KOSMOS-2 brings grounding to vision-language models, letting AI pinpoint visual regions based on text. In this blog, I explore how well it performs through real-world experiments and highlight both its promise and limitations in grounding and image understanding.
pull down to refresh
related posts
21 sats \ 1 reply \ @k00b 9 Nov
I remain super curious about the data pipelines of these things.
I'm also curious about how wrong models can be and still be useful. Like, how is often is something that works sometimes better than nothing?
reply
0 sats \ 0 replies \ @optimism 13h
If you're a gambler, all the time. The problem starts when you have standards then it becomes crappy real fast, unless you have a scalable means to judge output and discard at scale too.
I think that this is the hardest part of all generation in practice.
reply
0 sats \ 0 replies \ @0xbitcoiner 13h
This is cool for computer vision projects, but at the same time, it's kinda creepy when you think about what people can do with it and what could go wrong.
reply