reply on: KOSMOS-2 Explained: Microsoft’s Multimodal Marvel \ stacker news ~AI

pull down to refresh

21 sats \ 1 reply \ @k00b 9 Nov \ on: KOSMOS-2 Explained: Microsoft’s Multimodal Marvel AI

VLMs, in contrast, use "the visual reality". They are trained on massive datasets containing millions, or even billions, of image-text pairs.

This process forces the model to connect the word "sunset" not just to other words, but to the actual pixels, colors, and shapes that constitute a sunset in a photograph. This grounding in the visual world makes their understanding richer, more concrete, and fundamentally closer to how humans perceive the world.

I remain super curious about the data pipelines of these things.

The model demonstrates impressive capabilities in associating textual phrases with visual regions, especially for straightforward prompts and well-lit, distinct images.

However, it still struggles with contextual complexity and logical reasoning, often producing hallucinated or incorrect outputs when faced with open-ended or abstract questions.

I'm also curious about how wrong models can be and still be useful. Like, how is often is something that works sometimes better than nothing?

0 sats \ 0 replies \ @optimism 9 Nov

Like, how is often is something that works sometimes better than nothing?

If you're a gambler, all the time. The problem starts when you have standards then it becomes crappy real fast, unless you have a scalable means to judge output and discard at scale too.

I think that this is the hardest part of all generation in practice.