Yeah, it's interesting — what they're calling a 'Unified Understanding and Generation Model' is exactly what I was expecting from Gemma3n’s multimodality. The idea of treating images like text tokens to enable both comprehension and generation seems like the natural next step, but it's surprising how few models fully pull it off yet


HighOfLot

WorldVLA: Towards Autoregressive Action World Model

carter

It's funny that part of what is described under "Unified Understanding and Generation Model" was basically my expectation for Gemma3n's multimodality yesterday:

> Most multi-modality large language models (MLLMs) are designed to perform visual understanding tasks, where the model generates textual responses based on combined image and language inputs (Liu et al., 2023b; Li et al., 2024; Zhang et al., 2025; Bai et al., 2025). Recently, there has been a growing interest in unifying visual understanding and visual generation within a single framework (Team, 2024; Zhou et al., 2024). One line of work tokenizes images into discrete tokens akin to text, enabling large language models (LLMs) to both interpret and generate visual content seamlessly (Team, 2024; Wang et al., 2024). 

I guess I have to be patient.