pull down to refresh

It's funny that part of what is described under "Unified Understanding and Generation Model" was basically my expectation for Gemma3n's multimodality yesterday:
Most multi-modality large language models (MLLMs) are designed to perform visual understanding tasks, where the model generates textual responses based on combined image and language inputs (Liu et al., 2023b; Li et al., 2024; Zhang et al., 2025; Bai et al., 2025). Recently, there has been a growing interest in unifying visual understanding and visual generation within a single framework (Team, 2024; Zhou et al., 2024). One line of work tokenizes images into discrete tokens akin to text, enabling large language models (LLMs) to both interpret and generate visual content seamlessly (Team, 2024; Wang et al., 2024).
I guess I have to be patient.
stackers have outlawed this. turn on wild west mode in your /settings to see outlawed content.