pull down to refresh

It's multimodal for input, not output unfortunately.

I wonder how much can be improved by removing 139 languages, and audio and video modality.

reply