Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control.
One of the things that annoyed me when I was playing with VibeVoice (#1205738) was that for samples that are rich in different intonation, each session (like a speaker change in the script) can cause wildly deviating voices. This promises to improve that, so that's cool.