The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion
5 fingers, 2 legs, and 2 arms, from here on out!
I, or more likely the troll that lives inside me, couldn't help but wonder what it does if you ask it to render with 6 fingers though: so I used the demo and asked it for
a photo of a person signing a contract, with 6 fingers
: