pull down to refresh

The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion

5 fingers, 2 legs, and 2 arms, from here on out!

I, or more likely the troll that lives inside me, couldn't help but wonder what it does if you ask it to render with 6 fingers though: so I used the demo and asked it for a photo of a person signing a contract, with 6 fingers:
100 sats \ 1 reply \ @Scoresby 6h
I think the accidental six-finger images may have understood six-finger biology a little better. The left hand isn't bad, though.
reply
On the right hand it looks as if it really struggled with the instruction though; so there's good hope for Mariana-trench-depth-fakes!
reply