different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space
I'll admit I'm kinda on the edge of understanding things like this, but I think it kinda means you can always pull an input (a prompt) out of the state of an LLM...which would mean even if OpenAI wasn't logging all your prompts (they are), all your prompts could be recreated from the model itself.
A core question in understanding large language models is whether their internal representations faithfully preserve the information in their inputs. Since Transformer architectures rely heavily on non-linearities, normalization, and many-to-one attentions mechanisms, it is often assumed that they discard information: different inputs could collapse to the same hidden state, making exact recovery of the input impossible.
Using tools from real analysis, we show that collisions (two different prompts producing the exact same representation) can only occur on a set of parameter values that has measure zero; that is, they are mathematical exceptions rather than possibilities one should expect in practice.
Moreover, we prove that common training procedures (gradient descent with standard step sizes) never move parameters into this exceptional set. In layman’s terms, almost all models at initialization are injective, and training preserves this property.
Finally, we turn this theoretical guarantee into an operational tool: our algorithm SipIt uses gradient-based reconstruction to recover prompts exactly from internal activations, efficiently and with provable linear-time guarantees. This confirms empirically that collisions do not occur in practice. Beyond transparency and safety, this elevates invertibility to a first-class property of Transformer language models, enabling stronger interpretability, probing, and causal analyses.
Here's an X thread on the paper as well: