pull down to refresh

model performance degrades as input length increases, often in surprising and non-uniform ways.
Long context evaluations for these models often demonstrate consistent performance across input lengths. However, these evaluations are narrow in scope and not representative of how long context is used in practice. The most commonly used test, Needle in a Haystack (NIAH), is a simple lexical retrieval task often used to generalize a model’s ability to reliably handle long context.
The researchers tested llms with focused prompts (~300 tokens) and full prompts (~113k tokens):
Across all models, we see significantly higher performance on focused prompts compared to full prompts.
More broadly, our findings point to the importance of context engineering: the careful construction and management of a model’s context window. Where and how information is presented in a model’s context strongly influences task performance, making this a meaningful direction of future work for optimizing model performance.
@freetx offered a nice analogy in #1027038 about the "teapot test" - basically that LLMs often hyperfocus on some part of the instruction. I've seen this happen with smaller reasoning models and even posted/complained about such hyperfocus-on-nonsense behavior in the past (#987426).
This is why I want to invest a bit in fine-tuned mcp. I'm starting to think that the tool descriptions take too much context away from the problem we want to solve. I'm trying to treat the models as if they have an attention disorder to see if that improves results.
reply
All of these context questions are equally interesting to consider wrt people. The power of what you try to bring to mind is significant in constructing reality.
reply