By comparing LRMs with their standard LLM counterparts under equivalent inference compute, we identify three performance regimes: (1) lowcomplexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse.
(3) is generally where I tend to have issues. For people that mostly see problems of type (2), it's hard to explain how insufficient even the state of the art is for (3).
(3) is generally where I tend to have issues. For people that mostly see problems of type (2), it's hard to explain how insufficient even the state of the art is for (3).