>  By comparing LRMs with their standard LLM counterparts under equivalent inference compute, we identify three performance regimes: (1) lowcomplexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse.

(3) is generally where I tend to have issues. For people that mostly see problems of type (2), it's hard to explain how insufficient even the state of the art is for (3).