pull down to refresh

producing output only consumable by other agents, maybe? we’re being replaced

66 sats \ 0 replies \ @k00b OP 12h

I'd guess their success criteria isn't very sophisticated yet and is mostly "did it output something that gets the job done?"

I should probably go browse with SWE benchmarks they all use. I'd guess that tracks SOTA success criteria pretty well.

reply