pull down to refresh
I'd guess their success criteria isn't very sophisticated yet and is mostly "did it output something that gets the job done?"
I should probably go browse with SWE benchmarks they all use. I'd guess that tracks SOTA success criteria pretty well.
I'd guess their success criteria isn't very sophisticated yet and is mostly "did it output something that gets the job done?"
I should probably go browse with SWE benchmarks they all use. I'd guess that tracks SOTA success criteria pretty well.