pull down to refresh
Do you do that in the same session?
So what I find is that if I say review <xyz> for logic errors and omissions in a new session (I work git-based with all models though, so there is some indirection here too that may be helpful in triggering a different pattern through the layers) I can be sure that Claude finds a lot of stuff it missed the first run. (Hate the apologies tho, wtf Anthropic).
However, I do admit that second-model review works better. I think k00b was experiencing the same by mixing GPT and Claude for reviews.
I might do a write up of my process. It was an iterative loop along these lines:
- Start a new session (no memory)
- Upload current version of model and (incomplete) proof
- Ask it to complete the proof
- Evaluate the AI's proof completion
- Poke the AI on parts that I found incorrect or dissatisfying
- Iterate within the chat on why that part was hard or dissatisfying
- If simply a mistake by the AI, fix it
- If model setup is genuinely problematic, revise assumptions
- Update the model with new assumptions as appropriate
- Go to step 1
For step 5, is there a way to make it come to that conclusion?
Not sure if it's 100% comparable but maybe there's something close to it: if I find a thing in code, instead of "arguing" or "pressing" I just say: write a test around <xyz> in case there are any issues and then it 90% of the time finds its own error.
Mmm... so recently I've been using Opus 4.6 to help me write mathematical proofs.
It's actually super helpful. Gets me started on the right track. But it doesn't get me 100% of the way there. I frequently found holes in the proof.
I think one of the issues is that when I say, "Prove this", the AI tries really hard to satisfy me and doesn't give enough weight to the possibility that the theorem is wrong. I think if it's obviously wrong, it'll say so, but the proofs I was asking for were for fairly complex setups in which the theorem is true for most cases, but maybe not for some edge cases.
So, I did find it making some stuff up, and I had to press it on its mistakes. After pressing it, we eventually arrived at the proper assumptions that would rule out the edge cases.
But if I hadn't pressed, I probably would have put out an erroneous proof.