pull down to refresh
I suspect that no 2 cannot be easily fixed and that no 1 is a product of no 2.
no 3 seems to be the most important to me. If you haven't got a reasonable understanding of the topic you are going to do research, AI tools are particularly dangerous -- you won't know shit when you smell it.
The temptation is: hey, now that I have a tool that give me a detailed response to a question in any field, I am now an expert in any field. This is clearly not true. AI may speed me on my way to expertise, but it doesn't take its place. I suspect people are going to make this mistake very frequently in the coming years.
I think this is what I was trying to get at yesterday with that post about wisdom.
I suspect people are going to make this mistake very frequently in the coming years.
The question is how quickly will they get feedback that the AI has missed something important.
I worry about a future where AI is the de facto fact checker. Person A posts something 95% correct, but AI made shit up about 5% of it. Person B uses AI to check if it's true and AI gives it the thumbs up. No one in that transaction has any idea of the 5% the AI just made up.
Have you observed this recently though?
I really find that as long as I isolate context/sessions, and moreso when I push some GLM-5 reviews in to throw the US models off their game, I get a pretty good result. Of course I still meatreview, but there are times that I find myself in relative incompetence now and find nothing.
Mmm... so recently I've been using Opus 4.6 to help me write mathematical proofs.
It's actually super helpful. Gets me started on the right track. But it doesn't get me 100% of the way there. I frequently found holes in the proof.
I think one of the issues is that when I say, "Prove this", the AI tries really hard to satisfy me and doesn't give enough weight to the possibility that the theorem is wrong. I think if it's obviously wrong, it'll say so, but the proofs I was asking for were for fairly complex setups in which the theorem is true for most cases, but maybe not for some edge cases.
So, I did find it making some stuff up, and I had to press it on its mistakes. After pressing it, we eventually arrived at the proper assumptions that would rule out the edge cases.
But if I hadn't pressed, I probably would have put out an erroneous proof.
Do you do that in the same session?
So what I find is that if I say review <xyz> for logic errors and omissions in a new session (I work git-based with all models though, so there is some indirection here too that may be helpful in triggering a different pattern through the layers) I can be sure that Claude finds a lot of stuff it missed the first run. (Hate the apologies tho, wtf Anthropic).
However, I do admit that second-model review works better. I think k00b was experiencing the same by mixing GPT and Claude for reviews.
I might do a write up of my process. It was an iterative loop along these lines:
- Start a new session (no memory)
- Upload current version of model and (incomplete) proof
- Ask it to complete the proof
- Evaluate the AI's proof completion
- Poke the AI on parts that I found incorrect or dissatisfying
- Iterate within the chat on why that part was hard or dissatisfying
- If simply a mistake by the AI, fix it
- If model setup is genuinely problematic, revise assumptions
- Update the model with new assumptions as appropriate
- Go to step 1
For step 5, is there a way to make it come to that conclusion?
Not sure if it's 100% comparable but maybe there's something close to it: if I find a thing in code, instead of "arguing" or "pressing" I just say: write a test around <xyz> in case there are any issues and then it 90% of the time finds its own error.
no 1 is a product of no 2.
Hmm no 1 can also be simply being wrong in an assessment of how reliable the process is.
If you haven't got a reasonable understanding of the topic you are going to do research
Worse: if you haven't got the slightest experience in what you're outsourcing! This is where in the past, when corporations outsourced work to contractors, was also... subject to improvement haha.
that post about wisdom.
It's #10 or so on my backlog of Scoresby's posts to re-read and reply. Haha.
Exactly! Your wife deserves praise.
Per my short argument this morning I do like what GPT does since the last 2 versions. But still you need to fact check it all and you need a methodology for that. One that works for you, too, for example how @k00b does his LLM-aided coding and how I do it is worlds apart (except that we're both resisting being a yoloboi, as hard as that is) but we master the process we've chosen.
So this is how I perceive this:
<boi | grrl>In all cases the problem will solve itself into either massive reputational damage right now, or later. Eventually, yoloing a prompt into production output without a framework will fuck anyone up.