reply on: Meta's $70m efforts lobbying for age-verification laws \ stacker news

pull down to refresh

108 sats \ 10 replies \ @optimism 14 Mar \ parent \ on: Meta's $70m efforts lobbying for age-verification laws Politics_And_Law

the high propensity to make stuff up in areas that are not broadly documented.

Exactly! Your wife deserves praise.

Per my short argument this morning I do like what GPT does since the last 2 versions. But still you need to fact check it all and you need a methodology for that. One that works for you, too, for example how @k00b does his LLM-aided coding and how I do it is worlds apart (except that we're both resisting being a yoloboi, as hard as that is) but we master the process we've chosen.

So this is how I perceive this:

The prompter didn't think about the system enough, and especially not from an adversarial p.o.v.
The promper is a yolo<boi | grrl>
The prompter lacks frame of reference in this kind of research

In all cases the problem will solve itself into either massive reputational damage right now, or later. Eventually, yoloing a prompt into production output without a framework will fuck anyone up.

106 sats \ 9 replies \ @Scoresby OP 14 Mar

I suspect that no 2 cannot be easily fixed and that no 1 is a product of no 2.

no 3 seems to be the most important to me. If you haven't got a reasonable understanding of the topic you are going to do research, AI tools are particularly dangerous -- you won't know shit when you smell it.

The temptation is: hey, now that I have a tool that give me a detailed response to a question in any field, I am now an expert in any field. This is clearly not true. AI may speed me on my way to expertise, but it doesn't take its place. I suspect people are going to make this mistake very frequently in the coming years.

I think this is what I was trying to get at yesterday with that post about wisdom.

305 sats \ 5 replies \ @SimpleStacker 14 Mar

I suspect people are going to make this mistake very frequently in the coming years.

The question is how quickly will they get feedback that the AI has missed something important.

I worry about a future where AI is the de facto fact checker. Person A posts something 95% correct, but AI made shit up about 5% of it. Person B uses AI to check if it's true and AI gives it the thumbs up. No one in that transaction has any idea of the 5% the AI just made up.

103 sats \ 4 replies \ @optimism 14 Mar

Have you observed this recently though?

I really find that as long as I isolate context/sessions, and moreso when I push some GLM-5 reviews in to throw the US models off their game, I get a pretty good result. Of course I still meatreview, but there are times that I find myself in relative incompetence now and find nothing.

101 sats \ 3 replies \ @SimpleStacker 14 Mar

Mmm... so recently I've been using Opus 4.6 to help me write mathematical proofs.

It's actually super helpful. Gets me started on the right track. But it doesn't get me 100% of the way there. I frequently found holes in the proof.

I think one of the issues is that when I say, "Prove this", the AI tries really hard to satisfy me and doesn't give enough weight to the possibility that the theorem is wrong. I think if it's obviously wrong, it'll say so, but the proofs I was asking for were for fairly complex setups in which the theorem is true for most cases, but maybe not for some edge cases.

So, I did find it making some stuff up, and I had to press it on its mistakes. After pressing it, we eventually arrived at the proper assumptions that would rule out the edge cases.

But if I hadn't pressed, I probably would have put out an erroneous proof.

1 sat \ 2 replies \ @optimism 14 Mar

Do you do that in the same session?

So what I find is that if I say review <xyz> for logic errors and omissions in a new session (I work git-based with all models though, so there is some indirection here too that may be helpful in triggering a different pattern through the layers) I can be sure that Claude finds a lot of stuff it missed the first run. (Hate the apologies tho, wtf Anthropic).

However, I do admit that second-model review works better. I think k00b was experiencing the same by mixing GPT and Claude for reviews.

106 sats \ 1 reply \ @SimpleStacker 14 Mar

I might do a write up of my process. It was an iterative loop along these lines:

Start a new session (no memory)
Upload current version of model and (incomplete) proof
Ask it to complete the proof
Evaluate the AI's proof completion
Poke the AI on parts that I found incorrect or dissatisfying
Iterate within the chat on why that part was hard or dissatisfying
- If simply a mistake by the AI, fix it
- If model setup is genuinely problematic, revise assumptions
Update the model with new assumptions as appropriate
Go to step 1

6 sats \ 0 replies \ @optimism 14 Mar

For step 5, is there a way to make it come to that conclusion?

Not sure if it's 100% comparable but maybe there's something close to it: if I find a thing in code, instead of "arguing" or "pressing" I just say: write a test around <xyz> in case there are any issues and then it 90% of the time finds its own error.

103 sats \ 2 replies \ @optimism 14 Mar

no 1 is a product of no 2.

Hmm no 1 can also be simply being wrong in an assessment of how reliable the process is.

If you haven't got a reasonable understanding of the topic you are going to do research

Worse: if you haven't got the slightest experience in what you're outsourcing! This is where in the past, when corporations outsourced work to contractors, was also... subject to improvement haha.

that post about wisdom.

It's #10 or so on my backlog of Scoresby's posts to re-read and reply. Haha.

101 sats \ 1 reply \ @Scoresby OP 14 Mar

Oh dear, I didn't mean to assign homework. Just that it is a more full explanation of this thing I've been wrestling with.

103 sats \ 0 replies \ @optimism 14 Mar

It's not work. It's backlog of fun.