pull down to refresh
I think that using AI for research is fine but you have to fact check every thing it comes up with. So what is being verified now? (I'm going to omit my gut feeling on this one.)
the business of double checking it on every single detail really hinders the usefulness. My wife does research in a fairly niche area and is not pleased at all with AI as a research tool because of the high propensity to make stuff up in areas that are not broadly documented.
the high propensity to make stuff up in areas that are not broadly documented.
Exactly! Your wife deserves praise.
Per my short argument this morning I do like what GPT does since the last 2 versions. But still you need to fact check it all and you need a methodology for that. One that works for you, too, for example how @k00b does his LLM-aided coding and how I do it is worlds apart (except that we're both resisting being a yoloboi, as hard as that is) but we master the process we've chosen.
So this is how I perceive this:
- The prompter didn't think about the system enough, and especially not from an adversarial p.o.v.
- The promper is a yolo
<boi | grrl> - The prompter lacks frame of reference in this kind of research
In all cases the problem will solve itself into either massive reputational damage right now, or later. Eventually, yoloing a prompt into production output without a framework will fuck anyone up.
I suspect that no 2 cannot be easily fixed and that no 1 is a product of no 2.
no 3 seems to be the most important to me. If you haven't got a reasonable understanding of the topic you are going to do research, AI tools are particularly dangerous -- you won't know shit when you smell it.
The temptation is: hey, now that I have a tool that give me a detailed response to a question in any field, I am now an expert in any field. This is clearly not true. AI may speed me on my way to expertise, but it doesn't take its place. I suspect people are going to make this mistake very frequently in the coming years.
I think this is what I was trying to get at yesterday with that post about wisdom.
I suspect people are going to make this mistake very frequently in the coming years.
The question is how quickly will they get feedback that the AI has missed something important.
I worry about a future where AI is the de facto fact checker. Person A posts something 95% correct, but AI made shit up about 5% of it. Person B uses AI to check if it's true and AI gives it the thumbs up. No one in that transaction has any idea of the 5% the AI just made up.
Have you observed this recently though?
I really find that as long as I isolate context/sessions, and moreso when I push some GLM-5 reviews in to throw the US models off their game, I get a pretty good result. Of course I still meatreview, but there are times that I find myself in relative incompetence now and find nothing.
Mmm... so recently I've been using Opus 4.6 to help me write mathematical proofs.
It's actually super helpful. Gets me started on the right track. But it doesn't get me 100% of the way there. I frequently found holes in the proof.
I think one of the issues is that when I say, "Prove this", the AI tries really hard to satisfy me and doesn't give enough weight to the possibility that the theorem is wrong. I think if it's obviously wrong, it'll say so, but the proofs I was asking for were for fairly complex setups in which the theorem is true for most cases, but maybe not for some edge cases.
So, I did find it making some stuff up, and I had to press it on its mistakes. After pressing it, we eventually arrived at the proper assumptions that would rule out the edge cases.
But if I hadn't pressed, I probably would have put out an erroneous proof.
Do you do that in the same session?
So what I find is that if I say review <xyz> for logic errors and omissions in a new session (I work git-based with all models though, so there is some indirection here too that may be helpful in triggering a different pattern through the layers) I can be sure that Claude finds a lot of stuff it missed the first run. (Hate the apologies tho, wtf Anthropic).
However, I do admit that second-model review works better. I think k00b was experiencing the same by mixing GPT and Claude for reviews.
no 1 is a product of no 2.
Hmm no 1 can also be simply being wrong in an assessment of how reliable the process is.
If you haven't got a reasonable understanding of the topic you are going to do research
Worse: if you haven't got the slightest experience in what you're outsourcing! This is where in the past, when corporations outsourced work to contractors, was also... subject to improvement haha.
that post about wisdom.
It's #10 or so on my backlog of Scoresby's posts to re-read and reply. Haha.
I'm feeling weird about this, so I checked on the claims in the first paragraph that includes a named individual:
Casey Stefanski, Executive Director, spent 10 years at NCOSE as Senior Director of Global Partnerships. Unusually, she never appears on any NCOSE 990 filing as an officer, key employee, or among the five highest-compensated staff. A senior director title at a $5.4M organization for a decade with no 990 appearance suggests either below-threshold compensation, an inflated title, or something else about the arrangement.
There is a detailed biography of Stefanski on the DCA website, which confirms that she did work at NCOSE for ten years prior to her starting her tenure as ED at DCA in 2025.
I looked through (visually myself, not using AI because pdfs suck) the NCOSE's 990s for 2024, 2023, 2022, 2021, 2020 (I couldn't find this one), 2019, 2018, 2017, and 2016 and it is true that her name is not listed in Section A "Officers, Directors, Trustees, Key Employees, and Highest Compensated Employees" on any of these. I'm not entirely clear whether she should have been listed on these 990s, but it is the case that many other Director-level people are.
So, at least on this one paragraph, the reddit post is accurate.
I'll also add that the reddit posts description of the various age-verification bills is accurate compared to the other sources I've been tracking over the last year.
2024 NCOSE staff page does not list Stefanski, nor does the 2022 staff page, but she is listed on the 2021 staff page. But she has a small picture, while there are 12 people with big pictures. The NCOSE's 990s only list 10 or 12 people, so I suspect the reason Stefanski doesn't show up on the 990s is just that she wasn't senior enough. So on this count, I'd say the reddit post is wrong.
I hadn't heard of vxunderground before. Their post came across my TL via peter todd.
I came across the reddit post via @standardcrypto's link and the discussion on hacker news. It also largely tracks with previous things I've read about the age-verification phenomenon (#1440441).
Reading through the HN comments, it seems that some people are dubious of the researcher and methods (lots of accusations of AI-reliance, lack of due dilligence given that the reddit post includes so many names).
I have not dug through the source data. The reddit post includes a number of articles from traditional news outlets, but it also claims to have done original research using publicly available databases. This was definitely done via AI.
From their Methodology section:
I'll admit this is one that confirms my priors, so I probably should be more skeptical.