First of all, using ChatGPT (paid version, but not the most expensive one) has been a net positive for my work. I use it a lot to code simple scripts that I know how to code, but would take me 10 times longer to code myself. Also, it has introduced me to CS algorithms that I didn't know of, but that allow me to do some stuff in much cleaner ways. Also, to write research proposals, it's of great help, if only to save time.
But, whereas I used to really check every detail it provided me with, I slowly became complacent and started to trust it blindly, more and more. Then, its limitations hit me, several times, in more or less painful ways.
- I am working on a paper for Science, which, by definition, goes beyond the state-of-the-art. In it, some quite technical stuff is being discussed. I then used ChatGPT to ask it a straightforward question: is method A or method B considered a stricter version of the other one (mathematically). It very confidently told me, method A is stricter than B. However, I know it is the opposite. I worked with some of the people who developed these methods. I told ChatGPT it's wrong. For another 15 minutes, it kept arguing that it may look like that, but that I am wrong and it is right. I did not want to admit its mistake. Because of this, I started questioning many of its previous statements...
- Another time, I asked it a pretty basic question about a widely used software in my field. I was too lazy to Google it. I trusted it at its word, and built 10s of simulations on top of the answer it gave me. Only today, I realized it completely hallucinated the answer to my question. I confidently reported on my simulations to colleagues, but now have to backtrack on several of those statements. And it does not look good when I have to admit it's because I relied on ChatGPT instead of doing the basic checks myself.
- A pretty similar occurrence as the previous bullet point happened a week ago or so. I spent a whole weekend trying to implement some new stuff with ChatGPT, helping me along the way. But as it was new stuff that I did not yet know how to code myself, it took me a while to catch the mistakes and hallucinated assumptions by ChatGPT. Also, this led me to report some shamefully wrong results to colleagues.
- A recent report by a Nature referee (supposedly, the best in the field) clearly showed they had used AI for their report. The em dash, for one, but also because of probably the stupidest question about a feature in my data that no one who works in this field would even dare to ask, as it is so obvious... but ChatGPT would suggest to ask such a question, for sure.
I have no proof for this last one, but it is quite a likely conclusion based on my own experience using LLMs.
Many times, I realize I cannot trust junior colleagues anymore. They face the same AI limitations that I do, but are even less able to spot that ChatGPT can and will hallucinate when it hasn't been trained on the appropriate data for some questions. I often get annoyed by the sloppiness of their code, but can I even blame them if I end up making the same mistakes because of AI?
I am mostly writing this to reflect on my use of AI. It's useful, for sure, but at what cost... I really need to implement a better process so that I can benefit from it without quietly getting f'd without me realizing it at first.
My first instinct now would be to copy and paste this block of text into ChatGPT and ask it to review it for clarity and flow. It'd do a great job, I know, but not today... not today.
So, my question to you? Have you had bad experiences using AI?
Somewhat similar using outcomes using Cursor to write SysAdmin code (ansible, python, etc).
Generally speaking I only use GPT models for things like helping write README.md or INSTALL.txt instructions.
For actual code, I consider Claude and Gemini to be the best (least hallucinations + greatest ability to handle complex code bases)....perhaps Cursors also changes some of the parameters to help minimize these ill effects as well (not sure).
Although I've only had a few rare outright hallucinations, much more prevalent are 2 occurrences that I encounter regularly:
AI has definitely improved my productivity but I've found its best use is to constrain it as much as can be reasonably done. Keep it focused on 1 or 2 changes max per prompt and those changes should hint how you want it done.
Its a bit like micromanaging an employee....
Spot on. And depending on the employee, this can lead to great outcomes, but often, you're left frustrated and annoyed.
Your example also reminds me the limitations of its context window. I really should avoid hour-long coding sessions without some reset in the middle.
I also get lazy in terms of the choice of model. I've had decent experience with ChatGPT, and thus barely followed or tried other models. But will keep the Claude rec in mind for coding then. Just don't feel like hopping constantly between models to chose for the best one for the task at hand. But then I get the results I am complaining about, so that's on me.
I completely understand. This is where things like Cursor comes into play. It provides you with multiple different models (Perplexity does this as well, but its not an IDE).
In addition, Cursor gives you "modes" which you can define individual prompt bases and specify a model per mode. So you can say for instance: "In Ask Mode you should never change code, this is for brainstorming and planning only" (and choose ChatGPT for that). Then you can have "Bug Fix Mode" and choose Claude 3.5 for that, etc.
See pic
I have experimental pipelines with fast-agent - though may switch frameworks or code something myself - locally that can do these things too. You simply pre-program the optimal model per agent, prompting and so on.
For example, in my current experimental setup I use a large qwen3 on leased compute for analysis or walking a pre-collected code graph through mcp and then use mistral 32b locally to code a prototype, call pylint through mcp and fix issues, and so on.
It works okay-ish if you define small enough actions and then just loop
There is so much to learn.
I actually think eventually we are going to be able to self-host most of this. I think coding models will eventually top-out where their incremental usefulness starts slowing down and commodity hardware catches up (I've been watching the AMD AI MAX+ 395 setups).
Sure the top-end models will keep being impressive but eventually everything becomes a commodity....I mean in the early days of smart-phones it was practically a necessity to upgrade from iPhone 1 to 2 to 3 ... as each change was huge. Now a person could reasonably use an iPhone 10 even though its going on a decade....these eventually become "solved problems".
Agreed.
Out of principle, I don't use any LLM plan and have no middlemen in my setup. They shall not steal my data, they shall not mess with the output, and they shall definitely not know what I'm coding. Because fuck these guys. They aren't players with your best interest at heart.
So yeah: everything sovereign. I wish there was a larger version of llama3.2 or a distilled version of llama4 because the small models, despite nice and clean instruct, still hallucinate too much to do analysis and I can't run the big ones on an apple m4.
Cool! And so with the 20 dollar/month plan, it doesn't matter which model I use for each query? Need to think wheter I should switch my current ChatGPT subscription to Cursor then. 500 requests per month does not sound like a lot. How is your experience with the "slow pool"? Anyhow, I'll look into it. Tnx
They have recently added "MAX" plans which allow for greater use. They occasionally rate-limit you based on how heavy models are being used. But my experience is that only happens rarely.
Personally I've been investigating moving to a completely open-source model which would be something like: VSCode + Cline or RooCode VS Code Extensions + Requesty service
This would simulate Cursor. Requesty is an API service (like open-router) that gives you access to different models. So in that case you would load $X dollars on Requesty and use VSCode + the extension you want (note: RooCode broke off from Cline but they both share lots of similar features).
In non IDE mode, I really really like Perplexity. Its basically my new search engine. If Perplexity ever releases IDE plugins for VS Code I would strongly consider dumping everything and just using them.
The best benefit of Perplexity is it includes top-notch real time websearch. So its much more useful for day-to-day task.
Stop giving me such detailed and useful answers, I must keep rewarding you with sats~~
I actually use openrouter to access many models, and we can have web search also enabled with it: https://openrouter.ai/announcements/introducing-web-search-via-the-api
Was there anything besides this feature which makes you prefer Requesty over open router?
Yeah, I’ve noticed the same, the model can’t stick to a single mental model across a session. I started saving a style snapshot after the first clean output and just re-prompt with “follow this style nothing else.” Keeps me from ending up with a Frankenstein codebase.
Your comments about the referees and junior colleagues is what scares me most.
It seems like we're gonna have a harder time trusting each other that we're interacting with a real human intelligence and not ChatGPT. Because ChatGPT does a good enough job of simulating human intelligence most times, it's easy for the human side of us to get lazy and use ChatGPT as a substitute. I'm not optimistic that we can prevent this from occurring.
True. I feel like I can find a good balance because I know and have experienced the before times, but some junior colleagues already feel ill-equipped at this point to deal with AI hallucinations.
What's interesting, too, is that some journals seem to approach the use of AI with a Don't ask, don't tell mentality. I am still rarely asked during submission if I used AI in the creation of an article.
I think this is already occurring, tbh, and I don't think there's a way back.
As a programmer and researcher, I find LLMs are useful for:
What I have found LLMs are not useful for:
I used to rely on LLMs quite heavily, but I found they usually just ended up wasting more time than they saved. Quite often they would lead me down a deep rabbit hole that went nowhere, and to get back on track, I'd end up needing to research the topic manually myself anyway.
Today, I use them sparingly. I'll only use them when I know an LLM will give me a good answer, otherwise I don't bother. I've mostly reverted back to Google search and reading man pages for my work.
They have their uses, its just about understanding how and when to use them. They aren't a replacement for traditional cognitive tasks such as researching or programming. Rather, they're more like a helpful junior assistant that require constant supervision and prompting to get any use out of them.
I would not say I have had a "bad" experience with IA for I have always been conscious that it will unavoidably hallucinate, for once because it is reported permanently both by users and companies of origin, and fundamentally because I myself know that if we ourselves can hallucinate when making an investigation, it's simply impossible for an AI to not to do the same. Too many variables on the fly.
I did made some basic tests with some questions I knew the answer to, and while the models have made great progress very fast, it still falls way too short to be barely usable.
However, I do like to some times ask the AI about some subjects and problems because even when I know it will get it wrong, 50% of the time its slop gives me a clue on something, and 50% of that 50% of the time, it even gives me a clue indirectly, simply because the process of reviewing the slop sparks my mind into the right direction.
With that careful use, my experience has always been positive so far, although this procedure implies minimal use.
If you were top 1% programmers in the world (I consider myself to be so, in my area, Frontend. My salary also told me so) you would be surprised how often this happens too in my field at the cutting edge.
I often had to correct my colleagues when AI made an error.
It's not bad. I know juniors are sh*tting their pants. But reaaally complex stuff, naaw man
A LOT! So I was making this somewhat basic java springboot app and I told ChatGPT to review it.
import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.SpringBootApplication; import org.springframework.web.bind.annotation.GetMapping; import org.springframework.web.bind.annotation.RequestParam; import org.springframework.web.bind.annotation.RestController; import org.springframework.stereotype.Service; @App public class Base { public static void main(String[] args) { SpringApplication.run(Base.class, args); } } // Controller Layer @RestController class GreetingController { private final GreetingService greetingService; // Constructor-based Dependency Injection public GreetingController(GreetingService greetingService) { this.greetingService = greetingService; } @GetMapping("/greet") public String greet(@RequestParam(value = "name", defaultValue = "World") String name) { return greetingService.getGreetingMessage(name); } } // Service Layer @Service class GreetingService { public String getGreetingMessage(String name) { return "Hello, " + name + "! Welcome to the BPE Window"; } }This app is just supposed to send a Hello message as you can see the GreetingService class. and when ChatGPT gave the output, it was a 200+ lines of code, wrapped in error handling blocks, 20+ variables (whereas mine has only 1) and things I didn't even learn, just to make it "ERROR FREE" Like wtf man! I asked you check whether it works or not and you just gave me a whole thesis on it.
Then again one day, I asked him another science question, and it was pretty simple, how to prepare fresh ferrous sulphate because apparantly I could find how to make ferrous sulphate but not "fresh" on the internet. And he replied with a Flowchart of how to create ferrous sulphate through pyrites. So I straightaway asked my chem teacher and he said that it's a reaction of iron and sulphuric acid filtered later and rebuked for not studying basics thoroughly. But it wasnt MY fault! I WAS MISLEAD!
Can you brief about it? I might understand something :) Or is it top-secret?
Currently I'm pivoting my setup
from:
Synchronous: letting the LLM run unstructured with different models in a pipeline
to:
Asynchronous:
Everything that can be done with code, like linting, does not use LLMs.
Damn, I just wanted the AI to check if my plant was alive and it built a greenhouse with a self-watering system and an AI-powered scarecrow.
I swear, sometimes ChatGPT doesn’t review code — it rewrites it like it's auditioning for a job at NASA. Like bro, I’m still trying to survive public static void main, not orchestrate microservices across a Kubernetes cluster.
Same thing with chemistry — I asked for a fresh ferrous sulphate recipe and got a mining operation flowchart straight outta a metallurgy PhD thesis. Asked my chem teacher and he just said “use Fe + H₂SO₄ and move on.”
It's like these LLMs read Thus Spoke Zarathustra and thought every answer must ascend the mountain of abstraction before descending to meet us mortals.
But fr tho, loving that async pivot you're on @optimism. Turning LLMs from noisy sidekicks into focused bug-hunters with issue-detection filtering? That’s pretty GOOD
Don't worry I won't steal your repo, I'm building a Human Behaviour Prediction Engine too, https://github.com/axelvyrn/TiresiasIQ (and it's quite good, believe me - i'd like your input)
Also, curious: How are you ranking issue significance without it hallucinating a crisis over a missing semicolon?
It doesn't matter. Every task should be small, or otherwise needs breakdown.
It's harder to make it "just fix a semicolon", so in that case using non-llm tools is better, or at least expose the tools needed to the LLM through MCP. Syntax fixing can be done with existing tools, so in this case you just expose an MCP tool, ie:
code_fixing::correct_semicolons(files[])that implements the syntactical logic in code, without needing the LLM to actually write correct code.like using
standard --fixto lint .js files?exactly!
Will submit it to the editor this week or next week, hopefully. I can send you the paper once it's published. For now, I prefer not doxxing myself too much by being detailed about the field that I work in. Even though some people here already more than what is good about me to remain anon~~
Okayy!
lol
Nearly every experience I have with AI is bad, whether it's ChatGPT, Claude, or Grok. If they would just reply with "I don't know" instead of spewing bullshit, it would be somewhat useful.
I have to say that I have found Claude to be much better for me over ChatGPT. My Committee and the House Bipartisan AI Taskforce from last year met with Anthropic it seemed like every few weeks between the two and has just seemed to always outperform. Nothing ever to flashy but much more reliable.
A lot of great insight in this post and in the comments. Thanks for sharing!
Nice post! AI while useful still not sophisticated enough for complex tasks
some great comment s here . when I have doubts on the quality of a response, I try to ask the question from several different angles and I ask several different bots. but net it is still a time winner. and challenging the bot is still a good mental exercise. it keeps you sharp
I had similar experiences with ChatGPT 3.5 in the past. I was feeding it matrices and it was consistently generating wrong answers with calculations. I swear I would never use it again. Then came ChatGPT 4.o. I think for people who want to learn a foreign language it became useful and funny. Plus it became useful for math, at least for me. I do calculations and I check my results with those of ChatGPT. If it matches, I consider it is likely correct, else I re-check. After re-checking if it still doesn't match, I keep my calculations and ignore those of ChatGPT. This way I never got disappointed anymore with it, since I use it as a tool to do double checks.
For coding, I feel like Claude can also give valuable answers sometimes better than ChatGPT. Recently I asked both to build an algorithm to convert a Golang algorithm to Python to encrypt passwords... And the results were consistently 100% totally wrong. Very disappointing isn't it? But in this case I was double checking with a good Golang version by decrypting the encrypted password. So with this kind of check I could immediately spot incorrect code. Disappointing but fine since the only negative impact was to lose a little bit of my time working on it. Bottom line regarding code, Claude can be good as well but currently more expensive, and regardless this is really important to have unit tests or some kind of test to double check results.
So for coding I use it more like a very advanced auto-completion tool. E.g.: I want to read a CSV file and parse it a particular way, then it gives me the code right away. Or I want to write a SQL statement to show some statistics about a metric, it writes it. Its use is tremendously good to save time but everything has to be constantly double-checked by testing the code, or by doing a quick visual check, or by comparing carefully results if money is involved and it is extremely important that the code must be right.
hey whats up hows it going! I featured your article in the stacker news zine this week
check it out