A lot of these challenges can be overcame by teaching the LLM to use a separate tool [1]. For example, the LLM doesn't need to know how to play chess, but just needs to know how to use a chess engine. It doesn't need to know how to calculate, it just needs to know how to write a python script using Numpy/Scipy.
If you take your example and have the LLM respond with Python code that counts the letters and returns the positions it is more likely to get it correct. At some point the OpenAI team will figure out how to get it to do this behind the scenes. You can actually ask it to analyze a CSV file and it automatically writes Python code to do it, so this is already close to reality.
Eventually we'll see these systems translate prompts into logic arguments and then use tools like Coq [2] to make provably true statements. They can start leaning on compilers and language servers to error correct as they write code. Along the way they can write tests to check their work. The bar these LLMs have to clear is just to produce no more bugs than the average software engineer and that's honestly a fairly low bar.

  1. https://arxiv.org/abs/2302.04761
  2. https://coq.inria.fr/about-coq
The example OP used is like saying, "look, this sports car can't go up stairs, therefore walking is the superior form of transport."
LLMs are not meant to count instances of a character in a string. We have better tools for that.
reply