Large Language Models Pass the Turing Test \ stacker news

pull down to refresh

Large Language Models Pass the Turing Test arxiv.org/pdf/2503.23674

374 sats \ 11 comments \ @south_korea_ln 15 Apr 2025 AI

We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) intwo randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time—not significantly more or less often than the humans they were being compared to—while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.

(emphasis mine)

As I checked for a published version (it hasn't), I stumbled on this general audience article on the topic in case the hedged language from the arXiv is not your thing.

view all related items

56 sats \ 1 reply \ @SimpleStacker 15 Apr 2025

The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test

Wow, really surprised it hadn't been done before.

I've long since thought we were past the Turing Test. Now we just need a new definition of AGI.

1 sat \ 0 replies \ @south_korea_ln OP 16 Apr 2025

Surprising indeed. I thought it'd have been one of the obvious benchmarks people would try to test. I guess they did test it, but it's the first time it passes.

1 sat \ 8 replies \ @Dkryptoenth 16 Apr 2025 -147 sats

On the contrary, the reason why A “small” language model—by which we generally mean one with on the order of millions (rather than billions) of parameters—struggles to mimic human conversation convincingly for several interlocking reasons:

Limited Representational Capacity

Fewer parameters mean the model can store and manipulate far less information about language patterns, world knowledge, and subtle linguistic nuances.

As a result, it often resorts to simplistic or repetitive responses, rather than the rich variety of expression a human would use.

Poor Generalization and Coherence

Small models tend to overfit to the specific data they were trained on, so they struggle with novel topics or unexpected turns in conversation.

They lack the depth to maintain a consistent persona, long-term context, or coherent thread over multiple turns, making their dialogue feel disjointed or “robotic.”

Limited World Knowledge and Reasoning

Passing a Turing test usually requires not just fluent language but also common-sense reasoning, up‑to‑date facts, and the ability to draw inferences.

With constrained capacity, small models cannot internalize large-scale factual databases or sophisticated reasoning patterns; they often hallucinate or give incorrect answers when pressed.

Surface‑Level Pattern Matching

At their core, small LMs are powerful pattern‑matchers but lack the deeper latent structures (e.g., causal models, theory of mind) that larger models can approximate.

This leads to responses that may look grammatically correct but fail to capture intentions, emotions, or the pragmatic subtleties of human dialogue.

Implications for the Turing Test

Alan Turing’s original proposal envisioned an interlocutor capable of sustained, varied, and contextually appropriate conversation. Small language models simply don’t have the “brain‑like” resources—be it memory, breadth of knowledge, or reasoning scaffolds—to convincingly impersonate a human over an extended exchange. In short, they lack both the scale and depth required to fool a well‑informed judge.