Forget DeepSeek. Large language models are getting cheaper still \ stacker news ~AI

Non paywalled: https://archive.md/ZYNAH

Listen to this story As recently as 2022, just building a large language model (LLM) was a feat at the cutting edge of artificial-intelligence (AI) engineering. Three years on, experts are harder to impress. To really stand out in the crowded marketplace, an AI lab needs not just to build a high-quality model, but to build it cheaply. In December a Chinese firm, DeepSeek, earned itself headlines for cutting the dollar cost of training a frontier model down from $61.6m (the cost of Llama 3.1, an LLM produced by Meta, a technology company) to just $6m. In a preprint posted online in February, researchers at Stanford University and the University of Washington claim to have gone several orders of magnitude better, training their s1 LLM for just $6. Phrased another way, DeepSeek took 2.7m hours of computer time to train; s1 took just under seven hours.

The figures are eye-popping, but the comparison is not exactly like-for-like. Where DeepSeek’s v3 chatbot was trained from scratch—accusations of data theft from OpenAI, an American competitor, and peers notwithstanding—s1 is instead “fine-tuned” on the pre-existing Qwen2.5 LLM, produced by Alibaba, China’s other top-tier AI lab. Before s1’s training began, in other words, the model could already write, ask questions, and produce code. Piggybacking of this kind can lead to savings, but can’t cut costs down to single digits on its own. To do that, the American team had to break free of the dominant paradigm in AI research, wherein the amount of data and computing power available to train a language model is thought to improve its performance. They instead hypothesised that a smaller amount of data, of high enough quality, could do the job just as well. To test that proposition, they gathered a selection of 59,000 questions covering everything from standardised English tests to graduate-level problems in probability, with the intention of narrowing them down to the most effective training set possible.

Not sure exactly how to think about this. They did build on a previous LLM, which was trained on a staggering amount of data, so it is not truly a 6$ model. But maybe we got to the point now where we can consider existing LLMs part of the public good and reading in all the weights (or whatever other method they use to reverse engineer it if not public) is akin to reading one extremely dense volume of a single encyclopedia?

158 sats \ 0 replies \ @optimism 21 Jul

To me, this is exactly the kind of evolution needed.

Brute-forcing a problem ("bigger == better") is the retarded approach to solutions but because the big companies are boasting their costs like its a benefit to society and people are jumping on that bandwagon, somehow a very large part of society really wants Elon to get another 50B in funding to waste on feeding X posts as truth to an autocorrect algo. You know what? For 50B, we can also feed in Reddit and 4chan. AI will make the world a better place with all that knowledge. lol.

Qwen is open weights. It is meant to be built upon, and so are Deepseek and llama (but there seem to be some issues with llama4, Zuck's probably off the open source path.) Not using that is a shame. I think that these open models can definitely use fine-tuning, and in open source, you just do it if you want it.

People using their big brains and doing smart approaches to problem solving in the space is just what we need. So that we can move away from building massive datacenters just to transform text to math in massive batches that we then throw away. I hope for more of this.