DeepSeek V3 and R1

3 Sensible Narratives for Decades of Change

Jan 28, 2025

DeepSeek is a top Chinese AI company. It recently released its R1 model, a competitor to OpenAI’s o1 model capable of answering advanced scientific questions.

In a week, DeepSeek shattered media narratives around AI, China, and open source. It went from a niche technical release on inauguration day to a national press cover story.

Whenever conventional narratives are challenged, both hype and cope distort the truth. Here’s what we know about DeepSeek and what it means for sacred narratives.

Introduction

DeepSeek R1 is a “reasoning” model. What does that mean? It means that it modifies a base model, DeepSeek V3, to answer more difficult questions using a sequence of outputs called “Chain of Thought”. Many of DeepSeek’s most notable optimizations, around algorithm structure, memory management, and data type, occurred with the base model V3, released in December.

V3 was competitive to Western base models, but R1 put DeepSeek into an exclusive club with OpenAI and Google, passing Meta, xAI, and Anthropic. OpenAI released the first “reasoning” model, o1, in September. DeepSeek is the third company to release this type of model, after OpenAI and Google. On many metrics, it surpasses Google’s “flash reasoning”.

Are DeepSeek’s Results Real?

DeepSeek claims similar performance to o1 on a variety of tasks.

ARC Prize has independently verified R1’s performance, which scores similarly to o1. LMSYS, a benchmark based on live feedback from real users, ranks it higher than everyone but OpenAI and Google. R1 has also been rapidly adopted by scientists and engineers. Open source models like R1 are overwhelmingly favored by researchers, because they are much easier to modify. Western open source models like UC Berkeley’s Sky-T1 have also successfully replicated near-o1 performance.

The balance of evidence favors DeepSeek’s results being real. While only a few independent verification attempts have been made, they have all stood up to scrutiny.

Is DeepSeek’s Cost Real?

DeepSeek V3, the base model behind R1, claims to have a total training cost of roughly 5.6M USD. Several people have raised skepticism about this cost for various reasons. Scale AI CEO Alexander Wang claimed that DeepSeek had smuggled 50,000 H100s in violation of US export controls.

There’s no clear way to verify the training cost of DeepSeek V3. The closest thing we can do is attempt replication using similar techniques and spending. Some attempts are already being made. Replication can prove that similar results/cost is possible, but it can’t definitively prove that the results are impossible.

Export Controls

In Washington DC, a raging political debate was whether AI hardware export controls were worth the cost. Both sides seem to want to shoehorn DeepSeek’s algorithmic improvements into this debate.

The export control critics argue that the export controls caused Chinese firms like DeepSeek to innovate. On the narrow point, they might be right, but that’s not enough to declare export controls dead. Consider the equilibrium. When American firms have a hardware advantage, they can integrate the innovations of Chinese firms with an asymmetric advantage. This may not be true for a narrow set of H800-specific improvements, but it is true for the majority of algorithmic improvements made in DeepSeek V3.

What export control critics get right is that China can’t be stopped in place. They will always be capable of improving. If America wants to maintain its lead, American firms must keep innovating and integrating new discoveries.

Three Stories to Explain DeepSeek

The DeepSeek papers are concise (53 and 22 pages respectively, even including references and appendices)! For clear descriptions of implementation and results, just read the paper.

The trouble with “just reading the paper” is that the significance of any or all of these improvements is very unclear to a general audience. As bittersweet as this is, I’ll introduce three general stories that capture the context for DeepSeek’s improvements. Of course, stories are imperfect and will not perfectly represent the technology or history involved. Hopefully, these stories are better than the alternatives.

The first story is a continuous story of model architecture, or what most people consider to be the AI model “algorithm”. RNNs, Transformers, and Mixtures of Experts models form the rough “evolutionary chain” of this progression. The direction of this story is towards models that are faster, cheaper, and more accurate. The first two of these factors are based on complexity, a rough correlate to how long those models take to run. Accuracy can sometimes be in tension with the other two. By shrinking the model, AI researchers often increase model speed while decreasing accuracy. Sometimes, the increased speed allows for longer training, which can more than compensate for the loss in accuracy. The challenge is to change models into more efficient versions in a way that does not significantly harm accuracy.

To that end, DeepSeek introduced several optimizations: Auxiliary-Loss-Free Load Balancing, Multi-Head Latent Attention, and Multi-Token Prediction.

Auxiliary-Loss-Free Load Balancing is an improvement upon existing Mixture of Experts model architecture. It solves a known problem, Load Balancing, in a more efficient way. Multi-Head Latent Attention is a more efficient way to implement the standard Multi-Head Attention step that does not seem to have a significant cost to accuracy. Multi-Token prediction is a technique to increase accuracy by predicting multiple tokens in advance, at a greater cost per token. However, like lower cost per token can increase long-run accuracy, greater accuracy per token can decrease long-run cost.

The second narrative revolves around combating hardware limitations. The US export controls on H100s, or more specifically on chips with high-bandwidth memory, created an asymmetry in model loading costs. In short, DeepSeek pays more in dollars and time to load a model and update weights. Changes to DualPipe Pipeline Parallelism and Cross-Node All-to-All Communication fall under this category. I defer to the following Tom’s Hardware article:

DeepSeek used the DualPipe algorithm to overlap computation and communication phases within and across forward and backward micro-batches and, therefore, reduced pipeline inefficiencies. In particular, dispatch (routing tokens to experts) and combine (aggregating results) operations were handled in parallel with computation using customized PTX (Parallel Thread Execution) instructions, which means writing low-level, specialized code that is meant to interface with Nvidia CUDA GPUs and optimize their operations. The DualPipe algorithm minimized training bottlenecks, particularly for the cross-node expert parallelism required by the MoE architecture, and this optimization allowed the cluster to process 14.8 trillion tokens during pre-training with near-zero communication overhead, according to DeepSeek.

A related but more generalizable improvement is low-precision training. A common practice throughout the lifetime of machine learning, at both the hardware and software levels, is optimizing the data type. Computers use a series of bits, either 0 or 1, to store memory. Typically, computers used 32 bits or 64 bits to store numbers (both integers and floating point numbers). As performance became more important, two things happened in sequence. First, many researchers realized that many of those bits could be removed without a significant cost to accuracy. This was low-hanging fruit, and optimized almost immediately once machine learning began to be taken seriously commercially. Later data type improvements involve doing much more difficult work to reduce bits without creating a significant loss in accuracy. Recently, that work has been focused on reducing the bits in a trained model for more efficient inference, a practice known as ‘quantization’. As the name suggests, low-precision training is a technique to apply those same low-precision adaptations to training.

We might also consider DeepSeek’s algorithmic changes as a response to their unique hardware situation.

A third narrative is the development of fine-tuning techniques that are used to modify AI models that have already completed the earlier step of “pre-training”. The progress in fine-tuning is often framed as the academic debate between Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO). DeepSeek has weighed in with a variant of PPO called Group Relative Policy Optimization. GRPO was invented by DeepSeek researcher Shao ZhiHong in 2024. According to the DeepSeek R1 paper, this change was made “to save the training costs of [reinforcement learning]”. Historically, fine-tuning training has not been significant, but that might be changing with “reasoning” models such as R1 or OpenAI’s o1.

Conclusion

Decisions matter. AI researchers who believe that decisions matter make better decisions. A culture that takes AI progress for granted will fail to make better research decisions. Open source is a check on this system, by giving more researchers the tools to make improvements, bust myths, and move the state of the art forward.

From the New World

Discussion about this post