The chip that achieves the miracle of DeepSeek sounds a wake-up call for Nvidia

In the past two weeks, DeepSeek has become a global hot spot. Especially in the Western world, this generative artificial intelligence system from China has triggered widespread discussion. Within the first 18 days of its release, DeepSeek achieved an astonishing 16 million downloads. This number is almost twice the number of downloads of competitor OpenAI's ChatGPT in the same period, fully demonstrating its strong market appeal and user base.

According to authoritative data from market analysis company Appfigures, DeepSeek's application topped Apple's AppStore for the first time on January 26, and has continued to maintain its global dominance since then. Data statistics show that since its release at the beginning of this year, it has quickly climbed to the top of the Apple App Store download rankings in 140 countries, and also occupied the top position in the Android Play Store in the United States.

As a large AI model in China, DeepSeek has been able to gain this attention. In addition to its excellent performance, its low training cost is also the key to attracting global attention. In today's article, we take a look at the chips and systems behind DeepSeek.

DeepSeek's architecture readme

Back in August 2024, the DeepSeek team published a paper describing a new load balancer it had created to interconnect elements of its Mix of Experts (MoE: Mixture of Experts) base model.

DeepSeek stated in the article that for the mixed expert (MoE) model, expert load imbalance will lead to routing collapse (routingcollapse) or computational overhead (computationaloverhead) increase. Existing methods usually use auxiliary losses to promote load balancing, but large auxiliary losses will introduce non-negligible interference gradients in training, thus damaging model performance.

In order to control the load balance during the training process without generating undesired gradients, the DeepSeek team proposed loss-free balancing (Loss-FreeBalancing), which is characterized by an auxiliary-loss-free load balancing strategy.

Specifically, lossless balancing will first apply expert-wise bias to each expert's routing scores before making top-K routing decisions. By dynamically updating each expert's bias based on their recent load, lossless balancing can always maintain a balanced distribution of expert loads.

Furthermore, since lossless balancing does not produce any disturbing gradients, it also raises the upper limit of model performance obtained from MoE training. The DeepSeek team also verified the performance of lossless balancing on MoE models with up to 3B parameters and trained on up to 200B tokens. Experimental results show that compared with the traditional auxiliary packet loss control load balancing strategy, the lossless balancing strategy achieves both better performance and better load balancing.

Figure 1: Lossless balancing selects experts based on their biased gating score at each training step and updates this expert bias after each training step.

In the report "DeepSeek-V3 Technical Report" released at the end of 2024, the DeepSeek team conducted an in-depth interpretation of the technical architecture of its DeepSeek-V3 model, which provides us with more reference for understanding the company's technology.

They bluntly stated in the report that out of forward-looking considerations, the company has always pursued models with strong performance and low cost. Therefore, in terms of architecture, DeepSeek-V3 still uses Multi-head Latent Attention (MLA: Multi-head Latent Attention) for efficient inference and DeepSeekMoE to achieve cost-effective training. In order to achieve efficient training, the DeepSeek team's solution supports FP8 mixed precision training and fully optimizes the training framework. In their view, low-precision training has become a promising solution for efficient training, and its development is closely related to advances in hardware capabilities.

Figure 2: Overall mixed precision framework using FP8 data format. For clarity, only linear operators are illustrated.

Through support for FP8 computing and storage, the DeepSeek team achieved accelerated training and reduced GPU memory usage. In terms of training framework, they designed the DualPipe algorithm to achieve efficient pipeline parallelism, which has fewer pipeline bubbles and hides most of the communication in the training process through calculation-communication overlap.

Figure 3: Basic architecture diagram of DeepSeek-V3. Following DeepSeek-V2, the company adopts MLA and DeepSeekMoE for efficient inference and economic training.

The DeepSeek team says this overlap ensures that as the model scales further, the company can still use fine-grained experts across nodes while achieving near-zero all-to-all communication overhead as long as it maintains a constant compute-to-communication ratio.

In addition, the DeepSeek team has developed efficient cross-node all-to-all communication cores to fully utilize InfiniBand (IB) and NVLink bandwidth. The company has also carefully optimized the memory footprint so that DeepSeek-V3 can be trained without expensive tensor parallelism.

In combining these efforts, the DeepSeek team achieved high training efficiency.

Table 1: Training costs of DeepSeek-V3, assuming the H800 rental price is $2 per GPU hour.

According to the DeepSeek team's emphasis in the paper, it is achieved through the co-design of optimization algorithms, frameworks and hardware. In the pre-training stage, training DeepSeek-V3 only requires 180KH800GPU hours per trillion tokens, that is, only 3.7 days on its cluster with 2048 H800GPUs. As a result, the company's pre-training phase was completed in less than two months and took 2664K GPU hours. Including 119K GPU hours for context length extension and 5K GPU hours for post-training, the full training of DeepSeek-V3 only took 2.788 million GPU hours.

Assuming that the rental price of the H800 GPU is US$2 per hour, this means that its total training cost is only US$5.576 million. The DeepSeek team also specifically emphasized that the above costs only include the official training of DeepSeek-V3 and do not include costs related to previous research and ablation experiments on architecture, algorithms, or data. For comparison, OpenAI boss Sam Altman said that training GPT-4 would require more than $100 million.

On January 20, DeepSeek launched the DeepSeek-R1 model, which adds two reinforcement learning stages and two supervised fine-tuning stages to enhance the model’s reasoning capabilities. DeepSeekAI charges 6.5 times more for the R1 model than the base V3 model. Subsequently, DeepSeek released Janus-Pro, an updated version of its multi-modal model Janus. The new model improves training strategies, data expansion and model size, enhancing multi-modal understanding and text-to-image generation.

So far, DeepSeek has become popular all over the world.

The chip behind DeepSeek

After DeepSeek came out, some discussions surrounding its system and technical research framework also spread all over the Internet, specifically in terms of hardware. Because of its extremely low cost, this has caused shocks in the entire AI chip market. NVIDIA's sharp decline in the past few days is the most direct reflection of this concern.

As mentioned above, DeepSeek said that the cluster used to train the V3 model only has 256 server nodes, each with 8 H800 GPU accelerators, for a total of 2,048 GPUs. Analysts at nextplatform speculate that these GPU cards are H800SXM5 versions of Nvidia's H800 cards, which have FP64 floating point performance capped at 1 teraflops and are otherwise identical to the 80GB version of the H100 cards that most companies around the world can buy.

Among them, the eight GPUs within the node are interconnected with NVSwitch to create a shared memory domain between these GPU memories, and the node has multiple InfiniBand cards (perhaps one per GPU) to create high-bandwidth links to other nodes in the cluster.

Specific to the H800, this is the GPU that Nvidia originally launched in response to the export restrictions in the United States. The U.S. GPU export ban regulations at that time mainly restricted two aspects: computing power and bandwidth. Among them, the upper limit of computing power is 4800TOPS and the upper limit of bandwidth is 600GB/s. The computing power of the A800 and H800 is equivalent to the original version, but the bandwidth is reduced.

Figure 4: Details of H800

As mentioned above, DeepSeek uses the H800SXM version in training. It is understood that the so-called SXM architecture is a high-bandwidth socket solution for connecting NVIDIATensorCore accelerators to its proprietary DGX and HGX systems. For each generation of NVIDIA Sensor Core GPU, the DGX system HGX board is equipped with an SXM socket type, which enables high bandwidth, power delivery and other functions for its matching GPU daughter card.

According to the data, a specialized HGX system board interconnects 8 GPUs through NVLink, achieving high bandwidth between GPUs. NVLink's capabilities enable extremely fast data flow between GPUs, allowing them to operate like a single GPU beast without going through PCIe or needing to communicate with the CPU to exchange data. NVIDIA DGXH800 connects 8 SXM5H800, through 4 NVLink switching chips, the bandwidth of each GPU is 400GB/s, and the total two-way bandwidth exceeds 3.2TB/s. Each H800SXMGPU is also connected to the CPU via PCIExpress, so data calculated by any of the 8 GPUs can be forwarded back to the CPU.

Figure 5: Basic SGX/HGXtoCPU framework diagram

In the past few years, large enterprises have become increasingly interested in NVIDIA DGX because SXMGPU is more suitable for large-scale deployment. As mentioned above, the eight H800 GPUs are fully interconnected via NVLink and NVSwitch interconnect technology. In DGX and HGX, the connection method of 8 SXMGPU is different from PCIe; each GPU is connected to 4 NVLinkSwitch chips, basically making all GPUs run as one big GPU. This scalability can be further extended with the NVIDIA NVLinkSwitch system to deploy and connect 256 DGXH800s to create a GPU-accelerated AI factory.

Figure 6: Basic 8PCIeGPUtoCPU framework diagram

DeepSeeK in the eyes of foreign analysts

Based on these GPUs and systems, many analysts in the West have criticized the Deepseek team for achieving this achievement. However, analysts from nextplatform said that if you read this 53-page paper carefully, you will find that DeepSeek has adopted various ingenious optimizations and methods to make the V3 model. They also truly believe that this has indeed reduced the problem of inefficiency and improved DeepSeek's training and inference performance on hardware.

They believe that the key innovation in the approach taken by the DeepSeek team to train the V3 base model is the use of 20 of the 132 streaming multiprocessors (SMs) on the Hopper GPU as communication accelerators and schedulers for the data as the training runs scrutinize tokens and generate the model's weights from the parameter depth set as the data is passed around the cluster. Nextplatform speculates that this "overlap between computation and communication can hide communication latency during computation," as the V3 paper states, using SM to create what is effectively an L3 cache controller and data aggregator between GPUs that are not on the same node.

As nextplatform shared about its paper, DeepSeek created its own GPU virtual DPU to perform various SHARP-like processing related to all-to-all communication in a GPU cluster.

As mentioned above, the DeepSeek team designed the DualPipe algorithm to achieve efficient pipeline parallelism. In this regard, nextplatform points out that if DeepSeek can increase the computational efficiency on these 2,048 GPUs to close to 100%, then the cluster will start to think that it has 8,192 GPUs (missing some SMs of course) running less efficiently because they do not have DualPipe. For comparison, OpenAI's GPT-4 base model was trained on 8,000 Nvidia "Ampere" A100 GPUs, which is equivalent to 4,000 H100s (sort of).

In addition, including auxiliary lossless load balancing, FP8 low-precision processing, upgrading high-precision matrix mathematical operations of intermediate results in the tensor core to the vector unit on the CUDA core to maintain a higher-precision representation, recalculating all RMSNorm operations during backpropagation and recalculating all MLA upward projections are also among the innovations of DeepSeek.

Although Dylan Patel of SemiAnalysis, a well-known semiconductor analysis organization, has doubts about the costs disclosed by the DeepSeek team. But they also admit that DeepSeek has advantages.

SemiAnalysis said that DeepSeek-R1 can achieve results comparable to OpenAI-o1, which was only released in September. How did DeepSeek catch up so quickly? This is mainly because reasoning has become a new paradigm. Compared with before, reasoning can now iterate faster and require less calculation, but can achieve meaningful benefits. In contrast, the previous model relied on pre-training, and the cost of pre-training is getting higher and higher, and it is difficult to achieve stable benefits.

They noted that the new paradigm focuses on enabling inference capabilities through synthetic data generation and RL in post-training of existing models, resulting in faster revenue at lower prices. A low barrier to entry coupled with simple optimizations means DeepSeek is able to replicate o1 methods faster than ever before.

"R1 is a very good model, we have no objection to it, and it is objectively impressive that it has caught up with the edge of reasoning so quickly." SemiAnalysis emphasized. They concluded:

On the one hand, DeepSeekV3 uses multi-token prediction (MTP: Multi-Token Prediction) technology on an unprecedented scale. These additional attention modules (attention modules) can predict several tokens instead of a single token. This improves model performance during training and can be discarded during inference. This is an example of algorithmic innovation that improves performance with lower computational effort. There are some additional considerations, such as improving FP8 accuracy during training;

On the other hand, DeepSeekv3 is also a hybrid of experts models, which are large models composed of many other small models that specialize in different fields. One difficulty faced by hybrid expert models is how to determine which token to give to which sub-model or "expert". DeepSeek implements a "gating network" to route tokens to appropriate experts in a balanced manner that does not affect model performance. This means that routing is very efficient, with only a small number of parameter changes per token during training relative to the overall size of the model. This not only improves training efficiency, but also reduces inference costs;

Again, in the case of R1, it will benefit greatly from having a strong base model (v3). Part of the answer lies in reinforcement learning (RL).

Reinforcement learning has two focuses: formatting (ensuring that it provides coherent output) and usefulness and harmlessness (ensuring that the model has

use). Inference capabilities emerge when the model is fine-tuned on synthetic datasets;

SemiAnalysis reiterated that MLA is DeepSeek's key innovative technology that significantly reduces the cost of inference. The reason is that MLA reduces the amount of KV cache required for each query by approximately 93.3% compared to standard attention. The KV cache is an in-memory mechanism in the converter model used to store data representing conversation context, thereby reducing unnecessary computations.

Potential impact on Nvidia chips

As we mentioned at the beginning of the article, after DeepSeek became popular, Nvidia responded with a plunge. Because if large US technology companies start to learn from DeepSeek and choose cheaper artificial intelligence solutions, this may put pressure on Nvidia.

Subsequently, Nvidia gave positive comments on DeepSeek's progress. The company said in a statement that DeepSeek's progress is a good demonstration of new ways of operating AI models. The company said that delivering such AI models to users requires a large number of Nvidia chips.

However, Kathy Wood, a well-known investor and CEO of Ark Investment, said in an interview that DeepSeek proved that success in the AI field does not require so much money and accelerated the collapse of costs.

Sun Wei, chief analyst of artificial intelligence at Counterpoint Research, also said that Nvidia’s sell-off reflects people’s changing views on the development of artificial intelligence. She further noted: "DeepSeek's success challenges the belief that larger models and more powerful computing power lead to better performance, posing a threat to Nvidia's GPU-driven growth strategy."

SemiAnalysis emphasized that the speed of algorithm improvement is too fast, which is also detrimental to Nvidia and GPUs.

The US media "Fortune" even warned that DeepSeek is threatening Nvidia's AI dominance.

As mentioned earlier, DeepSeek has used lower-performance, cheaper chips to build its latest models, which has also put pressure on Nvidia, and some people worry that other large technology companies may reduce demand for Nvidia's more advanced products.

Kate Leaman, chief market analyst at AvaTrade, told Fortune: "Investors are concerned that DeepSeek's ability to work with weaker AI chips could undermine Nvidia's dominance in AI hardware, especially given that its valuation relies heavily on AI demand."

It is worth mentioning that according to tomshardware reports, DeepSeek's AI breakthrough bypasses NVIDIA's CUDA out of the box, and instead uses assembly-like PTX programming, which to some extent increases everyone's concerns about NVIDIA.

According to reports, Nvidia's PTX (ParallelThreadExecution: Parallel Thread Execution) is an intermediate instruction set architecture designed by Nvidia for its GPU. PTX sits between high-level GPU programming languages (such as CUDAC/C++ or other language front-ends) and low-level machine code (streaming assembly or SASS). PTX is a near-metal ISA that exposes the GPU as a data-parallel computing device, thus allowing fine-grained optimizations such as register allocation and thread/warp level adjustments that are not possible with CUDAC/C++ and other languages. Once PTX is in SASS, it is optimized for a specific generation of Nvidia GPUs.

When training the V3 model, DeepSeek reconfigured Nvidia's H800 GPU: of the 132 streaming multiprocessors, it allocated 20 for server-to-server communication, possibly for compressing and decompressing data to overcome the processor's connection limitations and speed up transactions. To maximize performance, DeepSeek also implements advanced pipeline algorithms, possibly by making ultra-fine thread/warp level tuning.

The report pointed out that these modifications go far beyond the scope of standard CUDA-level development, but are very difficult to maintain.

However, Morningstar strategist Brian Colello bluntly stated that DeepSeek’s entry has undoubtedly added uncertainty to the entire artificial intelligence ecosystem, but this has not changed the overwhelming momentum behind this movement. "We believe demand for AI GPUs continues to outpace supply," he wrote in a note. "So while thinner and lighter models may be able to achieve more with the same number of chips, we still think tech companies will continue to buy all the GPUs they can as part of this AI gold rush."

Industry veterans like former Intel CEO Pat Gelsinger also believe applications like artificial intelligence can take advantage of all the computing power they have access to. As for DeepSeek's breakthrough, Gelsinger sees it as a way to add artificial intelligence to a plethora of cheap devices in the mass market.

SemiAnalysis revealed in its report that H100 AWS GPU prices have increased in many regions since the release of DeepSeekV3 and R1. Similar H200s are also harder to find. "After the launch of V3, H100 prices skyrocketed as GPU monetization rates began to increase significantly. More intelligence at a lower price means more demand. This is a significant shift from the sluggish H100 spot prices in previous months." SemiAnalysis said,

So, how do you think DeepSeek will develop? Can Nvidia chips continue to dominate the world?