On February 24, last week, DeepSeek announced that this week would be Open Source Week and that it would open source five software libraries in a row. At about 9:30 am today, DeepSeek announced that it has open sourced the first code library of this open source week - FlashMLA, an efficient MLA decoding core optimized for Hopper GPU.


On GitHub, the project has received more than 5,000 Star collections and 188 Forks (copies created) 6 hours after it was open sourced. After hearing about DeepSeek's open source FlashMLA and the rapid growth of Star collection and Fork data, the CTO of a Hong Kong-listed company said in communication with Sina Technology: "It's too powerful."

Another investor who focuses on AI hardware research and investment told Sina Technology after reviewing FlashMLA that this open source is a major benefit for domestic GPUs. "The previous domestic GPU cards were very weak. Now we can use the optimization ideas and methodologies provided by FlashMLA to try to significantly improve the performance of domestic cards. Even if the architecture is different, it will be a matter of course for the inference performance of domestic cards to be improved later."


According to DeepSeek official introduction, FlashMLA is based on the effective MLA decoding kernel of HopperGPUs and can be optimized for variable length sequences.

In the entire technical route of DeepSeek, MLA (Multiple Latent Attention Mechanism) is one of the most core technologies in the V2 and V3 models that the company has released. It is used to solve performance bottlenecks in computing efficiency and memory usage, which can significantly improve model training and inference efficiency while maintaining or even enhancing model performance.

Previously, Zheng Weimin, academician of the Chinese Academy of Engineering and professor of the Department of Computer Science at Tsinghua University, mentioned in a communication with Sina Technology: "DeepSeek's self-developed MLA architecture has played a key role in reducing its own model training costs." He pointed out, "MLA compresses KV by transforming the attention operator Cache size enables more KVCache to be stored with the same capacity. This architecture, combined with the transformation of the FFN layer in the DeepSeek-V3 model, achieves a very large sparse MoE layer, which becomes the most critical reason for the low training cost of DeepSeek.”

This time, DeepSeek directly opens the MLA decoding core - FlashMLA, which means that DeepSeek will directly open the core MLA underlying code for free. This allows the majority of development groups to directly reuse the FlashMLA code base to complete the same task with fewer GPU servers, directly reducing the cost of inference. This is undoubtedly a great benefit for more groups who hope to perform underlying optimization and AI application development based on DeepSeek's open source capabilities.

Interestingly, the MLA decoding core opened by DeepSeek this time is mainly optimized for Hopper GPU. Generally speaking, Hopper GPU refers to the H-series GPU products developed based on NVIDIA's Hopper architecture. At present, NVIDIA has released a number of chips in this series of chips, such as H100, H800 and H20.

According to DeepSeek, in terms of benchmark performance, FlashMLA can achieve a memory speed of 3000GB/s and a computing upper limit of 580TFLOPS on the NVIDIA H800SXM5 GPU.


Public information shows that according to U.S. export control regulations, the bandwidth limit of the H800 is set to 600GB/s, which is lower than some flagship products. This means that after optimization with FlashMLA, the memory bandwidth utilization of H800 is expected to be further improved or even exceed the theoretical upper limit of H800 GPU, reaching the ultimate in memory access, allowing the development community to fully "squeeze" the capabilities of NVIDIA H-series chips, achieve stronger model performance with fewer chips, and maximize the value of the GPU.

An investor who focuses on AI hardware research and investment said after viewing FlashMLA, "FlashMLA is an optimization solution that can make LLM run faster and more efficiently on H800. It is especially suitable for high-performance AI tasks. Its core is to accelerate the decoding process of large language models and improve the response speed and throughput of the model. This is very important for real-time generation tasks (such as chatbots, etc.). It will greatly promote the capabilities and user experience of large models, and the speed will be significantly improved."

Although FlashMLA is an optimized code library for HopperGPU, for domestic GPUs, this open source is also beneficial. After reviewing FlashMLA, the above-mentioned investors said that for domestic GPUs, this open source is a major benefit. "The previous domestic GPU cards were very weak. Now we can use the optimization ideas and methodologies provided by FlashMLA to try to significantly improve the performance of domestic cards. Even if the architecture is different, it will be a matter of course for the inference performance of domestic cards to be improved later."