On the evening of March 16, Tesla CEO Musk publicly spoke out on social platforms, praising the latest technological achievements of the Chinese artificial intelligence company Kimi’s team, saying that the work was “impressive” and bringing the cutting-edge research of this domestically produced large model into the public eye.

As the technical papers were released simultaneously, Guangyu Chen, who ranked first in the author list, attracted the attention of the entire Internet - the core author was actually a 17-year-old high school student from Shenzhen, Guangdong.
According to the information marked in the appendix of the paper, Chen Guangyu, Zhang Yu, and Su Jianlin are all co-first authors with equal contributions, and the remaining 34 participating authors have not marked this qualification.
Among them, Zhang Yu is the core developer of Kimi's efficient model architecture, and Su Jianlin is the proposer of rotational position encoding (RoPE).
It is worth mentioning that Chen Guangyu has only been deeply involved in the field of AI for only one year. In the initial stage, he quickly completed the basic knowledge and practical abilities of AI by independently studying cutting-edge papers and tracking GitHub open source projects.
Last summer, he went to San Francisco to complete a 7-week internship experience. After returning to China, he joined the Kimi team in November last year to participate in the internship.
After the paper was released, Chen Guangyu posted a review of the results in his circle of friends, specifically mentioning the three authors who contributed equally, as well as the team colleagues responsible for model expansion and infrastructure construction. He responded in a low-key manner, "It's a team effort, not a god."
According to reports, this technical report released by the Kimi team proposes a new Attention Residuals mechanism to achieve a disruptive reconstruction of the traditional residual connection that has been used in the field of deep learning for nearly ten years.
Kimi's innovation is equivalent to installing an "intelligent filter" on AI, migrating the Transformer attention mechanism to the depth dimension of the model, allowing each layer to dynamically filter previously useful information, reduce redundancy, and improve transmission efficiency.