Musk’s xAI, the first public research results are here! One of the co-authors is Greg Yang, a founding member of xAI and a disciple of Qiu Chengtong. Previously, Yang Ge has publicly stated that his research direction in xAI is "MathforAI" and "AIforMath". One of the key points is to continue his previous research: TensorPrograms, a unified programming language for describing neural network architecture - related results have been applied in GPT-4.
This new paper belongs to this series and focuses on "how to train infinitely deep networks".
To this end, Yang himself also conducted a live broadcast on ??.
Let’s take a look at what exciting content is worth marking~
Training infinitely deep neural networks
Simply put, this article studies the extension of the residual network (ResNet) in the depth direction.
We know that the residual network solves the problem of performance degradation of deep convolutional neural networks when the depth increases. But as the network continues to deepen, training a good deep residual network is still not easy:
When the network deepens, the size of the features will continue to increase, causing network instability; after deepening the network, the hyperparameters need to be readjusted, which requires a lot of work...
The idea of Yang Ge and his friends is to find a deep parameterization method that can both learn features and achieve hyperparameter transfer.
They first thought of the two limiting cases of infinitely wide neural networks: either kernel machines or feature learners. For the latter, the optimal hyperparameters do not change with width.
Here, they use the TensorPrograms framework to analyze the limiting case of infinitely wide networks.
As mentioned earlier, TensorPrograms is a long-term research goal of Young: using mathematical language to establish a low-level programming language that can describe and analyze neural network architecture.
Specifically, TensorPrograms consist of matrix multiplication and activation functions. Young discovered that if the neural network function could be expressed in this language, initialization analysis could be performed automatically and completely.
The mathematical derivation part will not be elaborated here. We can briefly feel the style of painting...
On the basis of these derivation analyses, the author proposed the Depth-μP method, which can realize hyperparameter migration in the depth direction and greatly simplifies hyperparameter adjustment at different depths.
Depth-μP contains the following points:
Each residual branch has a coefficient a/sqrt(L) inversely proportional to the square root of depth L.
The learning rate of each weight matrix decreases as the depth L becomes larger, depending on the type of optimization algorithm. For SGD, the learning rate is a constant η, and for adaptive optimization algorithms such as Adam, the learning rate is eta/sqrt(L).
It is worth noting that the author found that when the depth of the residual block is 1, Depth-μP is the optimal method of depth parameterization, which can ensure that the hyperparameters converge with the increase of depth and realize the hyperparameter transfer in the depth direction.
However, when the residual block depth ≥ 2, there will still be problems with hyperparameter migration failure and training performance degradation.
In addition, the paper also explores the concept of "feature diversity" and believes that it plays a key role in deep networks.
Another co-author of the paper is Dingli Yu from Princeton. He graduated from the Yao Class of Tsinghua University and is currently pursuing a Ph.D. in the Department of Computer Science at Princeton.
What did Yang Ge say during the live broadcast?
During the live broadcast, Yang Ge also answered questions that the audience was interested in. Without changing the original meaning, Qubit has sorted out some of the issues.
Q: For many of us, (the content of the paper) may be beyond our understanding. But I want to know, how is the model you mentioned different from the ChatGPT and OpenAI technologies we can experience? What are the significant differences or innovations between this paper and OpenAI’s results?
Young: I would like to make a brief comment. I would like to say that these characteristics are not directly related to practical applications at present, but are more of a research nature.
Of course, the ultimate goal of doing all this is to make the model better and safer, and then benefit mankind. What we are doing now is describing the expected effect, which does not necessarily have a direct impact.
We're all in the same boat now, and we're doing what we can, whether it's short-term work or long-term applied research, to make it work for everyone.
Q: It sounds like you're building an artificial computer brain capable of reasoning, so is that what you're working on? Also, I am a mother and my 7-year-old son is very interested in mathematics. Do you have any suggestions for him to continue to be interested and enthusiastic about the field of AI?
Young: "New network" refers to artificial neural networks. I think it is the backbone of many modern technologies, including Google, Facebook, Instagram, etc. that you use every day. These artificial neural networks are used at the bottom of these services. These networks were inspired by real neural networks in animals and humans about sixty or seventy years ago, but they have deviated from real neuroscience.
These networks are inherently mathematical problems, so we can gain a deep understanding of these neural networks by doing a lot of analysis after mastering these new mathematical problems.
Although we don’t yet know how neurons are actually connected, through mathematical research we can optimize these artificial neural networks to help technology companies improve people’s lives.
Regarding your second question, it's great to hear that your son is very interested in math. This is the foundation for creating great things in technology and improving everyone's lives.
The advice I would like to give is that it is very important that you maintain your son's passion for math first. Once you lose this love, it will be difficult to continue learning.
Also pay attention to observing the things he likes to make the learning process interesting and further stimulate his interest. At the same time, we should also cultivate his curiosity about how things work, and try to develop a scientific thinking, and research should be driven by curiosity. Like taking things apart and trying to understand how they work.
If a person loses his passion for exploring the mathematical truths of the universe, it may be difficult to have the motivation to move forward. Overall, I recommend that you develop in your son a strong interest and curiosity about the world, especially about the nature of mathematics and science.
Q: I have a more abstract question. You had the idea that depth approaches infinity, and then you wrote this paper based on that idea. So have you considered using different architectures of neural networks? Not a standard architecture with neurons and countless layers, but something completely different. Like these neurons are connected in a completely different way, maybe some kind of square shape?
Young: In fact, the insights into nonlinearity and the number of layers in our work are only very preliminary research. There are certainly many questions that can be explored about what is an appropriate structure, or what a structure should be.
The Meta team has previously studied what happens when neurons are randomly connected and obtained some interesting results. So, there’s definitely a lot to do here. Now I really don't have a concrete answer as to what would be the correct or better structure.
About Younger
Yang Ge was born in Hunan Province. After graduating from elementary school, he went to the United States and studied under Professor Yau Shing-tung at Harvard.
△Yang Ge and Qiu Chengtong, picture source: Yang Ge Twitter
In 2017, Yang Ge graduated from Harvard and later entered Microsoft under the recommendation of Shen Xiangyang.
At Microsoft, Yang Ge received high praise from Shun Xiangyang. A few months ago, at a forum called "Basic Science and Artificial Intelligence", Shen Xiangyang publicly stated:
Microsoft Research usually only recruits doctoral students. Yang Ge entered Microsoft Research as an undergraduate graduate. Not only did he join Microsoft Research, he also did extremely well in the past five years, especially making a decisive contribution to the development of GPT.
It is worth mentioning that he himself admitted that GPT-4 used his μTransfer (TensorPrograms series) method.
Yang Ge's research on TensorPrograms started very early. He published "TensorProgramsI" in 2019 and continued to explore in depth while working at Microsoft. He believes that almost any calculation in deep learning can be expressed as TensorPrograms.
In July this year, Musk announced the establishment of a new company, xAI. Young left Microsoft to join the founding team of xAI and became a mathematician at xAI.
After joining xAI, Yang Ge revealed more than once that the long-term goal of the TensorPrograms project is to develop a "theory of everything" for large-scale deep learning, that is, to find a theoretical rule that can truly understand the behavior of large AI models.
He also said:
AI will enable everyone to understand our mathematical universe in ways previously unimaginable.
Paper link: https://arxiv.org/abs/2310.02244