In the past two days, AI has once again made headlines in major media. On December 6, Google officially announced the new multi-modal large model Gemini, which includes three versions. According to Google's benchmark test results, the Gemini Ultra version has shown "state-of-the-art performance" in many tests, and even completely defeated OpenAI's GPT-4 in most tests.
While Gemini stole the show, Google also dropped another blockbuster—the new self-developed chip TPUv5p, which is also the most powerful TPU to date.
According to official data, each TPUv5ppod combines 8,960 chips in a three-dimensional ring topology through the highest bandwidth inter-chip interconnect (ICI) at a speed of 4,800Gbps/chip. Compared with TPUv4, TPUv5p’s FLOPS and high-bandwidth memory (HBM) are increased by 2 times and 3 times respectively.
In addition, TPUv5p trains large LLM models 2.8 times faster than the previous generation TPUv4, and using second-generation SparseCores, TPUv5p trains embedded dense models 1.9 times faster than TPUv4. TPUv5p is also 4 times more scalable than TPUv4 in terms of total available FLOPs per pod, with twice the number of floating point operations per second (FLOPS) and twice the number of chips in a single pod, greatly improving the relative performance of training speed.
Google also hired a bunch of scientists to endorse TPUv5p’s AI performance:
Erik Nijkamp, senior research scientist at Salesforce, said: “We have been leveraging Google Cloud TPUv5p to pre-train Salesforce’s base models that will serve as the core engine for professional production use cases, and we are seeing significant improvements in training speed. In fact, Clou dTPUv5p has 2x the compute power of the previous generation TPUv4. We also love the seamless and easy transition from CloudTPUv4 to v5p using JAX and are excited to optimize our models for further speed with the Accurate Quantized Training (AQT) library.”
Dr. Yoav Ha Cohen, leader of the core generative AI research team at Lightricks, said: "Leveraging the superior performance and ample memory of the Google Cloud TPUv5p, we successfully trained a text-to-video generative model without splitting it into separate processes. This optimal hardware utilization greatly accelerated each training cycle, allowing us to quickly launch a series of experiments. The ability to quickly train the model in each experiment promotes rapid iteration, which is a valuable advantage for our research team in the highly competitive field of generative AI."
Jeff Dean, chief scientist of GoogleDeepMind and Google Research Institute, also supports their own chips: "In early use, GoogleDeepMind and Google Research Institute observed that the speed of LLM training workload using TPUv5p chip was increased by 2 times compared with the TPUv4 generation. For ML framework (JAX) "
For Google, Gemini is a powerful tool to deal with OpenAI, and TPUv5p is a stepping stone. Use it to build a high wall against Nvidia GPUs. With both software and hardware, it seems that it is in an invincible position in the AI era.
The question is, why does Google have the advantage it currently has?
From being unknown to being famous all over the world
Google TPU was not achieved overnight. Its self-research journey began ten years ago.
As a technology company, Google actually considered building an application-specific integrated circuit (ASIC) for neural networks as early as 2006. However, by 2013, the situation became urgent. Google scientists began to realize that there was an irreconcilable contradiction between the rapidly growing computing needs of neural networks and the number of data centers.
Jeff Dean, the head of Google AI at the time, calculated and found that if 100 million Android users used the mobile phone voice-to-text service for 3 minutes a day, the computing power consumed would be twice the total computing power of all Google data centers, and there are far more than 100 million Android users around the world.
The scale of the data center cannot expand indefinitely, and Google cannot limit the time users use services. However, both CPU and GPU are difficult to meet Google's needs: the CPU can only handle a relatively small number of tasks at a time, and the GPU is less efficient when performing a single task, and the range of tasks it can handle is smaller. Self-research has become the last resort.
Google has set a small goal: to build a domain-specific computing architecture (Domain-specific Architecture) for the purpose of machine learning, and to reduce the total cost of ownership (TCO) of deep neural network inference to one-tenth of its original value.
Usually, the development of ASIC takes several years, but Google completed the design, verification, manufacturing and deployment of the TPU processor to the data center in only 15 months. Norm Jouppi, the technical leader of the TPU project (also one of the main architects of the MIPS processor) described the sprint phase this way:
"We're designing chips very fast. It's really remarkable. We're shipping the first chip without fixing bugs or changing masks. It's all very hectic considering we're building the chip while still hiring a team, then hiring RTL (circuit design) people, and rushing to hire design verification people."
The first-generation TPU, which represents the crystallization of Google technology, was manufactured using a 28-nanometer process, with an operating frequency of 700MHz and a running power consumption of 40W. Google packaged the processor into an external accelerator card and installed it in a SATA hard drive slot to achieve plug-and-play. The TPU is connected to the host through the PCIeGen3x16 bus and can provide an effective bandwidth of 12.5GB/s.
Compared with CPUs and GPUs, single-threaded TPUs do not have any complex microarchitectural features. Minimalism is the advantage of processors in specific fields. Google's TPU can only run one task at a time: neural network prediction, but its performance per watt reaches 30 times that of the GPU and 80 times that of the CPU.
Google was very low-key on this matter. It was not until the 2016 Google I/O Developer Conference that CEO Sundar Pichai officially showed the world the self-research results of TPU.
Pichai told the guests attending the meeting that AlphaGo developed by DeepMind was able to defeat Korean chess player Lee Sedol, and the TPU in the underlying hardware was indispensable. TPU is like Helen, the woman who triggered the Trojan War in Greek mythology. Its appearance caused "thousands of chips to compete with it."
But Google did not stop there. Almost as soon as the first generation of TPU was completed, it immediately invested in the development of the next generation: in 2017, TPUv2 came out; in 2018, TPUv3 was launched; in 2021, TPUv4 was unveiled at the Google I/O Developer Conference...
Google is also becoming more and more comfortable with AI chips: the first-generation TPU only supports 8-bit integer operations, which means that it can perform reasoning, but training is out of reach; TPUv2 not only introduces HBM memory, but also supports floating-point operations, thus supporting the training and reasoning of machine models; TPUv3 focuses on enhancing performance based on the previous generation, and quadruples the number of chips deployed in Pods.
When it comes to TPUv4, Pichai proudly said: "The progress of AI technology depends on the support of computing infrastructure, and TPU is an important part of Google's computing infrastructure. The speed of the new generation TPUv4 chip is more than twice that of v3. Google uses TPU clusters to build Pod supercomputing Machine, a single TPUv4Pod contains 4096 v4 chips, and the inter-chip interconnection bandwidth of each Pod is 10 times that of other interconnection technologies. Therefore, the computing power of TPUv4Pod can reach 1ExaFLOP, that is, it can perform 10 to the 18th power of floating point operations per second, which is equivalent to the total computing power of 10 million laptops. "
Today in 2023, TPU has become one of the synonyms of AI chips and another important processor after CPU and GPU. It is deployed in dozens of Google data centers and completes hundreds of millions of AI computing tasks every day.
Google’s self-research empire
TPU is just the prelude to Google’s self-research.
At the Google Cloud Next '17 conference in 2017, Google launched a custom security chip called Titan, which is designed for hardware-level cloud security. It achieves more secure identification and authentication by establishing an encrypted identity for specific hardware, thus preventing increasingly rampant BIOS attacks.
The Titan chip is not all for Google itself. It appears to convince enterprises that data stored in the Google cloud is more secure than the enterprise's local data center. Google said that the self-developed Titan chip verifies system firmware and software by establishing a strong hardware-based system identity. Components, protecting the startup process, all thanks to the hardware logic created by Google itself, which fundamentally reduces the possibility of hardware backdoors. The Titan-based ecosystem also ensures that facilities only use authorized and verifiable code, ultimately making Google Cloud more secure and reliable than local data centers.
The emergence of Titan is just a small test. In March 2021, Google introduced for the first time a self-developed chip for YouTube servers, namely ArgosVCU, at the ASPLOS conference. Its task is very simple, which is to transcode videos uploaded by users.
According to statistics, users upload more than 500 hours of video content in various formats to YouTube every minute, and Google needs to quickly convert this content into multiple resolutions (including 144p, 240p, 360p, 480p , 720p, 1080p, 1440p, 2160p and 4320p) and various formats (such as H.264, VP9 or AV1). Without a chip with powerful encoding capabilities, it is impossible to quickly transcode.
Google has tried two solutions. The first is Intel's Visual Computing Accelerator (VCA), which contains three XeonE3 CPUs, built-in IrisProP6300/P580GT4e integrated graphics core and advanced hardware encoder. The second one uses Intel Xeon processors plus software encoding to complete the task.
But both the former and the latter require huge server scale and huge power consumption. Therefore, Google started the research and development of another self-developed chip-VCU. Scott Silver, Google's vice president of engineering who is responsible for overseeing YouTube's huge infrastructure, said that starting in 2015, a team of about 100 Google engineers devoted themselves to designing the first-generation Argos chip. In the following years, the team not only completed the research and development, but also applied the chip in Google's data centers, and the strength of Argos was also demonstrated - it processes video 20 to 33 times more efficiently than traditional servers, and the time to process high-resolution 4K video is shortened from days to hours.
The next generation of Argos may have already been quietly launched on Google servers. According to reports, Google's self-developed second-generation VCU will support AV1, H.264 and VP9 codecs, which can further improve the efficiency of its encoding technology and will also be the most powerful support for YouTube's content creation ecosystem.
And Google's strongest move is the most complex mobile phone SoC. On October 19, 2021, at an autumn conference, the flagship mobile phone Pixel6 series equipped with Google's first self-developed chip Tensor made its debut.
Google Senior Vice President Rick Osterloh said at the press conference that this chip is "the largest mobile hardware innovation in the company's history", and Google CEO Sundar Pichai even posted a photo of the Tensor chip on Twitter early, showing his pride in the self-research project.
However, this self-developed chip is essentially based on the semi-custom chip design service opened by Samsung in 2020. In the disassembly diagram of TechInsights, the package size of Tensor is 10.38mmx10.43mm=108.26mm2, and the internal chip is labeled "S5 P9845", which conforms to the traditional Samsung Exynos processor naming rules. For example, the Exynos990 chip is marked as S5E9830, and the Exynos21005GSoC chip is marked as S5E9840. It is essentially a chip defined by Google and designed and manufactured by Samsung.
Even so, the layout of Google's self-developed chips has begun to take shape. From TPU to Titan, from VCU to Tensor, Google has gone through a ten-year journey, and its ambition to fully master this empire of silicon chips is also clear.
Google’s smart accounts and stumbling blocks
For Google, it requires money, technology, and application scenarios. It can be said that it is the furthest along the road of self-developed AI chips among the major technology giants. Other manufacturers are still pouring money into NVIDIA accounts, but Google has already made preparations for both. Many people even regard it as the strongest challenger to NVIDIA's current monopoly.
Compared with Microsoft and Amazon, Google's most prominent advantage is to design TPU from a system-level perspective. A single chip is important, but how it is used in combination in the system in the real world is even more important. Although Nvidia also thinks from a systems perspective, their systems are smaller and narrower in scope than Google. And Google also uses a custom network stack ICI between TPUs. This link offers low latency and high performance compared to expensive Ethernet and InfiniBand deployments, similar to Nvidia's NVLink.
In fact, Google's TPUv2 can scale to 256 TPU chips, the same number as Nvidia's H100 GPU. In TPUv3 and TPUv4, this number increases to 1024 and 4096 respectively. According to Trendline, the latest TPUv5p can scale to 8960 chips without going through inefficient Ethernet.
In addition, Google also has unique advantages in OCS, topology, and DLRM optimization. The experience and advantages accumulated over the past ten years have helped Google's TPU to show its talents in data centers and large AI models. In specific applications, it is not an exaggeration to describe it as being far ahead. In the future, it is not impossible for Google to completely get rid of the constraints of Nvidia GPUs.
However, Google still has a minor stumbling block.
Self-research on TPU began in 2013, and it was able to be quickly deployed to data centers within 15 months, while achieving rapid iterative performance improvements. In addition to Google researchers working around the clock, the help provided by another company was also extremely important.
According to a report by J.P. Morgan analyst Harlan Suhr in 2020, the generations of Google TPUv1 to v4 were co-designed by Broadcom. At that time, it had already begun producing TPUv4 using the 7nm process and began cooperating with Google to design TPUv5 using the 5nm process.
Sur said that Broadcom's application-specific integrated circuit (ASIC) business's full-year revenue in 2020 was US$750 million, up from US$50 million in 2016. In addition to chip design, Broadcom also provided key intellectual property rights to Google and was responsible for steps such as manufacturing, testing and packaging new chips to supply Google's new data centers. Broadcom also cooperates with other customers such as Facebook, Microsoft and AT&T to design ASIC chips.
According to Broadcom's 2022 financial report, it divides ASIC revenue into two parts: routing and switching and computing offloading. Compute offloading is handled in the data center in two steps. When a compute request comes in, routers and switches decide which part of the data center should handle the work. Once decided, a processor (usually a CPU or GPU, like those designed by NVIDIA) does the calculations, which are then sent back to the end user again by those routers and switches over the Internet or private network.
In terms of revenue, Broadcom is the second largest artificial intelligence chip company in the world, second only to Nvidia. Its ASIC sales amount to billions of dollars. This is the result of Google's increased TPU deployment in response to Microsoft's cooperation with OpenAI. Just one Google TPU more than quadrupled Broadcom's ASIC revenue. The artificial intelligence tax that Google did not pay to Nvidia went into Broadcom's pocket in another form.
No matter which company it is, it will not be willing to continue to pay this money. Therefore, in September this year, it was reported that Google was preparing to end its partnership with Broadcom before 2027. Sources said that Google executives had set the goal of abandoning Broadcom and began to consider its competitor Marvell. The two companies had been in a months-long impasse over the pricing of TPU chips.
Although Google officials later came out to refute the rumors and stated that they currently have no plans to change their cooperative relationship with Broadcom, it is already known that the two companies are at odds in private.
Google made a smart calculation on TPU. When Microsoft and other giants obediently paid, it took out TPUv5p to fight against Nvidia. However, what it did not expect was that the ASIC cooperation, which was inconspicuous a few years ago, has now become the biggest stumbling block in the development of TPU. As long as the scale of TPU deployment is expanded, Broadcom must continue to increase money.
Thinking about it this way, the giants really can't escape the first grade of junior high school, but they can't escape the fifteenth grade. They can escape the 70% profit of Nvidia GPU, but they can't escape partner companies like Broadcom and Microsoft. If they want to save money on AI chips, they will inevitably encounter difficulties like Google today.