Nvidia’s head of autonomous driving explains in detail the roadmap to “surpassing Tesla and Waymo”

NVIDIA Automotive Business LeaderWu XinzhouEvery six months or so, when the self-driving system iterates to a point where he is "confident enough," CEO Huang Jen-Hsun will be invited into the car for a real-road test. Most recently, the two drove from Woodside, California to downtown San Francisco, riding in a Mercedes-Benz CLA equipped with the MB.Drive Assist Pro hands-free driving assistance system. This system was partly designed by NVIDIA and is quite close to Tesla's "Full Self-Driving" (FSD) in form.

Although there was heavy traffic on the road, the atmosphere inside the car was relaxed. Huang Renxun even joked that he "started to worry less about safety" after the system entered autonomous driving mode.

Nvidia subsequently provided the media with a 22-minute in-vehicle video, in which the Mercedes-Benz was traveling through complex daily urban scenes such as construction roads, illegally parked vehicles, and narrow roads pinched by orange roadblock cones. The system performed quite smoothly, but the video was edited and was not a complete recording. A spokesman for Nvidia later emphasized that there was no system "failure to take over" that required manual intervention during the entire journey. The author has previously experienced a similar system in a car with Nvidia executives in San Francisco, and was impressed by its ability to operate at traffic lights, four-way intersections, illegally parked vehicles, unprotected left turns, and between pedestrians, bicycles, and scooters. He believes that it is not difficult for the world's most valuable chip company to make the same or even better system under the premise that Tesla has already run through a solution using cameras and chips.

After years of “behind-the-scenes empowerment,” Nvidia has begun to proactively place itself at the center of the autonomous driving industry stage. In addition to continuing to supply car-grade chips to Tesla and other car companies, it also packages its self-developed AI driving functions into a platform and provides them to partners such as Mercedes-Benz, Jaguar Land Rover, and Lucid. At CES earlier this year, Huang Renxun released an autonomous driving development portfolio called "Alpamayo", covering AI models, simulation blueprints and data sets, aiming to enable vehicles to achieve L4 autonomous driving under specific conditions. He even called this node "the ChatGPT moment of physical world AI."

However, when talking with Wu Xinzhou in the car, Huang Renxun restrained his arrogance at the press conference and preferred to calmly reflect, but he was still extremely optimistic about the future of technology. He admitted that the power of Alpamayo lies in its ability to reason about the environment, but the real difficulty lies in "we don't know what it can't do", so it still needs to be deeply integrated with the traditional "classic technology stack". In his view, it is difficult to demonstrate the safety of a purely end-to-end large model from an engineering perspective, while the classic technology stack is based on mature engineering processes and is more conducive to safe verification of specific behaviors. Combining the two can not only achieve a driving style close to humans, but also constrain behaviors within the framework of traditional traffic rules. Although other self-driving players in the industry also overlay explicit safety rules on top of end-to-end neural networks, end-to-end learning is becoming a new trend in the industry: Waymo adopts a hybrid approach, and Tesla is betting almost entirely on end-to-end networks.

Wu Xinzhou said in the interview that the end-to-end model can reduce the mechanical feel and "robotic" lag when dealing with delicate scenes such as speed bumps and lane changes, and is closer to the rhythm of real-life driving. This is why he emphasized the "ChatGPT moment". “Only when your car behaves confidently, users will be more willing to continue using it,” he said.

Compared with Tesla, Wu Xinzhou puts Nvidia's differences more in its sensor combination and system architecture, rather than commenting on its opponents' safety controversies in public. Tesla FSD has accumulated more than 8.5 billion miles of road testing so far, but it has also been involved in many serious safety accidents. It has been named by regulatory authorities as being involved in 23 injuries and at least two fatal accidents. An NVIDIA executive revealed last year that the company had used its own system and Tesla FSD for comparative testing internally. Judging from the number of driver takeovers, both had their own advantages in different scenarios.

Wu Xinzhou emphasized that NVIDIA insists on using a "multi-source redundant" sensor combination: in addition to cameras and millimeter-wave radars, ultrasonic sensors will also be deployed, and lidar will be added in higher configurations. In his view, the redundancy and diversity of various sensors are the key to complement extreme edge scenarios and improve overall security redundancy. Of course, the more sensors there are, the higher the hardware cost of the entire system, especially lidar, which makes people worry that solutions with the highest safety specifications will only appear in expensive luxury cars. In this regard, Wu Xinzhou believes that relying on NVIDIA's "vertical integration" solution and the overall downward trend in sensor prices can reduce safety performance to the "lowest possible" cost range.

He introduced that NVIDIA's DRIVE Hyperion platform has supported multi-level configurations since the beginning of its design: the entry-level version uses a simplified solution based on cameras and radars. After more than ten years of mass production, the cost of these devices has dropped significantly, and the ultrasonic sensors themselves are very cheap. For higher levels of autonomous driving, the platform can be superimposed with lidar on demand. As the price of this type of sensor continues to fall, he believes that it is not unimaginable to equip a complete sensor stack in mass-produced models in the price range of $40,000 to $50,000.

In the face of Waymo's recent safety incidents in San Francisco and other places - such as self-driving taxis collectively blocking intersections when intersection signals failed due to power outages - Wu Xinzhou said that such extreme cases have been moved into simulation environments by Nvidia for repeated deductions. Unlike Tesla, which has a huge fleet of private cars, and Waymo, which has accumulated nearly 200 million miles of fully autonomous driving on public roads, Nvidia does not have an advantage in real road test data, so it pays more attention to infrastructure investment in "synthetic data" and high-fidelity simulation.

In terms of simulation strategy, NVIDIA mainly adopts two methods. The first is "Neural Reconstruction" (NuRec). Engineers use sensor data collected by real vehicles to reconstruct realistic three-dimensional road scenes, allowing the system to repeatedly experience a certain real road condition in a virtual environment. The second is "enhancement", that is, constantly modifying variables in the reconstructed scene, such as adjusting the time, speed and location of pedestrians, thereby generating a series of new situations with only subtle differences to observe the behavior of the system under various slightly changing conditions. This process is internally referred to as "fuzzifying" the data set. Nvidia not only obtains driving recorder videos from partners, but also reproduces public event scenes such as the traffic jams encountered by Waymo into simulations, training the system to learn to proactively avoid behavioral patterns similar to "collective jamming".

However, in Wu Xinzhou's vision, the truly ideal autonomous driving system in the future should not rely on endless real vehicle road test data to cover all edge cases, but should have the ability to "reason" and derive coping strategies by analogy from rules and limited experience. To this end, his team is developing a new model called "Vision Language Action" (Vision Language Action), which unifies visual perception, language understanding and physical actions into the same architecture, and uses a basic large model that has been trained on Internet-level data to give vehicles stronger understanding and reasoning capabilities. Wu Xinzhou compared this to humans learning to drive: first read a traffic rule manual, and then practice on the road for twenty hours. Most new drivers will already be qualified on the road, and then continue to improve through experience. "Our goal is to enable the model to do the same - in the future it will only need a rule book and twenty hours of training data, and it will be able to learn to drive," he said.

On the track where forerunners such as Tesla and Waymo are already running ahead, NVIDIA is trying to shorten the gap in mileage and experience through a complete combination of "chip + platform + model + simulation", and transform itself from a behind-the-scenes "computing infrastructure builder" to an important setter of autonomous driving technology routes and safety standards. For Jen-Hsun Huang and Xinzhou Wu, this gamble on “the ChatGPT moment of AI in the physical world” has just crossed the starting line.