2024 will be a big year at the intersection of generative AI/large base models and robotics. From learning to product design, the potential for applications is exciting. Google's DeepMind robotics researchers are one of many teams exploring the potential of this area. In a blog post today, the team highlights ongoing research aimed at giving robots a better understanding of what we humans want them to do.

Traditionally, a robot's life is focused on repeatedly performing a single task. Single-purpose bots tend to be very good at this one thing, but even then they can struggle when inadvertent changes or errors arise.

The newly released AutoRT is designed to utilize large base models for a variety of different purposes. In a standard example given by the DeepMind team, the system first utilizes a visual language model (VLM) to improve situational awareness. AutoRT manages a fleet of robots working together and equipped with cameras to capture the layout of the environment and the objects within it.

At the same time, large language models can also propose tasks that hardware, including terminal executors, can perform. Many believe that language models are the key to unlocking robotics, allowing them to effectively understand more natural language instructions and reduce the need for hard-coded skills.

The system has undergone extensive testing over the past seven-plus months. AutoRT is capable of coordinating up to 20 robots and 52 different devices simultaneously. In total, DeepMind has collected approximately 77,000 tests, including more than 6,000 tasks.

Also new from the team is RT-Trajectory, which uses video input for robot learning. Many teams are exploring using YouTube videos as a way to train robots at scale, but RT-Trajectory adds an interesting layer by superimposing 2D sketches of arm movements on top of the videos.

"These trajectories, in the form of RGB images, provide low-level practical visual cues to the model as it learns robot control strategies," the team noted.

DeepMind said that when tested on 41 tasks, the success rate of this training was twice that of RT-2 training, at 63% and 29% respectively.

"RT-Trajectory exploits the rich wealth of robot motion information that is present in all robot datasets but is currently underutilized," the team noted. "RT-Trajectory not only represents another step on the road to building robots that can move efficiently and accurately in new situations, but also unlocks knowledge from existing data sets."