In a cluttered open-plan office in Mountain View, California, a tall, slender wheeled robot has been busy playing tour guide and informal office helper, thanks to a major language model upgrade, Google DeepMind revealed today. The robot uses the latest version of Google's Gemini large language model to both parse commands and find its way around.
For example, when a human gives the command, “Find a place where I can write,” the robot dutifully sets out and leads the person to a pristine whiteboard located somewhere in the building.
Gemini’s ability to process video and text, along with its ability to absorb large amounts of information in the form of previously recorded video tours of the office, allows the “Google helper” robot to understand its environment and navigate correctly when given commands that require some common sense. The robot combines Gemini with an algorithm that generates specific actions for the robot to perform, such as turning, in response to commands and what it sees in front of it.
When Gemini was introduced in December, Google DeepMind CEO Demis Hassabis told WIRED that its multimodal capabilities would likely unlock new robotic capabilities. He added that the company’s researchers were hard at work testing the model’s robotic potential.
In a new paper describing the project, the researchers behind the work say their robot was up to 90 percent reliable at navigating, even when faced with difficult commands like “Where did I leave my roller coaster?” DeepMind’s system “significantly improved the naturalness of the human-robot interaction and greatly increased the usability of the robot,” the team writes.
The demo neatly illustrates the potential of large language models to reach into the physical world and do useful work. Gemini and other chatbots typically operate within the confines of a web browser or app, though they are increasingly capable of processing visual and auditory input, as both Google and OpenAI have recently demonstrated. In May, Hassabis demonstrated an improved version of Gemini that can understand an office layout as seen through a smartphone camera.
Academic and industrial research labs are busy seeing how language models can be used to enhance the capabilities of robots. The May program for the International Conference on Robotics and Automation, a popular event for robotics researchers, includes nearly two dozen papers using vision language models.
Investors are pouring money into startups that want to apply AI advances to robotics. Several researchers involved in Google’s project have since left to found a startup called Physical Intelligence, which has raised an initial $70 million in funding; it’s working on combining large language models with real-world training to give robots general problem-solving skills. Skild AI, founded by roboticists at Carnegie Mellon University, has a similar goal. This month, it announced $300 million in funding.
Just a few years ago, a robot needed a map of its environment and carefully chosen commands to navigate successfully. Large language models contain useful information about the physical world, and newer versions trained on images, video and text, known as vision language models, can answer questions that require perception. With Gemini, Google’s robot can parse both visual and spoken instructions, following a sketch on a whiteboard that provides a route to a new destination.
In their paper, the researchers say they plan to test the system on different types of robots. They add that Gemini should be able to understand more complex questions, such as “Do they have my favorite drink today?” from a user with a bunch of empty Coke cans on his desk.