On Monday, a group of AI researchers from Google and the Berlin University of Technology presented it PaLM-EThe multimodal visual language model (VLM) includes 562 billion behind that integrates vision and language for robot control. They claim that it is the largest VLM ever developed and can perform many tasks without the need for repair.
According to Google, when you give me a high-level command, such as “bring me rice chips from the bin,” PaLM-E can generate an action plan for a mobile robot platform with an arm (made by Google Robotics). and perform the actions themselves.
PaLM-E does this by analyzing data from the robot’s camera without requiring a pre-processed scene representation. This eliminates the need for humans to preprocess or interpret data and allows for more autonomous robot control.
In a demo video provided by Google, PaLM-E “brings me rice chips from the drive,” which includes multiple planning steps as well as integrating visual feedback from the robot’s camera.
It is also resilient and can react to its environment. For example, the PaLM-E model can control a robot to get a bag of chips from the kitchen—and with the PaLM-E integrated into the control loop, it becomes resistant to interruptions that may occur during the task. In the video example, a researcher takes chips from a robot and carries them, but the robot finds the chips and picks them up again.
Inside another example, the same PaLM-E model autonomously controls the robot through tasks with complex processes that previously required human guidance. Google’s research paper explain how PaLM-E turns instructions into actions:
We demonstrate the performance of PaLM-E on challenging and diverse mobile manipulation tasks. We follow the very principle in Ahn et al. (2022), where the robot needs to plan a sequence of navigation and manipulation actions based on guidance by a human. For example, given the instruction “I spilled my drink, can you get me something to clean up?”, the robot needs to plan a path that includes “1. Find a sponge, 2. Pick up a sponge , 3. Bring it. to the user, 4. Leave the sponge.” Inspired by these activities, we developed 3 use cases to test the conceptual capabilities of PaLM-E: resource prediction, failure detection, and long-term planning. Low-level policies are derived from RT-1 (Brohan et al., 2022), a transformation model that accepts RGB imagery and natural language instruction, and produces end-effector control commands.
PaLM-E is the next sign predictor, and it is called “PaLM-E” because it is based on Google’s existing large language model (LLM) called “PaLM“(which is similar to the technology behind ChatGPT). Google has made PaLM “embedded” by adding sensory information and robot control.
Since it is based on a language model, PaLM-E collects continuous observations, such as images or sensor data, and stores them in a sequence of vectors that are the same size as language symbols. This allows the model to “understand” sensory information in the same way that it processes language.
A demo video provided by Google shows the robot led by PaLM-E following the instruction, “Give me a green star.” The researchers said that the green star “is something that this robot is not directly exposed to.”
In addition to those RT-1 Robotic transformerPaLM-E draws from Google’s previous work on ViT-22B, the generation transformer model appeared in February. ViT-22B has been trained on various visual tasks, such as image classification, object detection, semantic segmentation, and image captioning.
Google Robotics is not the only research group working on robot control with neural networks. This important work is similar to Microsoft’s recent “ChatGPT for Robotics” paper, which experiments with combining visual data and large language models for robot control in a similar way.
Robotics aside, the Google researchers noticed several interesting effects that clearly come from using a large language model as the basis of PaLM-E. For one, it demonstrates “positive transfer,” which means you can transfer the knowledge and skills you’ve learned from one task to another, resulting in “superior performance” compared to single-task robot model.
Also, they observe trend with the size of the model: “The bigger the language model, the more you preserve your language abilities when training on language-page and robot-sized tasks, the model 562B PaLM-E almost replaces all your language power stops.”
PaLM-E is the largest VLM reported to date. We are observing emergent abilities such as multimodal chain of thought, and multi-image reference, despite being trained on single image stimuli. Although not the focus of our work, PaLM-E sets a new SOTA on the OK-VQA benchmark. pic.twitter.com/9FHug25tOF
— Danny Driess (@DannyDriess) March 7, 2023
And the researchers Ask that PaLM-E demonstrates emergent capabilities such as chain-of-concept multimodal reasoning (allowing the model to analyze a sequence of inputs that include both language and visual information) and multi-image reference (using multiple images as a print more to indicate or predict. ) with training on single image pointers. In that sense, PaLM-E is similar continue the trend of surprises emerge as deep learning models get more complex over time.
Google researchers plan to explore more applications of PaLM-E for real-world scenarios such as home automation or industrial robots. And they hope that PaLM-E will provide more research on multimodal thinking and embodied AI.
“Multimodal” is a buzz word that we will hear more and more as companies reach for artificial general intelligence that will probably be able to perform general tasks like humans.