UC Berkeley researchers have
developed a robotic learning technology that enables robots to imagine
the future of their actions so they can figure out how to manipulate
objects they have never encountered before. In the future, this
technology could help self-driving cars anticipate future events on the
road and produce more intelligent robotic assistants in homes, but the
initial prototype focuses on learning simple manual skills entirely from
autonomous play.
Berkeley researchers have
programmed the robot, Vestri, to complete tasks like a baby would – by
playing with objects and then imagining how to get the task done.
Using this technology, called visual foresight, the robots can predict
what their cameras will see if they perform a particular sequence of
movements. These robotic imaginations are still relatively simple for
now – predictions made only several seconds into the future – but they
are enough for the robot to figure out how to move objects around on a
table without disturbing obstacles. Crucially, the robot can learn to
perform these tasks without any help from humans or prior knowledge
about physics, its environment or what the objects are. That’s because
the visual imagination is learned entirely from scratch from unattended
and unsupervised exploration, where the robot plays with objects on a
table. After this play phase, the robot builds a predictive model of the
world, and can use this model to manipulate new objects that it has not
seen before.
“In the same way that we can imagine how our actions will move the
objects in our environment, this method can enable a robot to visualize
how different behaviors will affect the world around it,” said Sergey
Levine, assistant professor in Berkeley’s Department of Electrical
Engineering and Computer Sciences, whose lab developed the technology.
“This can enable intelligent planning of highly flexible skills in
complex real-world situations.”
The research team will perform a demonstration of the visual foresight
technology at the Neural Information Processing Systems conference in
Long Beach, California, on December 5.
At the core of this system is a deep learning technology based on
convolutional recurrent video prediction, or dynamic neural advection
(DNA). DNA-based models predict how pixels in an image will move from
one frame to the next based on the robot’s actions. Recent improvements
to this class of models, as well as greatly improved planning
capabilities, have enabled robotic control based on video prediction to
perform increasingly complex tasks, such as sliding toys around
obstacles and repositioning multiple objects.
“In that past, robots have learned skills with a human supervisor
helping and providing feedback. What makes this work exciting is that
the robots can learn a range of visual object manipulation skills
entirely on their own,” said Chelsea Finn, a doctoral student in
Levine’s lab and inventor of the original DNA model.
With the new technology, a robot pushes objects on a table, then uses
the learned prediction model to choose motions that will move an object
to a desired location. Robots use the learned model from raw camera
observations to teach themselves how to avoid obstacles and push objects
around obstructions.
“Humans learn object manipulation skills without any teacher through
millions of interactions with a variety of objects during their
lifetime. We have shown that it possible to build a robotic system that
also leverages large amounts of autonomously collected data to learn
widely applicable manipulation skills, specifically object pushing
skills,” said Frederik Ebert, a graduate student in Levine’s lab who
worked on the project.
Since control through video prediction relies only on observations that
can be collected autonomously by the robot, such as through camera
images, the resulting method is general and broadly applicable. In
contrast to conventional computer vision methods, which require humans
to manually label thousands or even millions of images, building video
prediction models only requires unannotated video, which can be
collected by the robot entirely autonomously. Indeed, video prediction
models have also been applied to datasets that represent everything from
human activities to driving, with compelling results.
“Children
can learn about their world by playing with toys, moving them around,
grasping, and so forth. Our aim with this research is to enable a robot
to do the same: to learn about how the world works through autonomous
interaction,” Levine said. “The capabilities of this robot are still
limited, but its skills are learned entirely automatically, and allow it
to predict complex physical interactions with objects that it has never
seen before by building on previously observed patterns of interaction.”
The Berkeley scientists are continuing to research control through video
prediction, focusing on further improving video prediction and
prediction-based control, as well as developing more sophisticated
methods by which robots can collected more focused video data, for
complex tasks such as picking and placing objects and manipulating soft
and deformable objects such as cloth or rope, and assembly.