Try to See It My Way




Robots are not real quick on the uptake, if you catch my drift. One of the more common ways to teach a robot a new trick is to show its control system videos of human demonstrations so that it can learn by example. To become at all proficient at the task, it will generally need to be shown a large number of demonstrations. These demonstrations can be quite time-consuming and laborious to produce, and may require the use of complex, specialized equipment.

That is bad news for those of us that want domestic robots à la Rosey the Robot to finally make their way into our homes. Between the initial training datasets needed to give the robots a reasonable ability to generalize in different environments, and the fine-tuning datasets that will inevitably be needed to achieve decent success rates in each home, it is not practical to train these robots to do even one thing, let alone a dozen household chores.

A group of researchers at New York University and UC Berkeley had an idea that could greatly simplify data collection when it comes to human demonstrations. Their approach, called EgoZero , makes the process as transparent as possible by recording a first-person view video from a pair of glasses — no complex setups or hardware needed. And these demonstrations could even be collected over time, as a person goes about their normal, daily routine.

The glasses used by the researchers are Meta’s Project Aria smart glasses, which are equipped with both RGB and SLAM cameras that can capture video from the wearer’s perspective. Using this minimal setup, the wearer can collect high-quality, action-labeled demonstrations of everyday tasks — things like opening a drawer, placing a dish in the sink, or grabbing a box off a shelf.

Once the video data is captured, EgoZero converts it into 3D point-based representations that are morphology-agnostic. Because of this transformation, it does not matter whether the person performing the task has five fingers and the robot has two. The system abstracts the behavior in a way that can generalize across physical differences. These compact representations can then be used to train a robotic policy capable of performing the task autonomously.

In their experiments, the team used EgoZero data to train a Franka Panda robotic arm with a gripper, testing it on seven manipulation tasks. With just 20 minutes of human demonstration data per task and no robot-specific data, the robot achieved a 70% average success rate. That is an impressive level of performance for what is essentially zero-shot learning in the physical world. This performance even held up under changing conditions, like new camera angles, different spatial configurations, and the addition of unfamiliar objects. This suggests EgoZero-based training could be practical for real-world use, even in dynamic or varied environments like homes.

The team has made their system publicly available on GitHub , hoping to spur further research and dataset collection. They are now exploring how to scale the approach even further, including integrating fine-tuned visual language models and testing broader task generalization.Showing a robot how it’s done with smart glasses (📷: V. Liu et al.)

An overview of the training approach (📷: V. Liu et al.)

By admin

Deixe um comentário

O seu endereço de email não será publicado. Campos obrigatórios marcados com *