From something simple like a screwdriver or a frying pan, all the way up to modern technology like a computer, we need all sorts of tools to get much of anything done. That is no different for robots, yet teaching them to use tools is extraordinarily difficult. Until they can learn these skills they will effectively be stuck in the stone age compared to us, and that certainly does not move us nearer to our goal of building general-purpose robots that can help with household tasks.
Since we learn to use tools by watching others, it only seems natural that we should teach robots in a similar way. And efforts have been made to do exactly that. Teleoperation is one frequently used method, however, it requires expensive equipment and is not scalable. Learning from videos of human demonstrations is another option, but working with single-view data limits the insights that can be drawn from it.
The Tool-as-Interface approach (📷: H. Chen et al.)
A more effective and scalable solution is sorely needed, and one such solution has just been proposed by a team led by researchers at the University of Illinois Urbana-Champaign. They have developed what they call Tool-as-Interface, which assists robots in learning to use tools by observing human demonstrations. Their approach differs from existing methods in some key ways to enhance its effectiveness.
Instead of relying on cumbersome teleoperation setups or specialized hand-held grippers, the new framework makes use of natural human interaction data. That means people simply use tools as they normally would, no special equipment or technical expertise is required. This raw, unstructured human activity is then recorded using a pair of RGB cameras that capture the scene from different angles.
With data from this simple setup, the system generates 3D reconstructions of the person’s actions, enabling robust and view-invariant learning. The researchers also applied a technique called Gaussian splatting to generate new, synthetic views of the same action, further enhancing the diversity of training data. To make the demonstrations robot-friendly, a segmentation model filters out any embodiment-specific details, like human hands or arms, allowing the robot to focus solely on how the tool itself is used.
Some tasks learned with the system (📷: H. Chen et al.)
The method achieved a 71% higher average success rate compared to diffusion policies trained using teleoperation data, and reduced data collection time by 77%. In some cases, tasks such as pan flipping or wine bottle balancing could only be solved using this new framework. Compared to existing hand-held gripper systems like UMI, Tool-as-Interface slashed data collection time by 41%. In addition to the performance gains, the framework also demonstrated robustness in challenging conditions, such as changes in camera positioning, robot base movement, and unexpected disturbances during task execution.
Tool-as-Interface may not get us all the way to a general-purpose domestic robot, but this new training method may prove to be an important step along the way. Until robots can harness the power of tools, they will remain severely limited in their ability to assist us in meaningful ways.