pull down to refresh

Microsoft just announced Rho-alpha (ρα), their first robotics model derived from the Phi series of vision-language models.

Rho-alpha translates natural language commands into control signals for robotic systems performing bimanual manipulation tasks. Commands like "push the green button with the right gripper," "pull out the red wire," "flip the top switch on," or "turn the knob to position 5" get executed directly by dual-arm robots.

What makes this different from standard vision-language-action (VLA) models is the additional modalities. Rho-alpha is a VLA+ model that adds tactile sensing to the perceptual mix, with plans to incorporate force feedback.

On the learning side, the model is designed to continually improve during deployment by learning from human feedback.

The training approach combines trajectories from physical demonstrations and simulated tasks with web-scale visual question answering data.

Since teleoperation data is scarce and expensive, Microsoft is using NVIDIA Isaac Sim on Azure to generate physically accurate synthetic datasets via reinforcement learning. These simulated trajectories get combined with commercial and open physical demonstration datasets.

The model is currently under evaluation on dual-arm setups and humanoid robots. Microsoft is opening an Early Access Program for organizations interested in evaluating Rho-alpha.

Robots that can adapt to dynamic situations and human preferences are more useful in real environments and more trusted by the people operating them.