Philip-MIT/SOLE-R1-8B
Philip-MIT/SOLE-R1-8B is an 8 billion parameter video-language reward reasoning model developed by Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, and Ondrej Biza. It is designed to estimate robot task progress from video frames and natural-language descriptions, generating both per-timestep reasoning traces and scalar progress predictions. This model excels at providing dense reward signals for online robot reinforcement learning, particularly when manual reward engineering is impractical. It processes visual observations and task descriptions within a 32768 token context length to output a progress percentage.
Loading preview...
SOLE-R1-8B: Video-Language Reward Reasoning for Robotics
SOLE-R1-8B is an 8 billion parameter model developed by Philip Schroeder et al. that functions as a video-language reward reasoning system for robotics. Its core purpose is to estimate the progress of a robot's task by analyzing video frames and a natural-language task description. The model outputs both a detailed reasoning trace and a scalar progress prediction, which can be directly utilized as a dense reward signal for online robot reinforcement learning.
Key Capabilities & Features
- Task Progress Estimation: Predicts robot task progress from visual observations (video) and a natural-language task description.
- Reasoning Traces: Generates per-timestep reasoning traces, explaining its progress assessment.
- Scalar Progress Prediction: Outputs a percentage-based progress estimate (e.g.,
22%) suitable for use as a reward function. - Reinforcement Learning Integration: Specifically designed to provide dense reward signals for robotic reinforcement learning, addressing scenarios where manual reward engineering is challenging.
- Multi-view Support: Trained to reason over visual observations from multiple camera views (e.g., external and wrist cameras).
- RoboReason Interface: Integrates seamlessly with the RoboReason library for easy inference and visualization.
How it Works
The model takes a task description, initial progress (typically 0%), previous timestep progress, and visual observations from multiple timesteps and camera views as input. It then generates an output in the format <think>reasoning about task progress</think><answer>progress%</answer>, where the numeric value within the <answer> tag represents the current task progress.
Training Data
SOLE-R1-8B was trained on the extensive Philip-MIT/sole_training_data dataset, which comprises approximately 2TB of robot task progress examples, including images, prompts, reasoning completions, and progress labels.
Good for
- Robotics Researchers: Ideal for those developing and experimenting with robot reinforcement learning, especially when seeking automated, dense reward functions.
- Automating Reward Engineering: Useful for scenarios where designing manual reward functions for complex robotic tasks is difficult or time-consuming.
- Understanding Robot Behavior: The generated reasoning traces can provide insights into how the model perceives and evaluates robot task progress.