Wojtekb30/Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC
Wojtekb30/Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC is a 1.5 billion parameter Qwen2.5-Instruct variant, fine-tuned to generate both natural language chain-of-thought and discrete motion tokens for 3D human animation. This proof-of-concept model translates action prompts into first-person reasoning and RVQ-encoded movement sequences. It is specifically designed for embodied AI applications where an LLM needs to control physical actions. The model integrates motion token vocabulary directly into chat output, enabling real-time generation of animated human movements.
Loading preview...
Model Overview
This model, Wojtekb30/Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC, is a specialized variant of the Qwen/Qwen2.5-1.5B-Instruct large language model. It is a proof-of-concept designed for embodied AI, enabling an LLM to generate both natural language reasoning and corresponding 3D human motion sequences from a single prompt.
Key Differentiators
- Integrated Motion Generation: Unlike standard LLMs, this model is trained to emit discrete movement tokens directly within its chat output, interleaved with reasoning text.
- Explicit Motion Token Vocabulary: It utilizes a custom vocabulary including
<move>,</move>, and<m_{level}_{value}>tokens to represent quantized motion data. - Efficient Motion Encoding: Only 3 movement tokens are needed to decode approximately 0.5 seconds of coarse motion, and 10 tokens for detailed motion, allowing for responsive control even with slower LLM inference.
- Chain-of-Thought for Motion: The model produces a first-person chain of thought about the movement alongside the motion tokens.
How it Works
The model takes a natural-language action prompt and, with a specific system prompt (You are an embodied AI...), generates a response containing both descriptive text and motion tokens. These tokens are then decoded by an included RVQ (Residual Vector Quantization) decoder into a 3D human motion sequence. The process involves extracting tokens, rebuilding an RVQ token matrix, summing quantizer embeddings, and decoding the latent sequence.
Limitations
As a proof-of-concept, the model's motion quality is dependent on the RVQ decoder and token correctness. It generally handles basic movements but may struggle with more complex actions. It is intended for research and prototyping.