Wojtekb30/Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC-2

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Jun 5, 2026License:cc-by-nc-4.0Architecture:Transformer Open Weights Cold

Wojtekb30/Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC-2 is a 1.5 billion parameter Qwen2.5-Instruct variant developed by Wojtekb30, specifically fine-tuned for generating human motion sequences. This model uniquely integrates discrete movement tokens directly into its chat output, alongside natural language reasoning, enabling it to produce 3D human motion animations. It functions as a proof-of-concept for embodied AI, translating natural language action prompts into motor actions decodable by an included RVQ model. Its primary strength lies in its ability to generate motion sequences efficiently, requiring only a few tokens for coarse or detailed movements.

Loading preview...

Overview

Wojtekb30/Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC-2 is a specialized 1.5 billion parameter Qwen2.5-Instruct model designed for embodied AI applications. It uniquely generates both natural language reasoning and discrete motion tokens in response to action prompts. This model is a proof-of-concept for translating linguistic instructions into physical movements, with the motion tokens being decodable into 3D human animation sequences using an integrated RVQ (Residual Vector Quantization) decoder.

Key Capabilities

  • Integrated Motion Generation: Emits explicit motion tokens (<m_level_value>) directly within the chat output, alongside textual reasoning.
  • Efficient Motion Encoding: Uses only 3 movement tokens to decode 0.5 seconds of coarse motion and 10 tokens for detailed motion, allowing for real-time robot or avatar control even with slower LLM inference.
  • First-Person Chain of Thought: Generates a first-person chain of thought about the movement before outputting motor actions.
  • Custom Tokenization: Incorporates special tokens for motion vocabulary (4 x 1024 RVQ bins + move delimiters) into its tokenizer.

What Makes This Model Different

This model stands out by directly embedding motor actions as discrete tokens within its language output, enabling a seamless integration of language and physical action generation. Unlike general-purpose LLMs, it's specifically trained to produce sequences that can be immediately translated into 3D human motion, making it suitable for robotics, animation, and embodied AI research. It's a proof-of-concept, demonstrating the feasibility of generating basic movements, though it may struggle with more complex actions.

Recommended Use Cases

  • Embodied AI Research: Prototyping and research into language-to-motion generation for virtual agents or robots.
  • Animation Generation: Creating basic 3D human motion sequences from natural language descriptions.
  • Proof-of-Concept Demonstrations: Showcasing the integration of LLMs with physical action generation systems.