ArcadaLabs/Ouro-2.6B-Thinking-mlx-bf16
ArcadaLabs/Ouro-2.6B-Thinking-mlx-bf16 is an unquantized (bf16) MLX conversion of ByteDance's Ouro-2.6B-Thinking, a 2.6 billion parameter Looped Language Model. This model is specifically designed for chain-of-thought reasoning, producing visible tokens before generating its final answer. It leverages architectural recurrence with 24 layers applied recurrently 4 times (96 effective layers) and is optimized for efficient inference on Apple Silicon via MLX, offering improved throughput compared to PyTorch fp16 on MPS.
Loading preview...
ArcadaLabs/Ouro-2.6B-Thinking-mlx-bf16 Overview
This model is an unquantized (bf16) MLX conversion of the ByteDance/Ouro-2.6B-Thinking model, a 2.6 billion parameter Looped Language Model (LoopLM) or Universal Transformer. Its core innovation lies in its chain-of-thought reasoning capability, where it explicitly generates internal <think> tokens before formulating a final response. This process is powered by architectural recurrence, effectively using 24 physical layers as 96 effective layers through looping.
Key Capabilities & Features
- Explicit Reasoning: Generates a visible chain-of-thought, allowing users to observe the model's reasoning process.
- MLX Optimization: Converted for efficient inference on Apple Silicon, demonstrating significantly higher token throughput (11.9 tok/s) compared to PyTorch fp16 on MPS (5.0 tok/s) in informal benchmarks.
- Looped Architecture: Utilizes recurrent looping over transformer blocks, enabling deeper processing with fewer physical layers.
- High Context Length: Trained with a 4K context length, extendable up to 64K tokens.
- Full Precision: This specific conversion maintains bfloat16 precision, offering a balance between performance and numerical stability.
When to Use This Model
- Reasoning-Intensive Tasks: Ideal for applications requiring transparent, step-by-step reasoning, such as complex problem-solving or analysis.
- Apple Silicon Deployment: Excellent choice for developers targeting Apple devices due to its MLX optimization.
- Understanding Model Thought Process: Useful for research or debugging where insight into the model's internal deliberation is beneficial.
- Resource-Efficient Deep Processing: Its looped architecture allows for deep processing with a relatively smaller parameter count, making it efficient for certain tasks.