ArcadaLabs/Ouro-2.6B-Thinking-mlx-bf16

TEXT GENERATIONConcurrency Cost:1Model Size:2.6BQuant:BF16Ctx Length:32kPublished:May 3, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

ArcadaLabs/Ouro-2.6B-Thinking-mlx-bf16 is an unquantized (bf16) MLX conversion of ByteDance's Ouro-2.6B-Thinking, a 2.6 billion parameter Looped Language Model. This model is specifically designed for chain-of-thought reasoning, producing visible tokens before generating its final answer. It leverages architectural recurrence with 24 layers applied recurrently 4 times (96 effective layers) and is optimized for efficient inference on Apple Silicon via MLX, offering improved throughput compared to PyTorch fp16 on MPS.

Loading preview...

ArcadaLabs/Ouro-2.6B-Thinking-mlx-bf16 Overview

This model is an unquantized (bf16) MLX conversion of the ByteDance/Ouro-2.6B-Thinking model, a 2.6 billion parameter Looped Language Model (LoopLM) or Universal Transformer. Its core innovation lies in its chain-of-thought reasoning capability, where it explicitly generates internal <think> tokens before formulating a final response. This process is powered by architectural recurrence, effectively using 24 physical layers as 96 effective layers through looping.

Key Capabilities & Features

  • Explicit Reasoning: Generates a visible chain-of-thought, allowing users to observe the model's reasoning process.
  • MLX Optimization: Converted for efficient inference on Apple Silicon, demonstrating significantly higher token throughput (11.9 tok/s) compared to PyTorch fp16 on MPS (5.0 tok/s) in informal benchmarks.
  • Looped Architecture: Utilizes recurrent looping over transformer blocks, enabling deeper processing with fewer physical layers.
  • High Context Length: Trained with a 4K context length, extendable up to 64K tokens.
  • Full Precision: This specific conversion maintains bfloat16 precision, offering a balance between performance and numerical stability.

When to Use This Model

  • Reasoning-Intensive Tasks: Ideal for applications requiring transparent, step-by-step reasoning, such as complex problem-solving or analysis.
  • Apple Silicon Deployment: Excellent choice for developers targeting Apple devices due to its MLX optimization.
  • Understanding Model Thought Process: Useful for research or debugging where insight into the model's internal deliberation is beneficial.
  • Resource-Efficient Deep Processing: Its looped architecture allows for deep processing with a relatively smaller parameter count, making it efficient for certain tasks.