ArcadaLabs/Ouro-2.6B-mlx-bf16

TEXT GENERATIONConcurrency Cost:1Model Size:2.6BQuant:BF16Ctx Length:32kPublished:May 3, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

ArcadaLabs/Ouro-2.6B-mlx-bf16 is an unquantized bfloat16 MLX conversion of ByteDance's Ouro-2.6B, a 2.6 billion parameter Looped Language Model. This model utilizes recurrent application of transformer blocks, effectively achieving performance beyond its parameter count by simulating 96 layers from 24 physical layers. Trained on 7.7 trillion tokens with a 4K context length (extendable to 64K), it is optimized for efficient inference on Apple Silicon via MLX, demonstrating improved throughput compared to PyTorch fp16 on MPS.

Loading preview...

ArcadaLabs/Ouro-2.6B-mlx-bf16: A Looped Language Model for MLX

This model is an unquantized (bfloat16) conversion of ByteDance's Ouro-2.6B, specifically engineered for Apple Silicon using the MLX framework. Ouro-2.6B is a unique Looped Language Model (LoopLM) or Universal Transformer architecture that applies its 24 transformer blocks recurrently 4 times, resulting in an effective 96 layers. This design allows it to achieve performance typically associated with much larger models, despite its 2.6 billion parameters.

Key Capabilities & Features

  • Efficient MLX Conversion: Provides a full-precision bfloat16 version optimized for Apple Silicon, offering significantly faster inference (e.g., 12.0 tok/s vs. 6.9 tok/s for PyTorch fp16 on MPS in informal benchmarks).
  • Looped Architecture: Employs recurrent processing of transformer blocks, enhancing reasoning capabilities and performance relative to its parameter count.
  • Extensive Training: Trained on a massive 7.7 trillion tokens, contributing to its robust language understanding and generation.
  • Adaptable Context: Features a 4K token training context length, extendable to 64K, suitable for handling longer inputs.
  • Custom MLX Integration: Requires a custom ouro.py model file for mlx-lm to support its unique architecture, including sandwich RMSNorm and KV cache management for recurrent passes.

Ideal Use Cases

  • Apple Silicon Development: Developers targeting Apple Silicon for local LLM inference will benefit from its MLX optimization.
  • Resource-Constrained Environments: Its efficient looped architecture makes it a strong candidate for scenarios where performance is needed without the memory footprint of larger, non-looped models.
  • Experimental MLX Projects: Useful for those exploring advanced model architectures and custom MLX integrations.