ByteDance/Ouro-2.6B
ByteDance/Ouro-2.6B is a 2.6 billion parameter Looped Language Model (LoopLM) developed by ByteDance, designed for exceptional parameter efficiency. It achieves performance comparable to 3-4B parameter standard transformers by utilizing iterative shared-weight computation and recurrent steps. This model excels at latent reasoning through recurrent computation and supports adaptive computation with early exit mechanisms. It is primarily intended for research purposes exploring efficient and adaptive language model architectures.
Loading preview...
Overview
ByteDance/Ouro-2.6B is a 2.6 billion parameter Looped Language Model (LoopLM) that introduces a novel approach to parameter efficiency. It leverages iterative shared-weight computation to achieve performance on par with larger 3-4 billion parameter standard transformers, making it a highly efficient model for its size. The model was trained on a substantial 7.7 trillion tokens, encompassing web data, code, mathematics, and long-context documents.
Key Capabilities
- Exceptional Parameter Efficiency: Matches the performance of larger models (3-4B parameters) with only 2.6 billion parameters through its unique LoopLM architecture.
- Iterative Latent Reasoning: Performs reasoning tasks by recurrently processing information within its latent space.
- Adaptive Computation: Features an adaptive exit mechanism, allowing for dynamic allocation of computational resources based on the task's complexity. This can be configured via
early_exit_thresholdinconfig.json. - Configurable Recurrent Steps: Users can adjust the
total_ut_stepsparameter to control the number of recurrent computations, balancing performance and inference speed.
Architecture and Training
Ouro-2.6B is based on a decoder-only Transformer architecture with 24 layers, a hidden size of 2048, and a vocabulary of 49,152. It uses RoPE for position embeddings and Sandwich RMSNorm. The model was trained through a multi-stage pipeline, including pre-training, CT annealing, long-context training, and mid-training phases. While its training context length was 4K, it is extendable to 64K.
Intended Use
This model is primarily intended for research purposes to explore and develop efficient language model architectures. Developers interested in parameter-efficient models, iterative reasoning, or adaptive computation will find Ouro-2.6B particularly relevant. Note that the adaptive exit feature is not currently supported in vLLM, where the model will always execute all recurrent steps.