amandaa/AutoL2S-Plus-7b

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kLicense:apache-2.0Architecture:Transformer0.0K Open Weights Cold

amandaa/AutoL2S-Plus-7b is a 7.6 billion parameter model fine-tuned for efficient reasoning, building upon the AutoL2S framework. This model utilizes a two-stage training pipeline involving Long-Short Concatenated Distillation and off-policy Reinforcement Learning with a length-aware objective. It is optimized to generate shorter reasoning paths while maintaining correctness, making it suitable for tasks requiring concise and accurate logical deduction. The model is designed for improved reasoning efficiency compared to its base model.

Loading preview...

AutoL2S-Plus-7b: Efficient Reasoning with Length-Aware RL

amandaa/AutoL2S-Plus-7b is a 7.6 billion parameter model developed by amandaa, specifically engineered for efficient reasoning. It builds upon the innovative AutoL2S framework, which employs a two-stage training methodology to enhance reasoning capabilities while optimizing for conciseness.

Key Capabilities & Training:

  • Two-Stage Training: The model undergoes Supervised Fine-Tuning (SFT) followed by off-policy Reinforcement Learning (RL).
  • Long-Short Concatenated Distillation (Stage 1): This initial phase trains the model on paired long and short chains of thought (CoT), using a <EASY> token for automatic mode switching. The base SFT model is amandaa/AutoL2S-7b.
  • Off-Policy RL with Length-Aware Objective (Stage 2): This crucial stage refines reasoning efficiency by rewarding the model for generating shorter reasoning paths while preserving correctness. It uses a PPO-style clipped loss and leverages long- and short-form outputs from the SFT model as a reference policy.

Good for:

  • Applications requiring efficient and concise reasoning.
  • Tasks where balancing accuracy with output length is critical.
  • Developers looking for a model optimized for logical deduction with reduced verbosity.

This model is recommended for use with vLLM for optimal inference performance.