cs-552-2026-MMRF/3000Alpaca_15kDPO

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:May 18, 2026Architecture:Transformer Warm

The cs-552-2026-MMRF/3000Alpaca_15kDPO is a 2 billion parameter language model, fine-tuned from the 3000alpaca base model using Direct Preference Optimization (DPO). This model is designed for generating high-quality, preference-aligned text responses, leveraging its 32768 token context length. It specializes in producing outputs that align with human preferences, making it suitable for conversational AI and instruction-following tasks.

Loading preview...

Model Overview

cs-552-2026-MMRF/3000Alpaca_15kDPO is a 2 billion parameter language model, building upon the cs-552-2026-MMRF/3000alpaca base model. Its key differentiator is the application of Direct Preference Optimization (DPO) during its training procedure, a method detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". This fine-tuning approach aims to align the model's outputs more closely with human preferences.

Key Capabilities

  • Preference-Aligned Text Generation: Excels at generating responses that are optimized based on human preferences, a direct result of its DPO training.
  • Instruction Following: Capable of understanding and responding to user instructions effectively, as demonstrated by its quick start example.
  • Extended Context Window: Features a substantial context length of 32768 tokens, allowing it to process and generate longer, more coherent texts.

Training Details

The model was fine-tuned using the TRL library, specifically implementing the DPO algorithm. This method directly optimizes a language model to align with human preferences without the need for a separate reward model. The training utilized TRL version 1.3.0, Transformers 5.7.0, Pytorch 2.10.0+cu128, Datasets 4.8.5, and Tokenizers 0.22.2.