Model Overview
The normster/RealGuardrails-Qwen2.5-7B-SFT-DPO is a 7.6 billion parameter language model built upon the Qwen2.5 architecture, designed to excel in system prompt adherence and precedence. It leverages a substantial 32768 token context length, allowing for complex and lengthy instructions.
Key Capabilities
- Enhanced System Prompt Adherence: The model was initially fine-tuned via Supervised Fine-Tuning (SFT) on the
systemmix split of the RealGuardrails dataset, comprising 150,000 examples. This process specifically trained the model to follow system-level instructions more reliably. - Improved Preference Alignment: Further training involved Direct Preference Optimization (DPO) on the
preferencemix split (30,000 examples) of the RealGuardrails dataset. This DPO phase refines the model's responses to align better with desired behaviors and preferences, particularly concerning guardrail enforcement. - Robust Training Methodology: The model was developed using
normster's custom training library, torchllms, ensuring a controlled and optimized training environment.
Training Details
The DPO training utilized specific hyperparameters including a beta of 0.01, AdamW optimizer, a batch size of 128, and a learning rate of 1e-5 with a cosine scheduler. Training was conducted for 1 epoch with bf16 precision and a maximum sequence length of 4096 tokens.
Ideal Use Cases
This model is particularly well-suited for applications where strict adherence to predefined rules, safety guidelines, or specific output formats (guardrails) is critical. It can be beneficial in scenarios requiring reliable instruction following and robust control over model behavior.