normster/RealGuardrails-Qwen2.5-7B-SFT-DPO
The normster/RealGuardrails-Qwen2.5-7B-SFT-DPO model is a 7.6 billion parameter language model based on the Qwen2.5 architecture, featuring a 32768 token context length. Developed by normster, this model is specifically fine-tuned using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) on the RealGuardrails dataset. Its primary differentiator is an enhanced ability to adhere to system prompts and maintain precedence, making it highly effective for applications requiring strict guardrail enforcement.
Loading preview...
Model Overview
The normster/RealGuardrails-Qwen2.5-7B-SFT-DPO is a 7.6 billion parameter language model built upon the Qwen2.5 architecture, designed to excel in system prompt adherence and precedence. It leverages a substantial 32768 token context length, allowing for complex and lengthy instructions.
Key Capabilities
- Enhanced System Prompt Adherence: The model was initially fine-tuned via Supervised Fine-Tuning (SFT) on the
systemmixsplit of the RealGuardrails dataset, comprising 150,000 examples. This process specifically trained the model to follow system-level instructions more reliably. - Improved Preference Alignment: Further training involved Direct Preference Optimization (DPO) on the
preferencemixsplit (30,000 examples) of the RealGuardrails dataset. This DPO phase refines the model's responses to align better with desired behaviors and preferences, particularly concerning guardrail enforcement. - Robust Training Methodology: The model was developed using
normster's custom training library, torchllms, ensuring a controlled and optimized training environment.
Training Details
The DPO training utilized specific hyperparameters including a beta of 0.01, AdamW optimizer, a batch size of 128, and a learning rate of 1e-5 with a cosine scheduler. Training was conducted for 1 epoch with bf16 precision and a maximum sequence length of 4096 tokens.
Ideal Use Cases
This model is particularly well-suited for applications where strict adherence to predefined rules, safety guidelines, or specific output formats (guardrails) is critical. It can be beneficial in scenarios requiring reliable instruction following and robust control over model behavior.