Jihyung803/Qwen3-8B-SOCIALIQA-DPO
Jihyung803/Qwen3-8B-SOCIALIQA-DPO is an 8 billion parameter language model fine-tuned from Qwen/Qwen3-8B using Direct Preference Optimization (DPO). This model is specifically trained to enhance its ability to generate responses aligned with human preferences, particularly in social intelligence and conversational contexts. It leverages a 32K token context length, making it suitable for nuanced and extended interactions. The DPO training method aims to improve the model's helpfulness and harmlessness in open-ended dialogue.
Loading preview...
Model Overview
Jihyung803/Qwen3-8B-SOCIALIQA-DPO is an 8 billion parameter language model derived from the Qwen3-8B architecture. This model has undergone fine-tuning using the Direct Preference Optimization (DPO) method, a technique designed to align language models with human preferences by treating preference data as implicit reward signals. The training process utilized the TRL (Transformer Reinforcement Learning) framework.
Key Capabilities
- Preference Alignment: Fine-tuned with DPO to generate responses that are more aligned with human preferences, potentially leading to more helpful and less harmful outputs.
- Conversational AI: Optimized for social intelligence tasks, making it suitable for generating nuanced and contextually appropriate responses in dialogue.
- Base Model Strength: Benefits from the robust capabilities of the Qwen3-8B base model, including its 32,768 token context length, allowing for processing and generating longer, more complex texts.
Training Details
The model was trained using the DPO method, as detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." This approach directly optimizes a policy to satisfy human preferences without explicitly training a separate reward model. The training procedure was tracked and can be visualized via Weights & Biases. Key frameworks used include TRL 0.25.0 and Transformers 4.57.6.