Overview
Model Overview
albertfares/MNLP_SFT_DPO is a 0.8 billion parameter language model developed by albertfares. It is a fine-tuned variant of the Qwen/Qwen3-0.6B-Base architecture, specifically enhanced using filtered Direct Preference Optimization (fDPO). This training methodology leverages the MNLP M3 DPO dataset, which consists of approximately 69,000 samples, to align the model's outputs with human preferences.
Key Characteristics
- Base Model: Qwen/Qwen3-0.6B-Base
- Fine-tuning Method: Filtered Direct Preference Optimization (fDPO)
- Training Data: MNLP M3 DPO dataset (~69k samples)
- Parameter Count: 0.8 billion
- Context Length: 40960 tokens
- Format: Utilizes SafeTensors for improved security and efficient loading.
Use Cases
This model is particularly suited for applications where preference-based alignment is crucial, such as:
- Generating responses that adhere to specific stylistic or content preferences.
- Tasks requiring nuanced understanding and generation based on comparative feedback.
- Scenarios where a smaller, efficiently loaded model with DPO-enhanced capabilities is beneficial.