Model Overview
kenzrx/dpo-qwen-cot-merged is a 4 billion parameter instruction-tuned language model derived from the Qwen3-4B-Instruct-2507 base. This model has undergone a rigorous two-stage training methodology to enhance its response quality and alignment.
Training Methodology
- Supervised Fine-Tuning (SFT): Initially, the model was fine-tuned to learn high-quality reference answers and specific formatting requirements using the
structured_data_with_cot_dataset_v2 dataset. This stage established a strong foundation for generating coherent and well-structured outputs. - Direct Preference Optimization (DPO): Following SFT, the model was further optimized using DPO. This stage focused on aligning the model's outputs towards preferred responses by training it to favor "chosen" outputs over "rejected" ones for the same prompt. The DPO process, conducted over 1 epoch with a learning rate of 1e-07 and a beta of 0.1, significantly improved response alignment and the overall structured quality of the generated text.
Key Characteristics
- Enhanced Response Alignment: The DPO stage specifically targets and improves the model's ability to generate outputs that are aligned with desired preferences and quality standards.
- Structured Output Quality: Optimized for producing well-formatted and structured answers, making it suitable for tasks where output consistency is crucial.
- Full Merged Weights: Provided as full-merged 16-bit weights, eliminating the need for adapter loading during deployment.
Ideal Use Cases
This model is particularly well-suited for applications requiring:
- Generation of high-quality, structured responses.
- Tasks where adherence to specific formatting or answer styles is important.
- Scenarios benefiting from a model optimized for preferred output alignment.