Model Overview
This model, takeshi200ok/qwen3-4B-dpo-anti-fence-240slow26, is a 4 billion parameter language model developed by takeshi200ok. It is a fine-tuned version of the Qwen/Qwen3-4B-Instruct-2507 base model, utilizing Direct Preference Optimization (DPO) via the Unsloth library.
Key Characteristics
- Optimization Objective: The primary goal of this DPO training was to align the model's responses with preferred outputs, specifically focusing on enhancing:
- Reasoning (Chain-of-Thought): Improving the model's ability to generate logical, step-by-step thought processes.
- Structured Response Quality: Producing more coherent and well-organized outputs based on a preference dataset.
- Training Process: The DPO training was initiated from an existing SFT (Supervised Fine-Tuning) LoRA adapter and resulted in a fully merged 16-bit model, combining the base, SFT, and DPO weights. This means no adapter loading is required for usage.
- Configuration: Training involved 1 epoch with a learning rate of 3e-06, a beta value of 0.05, and a maximum sequence length of 3072.
Usage and Integration
As a full-merged 16-bit model, it can be directly used with the transformers library for inference. The model's training data includes the [u-10bei/dpo-dataset-qwen-cot] dataset, and it operates under the MIT License, while also adhering to the original base model's license terms.