The takeshi200ok/qwen3-4B-dpo-anti-fence-600 is a 4 billion parameter Qwen3-based instruction-tuned causal language model, fine-tuned using Direct Preference Optimization (DPO) by takeshi200ok. This model focuses on improving reasoning capabilities, specifically Chain-of-Thought, and generating structured responses. It is designed for tasks requiring aligned and high-quality outputs based on preference datasets.
Loading preview...
Model Overview
This model, takeshi200ok/qwen3-4B-dpo-anti-fence-600, is a 4 billion parameter language model based on the Qwen3-4B-Instruct-2507 architecture. It has been fine-tuned by takeshi200ok using Direct Preference Optimization (DPO) via the Unsloth library, starting from an SFT LoRA adapter. The final artifact is a fully merged 16-bit model, meaning no adapter loading is required for use with transformers.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning processes.
- Structured Responses: Focuses on generating higher quality and more structured outputs.
- Preference Alignment: Aligned with preferred outputs through DPO training on a specific preference dataset.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 1e-05 and a beta of 0.1. It utilized a maximum sequence length of 3072. The training data used was u-10bei/dpo-dataset-qwen-cot.
Usage & License
As a merged model, it can be directly loaded and used with the transformers library. The model is released under the MIT License, aligning with the terms of its training dataset. Users must also comply with the original base model's license terms.