ottys/dpo-qwen-cot-merged
The ottys/dpo-qwen-cot-merged model is a 4 billion parameter instruction-tuned causal language model based on the Qwen3-4B-Instruct-2507 architecture. Developed by ottys, it utilizes Direct Preference Optimization (DPO) on a filtered subset of official DPO data, focusing on enhancing structured data output accuracy and Chain-of-Thought reasoning. With a 32768 token context length, this model is specifically optimized for tasks requiring precise structured outputs and improved inference processes.
Loading preview...
Overview
This model, ottys/dpo-qwen-cot-merged, is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base. It was developed by ottys using Direct Preference Optimization (DPO) as part of a competition, adhering strictly to specified guidelines regarding base model, training methodology, and data usage.
Key Capabilities
- Enhanced Structured Data Output: The model is specifically trained to improve the accuracy of structured data generation.
- Improved Chain-of-Thought (CoT) Reasoning: It aims to strengthen the model's ability to articulate its reasoning process.
- DPO Fine-tuning: Utilizes DPO with a filtered, high-quality subset of official DPO data, focusing on specific tasks.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length used during training was 512 tokens. Notably, no new data was generated or modified using AI; all training data was selected from the provided official dataset.
Usage
For evaluation, users are instructed to use the provided "2026 final assignment main competition_standard code 2 (submission JSON generation)" by replacing the model ID with ottys/dpo-qwen-cot-merged.
Licensing
The base model operates under the Apache 2.0 license, and the training data consists solely of the officially distributed dataset.