Name: sfutenma/dpo-qwen3_4b-cot-merged_v260301-220140 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: sfutenma

Model Overview

This model, sfutenma/dpo-qwen3_4b-cot-merged_v260301-220140, is a 4 billion parameter language model derived from sfutenma/lora_structeval_t_qwen3_4b_v260228-172650. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, specifically targeting enhanced reasoning and structured response generation.

Key Capabilities

Improved Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, aligning responses with preferred outputs.
Structured Response Quality: Enhanced ability to produce high-quality, structured answers based on preference datasets.
DPO Fine-tuning: Leverages DPO for better alignment and coherence in generated text.
Merged Weights: Provides full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment with transformers.

Training Details

The model was trained for 5 epochs with a learning rate of 2e-05 and a beta value of 0.03. It utilized a maximum sequence length of 768 during training and incorporated LoRA with r=8 and alpha=16, which has been merged into the base model. The training data used was u-10bei/dpo-dataset-qwen-cot.

Good For

Applications requiring models with strong reasoning capabilities.
Generating structured and aligned text outputs.
Use cases where direct preference optimization leads to desired response quality.

License

The model is released under the MIT License, consistent with its training dataset. Users must also adhere to the original base model's license terms.

Overview

Model Overview

Key Capabilities

Training Details

Good For

License

Full Model Card (README)