Name: TSerizawa/llm-lecture-2025_dpo-qwen-cot-merged_base_model API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: TSerizawa

Model Overview

This model, TSerizawa/llm-lecture-2025_dpo-qwen-cot-merged_base_model, is a specialized variant of the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library, to align its responses with preferred outputs.

Key Capabilities & Optimization

Enhanced Reasoning: The primary objective of this DPO fine-tuning was to improve the model's reasoning abilities, particularly through Chain-of-Thought (CoT) processes.
Structured Response Quality: It is optimized to produce higher quality and more structured responses based on a preference dataset.
Full-Merged Weights: The repository provides the full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.

Training Details

Methodology: DPO was applied over 1 epoch with a learning rate of 1e-07 and a beta value of 0.1.
Context Length: The training utilized a maximum sequence length of 1024 tokens.
LoRA Configuration: LoRA (r=8, alpha=16) was used during training and subsequently merged into the base model.

Intended Use Cases

This model is particularly well-suited for applications where:

Improved Reasoning is critical, especially for tasks benefiting from Chain-of-Thought prompting.
High-Quality, Aligned Outputs are required, reflecting preferred response styles and structures.

Licensing

The model's training data is sourced from u-10bei/dpo-dataset-qwen-cot under an MIT License. Users must also comply with the original base model's license terms.

Overview

Model Overview

Key Capabilities & Optimization

Training Details

Intended Use Cases

Licensing

Full Model Card (README)