Name: Hi-Satoh/sft-base4-dpo-e2-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Hi-Satoh

Overview

Hi-Satoh/sft-base4-dpo-e2-qwen-cot-merged is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. This model has undergone a multi-stage training process, beginning with Supervised Fine-Tuning (SFT) and subsequently enhanced with Direct Preference Optimization (DPO) using the Unsloth library. The primary objective of this fine-tuning was to significantly improve the model's reasoning and alignment capabilities.

Key Capabilities

Enhanced Reasoning: The model's DPO phase specifically targeted improvements in reasoning, building on domain-specific knowledge from its SFT phase.
Optimized Alignment: DPO training contributes to better alignment with desired outputs and user preferences.
Full-Merged Weights: This repository provides the full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.
Qwen3-4B-Instruct-2507 Base: Leverages the robust architecture and initial capabilities of the Qwen3-4B-Instruct-2507 model.

Training Details

The model's training involved:

SFT Phase: Initial training on the u-10bei/structured_data_with_cot_dataset_512_v5 dataset to acquire domain-specific knowledge.
DPO Phase: Applied Direct Preference Optimization for 2 epochs with a learning rate of 1e-07 and a beta of 0.1, using u-10bei/dpo-dataset-qwen-cot as the training data. The maximum sequence length for DPO was 2048 tokens.

Good For

Applications requiring a 4B parameter model with improved reasoning and alignment.
Developers looking for a readily deployable Qwen3-4B variant with enhanced performance characteristics.
Use cases benefiting from a model fine-tuned with DPO for better preference alignment.

Overview

Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)