Taichi11/sft_v7_dpo_v2_merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 22, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Taichi11/sft_v7_dpo_v2_merged is a 4 billion parameter language model fine-tuned by Taichi11 using Direct Preference Optimization (DPO) on the Taichi11/LLM_main_v7_merged base model. Optimized for improved reasoning through Chain-of-Thought and enhanced structured response quality, this model is designed for applications requiring precise and well-organized outputs. It offers a 32768 token context length and is provided with full-merged 16-bit weights for direct use without adapter loading.

Loading preview...

Overview

Taichi11/sft_v7_dpo_v2_merged is a 4 billion parameter language model developed by Taichi11, built upon the Taichi11/LLM_main_v7_merged base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its full-merged 16-bit weights available directly, eliminating the need for adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning abilities.
  • Structured Output Quality: Specifically aligned to produce higher quality structured responses based on preference datasets.
  • Direct Use: Provided as a fully merged model, ready for immediate deployment with transformers.

Good For

  • Applications requiring models with improved logical reasoning steps.
  • Use cases where generating well-structured and precise outputs is critical.
  • Developers seeking a DPO-optimized model for better response alignment without complex setup.

Training Details

The model underwent DPO training for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1. It utilized a maximum sequence length of 1024 during training, with LoRA configurations (r=8, alpha=16) merged into the base model. The training data used was Taichi11/dpo_dataset_v1.