tomofusa/exp033-dpo-wd005-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The tomofusa/exp033-dpo-wd005-merged model is a 4 billion parameter language model developed by tomofusa, built upon a SFT and DPO merged architecture. This model is provided with full 16-bit weights, eliminating the need for adapter loading. It is specifically fine-tuned using a DPO configuration with a learning rate of 5e-07 and a beta of 0.1, making it suitable for tasks benefiting from advanced alignment techniques.

Loading preview...

Model Overview

The tomoofusa/exp033-dpo-wd005-merged is a 4 billion parameter language model developed by tomofusa. It is a merged model, combining a Supervised Fine-Tuning (SFT) phase with a subsequent Direct Preference Optimization (DPO) phase. This model is distributed with full 16-bit weights, which means it can be used directly without requiring additional adapter loading, simplifying deployment.

Training Details

The model's training pipeline involved two main stages:

  • SFT Phase: Initialized from tomoofusa/exp015-blend-h-lora.
  • DPO Phase: Further optimized using the u-10bei/dpo-dataset-qwen-cot dataset for one epoch. Key DPO configuration parameters include a learning rate of 5e-07, a beta value of 0.1, and an ipo loss type. LoRA was utilized during DPO with r=64 and alpha=128, and a maximum sequence length of 1024 was used.

Key Characteristics

  • Merged Architecture: Benefits from both SFT for foundational instruction following and DPO for preference alignment.
  • Full 16-bit Weights: Ready-to-use without adapter loading.
  • DPO Alignment: Specifically tuned for improved response quality and alignment with human preferences through Direct Preference Optimization.