tomofusa/exp034-toml-upsample-dpo-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The tomofusa/exp034-toml-upsample-dpo-merged model is a 4 billion parameter language model developed by tomofusa, featuring full 16-bit weights without requiring adapter loading. It was created through a two-stage training pipeline involving Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). This model is specifically optimized for generating responses aligned with human preferences, making it suitable for conversational AI and instruction-following tasks.

Loading preview...

Model Overview

The tomoofusa/exp034-toml-upsample-dpo-merged is a 4 billion parameter language model developed by tomofusa. This model is provided with full 16-bit weights, eliminating the need for adapter loading, which simplifies deployment and usage. Its development involved a two-stage training process designed to enhance its performance and alignment with human preferences.

Training Pipeline

The model's training consisted of two distinct phases:

  • Supervised Fine-Tuning (SFT): The initial phase utilized the tomoofusa/exp034-blend-h-toml-up-lora model as its base, establishing a strong foundation for language understanding and generation.
  • Direct Preference Optimization (DPO): Following SFT, the model underwent DPO using the u-10bei/dpo-dataset-qwen-cot dataset. This phase was crucial for aligning the model's outputs with desired human preferences, with specific configurations:
    • Learning rate: 5e-07
    • Beta: 0.1
    • Loss type: ipo
    • LoRA parameters: r=64, alpha=128
    • Max length: 1024
    • Training duration: 1 epoch

Key Capabilities

This model excels in generating high-quality, preference-aligned text due to its DPO training. Its 32768-token context length allows for processing and generating longer, more coherent responses. The full 16-bit weights ensure robust performance without the overhead of adapter management.

Good For

  • Applications requiring models that adhere closely to human preferences.
  • Conversational AI and chatbot development where response quality and alignment are critical.
  • Instruction-following tasks where nuanced understanding and generation are beneficial.