Model Overview
The tomoofusa/exp034-toml-upsample-dpo-merged is a 4 billion parameter language model developed by tomofusa. This model is provided with full 16-bit weights, eliminating the need for adapter loading, which simplifies deployment and usage. Its development involved a two-stage training process designed to enhance its performance and alignment with human preferences.
Training Pipeline
The model's training consisted of two distinct phases:
- Supervised Fine-Tuning (SFT): The initial phase utilized the
tomoofusa/exp034-blend-h-toml-up-lora model as its base, establishing a strong foundation for language understanding and generation. - Direct Preference Optimization (DPO): Following SFT, the model underwent DPO using the
u-10bei/dpo-dataset-qwen-cot dataset. This phase was crucial for aligning the model's outputs with desired human preferences, with specific configurations:- Learning rate: 5e-07
- Beta: 0.1
- Loss type: ipo
- LoRA parameters: r=64, alpha=128
- Max length: 1024
- Training duration: 1 epoch
Key Capabilities
This model excels in generating high-quality, preference-aligned text due to its DPO training. Its 32768-token context length allows for processing and generating longer, more coherent responses. The full 16-bit weights ensure robust performance without the overhead of adapter management.
Good For
- Applications requiring models that adhere closely to human preferences.
- Conversational AI and chatbot development where response quality and alignment are critical.
- Instruction-following tasks where nuanced understanding and generation are beneficial.