Taichi11/sft_v7_dpo_v2_merged
Taichi11/sft_v7_dpo_v2_merged is a 4 billion parameter language model fine-tuned by Taichi11 using Direct Preference Optimization (DPO) on the Taichi11/LLM_main_v7_merged base model. Optimized for improved reasoning through Chain-of-Thought and enhanced structured response quality, this model is designed for applications requiring precise and well-organized outputs. It offers a 32768 token context length and is provided with full-merged 16-bit weights for direct use without adapter loading.
Loading preview...
Overview
Taichi11/sft_v7_dpo_v2_merged is a 4 billion parameter language model developed by Taichi11, built upon the Taichi11/LLM_main_v7_merged base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its full-merged 16-bit weights available directly, eliminating the need for adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning abilities.
- Structured Output Quality: Specifically aligned to produce higher quality structured responses based on preference datasets.
- Direct Use: Provided as a fully merged model, ready for immediate deployment with
transformers.
Good For
- Applications requiring models with improved logical reasoning steps.
- Use cases where generating well-structured and precise outputs is critical.
- Developers seeking a DPO-optimized model for better response alignment without complex setup.
Training Details
The model underwent DPO training for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1. It utilized a maximum sequence length of 1024 during training, with LoRA configurations (r=8, alpha=16) merged into the base model. The training data used was Taichi11/dpo_dataset_v1.