Hi-Satoh/adv_sft_dpo_final_1_merged
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 28, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Hi-Satoh/adv_sft_dpo_final_1_merged is a 4 billion parameter instruction-tuned causal language model developed by Hi-Satoh. It is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507, optimized using Direct Preference Optimization (DPO) to enhance reasoning (Chain-of-Thought) and structured response quality. This model is designed for tasks requiring improved alignment with preferred outputs and better logical coherence.

Loading preview...