kikiyaa/qwen-dpo-finetuned-ver2

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Apr 17, 2026Architecture:Transformer Cold

The kikiyaa/qwen-dpo-finetuned-ver2 is a 7.6 billion parameter causal language model, fine-tuned from Qwen/Qwen2.5-7B by kikiyaa. This model leverages Direct Preference Optimization (DPO) for enhanced performance, utilizing a context length of 32768 tokens. It is designed for general text generation tasks, benefiting from preference-based training to produce more aligned and helpful responses.

Loading preview...

Overview

The kikiyaa/qwen-dpo-finetuned-ver2 is a 7.6 billion parameter language model, building upon the base architecture of Qwen/Qwen2.5-7B. Developed by kikiyaa, this model has undergone further fine-tuning using the Direct Preference Optimization (DPO) method, as introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (2305.18290). This training approach aims to align the model's outputs more closely with human preferences.

Key Capabilities

  • Preference-tuned Responses: Utilizes DPO for generating outputs that are aligned with specified preferences, potentially leading to more helpful and desirable text.
  • General Text Generation: Capable of various text generation tasks, leveraging its 7.6 billion parameters and a substantial context window of 32768 tokens.
  • TRL Framework: Trained using the TRL (Transformers Reinforcement Learning) library, indicating a robust and established training pipeline.

Training Details

The model's fine-tuning process specifically employed DPO, a technique that directly optimizes a language model to act as its own reward model. This method is known for its effectiveness in improving model alignment without requiring a separate reward model. The training was conducted using TRL version 1.1.0, with Transformers 5.5.4 and Pytorch 2.9.1+cu128.