lhkhiem28/qwen2.5-1.5b-dpo-iter1

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Nov 2, 2025Architecture:Transformer Cold

lhkhiem28/qwen2.5-1.5b-dpo-iter1 is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-1.5B-Instruct using Direct Preference Optimization (DPO). This model is designed for instruction-following tasks, leveraging DPO to align its responses with human preferences. It is suitable for applications requiring a compact yet capable model for generating coherent and contextually relevant text based on prompts.

Loading preview...

Model Overview

lhkhiem28/qwen2.5-1.5b-dpo-iter1 is a 1.5 billion parameter language model, building upon the Qwen2.5-1.5B-Instruct architecture. This model has undergone further fine-tuning using Direct Preference Optimization (DPO), a method designed to align language model outputs more closely with human preferences by treating preference data as implicit rewards.

Key Characteristics

  • Base Model: Fine-tuned from Qwen/Qwen2.5-1.5B-Instruct, inheriting its foundational capabilities.
  • Training Method: Utilizes Direct Preference Optimization (DPO), as detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (paper link). This method aims to improve the model's ability to generate preferred responses without explicit reward modeling.
  • Framework: Training was conducted using the TRL library (TRL GitHub), a Transformer Reinforcement Learning framework.
  • Context Length: Supports a context window of 32768 tokens, allowing for processing and generating longer sequences of text.

Use Cases

This model is particularly well-suited for:

  • Instruction Following: Generating responses that adhere to specific instructions or prompts.
  • Text Generation: Creating coherent and contextually appropriate text in various scenarios.
  • Preference Alignment: Applications where human-like preferences in generated text are crucial, benefiting from its DPO training.