CEIA-RL/qwen3-4b-dw-lr-dpo

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 24, 2026Architecture:Transformer Cold

CEIA-RL/qwen3-4b-dw-lr-dpo is a 4 billion parameter language model, fine-tuned from cemig-temp/qwen3-4b-dw-lr using Online DPO. This model specializes in direct language model alignment through online AI feedback, leveraging a 32768 token context length. It is designed for tasks requiring refined conversational responses and alignment with human preferences.

Loading preview...

Model Overview

CEIA-RL/qwen3-4b-dw-lr-dpo is a 4 billion parameter language model, building upon the cemig-temp/qwen3-4b-dw-lr base model. Its key differentiator lies in its training methodology: it has been fine-tuned using Online DPO (Direct Language Model Alignment from Online AI Feedback), a method introduced in the paper "Direct Language Model Alignment from Online AI Feedback" (arXiv:2402.04792). This approach aims to align the model's outputs more closely with desired human preferences through continuous feedback.

Key Capabilities

  • Online DPO Fine-tuning: Utilizes a novel training procedure for direct alignment based on online AI feedback.
  • Qwen3 Architecture: Benefits from the foundational capabilities of the Qwen3 model family.
  • Context Length: Supports a substantial context window of 32768 tokens, enabling processing of longer inputs and generating more coherent, extended responses.

Training Details

The model was trained using the TRL (Transformers Reinforcement Learning) library, specifically implementing the Online DPO method. This training process is designed to enhance the model's ability to generate aligned and preferred responses, making it suitable for applications where nuanced and human-like interaction is crucial.

Use Cases

This model is particularly well-suited for applications requiring:

  • Conversational AI: Generating more aligned and contextually appropriate dialogue.
  • Instruction Following: Producing outputs that better adhere to user instructions and preferences.
  • Research in Alignment: Exploring the effectiveness of Online DPO for language model fine-tuning.