CEIA-RL/qwen3-4b-dw-lr-dpo-offline
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 8, 2026Architecture:Transformer Cold

CEIA-RL/qwen3-4b-dw-lr-dpo-offline is a 4 billion parameter language model, fine-tuned from cemig-temp/qwen3-4b-dw-lr using Direct Preference Optimization (DPO) with a 32768 token context length. Developed by CEIA-RL, this model leverages reinforcement learning from human feedback techniques to enhance its conversational capabilities and alignment. It is designed for general text generation tasks where improved response quality and adherence to preferences are critical.

Loading preview...

Model Overview

This model, CEIA-RL/qwen3-4b-dw-lr-dpo-offline, is a 4 billion parameter language model derived from cemig-temp/qwen3-4b-dw-lr. It has been fine-tuned using Direct Preference Optimization (DPO), a method that aligns language models with human preferences without requiring a separate reward model. The training process utilized the TRL library from Hugging Face, indicating a focus on reinforcement learning from human feedback (RLHF) techniques.

Key Capabilities

  • Preference Alignment: Enhanced to generate responses that better align with human preferences due to DPO fine-tuning.
  • Conversational Generation: Suitable for interactive text generation tasks, as demonstrated by the quick start example.
  • Qwen3 Architecture: Benefits from the underlying Qwen3 base model's capabilities.

Training Details

The model's training procedure involved DPO, as detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (paper link). This method directly optimizes a policy to satisfy preferences, making it efficient for alignment. The training run can be visualized on Weights & Biases.

Good For

  • Applications requiring improved response quality and alignment with user preferences.
  • General-purpose text generation where DPO's benefits in conversational flow and coherence are valuable.