Name: CEIA-RL/qwen3-4b-dw-lr-dpo-offline API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: CEIA-RL

Model Overview

This model, CEIA-RL/qwen3-4b-dw-lr-dpo-offline, is a 4 billion parameter language model derived from cemig-temp/qwen3-4b-dw-lr. It has been fine-tuned using Direct Preference Optimization (DPO), a method that aligns language models with human preferences without requiring a separate reward model. The training process utilized the TRL library from Hugging Face, indicating a focus on reinforcement learning from human feedback (RLHF) techniques.

Key Capabilities

Preference Alignment: Enhanced to generate responses that better align with human preferences due to DPO fine-tuning.
Conversational Generation: Suitable for interactive text generation tasks, as demonstrated by the quick start example.
Qwen3 Architecture: Benefits from the underlying Qwen3 base model's capabilities.

Training Details

The model's training procedure involved DPO, as detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (paper link). This method directly optimizes a policy to satisfy preferences, making it efficient for alignment. The training run can be visualized on Weights & Biases.

Good For

Applications requiring improved response quality and alignment with user preferences.
General-purpose text generation where DPO's benefits in conversational flow and coherence are valuable.

Overview

Model Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)