Name: akseljoonas/qwen3-4b-dpo-hh-rlhf-reversed API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: akseljoonas

Model Overview

The akseljoonas/qwen3-4b-dpo-hh-rlhf-reversed is a 4 billion parameter language model derived from the Qwen3-4B-Instruct-2507 base model. It has been specifically fine-tuned using Direct Preference Optimization (DPO), a method that leverages human preference data to align the model's outputs more closely with desired responses. This training approach, implemented via the TRL framework, aims to improve the model's ability to generate high-quality, preference-aligned text.

Key Characteristics

Base Model: Qwen/Qwen3-4B-Instruct-2507
Parameter Count: 4 billion
Context Length: 40960 tokens
Training Method: Direct Preference Optimization (DPO) for enhanced alignment.
Framework: Trained using Hugging Face's TRL library.

Use Cases

This model is suitable for various text generation tasks where response quality and alignment with human preferences are important. Its DPO fine-tuning makes it particularly effective for:

Generating conversational responses.
Answering open-ended questions.
Creating coherent and contextually relevant text.

Overview

Model Overview

Key Characteristics

Use Cases

Full Model Card (README)