The akseljoonas/qwen3-4b-dpo-hh-rlhf-reversed model is a 4 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507. It was trained using Direct Preference Optimization (DPO) with the TRL framework, enhancing its ability to align with human preferences. This model is designed for general text generation tasks, offering improved response quality through its DPO-based training.
Loading preview...
Model Overview
The akseljoonas/qwen3-4b-dpo-hh-rlhf-reversed is a 4 billion parameter language model derived from the Qwen3-4B-Instruct-2507 base model. It has been specifically fine-tuned using Direct Preference Optimization (DPO), a method that leverages human preference data to align the model's outputs more closely with desired responses. This training approach, implemented via the TRL framework, aims to improve the model's ability to generate high-quality, preference-aligned text.
Key Characteristics
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Parameter Count: 4 billion
- Context Length: 40960 tokens
- Training Method: Direct Preference Optimization (DPO) for enhanced alignment.
- Framework: Trained using Hugging Face's TRL library.
Use Cases
This model is suitable for various text generation tasks where response quality and alignment with human preferences are important. Its DPO fine-tuning makes it particularly effective for:
- Generating conversational responses.
- Answering open-ended questions.
- Creating coherent and contextually relevant text.