Name: yunjae-won/ubq30i_qwen4b_dpo_topk20_j0 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: yunjae-won

Model Overview

The yunjae-won/ubq30i_qwen4b_dpo_topk20_j0 is a 4 billion parameter language model developed by yunjae-won. It is a fine-tuned variant of the yunjae-won/ubq30i_qwen4b_sft_both model, specifically optimized using Direct Preference Optimization (DPO). DPO is a training method that aligns language model outputs with human preferences by treating the preference data as implicit rewards, as detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (paper link).

Key Characteristics

Architecture: Based on the Qwen4B family, with 4 billion parameters.
Training Method: Utilizes Direct Preference Optimization (DPO) for preference alignment.
Context Length: Supports a substantial context window of 32768 tokens.
Framework: Trained using the TRL (Transformers Reinforcement Learning) library.

Intended Use Cases

This model is particularly well-suited for applications requiring:

Preference-aligned text generation: Generating responses that are more likely to be preferred by users.
Conversational AI: Enhancing the quality and naturalness of dialogue systems.
Instruction following: Producing outputs that better adhere to given instructions due to DPO fine-tuning.

Overview

Model Overview

Key Characteristics

Intended Use Cases

Full Model Card (README)