Name: Kyleyee/VRPO_hh-seed5 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Kyleyee

Model Overview

Kyleyee/VRPO_hh-seed5 is a 1.5 billion parameter language model developed by Kyleyee, building upon the Qwen2.5-1.5B-sft-hh-3e base model. Its primary distinction lies in its fine-tuning process, which utilized the DRDPO (Direct Preference Optimization) method. This technique, detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" [http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html], aims to align the model's outputs with human preferences by directly optimizing against a reward model.

Key Capabilities

Preference Alignment: Specifically trained on the Kyleyee/train_data_Helpful_drdpo_preference dataset to enhance helpfulness in responses.
Instruction Following: Optimized for generating outputs that adhere to user instructions, benefiting from the DRDPO fine-tuning.
Efficient Performance: At 1.5 billion parameters, it offers a balance between performance and computational efficiency for preference-aligned tasks.
Extended Context: Supports a context length of 32768 tokens, allowing for processing longer prompts and maintaining conversational coherence.

Training Details

The model was fine-tuned using the TRL (Transformer Reinforcement Learning) library [https://github.com/huggingface/trl], leveraging the DRDPO algorithm. This approach directly optimizes the policy to maximize the likelihood of preferred responses over dispreferred ones, making it particularly suitable for applications requiring robust and helpful conversational agents.

Overview

Model Overview

Key Capabilities

Training Details

Full Model Card (README)