jondurbin/bagel-dpo-7b-v0.1
jondurbin/bagel-dpo-7b-v0.1 is a 7 billion parameter language model, a DPO-tuned version of the original bagel-7b-v0.1, developed by jondurbin. This model is specifically fine-tuned using Direct Preference Optimization (DPO) to reduce AI refusals and enhance conversational quality, making it suitable for applications requiring more flexible and less restrictive AI responses. It demonstrates strong performance across various benchmarks including MMLU, ARC-Challenge, and MT-Bench, indicating broad general intelligence and conversational capabilities.
Loading preview...
jondurbin/bagel-dpo-7b-v0.1: A DPO-Tuned Conversational Model
This model, developed by jondurbin, is a 7 billion parameter language model that has undergone Direct Preference Optimization (DPO) based on the original bagel-7b-v0.1. Its primary differentiation lies in its DPO tuning, which aims to mitigate excessive AI refusals and provide more natural, less constrained responses, even with explicitly human system prompts. This makes it a strong candidate for conversational AI applications where flexibility and reduced censorship are desired.
Key Capabilities & Performance
The model exhibits competitive performance across a range of benchmarks:
- General Knowledge & Reasoning: Achieves notable scores on
mmlu(0.6408),arc_challenge(0.6715), andopenbookqa(0.51). - Mathematical Reasoning: Scores 0.5618 on
gsm8k. - Reading Comprehension: Performs well on
boolq(0.8813) andpiqa(0.8406). - Conversational Quality: Achieves an average MT-Bench score of 7.30625, indicating strong multi-turn dialogue capabilities.
Training & Data
The model was trained on a composite dataset including both Supervised Fine-Tuning (SFT) and DPO data. The SFT phase incorporated a diverse array of datasets covering instruction following, coding, reading comprehension, and creative writing. The DPO phase utilized datasets like airoboros 3.1 (for creative responses), helpsteer (human-annotated preferences), orca_dpo_pairs, toxic-dpo (for de-censorship research), and truthy (to increase truthfulness). A unique aspect of its training is the use of four different prompt formats (Alpaca, Vicuna, ChatML-like, Llama-2 chat) for each instruction, enhancing generalization.
Good For
- Conversational AI: Especially where a less restrictive and more human-like response style is preferred.
- Applications requiring reduced AI refusals: If other models are too conservative or prone to 'AALLM' responses.
- General-purpose instruction following: Due to its diverse SFT dataset covering many task types.
Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.