jondurbin/bagel-dpo-7b-v0.1

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:8kPublished:Dec 13, 2023License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

jondurbin/bagel-dpo-7b-v0.1 is a 7 billion parameter language model, a DPO-tuned version of the original bagel-7b-v0.1, developed by jondurbin. This model is specifically fine-tuned using Direct Preference Optimization (DPO) to reduce AI refusals and enhance conversational quality, making it suitable for applications requiring more flexible and less restrictive AI responses. It demonstrates strong performance across various benchmarks including MMLU, ARC-Challenge, and MT-Bench, indicating broad general intelligence and conversational capabilities.

Loading preview...

jondurbin/bagel-dpo-7b-v0.1: A DPO-Tuned Conversational Model

This model, developed by jondurbin, is a 7 billion parameter language model that has undergone Direct Preference Optimization (DPO) based on the original bagel-7b-v0.1. Its primary differentiation lies in its DPO tuning, which aims to mitigate excessive AI refusals and provide more natural, less constrained responses, even with explicitly human system prompts. This makes it a strong candidate for conversational AI applications where flexibility and reduced censorship are desired.

Key Capabilities & Performance

The model exhibits competitive performance across a range of benchmarks:

  • General Knowledge & Reasoning: Achieves notable scores on mmlu (0.6408), arc_challenge (0.6715), and openbookqa (0.51).
  • Mathematical Reasoning: Scores 0.5618 on gsm8k.
  • Reading Comprehension: Performs well on boolq (0.8813) and piqa (0.8406).
  • Conversational Quality: Achieves an average MT-Bench score of 7.30625, indicating strong multi-turn dialogue capabilities.

Training & Data

The model was trained on a composite dataset including both Supervised Fine-Tuning (SFT) and DPO data. The SFT phase incorporated a diverse array of datasets covering instruction following, coding, reading comprehension, and creative writing. The DPO phase utilized datasets like airoboros 3.1 (for creative responses), helpsteer (human-annotated preferences), orca_dpo_pairs, toxic-dpo (for de-censorship research), and truthy (to increase truthfulness). A unique aspect of its training is the use of four different prompt formats (Alpaca, Vicuna, ChatML-like, Llama-2 chat) for each instruction, enhancing generalization.

Good For

  • Conversational AI: Especially where a less restrictive and more human-like response style is preferred.
  • Applications requiring reduced AI refusals: If other models are too conservative or prone to 'AALLM' responses.
  • General-purpose instruction following: Due to its diverse SFT dataset covering many task types.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p