Dolphin 2.6 Mistral 7b DPO: Uncensored and Compliant
Dolphin 2.6 Mistral 7b DPO is a 7 billion parameter language model built upon the Mistral-7b architecture, featuring a 4096 token context window. Developed by dphn and sponsored by Convai, this iteration is notably DPO (Direct Preference Optimization) tuned using the argilla/ultrafeedback-binarized-preferences-cleaned dataset.
Key Capabilities & Characteristics
- Enhanced Coding Performance: The model has been trained with a significant amount of coding data, making it particularly proficient in coding tasks.
- High Compliance & Uncensored: DPO tuning has made the model highly obedient to user instructions. It is uncensored, with its training dataset filtered to remove alignment and bias, ensuring compliance even with unethical requests. Users are advised to implement their own alignment layers.
- ChatML Format: Utilizes the ChatML prompt format, with
<|im_end|> mapping to token_id 2 for broader compatibility. - Performance Benchmarks: Achieves an average score of 67.20 on the Open LLM Leaderboard, including 65.61 on AI2 Reasoning Challenge and 63.24 on MMLU.
Training Details
- Trained for 3 epochs over 2 days on 4x A100 GPUs using a full weights finetune on Axolotl.
Future Enhancements (Dolphin 3.0)
Future plans for Dolphin 3.0 include enhancements for general chat, structured output, agent use cases (like Autogen, Memgpt, Functions), and role-playing.