agentlans/Llama3.1-Daredevilish

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Jan 22, 2025License:llama3.1Architecture:Transformer0.0K Cold

agentlans/Llama3.1-Daredevilish is an experimental 8.03 billion parameter Llama 3.1-based model, created by agentlans, that merges top-performing Llama 3.1 8B models on the MMLU-Pro task. It incorporates additional supervised fine-tuning (SFT) to ensure compatibility with the Llama 3 prompt format. This model is designed for research and development, particularly for tasks requiring strong reasoning capabilities as indicated by its MMLU-Pro optimization.

Loading preview...

Llama 3.1 Daredevilish Overview

agentlans/Llama3.1-Daredevilish is an experimental 8.03 billion parameter Llama 3.1-based model developed by agentlans. It is a merge inspired by mlabonne/Daredevil-8B, combining several of the highest-performing Llama 3.1 8B models on the MMLU-Pro task as of January 21, 2025.

Key Features & Training

  • Merged Architecture: Utilizes mergekit with a dare_ties method to integrate multiple Llama 3.1 8B models, specifically those excelling in MMLU-Pro evaluations.
  • Supervised Fine-Tuning (SFT): The merged model underwent additional SFT using the agentlans/crash-course dataset (1200-row configuration) within LLaMA-Factory. This SFT ensures adherence to the Llama 3 prompt format.
  • Fine-Tuning Approach: Employed 1 epoch training with Rank 4 LoRA, Alpha = 4, and rslora.

Usage and Limitations

This model is intended for research and development. Users should be aware that it may fail to end replies properly with certain system prompts, in which case the instruct version (agentlans/Llama3.1-Daredevilish-Instruct) is recommended. It is important to validate outputs and use the model responsibly due to potential biases inherent in language models.

Performance Snapshot

Evaluations on the Open LLM Leaderboard show an Average score of 25.54%, with specific metrics including:

  • IFEval (0-Shot): 62.92%
  • MMLU-PRO (5-shot): 29.96%

Further evaluation and fine-tuning are suggested for optimizing performance across diverse tasks.