Menlo/llama3-s-2024-07-08

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Jul 8, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

Menlo/llama3-s-2024-07-08 is an 8 billion parameter Llama-3 architecture model developed by Homebrew Research. This model is uniquely capable of natively understanding both audio and text inputs, building upon Meta-Llama-3-8B-Instruct with sound understanding capabilities. It was fine-tuned using 700 million tokens from the Instruction Speech v1 dataset, making it suitable for research applications focused on sound-text semantics.

Loading preview...

Model Overview

Menlo/llama3-s-2024-07-08 is an 8 billion parameter model from the llama3-s family, developed by Homebrew Research. It is built upon the Llama-3 architecture and extends the capabilities of Meta-Llama-3-8B-Instruct by integrating native audio and text understanding. The model processes both sound and text as input to generate text output.

Key Capabilities & Training

This model's primary differentiator is its multimodal input capability, specifically its ability to interpret sound. It was continually trained for 8 hours on a cluster of 8x NVIDIA H100-SXM-80GB GPUs, leveraging 700 million tokens from the Instruction Speech v1 dataset to enhance its sound understanding. The training utilized an Adam-mini optimizer with a learning rate of 5e-5 and a global batch size of 128. Despite being in early stages, the model shows an emerging grasp of sound-text semantics.

Intended Use Cases

This model family is primarily intended for research applications, particularly those focused on improving and exploring sound understanding capabilities within large language models. Users can convert audio files into sound tokens using the provided Encodec-based Python script before feeding them into the model alongside text. The model is English-only.