Menlo/llama3-s-v0.1 is an 8 billion parameter Llama-3 architecture model developed by Homebrew Research, designed for natively understanding both audio and text inputs. This model expands on previous checkpoints by incorporating 1.3 billion tokens from the Instruction Speech v1.5 dataset, enhancing its sound understanding capabilities. It is primarily intended for research applications focused on improving large language models' ability to process and interpret sound alongside text.
Loading preview...
Overview
Menlo/llama3-s-v0.1 is an 8 billion parameter model built on the Llama-3 architecture by Homebrew Research. This model is uniquely designed to process and understand both text and audio inputs, generating text outputs. It represents a continuation of the llama3s family, specifically enhancing sound understanding capabilities by training on an additional 1.3 billion tokens from the Instruction Speech v1.5 dataset.
Key Capabilities
- Multimodal Input: Natively understands and processes both text and sound inputs.
- Llama-3 Architecture: Leverages the robust Llama-3 base for strong language understanding.
- Enhanced Sound Understanding: Continuously trained to improve its ability to interpret audio, building on previous
llama3scheckpoints. - Research-Oriented: Primarily intended for research applications exploring multimodal LLMs.
Training Details
The model underwent continual training for 14 hours on a cluster of 8x NVIDIA H100-SXM-80GB GPUs. Key training arguments included a global batch size of 128, a learning rate of 1.5e-4 with a cosine scheduler, and an Adam optimizer. The training process focused on improving sound-text semantics, as evidenced by the provided training loss curve.
Good for
- Multimodal Research: Ideal for researchers exploring the integration of audio and text in LLMs.
- Sound-to-Text Applications: Suitable for experimental use cases requiring an LLM to respond to spoken or environmental audio cues.
- Developing Audio-Aware Agents: A foundational model for building agents that can interpret and react to sound alongside textual instructions.