Model Overview
Menlo/llama3-s-instruct-v0.2 is an 8 billion parameter instruction-tuned model built on the Llama-3 architecture by Homebrew Research. Its core innovation lies in its multimodal input capabilities, natively processing both text and audio. The model leverages WhisperVQ as a tokenizer for audio files, enabling it to understand sound input by converting it into semantic tokens.
Key Capabilities
- Multimodal Input: Processes both text and sound inputs simultaneously.
- Sound Understanding: Specifically enhanced for interpreting and responding to audio information.
- Llama-3 Architecture: Benefits from the robust foundation of the Llama-3 family.
- English Language Support: Primarily developed for English language tasks.
Training and Use
The model was continually trained for 6 hours on a cluster of 8x NVIDIA H100 GPUs, utilizing the torchtune library for FSDP2 training. It was trained with a global batch size of 128 and a learning rate of 0.5e-4. This version is primarily intended for research applications focused on improving LLM sound understanding. Users can get started by converting audio files into sound tokens using the provided Python script and then inferencing the model like any other LLM.