glenn2/LFG-1
LFG-1 (Listening Fusion Gemma) by glenn2 is a 26 billion parameter multimodal conversational audio-language model, bridging a Gemma 4 E2B audio encoder with a Gemma 4 26B-A4B text model via a custom projection layer. This architecture allows it to natively process raw acoustic features, capturing conversational nuances like pacing and tone, for real-time audio applications. Optimized for local execution on Apple Silicon with 64GB+ unified memory, it excels at native speech-to-response generation and streaming text output, while maintaining Gemma 4's core text reasoning capabilities.
Loading preview...
LFG-1: Multimodal Conversational Audio-Language Model
LFG-1 (Listening Fusion Gemma) is a 26 billion parameter multimodal model developed by glenn2 as a personal learning project. It uniquely integrates a Gemma 4 E2B audio encoder with a Gemma 4 26B-A4B text model using a custom-trained projection layer. This design enables the model to directly ingest and understand raw acoustic features, bypassing traditional Speech-to-Text (STT) pipelines and preserving crucial conversational elements like pacing, pauses, and tone.
Key Capabilities & Features
- Native Audio Understanding: Processes raw acoustic data for nuanced conversational interaction.
- Real-time Conversational Audio: Designed for immediate speech-to-response generation.
- Streaming Text Output: Provides continuous text responses.
- Multimodal Support: Capable of handling simultaneous audio and image inputs.
- Preserved Text Reasoning: The core Gemma 4 text backbone remains frozen during audio-projection training, ensuring its original language and reasoning capabilities are maintained.
- Apple Silicon Optimized: Built with the MLX framework (
mlx-vlm) for efficient local execution on Mac hardware.
Intended Use & Requirements
LFG-1 is primarily intended for real-time conversational audio applications running locally on Apple Silicon. The current audio projection layer is trained exclusively on English. Due to its combined weight of approximately 48 GB, it requires a minimum of 64 GB of unified memory on Apple Silicon devices and about 50 GB of disk space. The project is actively evolving, with future plans to expand language support for the audio training data.