Model Overview
This model is an instruction-tuned variant from Meta's Llama 3.2 collection, featuring 3.21 billion parameters and a 32,768 token context length. It is built on an optimized transformer architecture, incorporating Grouped-Query Attention (GQA) for enhanced inference scalability. The model has been fine-tuned using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety, particularly in multilingual contexts.
Key Capabilities
- Multilingual Dialogue: Optimized for conversational AI in English, Hindi, German, French, Italian, Portuguese, Spanish, and Thai.
- Agentic Applications: Designed for tasks like knowledge retrieval, summarization, and mobile AI-powered writing assistants.
- Quantization Support: Includes quantized versions (SpinQuant and QLoRA) for efficient deployment in constrained environments like mobile devices, demonstrating significant improvements in decode speed and reduced memory footprint.
- Robust Safety Alignment: Incorporates comprehensive safety fine-tuning, including handling refusals and tone, and is intended to be deployed with additional system safeguards.
Good For
- Developing assistant-like chat applications requiring multilingual support.
- Implementing agentic systems for information retrieval and summarization.
- Deploying LLM capabilities on devices with limited compute resources, leveraging its optimized quantized versions.
- Researching safety fine-tuning and robust model deployment strategies.