Model Overview

This model is an instruction-tuned variant from Meta's Llama 3.2 collection, featuring 3.21 billion parameters and a 32,768 token context length. It is built on an optimized transformer architecture, incorporating Grouped-Query Attention (GQA) for enhanced inference scalability. The model has been fine-tuned using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety, particularly in multilingual contexts.

Key Capabilities

Multilingual Dialogue: Optimized for conversational AI in English, Hindi, German, French, Italian, Portuguese, Spanish, and Thai.
Agentic Applications: Designed for tasks like knowledge retrieval, summarization, and mobile AI-powered writing assistants.
Quantization Support: Includes quantized versions (SpinQuant and QLoRA) for efficient deployment in constrained environments like mobile devices, demonstrating significant improvements in decode speed and reduced memory footprint.
Robust Safety Alignment: Incorporates comprehensive safety fine-tuning, including handling refusals and tone, and is intended to be deployed with additional system safeguards.

Good For

Developing assistant-like chat applications requiring multilingual support.
Implementing agentic systems for information retrieval and summarization.
Deploying LLM capabilities on devices with limited compute resources, leveraging its optimized quantized versions.
Researching safety fine-tuning and robust model deployment strategies.

Overview

Model Overview

Key Capabilities

Good For

Full Model Card (README)