FritzStack/IRF-Llama-3.2-3B_4bit-merged-mlx-fp16

TEXT GENERATIONConcurrency Cost:1Model Size:3.2BQuant:BF16Ctx Length:32kPublished:Feb 25, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

FritzStack/IRF-Llama-3.2-3B_4bit-merged-mlx-fp16 is a 3.2 billion parameter Llama-3.2 model, developed by FritzStack, converted to the MLX format for efficient deployment on Apple Silicon. This model is a 4-bit merged version, optimized for inference performance and reduced memory footprint. It is designed for general-purpose language generation tasks, leveraging its Llama-3.2 architecture for robust performance in a compact size. Its primary utility lies in applications requiring a capable yet resource-efficient language model on MLX-compatible hardware.

Loading preview...

Model Overview

FritzStack/IRF-Llama-3.2-3B_4bit-merged-mlx-fp16 is a 3.2 billion parameter language model based on the Llama-3.2 architecture, developed by FritzStack. This specific version has been converted to the MLX format, making it suitable for optimized inference on Apple Silicon. It is a 4-bit merged variant, indicating a focus on efficiency and reduced memory consumption while maintaining performance.

Key Characteristics

  • Architecture: Llama-3.2 base model.
  • Parameter Count: 3.2 billion parameters.
  • Quantization: 4-bit merged for efficiency.
  • Format: Converted to MLX format using mlx-lm version 0.29.1, enabling native execution on Apple Silicon.
  • Context Length: Supports a context length of 32768 tokens.

Use Cases

This model is particularly well-suited for developers and applications that require a capable language model to run efficiently on Apple Silicon hardware. Its 4-bit quantization and MLX conversion make it ideal for:

  • Local inference on macOS devices.
  • Applications where memory footprint and inference speed are critical.
  • General text generation, summarization, and question-answering tasks within resource-constrained environments.