SC117/QwenPaw-Flash-9B-heretic
SC117/QwenPaw-Flash-9B-heretic is a 9 billion parameter dense language model, fine-tuned from Qwen3.5-9B using the Heretic methodology. This model supports Multi-Token Prediction (MTP) for accelerated inference, offering up to 1.9x decode throughput. It is designed for general text generation tasks, providing a balance of performance and speed, and is available in F32 safetensors and GGUF formats.
Loading preview...
Overview
SC117/QwenPaw-Flash-9B-heretic is a 9 billion parameter dense language model, derived from the Qwen3.5-9B base model. It has been fine-tuned using the "Heretic" methodology and is provided in F32 (float32) safetensors format.
Key Capabilities & Features
- Base Model: Built upon the robust Qwen3.5-9B architecture.
- Precision: Main weights are in F32 (float32) for high fidelity.
- Multi-Token Prediction (MTP): Includes an MTP head (extracted from Qwen3.5-9B) that enables the model to predict multiple future tokens in a single forward pass. This feature significantly improves generation speed through speculative decoding.
- Achieves an MTP acceptance rate of approximately 43%.
- Provides a speedup of roughly 1.5-1.9x in decode throughput.
- GGUF Versions: Available in both standard GGUF and MTP-enabled GGUF formats for compatibility with inference engines like llama.cpp, Ollama, and LM Studio.
When to Use This Model
This model is suitable for developers seeking a 9B parameter model that offers enhanced inference speed, particularly for text generation tasks where faster decoding is beneficial. Its MTP capability makes it a strong candidate for applications requiring efficient token generation. Consider using this model if your use case benefits from the Qwen3.5-9B's capabilities combined with accelerated output generation.