AISafety-Student/Phi-4-reasoning-heretic

TEXT GENERATIONConcurrency Cost:1Model Size:14.7BQuant:FP8Ctx Length:32kPublished:Apr 2, 2026License:mitArchitecture:Transformer Open Weights Cold

AISafety-Student/Phi-4-reasoning-heretic is a 14.7 billion parameter, dense decoder-only Transformer model, a decensored version of Microsoft Research's Phi-4-reasoning. It is fine-tuned using supervised fine-tuning on chain-of-thought traces and reinforcement learning, specifically optimized for advanced reasoning in math, science, and coding skills. With a 32k token context length, this model excels in memory/compute constrained and latency-bound environments, providing structured responses with a reasoning chain-of-thought block followed by a summarization block.

Loading preview...

Overview

AISafety-Student/Phi-4-reasoning-heretic is a 14.7 billion parameter, dense decoder-only Transformer model, derived from Microsoft Research's Phi-4-reasoning. This version has been decensored using Heretic v1.2.0, resulting in a reduction of refusals from 66/100 to 52/100 compared to the original model, with a KL divergence of 0.0049.

Key Capabilities

  • Advanced Reasoning: Fine-tuned on a blend of synthetic prompts and high-quality filtered data focusing on math, science, and coding skills, as well as alignment data for safety and Responsible AI.
  • Structured Outputs: Generates responses with a distinct reasoning chain-of-thought block followed by a summarization block.
  • Optimized for Efficiency: Designed for memory/compute constrained and latency-bound environments.
  • Long Context: Supports a context length of 32,768 tokens, allowing for more complex queries and longer chain-of-thought processes.
  • Performance: Achieves strong results across various reasoning benchmarks, including AIME (75.3% on AIME 2024), OmniMath (76.6%), GPQA-Diamond (65.8%), and LiveCodeBench (53.8%), often outperforming significantly larger open-weight models.

Good For

  • Research on Language Models: Serves as a building block for generative AI-powered features.
  • General Purpose AI Systems: Suitable for applications requiring strong reasoning and logic capabilities, primarily in English.
  • Math, Science, and Coding Tasks: Excels in domains requiring detailed problem-solving and logical deduction.
  • Environments with Constraints: Ideal for scenarios where memory, compute, or latency are critical factors.
  • Instruction Following: Demonstrates improved instruction following (IFEval Strict: 83.4%) and functional code generation (HumanEvalPlus: 92.9%) compared to its base model.