AscendKernelGen/KernelGen-LM-4B

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Jan 23, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

KernelGen-LM-4B is a domain-adaptive large language model developed by AscendKernelGen, specialized for generating low-level NPU kernels for the Huawei Ascend architecture using AscendC. Built on the Qwen3-4B backbone, it is trained on the Ascend-CoT dataset and refined with reinforcement learning using execution feedback. This model excels at hardware-specific code generation, demonstrating significant improvements in complex kernel implementation compared to general-purpose LLMs. Its primary use is to automate and optimize the creation of NPU kernels, bridging the gap between high-level code generation and hardware-specific programming.

Loading preview...

Overview

AscendKernelGen/KernelGen-LM-4B is a specialized large language model designed for generating highly optimized low-level NPU (Neural Processing Unit) kernels for Huawei Ascend hardware, utilizing the AscendC programming language. Developed by AscendKernelGen, this model is built upon the Qwen3-4B architecture and has undergone extensive domain-adaptive post-training.

Key Capabilities

  • Domain-Specific Code Generation: Excels at producing correct and efficient AscendC code for NPU kernels, a task where general-purpose LLMs often fail.
  • Chain-of-Thought (CoT) Reasoning: Leverages the proprietary Ascend-CoT dataset, which incorporates documentation-based, code-centric, and general reasoning chains to understand complex NPU programming logic.
  • Reinforcement Learning with Execution Feedback: Utilizes Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO) based on execution-driven correctness and performance signals, ensuring generated code is both functional and performant.
  • Hardware-Grounded Evaluation: Validated using NPUKernelBench, a comprehensive benchmark assessing compilation, functional correctness, and latency on real Ascend hardware.
  • Improved Complex Kernel Handling: Demonstrates significant qualitative and quantitative improvements in generating complex Level-2 kernels and handling intricate tiling strategies compared to baseline models.

Good For

  • Automating NPU Kernel Development: Ideal for developers and researchers working on Huawei Ascend platforms who need to generate optimized low-level kernels.
  • Bridging Hardware-Software Gaps: Useful for tasks requiring precise, hardware-specific code generation that general LLMs cannot achieve.
  • Research in Domain-Adaptive LLMs: Provides a strong example and framework for developing LLMs specialized in highly technical and hardware-specific domains.