AscendKernelGen/KernelGen-LM-1.7B

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Jan 23, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

KernelGen-LM-1.7B by AscendKernelGen is a domain-adaptive large language model built on the Qwen3-1.7B backbone, specialized for low-level NPU kernel generation for the Huawei Ascend architecture using AscendC. It is trained on the Ascend-CoT dataset and refined with reinforcement learning using execution feedback. This model excels at generating complex Level-2 NPU kernels, addressing tasks where general-purpose models typically fail. Its primary use case is to bridge the gap between general code generation and hardware-specific programming for neural processing units.

Loading preview...

AscendKernelGen/KernelGen-LM-1.7B Overview

KernelGen-LM-1.7B is a specialized large language model developed by AscendKernelGen, designed for generating low-level NPU kernels for the Huawei Ascend architecture using the AscendC programming language. Built upon the Qwen3-1.7B foundation, this model undergoes a unique two-stage domain-adaptive post-training process, including Supervised Fine-Tuning (SFT) with error-derived supervision and Reinforcement Learning (RL) using Direct Preference Optimization (DPO) driven by execution-based correctness and performance signals.

Key Capabilities & Innovations

  • Domain-Specific Training: Utilizes the high-quality, domain-specific Ascend-CoT Dataset, which incorporates Chain-of-Thought (CoT) reasoning from documentation, code-centric analysis, and general reasoning chains.
  • Hardware-Grounded Evaluation: Performance is rigorously validated using NPUKernelBench, a comprehensive benchmark assessing compilation success, functional correctness, and latency on real Ascend hardware.
  • Enhanced Kernel Generation: Demonstrates significant qualitative improvement in generating complex NPU kernels, particularly for Level-2 tasks, by accurately understanding AscendC-specific APIs, data layout constraints, and multi-core parallelization strategies like tiling.
  • Superior Performance: Outperforms general-purpose models (e.g., Qwen3, Llama3.1) on complex NPU kernel generation, effectively solving tasks where baselines completely fail.

Ideal Use Cases

This model is ideal for developers and researchers focused on:

  • Automating the generation of highly optimized, hardware-specific kernels for Huawei Ascend NPUs.
  • Bridging the gap between high-level AI models and low-level hardware programming.
  • Developing and optimizing custom operations for neural processing units where precise control over hardware resources is critical.