AscendKernelGen/KernelGen-LM-32B-RL: Specialized NPU Kernel Generation
KernelGen-LM-32B-RL is a specialized large language model developed by AscendKernelGen, designed for generating low-level NPU kernels for the Huawei Ascend architecture using the AscendC programming language. Built upon the Qwen3-32B backbone, this model undergoes a unique two-stage optimization process, making it highly effective for hardware-specific code generation.
Key Capabilities and Innovations
- Domain-Specific Training: The model is trained on the Ascend-CoT dataset, which incorporates Chain-of-Thought (CoT) reasoning from documentation, real-world kernel implementations, and general reasoning chains, specifically tailored for low-level NPU programming.
- Two-Stage Optimization: It utilizes a Supervised Fine-Tuning (SFT) phase with error-derived supervision to correct API misuse and numerical errors, followed by Reinforcement Learning (RL) using Direct Preference Optimization (DPO), driven by execution-based correctness and performance signals.
- Hardware-Grounded Evaluation: Performance is rigorously validated using NPUKernelBench, a comprehensive benchmark that assesses compilation success, functional correctness, and latency on actual Ascend hardware.
- Significant Performance Improvement: The model demonstrates substantial improvements on complex Level-2 kernels, achieving a 95.5% compilation success rate (Pass@10) and 64.3% functional correctness, effectively solving tasks where general-purpose models like Qwen3 and Llama3.1 fail completely.
Good for
- Developers and researchers working on Huawei Ascend NPU kernel development.
- Automating the generation of AscendC code for hardware acceleration.
- Tasks requiring high compilation success and functional correctness in low-level hardware programming.
- Bridging the gap between general-purpose code generation and hardware-specific programming.