lunahr/CeluneNorm-0.6B-v1.2
CeluneNorm-0.6B-v1.2 by lunahr is a lightweight 0.8 billion parameter text normalization model, fine-tuned from Qwen3-0.6B-Base, designed for text-to-speech (TTS) and general preprocessing pipelines. It converts poorly formatted input into clean, readable text while preserving original meaning and avoiding sentence rewriting. Optimized for conservative text correction, it excels at handling mixed text and maintaining domain-specific tokens, with an improved casing accuracy in this version.
Loading preview...
Overview
CeluneNorm-0.6B-v1.2 is a compact 0.8 billion parameter text normalization model developed by lunahr, based on Qwen3-0.6B-Base. Its primary function is to clean and standardize poorly formatted text for applications like text-to-speech (TTS) and general text preprocessing. This model is designed to be conservative, focusing on preserving the original meaning and structure of the input without rewriting sentences or altering domain-specific tokens such as URLs or names. Version 1.2 specifically enhances casing accuracy compared to its predecessor.
Key Capabilities
- Lightweight and Efficient: A 0.8B parameter model suitable for integration into various pipelines.
- Conservative Normalization: Prioritizes meaning preservation, avoiding aggressive corrections or sentence rewrites.
- Improved Casing: Version 1.2 offers better capitalization for specific names and phrases.
- Deterministic Output: Provides consistent results without sampling.
- Handles Mixed Text: Capable of processing natural language alongside technical content.
- Multi-Sentence Support: Can normalize multiple sentences when boundaries are clear.
Training and Performance
The model was fine-tuned from Qwen3-0.6B-Base on a mixed dataset including formal, conversational, and synthetic text, with an additional 10k rows for casing improvement. Training involved 3 epochs plus 1 for casing, achieving a mean token accuracy of 97.53% (99.77% for casing CFT). While token accuracy is high, real-world human-level correctness is estimated at 90–95%.
Limitations
CeluneNorm is not a full grammar correction system. It may occasionally miss punctuation or casing corrections, be conservative with contractions, and preserve ambiguous casing. It is most reliable for sequences under 128 tokens and prioritizes safety and meaning preservation over extensive linguistic correction.