CeluneNorm-0.6B-v1.1: Lightweight Text Normalization
CeluneNorm-0.6B-v1.1, developed by lunahr, is a 0.8 billion parameter causal language model fine-tuned from Qwen3-0.6B-Base. Its primary function is text normalization, transforming poorly formatted input into clean, readable text suitable for Text-to-Speech (TTS) and other preprocessing tasks. The model is designed to be conservative, prioritizing the preservation of original meaning and avoiding sentence rewriting or changes to domain-specific tokens like URLs or names.
Key Capabilities
- Deterministic output: Ensures consistent normalization without sampling.
- Meaning preservation: Avoids altering the original intent or content of the text.
- Structure and intent handling: Maintains the structural integrity of the input.
- Mixed text support: Effectively processes natural language combined with technical content.
- Conservative punctuation: Prefers standard punctuation like periods over exclamation marks unless explicitly indicated.
- Multi-sentence normalization: Can normalize multiple sentences when boundaries are clear.
Training and Performance
The model was fine-tuned on a mixed dataset including formal, conversational, and synthetic text, achieving a mean token accuracy of 97.53%. While not a full grammar correction system, it provides reliable text cleaning, particularly for applications where meaning preservation and structural integrity are paramount. It is not intended for aggressive correction or expanding slang.