lunahr/CeluneNorm-0.6B-v2.0-ctx2048
CeluneNorm-0.6B-v2.0-ctx2048 is a 0.8 billion parameter causal language model developed by lunahr, fine-tuned from Qwen3-0.6B-Base. This model specializes in lightweight text normalization for TTS and preprocessing pipelines, converting poorly formatted input into clean, readable text while preserving meaning. It is designed to be conservative, avoiding sentence rewrites and meaning changes, and supports long-context normalization up to 2048 tokens.
Loading preview...
Overview
CeluneNorm-0.6B-v2.0-ctx2048 is a 0.8 billion parameter causal language model developed by lunahr, based on Qwen3-0.6B-Base. It is specifically designed for text normalization, aiming to transform poorly formatted input into clean, readable text without altering its original meaning or intent. This version (2.0) significantly improves performance on longer contexts, supporting inputs up to 2048 tokens, making it suitable for normalizing more extensive text segments compared to its predecessor.
Key Capabilities
- Lightweight Text Normalization: Converts informal or poorly formatted text into a standardized, readable format.
- Meaning Preservation: Conservatively avoids rewriting sentences or changing the original meaning.
- Domain-Specific Token Handling: Preserves URLs, commands, names, and other domain-specific tokens.
- Long Context Support: Handles normalization for inputs up to 2048 tokens, an improvement over previous versions.
- Deterministic Output: Provides consistent normalization without requiring sampling.
- Mixed Text Handling: Capable of processing text containing both natural language and technical content.
Training and Performance
The model was fine-tuned on a mixed dataset including formal text, conversational text, synthetic edge cases, and quoted text. It also received additional tuning for casing accuracy and long-context coherence. Training metrics show a mean token accuracy of 97.53%, with real-world human-level correctness estimated at 90-95%.
Limitations
CeluneNorm is not a full grammar correction system. It may exhibit limitations such as missing some punctuation or casing corrections, being conservative with contractions, and potentially preserving ambiguous casing. It prioritizes safety and meaning preservation over aggressive correction.