lunahr/CeluneNorm-0.6B-v1.3
lunahr/CeluneNorm-0.6B-v1.3 is a 0.8 billion parameter causal language model, fine-tuned from Qwen3-0.6B-Base, specifically designed for lightweight text normalization. It converts poorly formatted input into clean, readable text while preserving original meaning and avoiding sentence rewriting. This model excels at standardizing text for applications like Text-to-Speech (TTS) and general preprocessing pipelines, offering deterministic output and conservative punctuation handling.
Loading preview...
CeluneNorm-0.6B-v1.3: Lightweight Text Normalization
CeluneNorm-0.6B-v1.3, developed by lunahr, is a 0.8 billion parameter causal language model based on Qwen3-0.6B-Base. Its primary function is to normalize poorly formatted text into clean, readable output, making it ideal for Text-to-Speech (TTS) systems and text preprocessing. The model is designed to be conservative, preserving the original meaning, avoiding sentence rewriting, and maintaining domain-specific tokens like URLs or names.
Key Capabilities
- Deterministic Output: Provides consistent normalization without sampling.
- Meaning Preservation: Converts text while maintaining the original intent and structure.
- Improved Punctuation: Version 1.3 significantly enhances punctuation and infers sentence boundaries compared to previous versions.
- Mixed Text Handling: Capable of processing both natural language and technical content.
- Conservative Correction: Prioritizes safety and meaning over aggressive grammar correction, avoiding changes to slang or informal language.
- Efficient: Fine-tuned on a mix of formal, conversational, and synthetic data, including specific casing data, for robust performance.
Usage Considerations
This model expects input in the format YOUR INPUT<NORM> and works reliably on sequences below 128 tokens. It is not a full grammar correction system and may miss some nuanced corrections or preserve ambiguous casing. The model achieves a mean token accuracy of 97.53% (99.77% for casing) on its training metrics, translating to approximately 90-95% human-level correctness in real-world normalization tasks.