juanquivilla/sotto-cleanup-lfm25-350m

TEXT GENERATIONConcurrency Cost:1Model Size:0.35BQuant:BF16Ctx Length:32kPublished:Apr 1, 2026License:mitArchitecture:Transformer Open Weights Cold

The juanquivilla/sotto-cleanup-lfm25-350m is a 350 million parameter, full-precision bf16 fine-tune of LiquidAI/LFM2.5-350M-Base, specifically designed for on-device speech-to-text transcript cleanup. This model is a "model soup" (weight-space average) of two strong checkpoints, optimizing for both number accuracy and adversarial sampling benchmarks. It excels at removing filler words and improving the accuracy of transcribed numbers, making it ideal for refining ASR outputs.

Loading preview...

Overview

This model, juanquivilla/sotto-cleanup-lfm25-350m, is a 350 million parameter, full-precision bf16 fine-tune of the LiquidAI/LFM2.5-350M-Base architecture. Its primary purpose is on-device speech-to-text transcript cleanup, specifically targeting the refinement of ASR outputs.

Key Differentiators

  • Model Souping: This model is a unique "weight-space average" of two high-performing checkpoints (v55 and v51) from the same fine-tuning lineage. This technique combines the strengths of both, recovering v51's adversarial sampling performance while retaining v55's gains in number accuracy and filler-stripping.
  • Optimized for Accuracy: Achieves a 96.5% number accuracy and an 86.4% adversarial benchmark (greedy) in production-mode evaluations. It also demonstrates strong performance in reducing sub-deletion and minimizing sampling loops.
  • Specialized Training: The model underwent a sophisticated training pipeline, including GRPO (Generative Reinforcement Policy Optimization) with substantive-deletion-aware rewards, augmented number examples, and anti-loop n-gram penalties.

Recommended Usage

For optimal performance, especially on Apple Silicon, users are recommended to use the MLX 5-bit variant. Inference should use specific settings:

  • repetition_penalty=1.05 to prevent rare 5-gram loops.
  • max_new_tokens >= 1.5 × input_word_count (or 900 minimum) to avoid content truncation.
  • do_sample=False for deterministic greedy output, or temperature=0.1, top_k=50 for sampling.