anonymous-dada/DialFactSum-Base-8B

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Apr 25, 2026Architecture:Transformer Cold

DialFactSum-Base-8B is an 8 billion parameter dialogue summarization model developed by anonymous-dada, fine-tuned from Qwen3-8B. It utilizes ACU-driven Group Relative Policy Optimization (GRPO) to achieve state-of-the-art factual coverage and density in dialogue summarization. The model is specifically optimized to strategically expand summary length to capture more facts without sacrificing precision, outperforming standard SFT models and strong baselines on factual metrics.

Loading preview...

DialFactSum-Base-8B: Advanced Dialogue Summarization

DialFactSum-Base-8B is an 8 billion parameter model, fine-tuned from Qwen3-8B, specifically designed for dialogue summarization. Developed by anonymous-dada, this model leverages ACU-driven Group Relative Policy Optimization (GRPO) to enhance factual coverage and density in summaries, particularly evaluated on the RoSE benchmark (SAMSum subset).

Key Capabilities & Improvements

  • Strategic Token Reallocation: Unlike traditional SFT models that often truncate summaries, DialFactSum-ACU-8B learns to expand sequence length to include more facts while maintaining high precision.
  • State-of-the-Art Factual Performance: It significantly outperforms strong baselines such as Ctrl-DiaSumm and MV-BART across all factual metrics, achieving superior ACU F1 (0.5685) and Normalized ACU (0.4635).
  • Unified Evaluation Protocol: All performance metrics are reported using a unified GPT-4o (G-Eval) protocol, ensuring fair comparisons.
  • Mitigation of "Truncation Trap": The GRPO policy effectively resolves the common issue in SFT models where summaries converge to a conservative length, thereby limiting factual recall. DialFactSum-ACU-8B generates longer summaries (~30 words) that encapsulate more atomic facts without sacrificing precision.
  • Superior Factual Consistency: The model's bidirectional ACU reward function effectively mitigates hallucinations and structural errors, leading to high factual consistency.
  • Quality Preservation: It maintains high linguistic quality, showing improvements in Coherence (0.9507) and Relevance (0.9041) compared to its SFT predecessor, avoiding the typical "alignment tax" associated with reinforcement learning.

Training & Evaluation

The model undergoes two training stages: an initial Stage-1 SFT fine-tuning on distilled rationale trajectories, followed by Stage-2 GRPO optimization using a composite reward function ($R_{ACU} + R_{len} + R_{BERT}$). Evaluation relies on GPT-4o G-Eval for ACU parsing and verification.