google/DiarizationLM-8b-Fisher-v2
DiarizationLM-8b-Fisher-v2 is an 8 billion parameter DiarizationLM model developed by Google, fine-tuned on the Fisher corpus using a LoRA adapter. Built upon unsloth/llama-3-8b-bnb-4bit, this model specializes in speaker diarization post-processing, improving Word Diarization Error Rate (WDER) on both Fisher and Callhome testing sets. It computes loss only on completion tokens, distinguishing it from its predecessor, and is designed for refining speaker attribution in transcribed audio.
Loading preview...
DiarizationLM-8b-Fisher-v2 Overview
This model is a specialized 8 billion parameter DiarizationLM, developed by Google, for speaker diarization post-processing. It is fine-tuned on the training subset of the Fisher corpus using a LoRA adapter (rank 256) and built upon the unsloth/llama-3-8b-bnb-4bit foundation model. A key distinction from its predecessor, google/DiarizationLM-8b-Fisher-v1, is that this version computes loss exclusively on completion tokens during training.
Key Capabilities & Performance
- Speaker Diarization Post-Processing: Designed to refine speaker attribution in transcribed audio.
- Improved WDER: Demonstrates a reduction in Word Diarization Error Rate (WDER) compared to a USM + turn-to-diarize baseline.
- Fisher Testing Set: Achieves 3.28% WDER (baseline 5.32%).
- Callhome Testing Set: Achieves 6.66% WDER (baseline 7.72%).
- Training Details: Trained for approximately 9 epochs (28800 steps) on 51,063 prompt-completion pairs from the Fisher corpus, combining
hyp2oraanddeg2refdata flavors. - Context Length: Supports a maximal sequence length of 4096 tokens, with prompts up to 6000 characters.
Good For
- Enhancing Speaker Diarization Accuracy: Ideal for applications requiring improved speaker identification and segmentation in conversational audio.
- Research in Diarization Post-Processing: Useful for researchers exploring LLM-based methods for refining diarization outputs.
- Integration with ASR Systems: Can be used as a post-processing step for Automatic Speech Recognition (ASR) systems to produce more accurate speaker-attributed transcripts.