selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge
The selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge is a 4.5 billion parameter LLM-as-a-Judge model developed by selfhypnosis-ai, built on the Qwen 3.5 architecture with a 32,768 token context length. It is specifically fine-tuned for evaluating and ranking creative writing responses based on accuracy, clarity, and originality. This model excels at pairwise preference evaluation by analyzing token log probabilities, demonstrating high discriminative ability and robust symmetry against length and positional biases in creative writing assessments.
Loading preview...
selfhypnosis-ai/Qwen3.5-4B-Creative-Writing-Judge: A Specialized LLM for Creative Writing Evaluation
This model, developed by selfhypnosis-ai, is a specialized 4.5 billion parameter LLM-as-a-Judge built on the Qwen 3.5 architecture. It is uniquely designed for evaluating and ranking creative writing responses by assessing pairs of texts for accuracy, clarity, and originality. Unlike traditional LLMs that generate text, this model's primary function is to act as an automated judge, providing a quantitative preference score between two given responses.
Key Capabilities
- LLM-as-a-Judge for Creative Writing: Optimized for pairwise preference evaluation of creative texts.
- Logit-Based Evaluation: Utilizes underlying token log probabilities of 'A' and 'B' to determine confidence scores, rather than simple text generation, for more robust and consistent judgments.
- Bias Mitigation: Employs a dual-pass inference strategy (swapping response positions) to effectively reduce positional bias, a common issue in LLM judges.
- High Discriminative Ability: Achieves a Combined Separability Score of
87.92 / 100.0in prompt-isolated Elo evaluations, indicating strong ability to differentiate writing quality. - Leaderboard Stability: Demonstrates extreme stability in tournament rankings with an Omega-Squared of
0.9020in bootstrapped evaluations, meaning rankings are highly consistent across different prompt selections. - 32K Context Length: Supports a fine-tuned context length of 32,768 tokens, allowing for evaluation of longer creative pieces.
Good for
- Automated evaluation and ranking of creative writing outputs from other LLMs or human writers.
- Developing and maintaining leaderboards for creative writing models.
- Research into LLM-as-a-Judge methodologies, particularly for subjective tasks.
- Applications requiring objective, bias-mitigated comparison of creative text quality.