ZiyiYe/Con-J-Qwen2-7B

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Sep 20, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

Con-J-Qwen2-7B by ZiyiYe is a 7.6 billion parameter generative judge model built on the Qwen2-7B-Instruct architecture, designed to evaluate and provide rationales for the quality of two given answers to a question. It is trained using Direct Preference Optimization (DPO) on self-generated contrastive judgment pairs derived from the Skywork/Skywork-Reward-Preference-80K-v0.1 dataset. This model excels at providing accurate, rationale-supported judgments, making it ideal for automated evaluation and feedback systems.

Loading preview...

Overview

Con-J-Qwen2-7B is a 7.6 billion parameter generative judge model developed by ZiyiYe. It is based on the Qwen2-7B-Instruct architecture and is specifically trained to act as a "generative judge" by evaluating two potential answers to a question and providing a rationale for its preference. The model learns to generate both positive and negative judgments, complete with natural language rationales, from preference data using Direct Preference Optimization (DPO).

Key Capabilities

  • Generative Judgment: Evaluates two candidate answers to a question and determines which is superior.
  • Rationale Generation: Provides detailed, natural language explanations for its judgments, enhancing transparency and interpretability.
  • Preference-based Training: Utilizes self-generated contrastive judgment pairs from the Skywork/Skywork-Reward-Preference-80K-v0.1 dataset for robust training.

Performance Highlights

Con-J-Qwen2-7B demonstrates strong performance across various reward model benchmarks, often outperforming models in its size class and even larger models in specific categories. Notably, it achieves:

  • 81.0 on Infinity-Preference, surpassing GPT-4o (75.0) and Llama3.1-70B (64.0).
  • 73.0 on Ultra-Feedback, outperforming GPT-4o (72.2) and Llama3.1-70B (71.4).
  • 79.6 on Reward-Bench Chat-H, significantly higher than GPT-4o (74.3) and Llama3.1-70B (70.2).
  • 88.0 on Reward-Bench Safety, exceeding GPT-4o (87.6) and Llama3.1-70B (82.8).

Good For

  • Automated evaluation of LLM outputs.
  • Providing detailed feedback and rationales for answer quality.
  • Developing systems that require nuanced judgment of text coherence, accuracy, and coverage.
  • Research into generative judge models and preference-based learning.