Model Overview
IsmaelMousa/Qwen2.5-3B-Instruct-EngSaf-628K is a specialized large language model, fine-tuned from the Qwen2.5-3B-Instruct architecture by Ismael Mousa. With 3.1 billion parameters and a 32K context length, this model is specifically designed for Automatic Essay Grading (AEG), focusing on short-answer responses.
Key Capabilities
- Essay Grading: Evaluates student answers against reference answers and mark schemes.
- Rationale Generation: Provides a textual rationale explaining the assigned score.
- JSON Output: Designed to output scores and rationales in a structured JSON format.
- Domain-Specific Training: Fine-tuned on the EngSAF-628K dataset, comprising short-answer responses from engineering examinations.
Performance Metrics
Evaluation on a held-out test set demonstrated the model's capabilities in both scoring and rationale generation:
- Score F1: 0.6141
- Score Accuracy: 0.6200
- Score Cohen's Kappa (CKS): 0.4123
- Rationale F1 (BERT-Score): 0.6438
When to Use This Model
This model is particularly well-suited for:
- Automated Educational Assessment: Grading short-answer questions in academic settings.
- Feedback Generation: Providing structured feedback to students based on their responses.
- Research in AEG: As a baseline or component in further research on automatic essay grading systems.
Limitations
The model's evaluation revealed instances where it correctly identified key aspects of student answers but occasionally failed to align its scoring perfectly with rubric criteria. It is primarily trained on engineering examination data, which may affect performance on other domains.