cs-552-2026-databand/general_knowledge_model
The cs-552-2026-databand/general_knowledge_model is a specialized language model developed by cs-552-2026-databand, fine-tuned for multiple-choice general knowledge questions. It is an SFT-only merged model, optimized to output concise, boxed answers (e.g., \boxed{A}). This model excels at accurately answering general knowledge questions from diverse datasets like Kaggle LLM Science and OpenBookQA, demonstrating strong performance on its SFT validation set and competitive results on MMLU benchmarks.
Loading preview...
Overview
The cs-552-2026-databand/general_knowledge_model is a specialized language model developed for the CS-552 Modern NLP Spring 2026 project. It is an SFT-only merged model, specifically fine-tuned for multiple-choice general knowledge questions. The model's primary function is to provide a single, concise, boxed answer (e.g., \boxed{A}) from a given set of choices (A through T).
Key Capabilities & Training
- Specialized Answering: Designed to output only a boxed letter for multiple-choice questions, enforced by a custom chat template.
- Supervised Fine-Tuning (SFT): Trained using LoRA SFT with a masked loss function, focusing on the final assistant boxed answer.
- Diverse Training Data: Built from six general knowledge datasets, including Kaggle LLM Science, EduQG, EduAdapt, NCERT_MCQs, SciQ, and OpenBookQA, with balanced answer distributions.
- Performance: Achieved 85.30% accuracy on its 2,000-example SFT validation set and 56.25% on MMLU Redux 2k, outperforming the baseline significantly.
Intended Use Cases
- Automated Quiz/Test Answering: Ideal for systems requiring precise, single-choice answers to general knowledge questions.
- Educational Tools: Can be integrated into platforms for evaluating understanding of factual information.
- Knowledge Retrieval: Useful for applications where quick, definitive answers to multiple-choice queries are needed.
Limitations
- The model is highly specialized for multiple-choice formats and may not perform optimally on open-ended or generative tasks.
- A DPO experiment was conducted but ultimately not selected as it reduced external benchmark accuracy, indicating the SFT-only model is the most robust for its intended purpose.