eth-nlped/TutorRL-7B
TutorRL-7B is a 7.6 billion parameter Qwen2.5-7B-Instruct variant developed by eth-nlped, fine-tuned using reinforcement learning (GRPO) for pedagogical alignment. This model functions as a math tutor, optimized to scaffold reasoning and guide students through Socratic questioning rather than directly solving problems. It excels in interactive math tutoring and research on educational LLM alignment, leveraging a 131072 token context length.
Loading preview...
TutorRL-7B: A Pedagogically Aligned Math Tutor
TutorRL-7B, developed by eth-nlped, is a 7.6 billion parameter model based on Qwen2.5-7B-Instruct, uniquely fine-tuned to act as a math tutor rather than a direct problem-solver. Its core innovation lies in its alignment with pedagogical principles using reinforcement learning (GRPO) within a synthetic multi-turn classroom environment, eliminating the need for human-labeled data.
Key Capabilities
- Socratic Tutoring: Guides users through problem-solving with Socratic questioning, scaffolding reasoning, and withholding direct answers to foster learning.
- Pedagogical Alignment: Optimized for educational interactions, focusing on teaching methodologies over solution provision.
- Annotation-Free Training: Utilizes a scalable, annotation-free approach for training LLMs as educational tutors.
Good For
- Interactive Math Tutoring: Ideal for applications requiring an AI to teach math concepts and problem-solving.
- Educational Research: A valuable tool for research into the educational alignment of large language models.
- Socratic Dialogue Generation: Capable of generating guided, inquiry-based conversations for learning.
This model is distinct from its variant, TutorRL-7B-think, as it does not generate <think> blocks for planning-based reasoning.