turboderp/turbcat-instruct-72b: Enhanced Multilingual Instruction Model
The turboderp/turbcat-instruct-72b is a significant upgrade over the previous Cat 70B model, featuring a substantially expanded dataset (2GB to 5GB) and robust Chinese language support that matches the quality of its English counterpart. This model is particularly distinguished by its rigorous data generation and quality control processes.
Key Capabilities & Features
- Enhanced Medical & Scientific Reasoning: The dataset includes a dedicated medical Chain-of-Thought (COT) portion sponsored by steelskull and features annotations from 20 postdocs specializing in computational biology, biomedicine, biophysics, and biochemistry. This includes manually answered GRE and MCAT/Kaoyan questions with strict COT application.
- Comprehensive Chinese Support: Incorporates Chinese Ph.D. Entrance exams, Traditional Chinese, and Chinese storytelling data, ensuring comparable quality to English data, as verified by PCA after BERT embedding.
- Controlled Roleplay: Designed to function effectively as an API or in roleplay scenarios, the model avoids generating irrelevant content not specified by the system prompt.
- Quality Control: Individual task clusters are quality-checked using BERT embeddings on UMAP, with outliers manually reviewed by doctors.
Use Cases & Differentiators
This model is particularly well-suited for applications requiring high-quality, instruction-following capabilities in both English and Chinese, especially in medical, scientific, and academic domains. Its unique training methodology, involving expert human annotation and rigorous quality checks, sets it apart for tasks demanding precision and factual accuracy. The model uses the ChatML prompt format for its 72B variant.