MedAlpaca-7b: A Medical Domain LLM
MedAlpaca-7b is a 7 billion parameter large language model, built upon the LLaMA architecture, and specifically fine-tuned for applications within the medical domain. Its primary objective is to enhance performance in medical question-answering and dialogue tasks.
Key Capabilities
- Medical Question Answering: Designed to provide answers to medical queries.
- Medical Dialogue: Capable of engaging in medical-related conversational exchanges.
- Specialized Knowledge: Fine-tuned on a diverse dataset to acquire medical domain expertise.
Training Data
The model was trained using a comprehensive dataset compiled from several sources:
- Anki Flashcards: Automatically generated question-answer pairs.
- Wikidoc: Medical question-answer pairs generated using Chat-GPT 3.5 from relevant paragraphs.
- StackExchange: Top-rated question-answer pairs from Academia, Bioinformatics, Biology, Fitness, and Health categories.
- ChatDoctor: A large dataset comprising 200,000 medical question-answer pairs.
Limitations
It is important to note that MedAlpaca-7b's performance is primarily within the medical domain and may not be effective outside of it. The training data targets the knowledge level of medical students, which might limit its utility for board-certified physicians. The model has not been tested in real-world clinical applications and should be treated as a research tool, not a substitute for professional medical advice.