BioMedGPT-LM-7B: A Specialized Biomedical Language Model
BioMedGPT-LM-7B, developed by PharMolix, is the first large generative language model based on Llama2 specifically fine-tuned for the biomedical domain. It leverages the Llama2-7B-Chat architecture and has been extensively trained on over 26 billion tokens derived from millions of biomedical papers within the S2ORC corpus.
Key Capabilities and Features
- Biomedical Specialization: Fine-tuned on a vast dataset of biomedical literature, making it highly proficient in domain-specific language and knowledge.
- High Performance on QA: Demonstrates performance on par with or superior to human experts and larger general-purpose foundation models on various biomedical Question Answering benchmarks.
- Foundation for Multimodal AI: Serves as the generative language model component of BioMedGPT-10B, an open multimodal generative pre-trained transformer that bridges natural language with diverse biomedical data modalities.
Training Details
The model underwent 5 epochs of fine-tuning with a batch size of 192, a context length of 2048 tokens, and a learning rate of 2e-5. The training data was meticulously extracted using PubMed Central (PMC)-ID and PubMed ID criteria.
Use Cases
BioMedGPT-LM-7B is ideal for applications requiring deep understanding and generation of biomedical text, such as:
- Biomedical question answering systems.
- Information extraction from scientific literature.
- Assisting in research and development within the pharmaceutical and medical fields.
For more technical details, refer to the technical report on "BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine".