Instruction Pre-Training: Biomedicine-Llama3-8B
This model, developed by instruction-pretrain, is a specialized version of Llama3-8B, fine-tuned for the biomedicine domain using a novel "Instruction Pre-Training" framework. This approach involves augmenting massive raw corpora with instruction-response pairs generated by an efficient instruction synthesizer built on open-source models.
Key Capabilities & Differentiators
- Enhanced Domain Adaptation: Outperforms vanilla pre-training in domain-adaptive continual pre-training, specifically within the biomedicine sector.
- Scalable Augmentation: Leverages an instruction synthesizer to create up to 500 million synthesized instruction-response pairs, scaling pre-trained tokens to 250 billion.
- Performance: In continual pre-training, this Llama3-8B variant achieves performance comparable to or even surpassing Llama3-70B on domain-specific tasks.
- Base Model: Built upon the pre-trained base Llama3-8B model, not the instruction-tuned version, requiring only BOS and EOS tokens for pre-training.
Use Cases
- Biomedical NLP: Ideal for applications requiring deep understanding and generation within the biomedical field.
- Research: Useful for researchers exploring advanced pre-training techniques and domain adaptation for large language models.
- Evaluation: Provides a robust model for evaluating domain-specific tasks, with scripts available for Hugging Face models.