Instruction Pre-Training: Instruction Synthesizer
This repository introduces the context-based instruction synthesizer, a core component of the Instruction Pre-Training framework. Developed by instruction-pretrain and presented at EMNLP 2024, this framework focuses on supervised multitask pre-training by generating instruction-response pairs from raw text to augment large language models.
Key Capabilities & Features
- Scalable Data Augmentation: Efficiently generates millions of instruction-response pairs (up to 500M) from raw corpora, covering over 40 task categories.
- Improved Pre-training: Models pre-trained using this method consistently outperform those from vanilla pre-training, both from scratch and in domain-adaptive continual pre-training.
- Performance Gains: Demonstrates that Instruction Pre-Training can enable smaller models, such as Llama3-8B, to achieve performance comparable to or better than Llama3-70B in continual pre-training scenarios.
- Flexible Usage: Provides methods for basic synthesis of instruction-response pairs from single texts and advanced, large-scale conversion of raw corpora into instruction-augmented datasets.
- Open-Source Integration: Utilizes open-source models for its synthesizer and offers resources including pre-trained models (e.g., InstructLM-500M, InstructLM-1.3B) and domain-specific models (e.g., Finance-Llama3-8B).
Good For
- Researchers and Developers looking to enhance language model pre-training with synthetic instruction data.
- Creating Custom Instruction Datasets from any raw text for fine-tuning or pre-training.
- Improving Continual Pre-training of existing base models, especially for domain adaptation, by generating relevant instruction-response pairs.
- Exploring Supervised Multitask Learning as a pre-training paradigm to boost model performance and efficiency.