instruction-pretrain/instruction-synthesizer

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Jun 18, 2024License:apache-2.0Architecture:Transformer0.1K Open Weights Cold

The instruction-pretrain/instruction-synthesizer is a context-based instruction synthesizer developed by instruction-pretrain, designed to generate instruction-response pairs from raw text. This tool is central to the Instruction Pre-Training framework, which augments massive corpora with synthetic instruction data to pre-train language models. It has been used to synthesize 200M to 500M instruction-response pairs across 40+ task categories, demonstrating improved performance over vanilla pre-training in both general and domain-adaptive continual pre-training, enabling smaller models like Llama3-8B to achieve performance comparable to or exceeding Llama3-70B.

Loading preview...

Instruction Pre-Training: Instruction Synthesizer

This repository introduces the context-based instruction synthesizer, a core component of the Instruction Pre-Training framework. Developed by instruction-pretrain and presented at EMNLP 2024, this framework focuses on supervised multitask pre-training by generating instruction-response pairs from raw text to augment large language models.

Key Capabilities & Features

  • Scalable Data Augmentation: Efficiently generates millions of instruction-response pairs (up to 500M) from raw corpora, covering over 40 task categories.
  • Improved Pre-training: Models pre-trained using this method consistently outperform those from vanilla pre-training, both from scratch and in domain-adaptive continual pre-training.
  • Performance Gains: Demonstrates that Instruction Pre-Training can enable smaller models, such as Llama3-8B, to achieve performance comparable to or better than Llama3-70B in continual pre-training scenarios.
  • Flexible Usage: Provides methods for basic synthesis of instruction-response pairs from single texts and advanced, large-scale conversion of raw corpora into instruction-augmented datasets.
  • Open-Source Integration: Utilizes open-source models for its synthesizer and offers resources including pre-trained models (e.g., InstructLM-500M, InstructLM-1.3B) and domain-specific models (e.g., Finance-Llama3-8B).

Good For

  • Researchers and Developers looking to enhance language model pre-training with synthetic instruction data.
  • Creating Custom Instruction Datasets from any raw text for fine-tuning or pre-training.
  • Improving Continual Pre-training of existing base models, especially for domain adaptation, by generating relevant instruction-response pairs.
  • Exploring Supervised Multitask Learning as a pre-training paradigm to boost model performance and efficiency.