instruction-pretrain/instruction-synthesizer
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Jun 18, 2024License:apache-2.0Architecture:Transformer0.1K Open Weights Cold

The instruction-pretrain/instruction-synthesizer is a context-based instruction synthesizer developed by instruction-pretrain, designed to generate instruction-response pairs from raw text. This tool is central to the Instruction Pre-Training framework, which augments massive corpora with synthetic instruction data to pre-train language models. It has been used to synthesize 200M to 500M instruction-response pairs across 40+ task categories, demonstrating improved performance over vanilla pre-training in both general and domain-adaptive continual pre-training, enabling smaller models like Llama3-8B to achieve performance comparable to or exceeding Llama3-70B.

Loading preview...