whynlp/tinyllama-zh
whynlp/tinyllama-zh is a Llama-architecture language model pretrained on approximately 45 billion Chinese tokens from the WuDaoCorpora Text dataset. Developed by whynlp, this model serves as a demonstration for pretraining TinyLlama on large Chinese corpora. It utilizes the THUDM/chatglm3-6b tokenizer and is primarily intended for research and educational purposes to showcase the pretraining process rather than achieving state-of-the-art performance.
Loading preview...
Overview
whynlp/tinyllama-zh is a Llama-architecture model developed by whynlp, specifically pretrained on Chinese corpora. This project primarily functions as a demonstration of how to pretrain a TinyLlama model using the Hugging Face transformers library on a large dataset. It is fine-tuned from a TinyLlama-2.5T checkpoint.
Key Training Details
- Dataset: WuDaoCorpora Text, comprising approximately 45 billion tokens.
- Training Epochs: 2 epochs.
- Training Duration: Approximately 6 days using 8 A100 GPUs.
- Tokenizer: Employs the
THUDM/chatglm3-6btokenizer. - License: MIT.
Intended Use and Limitations
This model is designed to illustrate the pretraining process on a large corpus. While functional, its performance is noted as not being very strong, with a CMMLU result slightly above 25. For better performance in practical applications, the developers suggest using a higher-quality corpus like Wanjuan. Therefore, it is best suited for researchers and developers interested in understanding the mechanics of pretraining TinyLlama on Chinese data rather than for high-performance production use cases.