whynlp/tinyllama-zh

TEXT GENERATIONConcurrency Cost:1Model Size:1BQuant:BF16Ctx Length:32kLicense:mitArchitecture:Transformer0.0K Open Weights Cold

whynlp/tinyllama-zh is a Llama-architecture language model pretrained on approximately 45 billion Chinese tokens from the WuDaoCorpora Text dataset. Developed by whynlp, this model serves as a demonstration for pretraining TinyLlama on large Chinese corpora. It utilizes the THUDM/chatglm3-6b tokenizer and is primarily intended for research and educational purposes to showcase the pretraining process rather than achieving state-of-the-art performance.

Loading preview...

Overview

whynlp/tinyllama-zh is a Llama-architecture model developed by whynlp, specifically pretrained on Chinese corpora. This project primarily functions as a demonstration of how to pretrain a TinyLlama model using the Hugging Face transformers library on a large dataset. It is fine-tuned from a TinyLlama-2.5T checkpoint.

Key Training Details

  • Dataset: WuDaoCorpora Text, comprising approximately 45 billion tokens.
  • Training Epochs: 2 epochs.
  • Training Duration: Approximately 6 days using 8 A100 GPUs.
  • Tokenizer: Employs the THUDM/chatglm3-6b tokenizer.
  • License: MIT.

Intended Use and Limitations

This model is designed to illustrate the pretraining process on a large corpus. While functional, its performance is noted as not being very strong, with a CMMLU result slightly above 25. For better performance in practical applications, the developers suggest using a higher-quality corpus like Wanjuan. Therefore, it is best suited for researchers and developers interested in understanding the mechanics of pretraining TinyLlama on Chinese data rather than for high-performance production use cases.