shuoxing/llama3-8b-full-pretrain-wash-c4-1-2m-sft-bs64
The shuoxing/llama3-8b-full-pretrain-wash-c4-1-2m-sft-bs64 is an 8 billion parameter language model based on the Llama 3 architecture. This model was trained from scratch, indicating a foundational pre-training effort rather than a fine-tune of an existing model. While specific differentiators and intended uses are not detailed, its from-scratch training suggests a focus on establishing a robust base for further specialization. It is suitable for general language understanding and generation tasks where a Llama 3-based model of this size is appropriate.
Loading preview...
Model Overview
The shuoxing/llama3-8b-full-pretrain-wash-c4-1-2m-sft-bs64 is an 8 billion parameter language model built upon the Llama 3 architecture. Notably, this model was trained entirely from scratch, implying a foundational pre-training phase rather than an instruction-tuned or fine-tuned version of an existing model. The training process utilized a multi-GPU setup with 4 devices, a total batch size of 64, and an AdamW optimizer with a cosine learning rate scheduler over 3 epochs.
Key Training Details
- Architecture: Llama 3-based
- Parameters: 8 billion
- Training Approach: Trained from scratch
- Learning Rate: 1e-05
- Optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08
- Scheduler: Cosine learning rate scheduler with 0.1 warmup steps
- Epochs: 3.0
- Batch Size: 64 (total across 4 GPUs)
Intended Use
Given its 'from-scratch' pre-training nature, this model is likely intended as a base model for further fine-tuning or research into foundational language understanding. While specific applications are not detailed in the provided information, it can serve as a strong starting point for various natural language processing tasks.