shuoxing/llama3-8b-full-pretrain-wash-c4-2-4m-sft-bs64
The shuoxing/llama3-8b-full-pretrain-wash-c4-2-4m-sft-bs64 model is an 8 billion parameter language model based on the Llama 3 architecture. This model was trained from scratch, undergoing a full pre-training process. While specific differentiators are not detailed, its training parameters suggest a focus on general language understanding and generation tasks.
Loading preview...
Overview
This model, shuoxing/llama3-8b-full-pretrain-wash-c4-2-4m-sft-bs64, is an 8 billion parameter language model built upon the Llama 3 architecture. It has undergone a complete pre-training process from scratch. The training involved a learning rate of 1e-05, a total batch size of 64, and a cosine learning rate scheduler over 3 epochs. The model was trained using a multi-GPU setup with 4 devices and AdamW optimizer.
Key Training Details
- Architecture: Llama 3
- Parameters: 8 billion
- Training Process: Full pre-training from scratch
- Learning Rate: 1e-05
- Optimizer: AdamW_TORCH with betas=(0.9, 0.999) and epsilon=1e-08
- Scheduler: Cosine LR scheduler with 0.1 warmup steps
- Epochs: 3.0
- Batch Size: Total train batch size of 64 (8 per device with 2 gradient accumulation steps)
Intended Use
Given the available information, this model is suitable for general language understanding and generation tasks where an 8 billion parameter model is appropriate. Its "from scratch" pre-training suggests a foundational model that could be further fine-tuned for specific downstream applications.