shuoxing/llama3-8b-full-pretrain-wash-c4-2-4m-sft-bs64

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Mar 27, 2026Architecture:Transformer Cold

The shuoxing/llama3-8b-full-pretrain-wash-c4-2-4m-sft-bs64 model is an 8 billion parameter language model based on the Llama 3 architecture. This model was trained from scratch, undergoing a full pre-training process. While specific differentiators are not detailed, its training parameters suggest a focus on general language understanding and generation tasks.

Loading preview...

Overview

This model, shuoxing/llama3-8b-full-pretrain-wash-c4-2-4m-sft-bs64, is an 8 billion parameter language model built upon the Llama 3 architecture. It has undergone a complete pre-training process from scratch. The training involved a learning rate of 1e-05, a total batch size of 64, and a cosine learning rate scheduler over 3 epochs. The model was trained using a multi-GPU setup with 4 devices and AdamW optimizer.

Key Training Details

  • Architecture: Llama 3
  • Parameters: 8 billion
  • Training Process: Full pre-training from scratch
  • Learning Rate: 1e-05
  • Optimizer: AdamW_TORCH with betas=(0.9, 0.999) and epsilon=1e-08
  • Scheduler: Cosine LR scheduler with 0.1 warmup steps
  • Epochs: 3.0
  • Batch Size: Total train batch size of 64 (8 per device with 2 gradient accumulation steps)

Intended Use

Given the available information, this model is suitable for general language understanding and generation tasks where an 8 billion parameter model is appropriate. Its "from scratch" pre-training suggests a foundational model that could be further fine-tuned for specific downstream applications.