shuoxing/llama3-8b-full-pretrain-wash-c4-1-2m-sft-bs64

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Mar 27, 2026Architecture:Transformer Cold

The shuoxing/llama3-8b-full-pretrain-wash-c4-1-2m-sft-bs64 is an 8 billion parameter language model based on the Llama 3 architecture. This model was trained from scratch, indicating a foundational pre-training effort rather than a fine-tune of an existing model. While specific differentiators and intended uses are not detailed, its from-scratch training suggests a focus on establishing a robust base for further specialization. It is suitable for general language understanding and generation tasks where a Llama 3-based model of this size is appropriate.

Loading preview...

Model Overview

The shuoxing/llama3-8b-full-pretrain-wash-c4-1-2m-sft-bs64 is an 8 billion parameter language model built upon the Llama 3 architecture. Notably, this model was trained entirely from scratch, implying a foundational pre-training phase rather than an instruction-tuned or fine-tuned version of an existing model. The training process utilized a multi-GPU setup with 4 devices, a total batch size of 64, and an AdamW optimizer with a cosine learning rate scheduler over 3 epochs.

Key Training Details

  • Architecture: Llama 3-based
  • Parameters: 8 billion
  • Training Approach: Trained from scratch
  • Learning Rate: 1e-05
  • Optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08
  • Scheduler: Cosine learning rate scheduler with 0.1 warmup steps
  • Epochs: 3.0
  • Batch Size: 64 (total across 4 GPUs)

Intended Use

Given its 'from-scratch' pre-training nature, this model is likely intended as a base model for further fine-tuning or research into foundational language understanding. While specific applications are not detailed in the provided information, it can serve as a strong starting point for various natural language processing tasks.