shuoxing/llama3-8b-full-pretrain-wash-c4-0-9m-sft-bs64

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Mar 27, 2026Architecture:Transformer Cold

The shuoxing/llama3-8b-full-pretrain-wash-c4-0-9m-sft-bs64 is an 8 billion parameter Llama 3-based model, trained from scratch. This model was developed with specific training hyperparameters including a learning rate of 1e-05 and a total batch size of 64 over 3 epochs. While specific differentiators are not detailed, its training from scratch suggests potential for unique performance characteristics depending on its undisclosed dataset.

Loading preview...

Model Overview

The shuoxing/llama3-8b-full-pretrain-wash-c4-0-9m-sft-bs64 is an 8 billion parameter language model based on the Llama 3 architecture. It was trained from scratch, indicating a unique pre-training process distinct from standard Llama 3 checkpoints.

Training Details

The model underwent a specific training regimen with the following key hyperparameters:

  • Learning Rate: 1e-05
  • Batch Size: A train_batch_size of 8, combined with gradient_accumulation_steps of 2, resulted in a total_train_batch_size of 64.
  • Optimizer: ADAMW_TORCH with default betas and epsilon.
  • Scheduler: Cosine learning rate scheduler with 0.1 warmup steps.
  • Epochs: Trained for 3.0 epochs.

Current Status

As per the model card, more information is needed regarding its specific capabilities, intended uses, limitations, and the exact nature of its training and evaluation data. Developers should note that the dataset used for its "from scratch" training is currently unknown.