SII-GAIR-NLP/davinci-llm-model

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:Mar 26, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

The SII-GAIR-NLP/davinci-llm-model is a 3.09 billion parameter decoder-only Transformer model from the Qwen2 family, developed by SII-GAIR-NLP. It features a 4096-token context length and is uniquely designed for transparent and reproducible pretraining research, offering publicly documented data processing, training trajectories, and over 200 ablation studies. This base model excels in general language understanding, math/science reasoning, and code generation, achieving an overall average score of 51.72 across 19 benchmarks, comparable to larger 7B-scale models.

Loading preview...

daVinci-LLM-3B: A Transparent Pretraining Research Model

daVinci-LLM-3B is a 3.09 billion parameter base language model developed by SII-GAIR-NLP, designed to advance the science of pretraining. Unlike many LLMs, this project emphasizes full transparency and reproducibility, releasing not only the final model weights but also detailed training trajectories, intermediate checkpoints, data processing decisions, and over 200 ablation studies. This allows researchers to deeply investigate data quality, mixture design, training dynamics, and evaluation validity.

Key Capabilities

  • Transparent Pretraining: All aspects of the pretraining pipeline, including data processing logic, mixtures, logs, and checkpoints, are publicly documented.
  • Data Darwinism Framework: Utilizes a systematic L0–L9 taxonomy for categorizing data processing depth.
  • Extensive Ablations: Includes over 200 controlled experiments, providing insights into both positive and negative training outcomes.
  • Strong General Performance: Achieves an overall average score of 51.72 across 19 benchmarks, matching or exceeding the performance of larger 7B-scale models like OLMo-3 7B.
  • Specialized Reasoning: Demonstrates strong performance in math (62.80) and code generation (55.99), surpassing comparable models.

Good For

  • Research in Pretraining Science: Ideal for studying data quality, training dynamics, and evaluation stability.
  • General Language Understanding: Capable of broad language tasks.
  • Math and Science Reasoning: Excels in complex mathematical and scientific problem-solving.
  • Code Generation: Strong performance in generating code across various languages.

This model is a base model and requires additional instruction-tuning and safety alignment for production deployment.