yang0104/OryzaG3-8k

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1BQuant:BF16Ctx Length:32kPublished:May 22, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

OryzaG3-8k is a 700 million parameter single-species DNA language model developed by yang0104, specifically pretrained on 149 high-quality rice pangenomes. Utilizing a non-overlapping 3-mer tokenization strategy and Causal Language Modeling, this model is optimized for genomic analysis within the rice species. It offers a 8k token context length and demonstrates competitive performance against larger multi-species models on rice-specific tasks, while providing superior inference efficiency.

Loading preview...

OryzaG3-8k: A Genomic Foundation Model for Rice

OryzaG3-8k is a 700 million parameter DNA language model developed by yang0104, uniquely focused on the Oryza (rice) species. It was pretrained on an extensive dataset of 149 high-quality rice pangenomes, employing a non-overlapping 3-mer tokenization strategy and Causal Language Modeling (CLM) as its pretraining objective. This model is available in two context-length versions, with OryzaG3-8k offering an 8k token context.

Key Capabilities & Performance

  • Species-Specific Genomic Analysis: Designed specifically for rice, enabling deep insights into its genomics.
  • Competitive Benchmarking: On the Plants Genomic Benchmark-polyA for the Indica Group, OryzaG3-8k (700M) achieves an AUC of 0.970, AP of 0.942, and Accuracy of 0.924. It matches or exceeds the performance of larger multi-species models like AgroNT (1B) and Botanic0-L (991M) on rice-specific tasks.
  • Superior Inference Efficiency: Demonstrates significantly higher samples/s (400.41) compared to other models (e.g., AgroNT at 95.47 samples/s), making it highly efficient for genomic research.
  • Reproducible Framework: Provides a technical framework for developing lightweight, crop-specific genomic foundation models.

When to Use OryzaG3-8k

This model is ideal for researchers and developers working on:

  • Rice Genomics: Tasks requiring detailed analysis and understanding of rice DNA sequences.
  • Crop-Specific AI: Developing specialized AI applications for agricultural genomics, particularly for rice.
  • Efficient Genomic Inference: Scenarios where high throughput and efficient processing of genomic data are critical.

OryzaG3 was initialized using the Gemma3-1B architecture configuration, without loading its original pretrained weights, highlighting its unique training from scratch on rice pangenomes.