yang0104/OryzaG3-8k
OryzaG3-8k is a 700 million parameter single-species DNA language model developed by yang0104, specifically pretrained on 149 high-quality rice pangenomes. Utilizing a non-overlapping 3-mer tokenization strategy and Causal Language Modeling, this model is optimized for genomic analysis within the rice species. It offers a 8k token context length and demonstrates competitive performance against larger multi-species models on rice-specific tasks, while providing superior inference efficiency.
Loading preview...
OryzaG3-8k: A Genomic Foundation Model for Rice
OryzaG3-8k is a 700 million parameter DNA language model developed by yang0104, uniquely focused on the Oryza (rice) species. It was pretrained on an extensive dataset of 149 high-quality rice pangenomes, employing a non-overlapping 3-mer tokenization strategy and Causal Language Modeling (CLM) as its pretraining objective. This model is available in two context-length versions, with OryzaG3-8k offering an 8k token context.
Key Capabilities & Performance
- Species-Specific Genomic Analysis: Designed specifically for rice, enabling deep insights into its genomics.
- Competitive Benchmarking: On the Plants Genomic Benchmark-polyA for the Indica Group, OryzaG3-8k (700M) achieves an AUC of 0.970, AP of 0.942, and Accuracy of 0.924. It matches or exceeds the performance of larger multi-species models like AgroNT (1B) and Botanic0-L (991M) on rice-specific tasks.
- Superior Inference Efficiency: Demonstrates significantly higher samples/s (400.41) compared to other models (e.g., AgroNT at 95.47 samples/s), making it highly efficient for genomic research.
- Reproducible Framework: Provides a technical framework for developing lightweight, crop-specific genomic foundation models.
When to Use OryzaG3-8k
This model is ideal for researchers and developers working on:
- Rice Genomics: Tasks requiring detailed analysis and understanding of rice DNA sequences.
- Crop-Specific AI: Developing specialized AI applications for agricultural genomics, particularly for rice.
- Efficient Genomic Inference: Scenarios where high throughput and efficient processing of genomic data are critical.
OryzaG3 was initialized using the Gemma3-1B architecture configuration, without loading its original pretrained weights, highlighting its unique training from scratch on rice pangenomes.