SoilFM Language Tower: Domain-Adapted for Soil Science
This model, northenlab/soilfm-qwen2.5-14b-literature-cpt, is a specialized large language model (LLM) built upon the 14.2 billion parameter Qwen2.5-14B-Instruct architecture. Developed by the Northen Lab at Lawrence Berkeley National Laboratory, it is the "Language Tower" component of the multi-modal SoilFM2 foundation model for soil microbiome analysis. Its primary differentiator is its domain adaptation to soil science and soil microbiology through continued pretraining.
Key Capabilities & Features
- Domain-Specific Knowledge: Fine-tuned on 200,000 curated text passages from sources like PubMed Central soil microbiology papers, Wikipedia soil science articles, and the USDA Soil Survey Manual.
- High Context Length: Inherits a 32,768-token context window from its base model, suitable for processing extensive scientific texts.
- Efficient Training: Utilized QLoRA (4-bit NF4) for continued pretraining, making the process efficient while achieving a 7.2% improvement in validation loss over 1,500 steps.
- Integration with SoilFM2: Designed to provide domain-grounded context within the broader SoilFM2 multi-modal pipeline, supporting applications like prebiotic recommendation.
Intended Uses
- Generating detailed explanations of complex soil microbial processes, rhizosphere ecology, and plant-microbe interactions.
- Serving as a specialized backbone for downstream fine-tuning or Retrieval-Augmented Generation (RAG) systems in soil science.
- Supporting research and educational applications requiring deep knowledge in soil microbiology.
This model is intended for research and non-commercial use only, inheriting licensing considerations from its training data.