Overview
C2S-Scale-Gemma-2-2B: A Specialized LLM for Single-Cell Biology
C2S-Scale-Gemma-2-2B is a 2.6 billion parameter language model developed by the van Dijk Lab (Yale), Google Research, and Google DeepMind. It is built on the Gemma-2 2B architecture and uniquely fine-tuned for single-cell biology using the Cell2Sentence (C2S) framework. This model interprets high-dimensional scRNA-seq data as "cell sentences"—ordered sequences of gene names—enabling advanced biological analyses.
Key Capabilities
- Single-Cell Data Interpretation: Processes scRNA-seq data by converting it into a language-like format.
- Versatile Performance: Demonstrates strong capabilities across diverse single-cell and multi-cell tasks, including cell type prediction and tissue classification.
- Scalability: Trained on over 57 million cells from more than 800 datasets (CellxGene and Human Cell Atlas), showcasing its ability to handle massive biological data.
- Generative Power: Capable of generating realistic single-cell gene expression profiles for in silico experiments.
- Foundational Model: Serves as a robust pretrained base for fine-tuning on specialized, domain-specific single-cell analysis tasks.
Good For
- Research in Single-Cell Genomics: Ideal for computational biologists studying cellular diversity.
- Cell Atlas Annotation: Streamlining the annotation of large-scale single-cell datasets.
- Biomarker Discovery: Identifying gene patterns relevant to specific cell states or diseases.
- In Silico Experimentation: Generating cells under specific conditions to test biological hypotheses.
This model represents a significant advancement in applying large language models to biological data, establishing new benchmarks in single-cell biology. For more details, refer to the C2S-Scale Paper.