Overview
C2S-Scale-Gemma-2-27B: Single-Cell Biology LLM
C2S-Scale-Gemma-2-27B is a 27 billion parameter language model, a collaboration between Yale's van Dijk Lab, Google Research, and Google DeepMind. It leverages the Gemma-2 architecture and the Cell2Sentence (C2S) framework to interpret single-cell RNA sequencing (scRNA-seq) data as 'cell sentences'—ordered sequences of gene names. Trained on over 57 million human and mouse cells from CellxGene and the Human Cell Atlas, this model significantly scales LLM capabilities for biological analysis.
Key Capabilities
- Single-Cell Data Understanding: Processes high-dimensional scRNA-seq data by converting it into a language-like format.
- Versatile Performance: Demonstrates strong results across diverse single-cell and multi-cell tasks, including advanced downstream applications like cluster captioning and perturbation prediction.
- Generative Power: Capable of generating realistic single-cell gene expression profiles for in silico experiments.
- Foundation Model: Serves as a powerful pretrained base for fine-tuning on specialized, domain-specific single-cell analysis tasks.
- Scalability: Trained on a massive dataset using Google's TPU v5s, enabling a significant increase in model size and capability.
Good for
- Cell Type Prediction & Annotation: Streamlining the annotation of large-scale single-cell datasets.
- Biomarker Discovery: Identifying gene patterns for specific cell states or diseases.
- In Silico Experiments: Generating cells under specific conditions to test biological hypotheses.
- Research in Single-Cell Genomics: A foundational tool for computational biology and interpreting scRNA-seq experiments.