yerevann/chemma-2b

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2.5BQuant:BF16Ctx Length:8kPublished:Jun 2, 2024License:cc-by-nc-4.0Architecture:Transformer0.0K Open Weights Warm

yerevann/chemma-2b is a 2 billion parameter Gemma-2B based language model continually pretrained by yerevann specifically for organic molecules. It is trained on 40 billion tokens covering over 110 million molecules from PubChem, including their chemical properties and similarities. This model excels at predicting molecular properties and generating molecules based on specified properties and similarity criteria, making it ideal for chemical space exploration and drug discovery applications.

Loading preview...

Chemma-2B: A Specialized LLM for Organic Molecules

Chemma-2B is a 2 billion parameter language model, continually pretrained from Google's Gemma-2B architecture, specifically designed for tasks involving organic molecules. Developed by yerevann, this model has been fine-tuned on an extensive dataset of 40 billion tokens derived from over 110 million molecules from PubChem.

Key Capabilities

  • Molecular Property Prediction: Predicts various chemical properties such as molecular weight, synthetic accessibility score (SAS), drug-likeness (QED), cLogP, TPSA, and ring count for given SMILES strings.
  • Conditional Molecule Generation: Generates novel molecules based on desired chemical properties and similarity to a reference molecule, using Tanimoto distance between ECFP fingerprints.
  • Chemical Space Exploration: Designed to be integrated into optimization loops for traversing and exploring chemical spaces, as demonstrated by the associated ChemLactica GitHub repository.
  • State-of-the-Art Performance: A preprint details its use in an optimization algorithm that achieves state-of-the-art results on benchmarks like Practical Molecular Optimization, available on arXiv.

Good For

  • Drug Discovery: Accelerating the design and optimization of new drug candidates.
  • Materials Science: Discovering molecules with specific desired properties.
  • Computational Chemistry: Researchers and developers working on molecular design and property prediction.

Chemma-2B is part of a family of models, including Chemlactica-125M and Chemlactica-1.3B, all focused on chemical applications.