ProLLaMA: A Specialized Protein Language Model

ProLLaMA is a 7 billion parameter protein large language model (LLM) developed by Lyu6PosHao, built upon the Llama-2-7b architecture. Its core purpose is multi-task protein language processing, offering specialized capabilities for researchers and developers in bioinformatics.

Key Capabilities

Protein Sequence Generation: ProLLaMA can generate protein sequences when provided with a specific protein superfamily. Users can also optionally specify the initial amino acids of the desired sequence.
Protein Superfamily Determination: The model is capable of analyzing a given protein sequence and identifying its corresponding superfamily.
Llama-2-7b Foundation: Leveraging the Llama-2-7b base model, ProLLaMA benefits from a robust and widely recognized architecture, while being fine-tuned for the unique domain of protein language.

Input Format

ProLLaMA utilizes a specific instruction format for its tasks, such as [Generate by superfamily] Superfamily=<xxx> or [Determine superfamily] Seq=<yyy>. This structured input ensures precise control over the model's operations. A comprehensive list of supported superfamilies is available here.

Good For

Researchers and scientists working on protein design and engineering.
Bioinformaticians needing to classify unknown protein sequences.
Applications requiring the generation of novel protein sequences based on known classifications.

Overview

ProLLaMA: A Specialized Protein Language Model

Key Capabilities

Input Format

Good For

Full Model Card (README)