BAAI/OPI-Llama-3.1-8B-Instruct
BAAI/OPI-Llama-3.1-8B-Instruct is an 8 billion parameter instruction-tuned causal language model developed by BAAI, fine-tuned from Meta-Llama-3.1-8B-Instruct. This model specializes in protein-related tasks, including sequence understanding, annotation prediction, and knowledge mining, utilizing a 32768 token context length. It is specifically adapted for tasks such as EC number prediction, fold type prediction, subcellular localization, and gene ontology term prediction. The model demonstrates capabilities in extracting and interpreting complex biological information from protein sequences and related data.
Loading preview...
OPI-Llama-3.1-8B-Instruct: Protein-Related Task Specialist
OPI-Llama-3.1-8B-Instruct is an 8 billion parameter model developed by BAAI, fine-tuned from Meta-Llama-3.1-8B-Instruct. Its primary focus is on protein-related tasks, leveraging the OPI (Open Instruction) training dataset, which comprises 1.61 million examples. This specialization allows the model to perform a range of complex biological predictions and analyses.
Key Capabilities
- Sequence Understanding: Predicts EC numbers, fold types, and subcellular localization from protein sequences.
- Annotation Prediction: Excels at predicting function keywords, Gene Ontology (GO) terms, and function descriptions.
- Knowledge Mining: Capable of predicting tissue locations and cancer associations from gene symbols and names.
Performance Highlights
The model has been evaluated across 9 distinct protein-related tasks, demonstrating its proficiency in specialized biological domains. For instance, it achieves F1 scores up to 0.7374 for UniProtSeq_keywords_test and Rouge-L scores up to 0.7524 for CASPSimilarSeq_function_test. Detailed evaluation metrics for accuracy, precision, recall, F1, and Rouge-L are available for various sub-tasks, showcasing its targeted performance in bioinformatics.
Good for
- Researchers and developers working on protein function prediction.
- Applications requiring automated annotation of biological sequences.
- Tasks involving the extraction of biological knowledge from gene and protein data.