saketlab/seqoutlm-0.5B
SeqoutLM 0.5B is a 500 million parameter specialized biomedical metadata normalization model developed by saketlab. Fine-tuned from Llama 3.2 1B Instruct using QLoRA, it converts unstructured genomic sample metadata into a standardized 16-field JSON representation. This model is designed for large-scale metadata harmonization across public genomics repositories like GEO and SRA, facilitating downstream search, filtering, and analytics workflows.
Loading preview...
SeqoutLM 0.5B: Biomedical Metadata Normalization
SeqoutLM 0.5B is a specialized language model designed for biomedical metadata normalization. It takes unstructured genomic sample metadata and transforms it into a fixed 16-field JSON schema. This model is crucial for harmonizing diverse metadata from public repositories such as GEO and SRA, enabling more efficient data integration and analysis.
Key Capabilities
- Standardized Output: Always produces a JSON object with 16 predefined fields (e.g.,
organism,tissue,disease,assay). - Missing Value Handling: Outputs
nullfor fields that cannot be determined from the input text. - Biomedical Focus: Specifically trained on the
saketlab/seqout-normalized-conversationdataset, comprising over 600K samples of free-text biomedical metadata paired with normalized JSON targets. - Efficient Fine-tuning: Built upon Llama 3.2 1B Instruct and fine-tuned using the Unsloth training stack with QLoRA, optimizing for performance and resource efficiency.
Good For
- Large-scale Metadata Harmonization: Ideal for standardizing vast amounts of genomic sample metadata.
- Enabling Downstream Analytics: Facilitates improved search, filtering, and integration of biomedical datasets.
- Automated Data Curation: Automates the process of converting varied text descriptions into a structured, queryable format.