ether0: A Specialized Chemistry Reasoning Model
ether0 is a 24 billion parameter language model developed by FutureHouse, derived from fine-tuning and reinforcement learning on Mistral-Small-24B-Instruct-2501. Unlike general-purpose chat models, ether0 is specifically engineered to reason in English and generate molecular structures in SMILES format. It supports a 32,768 token context length.
Key Capabilities
- Molecular Property Prediction & Modification: Converts IUPAC names or molecular formulas to SMILES, and modifies molecules based on desired properties like pKa, LogS, scent, or ADME properties (e.g., LD50, efflux ratio).
- Reactivity & Synthesis: Proposes one-step retrosynthesis from commercially available reagents and predicts reaction outcomes.
- Biological Interactions: Matches human cell receptor binding and mode (agonist/antagonist) to molecules, or modifies molecules to adjust binding effects, leveraging data from EveBio.
- Inverse Molecule Captioning: Generates SMILES from natural language descriptions of specific molecules.
- Natural Product Elucidation: Identifies potential SMILES structures from molecular formulas and organism sources.
Training & Benchmarking
Training involved initial pre-training on reasoning traces, followed by specialized rounds using GRPO and verifiable rewards for specific tasks, and aggregation of filtered reasoning traces. The model was then subjected to safety post-training. Benchmarking was conducted on a custom dataset (futurehouse/ether0-benchmark) designed for chemistry tasks where all answers are molecules, showing strong performance in its specialized domains.
Limitations
ether0 has limited general knowledge and poor performance on general chemistry textbook questions (e.g., Chembench). It is crucial to input molecules as SMILES for best results, as common names can lead to reasoning errors (e.g., confusing lysine and glutamic acid).