Janus: Historical Word Usage Generation Model

Janus is an 8 billion parameter model built upon Meta Llama 3, developed by Pierluigi Cassotti and Nina Tahmasebi at the University of Gothenburg. Its core function is to generate historically and semantically accurate example sentences for a given word, its sense definition, and a specified year.

Key Capabilities

Historical Usage Generation: Produces example sentences reflecting linguistic usage from 1700 to 2020.
Semantic Accuracy: Generates usages comparable to Oxford English Dictionary (OED) test data in human evaluations.
Temporal Accuracy: Achieves a Root Mean Squared Error (RMSE) of approximately 52.7 years against OED ground truth for temporal relevance.
Context Variability: Maintains low lexical repetition, preserving natural linguistic diversity in generated text.

Training and Data

Janus was fine-tuned using QLoRA on a dataset of over 1.2 million sense-annotated historical usages extracted from the Oxford English Dictionary (OED), covering the period from 1700 to 2020.

Good For

Semantic Change Detection: Investigating the evolution of word meanings over time.
Historical NLP: Enhancing the understanding and processing of historical texts.
Linguistic Research: Generating sense-annotated corpora for various linguistic studies.

Limitations

Users should be aware of potential historical biases present in the training data, approximate temporal resolution (~50 years RMSE), and the possibility of generating modern phrases in older contexts. The model has not been explicitly trained for fairness or bias mitigation.

Overview

Janus: Historical Word Usage Generation Model

Key Capabilities

Training and Data

Good For

Limitations

Full Model Card (README)