metadata
language:
- en
base_model:
- meta-llama/Meta-Llama-3-8B
pipeline_tag: text2text-generation
Janus
(Built with Meta Llama 3)
For the version with the PoS tag visit Janus (PoS).
Model Details
- Model Name: Janus
- Version: 1.0
- Developers: Pierluigi Cassotti, Nina Tahmasebi
- Affiliation: University of Gothenburg
- License: MIT
- GitHub Repository: Historical Word Usage Generation
- Paper: Sense-specific Historical Word Usage Generation
- Contact: [email protected]
Model Description
Janus is a fine-tuned Llama 3 8B model designed to generate historically and semantically accurate word usages. It takes as input a word, its sense definition, and a year and produces example sentences that reflect linguistic usage from the specified period. This model is particularly useful for semantic change detection, historical NLP, and linguistic research.
Intended Use
- Semantic Change Detection: Investigating how word meanings evolve over time.
- Historical Text Processing: Enhancing the understanding and modeling of historical texts.
- Corpus Expansion: Generating sense-annotated corpora for linguistic studies.
Training Data
- Dataset: Extracted from the Oxford English Dictionary (OED)
- Size: Over 1.2 million sense-annotated historical usages
- Time Span: 1700 - 2020
- Data Format:
<year><|t|><lemma><|t|><definition><|s|><historical usage sentence><|end|>
- Janus (PoS) Format:
<year><|t|><lemma><|t|><definition><|p|><PoS><|p|><|s|><historical usage sentence><|end|>
Training Procedure
- Base Model:
meta-llama/Llama-3-8B
- Optimization: QLoRA (Quantized Low-Rank Adaptation)
- Batch Size: 4
- Learning Rate: 2e-4
- Epochs: 1
Model Performance
- Temporal Accuracy: Root mean squared error (RMSE) of ~52.7 years (close to OED ground truth)
- Semantic Accuracy: Comparable to OED test data on human evaluations
- Context Variability: Low lexical repetition, preserving natural linguistic diversity
Usage Example
Generating Historical Usages
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "ChangeIsKey/llama3-janus"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
input_text = "1800<|t|>awful<|t|>Used to emphasize something unpleasant or negative; ‘such a’, ‘an absolute’.<|s|>"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
output = model.generate(**inputs, temperature=1.0, top_p=0.9, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
For more examples, see the GitHub repository Historical Word Usage Generation
Limitations & Ethical Considerations
- Historical Bias: The model may reflect biases present in historical texts.
- Time Granularity: The temporal resolution is approximate (~50 years RMSE).
- Modern Influence: Despite fine-tuning, the model may still generate modern phrases in older contexts.
- Not Trained for Fairness: The model has not been explicitly trained to be fair or unbiased. It may produce sensitive, outdated, or culturally inappropriate content.
Citation
If you use Janus, please cite:
@article{Cassotti2024Janus,
author = {Pierluigi Cassotti and Nina Tahmasebi},
title = {Sense-specific Historical Word Usage Generation},
journal = {TACL},
year = {2025}
}