|
--- |
|
language: |
|
- en |
|
base_model: |
|
- meta-llama/Meta-Llama-3-8B |
|
pipeline_tag: text2text-generation |
|
--- |
|
|
|
## Janus |
|
(Built with Meta Llama 3) |
|
|
|
For the version with the PoS tag visit [Janus (PoS)](https://huggingface.co/ChangeIsKey/llama3-janus-pos). |
|
|
|
### Model Details |
|
- **Model Name**: Janus |
|
- **Version**: 1.0 |
|
- **Developers**: Pierluigi Cassotti, Nina Tahmasebi |
|
- **Affiliation**: University of Gothenburg |
|
- **License**: MIT |
|
- **GitHub Repository**: [Historical Word Usage Generation](https://github.com/ChangeIsKey/historical-word-usage-generation) |
|
- **Paper**: [Sense-specific Historical Word Usage Generation](https://transacl.org) |
|
- **Contact**: [email protected] |
|
|
|
### Model Description |
|
Janus is a fine-tuned **Llama 3 8B** model designed to generate historically and semantically accurate word usages. It takes as input a word, its sense definition, and a year and produces example sentences that reflect linguistic usage from the specified period. This model is particularly useful for **semantic change detection**, **historical NLP**, and **linguistic research**. |
|
|
|
### Intended Use |
|
- **Semantic Change Detection**: Investigating how word meanings evolve over time. |
|
- **Historical Text Processing**: Enhancing the understanding and modeling of historical texts. |
|
- **Corpus Expansion**: Generating sense-annotated corpora for linguistic studies. |
|
|
|
### Training Data |
|
- **Dataset**: Extracted from the **Oxford English Dictionary (OED)** |
|
- **Size**: Over **1.2 million** sense-annotated historical usages |
|
- **Time Span**: **1700 - 2020** |
|
- **Data Format**: |
|
``` |
|
<year><|t|><lemma><|t|><definition><|s|><historical usage sentence><|end|> |
|
``` |
|
- **Janus (PoS) Format**: |
|
``` |
|
<year><|t|><lemma><|t|><definition><|p|><PoS><|p|><|s|><historical usage sentence><|end|> |
|
``` |
|
|
|
### Training Procedure |
|
- **Base Model**: `meta-llama/Llama-3-8B` |
|
- **Optimization**: **QLoRA** (Quantized Low-Rank Adaptation) |
|
- **Batch Size**: **4** |
|
- **Learning Rate**: **2e-4** |
|
- **Epochs**: **1** |
|
|
|
### Model Performance |
|
- **Temporal Accuracy**: Root mean squared error (RMSE) of **~52.7 years** (close to OED ground truth) |
|
- **Semantic Accuracy**: Comparable to OED test data on human evaluations |
|
- **Context Variability**: Low lexical repetition, preserving natural linguistic diversity |
|
|
|
### Usage Example |
|
#### Generating Historical Usages |
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
model_name = "ChangeIsKey/llama3-janus" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto") |
|
|
|
input_text = "1800<|t|>awful<|t|>Used to emphasize something unpleasant or negative; ‘such a’, ‘an absolute’.<|s|>" |
|
inputs = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
|
|
output = model.generate(**inputs, temperature=1.0, top_p=0.9, max_new_tokens=50) |
|
print(tokenizer.decode(output[0], skip_special_tokens=True)) |
|
``` |
|
|
|
For more examples, see the GitHub repository [Historical Word Usage Generation](https://github.com/ChangeIsKey/historical-word-usage-generation) |
|
|
|
### Limitations & Ethical Considerations |
|
- **Historical Bias**: The model may reflect biases present in historical texts. |
|
- **Time Granularity**: The temporal resolution is approximate (~50 years RMSE). |
|
- **Modern Influence**: Despite fine-tuning, the model may still generate modern phrases in older contexts. |
|
- **Not Trained for Fairness**: The model has not been explicitly trained to be fair or unbiased. It may produce sensitive, outdated, or culturally inappropriate content. |
|
|
|
### Citation |
|
If you use Janus, please cite: |
|
``` |
|
@article{Cassotti2024Janus, |
|
author = {Pierluigi Cassotti and Nina Tahmasebi}, |
|
title = {Sense-specific Historical Word Usage Generation}, |
|
journal = {TACL}, |
|
year = {2025} |
|
} |
|
``` |
|
|