Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: fr
|
3 |
+
---
|
4 |
+
|
5 |
+
# Modern French normalisation model
|
6 |
+
|
7 |
+
Normalisation model from Modern (17th c.) French to contemporary French. It was introduced in [this paper](https://hal.inria.fr/hal-03540226/) (see citation below). The main research repository can be found [here](https://github.com/rbawden/ModFr-Norm).
|
8 |
+
|
9 |
+
## Model description
|
10 |
+
|
11 |
+
The normalisation model is trained on the [FreEM_norm corpus](https://freem-corpora.github.io/corpora/norm/), which is a parallel data of French texts from the 17th century and their manually normalised versions that follow contemporary French spelling. The model is a transformer model with 2 encoder layers, 4 decoder layers, embedding dimensions of size 256, feedforward dimension of 1024. The associated tokeniser is trained with SentencePiece and the BPE strategy with a BPE vocabulary of 1000 tokens.
|
12 |
+
|
13 |
+
### Intended uses & limitations
|
14 |
+
|
15 |
+
The model is designed to be used to normalise 17th c. French texts. The best performance can be seen on texts from similar genres as those produced within this century of French.
|
16 |
+
|
17 |
+
### How to use
|
18 |
+
|
19 |
+
The model is to be used with the custom pipeline available in in the original repository [here](https://github.com/rbawden/ModFr-Norm/blob/main/hf-conversion/pipeline.py) and in this repository [here](https://huggingface.co/rbawden/modern_french_normalisation/blob/main/pipeline.py). You first need to download the pipeline file so that you can use it locally (since it is not integrated into HuggingFace).
|
20 |
+
|
21 |
+
```
|
22 |
+
tokeniser = AutoTokenizer.from_pretrained("rbawden/modern_french_normalisation", use_auth_token=True)
|
23 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("rbawden/modern_french_normalisation", use_auth_token=True)
|
24 |
+
normalisation_pipeline = NormalisationPipeline(model=model,
|
25 |
+
tokenizer=tokeniser,
|
26 |
+
batch_size=batch_size,
|
27 |
+
beam_size=beam_size)
|
28 |
+
|
29 |
+
list_sents = ["1. QVe cette propoſtion, qu'vn eſpace eſt vuidé, repugne au ſens commun.", Adieu, i'iray chez vous tantoſt vous rendre grace.]
|
30 |
+
normalised_outputs = normalisation_pipeline(list_sents)
|
31 |
+
print(normalised_outputs)
|
32 |
+
|
33 |
+
>> ["1. QUe cette propôtion, qu'un espace est vidé, répugne au sens commun.", "Adieu, j'irai chez vous tantôt vous rendre grâce."]
|
34 |
+
```
|
35 |
+
|
36 |
+
### Limitations and bias
|
37 |
+
|
38 |
+
The model has been learnt in a supervised fashion and therefore like any such model is likely to perform well on texts similar to those used for training and less well on other texts. Whilst care was taken to include a range of different domains from different periods in the 17th c. in the training data, there are nevertheless imbalances, notably with some decades (e.g. 1610s) being underrepresented.
|
39 |
+
|
40 |
+
The model reaches a high performance, but could in rare cases result in changes to the text other than those involving spelling conventions (e.g. changing words, deleting or hallucinating words). A post-processing step is introduced in the pipeline file to avoid these problems, which involves a look-up in a contemporary French lexicon ([The Le*fff*](http://almanach.inria.fr/software_and_resources/custom/Alexina-en.html)) and checks to make sure that the normalised words do not stray too far from the original source words.
|
41 |
+
|
42 |
+
## Training data
|
43 |
+
|
44 |
+
The model is trained on the parallel FreEM dataset [FreEM_norm corpus](https://freem-corpora.github.io/corpora/norm/), consisting of 17,930 training sentences and 2,443 development sentences (used for model selection).
|
45 |
+
|
46 |
+
## Training procedure
|
47 |
+
|
48 |
+
### Preprocessing
|
49 |
+
|
50 |
+
Texts are normalised (in terms of apostrophes, quotes and spaces), before being tokenised with SentencePiece and a vocabulary size of 1000. The inputs are of the form:
|
51 |
+
|
52 |
+
```
|
53 |
+
Sentence in Early Modern French </s>
|
54 |
+
```
|
55 |
+
where `</s>` is the end-of-sentence (eos) token.
|
56 |
+
|
57 |
+
### Training
|
58 |
+
|
59 |
+
The model was trained using [Fairseq](https://github.com/facebookresearch/fairseq) and ported to HuggingFace using an adapted version of [Stas's scripts for FSMT models](https://huggingface.co/blog/porting-fsmt).
|
60 |
+
|
61 |
+
### Evaluation results
|
62 |
+
|
63 |
+
Coming soon... (once post-processing extension has been finalised)
|
64 |
+
|
65 |
+
## BibTex entry and citation info
|
66 |
+
|
67 |
+
Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou, Philippe Gambette, Benoît Sagot, et al. Automatic Normalisation of Early Modern French. 2022. Preprint.
|
68 |
+
|
69 |
+
Bibtex:
|
70 |
+
```
|
71 |
+
@misc{bawden:hal-03540226,
|
72 |
+
title = {{Automatic Normalisation of Early Modern French}},
|
73 |
+
author = {Bawden, Rachel and Poinhos, Jonathan and Kogkitsidou, Eleni and Gambette, Philippe and Sagot, Beno{\^i}t and Gabay, Simon},
|
74 |
+
url = {https://hal.inria.fr/hal-03540226},
|
75 |
+
note = {working paper or preprint},
|
76 |
+
year = {2022},
|
77 |
+
HAL_ID = {hal-03540226},
|
78 |
+
HAL_VERSION = {v1},
|
79 |
+
}
|
80 |
+
```
|
81 |
+
|
82 |
+
And to reference the FreEM-norm dataset used in the experiments:
|
83 |
+
|
84 |
+
Simon Gabay. (2022). FreEM-corpora/FreEMnorm: FreEM norm Parallel corpus (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.5865428
|
85 |
+
```
|
86 |
+
@software{simon_gabay_2022_5865428,
|
87 |
+
author = {Simon Gabay},
|
88 |
+
title = {{FreEM-corpora/FreEMnorm: FreEM norm Parallel
|
89 |
+
corpus}},
|
90 |
+
month = jan,
|
91 |
+
year = 2022,
|
92 |
+
publisher = {Zenodo},
|
93 |
+
version = {1.0.0},
|
94 |
+
doi = {10.5281/zenodo.5865428},
|
95 |
+
url = {https://doi.org/10.5281/zenodo.5865428}
|
96 |
+
}
|