File size: 6,350 Bytes
3dddcd2
 
fef3901
3dddcd2
 
 
 
fe3bab1
3dddcd2
 
 
 
 
 
 
 
 
 
 
 
 
 
ae761aa
 
 
49e0b5e
758e329
 
49e0b5e
3dddcd2
ffd5b6d
758e329
 
3dddcd2
49e0b5e
3dddcd2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe3bab1
3dddcd2
659adff
3dddcd2
 
 
8a04b24
3dddcd2
 
 
8a04b24
 
3dddcd2
8a04b24
e9156e3
 
3dddcd2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
language: fr
license: cc-by-4.0
---

# Modern French normalisation model 

Normalisation model from Modern (17th c.) French to contemporary French. It was introduced in [this paper](https://hal.inria.fr/hal-03540226/) (see citation below). The main research repository can be found [here](https://github.com/rbawden/ModFr-Norm). If you use this model, please cite our research paper (see [below](#cite)).

## Model description

The normalisation model is trained on the [FreEM_norm corpus](https://freem-corpora.github.io/corpora/norm/), which is a parallel data of French texts from the 17th century and their manually normalised versions that follow contemporary French spelling. The model is a transformer model with 2 encoder layers, 4 decoder layers, embedding dimensions of size 256, feedforward dimension of 1024. The associated tokeniser is trained with SentencePiece and the BPE strategy with a BPE vocabulary of 1000 tokens.

### Intended uses & limitations

The model is designed to be used to normalise 17th c. French texts. The best performance can be seen on texts from similar genres as those produced within this century of French.

### How to use

The model is to be used with the custom pipeline available in in the original repository [here](https://github.com/rbawden/ModFr-Norm/blob/main/hf-conversion/pipeline.py) and in this repository [here](https://huggingface.co/rbawden/modern_french_normalisation/blob/main/pipeline.py). You first need to download the pipeline file so that you can use it locally (since it is not integrated into HuggingFace).

```
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from pipeline import NormalisationPipeline # N.B. local file

cache_lexicon_path="~/.normalisation_lex.pickle" # optionally set a path to store the processed lexicon (speeds up loading)
tokeniser = AutoTokenizer.from_pretrained("rbawden/modern_french_normalisation")
model = AutoModelForSeq2SeqLM.from_pretrained("rbawden/modern_french_normalisation")
norm_pipeline = NormalisationPipeline(model=model, tokenizer=tokeniser, batch_size=32, beam_size=5, cache_file=cache_lexicon_path)
                                              
list_inputs = ["Elle haïſſoit particulierement le Cardinal de Lorraine;", "Adieu, i'iray chez vous tantoſt vous rendre grace."]
list_outputs = norm_pipeline(list_inputs)
print(list_outputs)

>> [{'text': 'Elle haïssait particulièrement le Cardinal de Lorraine; ', 'alignment': [([0, 3], [0, 3]), ([5, 12], [5, 12]), ([14, 29], [14, 29]), ([31, 32], [31, 32]), ([34, 41], [34, 41]), ([43, 44], [43, 44]), ([46, 53], [46, 53]), ([54, 54], [54, 54])]}, {'text': "Adieu, j'irai chez vous tantôt vous rendre grâce. ", 'alignment': [([0, 4], [0, 4]), ([5, 5], [5, 5]), ([7, 8], [7, 8]), ([9, 12], [9, 12]), ([14, 17], [14, 17]), ([19, 22], [19, 22]), ([24, 30], [24, 29]), ([32, 35], [31, 34]), ([37, 42], [36, 41]), ([44, 48], [43, 47]), ([49, 49], [48, 48])]}]
```

### Limitations and bias

The model has been learnt in a supervised fashion and therefore like any such model is likely to perform well on texts similar to those used for training and less well on other texts. Whilst care was taken to include a range of different domains from different periods in the 17th c. in the training data, there are nevertheless imbalances, notably with some decades (e.g. 1610s) being underrepresented.

The model reaches a high performance, but could in rare cases result in changes to the text other than those involving spelling conventions (e.g. changing words, deleting or hallucinating words). A post-processing step is introduced in the pipeline file to avoid these problems, which involves a look-up in a contemporary French lexicon ([The Le*fff*](http://almanach.inria.fr/software_and_resources/custom/Alexina-en.html)) and checks to make sure that the normalised words do not stray too far from the original source words.

## Training data

The model is trained on the parallel FreEM dataset [FreEM_norm corpus](https://freem-corpora.github.io/corpora/norm/), consisting of 17,930 training sentences and 2,443 development sentences (used for model selection).

## Training procedure

### Preprocessing

Texts are normalised (in terms of apostrophes, quotes and spaces), before being tokenised with SentencePiece and a vocabulary size of 1000. The inputs are of the form:

```
Sentence in Early Modern French </s>
```
where `</s>` is the end-of-sentence (eos) token.

### Training

The model was trained using [Fairseq](https://github.com/facebookresearch/fairseq) and ported to HuggingFace using an adapted version of [Stas's scripts for FSMT models](https://huggingface.co/blog/porting-fsmt).

### Evaluation results

Coming soon... (once post-processing extension has been finalised)

## BibTex entry and citation info
<a name="cite"></a>

Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou, Philippe Gambette, Benoît Sagot and Simon Gabay. 2022. [Automatic Normalisation of Early Modern French](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.358.pdf). In Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association. Marseille, France.]

Bibtex:
```
@inproceedings{bawden-etal-2022-automatic,
  title = {{Automatic Normalisation of Early Modern French}},
  author = {Bawden, Rachel and Poinhos, Jonathan and Kogkitsidou, Eleni and Gambette, Philippe and Sagot, Beno{\^i}t and Gabay, Simon},
  url = {https://hal.inria.fr/hal-03540226},
  booktitle = {Proceedings of the 13th Language Resources and Evaluation Conference},
  publisher = {European Language Resources Association},
  year = {2022},
  address = {Marseille, France},
  pages = {3354--3366},
  url = {http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.358.pdf}
}
```

And to reference the FreEM-norm dataset used in the experiments:

Simon Gabay. (2022). FreEM-corpora/FreEMnorm: FreEM norm Parallel corpus (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.5865428
```
@software{simon_gabay_2022_5865428,
  author       = {Simon Gabay},
  title        = {{FreEM-corpora/FreEMnorm: FreEM norm Parallel 
                   corpus}},
  month        = jan,
  year         = 2022,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.5865428},
  url          = {https://doi.org/10.5281/zenodo.5865428}
}