File size: 7,532 Bytes
d553e66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eab1e64
d553e66
 
 
 
 
 
 
 
 
eab1e64
d553e66
 
eab1e64
 
 
 
 
 
 
 
 
 
d553e66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eab1e64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d553e66
 
e547c1e
 
 
 
 
 
 
 
 
d553e66
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123

---
language: fr
tags:
- NER
- camembert
- literary-texts
- nested-entities
- BookBLP-fr
license: apache-2.0
metrics:
- f1
- precision
- recall
base_model:
- almanach/camembertV2-base
pipeline_tag: token-classification
---

## INTRODUCTION:
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembertV2-base](https://huggingface.co/almanach/camembertV2-base) embeddings, trained to predict nested entities in french, specifically for literary texts.

The predicted entities are:
- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...) 
- facilities (FAC): chatêau, sentier, chambre, couloir, ...
- time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
- geo-political entities (GPE): Montrouge, France, le petit hameau, ...
- locations (LOC): le sud, Mars, l'océan, le bois, ...
- vehicles (VEH): avion, voitures, calèche, vélos, ...

## MODEL PERFORMANCES (LOOCV):
| NER_tag   | precision   | recall   | f1_score   | support   | support %   |
|-----------|-------------|----------|------------|-----------|-------------|
| PER       | 90.10%      | 93.38%   | 91.71%     | 31,570    | 83.87%      |
| FAC       | 70.14%      | 70.97%   | 70.55%     | 2,294     | 6.09%       |
| TIME      | 58.04%      | 58.98%   | 58.51%     | 1,670     | 4.44%       |
| GPE       | 75.85%      | 76.81%   | 76.33%     | 871       | 2.31%       |
| LOC       | 61.22%      | 46.57%   | 52.90%     | 773       | 2.05%       |
| VEH       | 66.37%      | 48.82%   | 56.26%     | 465       | 1.24%       |
| micro_avg | 86.25%      | 88.60%   | 87.36%     | 37,643    | 100.00%     |
| macro_avg | 70.29%      | 65.92%   | 67.71%     | 37,643    | 100.00%     |

## TRAINING PARAMETERS:
- Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
- Tagging scheme: BIOES
- Nested entities levels: [0, 1]
- Split strategy: Leave-one-out cross-validation (28 files)
- Train/Validation split: 0.85 / 0.15
- Batch size: 16
- Initial learning rate: 0.00014

## MODEL ARCHITECTURE:
Model Input: Maximum context camembertV2-base embeddings (768 dimensions)

- Locked Dropout: 0.5

- Projection layer:
  - layer type: highway layer
  - input: 768 dimensions
  - output: 2048 dimensions

- BiLSTM layer:
  - input: 2048 dimensions
  - output: 256 dimensions (hidden state)

- Linear layer:
  - input: 256 dimensions
  - output: 25 dimensions (predicted labels with BIOES tagging scheme)

- CRF layer

Model Output: BIOES labels sequence

## HOW TO USE:
*** IN CONSTRUCTION ***

## TRAINING CORPUS:
|    | Document                                                       | Tokens Count   | Is included in model eval          |
|----|----------------------------------------------------------------|----------------|------------------------------------|
|  0 | 1836_Gautier-Theophile_La-morte-amoureuse                      | 14,299 tokens  | True                               |
|  1 | 1840_Sand-George_Pauline                                       | 12,315 tokens  | True                               |
|  2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote             | 24,776 tokens  | True                               |
|  3 | 1844_Balzac-Honore-de_La-Maison-Nucingen                       | 30,987 tokens  | True                               |
|  4 | 1844_Balzac-Honore-de_Sarrasine                                | 15,408 tokens  | True                               |
|  5 | 1856_Cousin-Victor_Madame-de-Hautefort                         | 11,768 tokens  | True                               |
|  6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse                   | 11,834 tokens  | True                               |
|  7 | 1873_Zola-Emile_Le-ventre-de-Paris                             | 12,557 tokens  | True                               |
|  8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet                      | 12,281 tokens  | True                               |
|  9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens   | True                               |
| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE   | 2,554 tokens   | True                               |
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE        | 2,929 tokens   | True                               |
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA           | 4,067 tokens   | True                               |
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE          | 2,251 tokens   | True                               |
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE        | 2,034 tokens   | True                               |
| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU               | 1,864 tokens   | True                               |
| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL            | 2,141 tokens   | True                               |
| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE          | 2,441 tokens   | True                               |
| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL          | 2,860 tokens   | True                               |
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON      | 2,343 tokens   | True                               |
| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis   | 12,703 tokens  | True                               |
| 21 | 1903_Conan-Laure_Elisabeth_Seton                               | 13,023 tokens  | True                               |
| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube              | 10,982 tokens  | True                               |
| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin           | 10,305 tokens  | True                               |
| 24 | 1917_Adèle-Bourgeois_Némoville                                 | 12,389 tokens  | True                               |
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps                       | 14,637 tokens  | True                               |
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin                   | 11,902 tokens  | True                               |
| 27 | 1937_Audoux-Marguerite_Douce-Lumiere                           | 12,285 tokens  | True                               |
| 28 | TOTAL                                                          | 275,360 tokens | 28 files used for cross-validation |

## PREDICTIONS CONFUSION MATRIX:
| Gold Labels   | PER    | FAC   |   TIME |   GPE |   LOC |   VEH | O     | support   |
|---------------|--------|-------|--------|-------|-------|-------|-------|-----------|
| PER           | 29,481 | 31    |     14 |    12 |    11 |    21 | 2,000 | 31,570    |
| FAC           | 53     | 1,628 |      1 |    31 |    19 |     3 | 559   | 2,294     |
| TIME          | 5      | 1     |    985 |     0 |     1 |     0 | 678   | 1,670     |
| GPE           | 19     | 29    |      0 |   669 |    27 |     1 | 126   | 871       |
| LOC           | 3      | 71    |      0 |    59 |   360 |     0 | 280   | 773       |
| VEH           | 61     | 5     |      0 |     1 |     0 |   227 | 171   | 465       |
| O             | 3,053  | 536   |    696 |   106 |   163 |    90 | 0     | 4,644     |

## CONTACT:
mail: antoine [dot] bourgois [at] protonmail [dot] com