metadata
language: fr
tags:
- NER
- camembert
- literary-texts
- nested-entities
- BookBLP-fr
license: apache-2.0
metrics:
- f1
- precision
- recall
base_model:
- almanach/camembertV2-base
pipeline_tag: token-classification
INTRODUCTION:
This model, developed as part of the BookNLP-fr project, is a NER model built on top of camembertV2-base embeddings, trained to predict nested entities in french, specifically for literary texts.
The predicted entities are:
- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
- facilities (FAC): chatêau, sentier, chambre, couloir, ...
- time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
- geo-political entities (GPE): Montrouge, France, le petit hameau, ...
- locations (LOC): le sud, Mars, l'océan, le bois, ...
- vehicles (VEH): avion, voitures, calèche, vélos, ...
MODEL PERFORMANCES (LOOCV):
NER_tag | precision | recall | f1_score | support | support % |
---|---|---|---|---|---|
PER | 90.17% | 95.76% | 92.88% | 4,061 | 85.80% |
FAC | 79.19% | 78.12% | 78.65% | 224 | 4.73% |
TIME | 63.18% | 70.56% | 66.67% | 214 | 4.52% |
LOC | 62.50% | 54.55% | 58.25% | 110 | 2.32% |
GPE | 74.58% | 68.75% | 71.54% | 64 | 1.35% |
VEH | 69.12% | 78.33% | 73.44% | 60 | 1.27% |
micro_avg | 87.31% | 92.25% | 89.68% | 4,733 | 100.00% |
macro_avg | 73.12% | 74.35% | 73.57% | 4,733 | 100.00% |
TRAINING PARAMETERS:
- Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
- Tagging scheme: BIOES
- Nested entities levels: [0, 1]
- Split strategy: Leave-one-out cross-validation (28 files)
- Train/Validation split: 0.85 / 0.15
- Batch size: 16
- Initial learning rate: 0.00014
MODEL ARCHITECTURE:
Model Input: Maximum context camembertV2-base embeddings (768 dimensions)
Locked Dropout: 0.5
Projection layer:
- layer type: highway layer
- input: 768 dimensions
- output: 2048 dimensions
BiLSTM layer:
- input: 2048 dimensions
- output: 256 dimensions (hidden state)
Linear layer:
- input: 256 dimensions
- output: 25 dimensions (predicted labels with BIOES tagging scheme)
CRF layer
Model Output: BIOES labels sequence
HOW TO USE:
*** IN CONSTRUCTION ***
TRAINING CORPUS:
Document | Tokens Count | Is included in model eval | |
---|---|---|---|
0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens | False |
1 | 1840_Sand-George_Pauline | 12,315 tokens | False |
2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | False |
3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens | False |
4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | False |
5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | False |
6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | False |
7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | False |
8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | False |
9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | True |
10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | True |
11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | True |
12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | False |
13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | False |
14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | False |
15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens | False |
16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens | False |
17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens | False |
18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens | False |
19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | False |
20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | False |
21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | False |
22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | True |
23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | False |
24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | False |
25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | False |
26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | True |
27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | False |
28 | TOTAL | 275,360 tokens | 5 files used for cross-validation |
PREDICTIONS CONFUSION MATRIX:
Gold Labels | PER | FAC | TIME | LOC | GPE | VEH | O | support |
---|---|---|---|---|---|---|---|---|
PER | 3,889 | 3 | 2 | 2 | 1 | 1 | 163 | 4,061 |
FAC | 6 | 175 | 0 | 2 | 0 | 1 | 40 | 224 |
TIME | 0 | 0 | 151 | 0 | 0 | 0 | 63 | 214 |
LOC | 1 | 0 | 0 | 60 | 9 | 0 | 40 | 110 |
GPE | 2 | 0 | 0 | 8 | 44 | 0 | 10 | 64 |
VEH | 1 | 0 | 0 | 0 | 0 | 47 | 12 | 60 |
O | 411 | 43 | 85 | 24 | 5 | 19 | 0 | 587 |
CONTACT:
mail: antoine [dot] bourgois [at] protonmail [dot] com