--- language: fr tags: - NER - camembert - literary-texts - nested-entities - BookBLP-fr license: apache-2.0 metrics: - f1 - precision - recall base_model: - almanach/camembertV2-base pipeline_tag: token-classification --- ## INTRODUCTION: This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembertV2-base](https://huggingface.co/almanach/camembertV2-base) embeddings, trained to predict nested entities in french, specifically for literary texts. The predicted entities are: - mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...) - facilities (FAC): chatêau, sentier, chambre, couloir, ... - time (TIME): le règne de Louis XIV, ce matin, en juillet, ... - geo-political entities (GPE): Montrouge, France, le petit hameau, ... - locations (LOC): le sud, Mars, l'océan, le bois, ... - vehicles (VEH): avion, voitures, calèche, vélos, ... ## MODEL PERFORMANCES (LOOCV): | NER_tag | precision | recall | f1_score | support | support % | |-----------|-------------|----------|------------|-----------|-------------| | PER | 90.10% | 93.38% | 91.71% | 31,570 | 83.87% | | FAC | 70.14% | 70.97% | 70.55% | 2,294 | 6.09% | | TIME | 58.04% | 58.98% | 58.51% | 1,670 | 4.44% | | GPE | 75.85% | 76.81% | 76.33% | 871 | 2.31% | | LOC | 61.22% | 46.57% | 52.90% | 773 | 2.05% | | VEH | 66.37% | 48.82% | 56.26% | 465 | 1.24% | | micro_avg | 86.25% | 88.60% | 87.36% | 37,643 | 100.00% | | macro_avg | 70.29% | 65.92% | 67.71% | 37,643 | 100.00% | ## TRAINING PARAMETERS: - Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE'] - Tagging scheme: BIOES - Nested entities levels: [0, 1] - Split strategy: Leave-one-out cross-validation (28 files) - Train/Validation split: 0.85 / 0.15 - Batch size: 16 - Initial learning rate: 0.00014 ## MODEL ARCHITECTURE: Model Input: Maximum context camembertV2-base embeddings (768 dimensions) - Locked Dropout: 0.5 - Projection layer: - layer type: highway layer - input: 768 dimensions - output: 2048 dimensions - BiLSTM layer: - input: 2048 dimensions - output: 256 dimensions (hidden state) - Linear layer: - input: 256 dimensions - output: 25 dimensions (predicted labels with BIOES tagging scheme) - CRF layer Model Output: BIOES labels sequence ## HOW TO USE: *** IN CONSTRUCTION *** ## TRAINING CORPUS: | | Document | Tokens Count | Is included in model eval | |----|----------------------------------------------------------------|----------------|------------------------------------| | 0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens | True | | 1 | 1840_Sand-George_Pauline | 12,315 tokens | True | | 2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | True | | 3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens | True | | 4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | True | | 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | True | | 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | True | | 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | True | | 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | True | | 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | True | | 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | True | | 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | True | | 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | True | | 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | True | | 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | True | | 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens | True | | 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens | True | | 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens | True | | 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens | True | | 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | True | | 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | True | | 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | True | | 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | True | | 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | True | | 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | True | | 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | True | | 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | True | | 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | True | | 28 | TOTAL | 275,360 tokens | 28 files used for cross-validation | ## PREDICTIONS CONFUSION MATRIX: | Gold Labels | PER | FAC | TIME | GPE | LOC | VEH | O | support | |---------------|--------|-------|--------|-------|-------|-------|-------|-----------| | PER | 29,481 | 31 | 14 | 12 | 11 | 21 | 2,000 | 31,570 | | FAC | 53 | 1,628 | 1 | 31 | 19 | 3 | 559 | 2,294 | | TIME | 5 | 1 | 985 | 0 | 1 | 0 | 678 | 1,670 | | GPE | 19 | 29 | 0 | 669 | 27 | 1 | 126 | 871 | | LOC | 3 | 71 | 0 | 59 | 360 | 0 | 280 | 773 | | VEH | 61 | 5 | 0 | 1 | 0 | 227 | 171 | 465 | | O | 3,053 | 536 | 696 | 106 | 163 | 90 | 0 | 4,644 | ## CONTACT: mail: antoine [dot] bourgois [at] protonmail [dot] com