BookNLP-fr_NER_camembertV2-base / JCLS_model_card.md
AntoineBourgois's picture
Upload 3 files
eab1e64 verified
metadata
language: fr
tags:
  - NER
  - camembert
  - literary-texts
  - nested-entities
  - BookBLP-fr
license: apache-2.0
metrics:
  - f1
  - precision
  - recall
base_model:
  - almanach/camembertV2-base
pipeline_tag: token-classification

INTRODUCTION:

This model, developed as part of the BookNLP-fr project, is a NER model built on top of camembertV2-base embeddings, trained to predict nested entities in french, specifically for literary texts.

The predicted entities are:

  • mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
  • facilities (FAC): chatêau, sentier, chambre, couloir, ...
  • time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
  • geo-political entities (GPE): Montrouge, France, le petit hameau, ...
  • locations (LOC): le sud, Mars, l'océan, le bois, ...
  • vehicles (VEH): avion, voitures, calèche, vélos, ...

MODEL PERFORMANCES (LOOCV):

NER_tag precision recall f1_score support support %
PER 90.17% 95.76% 92.88% 4,061 85.80%
FAC 79.19% 78.12% 78.65% 224 4.73%
TIME 63.18% 70.56% 66.67% 214 4.52%
LOC 62.50% 54.55% 58.25% 110 2.32%
GPE 74.58% 68.75% 71.54% 64 1.35%
VEH 69.12% 78.33% 73.44% 60 1.27%
micro_avg 87.31% 92.25% 89.68% 4,733 100.00%
macro_avg 73.12% 74.35% 73.57% 4,733 100.00%

TRAINING PARAMETERS:

  • Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
  • Tagging scheme: BIOES
  • Nested entities levels: [0, 1]
  • Split strategy: Leave-one-out cross-validation (28 files)
  • Train/Validation split: 0.85 / 0.15
  • Batch size: 16
  • Initial learning rate: 0.00014

MODEL ARCHITECTURE:

Model Input: Maximum context camembertV2-base embeddings (768 dimensions)

  • Locked Dropout: 0.5

  • Projection layer:

    • layer type: highway layer
    • input: 768 dimensions
    • output: 2048 dimensions
  • BiLSTM layer:

    • input: 2048 dimensions
    • output: 256 dimensions (hidden state)
  • Linear layer:

    • input: 256 dimensions
    • output: 25 dimensions (predicted labels with BIOES tagging scheme)
  • CRF layer

Model Output: BIOES labels sequence

HOW TO USE:

*** IN CONSTRUCTION ***

TRAINING CORPUS:

Document Tokens Count Is included in model eval
0 1836_Gautier-Theophile_La-morte-amoureuse 14,299 tokens False
1 1840_Sand-George_Pauline 12,315 tokens False
2 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote 24,776 tokens False
3 1844_Balzac-Honore-de_La-Maison-Nucingen 30,987 tokens False
4 1844_Balzac-Honore-de_Sarrasine 15,408 tokens False
5 1856_Cousin-Victor_Madame-de-Hautefort 11,768 tokens False
6 1863_Gautier-Theophile_Le-capitaine-Fracasse 11,834 tokens False
7 1873_Zola-Emile_Le-ventre-de-Paris 12,557 tokens False
8 1881_Flaubert-Gustave_Bouvard-et-Pecuchet 12,281 tokens False
9 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI 5,425 tokens True
10 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE 2,554 tokens True
11 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE 2,929 tokens True
12 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA 4,067 tokens False
13 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE 2,251 tokens False
14 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE 2,034 tokens False
15 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU 1,864 tokens False
16 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL 2,141 tokens False
17 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE 2,441 tokens False
18 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL 2,860 tokens False
19 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON 2,343 tokens False
20 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis 12,703 tokens False
21 1903_Conan-Laure_Elisabeth_Seton 13,023 tokens False
22 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube 10,982 tokens True
23 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin 10,305 tokens False
24 1917_Adèle-Bourgeois_Némoville 12,389 tokens False
25 1923_Radiguet-Raymond_Le-diable-au-corps 14,637 tokens False
26 1926_Audoux-Marguerite_De-la-ville-au-moulin 11,902 tokens True
27 1937_Audoux-Marguerite_Douce-Lumiere 12,285 tokens False
28 TOTAL 275,360 tokens 5 files used for cross-validation

PREDICTIONS CONFUSION MATRIX:

Gold Labels PER FAC TIME LOC GPE VEH O support
PER 3,889 3 2 2 1 1 163 4,061
FAC 6 175 0 2 0 1 40 224
TIME 0 0 151 0 0 0 63 214
LOC 1 0 0 60 9 0 40 110
GPE 2 0 0 8 44 0 10 64
VEH 1 0 0 0 0 47 12 60
O 411 43 85 24 5 19 0 587

CONTACT:

mail: antoine [dot] bourgois [at] protonmail [dot] com