metadata

license: mit
base_model: almanach/camembertva-base
metrics:
  - precision
  - recall
  - f1
  - accuracy
model-index:
  - name: NERmembert2-4entities
    results: []
datasets:
  - CATIE-AQ/frenchNER_4entities
language:
  - fr
widget:
  - text: >-
      Le dévoilement du logo officiel des JO s'est déroulé le 21 octobre 2019 au
      Grand Rex. Ce nouvel emblème et cette nouvelle typographie ont été conçus
      par le designer Sylvain Boyer avec les agences Royalties & Ecobranding.
      Rond, il rassemble trois symboles : une médaille d'or, la flamme olympique
      et Marianne, symbolisée par un visage de femme mais privée de son bonnet
      phrygien caractéristique. La typographie dessinée fait référence à l'Art
      déco, mouvement artistique des années 1920, décennie pendant laquelle ont
      eu lieu pour la dernière fois les Jeux olympiques à Paris en 1924. Pour la
      première fois, ce logo sera unique pour les Jeux olympiques et les Jeux
      paralympiques.
library_name: transformers
pipeline_tag: token-classification
co2_eq_emissions: 27.5

NERmembert-large-4entities

Model Description

We present NERmemberta-4entities, which is a CamemBERTa v2 base fine-tuned for the Name Entity Recognition task for the French language on four French NER datasets for 4 entities (LOC, PER, ORG, MISC).
All these datasets were concatenated and cleaned into a single dataset that we called frenchNER_4entities.
There are a total of 384,773 rows, of which 328,757 are for training, 24,131 for validation and 31,885 for testing.
Our methodology is described in a blog post available in English or French.

Evaluation results

The evaluation was carried out using the evaluate python package.

frenchNER_4entities

For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model	PER	LOC	ORG	MISC
Jean-Baptiste/camembert-ner (110M)	0.971	0.947	0.902	0.663
cmarkea/distilcamembert-base-ner (67.5M)	0.974	0.948	0.892	0.658
NERmembert-base-4entities	0.978	0.958	0.903	0.814
NERmembert2-4entities (111M)	0.978	0.958	0.901	0.806
NERmemberta-4entities (this model) (111M)	0.979	0.961	0.915	0.812
NERmembert-large-4entities (336M)	0.982	0.964	0.919	0.834

Full results

Model	Metrics	PER	LOC	ORG	MISC	O	Overall
Jean-Baptiste/camembert-ner (110M)	Precision	0.952	0.924	0.870	0.845	0.986	0.976
	Recall	0.990	0.972	0.938	0.546	0.992	0.976
	F1	0.971	0.947	0.902	0.663	0.989	0.976
cmarkea/distilcamembert-base-ner (67.5M)	Precision	0.962	0.933	0.857	0.830	0.985	0.976
	Recall	0.987	0.963	0.930	0.545	0.993	0.976
	F1	0.974	0.948	0.892	0.658	0.989	0.976
NERmembert-base-4entities	Precision	0.973	0.951	0.888	0.850	0.993	0.984
	Recall	0.983	0.964	0.918	0.781	0.993	0.984
	F1	0.978	0.958	0.903	0.814	0.993	0.984
NERmembert2-4entities (111M)	Precision	0.973	0.951	0.882	0.860	0.991	0.982
	Recall	0.982	0.965	0.921	0.759	0.994	0.982
	F1	0.978	0.958	0.901	0.806	0.992	0.982
NERmemberta-4entities (111M) (this model)	Precision	0.976	0.955	0.894	0.856	0.991	0.983
	Recall	0.983	0.968	0.936	0.772	0.994	0.983
	F1	0.979	0.961	0.915	0.812	0.992	0.983
NERmembert-large-4entities (336M)	Precision	0.977	0.961	0.896	0.872	0.993	0.986
	Recall	0.987	0.966	0.943	0.798	0.995	0.986
	F1	0.982	0.964	0.919	0.834	0.994	0.986

In detail:

multiconer

For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model	PER	LOC	ORG	MISC
Jean-Baptiste/camembert-ner (110M)	0.940	0.761	0.723	0.560
cmarkea/distilcamembert-base-ner (67.5M)	0.921	0.748	0.694	0.530
NERmembert-base-4entities	0.960	0.890	0.867	0.852
NERmembert2-4entities (111M)	0.964	0.888	0.864	0.850
NERmemberta-4entities (111M) (this model)	0.966	0.891	0.867	0.862
NERmembert-large-4entities (336M)	0.969	0.919	0.904	0.864

Full results

Model	Metrics	PER	LOC	ORG	MISC	O	Overall
Jean-Baptiste/camembert-ner (110M)	Precision	0.908	0.717	0.753	0.620	0.936	0.889
	Recall	0.975	0.811	0.696	0.511	0.938	0.889
	F1	0.940	0.761	0.723	0.560	0.937	0.889
cmarkea/distilcamembert-base-ner (67.5M)	Precision	0.885	0.738	0.737	0.589	0.928	0.881
	Recall	0.960	0.759	0.655	0.482	0.939	0.881
	F1	0.921	0.748	0.694	0.530	0.934	0.881
NERmembert-base-4entities	Precision	0.954	0.893	0.851	0.849	0.979	0.954
	Recall	0.967	0.887	0.883	0.855	0.974	0.954
	F1	0.960	0.890	0.867	0.852	0.977	0.954
NERmembert2-4entities (111M)	Precision	0.953	0.890	0.870	0.842	0.976	0.952
	Recall	0.975	0.887	0.857	0.858	0.970	0.952
	F1	0.964	0.888	0.864	0.850	0.973	0.952
NERmemberta-4entities (111M) (this model)	Precision	0.961	0.895	0.859	0.845	0.978	0.953
	Recall	0.972	0.886	0.876	0.879	0.970	0.953
	F1	0.966	0.891	0.867	0.862	0.974	0.953
NERmembert-large-4entities (336M)	Precision	0.964	0.922	0.904	0.856	0.981	0.961
	Recall	0.975	0.917	0.904	0.872	0.976	0.961
	F1	0.969	0.919	0.904	0.864	0.978	0.961

multinerd

For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model	PER	LOC	ORG	MISC
Jean-Baptiste/camembert-ner (110M)	0.962	0.934	0.888	0.419
cmarkea/distilcamembert-base-ner (67.5M)	0.972	0.938	0.884	0.430
NERmembert-base-4entities	0.985	0.973	0.938	0.770
NERmembert2-4entities (111M)	0.986	0.974	0.937	0.761
NERmemberta-4entities (111M) (this model)	0.987	0.976	0.942	0.770
NERmembert-large-4entities (336M)	0.987	0.976	0.948	0.790

Full results

Model	Metrics	PER	LOC	ORG	MISC	O	Overall
Jean-Baptiste/camembert-ner (110M)	Precision	0.931	0.893	0.827	0.725	0.979	0.966
	Recall	0.994	0.980	0.959	0.295	0.990	0.966
	F1	0.962	0.934	0.888	0.419	0.984	0.966
cmarkea/distilcamembert-base-ner (67.5M)	Precision	0.954	0.908	0.817	0.705	0.977	0.967
	Recall	0.991	0.969	0.963	0.310	0.990	0.967
	F1	0.972	0.938	0.884	0.430	0.984	0.967
NERmembert-base-4entities	Precision	0.976	0.961	0.911	0.829	0.991	0.983
	Recall	0.994	0.985	0.967	0.719	0.993	0.983
	F1	0.985	0.973	0.938	0.770	0.992	0.983
NERmembert2-4entities (111M)	Precision	0.976	0.962	0.903	0.846	0.988	0.980
	Recall	0.995	0.986	0.974	0.692	0.992	0.980
	F1	0.986	0.974	0.937	0.761	0.990	0.980
NERmemberta-4entities (111M) (this model)	Precision	0.979	0.963	0.912	0.848	0.988	0.981
	Recall	0.996	0.989	0.975	0.705	0.992	0.981
	F1	0.987	0.976	0.942	0.770	0.990	0.981
NERmembert-large-4entities (336M)	Precision	0.979	0.967	0.922	0.852	0.991	0.985
	Recall	0.996	0.986	0.974	0.736	0.994	0.985
	F1	0.987	0.976	0.948	0.790	0.993	0.985

wikiner

For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model	PER	LOC	ORG	MISC
Jean-Baptiste/camembert-ner (110M)	0.986	0.966	0.938	0.938
cmarkea/distilcamembert-base-ner (67.5M)	0.983	0.964	0.925	0.926
NERmembert-base-4entities	0.970	0.945	0.876	0.872
NERmembert2-4entities (111M)	0.968	0.945	0.874	0.871
NERmemberta-4entities (111M) (this model)	0.969	0.950	0.897	0.871
NERmembert-large-4entities (336M)	0.975	0.953	0.896	0.893

Full results

Model	Metrics	PER	LOC	ORG	MISC	O	Overall
Jean-Baptiste/camembert-ner (110M)	Precision	0.986	0.962	0.925	0.943	0.998	0.992
	Recall	0.987	0.969	0.951	0.933	0.997	0.992
	F1	0.986	0.966	0.938	0.938	0.998	0.992
cmarkea/distilcamembert-base-ner (67.5M)	Precision	0.982	0.964	0.910	0.942	0.997	0.991
	Recall	0.985	0.963	0.940	0.910	0.998	0.991
	F1	0.983	0.964	0.925	0.926	0.997	0.991
NERmembert-base-4entities	Precision	0.970	0.944	0.872	0.878	0.996	0.986
	Recall	0.969	0.947	0.880	0.866	0.996	0.986
	F1	0.970	0.945	0.876	0.872	0.996	0.986
NERmembert2-4entities (111M)	Precision	0.970	0.942	0.865	0.883	0.996	0.985
	Recall	0.966	0.948	0.883	0.859	0.996	0.985
	F1	0.968	0.945	0.874	0.871	0.996	0.985
NERmemberta-4entities (111M) (this model)	Precision	0.974	0.949	0.883	0.869	0.996	0.986
	Recall	0.965	0.951	0.910	0.872	0.996	0.986
	F1	0.969	0.950	0.897	0.871	0.996	0.986
NERmembert-large-4entities (336M)	Precision	0.975	0.957	0.872	0.901	0.997	0.989
	Recall	0.975	0.949	0.922	0.884	0.997	0.989
	F1	0.975	0.953	0.896	0.893	0.997	0.989

Usage

Code

from transformers import pipeline

ner = pipeline('token-classification', model='CATIE-AQ/NERmemberta-4entities', tokenizer='CATIE-AQ/NERmemberta-4entities', aggregation_strategy="simple")

result = ner(
"Le dévoilement du logo officiel des JO s'est déroulé le 21 octobre 2019 au Grand Rex. Ce nouvel emblème et cette nouvelle typographie ont été conçus par le designer Sylvain Boyer avec les agences Royalties & Ecobranding. Rond, il rassemble trois symboles : une médaille d'or, la flamme olympique et Marianne, symbolisée par un visage de femme mais privée de son bonnet phrygien caractéristique. La typographie dessinée fait référence à l'Art déco, mouvement artistique des années 1920, décennie pendant laquelle ont eu lieu pour la dernière fois les Jeux olympiques à Paris en 1924. Pour la première fois, ce logo sera unique pour les Jeux olympiques et les Jeux paralympiques."
)

print(result)

Try it through Space

A Space has been created to test the model. It is available here.

Environmental Impact

Carbon emissions were estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.

Hardware Type: A100 PCIe 40/80GB
Hours used: 2h15min
Cloud Provider: Private Infrastructure
Carbon Efficiency (kg/kWh): 0.047 (estimated from electricitymaps for the day of November 20, 2024.)
Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid): 0.0275 kg eq. CO2

Citations

NERmemBERTa-4entities

@misc {NERmemberta2024,
    author       = { {BOURDOIS, Loïck} },  
    organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },  
    title        = { NERmemberta-4entities },
    year         = 2024,
    url          = { https://huggingface.co/CATIE-AQ/NERmemberta-4entities },
    doi          = { 10.57967/hf/1752 },
    publisher    = { Hugging Face }
}

NERmemBERT

@misc {NERmembert2024,
    author       = { {BOURDOIS, Loïck} },  
    organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },  
    title        = { NERmembert-base-3entities },
    year         = 2024,
    url          = { https://huggingface.co/CATIE-AQ/NERmembert-base-4entities },
    doi          = { 10.57967/hf/1752 },
    publisher    = { Hugging Face }
}

CamemBERT

@inproceedings{martin2020camembert,  
  title={CamemBERT: a Tasty French Language Model},  
  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},  
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},  
  year={2020}}

CamemBERT 2.0

@misc{antoun2024camembert20smarterfrench,
      title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection}, 
      author={Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
      year={2024},
      eprint={2411.08868},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.08868}, 
}

multiconer

@inproceedings{multiconer2-report,  
    title={{SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)}},  
    author={Fetahu, Besnik and Kar, Sudipta and Chen, Zhiyu and Rokhlenko, Oleg and Malmasi, Shervin},  
    booktitle={Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)},  
    year={2023},  
    publisher={Association for Computational Linguistics}}

@article{multiconer2-data,  
    title={{MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition}},  
    author={Fetahu, Besnik and Chen, Zhiyu and Kar, Sudipta and Rokhlenko, Oleg and Malmasi, Shervin},  
    year={2023}}

multinerd

@inproceedings{tedeschi-navigli-2022-multinerd,  
    title = "{M}ulti{NERD}: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)",  
    author = "Tedeschi, Simone and  Navigli, Roberto",  
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",  
    month = jul,  
    year = "2022",  
    address = "Seattle, United States",  
    publisher = "Association for Computational Linguistics",  
    url = "https://aclanthology.org/2022.findings-naacl.60",  
    doi = "10.18653/v1/2022.findings-naacl.60",  
    pages = "801--812"}

pii-masking-200k

@misc {ai4privacy_2023,  
    author = { {ai4Privacy} },  
    title = { pii-masking-200k (Revision 1d4c0a1) },  
    year = 2023,  
    url = { https://huggingface.co/datasets/ai4privacy/pii-masking-200k },  
    doi = { 10.57967/hf/1532 },  
    publisher = { Hugging Face }}

wikiner

@article{NOTHMAN2013151,  
    title = {Learning multilingual named entity recognition from Wikipedia},  
    journal = {Artificial Intelligence},  
    volume = {194},  
    pages = {151-175},  
    year = {2013},  
    note = {Artificial Intelligence, Wikipedia and Semi-Structured Resources},  
    issn = {0004-3702},  
    doi = {https://doi.org/10.1016/j.artint.2012.03.006},  
    url = {https://www.sciencedirect.com/science/article/pii/S0004370212000276},  
    author = {Joel Nothman and Nicky Ringland and Will Radford and Tara Murphy and James R. Curran}}

frenchNER_4entities

@misc {frenchNER2024,  
    author       = { {BOURDOIS, Loïck} },  
    organization  = { {Centre Aquitain des Technologies de l'Information et Electroniques} },  
    title        = { frenchNER_4entities },  
    year         = 2024,  
    url          = { https://huggingface.co/CATIE-AQ/frenchNER_4entities },  
    doi          = { 10.57967/hf/1751 },  
    publisher    = { Hugging Face }  
}

License

MIT