--- license: mit base_model: almanach/camembertva-base metrics: - precision - recall - f1 - accuracy model-index: - name: NERmembert2-4entities results: [] datasets: - CATIE-AQ/frenchNER_4entities language: - fr widget: - text: "Le dévoilement du logo officiel des JO s'est déroulé le 21 octobre 2019 au Grand Rex. Ce nouvel emblème et cette nouvelle typographie ont été conçus par le designer Sylvain Boyer avec les agences Royalties & Ecobranding. Rond, il rassemble trois symboles : une médaille d'or, la flamme olympique et Marianne, symbolisée par un visage de femme mais privée de son bonnet phrygien caractéristique. La typographie dessinée fait référence à l'Art déco, mouvement artistique des années 1920, décennie pendant laquelle ont eu lieu pour la dernière fois les Jeux olympiques à Paris en 1924. Pour la première fois, ce logo sera unique pour les Jeux olympiques et les Jeux paralympiques." library_name: transformers pipeline_tag: token-classification co2_eq_emissions: 27.5 --- # NERmembert-large-4entities ## Model Description We present **NERmemberta-4entities**, which is a [CamemBERTa v2 base](https://huggingface.co/almanach/camembertav2-base) fine-tuned for the Name Entity Recognition task for the French language on four French NER datasets for 4 entities (LOC, PER, ORG, MISC). All these datasets were concatenated and cleaned into a single dataset that we called [frenchNER_4entities](https://huggingface.co/datasets/CATIE-AQ/frenchNER_4entities). There are a total of **384,773** rows, of which **328,757** are for training, **24,131** for validation and **31,885** for testing. Our methodology is described in a blog post available in [English](https://blog.vaniila.ai/en/NER_en/) or [French](https://blog.vaniila.ai/NER/). ## Evaluation results The evaluation was carried out using the [**evaluate**](https://pypi.org/project/evaluate/) python package. ### frenchNER_4entities For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model

PER

LOC

ORG

MISC

Jean-Baptiste/camembert-ner (110M)

0.971

0.947

0.902

0.663

cmarkea/distilcamembert-base-ner (67.5M)

0.974

0.948

0.892

0.658

NERmembert-base-4entities

0.978

0.958

0.903

0.814

NERmembert2-4entities (111M)

0.978

0.958

0.901

0.806

NERmemberta-4entities (this model) (111M)

0.979

0.961

0.915

0.812

NERmembert-large-4entities (336M)

0.982

0.964

0.919

0.834
Full results

Model

Metrics

PER

LOC

ORG

MISC

O

Overall

Jean-Baptiste/camembert-ner (110M)

Precision

0.952

0.924

0.870

0.845

0.986

0.976

Recall

0.990

0.972

0.938

0.546

0.992

0.976
F1
0.971

0.947

0.902

0.663

0.989

0.976

cmarkea/distilcamembert-base-ner (67.5M)

Precision

0.962

0.933

0.857

0.830

0.985

0.976

Recall

0.987

0.963

0.930

0.545

0.993

0.976
F1
0.974

0.948

0.892

0.658

0.989

0.976

NERmembert-base-4entities

Precision

0.973

0.951

0.888

0.850

0.993

0.984

Recall

0.983

0.964

0.918

0.781

0.993

0.984
F1
0.978

0.958

0.903

0.814

0.993

0.984

NERmembert2-4entities (111M)

Precision

0.973

0.951

0.882

0.860

0.991

0.982

Recall

0.982

0.965

0.921

0.759

0.994

0.982
F1
0.978

0.958

0.901

0.806

0.992

0.982

NERmemberta-4entities (111M) (this model)

Precision

0.976

0.955

0.894

0.856

0.991

0.983

Recall

0.983

0.968

0.936

0.772

0.994

0.983
F1
0.979

0.961

0.915

0.812

0.992

0.983

NERmembert-large-4entities (336M)

Precision

0.977

0.961

0.896

0.872

0.993

0.986

Recall

0.987

0.966

0.943

0.798

0.995

0.986
F1
0.982

0.964

0.919

0.834

0.994

0.986
In detail: ### multiconer For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model

PER

LOC

ORG

MISC

Jean-Baptiste/camembert-ner (110M)

0.940

0.761

0.723

0.560

cmarkea/distilcamembert-base-ner (67.5M)

0.921

0.748

0.694

0.530

NERmembert-base-4entities

0.960

0.890

0.867

0.852

NERmembert2-4entities (111M)

0.964

0.888

0.864

0.850

NERmemberta-4entities (111M) (this model)

0.966

0.891

0.867

0.862

NERmembert-large-4entities (336M)

0.969

0.919

0.904

0.864
Full results

Model

Metrics

PER

LOC

ORG

MISC

O

Overall

Jean-Baptiste/camembert-ner (110M)

Precision

0.908

0.717

0.753

0.620

0.936

0.889

Recall

0.975

0.811

0.696

0.511

0.938

0.889
F1
0.940

0.761

0.723

0.560

0.937

0.889

cmarkea/distilcamembert-base-ner (67.5M)

Precision

0.885

0.738

0.737

0.589

0.928

0.881

Recall

0.960

0.759

0.655

0.482

0.939

0.881
F1
0.921

0.748

0.694

0.530

0.934

0.881

NERmembert-base-4entities

Precision

0.954

0.893

0.851

0.849

0.979

0.954

Recall

0.967

0.887

0.883

0.855

0.974

0.954
F1
0.960

0.890

0.867

0.852

0.977

0.954

NERmembert2-4entities (111M)

Precision

0.953

0.890

0.870

0.842

0.976

0.952

Recall

0.975

0.887

0.857

0.858

0.970

0.952
F1
0.964

0.888

0.864

0.850

0.973

0.952

NERmemberta-4entities (111M) (this model)

Precision

0.961

0.895

0.859

0.845

0.978

0.953

Recall

0.972

0.886

0.876

0.879

0.970

0.953
F1
0.966

0.891

0.867

0.862

0.974

0.953

NERmembert-large-4entities (336M)

Precision

0.964

0.922

0.904

0.856

0.981

0.961

Recall

0.975

0.917

0.904

0.872

0.976

0.961
F1
0.969

0.919

0.904

0.864

0.978

0.961
### multinerd For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model

PER

LOC

ORG

MISC

Jean-Baptiste/camembert-ner (110M)

0.962

0.934

0.888

0.419

cmarkea/distilcamembert-base-ner (67.5M)

0.972

0.938

0.884

0.430

NERmembert-base-4entities

0.985

0.973

0.938

0.770

NERmembert2-4entities (111M)

0.986

0.974

0.937

0.761

NERmemberta-4entities (111M) (this model)

0.987

0.976

0.942

0.770

NERmembert-large-4entities (336M)

0.987

0.976

0.948

0.790
Full results

Model

Metrics

PER

LOC

ORG

MISC

O

Overall

Jean-Baptiste/camembert-ner (110M)

Precision

0.931

0.893

0.827

0.725

0.979

0.966

Recall

0.994

0.980

0.959

0.295

0.990

0.966
F1
0.962

0.934

0.888

0.419

0.984

0.966

cmarkea/distilcamembert-base-ner (67.5M)

Precision

0.954

0.908

0.817

0.705

0.977

0.967

Recall

0.991

0.969

0.963

0.310

0.990

0.967
F1
0.972

0.938

0.884

0.430

0.984

0.967

NERmembert-base-4entities

Precision

0.976

0.961

0.911

0.829

0.991

0.983

Recall

0.994

0.985

0.967

0.719

0.993

0.983
F1
0.985

0.973

0.938

0.770

0.992

0.983

NERmembert2-4entities (111M)

Precision

0.976

0.962

0.903

0.846

0.988

0.980

Recall

0.995

0.986

0.974

0.692

0.992

0.980
F1
0.986

0.974

0.937

0.761

0.990

0.980

NERmemberta-4entities (111M) (this model)

Precision

0.979

0.963

0.912

0.848

0.988

0.981

Recall

0.996

0.989

0.975

0.705

0.992

0.981
F1
0.987

0.976

0.942

0.770

0.990

0.981

NERmembert-large-4entities (336M)

Precision

0.979

0.967

0.922

0.852

0.991

0.985

Recall

0.996

0.986

0.974

0.736

0.994

0.985
F1
0.987

0.976

0.948

0.790

0.993

0.985
### wikiner For space reasons, we show only the F1 of the different models. You can see the full results below the table.

Model

PER

LOC

ORG

MISC

Jean-Baptiste/camembert-ner (110M)

0.986

0.966

0.938

0.938

cmarkea/distilcamembert-base-ner (67.5M)

0.983

0.964

0.925

0.926

NERmembert-base-4entities

0.970

0.945

0.876

0.872

NERmembert2-4entities (111M)

0.968

0.945

0.874

0.871

NERmemberta-4entities (111M) (this model)

0.969

0.950

0.897

0.871

NERmembert-large-4entities (336M)

0.975

0.953

0.896

0.893
Full results

Model

Metrics

PER

LOC

ORG

MISC

O

Overall

Jean-Baptiste/camembert-ner (110M)

Precision

0.986

0.962

0.925

0.943

0.998

0.992

Recall

0.987

0.969

0.951

0.933

0.997

0.992
F1
0.986

0.966

0.938

0.938

0.998

0.992

cmarkea/distilcamembert-base-ner (67.5M)

Precision

0.982

0.964

0.910

0.942

0.997

0.991

Recall

0.985

0.963

0.940

0.910

0.998

0.991
F1
0.983

0.964

0.925

0.926

0.997

0.991

NERmembert-base-4entities

Precision

0.970

0.944

0.872

0.878

0.996

0.986

Recall

0.969

0.947

0.880

0.866

0.996

0.986
F1
0.970

0.945

0.876

0.872

0.996

0.986

NERmembert2-4entities (111M)

Precision

0.970

0.942

0.865

0.883

0.996

0.985

Recall

0.966

0.948

0.883

0.859

0.996

0.985
F1
0.968

0.945

0.874

0.871

0.996

0.985

NERmemberta-4entities (111M) (this model)

Precision

0.974

0.949

0.883

0.869

0.996

0.986

Recall

0.965

0.951

0.910

0.872

0.996

0.986
F1
0.969

0.950

0.897

0.871

0.996

0.986

NERmembert-large-4entities (336M)

Precision

0.975

0.957

0.872

0.901

0.997

0.989

Recall

0.975

0.949

0.922

0.884

0.997

0.989
F1
0.975

0.953

0.896

0.893

0.997

0.989
## Usage ### Code ```python from transformers import pipeline ner = pipeline('token-classification', model='CATIE-AQ/NERmemberta-4entities', tokenizer='CATIE-AQ/NERmemberta-4entities', aggregation_strategy="simple") result = ner( "Le dévoilement du logo officiel des JO s'est déroulé le 21 octobre 2019 au Grand Rex. Ce nouvel emblème et cette nouvelle typographie ont été conçus par le designer Sylvain Boyer avec les agences Royalties & Ecobranding. Rond, il rassemble trois symboles : une médaille d'or, la flamme olympique et Marianne, symbolisée par un visage de femme mais privée de son bonnet phrygien caractéristique. La typographie dessinée fait référence à l'Art déco, mouvement artistique des années 1920, décennie pendant laquelle ont eu lieu pour la dernière fois les Jeux olympiques à Paris en 1924. Pour la première fois, ce logo sera unique pour les Jeux olympiques et les Jeux paralympiques." ) print(result) ``` ### Try it through Space A Space has been created to test the model. It is available [here](https://huggingface.co/spaces/CATIE-AQ/NERmembert). ## Environmental Impact *Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.* - **Hardware Type:** A100 PCIe 40/80GB - **Hours used:** 2h15min - **Cloud Provider:** Private Infrastructure - **Carbon Efficiency (kg/kWh):** 0.047 (estimated from [electricitymaps](https://app.electricitymaps.com/zone/FR) for the day of November 20, 2024.) - **Carbon Emitted** *(Power consumption x Time x Carbon produced based on location of power grid)*: 0.0275 kg eq. CO2 ## Citations ### NERmemBERTa-4entities ``` @misc {NERmemberta2024, author = { {BOURDOIS, Loïck} }, organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} }, title = { NERmemberta-4entities }, year = 2024, url = { https://huggingface.co/CATIE-AQ/NERmemberta-4entities }, doi = { 10.57967/hf/1752 }, publisher = { Hugging Face } } ``` ### NERmemBERT ``` @misc {NERmembert2024, author = { {BOURDOIS, Loïck} }, organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} }, title = { NERmembert-base-3entities }, year = 2024, url = { https://huggingface.co/CATIE-AQ/NERmembert-base-4entities }, doi = { 10.57967/hf/1752 }, publisher = { Hugging Face } } ``` ### CamemBERT ``` @inproceedings{martin2020camembert, title={CamemBERT: a Tasty French Language Model}, author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t}, booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, year={2020}} ``` ### CamemBERT 2.0 ``` @misc{antoun2024camembert20smarterfrench, title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection}, author={Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah}, year={2024}, eprint={2411.08868}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2411.08868}, } ``` ### multiconer ``` @inproceedings{multiconer2-report, title={{SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)}}, author={Fetahu, Besnik and Kar, Sudipta and Chen, Zhiyu and Rokhlenko, Oleg and Malmasi, Shervin}, booktitle={Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)}, year={2023}, publisher={Association for Computational Linguistics}} @article{multiconer2-data, title={{MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition}}, author={Fetahu, Besnik and Chen, Zhiyu and Kar, Sudipta and Rokhlenko, Oleg and Malmasi, Shervin}, year={2023}} ``` ### multinerd ``` @inproceedings{tedeschi-navigli-2022-multinerd, title = "{M}ulti{NERD}: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)", author = "Tedeschi, Simone and Navigli, Roberto", booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.findings-naacl.60", doi = "10.18653/v1/2022.findings-naacl.60", pages = "801--812"} ``` ### pii-masking-200k ``` @misc {ai4privacy_2023, author = { {ai4Privacy} }, title = { pii-masking-200k (Revision 1d4c0a1) }, year = 2023, url = { https://huggingface.co/datasets/ai4privacy/pii-masking-200k }, doi = { 10.57967/hf/1532 }, publisher = { Hugging Face }} ``` ### wikiner ``` @article{NOTHMAN2013151, title = {Learning multilingual named entity recognition from Wikipedia}, journal = {Artificial Intelligence}, volume = {194}, pages = {151-175}, year = {2013}, note = {Artificial Intelligence, Wikipedia and Semi-Structured Resources}, issn = {0004-3702}, doi = {https://doi.org/10.1016/j.artint.2012.03.006}, url = {https://www.sciencedirect.com/science/article/pii/S0004370212000276}, author = {Joel Nothman and Nicky Ringland and Will Radford and Tara Murphy and James R. Curran}} ``` ### frenchNER_4entities ``` @misc {frenchNER2024, author = { {BOURDOIS, Loïck} }, organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} }, title = { frenchNER_4entities }, year = 2024, url = { https://huggingface.co/CATIE-AQ/frenchNER_4entities }, doi = { 10.57967/hf/1751 }, publisher = { Hugging Face } } ``` ## License MIT