|
|
|
--- |
|
license: apache-2.0 |
|
library_name: span-marker |
|
tags: |
|
- span-marker |
|
- token-classification |
|
- ner |
|
- named-entity-recognition |
|
pipeline_tag: token-classification |
|
widget: |
|
- text: "Amelia Earthart voló su Lockheed Vega 5B monomotor a través del Océano Atlántico hasta París ." |
|
example_title: "Spanish" |
|
- text: "Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris ." |
|
example_title: "English" |
|
- text: "Amelia Earthart a fait voler son monomoteur Lockheed Vega 5B à travers l' ocean Atlantique jusqu'à Paris ." |
|
example_title: "French" |
|
- text: "Amelia Earthart flog mit ihrer einmotorigen Lockheed Vega 5B über den Atlantik nach Paris ." |
|
example_title: "German" |
|
- text: "Амелия Эртхарт перелетела на своем одномоторном самолете Lockheed Vega 5B через Атлантический океан в Париж ." |
|
example_title: "Russian" |
|
- text: "Amelia Earthart vloog met haar één-motorige Lockheed Vega 5B over de Atlantische Oceaan naar Parijs ." |
|
example_title: "Dutch" |
|
- text: "Amelia Earthart przeleciała swoim jednosilnikowym samolotem Lockheed Vega 5B przez Ocean Atlantycki do Paryża ." |
|
example_title: "Polish" |
|
- text: "Amelia Earthart flaug eins hreyfils Lockheed Vega 5B yfir Atlantshafið til Parísar ." |
|
example_title: "Icelandic" |
|
- text: "Η Amelia Earthart πέταξε το μονοκινητήριο Lockheed Vega 5B της πέρα από τον Ατλαντικό Ωκεανό στο Παρίσι ." |
|
example_title: "Greek" |
|
model-index: |
|
- name: SpanMarker w. xlm-roberta-base on MultiNERD by Tom Aarsen |
|
results: |
|
- task: |
|
type: token-classification |
|
name: Named Entity Recognition |
|
dataset: |
|
type: Babelscape/multinerd |
|
name: MultiNERD |
|
split: test |
|
revision: 2814b78e7af4b5a1f1886fe7ad49632de4d9dd25 |
|
metrics: |
|
- type: f1 |
|
value: 0.91314 |
|
name: F1 |
|
- type: precision |
|
value: 0.91994 |
|
name: Precision |
|
- type: recall |
|
value: 0.90643 |
|
name: Recall |
|
datasets: |
|
- Babelscape/multinerd |
|
language: |
|
- multilingual |
|
metrics: |
|
- f1 |
|
- recall |
|
- precision |
|
--- |
|
|
|
# SpanMarker for Named Entity Recognition |
|
|
|
**Note**: Due to major [tokenization limitations](#Limitations), this model is deprecated in favor of the much superior [tomaarsen/span-marker-mbert-base-multinerd](https://huggingface.co/tomaarsen/span-marker-mbert-base-multinerd) model. |
|
|
|
This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for multilingual Named Entity Recognition trained on the [MultiNERD](https://huggingface.co/datasets/Babelscape/multinerd) dataset. In particular, this SpanMarker model uses [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) as the underlying encoder. See [train.py](train.py) for the training script. |
|
|
|
## Metrics |
|
|
|
| **Language** | **F1** | **Precision** | **Recall** | |
|
|--------------|--------|---------------|------------| |
|
| all | 91.31 | 91.99 | 90.64 | |
|
| **de** | 93.77 | 93.56 | 93.87 | |
|
| **en** | 94.55 | 94.01 | 95.10 | |
|
| **es** | 90.82 | 92.58 | 89.13 | |
|
| **fr** | 90.90 | 93.23 | 88.68 | |
|
| **it** | 93.40 | 90.23 | 92.60 | |
|
| **nl** | 92.47 | 93.61 | 91.36 | |
|
| **pl** | 91.66 | 92.51 | 90.81 | |
|
| **pt** | 91.73 | 93.29 | 90.22 | |
|
| **ru** | 92.64 | 92.37 | 92.91 | |
|
| **zh** | 82.38 | 83.23 | 81.55 | |
|
|
|
## Label set |
|
|
|
| Class | Description | Examples | |
|
|-------|-------------|----------| |
|
PER (person) | People | Ray Charles, Jessica Alba, Leonardo DiCaprio, Roger Federer, Anna Massey. | |
|
ORG (organization) | Associations, companies, agencies, institutions, nationalities and religious or political groups | University of Edinburgh, San Francisco Giants, Google, Democratic Party. | |
|
LOC (location) | Physical locations (e.g. mountains, bodies of water), geopolitical entities (e.g. cities, states), and facilities (e.g. bridges, buildings, airports). | Rome, Lake Paiku, Chrysler Building, Mount Rushmore, Mississippi River. | |
|
ANIM (animal) | Breeds of dogs, cats and other animals, including their scientific names. | Maine Coon, African Wild Dog, Great White Shark, New Zealand Bellbird. | |
|
BIO (biological) | Genus of fungus, bacteria and protoctists, families of viruses, and other biological entities. | Herpes Simplex Virus, Escherichia Coli, Salmonella, Bacillus Anthracis. | |
|
CEL (celestial) | Planets, stars, asteroids, comets, nebulae, galaxies and other astronomical objects. | Sun, Neptune, Asteroid 187 Lamberta, Proxima Centauri, V838 Monocerotis. | |
|
DIS (disease) | Physical, mental, infectious, non-infectious, deficiency, inherited, degenerative, social and self-inflicted diseases. | Alzheimer’s Disease, Cystic Fibrosis, Dilated Cardiomyopathy, Arthritis. | |
|
EVE (event) | Sport events, battles, wars and other events. | American Civil War, 2003 Wimbledon Championships, Cannes Film Festival. | |
|
FOOD (food) | Foods and drinks. | Carbonara, Sangiovese, Cheddar Beer Fondue, Pizza Margherita. | |
|
INST (instrument) | Technological instruments, mechanical instruments, musical instruments, and other tools. | Spitzer Space Telescope, Commodore 64, Skype, Apple Watch, Fender Stratocaster. | |
|
MEDIA (media) | Titles of films, books, magazines, songs and albums, fictional characters and languages. | Forbes, American Psycho, Kiss Me Once, Twin Peaks, Disney Adventures. | |
|
PLANT (plant) | Types of trees, flowers, and other plants, including their scientific names. | Salix, Quercus Petraea, Douglas Fir, Forsythia, Artemisia Maritima. | |
|
MYTH (mythological) | Mythological and religious entities. | Apollo, Persephone, Aphrodite, Saint Peter, Pope Gregory I, Hercules. | |
|
TIME (time) | Specific and well-defined time intervals, such as eras, historical periods, centuries, years and important days. No months and days of the week. | Renaissance, Middle Ages, Christmas, Great Depression, 17th Century, 2012. | |
|
VEHI (vehicle) | Cars, motorcycles and other vehicles. | Ferrari Testarossa, Suzuki Jimny, Honda CR-X, Boeing 747, Fairey Fulmar. |
|
|
|
## Usage |
|
|
|
To use this model for inference, first install the `span_marker` library: |
|
|
|
```bash |
|
pip install span_marker |
|
``` |
|
|
|
You can then run inference with this model like so: |
|
|
|
```python |
|
from span_marker import SpanMarkerModel |
|
|
|
# Download from the 🤗 Hub |
|
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-xlm-roberta-base-multinerd") |
|
# Run inference |
|
entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.") |
|
``` |
|
|
|
### Limitations |
|
|
|
**Warning**: This model works best when punctuation is separated from the prior words, so |
|
```python |
|
# ✅ |
|
model.predict("He plays J. Robert Oppenheimer , an American theoretical physicist .") |
|
# ❌ |
|
model.predict("He plays J. Robert Oppenheimer, an American theoretical physicist.") |
|
|
|
# You can also supply a list of words directly: ✅ |
|
model.predict(["He", "plays", "J.", "Robert", "Oppenheimer", ",", "an", "American", "theoretical", "physicist", "."]) |
|
``` |
|
The same may be beneficial for some languages, such as splitting `"l'ocean Atlantique"` into `"l' ocean Atlantique"`. |
|
|
|
See the [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) repository for documentation and additional information on this library. |
|
|
|
## Contributions |
|
Many thanks to [Simone Tedeschi](https://huggingface.co/sted97) from [Babelscape](https://babelscape.com) for his insight when training this model and his involvement in the creation of the training dataset. |
|
|