|
--- |
|
license: mit |
|
datasets: |
|
- numind/NuNER |
|
library_name: gliner |
|
language: |
|
- en |
|
pipeline_tag: token-classification |
|
tags: |
|
- entity recognition |
|
- NER |
|
- named entity recognition |
|
- zero shot |
|
- zero-shot |
|
--- |
|
|
|
NuNerZero - is the family of Zero-Shot Entity Recognition models inspired by [GLiNER](https://huggingface.co/papers/2311.08526) and built with insights we gathered throughout our work on [NuNER](https://huggingface.co/collections/numind/nuner-token-classification-and-ner-backbones-65e1f6e14639e2a465af823b). |
|
|
|
The key differences between NuNerZero Token Long in comparison to GLiNER are: |
|
* **4096 context window!** vs 512-token context in GLiNER. This allows processing a page at a time vs a few sentences! |
|
* The possibility to **detect entities that are longer than 12 tokens**, as NuNerZero, it operates on the token level rather than on the span level. |
|
* NuZero family is trained on the **diverse dataset tailored for real-life use cases** - NuNER v2.0 dataset |
|
|
|
<p align="center"> |
|
<img src="zero_shot_performance_unzero_token_long.png"> |
|
</p> |
|
|
|
## Installation & Usage |
|
|
|
``` |
|
!pip install gliner |
|
``` |
|
|
|
**NuZero requires labels to be lower-cased** |
|
|
|
```python |
|
from gliner import GLiNER |
|
|
|
model = GLiNER.from_pretrained("numind/NuNerZero_long_contex") |
|
|
|
# NuZero requires labels to be lower-cased! |
|
labels = ["person", "award", "date", "competitions", "teams"] |
|
labels [l.lower() for l in labels] |
|
|
|
text = """ |
|
|
|
""" |
|
|
|
entities = model.predict_entities(text, labels) |
|
|
|
for entity in entities: |
|
print(entity["text"], "=>", entity["label"]) |
|
``` |
|
|
|
## Fine-tuning |
|
|
|
A fine-tuning script can be found [here](https://colab.research.google.com/drive/19WDnuD2U-B0h-FzX7I5FySNP6sHt4Cru?usp=sharing). |
|
|
|
|
|
## Citation |
|
### This work |
|
```bibtex |
|
@misc{bogdanov2024nuner, |
|
title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data}, |
|
author={Sergei Bogdanov and Alexandre Constantin and Timothée Bernard and Benoit Crabbé and Etienne Bernard}, |
|
year={2024}, |
|
eprint={2402.15343}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
### Previous work |
|
```bibtex |
|
@misc{zaratiana2023gliner, |
|
title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer}, |
|
author={Urchade Zaratiana and Nadi Tomeh and Pierre Holat and Thierry Charnois}, |
|
year={2023}, |
|
eprint={2311.08526}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |