NERCat Classifier

Model Overview

The NERCat classifier is a fine-tuned version of the GLiNER Knowledgator model, designed specifically for Named Entity Recognition (NER) in the Catalan language. By leveraging a manually annotated dataset of Catalan-language television transcriptions, this classifier significantly improves the recognition of named entities across diverse categories, addressing the challenges posed by the scarcity of high-quality training data for Catalan.

The pre-trained version used for fine-tuning was: knowledgator/gliner-bi-large-v1.0.

Quickstart

import torch
from gliner import GLiNER

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GLiNER.from_pretrained("Ugiat/NERCat").to(device)

text = "La Universitat de Barcelona és una de les institucions educatives més importants de Catalunya."

labels = [
    "Person",
    "Facility",
    "Organization",
    "Location",
    "Product",
    "Event",
    "Date",
    "Law"
]

entities = model.predict_entities(text, labels, threshold=0.5)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

Performance Evaluation

We evaluated the fine-tuned NERCat classifier against the baseline GLiNER model using a manually classified evaluation dataset of 100 sentences. The results demonstrate significant performance improvements across all named entity categories:

Entity Type	NERCat Precision	NERCat Recall	NERCat F1	GLiNER Precision	GLiNER Recall	GLiNER F1	Δ Precision	Δ Recall	Δ F1
Person	1.00	1.00	1.00	0.92	0.80	0.86	+0.08	+0.20	+0.14
Facility	0.89	1.00	0.94	0.67	0.25	0.36	+0.22	+0.75	+0.58
Organization	1.00	1.00	1.00	0.72	0.62	0.67	+0.28	+0.38	+0.33
Location	1.00	0.97	0.99	0.83	0.54	0.66	+0.17	+0.43	+0.33
Product	0.96	1.00	0.98	0.63	0.21	0.31	+0.34	+0.79	+0.67
Event	0.88	0.88	0.88	0.60	0.38	0.46	+0.28	+0.50	+0.41
Date	0.88	1.00	0.93	1.00	0.07	0.13	-0.13	+0.93	+0.80
Law	0.67	1.00	0.80	0.00	0.00	0.00	+0.67	+1.00	+0.80

Fine-Tuning Process

The fine-tuning process followed a structured approach, including dataset preparation, model training, and optimization:

Data Splitting: The dataset was shuffled and split into training (90%) and testing (10%) subsets.
Training Setup:
- Batch size: 8
- Steps: 500
- Loss function: Focal loss (α = 0.75, γ = 2) to address class imbalances
- Learning rates:
  - Entity layers: $5 \times 10^{-6}$
  - Other model parameters: $1 \times 10^{-5}$
- Scheduler: Linear with a warmup ratio of 0.1
- Evaluation frequency: Every 100 steps
- Checkpointing: Every 1000 steps

The dataset included 13,732 named entity instances across eight categories:

Other

Citation Information

@misc{article_id,
  title        = {NERCat: Fine-Tuning for Enhanced Named Entity Recognition in Catalan},
  author       = {Guillem Cadevall Ferreres, Marc Bardeli Gámez, Marc Serrano Sanz, Pol Gerdt Basuillas, Francesc Tarres Ruiz, Raul Quijada Ferrero},
  year         = {2025},
  archivePrefix = {arXiv},
  url          = {https://github.com/ugiat/NERCat/blob/main/Catalan_GLiNER_Paper.pdf}
}

Ugiat
/

NERCat