NERCat Classifier
Model Overview
The NERCat classifier is a fine-tuned version of the GLiNER Knowledgator model, designed specifically for Named Entity Recognition (NER) in the Catalan language. By leveraging a manually annotated dataset of Catalan-language television transcriptions, this classifier significantly improves the recognition of named entities across diverse categories, addressing the challenges posed by the scarcity of high-quality training data for Catalan.
The pre-trained version used for fine-tuning was: knowledgator/gliner-bi-large-v1.0
.
Quickstart
import torch
from gliner import GLiNER
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GLiNER.from_pretrained("Ugiat/NERCat").to(device)
text = "La Universitat de Barcelona és una de les institucions educatives més importants de Catalunya."
labels = [
"Person",
"Facility",
"Organization",
"Location",
"Product",
"Event",
"Date",
"Law"
]
entities = model.predict_entities(text, labels, threshold=0.5)
for entity in entities:
print(entity["text"], "=>", entity["label"])
Performance Evaluation
We evaluated the fine-tuned NERCat classifier against the baseline GLiNER model using a manually classified evaluation dataset of 100 sentences. The results demonstrate significant performance improvements across all named entity categories:
Entity Type | NERCat Precision | NERCat Recall | NERCat F1 | GLiNER Precision | GLiNER Recall | GLiNER F1 | Δ Precision | Δ Recall | Δ F1 |
---|---|---|---|---|---|---|---|---|---|
Person | 1.00 | 1.00 | 1.00 | 0.92 | 0.80 | 0.86 | +0.08 | +0.20 | +0.14 |
Facility | 0.89 | 1.00 | 0.94 | 0.67 | 0.25 | 0.36 | +0.22 | +0.75 | +0.58 |
Organization | 1.00 | 1.00 | 1.00 | 0.72 | 0.62 | 0.67 | +0.28 | +0.38 | +0.33 |
Location | 1.00 | 0.97 | 0.99 | 0.83 | 0.54 | 0.66 | +0.17 | +0.43 | +0.33 |
Product | 0.96 | 1.00 | 0.98 | 0.63 | 0.21 | 0.31 | +0.34 | +0.79 | +0.67 |
Event | 0.88 | 0.88 | 0.88 | 0.60 | 0.38 | 0.46 | +0.28 | +0.50 | +0.41 |
Date | 0.88 | 1.00 | 0.93 | 1.00 | 0.07 | 0.13 | -0.13 | +0.93 | +0.80 |
Law | 0.67 | 1.00 | 0.80 | 0.00 | 0.00 | 0.00 | +0.67 | +1.00 | +0.80 |
Fine-Tuning Process
The fine-tuning process followed a structured approach, including dataset preparation, model training, and optimization:
- Data Splitting: The dataset was shuffled and split into training (90%) and testing (10%) subsets.
- Training Setup:
- Batch size: 8
- Steps: 500
- Loss function: Focal loss (α = 0.75, γ = 2) to address class imbalances
- Learning rates:
- Entity layers: $5 \times 10^{-6}$
- Other model parameters: $1 \times 10^{-5}$
- Scheduler: Linear with a warmup ratio of 0.1
- Evaluation frequency: Every 100 steps
- Checkpointing: Every 1000 steps
The dataset included 13,732 named entity instances across eight categories:
Other
Citation Information
@misc{article_id,
title = {NERCat: Fine-Tuning for Enhanced Named Entity Recognition in Catalan},
author = {Guillem Cadevall Ferreres, Marc Bardeli Gámez, Marc Serrano Sanz, Pol Gerdt Basuillas, Francesc Tarres Ruiz, Raul Quijada Ferrero},
year = {2025},
archivePrefix = {arXiv},
url = {https://github.com/ugiat/NERCat/blob/main/Catalan_GLiNER_Paper.pdf}
}