NERCat Classifier

Model Overview

The NERCat classifier is a fine-tuned version of the GLiNER Knowledgator model, designed specifically for Named Entity Recognition (NER) in the Catalan language. By leveraging a manually annotated dataset of Catalan-language television transcriptions, this classifier significantly improves the recognition of named entities across diverse categories, addressing the challenges posed by the scarcity of high-quality training data for Catalan.

The pre-trained version used for fine-tuning was: knowledgator/gliner-bi-large-v1.0.

Quickstart

import torch
from gliner import GLiNER

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GLiNER.from_pretrained("Ugiat/NERCat").to(device)

text = "La Universitat de Barcelona és una de les institucions educatives més importants de Catalunya."

labels = [
    "Person",
    "Facility",
    "Organization",
    "Location",
    "Product",
    "Event",
    "Date",
    "Law"
]

entities = model.predict_entities(text, labels, threshold=0.5)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

Performance Evaluation

We evaluated the fine-tuned NERCat classifier against the baseline GLiNER model using a manually classified evaluation dataset of 100 sentences. The results demonstrate significant performance improvements across all named entity categories:

Entity Type NERCat Precision NERCat Recall NERCat F1 GLiNER Precision GLiNER Recall GLiNER F1 Δ Precision Δ Recall Δ F1
Person 1.00 1.00 1.00 0.92 0.80 0.86 +0.08 +0.20 +0.14
Facility 0.89 1.00 0.94 0.67 0.25 0.36 +0.22 +0.75 +0.58
Organization 1.00 1.00 1.00 0.72 0.62 0.67 +0.28 +0.38 +0.33
Location 1.00 0.97 0.99 0.83 0.54 0.66 +0.17 +0.43 +0.33
Product 0.96 1.00 0.98 0.63 0.21 0.31 +0.34 +0.79 +0.67
Event 0.88 0.88 0.88 0.60 0.38 0.46 +0.28 +0.50 +0.41
Date 0.88 1.00 0.93 1.00 0.07 0.13 -0.13 +0.93 +0.80
Law 0.67 1.00 0.80 0.00 0.00 0.00 +0.67 +1.00 +0.80

Fine-Tuning Process

The fine-tuning process followed a structured approach, including dataset preparation, model training, and optimization:

  • Data Splitting: The dataset was shuffled and split into training (90%) and testing (10%) subsets.
  • Training Setup:
    • Batch size: 8
    • Steps: 500
    • Loss function: Focal loss (α = 0.75, γ = 2) to address class imbalances
    • Learning rates:
      • Entity layers: $5 \times 10^{-6}$
      • Other model parameters: $1 \times 10^{-5}$
    • Scheduler: Linear with a warmup ratio of 0.1
    • Evaluation frequency: Every 100 steps
    • Checkpointing: Every 1000 steps

The dataset included 13,732 named entity instances across eight categories:

Other

Citation Information

@misc{article_id,
  title        = {NERCat: Fine-Tuning for Enhanced Named Entity Recognition in Catalan},
  author       = {Guillem Cadevall Ferreres, Marc Bardeli Gámez, Marc Serrano Sanz, Pol Gerdt Basuillas, Francesc Tarres Ruiz, Raul Quijada Ferrero},
  year         = {2025},
  archivePrefix = {arXiv},
  url          = {https://github.com/ugiat/NERCat/blob/main/Catalan_GLiNER_Paper.pdf}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Dataset used to train Ugiat/NERCat