EU Organization Classifier

A multilingual transformer model fine-tuned for classifying European organizations into standardized categories based on EU funding database schemas.

Model Description

This model is fine-tuned from intfloat/multilingual-e5-large to classify organizations from European Union funding databases (CORDIS, Erasmus+, LIFE, Creative Europe) into five standardized organization types:

  • PUB: Public bodies (governments, municipalities, agencies)
  • HES: Higher Education Sector (universities, schools, educational institutions)
  • REC: Research Organizations (research institutes, laboratories, R&D centers)
  • PRC: Private Companies (businesses, SMEs, corporations)
  • OTH: Other Organizations (NGOs, associations, foundations, cultural organizations)

Training Data

The model was trained on ~140,000 organization names from multiple EU funding databases:

  • CORDIS: European research and innovation database
  • Erasmus+: European education and training programs
  • LIFE: European environment and climate action program
  • Creative Europe: European cultural and creative sector programs

Note: INTERREG data was excluded from training due to data quality issues but the model can be applied to INTERREG organizations for classification.

Performance

  • Overall Accuracy: 85-90%+ on held-out test data
  • Multilingual Support: Trained on organization names in 20+ European languages
  • Domain Expertise: Specialized for European institutional and funding contexts

Usage

Direct Classification

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "SIRIS-Lab/org_type_classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Organization types mapping
LABELS = {0: 'PUB', 1: 'HES', 2: 'REC', 3: 'PRC', 4: 'OTH'}

def classify_organization(org_name):
    # Tokenize input
    inputs = tokenizer(org_name, return_tensors="pt", truncation=True, max_length=128)

    # Get prediction
    with torch.no_grad():
        outputs = model(**inputs)
        prediction = torch.argmax(outputs.logits, dim=-1).item()
        confidence = torch.softmax(outputs.logits, dim=-1).max().item()

    return LABELS[prediction], confidence

# Examples
examples = [
    "Universität Wien",
    "Ministry of Education and Science",
    "ACME Technology Solutions Ltd",
    "European Research Council",
    "Greenpeace International"
]

for org in examples:
    org_type, confidence = classify_organization(org)
    print(f"{org}{org_type} (confidence: {confidence:.3f})")

Batch Processing

def classify_organizations_batch(org_names, batch_size=32):
    results = []

    for i in range(0, len(org_names), batch_size):
        batch = org_names[i:i+batch_size]

        # Tokenize batch
        inputs = tokenizer(batch, return_tensors="pt", truncation=True, 
                          padding=True, max_length=128)

        # Get predictions
        with torch.no_grad():
            outputs = model(**inputs)
            predictions = torch.argmax(outputs.logits, dim=-1)
            confidences = torch.softmax(outputs.logits, dim=-1).max(dim=-1).values

        # Convert to labels
        for pred, conf in zip(predictions, confidences):
            results.append((LABELS[pred.item()], conf.item()))

    return results

Use Cases

EU Funding Analysis

  • Classify beneficiary organizations in EU funding databases
  • Analyze funding distribution across organization types
  • Identify research collaboration patterns

Organization Deduplication

  • Standardize organization types for entity resolution
  • Improve clustering of similar organizations across databases
  • Enhance data quality in multi-source datasets

Institutional Research

  • Study European research and innovation ecosystems
  • Analyze public-private collaboration networks
  • Map educational and research infrastructure

Languages Supported

The model handles organization names in multiple European languages including:

  • English, German, French, Spanish, Italian
  • Dutch, Portuguese, Polish, Czech, Hungarian

Performance Metrics

Classification Report

Class Precision Recall F1-Score Support
PUB 0.78 0.75 0.76 1941
HES 0.88 0.89 0.89 15507
REC 0.87 0.90 0.89 7734
PRC 0.72 0.67 0.69 2175
OTH 0.59 0.45 0.51 1180

Overall Accuracy: 85%

Confusion Matrix

       PUB   HES   REC   PRC   OTH
PUB  1457    65    65    80   508
HES    38  1447    72    60   324
REC    83    44   535   129   389
PRC    62    22    37  6977   636
OTH   384   283   204   773  13863

License

This model is released under the MIT License.

Downloads last month
39
Safetensors
Model size
560M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support