EU Organization Classifier

A multilingual transformer model fine-tuned for classifying European organizations into standardized categories based on EU funding database schemas.

Model Description

This model is fine-tuned from intfloat/multilingual-e5-large to classify organizations from European Union funding databases (CORDIS, Erasmus+, LIFE, Creative Europe) into five standardized organization types:

PUB: Public bodies (governments, municipalities, agencies)
HES: Higher Education Sector (universities, schools, educational institutions)
REC: Research Organizations (research institutes, laboratories, R&D centers)
PRC: Private Companies (businesses, SMEs, corporations)
OTH: Other Organizations (NGOs, associations, foundations, cultural organizations)

Training Data

The model was trained on ~140,000 organization names from multiple EU funding databases:

CORDIS: European research and innovation database
Erasmus+: European education and training programs
LIFE: European environment and climate action program
Creative Europe: European cultural and creative sector programs

Note: INTERREG data was excluded from training due to data quality issues but the model can be applied to INTERREG organizations for classification.

Performance

Overall Accuracy: 85-90%+ on held-out test data
Multilingual Support: Trained on organization names in 20+ European languages
Domain Expertise: Specialized for European institutional and funding contexts

Usage

Direct Classification

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "SIRIS-Lab/org_type_classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Organization types mapping
LABELS = {0: 'PUB', 1: 'HES', 2: 'REC', 3: 'PRC', 4: 'OTH'}

def classify_organization(org_name):
    # Tokenize input
    inputs = tokenizer(org_name, return_tensors="pt", truncation=True, max_length=128)

    # Get prediction
    with torch.no_grad():
        outputs = model(**inputs)
        prediction = torch.argmax(outputs.logits, dim=-1).item()
        confidence = torch.softmax(outputs.logits, dim=-1).max().item()

    return LABELS[prediction], confidence

# Examples
examples = [
    "Universität Wien",
    "Ministry of Education and Science",
    "ACME Technology Solutions Ltd",
    "European Research Council",
    "Greenpeace International"
]

for org in examples:
    org_type, confidence = classify_organization(org)
    print(f"{org} → {org_type} (confidence: {confidence:.3f})")

Batch Processing

def classify_organizations_batch(org_names, batch_size=32):
    results = []

    for i in range(0, len(org_names), batch_size):
        batch = org_names[i:i+batch_size]

        # Tokenize batch
        inputs = tokenizer(batch, return_tensors="pt", truncation=True, 
                          padding=True, max_length=128)

        # Get predictions
        with torch.no_grad():
            outputs = model(**inputs)
            predictions = torch.argmax(outputs.logits, dim=-1)
            confidences = torch.softmax(outputs.logits, dim=-1).max(dim=-1).values

        # Convert to labels
        for pred, conf in zip(predictions, confidences):
            results.append((LABELS[pred.item()], conf.item()))

    return results

Use Cases

EU Funding Analysis

Classify beneficiary organizations in EU funding databases
Analyze funding distribution across organization types
Identify research collaboration patterns

Organization Deduplication

Standardize organization types for entity resolution
Improve clustering of similar organizations across databases
Enhance data quality in multi-source datasets

Institutional Research

Study European research and innovation ecosystems
Analyze public-private collaboration networks
Map educational and research infrastructure

Languages Supported

The model handles organization names in multiple European languages including:

English, German, French, Spanish, Italian
Dutch, Portuguese, Polish, Czech, Hungarian

Performance Metrics

Classification Report

Class	Precision	Recall	F1-Score	Support
PUB	0.78	0.75	0.76	1941
HES	0.88	0.89	0.89	15507
REC	0.87	0.90	0.89	7734
PRC	0.72	0.67	0.69	2175
OTH	0.59	0.45	0.51	1180

Overall Accuracy: 85%

Confusion Matrix

       PUB   HES   REC   PRC   OTH
PUB  1457    65    65    80   508
HES    38  1447    72    60   324
REC    83    44   535   129   389
PRC    62    22    37  6977   636
OTH   384   283   204   773  13863

License

This model is released under the MIT License.