EU Organization Classifier
A multilingual transformer model fine-tuned for classifying European organizations into standardized categories based on EU funding database schemas.
Model Description
This model is fine-tuned from intfloat/multilingual-e5-large
to classify organizations from European Union funding databases (CORDIS, Erasmus+, LIFE, Creative Europe) into five standardized organization types:
- PUB: Public bodies (governments, municipalities, agencies)
- HES: Higher Education Sector (universities, schools, educational institutions)
- REC: Research Organizations (research institutes, laboratories, R&D centers)
- PRC: Private Companies (businesses, SMEs, corporations)
- OTH: Other Organizations (NGOs, associations, foundations, cultural organizations)
Training Data
The model was trained on ~140,000 organization names from multiple EU funding databases:
- CORDIS: European research and innovation database
- Erasmus+: European education and training programs
- LIFE: European environment and climate action program
- Creative Europe: European cultural and creative sector programs
Note: INTERREG data was excluded from training due to data quality issues but the model can be applied to INTERREG organizations for classification.
Performance
- Overall Accuracy: 85-90%+ on held-out test data
- Multilingual Support: Trained on organization names in 20+ European languages
- Domain Expertise: Specialized for European institutional and funding contexts
Usage
Direct Classification
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "SIRIS-Lab/org_type_classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Organization types mapping
LABELS = {0: 'PUB', 1: 'HES', 2: 'REC', 3: 'PRC', 4: 'OTH'}
def classify_organization(org_name):
# Tokenize input
inputs = tokenizer(org_name, return_tensors="pt", truncation=True, max_length=128)
# Get prediction
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=-1).item()
confidence = torch.softmax(outputs.logits, dim=-1).max().item()
return LABELS[prediction], confidence
# Examples
examples = [
"Universität Wien",
"Ministry of Education and Science",
"ACME Technology Solutions Ltd",
"European Research Council",
"Greenpeace International"
]
for org in examples:
org_type, confidence = classify_organization(org)
print(f"{org} → {org_type} (confidence: {confidence:.3f})")
Batch Processing
def classify_organizations_batch(org_names, batch_size=32):
results = []
for i in range(0, len(org_names), batch_size):
batch = org_names[i:i+batch_size]
# Tokenize batch
inputs = tokenizer(batch, return_tensors="pt", truncation=True,
padding=True, max_length=128)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
confidences = torch.softmax(outputs.logits, dim=-1).max(dim=-1).values
# Convert to labels
for pred, conf in zip(predictions, confidences):
results.append((LABELS[pred.item()], conf.item()))
return results
Use Cases
EU Funding Analysis
- Classify beneficiary organizations in EU funding databases
- Analyze funding distribution across organization types
- Identify research collaboration patterns
Organization Deduplication
- Standardize organization types for entity resolution
- Improve clustering of similar organizations across databases
- Enhance data quality in multi-source datasets
Institutional Research
- Study European research and innovation ecosystems
- Analyze public-private collaboration networks
- Map educational and research infrastructure
Languages Supported
The model handles organization names in multiple European languages including:
- English, German, French, Spanish, Italian
- Dutch, Portuguese, Polish, Czech, Hungarian
Performance Metrics
Classification Report
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
PUB | 0.78 | 0.75 | 0.76 | 1941 |
HES | 0.88 | 0.89 | 0.89 | 15507 |
REC | 0.87 | 0.90 | 0.89 | 7734 |
PRC | 0.72 | 0.67 | 0.69 | 2175 |
OTH | 0.59 | 0.45 | 0.51 | 1180 |
Overall Accuracy: 85%
Confusion Matrix
PUB HES REC PRC OTH
PUB 1457 65 65 80 508
HES 38 1447 72 60 324
REC 83 44 535 129 389
PRC 62 22 37 6977 636
OTH 384 283 204 773 13863
License
This model is released under the MIT License.
- Downloads last month
- 39
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support