File size: 5,105 Bytes
df3e7d2 86cf5f7 df3e7d2 441b2fc df3e7d2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
---
license: mit
---
# EU Organization Classifier
A multilingual transformer model fine-tuned for classifying European organizations into standardized categories based on EU funding database schemas.
## Model Description
This model is fine-tuned from `intfloat/multilingual-e5-large` to classify organizations from European Union funding databases (CORDIS, Erasmus+, LIFE, Creative Europe) into five standardized organization types:
- **PUB**: Public bodies (governments, municipalities, agencies)
- **HES**: Higher Education Sector (universities, schools, educational institutions)
- **REC**: Research Organizations (research institutes, laboratories, R&D centers)
- **PRC**: Private Companies (businesses, SMEs, corporations)
- **OTH**: Other Organizations (NGOs, associations, foundations, cultural organizations)
## Training Data
The model was trained on **~140,000 organization names** from multiple EU funding databases:
- **CORDIS**: European research and innovation database
- **Erasmus+**: European education and training programs
- **LIFE**: European environment and climate action program
- **Creative Europe**: European cultural and creative sector programs
**Note**: INTERREG data was excluded from training due to data quality issues but the model can be applied to INTERREG organizations for classification.
## Performance
- **Overall Accuracy**: 85-90%+ on held-out test data
- **Multilingual Support**: Trained on organization names in 20+ European languages
- **Domain Expertise**: Specialized for European institutional and funding contexts
## Usage
### Direct Classification
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "SIRIS-Lab/org_type_classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Organization types mapping
LABELS = {0: 'PUB', 1: 'HES', 2: 'REC', 3: 'PRC', 4: 'OTH'}
def classify_organization(org_name):
# Tokenize input
inputs = tokenizer(org_name, return_tensors="pt", truncation=True, max_length=128)
# Get prediction
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=-1).item()
confidence = torch.softmax(outputs.logits, dim=-1).max().item()
return LABELS[prediction], confidence
# Examples
examples = [
"Universität Wien",
"Ministry of Education and Science",
"ACME Technology Solutions Ltd",
"European Research Council",
"Greenpeace International"
]
for org in examples:
# It is recommended to convert the organization name to UPPERCASE since this is the most frequent format in EU DBs.
org_type, confidence = classify_organization(org.upper())
print(f"{org} → {org_type} (confidence: {confidence:.3f})")
```
### Batch Processing
```python
def classify_organizations_batch(org_names, batch_size=32):
results = []
for i in range(0, len(org_names), batch_size):
batch = org_names[i:i+batch_size]
# Tokenize batch
inputs = tokenizer(batch, return_tensors="pt", truncation=True,
padding=True, max_length=128)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
confidences = torch.softmax(outputs.logits, dim=-1).max(dim=-1).values
# Convert to labels
for pred, conf in zip(predictions, confidences):
results.append((LABELS[pred.item()], conf.item()))
return results
```
## Use Cases
### EU Funding Analysis
- Classify beneficiary organizations in EU funding databases
- Analyze funding distribution across organization types
- Identify research collaboration patterns
### Organization Deduplication
- Standardize organization types for entity resolution
- Improve clustering of similar organizations across databases
- Enhance data quality in multi-source datasets
### Institutional Research
- Study European research and innovation ecosystems
- Analyze public-private collaboration networks
- Map educational and research infrastructure
## Languages Supported
The model handles organization names in multiple European languages including:
- English, German, French, Spanish, Italian
- Dutch, Portuguese, Polish, Czech, Hungarian
## Performance Metrics
### Classification Report
| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| **PUB** | 0.78 | 0.75 | 0.76 | 1941 |
| **HES** | 0.88 | 0.89 | 0.89 | 15507 |
| **REC** | 0.87 | 0.90 | 0.89 | 7734 |
| **PRC** | 0.72 | 0.67 | 0.69 | 2175 |
| **OTH** | 0.59 | 0.45 | 0.51 | 1180 |
**Overall Accuracy**: 85%
### Confusion Matrix
```
PUB HES REC PRC OTH
PUB 1457 65 65 80 508
HES 38 1447 72 60 324
REC 83 44 535 129 389
PRC 62 22 37 6977 636
OTH 384 283 204 773 13863
```
## License
This model is released under the [MIT License](LICENSE). |