File size: 5,105 Bytes
df3e7d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86cf5f7
df3e7d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
441b2fc
 
df3e7d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
license: mit
---


# EU Organization Classifier

A multilingual transformer model fine-tuned for classifying European organizations into standardized categories based on EU funding database schemas.

## Model Description

This model is fine-tuned from `intfloat/multilingual-e5-large` to classify organizations from European Union funding databases (CORDIS, Erasmus+, LIFE, Creative Europe) into five standardized organization types:

- **PUB**: Public bodies (governments, municipalities, agencies)
- **HES**: Higher Education Sector (universities, schools, educational institutions)
- **REC**: Research Organizations (research institutes, laboratories, R&D centers)
- **PRC**: Private Companies (businesses, SMEs, corporations)
- **OTH**: Other Organizations (NGOs, associations, foundations, cultural organizations)

## Training Data

The model was trained on **~140,000 organization names** from multiple EU funding databases:

- **CORDIS**: European research and innovation database
- **Erasmus+**: European education and training programs
- **LIFE**: European environment and climate action program  
- **Creative Europe**: European cultural and creative sector programs

**Note**: INTERREG data was excluded from training due to data quality issues but the model can be applied to INTERREG organizations for classification.

## Performance

- **Overall Accuracy**: 85-90%+ on held-out test data
- **Multilingual Support**: Trained on organization names in 20+ European languages
- **Domain Expertise**: Specialized for European institutional and funding contexts

## Usage

### Direct Classification

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "SIRIS-Lab/org_type_classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Organization types mapping
LABELS = {0: 'PUB', 1: 'HES', 2: 'REC', 3: 'PRC', 4: 'OTH'}

def classify_organization(org_name):
    # Tokenize input
    inputs = tokenizer(org_name, return_tensors="pt", truncation=True, max_length=128)

    # Get prediction
    with torch.no_grad():
        outputs = model(**inputs)
        prediction = torch.argmax(outputs.logits, dim=-1).item()
        confidence = torch.softmax(outputs.logits, dim=-1).max().item()

    return LABELS[prediction], confidence

# Examples
examples = [
    "Universität Wien",
    "Ministry of Education and Science",
    "ACME Technology Solutions Ltd",
    "European Research Council",
    "Greenpeace International"
]

for org in examples:
    # It is recommended to convert the organization name to UPPERCASE since this is the most frequent format in EU DBs.
    org_type, confidence = classify_organization(org.upper())
    print(f"{org} → {org_type} (confidence: {confidence:.3f})")
```

### Batch Processing

```python
def classify_organizations_batch(org_names, batch_size=32):
    results = []

    for i in range(0, len(org_names), batch_size):
        batch = org_names[i:i+batch_size]

        # Tokenize batch
        inputs = tokenizer(batch, return_tensors="pt", truncation=True, 
                          padding=True, max_length=128)

        # Get predictions
        with torch.no_grad():
            outputs = model(**inputs)
            predictions = torch.argmax(outputs.logits, dim=-1)
            confidences = torch.softmax(outputs.logits, dim=-1).max(dim=-1).values

        # Convert to labels
        for pred, conf in zip(predictions, confidences):
            results.append((LABELS[pred.item()], conf.item()))

    return results
```

## Use Cases

### EU Funding Analysis
- Classify beneficiary organizations in EU funding databases
- Analyze funding distribution across organization types
- Identify research collaboration patterns

### Organization Deduplication  
- Standardize organization types for entity resolution
- Improve clustering of similar organizations across databases
- Enhance data quality in multi-source datasets

### Institutional Research
- Study European research and innovation ecosystems
- Analyze public-private collaboration networks
- Map educational and research infrastructure

## Languages Supported

The model handles organization names in multiple European languages including:
- English, German, French, Spanish, Italian
- Dutch, Portuguese, Polish, Czech, Hungarian

## Performance Metrics

### Classification Report

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| **PUB** | 0.78 | 0.75 | 0.76 | 1941 |
| **HES** | 0.88 | 0.89 | 0.89 | 15507 |
| **REC** | 0.87 | 0.90 | 0.89 | 7734 |
| **PRC** | 0.72 | 0.67 | 0.69 | 2175 |
| **OTH** | 0.59 | 0.45 | 0.51 | 1180 |

**Overall Accuracy**: 85%

### Confusion Matrix

```
       PUB   HES   REC   PRC   OTH
PUB  1457    65    65    80   508
HES    38  1447    72    60   324
REC    83    44   535   129   389
PRC    62    22    37  6977   636
OTH   384   283   204   773  13863
```

## License

This model is released under the [MIT License](LICENSE).