Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,155 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
---
|
4 |
+
|
5 |
+
|
6 |
+
# EU Organization Classifier
|
7 |
+
|
8 |
+
A multilingual transformer model fine-tuned for classifying European organizations into standardized categories based on EU funding database schemas.
|
9 |
+
|
10 |
+
## Model Description
|
11 |
+
|
12 |
+
This model is fine-tuned from `intfloat/multilingual-e5-large` to classify organizations from European Union funding databases (CORDIS, Erasmus+, LIFE, Creative Europe) into five standardized organization types:
|
13 |
+
|
14 |
+
- **PUB**: Public bodies (governments, municipalities, agencies)
|
15 |
+
- **HES**: Higher Education Sector (universities, schools, educational institutions)
|
16 |
+
- **REC**: Research Organizations (research institutes, laboratories, R&D centers)
|
17 |
+
- **PRC**: Private Companies (businesses, SMEs, corporations)
|
18 |
+
- **OTH**: Other Organizations (NGOs, associations, foundations, cultural organizations)
|
19 |
+
|
20 |
+
## Training Data
|
21 |
+
|
22 |
+
The model was trained on **~140,000 organization names** from multiple EU funding databases:
|
23 |
+
|
24 |
+
- **CORDIS**: European research and innovation database
|
25 |
+
- **Erasmus+**: European education and training programs
|
26 |
+
- **LIFE**: European environment and climate action program
|
27 |
+
- **Creative Europe**: European cultural and creative sector programs
|
28 |
+
|
29 |
+
**Note**: INTERREG data was excluded from training due to data quality issues but the model can be applied to INTERREG organizations for classification.
|
30 |
+
|
31 |
+
## Performance
|
32 |
+
|
33 |
+
- **Overall Accuracy**: 85-90%+ on held-out test data
|
34 |
+
- **Multilingual Support**: Trained on organization names in 20+ European languages
|
35 |
+
- **Domain Expertise**: Specialized for European institutional and funding contexts
|
36 |
+
|
37 |
+
## Usage
|
38 |
+
|
39 |
+
### Direct Classification
|
40 |
+
|
41 |
+
```python
|
42 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
43 |
+
import torch
|
44 |
+
|
45 |
+
# Load model and tokenizer
|
46 |
+
model_name = "your-username/eu-organization-classifier"
|
47 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
48 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
49 |
+
|
50 |
+
# Organization types mapping
|
51 |
+
LABELS = {0: 'PUB', 1: 'HES', 2: 'REC', 3: 'PRC', 4: 'OTH'}
|
52 |
+
|
53 |
+
def classify_organization(org_name):
|
54 |
+
# Tokenize input
|
55 |
+
inputs = tokenizer(org_name, return_tensors="pt", truncation=True, max_length=128)
|
56 |
+
|
57 |
+
# Get prediction
|
58 |
+
with torch.no_grad():
|
59 |
+
outputs = model(**inputs)
|
60 |
+
prediction = torch.argmax(outputs.logits, dim=-1).item()
|
61 |
+
confidence = torch.softmax(outputs.logits, dim=-1).max().item()
|
62 |
+
|
63 |
+
return LABELS[prediction], confidence
|
64 |
+
|
65 |
+
# Examples
|
66 |
+
examples = [
|
67 |
+
"Universität Wien",
|
68 |
+
"Ministry of Education and Science",
|
69 |
+
"ACME Technology Solutions Ltd",
|
70 |
+
"European Research Council",
|
71 |
+
"Greenpeace International"
|
72 |
+
]
|
73 |
+
|
74 |
+
for org in examples:
|
75 |
+
org_type, confidence = classify_organization(org)
|
76 |
+
print(f"{org} → {org_type} (confidence: {confidence:.3f})")
|
77 |
+
```
|
78 |
+
|
79 |
+
### Batch Processing
|
80 |
+
|
81 |
+
```python
|
82 |
+
def classify_organizations_batch(org_names, batch_size=32):
|
83 |
+
results = []
|
84 |
+
|
85 |
+
for i in range(0, len(org_names), batch_size):
|
86 |
+
batch = org_names[i:i+batch_size]
|
87 |
+
|
88 |
+
# Tokenize batch
|
89 |
+
inputs = tokenizer(batch, return_tensors="pt", truncation=True,
|
90 |
+
padding=True, max_length=128)
|
91 |
+
|
92 |
+
# Get predictions
|
93 |
+
with torch.no_grad():
|
94 |
+
outputs = model(**inputs)
|
95 |
+
predictions = torch.argmax(outputs.logits, dim=-1)
|
96 |
+
confidences = torch.softmax(outputs.logits, dim=-1).max(dim=-1).values
|
97 |
+
|
98 |
+
# Convert to labels
|
99 |
+
for pred, conf in zip(predictions, confidences):
|
100 |
+
results.append((LABELS[pred.item()], conf.item()))
|
101 |
+
|
102 |
+
return results
|
103 |
+
```
|
104 |
+
|
105 |
+
## Use Cases
|
106 |
+
|
107 |
+
### EU Funding Analysis
|
108 |
+
- Classify beneficiary organizations in EU funding databases
|
109 |
+
- Analyze funding distribution across organization types
|
110 |
+
- Identify research collaboration patterns
|
111 |
+
|
112 |
+
### Organization Deduplication
|
113 |
+
- Standardize organization types for entity resolution
|
114 |
+
- Improve clustering of similar organizations across databases
|
115 |
+
- Enhance data quality in multi-source datasets
|
116 |
+
|
117 |
+
### Institutional Research
|
118 |
+
- Study European research and innovation ecosystems
|
119 |
+
- Analyze public-private collaboration networks
|
120 |
+
- Map educational and research infrastructure
|
121 |
+
|
122 |
+
## Languages Supported
|
123 |
+
|
124 |
+
The model handles organization names in multiple European languages including:
|
125 |
+
- English, German, French, Spanish, Italian
|
126 |
+
- Dutch, Portuguese, Polish, Czech, Hungarian
|
127 |
+
|
128 |
+
## Performance Metrics
|
129 |
+
|
130 |
+
### Classification Report
|
131 |
+
|
132 |
+
| Class | Precision | Recall | F1-Score | Support |
|
133 |
+
|-------|-----------|--------|----------|---------|
|
134 |
+
| **PUB** | 0.78 | 0.75 | 0.76 | 1941 |
|
135 |
+
| **HES** | 0.88 | 0.89 | 0.89 | 15507 |
|
136 |
+
| **REC** | 0.87 | 0.90 | 0.89 | 7734 |
|
137 |
+
| **PRC** | 0.72 | 0.67 | 0.69 | 2175 |
|
138 |
+
| **OTH** | 0.59 | 0.45 | 0.51 | 1180 |
|
139 |
+
|
140 |
+
**Overall Accuracy**: 85%
|
141 |
+
|
142 |
+
### Confusion Matrix
|
143 |
+
|
144 |
+
```
|
145 |
+
PUB HES REC PRC OTH
|
146 |
+
PUB 1457 65 65 80 508
|
147 |
+
HES 38 1447 72 60 324
|
148 |
+
REC 83 44 535 129 389
|
149 |
+
PRC 62 22 37 6977 636
|
150 |
+
OTH 384 283 204 773 13863
|
151 |
+
```
|
152 |
+
|
153 |
+
## License
|
154 |
+
|
155 |
+
This model is released under the [MIT License](LICENSE).
|