PabloAccuosto commited on
Commit
df3e7d2
·
verified ·
1 Parent(s): a7b659b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +155 -3
README.md CHANGED
@@ -1,3 +1,155 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+
6
+ # EU Organization Classifier
7
+
8
+ A multilingual transformer model fine-tuned for classifying European organizations into standardized categories based on EU funding database schemas.
9
+
10
+ ## Model Description
11
+
12
+ This model is fine-tuned from `intfloat/multilingual-e5-large` to classify organizations from European Union funding databases (CORDIS, Erasmus+, LIFE, Creative Europe) into five standardized organization types:
13
+
14
+ - **PUB**: Public bodies (governments, municipalities, agencies)
15
+ - **HES**: Higher Education Sector (universities, schools, educational institutions)
16
+ - **REC**: Research Organizations (research institutes, laboratories, R&D centers)
17
+ - **PRC**: Private Companies (businesses, SMEs, corporations)
18
+ - **OTH**: Other Organizations (NGOs, associations, foundations, cultural organizations)
19
+
20
+ ## Training Data
21
+
22
+ The model was trained on **~140,000 organization names** from multiple EU funding databases:
23
+
24
+ - **CORDIS**: European research and innovation database
25
+ - **Erasmus+**: European education and training programs
26
+ - **LIFE**: European environment and climate action program
27
+ - **Creative Europe**: European cultural and creative sector programs
28
+
29
+ **Note**: INTERREG data was excluded from training due to data quality issues but the model can be applied to INTERREG organizations for classification.
30
+
31
+ ## Performance
32
+
33
+ - **Overall Accuracy**: 85-90%+ on held-out test data
34
+ - **Multilingual Support**: Trained on organization names in 20+ European languages
35
+ - **Domain Expertise**: Specialized for European institutional and funding contexts
36
+
37
+ ## Usage
38
+
39
+ ### Direct Classification
40
+
41
+ ```python
42
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
43
+ import torch
44
+
45
+ # Load model and tokenizer
46
+ model_name = "your-username/eu-organization-classifier"
47
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
48
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
49
+
50
+ # Organization types mapping
51
+ LABELS = {0: 'PUB', 1: 'HES', 2: 'REC', 3: 'PRC', 4: 'OTH'}
52
+
53
+ def classify_organization(org_name):
54
+ # Tokenize input
55
+ inputs = tokenizer(org_name, return_tensors="pt", truncation=True, max_length=128)
56
+
57
+ # Get prediction
58
+ with torch.no_grad():
59
+ outputs = model(**inputs)
60
+ prediction = torch.argmax(outputs.logits, dim=-1).item()
61
+ confidence = torch.softmax(outputs.logits, dim=-1).max().item()
62
+
63
+ return LABELS[prediction], confidence
64
+
65
+ # Examples
66
+ examples = [
67
+ "Universität Wien",
68
+ "Ministry of Education and Science",
69
+ "ACME Technology Solutions Ltd",
70
+ "European Research Council",
71
+ "Greenpeace International"
72
+ ]
73
+
74
+ for org in examples:
75
+ org_type, confidence = classify_organization(org)
76
+ print(f"{org} → {org_type} (confidence: {confidence:.3f})")
77
+ ```
78
+
79
+ ### Batch Processing
80
+
81
+ ```python
82
+ def classify_organizations_batch(org_names, batch_size=32):
83
+ results = []
84
+
85
+ for i in range(0, len(org_names), batch_size):
86
+ batch = org_names[i:i+batch_size]
87
+
88
+ # Tokenize batch
89
+ inputs = tokenizer(batch, return_tensors="pt", truncation=True,
90
+ padding=True, max_length=128)
91
+
92
+ # Get predictions
93
+ with torch.no_grad():
94
+ outputs = model(**inputs)
95
+ predictions = torch.argmax(outputs.logits, dim=-1)
96
+ confidences = torch.softmax(outputs.logits, dim=-1).max(dim=-1).values
97
+
98
+ # Convert to labels
99
+ for pred, conf in zip(predictions, confidences):
100
+ results.append((LABELS[pred.item()], conf.item()))
101
+
102
+ return results
103
+ ```
104
+
105
+ ## Use Cases
106
+
107
+ ### EU Funding Analysis
108
+ - Classify beneficiary organizations in EU funding databases
109
+ - Analyze funding distribution across organization types
110
+ - Identify research collaboration patterns
111
+
112
+ ### Organization Deduplication
113
+ - Standardize organization types for entity resolution
114
+ - Improve clustering of similar organizations across databases
115
+ - Enhance data quality in multi-source datasets
116
+
117
+ ### Institutional Research
118
+ - Study European research and innovation ecosystems
119
+ - Analyze public-private collaboration networks
120
+ - Map educational and research infrastructure
121
+
122
+ ## Languages Supported
123
+
124
+ The model handles organization names in multiple European languages including:
125
+ - English, German, French, Spanish, Italian
126
+ - Dutch, Portuguese, Polish, Czech, Hungarian
127
+
128
+ ## Performance Metrics
129
+
130
+ ### Classification Report
131
+
132
+ | Class | Precision | Recall | F1-Score | Support |
133
+ |-------|-----------|--------|----------|---------|
134
+ | **PUB** | 0.78 | 0.75 | 0.76 | 1941 |
135
+ | **HES** | 0.88 | 0.89 | 0.89 | 15507 |
136
+ | **REC** | 0.87 | 0.90 | 0.89 | 7734 |
137
+ | **PRC** | 0.72 | 0.67 | 0.69 | 2175 |
138
+ | **OTH** | 0.59 | 0.45 | 0.51 | 1180 |
139
+
140
+ **Overall Accuracy**: 85%
141
+
142
+ ### Confusion Matrix
143
+
144
+ ```
145
+ PUB HES REC PRC OTH
146
+ PUB 1457 65 65 80 508
147
+ HES 38 1447 72 60 324
148
+ REC 83 44 535 129 389
149
+ PRC 62 22 37 6977 636
150
+ OTH 384 283 204 773 13863
151
+ ```
152
+
153
+ ## License
154
+
155
+ This model is released under the [MIT License](LICENSE).