SIRIS-Lab
/

org_type_classifier

TensorBoard

Safetensors

xlm-roberta

Model card Files Files and versions Metrics Training metrics Community

PabloAccuosto commited on 9 days ago

Commit

df3e7d2

verified ·

1 Parent(s): a7b659b

Update README.md

Browse files

Files changed (1) hide show

README.md +155 -3

README.md CHANGED Viewed

@@ -1,3 +1,155 @@
----
-license: mit
----

+---
+license: mit
+---
+# EU Organization Classifier
+A multilingual transformer model fine-tuned for classifying European organizations into standardized categories based on EU funding database schemas.
+## Model Description
+This model is fine-tuned from `intfloat/multilingual-e5-large` to classify organizations from European Union funding databases (CORDIS, Erasmus+, LIFE, Creative Europe) into five standardized organization types:
+- **PUB**: Public bodies (governments, municipalities, agencies)
+- **HES**: Higher Education Sector (universities, schools, educational institutions)
+- **REC**: Research Organizations (research institutes, laboratories, R&D centers)
+- **PRC**: Private Companies (businesses, SMEs, corporations)
+- **OTH**: Other Organizations (NGOs, associations, foundations, cultural organizations)
+## Training Data
+The model was trained on **~140,000 organization names** from multiple EU funding databases:
+- **CORDIS**: European research and innovation database
+- **Erasmus+**: European education and training programs
+- **LIFE**: European environment and climate action program
+- **Creative Europe**: European cultural and creative sector programs
+**Note**: INTERREG data was excluded from training due to data quality issues but the model can be applied to INTERREG organizations for classification.
+## Performance
+- **Overall Accuracy**: 85-90%+ on held-out test data
+- **Multilingual Support**: Trained on organization names in 20+ European languages
+- **Domain Expertise**: Specialized for European institutional and funding contexts
+## Usage
+### Direct Classification
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load model and tokenizer
+model_name = "your-username/eu-organization-classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Organization types mapping
+LABELS = {0: 'PUB', 1: 'HES', 2: 'REC', 3: 'PRC', 4: 'OTH'}
+def classify_organization(org_name):
+    # Tokenize input
+    inputs = tokenizer(org_name, return_tensors="pt", truncation=True, max_length=128)
+    # Get prediction
+    with torch.no_grad():
+        outputs = model(**inputs)
+        prediction = torch.argmax(outputs.logits, dim=-1).item()
+        confidence = torch.softmax(outputs.logits, dim=-1).max().item()
+    return LABELS[prediction], confidence
+# Examples
+examples = [
+    "Universität Wien",
+    "Ministry of Education and Science",
+    "ACME Technology Solutions Ltd",
+    "European Research Council",
+    "Greenpeace International"
+]
+for org in examples:
+    org_type, confidence = classify_organization(org)
+    print(f"{org} → {org_type} (confidence: {confidence:.3f})")
+```
+### Batch Processing
+```python
+def classify_organizations_batch(org_names, batch_size=32):
+    results = []
+    for i in range(0, len(org_names), batch_size):
+        batch = org_names[i:i+batch_size]
+        # Tokenize batch
+        inputs = tokenizer(batch, return_tensors="pt", truncation=True,
+                          padding=True, max_length=128)
+        # Get predictions
+        with torch.no_grad():
+            outputs = model(**inputs)
+            predictions = torch.argmax(outputs.logits, dim=-1)
+            confidences = torch.softmax(outputs.logits, dim=-1).max(dim=-1).values
+        # Convert to labels
+        for pred, conf in zip(predictions, confidences):
+            results.append((LABELS[pred.item()], conf.item()))
+    return results
+```
+## Use Cases
+### EU Funding Analysis
+- Classify beneficiary organizations in EU funding databases
+- Analyze funding distribution across organization types
+- Identify research collaboration patterns
+### Organization Deduplication
+- Standardize organization types for entity resolution
+- Improve clustering of similar organizations across databases
+- Enhance data quality in multi-source datasets
+### Institutional Research
+- Study European research and innovation ecosystems
+- Analyze public-private collaboration networks
+- Map educational and research infrastructure
+## Languages Supported
+The model handles organization names in multiple European languages including:
+- English, German, French, Spanish, Italian
+- Dutch, Portuguese, Polish, Czech, Hungarian
+## Performance Metrics
+### Classification Report
+| Class | Precision | Recall | F1-Score | Support |
+|-------|-----------|--------|----------|---------|
+| **PUB** | 0.78 | 0.75 | 0.76 | 1941 |
+| **HES** | 0.88 | 0.89 | 0.89 | 15507 |
+| **REC** | 0.87 | 0.90 | 0.89 | 7734 |
+| **PRC** | 0.72 | 0.67 | 0.69 | 2175 |
+| **OTH** | 0.59 | 0.45 | 0.51 | 1180 |
+**Overall Accuracy**: 85%
+### Confusion Matrix
+```
+       PUB   HES   REC   PRC   OTH
+PUB  1457    65    65    80   508
+HES    38  1447    72    60   324
+REC    83    44   535   129   389
+PRC    62    22    37  6977   636
+OTH   384   283   204   773  13863
+```
+## License
+This model is released under the [MIT License](LICENSE).