Affiliation Clustering Model
A fine-tuned multilingual embedding model based on Alibaba-NLP/gte-multilingual-base designed for clustering and matching academic affiliations. This model was trained using triplet loss to learn semantic representations that can distinguish between similar and dissimilar institutional affiliations.
Model Description
This model is specifically trained to understand academic affiliations and institutional names, making it useful for:
- Deduplicating author affiliations in academic databases
- Clustering similar institutional names
- Matching affiliations across different naming conventions
- Academic author disambiguation tasks
Architecture
- Base Model: Alibaba-NLP/gte-multilingual-base (12-layer transformer)
- Embedding Dimension: 768
- Training Objective: Triplet margin loss with L2 normalization
- Tokenization: Up to 8192 tokens with padding and truncation
Training Details
Training Data
- Dataset:
cometadata/triplet_loss_for_embedding_affiliations_sample_1
- Training/Test Split: 90%/10%
- Triplet format: (anchor, positive, negative) affiliation texts
Training Configuration
{
"model": "Alibaba-NLP/gte-multilingual-base",
"learning_rate": 1e-4,
"batch_size": 32,
"num_epochs": 3,
"margin": 1.0,
"warmup_steps": 30,
"optimizer": "AdamW",
"scheduler": "cosine"
}
Hardware
- GPU: NVIDIA H100 80GB HBM3 (8x GPUs available)
- Training Time: ~2 minutes (120 seconds total)
- Platform: Linux SLURM cluster
Performance Metrics
- Final Training Loss: 0.179
- Test Loss: 0.192
- Training Steps: 686
- Total Runtime: 120 seconds
The model successfully learns to separate positive and negative samples, as shown in the distribution of positive and negative scores from the test set above.
Usage
Installation
pip install torch transformers
Basic Usage
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
class AffiliationEmbeddingModel(torch.nn.Module):
def __init__(self, model_path):
super().__init__()
self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
self.embedding_dim = 768
def tokenize(self, input_texts):
return self.tokenizer(
input_texts,
max_length=8192,
padding=True,
truncation=True,
return_tensors='pt'
)
def forward(self, **inputs):
outputs = self.model(**inputs)
embeddings = outputs.last_hidden_state[:, 0][:self.embedding_dim]
embeddings = F.normalize(embeddings, p=2, dim=1)
return embeddings
# Load the model
model = AffiliationEmbeddingModel("cometadata/affiliation-clustering-0.3b")
model.eval()
# Example affiliations
affiliations = [
"Stanford University Department of Computer Science",
"Massachusetts Institute of Technology",
"Stanford University",
"University of California, Berkeley"
]
# Get embeddings
with torch.no_grad():
tokens = model.tokenize(affiliations)
embeddings = model(**tokens)
# Compute similarity
similarities = [
(embeddings[0] @ embeddings[i]).item()
for i in range(1, len(embeddings))
]
print(similarities)
# => [0.5692405700683594, 0.9963535666465759, 0.11983194202184677]
Model Architecture Details
The model uses the following forward pass:
- Tokenize input text (max 8192 tokens)
- Pass through 12-layer transformer encoder
- Extract token representation (first 768 dimensions)
- Apply L2 normalization
Training Objective
The model was trained using triplet margin loss:
loss = max(0, ||f(a) - f(p)||² - ||f(a) - f(n)||² + margin)
Where:
f(a)
= anchor embedding (target affiliation)f(p)
= positive embedding (similar affiliation)f(n)
= negative embedding (dissimilar affiliation)margin
= 1.0
Limitations
- Optimized specifically for academic affiliations
- May not generalize well to other domains
- Performance depends on quality of affiliation text preprocessing
- Limited by base model's multilingual capabilities
Citation
If you use this model in your research, please cite:
@misc{affiliation-clustering-model,
title={Fine-tuned Multilingual Embedding Model for Academic Affiliation Clustering},
author={COMET},
year={2025},
howpublished={\\url{https://huggingface.co/cometadata/affiliation-clustering-0.3b/}},
}
- Downloads last month
- 50
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for cometadata/affiliation-clustering-0.3b
Base model
Alibaba-NLP/gte-multilingual-base