Feature Extraction
Transformers
Safetensors
new
custom_code
text-embeddings-inference

Affiliation Clustering Model

A fine-tuned multilingual embedding model based on Alibaba-NLP/gte-multilingual-base designed for clustering and matching academic affiliations. This model was trained using triplet loss to learn semantic representations that can distinguish between similar and dissimilar institutional affiliations.

Model Description

This model is specifically trained to understand academic affiliations and institutional names, making it useful for:

  • Deduplicating author affiliations in academic databases
  • Clustering similar institutional names
  • Matching affiliations across different naming conventions
  • Academic author disambiguation tasks

Architecture

  • Base Model: Alibaba-NLP/gte-multilingual-base (12-layer transformer)
  • Embedding Dimension: 768
  • Training Objective: Triplet margin loss with L2 normalization
  • Tokenization: Up to 8192 tokens with padding and truncation

Training Details

Training Data

  • Dataset: cometadata/triplet_loss_for_embedding_affiliations_sample_1
  • Training/Test Split: 90%/10%
  • Triplet format: (anchor, positive, negative) affiliation texts

Training Configuration

{
  "model": "Alibaba-NLP/gte-multilingual-base",
  "learning_rate": 1e-4,
  "batch_size": 32,
  "num_epochs": 3,
  "margin": 1.0,
  "warmup_steps": 30,
  "optimizer": "AdamW",
  "scheduler": "cosine"
}

Hardware

  • GPU: NVIDIA H100 80GB HBM3 (8x GPUs available)
  • Training Time: ~2 minutes (120 seconds total)
  • Platform: Linux SLURM cluster

Performance Metrics

  • Final Training Loss: 0.179
  • Test Loss: 0.192
  • Training Steps: 686
  • Total Runtime: 120 seconds
triplet scores distribution

The model successfully learns to separate positive and negative samples, as shown in the distribution of positive and negative scores from the test set above.

Usage

Installation

pip install torch transformers

Basic Usage

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

class AffiliationEmbeddingModel(torch.nn.Module):
    def __init__(self, model_path):
        super().__init__()
        self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        self.embedding_dim = 768

    def tokenize(self, input_texts):
        return self.tokenizer(
            input_texts,
            max_length=8192,
            padding=True,
            truncation=True,
            return_tensors='pt'
        )

    def forward(self, **inputs):
        outputs = self.model(**inputs)
        embeddings = outputs.last_hidden_state[:, 0][:self.embedding_dim]
        embeddings = F.normalize(embeddings, p=2, dim=1)
        return embeddings

# Load the model
model = AffiliationEmbeddingModel("cometadata/affiliation-clustering-0.3b")
model.eval()

# Example affiliations
affiliations = [
    "Stanford University Department of Computer Science",
    "Massachusetts Institute of Technology",
    "Stanford University",
    "University of California, Berkeley"
]

# Get embeddings
with torch.no_grad():
    tokens = model.tokenize(affiliations)
    embeddings = model(**tokens)

# Compute similarity
similarities = [
    (embeddings[0] @ embeddings[i]).item()
    for i in range(1, len(embeddings))
]
print(similarities)
# => [0.5692405700683594, 0.9963535666465759, 0.11983194202184677]

Model Architecture Details

The model uses the following forward pass:

  1. Tokenize input text (max 8192 tokens)
  2. Pass through 12-layer transformer encoder
  3. Extract token representation (first 768 dimensions)
  4. Apply L2 normalization

Training Objective

The model was trained using triplet margin loss:

loss = max(0, ||f(a) - f(p)||² - ||f(a) - f(n)||² + margin)

Where:

  • f(a) = anchor embedding (target affiliation)
  • f(p) = positive embedding (similar affiliation)
  • f(n) = negative embedding (dissimilar affiliation)
  • margin = 1.0

Limitations

  • Optimized specifically for academic affiliations
  • May not generalize well to other domains
  • Performance depends on quality of affiliation text preprocessing
  • Limited by base model's multilingual capabilities

Citation

If you use this model in your research, please cite:

@misc{affiliation-clustering-model,
  title={Fine-tuned Multilingual Embedding Model for Academic Affiliation Clustering},
  author={COMET},
  year={2025},
  howpublished={\\url{https://huggingface.co/cometadata/affiliation-clustering-0.3b/}},
}
Downloads last month
50
Safetensors
Model size
305M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cometadata/affiliation-clustering-0.3b

Finetuned
(78)
this model

Dataset used to train cometadata/affiliation-clustering-0.3b