ALIGN-SIM: A Task-Free Test Bed for Evaluating and Interpreting Sentence Embeddings

ALIGN-SIM is a novel, task-free test bed for evaluating and interpreting sentence embeddings based on five intuitive semantic alignment criteria. It provides an alternative evaluation paradigm to popular task-specific benchmarks, offering deeper insights into whether sentence embeddings truly capture human-like semantic similarity.

Overview

Sentence embeddings are central to many NLP applications such as translation, question answering, and text classification. However, evaluating these dense vector representations in a way that reflects human semantic understanding remains challenging. ALIGN-SIM addresses this challenge by introducing a framework based on five semantic alignment criteria:

Semantic Distinction: Measures the ability of an encoder to differentiate between semantically similar sentence pairs and unrelated (random) sentence pairs.
Synonym Replacement: Tests if minor lexical changes (using synonyms) preserve the semantic similarity of the original sentence.
Antonym Replacement (Paraphrase vs. Antonym): Compares how closely a paraphrase aligns with the original sentence compared to a sentence where a key word is replaced with its antonym.
Paraphrase without Negation: Evaluates whether removing negation (and rephrasing) preserves the semantic meaning.
Sentence Jumbling: Assesses the sensitivity of the embeddings to changes in word order, ensuring that a jumbled sentence is distinctly represented.

ALIGN-SIM has been used to rigorously evaluate 13 sentence embedding models—including both classical encoders (e.g., SBERT, USE, SimCSE) and modern LLM-induced embeddings (e.g., GPT-3, LLaMA, Bloom)—across multiple datasets (QQP, PAWS-WIKI, MRPC, and AFIN).

Features

Task-Free Evaluation: Evaluate sentence embeddings without relying on task-specific training data.
Comprehensive Semantic Criteria: Assess embedding quality using five human-intuitive semantic alignment tests.
Multiple Datasets: Benchmark on diverse datasets to ensure robustness.
Comparative Analysis: Provides insights into both classical sentence encoders and LLM-induced embeddings.
Extensive Experimental Results: Detailed analysis demonstrating that high performance on task-specific benchmarks (e.g., SentEval) does not necessarily imply semantic alignment with human expectations.

Installation

Requirements

Python 3.7 or higher
PyTorch
Hugging Face Transformers
SentenceTransformers
Other dependencies as listed in requirements.txt (e.g., NumPy, SciPy, scikit-learn)

Setup

Clone the repository and install dependencies:

git clone https://github.com/BridgeAI-Lab/ALIGNSIM.git
cd ALIGN-SIM
pip install -r requirements.txt

Usage

Creating Sentence Perturbation Dataset

A dataset is available for English and six other languages [Fr, es, de, zh, ja, ko]. If you want to work with a different dataset, run the code below otherwise skip this step:

python  src/SentencePerturbation/sentence_perturbation.py \
        --dataset_name mrpc \
        --task anto \
        --target_lang en \
        --output_dir ./data/perturbed_dataset/ \
        --save True \
        --sample_size 3500

Evaluating Sentence Encoders

Run the evaluation script to test a sentence encoder against the five semantic alignment criteria. You can use any HuggingFace model for evaluaton. For example, to evaluate SBERT on the QQP dataset:

python src/evaluate.py --model llama3 
    --dataset qqp \
    --task antonym \
    --gpu auto \
    --batch_size 16 \
    --metric cosine \
    --save True

The script supports different models (e.g., sbert, use, simcse, gpt3-ada, llama2, etc.) and datasets (e.g., qqp, paws_wiki, mrpc, afin). We evalauted models on two metric Cosine Similarity and Normalized Euclidean Distance (NED)

Citation

If you use ALIGN-SIM in your research, please cite our work:

@inproceedings{mahajan-etal-2024-align,
    title = "{ALIGN}-{SIM}: A Task-Free Test Bed for Evaluating and Interpreting Sentence Embeddings through Semantic Similarity Alignment",
    author = "Mahajan, Yash  and
      Bansal, Naman  and
      Blanco, Eduardo  and
      Karmaker, Santu",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.436/",
    doi = "10.18653/v1/2024.findings-emnlp.436",
    pages = "7393--7428",
}

Acknowledgments

This work has been partially supported by NSF Standard Grant Award #2302974 and AFOSR Cooperative Agreement Award #FA9550-23-1-0426. We also acknowledge the support from Auburn University College of Engineering and the Department of CSSE.