YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ALIGN-SIM: A Task-Free Test Bed for Evaluating and Interpreting Sentence Embeddings

ALIGN-SIM is a novel, task-free test bed for evaluating and interpreting sentence embeddings based on five intuitive semantic alignment criteria. It provides an alternative evaluation paradigm to popular task-specific benchmarks, offering deeper insights into whether sentence embeddings truly capture human-like semantic similarity.

Overview

Sentence embeddings are central to many NLP applications such as translation, question answering, and text classification. However, evaluating these dense vector representations in a way that reflects human semantic understanding remains challenging. ALIGN-SIM addresses this challenge by introducing a framework based on five semantic alignment criteria:

  • Semantic Distinction: Measures the ability of an encoder to differentiate between semantically similar sentence pairs and unrelated (random) sentence pairs.
  • Synonym Replacement: Tests if minor lexical changes (using synonyms) preserve the semantic similarity of the original sentence.
  • Antonym Replacement (Paraphrase vs. Antonym): Compares how closely a paraphrase aligns with the original sentence compared to a sentence where a key word is replaced with its antonym.
  • Paraphrase without Negation: Evaluates whether removing negation (and rephrasing) preserves the semantic meaning.
  • Sentence Jumbling: Assesses the sensitivity of the embeddings to changes in word order, ensuring that a jumbled sentence is distinctly represented.

ALIGN-SIM has been used to rigorously evaluate 13 sentence embedding models—including both classical encoders (e.g., SBERT, USE, SimCSE) and modern LLM-induced embeddings (e.g., GPT-3, LLaMA, Bloom)—across multiple datasets (QQP, PAWS-WIKI, MRPC, and AFIN).

Features

  • Task-Free Evaluation: Evaluate sentence embeddings without relying on task-specific training data.
  • Comprehensive Semantic Criteria: Assess embedding quality using five human-intuitive semantic alignment tests.
  • Multiple Datasets: Benchmark on diverse datasets to ensure robustness.
  • Comparative Analysis: Provides insights into both classical sentence encoders and LLM-induced embeddings.
  • Extensive Experimental Results: Detailed analysis demonstrating that high performance on task-specific benchmarks (e.g., SentEval) does not necessarily imply semantic alignment with human expectations.

Installation

Requirements

Setup

Clone the repository and install dependencies:

git clone https://github.com/BridgeAI-Lab/ALIGNSIM.git
cd ALIGN-SIM
pip install -r requirements.txt

Usage

Creating Sentence Perturbation Dataset

A dataset is available for English and six other languages [Fr, es, de, zh, ja, ko]. If you want to work with a different dataset, run the code below otherwise skip this step:

python  src/SentencePerturbation/sentence_perturbation.py \
        --dataset_name mrpc \
        --task anto \
        --target_lang en \
        --output_dir ./data/perturbed_dataset/ \
        --save True \
        --sample_size 3500

Evaluating Sentence Encoders

Run the evaluation script to test a sentence encoder against the five semantic alignment criteria. You can use any HuggingFace model for evaluaton. For example, to evaluate SBERT on the QQP dataset:

python src/evaluate.py --model llama3 
    --dataset qqp \
    --task antonym \
    --gpu auto \
    --batch_size 16 \
    --metric cosine \
    --save True

The script supports different models (e.g., sbert, use, simcse, gpt3-ada, llama2, etc.) and datasets (e.g., qqp, paws_wiki, mrpc, afin). We evalauted models on two metric Cosine Similarity and Normalized Euclidean Distance (NED)

Citation

If you use ALIGN-SIM in your research, please cite our work:

@inproceedings{mahajan-etal-2024-align,
    title = "{ALIGN}-{SIM}: A Task-Free Test Bed for Evaluating and Interpreting Sentence Embeddings through Semantic Similarity Alignment",
    author = "Mahajan, Yash  and
      Bansal, Naman  and
      Blanco, Eduardo  and
      Karmaker, Santu",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.436/",
    doi = "10.18653/v1/2024.findings-emnlp.436",
    pages = "7393--7428",
}

Acknowledgments

This work has been partially supported by NSF Standard Grant Award #2302974 and AFOSR Cooperative Agreement Award #FA9550-23-1-0426. We also acknowledge the support from Auburn University College of Engineering and the Department of CSSE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support