File size: 5,438 Bytes
4f08d2c fdfb446 4f08d2c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
# ALIGN-SIM: A Task-Free Test Bed for Evaluating and Interpreting Sentence Embeddings
ALIGN-SIM is a novel, task-free test bed for evaluating and interpreting sentence embeddings based on five intuitive semantic alignment criteria. It provides an alternative evaluation paradigm to popular task-specific benchmarks, offering deeper insights into whether sentence embeddings truly capture human-like semantic similarity.
## Overview
Sentence embeddings are central to many NLP applications such as translation, question answering, and text classification. However, evaluating these dense vector representations in a way that reflects human semantic understanding remains challenging. ALIGN-SIM addresses this challenge by introducing a framework based on five semantic alignment criteria:
- **Semantic Distinction:** Measures the ability of an encoder to differentiate between semantically similar sentence pairs and unrelated (random) sentence pairs.
- **Synonym Replacement:** Tests if minor lexical changes (using synonyms) preserve the semantic similarity of the original sentence.
- **Antonym Replacement (Paraphrase vs. Antonym):** Compares how closely a paraphrase aligns with the original sentence compared to a sentence where a key word is replaced with its antonym.
- **Paraphrase without Negation:** Evaluates whether removing negation (and rephrasing) preserves the semantic meaning.
- **Sentence Jumbling:** Assesses the sensitivity of the embeddings to changes in word order, ensuring that a jumbled sentence is distinctly represented.
ALIGN-SIM has been used to rigorously evaluate 13 sentence embedding models—including both classical encoders (e.g., SBERT, USE, SimCSE) and modern LLM-induced embeddings (e.g., GPT-3, LLaMA, Bloom)—across multiple datasets (QQP, PAWS-WIKI, MRPC, and AFIN).
## Features
- **Task-Free Evaluation:** Evaluate sentence embeddings without relying on task-specific training data.
- **Comprehensive Semantic Criteria:** Assess embedding quality using five human-intuitive semantic alignment tests.
- **Multiple Datasets:** Benchmark on diverse datasets to ensure robustness.
- **Comparative Analysis:** Provides insights into both classical sentence encoders and LLM-induced embeddings.
- **Extensive Experimental Results:** Detailed analysis demonstrating that high performance on task-specific benchmarks (e.g., SentEval) does not necessarily imply semantic alignment with human expectations.
## Installation
### Requirements
- Python 3.7 or higher
- [PyTorch](https://pytorch.org/)
- [Hugging Face Transformers](https://huggingface.co/transformers/)
- [SentenceTransformers](https://www.sbert.net/)
- Other dependencies as listed in `requirements.txt` (e.g., NumPy, SciPy, scikit-learn)
### Setup
Clone the repository and install dependencies:
```bash
git clone https://github.com/BridgeAI-Lab/ALIGNSIM.git
cd ALIGN-SIM
pip install -r requirements.txt
```
# Usage
## Creating Sentence Perturbation Dataset
A dataset is available for English and six other languages [Fr, es, de, zh, ja, ko]. If you want to work with a different dataset, run the code below otherwise skip this step:
``` bash
python src/SentencePerturbation/sentence_perturbation.py \
--dataset_name mrpc \
--task anto \
--target_lang en \
--output_dir ./data/perturbed_dataset/ \
--save True \
--sample_size 3500
```
## Evaluating Sentence Encoders
Run the evaluation script to test a sentence encoder against the five semantic alignment criteria. You can use any HuggingFace model for evaluaton. For example, to evaluate SBERT on the QQP dataset:
```bash
python src/evaluate.py --model llama3
--dataset qqp \
--task antonym \
--gpu auto \
--batch_size 16 \
--metric cosine \
--save True
```
The script supports different models (e.g., sbert, use, simcse, gpt3-ada, llama2, etc.) and datasets (e.g., qqp, paws_wiki, mrpc, afin). We evalauted models on two metric **Cosine Similarity** and **Normalized Euclidean Distance (NED)**
[# Viewing Results
Evaluation results—such as similarity scores, normalized distances, and histograms—are saved in the `Results/`. Use the provided Jupyter notebooks in the `src/PlotAndTables.ipynb` folder to explore and visualize the performance of different models across the evaluation criteria.]: #
# Citation
If you use ALIGN-SIM in your research, please cite our work:
```bibtex
@inproceedings{mahajan-etal-2024-align,
title = "{ALIGN}-{SIM}: A Task-Free Test Bed for Evaluating and Interpreting Sentence Embeddings through Semantic Similarity Alignment",
author = "Mahajan, Yash and
Bansal, Naman and
Blanco, Eduardo and
Karmaker, Santu",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.436/",
doi = "10.18653/v1/2024.findings-emnlp.436",
pages = "7393--7428",
}
```
# Acknowledgments
This work has been partially supported by NSF Standard Grant Award #2302974 and AFOSR Cooperative Agreement Award #FA9550-23-1-0426. We also acknowledge the support from Auburn University College of Engineering and the Department of CSSE.
|