--- license: apache-2.0 --- ## WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings [ACL 2023] This repository contains the code and pre-trained models for our paper [WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2305.17746). Our code is mainly based on the code of SimCSE. Please refer to their repository for more detailed information. ## Overview We presents a whitening-based contrastive learning method for sentence embedding learning (WhitenedCSE), which combines contrastive learning with a novel shuffled group whitening. ![](./figure/model.png) ## Train WhitenedCSE In the following section, we describe how to train a WhitenedCSE model by using our code. ### Requirements First, install PyTorch by following the instructions from [the official website](https://pytorch.org). To faithfully reproduce our results, please use the correct `1.12.1` version corresponding to your platforms/CUDA versions. PyTorch version higher than `1.12.1` should also work. ```bash conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge ``` Then run the following script to install the remaining dependencies, ```bash pip install -r requirements.txt ``` For unsupervised WhitenedCSE, we sample 1 million sentences from English Wikipedia; You can run `data/download_wiki.sh` to download the two datasets. download the dataset ```bash ./download_wiki.sh ``` ### Evaluation Our evaluation code for sentence embeddings is based on a modified version of [SentEval](https://github.com/facebookresearch/SentEval). It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks. Before evaluation, please download the evaluation datasets by running ```bash cd SentEval/data/downstream/ bash download_dataset.sh ``` ```bash CUDA_VISIBLE_DEVICES=[gpu_ids]\ python train.py \ --model_name_or_path bert-base-uncased \ --train_file data/wiki1m_for_simcse.txt \ --output_dir result/my-unsup-whitenedcse-bert-base-uncased \ --num_train_epochs 1 \ --per_device_train_batch_size 128 \ --learning_rate 1e-5 \ --num_pos 3 \ --max_seq_length 32 \ --evaluation_strategy steps \ --metric_for_best_model stsb_spearman \ --load_best_model_at_end \ --eval_steps 125 \ --pooler_type cls \ --mlp_only_train \ --overwrite_output_dir \ --dup_type bpe \ --temp 0.05 \ --do_train \ --do_eval \ --fp16 \ "$@" ``` Then come back to the root directory, you can evaluate any `transformers`-based pre-trained models using our evaluation code. For example, ```bash python evaluation.py \ --model_name_or_path \ --pooler cls \ --task_set sts \ --mode test ``` which is expected to output the results in a tabular format: ``` ------ test ------ +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | 74.03 | 84.90 | 76.40 | 83.40 | 80.23 | 81.14 | 71.33 | 78.78 | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ ``` ## Citation Please cite our paper if you use WhitenedCSE in your work: ```bibtex @inproceedings{zhuo2023whitenedcse, title={WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings}, author={Zhuo, Wenjie and Sun, Yifan and Wang, Xiaohan and Zhu, Linchao and Yang, Yi}, booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, pages={12135--12148}, year={2023} } ```