SupstarZh
/

whitenedcse-bert-base

Model card Files Files and versions Community

whitenedcse-bert-base / README.md

SupstarZh

Create README.md

89f344d almost 2 years ago

preview code

raw

history blame contribute delete

3.8 kB

	---
	license: apache-2.0
	---
	## WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings [ACL 2023]

	This repository contains the code and pre-trained models for our paper [WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2305.17746).


	Our code is mainly based on the code of SimCSE. Please refer to their repository for more detailed information.

	## Overview
	We presents a whitening-based contrastive learning method for sentence embedding learning (WhitenedCSE), which combines contrastive learning with a novel shuffled group whitening.

	![](./figure/model.png)



	## Train WhitenedCSE

	In the following section, we describe how to train a WhitenedCSE model by using our code.

	### Requirements

	First, install PyTorch by following the instructions from [the official website](https://pytorch.org). To faithfully reproduce our results, please use the correct `1.12.1` version corresponding to your platforms/CUDA versions. PyTorch version higher than `1.12.1` should also work.

	```bash
	conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
	```

	Then run the following script to install the remaining dependencies,

	```bash
	pip install -r requirements.txt
	```
	For unsupervised WhitenedCSE, we sample 1 million sentences from English Wikipedia; You can run `data/download_wiki.sh` to download the two datasets.

	download the dataset
	```bash
	./download_wiki.sh
	```


	### Evaluation
	Our evaluation code for sentence embeddings is based on a modified version of [SentEval](https://github.com/facebookresearch/SentEval). It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks.

	Before evaluation, please download the evaluation datasets by running
	```bash
	cd SentEval/data/downstream/
	bash download_dataset.sh
	```


	```bash
	CUDA_VISIBLE_DEVICES=[gpu_ids]\
	python train.py \
	--model_name_or_path bert-base-uncased \
	--train_file data/wiki1m_for_simcse.txt \
	--output_dir result/my-unsup-whitenedcse-bert-base-uncased \
	--num_train_epochs 1 \
	--per_device_train_batch_size 128 \
	--learning_rate 1e-5 \
	--num_pos 3 \
	--max_seq_length 32 \
	--evaluation_strategy steps \
	--metric_for_best_model stsb_spearman \
	--load_best_model_at_end \
	--eval_steps 125 \
	--pooler_type cls \
	--mlp_only_train \
	--overwrite_output_dir \
	--dup_type bpe \
	--temp 0.05 \
	--do_train \
	--do_eval \
	--fp16 \
	"$@"
	```


	Then come back to the root directory, you can evaluate any `transformers`-based pre-trained models using our evaluation code. For example,
	```bash
	python evaluation.py \
	--model_name_or_path <your_output_model_dir> \
	--pooler cls \
	--task_set sts \
	--mode test
	```
	which is expected to output the results in a tabular format:
	```
	------ test ------
	+-------+-------+-------+-------+-------+--------------+-----------------+-------+
	\| STS12 \| STS13 \| STS14 \| STS15 \| STS16 \| STSBenchmark \| SICKRelatedness \| Avg. \|
	+-------+-------+-------+-------+-------+--------------+-----------------+-------+
	\| 74.03 \| 84.90 \| 76.40 \| 83.40 \| 80.23 \| 81.14 \| 71.33 \| 78.78 \|
	+-------+-------+-------+-------+-------+--------------+-----------------+-------+
	```





	## Citation

	Please cite our paper if you use WhitenedCSE in your work:

	```bibtex
	@inproceedings{zhuo2023whitenedcse,
	title={WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings},
	author={Zhuo, Wenjie and Sun, Yifan and Wang, Xiaohan and Zhu, Linchao and Yang, Yi},
	booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
	pages={12135--12148},
	year={2023}
	}
	```