SupstarZh commited on
Commit
89f344d
·
1 Parent(s): 049581e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -0
README.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ ## WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings [ACL 2023]
5
+
6
+ This repository contains the code and pre-trained models for our paper [WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2305.17746).
7
+
8
+
9
+ Our code is mainly based on the code of SimCSE. Please refer to their repository for more detailed information.
10
+
11
+ ## Overview
12
+ We presents a whitening-based contrastive learning method for sentence embedding learning (WhitenedCSE), which combines contrastive learning with a novel shuffled group whitening.
13
+
14
+ ![](./figure/model.png)
15
+
16
+
17
+
18
+ ## Train WhitenedCSE
19
+
20
+ In the following section, we describe how to train a WhitenedCSE model by using our code.
21
+
22
+ ### Requirements
23
+
24
+ First, install PyTorch by following the instructions from [the official website](https://pytorch.org). To faithfully reproduce our results, please use the correct `1.12.1` version corresponding to your platforms/CUDA versions. PyTorch version higher than `1.12.1` should also work.
25
+
26
+ ```bash
27
+ conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
28
+ ```
29
+
30
+ Then run the following script to install the remaining dependencies,
31
+
32
+ ```bash
33
+ pip install -r requirements.txt
34
+ ```
35
+ For unsupervised WhitenedCSE, we sample 1 million sentences from English Wikipedia; You can run `data/download_wiki.sh` to download the two datasets.
36
+
37
+ download the dataset
38
+ ```bash
39
+ ./download_wiki.sh
40
+ ```
41
+
42
+
43
+ ### Evaluation
44
+ Our evaluation code for sentence embeddings is based on a modified version of [SentEval](https://github.com/facebookresearch/SentEval). It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks.
45
+
46
+ Before evaluation, please download the evaluation datasets by running
47
+ ```bash
48
+ cd SentEval/data/downstream/
49
+ bash download_dataset.sh
50
+ ```
51
+
52
+
53
+ ```bash
54
+ CUDA_VISIBLE_DEVICES=[gpu_ids]\
55
+ python train.py \
56
+ --model_name_or_path bert-base-uncased \
57
+ --train_file data/wiki1m_for_simcse.txt \
58
+ --output_dir result/my-unsup-whitenedcse-bert-base-uncased \
59
+ --num_train_epochs 1 \
60
+ --per_device_train_batch_size 128 \
61
+ --learning_rate 1e-5 \
62
+ --num_pos 3 \
63
+ --max_seq_length 32 \
64
+ --evaluation_strategy steps \
65
+ --metric_for_best_model stsb_spearman \
66
+ --load_best_model_at_end \
67
+ --eval_steps 125 \
68
+ --pooler_type cls \
69
+ --mlp_only_train \
70
+ --overwrite_output_dir \
71
+ --dup_type bpe \
72
+ --temp 0.05 \
73
+ --do_train \
74
+ --do_eval \
75
+ --fp16 \
76
+ "$@"
77
+ ```
78
+
79
+
80
+ Then come back to the root directory, you can evaluate any `transformers`-based pre-trained models using our evaluation code. For example,
81
+ ```bash
82
+ python evaluation.py \
83
+ --model_name_or_path <your_output_model_dir> \
84
+ --pooler cls \
85
+ --task_set sts \
86
+ --mode test
87
+ ```
88
+ which is expected to output the results in a tabular format:
89
+ ```
90
+ ------ test ------
91
+ +-------+-------+-------+-------+-------+--------------+-----------------+-------+
92
+ | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
93
+ +-------+-------+-------+-------+-------+--------------+-----------------+-------+
94
+ | 74.03 | 84.90 | 76.40 | 83.40 | 80.23 | 81.14 | 71.33 | 78.78 |
95
+ +-------+-------+-------+-------+-------+--------------+-----------------+-------+
96
+ ```
97
+
98
+
99
+
100
+
101
+
102
+ ## Citation
103
+
104
+ Please cite our paper if you use WhitenedCSE in your work:
105
+
106
+ ```bibtex
107
+ @inproceedings{zhuo2023whitenedcse,
108
+ title={WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings},
109
+ author={Zhuo, Wenjie and Sun, Yifan and Wang, Xiaohan and Zhu, Linchao and Yang, Yi},
110
+ booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
111
+ pages={12135--12148},
112
+ year={2023}
113
+ }
114
+ ```
115
+