dangvantuan commited on
Commit
cff7f45
·
1 Parent(s): 667cead

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +137 -4
README.md CHANGED
@@ -1,8 +1,141 @@
1
  ---
2
  pipeline_tag: sentence-similarity
 
 
 
3
  tags:
4
- - sentence-transformers
5
- - feature-extraction
6
- - sentence-similarity
7
- - transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  pipeline_tag: sentence-similarity
3
+ language: fr
4
+ datasets:
5
+ - stsb_multi_mt
6
  tags:
7
+ - Text
8
+ - Sentence Similarity
9
+ - Sentence-Embedding
10
+ - camembert-base
11
+ license: apache-2.0
12
+ model-index:
13
+ - name: sentence-camembert-base by Van Tuan DANG
14
+ results:
15
+ - task:
16
+ name: Sentence-Embedding
17
+ type: Text Similarity
18
+ dataset:
19
+ name: Text Similarity fr
20
+ type: stsb_multi_mt
21
+ args: fr
22
+ metrics:
23
+ - name: Test Pearson correlation coefficient
24
+ type: Pearson_correlation_coefficient
25
+ value: 83.46
26
+ ---
27
+
28
+ ## Pre-trained sentence embedding models are the state-of-the-art of Sentence Embeddings for French.
29
+ This model is improved from [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base) using fine-tuning with [Augmented SBERT](https://aclanthology.org/2021.naacl-main.28.pdf) on on dataset [stsb](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train) along with Pair Sampling Strategies through 2 models [CrossEncoder-camembert-large](https://huggingface.co/dangvantuan/CrossEncoder-camembert-large) and [dangvantuan/sentence-camembert-large](https://huggingface.co/dangvantuan/sentence-camembert-large)
30
+ ## Usage
31
+ The model can be used directly (without a language model) as follows:
32
+
33
+ ```python
34
+ from sentence_transformers import SentenceTransformer
35
+ model = SentenceTransformer("Lajavaness/sentence-camembert-base")
36
+
37
+ sentences = ["Un avion est en train de décoller.",
38
+ "Un homme joue d'une grande flûte.",
39
+ "Un homme étale du fromage râpé sur une pizza.",
40
+ "Une personne jette un chat au plafond.",
41
+ "Une personne est en train de plier un morceau de papier.",
42
+ ]
43
+
44
+ embeddings = model.encode(sentences)
45
+ ```
46
+
47
+ ## Evaluation
48
+ The model can be evaluated as follows on the French test data of stsb.
49
+
50
+ ```python
51
+ from sentence_transformers import SentenceTransformer
52
+ from sentence_transformers.readers import InputExample
53
+ from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
54
+ from datasets import load_dataset
55
+ def convert_dataset(dataset):
56
+ dataset_samples=[]
57
+ for df in dataset:
58
+ score = float(df['similarity_score'])/5.0 # Normalize score to range 0 ... 1
59
+ inp_example = InputExample(texts=[df['sentence1'],
60
+ df['sentence2']], label=score)
61
+ dataset_samples.append(inp_example)
62
+ return dataset_samples
63
+
64
+ # Loading the dataset for evaluation
65
+ df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
66
+ df_test = load_dataset("stsb_multi_mt", name="fr", split="test")
67
+
68
+ # Convert the dataset for evaluation
69
+
70
+ # For Dev set:
71
+ dev_samples = convert_dataset(df_dev)
72
+ val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
73
+ val_evaluator(model, output_path="./")
74
+
75
+ # For Test set:
76
+ test_samples = convert_dataset(df_test)
77
+ test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
78
+ test_evaluator(model, output_path="./")
79
+ ```
80
+
81
+ **Test Result**:
82
+ The performance is measured using Pearson and Spearman correlation:
83
+ - On dev
84
+
85
+
86
+ | Model | Pearson correlation | Spearman correlation | #params |
87
+ | ------------- | ------------- | ------------- |------------- |
88
+ | [Lajavaness/sentence-camembert-base](https://huggingface.co/Lajavaness/sentence-camembert-base)| 86.88 |86.73 | 110M |
89
+ | [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base)| 86.73 |86.54 | 110M |
90
+ [inokufu/flaubert-base-uncased-xnli-sts](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts)| 85.85 |85.71 | 137M |
91
+ | [distiluse-base-multilingual-cased](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 79.22 | 79.16|135M |
92
+
93
+
94
+ - On test
95
+
96
+ **Pearson score**
97
+ | Data | STS-B | STS12-fr | STS13-fr | STS14-fr | STS15-fr | STS16-fr | SICK-fr | params |
98
+ |-----------------------------------------------------------|---------|----------|----------|----------|----------|----------|---------|--------|
99
+ | [Lajavaness/sentence-camembert-base](https://huggingface.co/Lajavaness/sentence-camembert-base) | 83.46 | 84.49 | 84.61 | 83.94 | 86.94 | 75.20 | 82.86 | 110M |
100
+ | [inokufu/flaubert-base-uncased-xnli-sts](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) | 82.82 | 84.79 | 85.76 | 82.81 | 85.38 | 74.05 | 82.23 | 137M |
101
+ | [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base) | 82.36 | 82.06 | 84.08 | 81.51 | 85.54 | 73.97 | 80.91 | 110M |
102
+ | [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased)| 78.63 | 72.51 | 67.25 | 70.12 | 79.93 | 66.67 | 77.76 | 135M |
103
+ | [hugorosen/flaubert_base_uncased-xnli-sts](https://huggingface.co/hugorosen/flaubert_base_uncased-xnli-sts) | 78.38 | 79.00 | 77.61 | 76.56 | 79.03 | 71.22 | 80.58 | 137M |
104
+ | [antoinelouis/biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 76.97 | 71.43 | 73.50 | 70.56 | 78.44 | 71.23 | 77.62 | 110M |
105
+
106
+
107
+ **Spearman score**
108
+ | Model | STS-B | STS12-fr | STS13-fr | STS14-fr | STS15-fr | STS16-fr | SICK-fr | params |
109
+ |-----------------------------------------------------------|---------|----------|----------|----------|----------|----------|---------|--------|
110
+ | [inokufu/flaubert-base-uncased-xnli-sts](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) | 83.07 | 77.34 | 85.88 | 80.96 | 85.70 | 76.43 | 77.00 | 137M |
111
+ | [Lajavaness/sentence-camembert-base](https://huggingface.co/Lajavaness/sentence-camembert-base) | 82.92 | 77.71 | 84.19 | 81.83 | 87.04 | 76.81 | 76.36 | 110M |
112
+ | [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base) | 81.64 | 75.45 | 83.86 | 78.63 | 85.66 | 75.36 | 74.18 | 110M |
113
+ | [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 77.49 | 69.80 | 68.85 | 68.17 | 80.27 | 70.04 | 72.49 | 135M |
114
+ | [hugorosen/flaubert_base_uncased-xnli-sts](https://huggingface.co/hugorosen/flaubert_base_uncased-xnli-sts) | 76.93 | 68.96 | 77.62 | 71.87 | 79.33 | 72.86 | 73.91 | 137M |
115
+ | [antoinelouis/biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 75.55 | 66.89 | 73.90 | 67.14 | 78.78 | 72.64 | 72.03 | 110M |
116
+
117
+
118
+ ## Citation
119
+
120
+
121
+ @article{reimers2019sentence,
122
+ title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
123
+ author={Nils Reimers, Iryna Gurevych},
124
+ journal={https://arxiv.org/abs/1908.10084},
125
+ year={2019}
126
+ }
127
+
128
 
129
+ @article{martin2020camembert,
130
+ title={CamemBERT: a Tasty French Language Mode},
131
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
132
+ journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
133
+ year={2020}
134
+ }
135
+ @article{thakur2020augmented,
136
+ title={Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks},
137
+ author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
138
+ journal={arXiv e-prints},
139
+ pages={arXiv--2010},
140
+ year={2020}
141
+ }