setu4993 commited on
Commit
abd4e32
1 Parent(s): 1cc3355

Add model card

Browse files
Files changed (1) hide show
  1. README.md +126 -0
README.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ - de
5
+ - en
6
+ - es
7
+ - fr
8
+ - it
9
+ - ja
10
+ - ko
11
+ - nl
12
+ - pl
13
+ - pt
14
+ - ru
15
+ - th
16
+ - tr
17
+ - zh
18
+ tags:
19
+ - bert
20
+ - sentence_embedding
21
+ - multilingual
22
+ - google
23
+ - sentence-similarity
24
+ - labse
25
+ license: apache-2.0
26
+ datasets:
27
+ - CommonCrawl
28
+ - Wikipedia
29
+ ---
30
+
31
+ # LaBSE
32
+
33
+ ## Model description
34
+
35
+ Smaller Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model distilled from the [original LaBSE model](https://huggingface.co/setu4993/LaBSE) to 15 languages (from the original 109 languages) using the techniques described in the paper ['Load What You Need: Smaller Versions of Multilingual BERT'](https://arxiv.org/abs/2010.05609) by [Ukjae Jeong](https://github.com/jeongukjae/).
36
+
37
+ - Model: [HuggingFace's model hub](https://huggingface.co/setu4993/smaller-LaBSE).
38
+ - Original model: [TensorFlow Hub](https://tfhub.dev/jeongukjae/smaller_LaBSE_15lang/1).
39
+ - Distillation source: [GitHub](https://github.com/jeongukjae/smaller-labse).
40
+ - Conversion from TensorFlow to PyTorch: [GitHub](https://github.com/setu4993/convert-labse-tf-pt).
41
+
42
+ ## Usage
43
+
44
+ Using the model:
45
+
46
+ ```python
47
+ import torch
48
+ from transformers import BertModel, BertTokenizerFast
49
+
50
+
51
+ tokenizer = BertTokenizerFast.from_pretrained("setu4993/smaller-LaBSE")
52
+ model = BertModel.from_pretrained("setu4993/smaller-LaBSE")
53
+ model = model.eval()
54
+
55
+ english_sentences = [
56
+ "dog",
57
+ "Puppies are nice.",
58
+ "I enjoy taking long walks along the beach with my dog.",
59
+ ]
60
+ english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True)
61
+
62
+ with torch.no_grad():
63
+ english_outputs = model(**english_inputs)
64
+ ```
65
+
66
+ To get the sentence embeddings, use the pooler output:
67
+
68
+ ```python
69
+ english_embeddings = english_outputs.pooler_output
70
+ ```
71
+
72
+ Output for other languages:
73
+
74
+ ```python
75
+ italian_sentences = [
76
+ "cane",
77
+ "I cuccioli sono carini.",
78
+ "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
79
+ ]
80
+ japanese_sentences = ["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"]
81
+ italian_inputs = tokenizer(italian_sentences, return_tensors="pt", padding=True)
82
+ japanese_inputs = tokenizer(japanese_sentences, return_tensors="pt", padding=True)
83
+
84
+ with torch.no_grad():
85
+ italian_outputs = model(**italian_inputs)
86
+ japanese_outputs = model(**japanese_inputs)
87
+
88
+ italian_embeddings = italian_outputs.pooler_output
89
+ japanese_embeddings = japanese_outputs.pooler_output
90
+ ```
91
+
92
+ For similarity between sentences, an L2-norm is recommended before calculating the similarity:
93
+
94
+ ```python
95
+ import torch.nn.functional as F
96
+
97
+
98
+ def similarity(embeddings_1, embeddings_2):
99
+ normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
100
+ normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
101
+ return torch.matmul(
102
+ normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
103
+ )
104
+
105
+
106
+ print(similarity(english_embeddings, italian_embeddings))
107
+ print(similarity(english_embeddings, japanese_embeddings))
108
+ print(similarity(italian_embeddings, japanese_embeddings))
109
+ ```
110
+
111
+ ## Details
112
+
113
+ Details about data, training, evaluation and performance metrics are available in the [original paper](https://arxiv.org/abs/2007.01852).
114
+
115
+ ### BibTeX entry and citation info
116
+
117
+ ```bibtex
118
+ @misc{feng2020languageagnostic,
119
+ title={Language-agnostic BERT Sentence Embedding},
120
+ author={Fangxiaoyu Feng and Yinfei Yang and Daniel Cer and Naveen Arivazhagan and Wei Wang},
121
+ year={2020},
122
+ eprint={2007.01852},
123
+ archivePrefix={arXiv},
124
+ primaryClass={cs.CL}
125
+ }
126
+ ```