lhchan commited on
Commit
fd3ffa7
·
1 Parent(s): db96cef

update README

Browse files
Files changed (1) hide show
  1. README.txt +38 -3
README.txt CHANGED
@@ -11,15 +11,50 @@ Finnish Sentence BERT trained from FinBERT
11
 
12
  ## Usage
13
 
14
- Please refer to the [HuggingFace documentation] (https://huggingface.co/sentence-transformers/bert-base-nli-mean-tokens)
15
 
16
- Briefly, using the `SentenceTransformer` library,
17
 
18
  ```
19
  from sentence_transformers import SentenceTransformer
20
  sentences = ["Tämä on esimerkkilause.", "Tämä on toinen lause."]
21
 
22
- model = SentenceTransformer('sbert-cased-finnish-paraphrase')
23
  embeddings = model.encode(sentences)
24
  print(embeddings)
25
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  ## Usage
13
 
14
+ The same as in [HuggingFace documentation] (https://huggingface.co/sentence-transformers/bert-base-nli-mean-tokens). Either through `SentenceTransformer` or `HuggingFace Transformers`
15
 
16
+ ### SentenceTransformer
17
 
18
  ```
19
  from sentence_transformers import SentenceTransformer
20
  sentences = ["Tämä on esimerkkilause.", "Tämä on toinen lause."]
21
 
22
+ model = SentenceTransformer('TurkuNLP/sbert-cased-finnish-paraphrase')
23
  embeddings = model.encode(sentences)
24
  print(embeddings)
25
  ```
26
+
27
+ ### HuggingFace Transformers
28
+
29
+ ```
30
+ from transformers import AutoTokenizer, AutoModel
31
+ import torch
32
+
33
+
34
+ #Mean Pooling - Take attention mask into account for correct averaging
35
+ def mean_pooling(model_output, attention_mask):
36
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
37
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
38
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
39
+
40
+
41
+ # Sentences we want sentence embeddings for
42
+ sentences = ["Tämä on esimerkkilause.", "Tämä on toinen lause."]
43
+
44
+ # Load model from HuggingFace Hub
45
+ tokenizer = AutoTokenizer.from_pretrained('TurkuNLP/sbert-cased-finnish-paraphrase')
46
+ model = AutoModel.from_pretrained('TurkuNLP/sbert-cased-finnish-paraphrase')
47
+
48
+ # Tokenize sentences
49
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
50
+
51
+ # Compute token embeddings
52
+ with torch.no_grad():
53
+ model_output = model(**encoded_input)
54
+
55
+ # Perform pooling. In this case, mean pooling.
56
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
57
+
58
+ print("Sentence embeddings:")
59
+ print(sentence_embeddings)
60
+ ```