luismsgomes commited on
Commit
a08f89b
verified
1 Parent(s): bd17c90

added bibtex

Browse files
Files changed (1) hide show
  1. README.md +144 -131
README.md CHANGED
@@ -1,131 +1,144 @@
1
- ---
2
- language: pt
3
- license: mit
4
- library_name: sentence-transformers
5
- pipeline_tag: sentence-similarity
6
- tags:
7
- - sentence-transformers
8
- - feature-extraction
9
- - sentence-similarity
10
- - transformers
11
- ---
12
-
13
- # Serafim 100m Portuguese (PT) Sentence Encoder
14
-
15
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
16
-
17
- <!--- Describe your model here -->
18
-
19
- ## Usage (Sentence-Transformers)
20
-
21
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
22
-
23
- ```
24
- pip install -U sentence-transformers
25
- ```
26
-
27
- Then you can use the model like this:
28
-
29
- ```python
30
- from sentence_transformers import SentenceTransformer
31
- sentences = ["This is an example sentence", "Each sentence is converted"]
32
-
33
- model = SentenceTransformer('PORTULAN/serafim-100m-portuguese-pt-sentence-encoder')
34
- embeddings = model.encode(sentences)
35
- print(embeddings)
36
- ```
37
-
38
-
39
-
40
- ## Usage (HuggingFace Transformers)
41
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
42
-
43
- ```python
44
- from transformers import AutoTokenizer, AutoModel
45
- import torch
46
-
47
-
48
- #Mean Pooling - Take attention mask into account for correct averaging
49
- def mean_pooling(model_output, attention_mask):
50
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
51
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
52
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
53
-
54
-
55
- # Sentences we want sentence embeddings for
56
- sentences = ['This is an example sentence', 'Each sentence is converted']
57
-
58
- # Load model from HuggingFace Hub
59
- tokenizer = AutoTokenizer.from_pretrained('PORTULAN/serafim-100m-portuguese-pt-sentence-encoder')
60
- model = AutoModel.from_pretrained('PORTULAN/serafim-100m-portuguese-pt-sentence-encoder')
61
-
62
- # Tokenize sentences
63
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
64
-
65
- # Compute token embeddings
66
- with torch.no_grad():
67
- model_output = model(**encoded_input)
68
-
69
- # Perform pooling. In this case, mean pooling.
70
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
71
-
72
- print("Sentence embeddings:")
73
- print(sentence_embeddings)
74
- ```
75
-
76
-
77
-
78
- ## Evaluation Results
79
-
80
- <!--- Describe how your model was evaluated -->
81
-
82
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=PORTULAN/serafim-100m-portuguese-pt-sentence-encoder)
83
-
84
-
85
- ## Training
86
- The model was trained with the parameters:
87
-
88
- **DataLoader**:
89
-
90
- `torch.utils.data.dataloader.DataLoader` of length 296 with parameters:
91
- ```
92
- {'batch_size': 64, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
93
- ```
94
-
95
- **Loss**:
96
-
97
- `sentence_transformers.losses.CoSENTLoss.CoSENTLoss` with parameters:
98
- ```
99
- {'scale': 20.0, 'similarity_fct': 'pairwise_cos_sim'}
100
- ```
101
-
102
- Parameters of the fit()-Method:
103
- ```
104
- {
105
- "epochs": 20,
106
- "evaluation_steps": 30,
107
- "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
108
- "max_grad_norm": 1,
109
- "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
110
- "optimizer_params": {
111
- "lr": 1e-05
112
- },
113
- "scheduler": "WarmupLinear",
114
- "steps_per_epoch": 296,
115
- "warmup_steps": 592,
116
- "weight_decay": 0.01
117
- }
118
- ```
119
-
120
-
121
- ## Full Model Architecture
122
- ```
123
- SentenceTransformer(
124
- (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
125
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
126
- )
127
- ```
128
-
129
- ## Citing & Authors
130
-
131
- <!--- Describe where people can find more information -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: pt
3
+ license: mit
4
+ library_name: sentence-transformers
5
+ pipeline_tag: sentence-similarity
6
+ tags:
7
+ - sentence-transformers
8
+ - feature-extraction
9
+ - sentence-similarity
10
+ - transformers
11
+ ---
12
+
13
+ # Serafim 100m Portuguese (PT) Sentence Encoder
14
+
15
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
16
+
17
+ <!--- Describe your model here -->
18
+
19
+ ## Usage (Sentence-Transformers)
20
+
21
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
22
+
23
+ ```
24
+ pip install -U sentence-transformers
25
+ ```
26
+
27
+ Then you can use the model like this:
28
+
29
+ ```python
30
+ from sentence_transformers import SentenceTransformer
31
+ sentences = ["This is an example sentence", "Each sentence is converted"]
32
+
33
+ model = SentenceTransformer('PORTULAN/serafim-100m-portuguese-pt-sentence-encoder')
34
+ embeddings = model.encode(sentences)
35
+ print(embeddings)
36
+ ```
37
+
38
+
39
+
40
+ ## Usage (HuggingFace Transformers)
41
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
42
+
43
+ ```python
44
+ from transformers import AutoTokenizer, AutoModel
45
+ import torch
46
+
47
+
48
+ #Mean Pooling - Take attention mask into account for correct averaging
49
+ def mean_pooling(model_output, attention_mask):
50
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
51
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
52
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
53
+
54
+
55
+ # Sentences we want sentence embeddings for
56
+ sentences = ['This is an example sentence', 'Each sentence is converted']
57
+
58
+ # Load model from HuggingFace Hub
59
+ tokenizer = AutoTokenizer.from_pretrained('PORTULAN/serafim-100m-portuguese-pt-sentence-encoder')
60
+ model = AutoModel.from_pretrained('PORTULAN/serafim-100m-portuguese-pt-sentence-encoder')
61
+
62
+ # Tokenize sentences
63
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
64
+
65
+ # Compute token embeddings
66
+ with torch.no_grad():
67
+ model_output = model(**encoded_input)
68
+
69
+ # Perform pooling. In this case, mean pooling.
70
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
71
+
72
+ print("Sentence embeddings:")
73
+ print(sentence_embeddings)
74
+ ```
75
+
76
+
77
+
78
+ ## Evaluation Results
79
+
80
+ <!--- Describe how your model was evaluated -->
81
+
82
+ For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=PORTULAN/serafim-100m-portuguese-pt-sentence-encoder)
83
+
84
+
85
+ ## Training
86
+ The model was trained with the parameters:
87
+
88
+ **DataLoader**:
89
+
90
+ `torch.utils.data.dataloader.DataLoader` of length 296 with parameters:
91
+ ```
92
+ {'batch_size': 64, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
93
+ ```
94
+
95
+ **Loss**:
96
+
97
+ `sentence_transformers.losses.CoSENTLoss.CoSENTLoss` with parameters:
98
+ ```
99
+ {'scale': 20.0, 'similarity_fct': 'pairwise_cos_sim'}
100
+ ```
101
+
102
+ Parameters of the fit()-Method:
103
+ ```
104
+ {
105
+ "epochs": 20,
106
+ "evaluation_steps": 30,
107
+ "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
108
+ "max_grad_norm": 1,
109
+ "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
110
+ "optimizer_params": {
111
+ "lr": 1e-05
112
+ },
113
+ "scheduler": "WarmupLinear",
114
+ "steps_per_epoch": 296,
115
+ "warmup_steps": 592,
116
+ "weight_decay": 0.01
117
+ }
118
+ ```
119
+
120
+
121
+ ## Full Model Architecture
122
+ ```
123
+ SentenceTransformer(
124
+ (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
125
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
126
+ )
127
+ ```
128
+
129
+ ## Citing & Authors
130
+
131
+ The article has been presented at EPIA 2024 conference but the Springer proceedings are not available yet.
132
+ In the meantime, if you use this model you may cite the arXiv preprint:
133
+
134
+ @misc{gomes2024opensentenceembeddingsportuguese,
135
+ title={Open Sentence Embeddings for Portuguese with the Serafim PT* encoders family},
136
+ author={Lu铆s Gomes and Ant贸nio Branco and Jo茫o Silva and Jo茫o Rodrigues and Rodrigo Santos},
137
+ year={2024},
138
+ eprint={2407.19527},
139
+ archivePrefix={arXiv},
140
+ primaryClass={cs.CL},
141
+ url={https://arxiv.org/abs/2407.19527},
142
+ }
143
+
144
+