shail-2512 commited on
Commit
edeaaa5
·
verified ·
1 Parent(s): 52ea1f8

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,815 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - sentence-transformers
7
+ - sentence-similarity
8
+ - feature-extraction
9
+ - generated_from_trainer
10
+ - dataset_size:6300
11
+ - loss:MatryoshkaLoss
12
+ - loss:MultipleNegativesRankingLoss
13
+ base_model: nomic-ai/nomic-embed-text-v1.5
14
+ widget:
15
+ - source_sentence: Where in the Annual Report can one find a description of certain
16
+ legal matters and their impact on the company?
17
+ sentences:
18
+ - Apollo coordinates the delivery of new features, security updates, and platform
19
+ configurations, ensuring the continuous operation of systems in any environment.
20
+ It was introduced commercially in 2021.
21
+ - In the Annual Report on Form 10-K, 'Item 1A. Risk Factors' provides a further
22
+ description of certain legal matters and their impact on the company.
23
+ - During fiscal 2022, we opened four new stores in Mexico.
24
+ - source_sentence: How does the company assess uncertain tax positions?
25
+ sentences:
26
+ - We recognize tax benefits from uncertain tax positions only if we believe that
27
+ it is more likely than not that the tax position will be sustained on examination
28
+ by the taxing authorities based on the technical merits of the position.
29
+ - CMS uses a risk-adjustment model which adjusts premiums paid to Medicare Advantage,
30
+ or MA, plans according to health status of covered members. The risk-adjustment
31
+ model, which CMS implemented pursuant to the Balanced Budget Act of 1997 (BBA)
32
+ and the Benefits Improvement and Protection Act of 2000 (BIPA), generally pays
33
+ more where a plan's membership has higher expected costs. Under this model, rates
34
+ paid to MA plans are based on actuarially determined bids, which include a process
35
+ whereby our prospective payments are based on our estimated cost of providing
36
+ standard Medicare-covered benefits to an enrollee with a 'national average risk
37
+ profile.' That baseline payment amount is adjusted to account for certain demographic
38
+ characteristics and health status of our enrolled members.
39
+ - Walmart Inc. reported total revenues of $611,289 million for the fiscal year ended
40
+ January 31, 2023.
41
+ - source_sentence: When does the 364-day facility entered into in August 2023 expire,
42
+ and what is its total amount?
43
+ sentences:
44
+ - In 2023, the total revenue generated by Emgality amounted to 678.3.
45
+ - In August 2023, we entered into a new 364-day facility. The 364-day facility of
46
+ $3.15 billion expires in August 2024.
47
+ - Diluted EPS increased $0.09, or 2%, to $5.90 as the decrease in net earnings was
48
+ more than fully offset by a reduction in shares outstanding.
49
+ - source_sentence: What does the company believe adds significant value to its business
50
+ regarding intellectual property?
51
+ sentences:
52
+ - We believe that, to varying degrees, our trademarks, trade names, copyrights,
53
+ proprietary processes, trade secrets, trade dress, domain names and similar intellectual
54
+ property add significant value to our business
55
+ - Railroad operating revenues declined 6.9% in 2023 compared to 2022, reflecting
56
+ an overall volume decrease of 5.7% and a decrease in average revenue per car/unit
57
+ of 0.6%, primarily attributable to lower fuel surcharge revenue, partially offset
58
+ by favorable price and mix.
59
+ - Cash provided by operating activities increased from $26.413 billion in 2022 to
60
+ $28.501 billion in 2023, an increase of approximately $2.088 billion.
61
+ - source_sentence: How are government incentives treated in accounting according to
62
+ the given information?
63
+ sentences:
64
+ - The components of 'Other income (expense), net' for the year ended December 30,
65
+ 2023, were $197 million; for December 31, 2022, they were $8 million; and for
66
+ December 25, 2021, they were $55 million.
67
+ - We are entitled to certain advanced manufacturing production credits under the
68
+ IRA, and government incentives are not accounted for or classified as an income
69
+ tax credit. We account for government incentives as a reduction of expense, a
70
+ reduction of the cost of the capital investment or other income based on the substance
71
+ of the incentive received. Benefits are generally recorded when there is reasonable
72
+ assurance of receipt or, as it relates with advanced manufacturing production
73
+ credits, upon the generation of the credit.
74
+ - Basic net income per share is computed by dividing net income attributable to
75
+ common stock by the weighted-average number of shares of common stock outstanding
76
+ during the period.
77
+ pipeline_tag: sentence-similarity
78
+ library_name: sentence-transformers
79
+ metrics:
80
+ - cosine_accuracy@1
81
+ - cosine_accuracy@3
82
+ - cosine_accuracy@5
83
+ - cosine_accuracy@10
84
+ - cosine_precision@1
85
+ - cosine_precision@3
86
+ - cosine_precision@5
87
+ - cosine_precision@10
88
+ - cosine_recall@1
89
+ - cosine_recall@3
90
+ - cosine_recall@5
91
+ - cosine_recall@10
92
+ - cosine_ndcg@10
93
+ - cosine_mrr@10
94
+ - cosine_map@100
95
+ model-index:
96
+ - name: Nomic Embed Financial Matryoshka
97
+ results:
98
+ - task:
99
+ type: information-retrieval
100
+ name: Information Retrieval
101
+ dataset:
102
+ name: dim 768
103
+ type: dim_768
104
+ metrics:
105
+ - type: cosine_accuracy@1
106
+ value: 0.7185714285714285
107
+ name: Cosine Accuracy@1
108
+ - type: cosine_accuracy@3
109
+ value: 0.87
110
+ name: Cosine Accuracy@3
111
+ - type: cosine_accuracy@5
112
+ value: 0.9014285714285715
113
+ name: Cosine Accuracy@5
114
+ - type: cosine_accuracy@10
115
+ value: 0.9357142857142857
116
+ name: Cosine Accuracy@10
117
+ - type: cosine_precision@1
118
+ value: 0.7185714285714285
119
+ name: Cosine Precision@1
120
+ - type: cosine_precision@3
121
+ value: 0.29
122
+ name: Cosine Precision@3
123
+ - type: cosine_precision@5
124
+ value: 0.18028571428571427
125
+ name: Cosine Precision@5
126
+ - type: cosine_precision@10
127
+ value: 0.09357142857142857
128
+ name: Cosine Precision@10
129
+ - type: cosine_recall@1
130
+ value: 0.7185714285714285
131
+ name: Cosine Recall@1
132
+ - type: cosine_recall@3
133
+ value: 0.87
134
+ name: Cosine Recall@3
135
+ - type: cosine_recall@5
136
+ value: 0.9014285714285715
137
+ name: Cosine Recall@5
138
+ - type: cosine_recall@10
139
+ value: 0.9357142857142857
140
+ name: Cosine Recall@10
141
+ - type: cosine_ndcg@10
142
+ value: 0.8337966812161252
143
+ name: Cosine Ndcg@10
144
+ - type: cosine_mrr@10
145
+ value: 0.8004784580498868
146
+ name: Cosine Mrr@10
147
+ - type: cosine_map@100
148
+ value: 0.8030662019934727
149
+ name: Cosine Map@100
150
+ - task:
151
+ type: information-retrieval
152
+ name: Information Retrieval
153
+ dataset:
154
+ name: dim 512
155
+ type: dim_512
156
+ metrics:
157
+ - type: cosine_accuracy@1
158
+ value: 0.7157142857142857
159
+ name: Cosine Accuracy@1
160
+ - type: cosine_accuracy@3
161
+ value: 0.8685714285714285
162
+ name: Cosine Accuracy@3
163
+ - type: cosine_accuracy@5
164
+ value: 0.9028571428571428
165
+ name: Cosine Accuracy@5
166
+ - type: cosine_accuracy@10
167
+ value: 0.9342857142857143
168
+ name: Cosine Accuracy@10
169
+ - type: cosine_precision@1
170
+ value: 0.7157142857142857
171
+ name: Cosine Precision@1
172
+ - type: cosine_precision@3
173
+ value: 0.2895238095238095
174
+ name: Cosine Precision@3
175
+ - type: cosine_precision@5
176
+ value: 0.18057142857142855
177
+ name: Cosine Precision@5
178
+ - type: cosine_precision@10
179
+ value: 0.09342857142857142
180
+ name: Cosine Precision@10
181
+ - type: cosine_recall@1
182
+ value: 0.7157142857142857
183
+ name: Cosine Recall@1
184
+ - type: cosine_recall@3
185
+ value: 0.8685714285714285
186
+ name: Cosine Recall@3
187
+ - type: cosine_recall@5
188
+ value: 0.9028571428571428
189
+ name: Cosine Recall@5
190
+ - type: cosine_recall@10
191
+ value: 0.9342857142857143
192
+ name: Cosine Recall@10
193
+ - type: cosine_ndcg@10
194
+ value: 0.8320816465681472
195
+ name: Cosine Ndcg@10
196
+ - type: cosine_mrr@10
197
+ value: 0.7986201814058957
198
+ name: Cosine Mrr@10
199
+ - type: cosine_map@100
200
+ value: 0.8013251784905495
201
+ name: Cosine Map@100
202
+ - task:
203
+ type: information-retrieval
204
+ name: Information Retrieval
205
+ dataset:
206
+ name: dim 256
207
+ type: dim_256
208
+ metrics:
209
+ - type: cosine_accuracy@1
210
+ value: 0.7028571428571428
211
+ name: Cosine Accuracy@1
212
+ - type: cosine_accuracy@3
213
+ value: 0.86
214
+ name: Cosine Accuracy@3
215
+ - type: cosine_accuracy@5
216
+ value: 0.8914285714285715
217
+ name: Cosine Accuracy@5
218
+ - type: cosine_accuracy@10
219
+ value: 0.9271428571428572
220
+ name: Cosine Accuracy@10
221
+ - type: cosine_precision@1
222
+ value: 0.7028571428571428
223
+ name: Cosine Precision@1
224
+ - type: cosine_precision@3
225
+ value: 0.2866666666666667
226
+ name: Cosine Precision@3
227
+ - type: cosine_precision@5
228
+ value: 0.17828571428571427
229
+ name: Cosine Precision@5
230
+ - type: cosine_precision@10
231
+ value: 0.09271428571428571
232
+ name: Cosine Precision@10
233
+ - type: cosine_recall@1
234
+ value: 0.7028571428571428
235
+ name: Cosine Recall@1
236
+ - type: cosine_recall@3
237
+ value: 0.86
238
+ name: Cosine Recall@3
239
+ - type: cosine_recall@5
240
+ value: 0.8914285714285715
241
+ name: Cosine Recall@5
242
+ - type: cosine_recall@10
243
+ value: 0.9271428571428572
244
+ name: Cosine Recall@10
245
+ - type: cosine_ndcg@10
246
+ value: 0.8208030315973883
247
+ name: Cosine Ndcg@10
248
+ - type: cosine_mrr@10
249
+ value: 0.7862023809523814
250
+ name: Cosine Mrr@10
251
+ - type: cosine_map@100
252
+ value: 0.7893111186082761
253
+ name: Cosine Map@100
254
+ - task:
255
+ type: information-retrieval
256
+ name: Information Retrieval
257
+ dataset:
258
+ name: dim 128
259
+ type: dim_128
260
+ metrics:
261
+ - type: cosine_accuracy@1
262
+ value: 0.7
263
+ name: Cosine Accuracy@1
264
+ - type: cosine_accuracy@3
265
+ value: 0.8428571428571429
266
+ name: Cosine Accuracy@3
267
+ - type: cosine_accuracy@5
268
+ value: 0.8771428571428571
269
+ name: Cosine Accuracy@5
270
+ - type: cosine_accuracy@10
271
+ value: 0.9271428571428572
272
+ name: Cosine Accuracy@10
273
+ - type: cosine_precision@1
274
+ value: 0.7
275
+ name: Cosine Precision@1
276
+ - type: cosine_precision@3
277
+ value: 0.28095238095238095
278
+ name: Cosine Precision@3
279
+ - type: cosine_precision@5
280
+ value: 0.1754285714285714
281
+ name: Cosine Precision@5
282
+ - type: cosine_precision@10
283
+ value: 0.09271428571428571
284
+ name: Cosine Precision@10
285
+ - type: cosine_recall@1
286
+ value: 0.7
287
+ name: Cosine Recall@1
288
+ - type: cosine_recall@3
289
+ value: 0.8428571428571429
290
+ name: Cosine Recall@3
291
+ - type: cosine_recall@5
292
+ value: 0.8771428571428571
293
+ name: Cosine Recall@5
294
+ - type: cosine_recall@10
295
+ value: 0.9271428571428572
296
+ name: Cosine Recall@10
297
+ - type: cosine_ndcg@10
298
+ value: 0.8174548081454337
299
+ name: Cosine Ndcg@10
300
+ - type: cosine_mrr@10
301
+ value: 0.7820821995464855
302
+ name: Cosine Mrr@10
303
+ - type: cosine_map@100
304
+ value: 0.7852661387487447
305
+ name: Cosine Map@100
306
+ - task:
307
+ type: information-retrieval
308
+ name: Information Retrieval
309
+ dataset:
310
+ name: dim 64
311
+ type: dim_64
312
+ metrics:
313
+ - type: cosine_accuracy@1
314
+ value: 0.69
315
+ name: Cosine Accuracy@1
316
+ - type: cosine_accuracy@3
317
+ value: 0.83
318
+ name: Cosine Accuracy@3
319
+ - type: cosine_accuracy@5
320
+ value: 0.8671428571428571
321
+ name: Cosine Accuracy@5
322
+ - type: cosine_accuracy@10
323
+ value: 0.9128571428571428
324
+ name: Cosine Accuracy@10
325
+ - type: cosine_precision@1
326
+ value: 0.69
327
+ name: Cosine Precision@1
328
+ - type: cosine_precision@3
329
+ value: 0.27666666666666667
330
+ name: Cosine Precision@3
331
+ - type: cosine_precision@5
332
+ value: 0.1734285714285714
333
+ name: Cosine Precision@5
334
+ - type: cosine_precision@10
335
+ value: 0.09128571428571428
336
+ name: Cosine Precision@10
337
+ - type: cosine_recall@1
338
+ value: 0.69
339
+ name: Cosine Recall@1
340
+ - type: cosine_recall@3
341
+ value: 0.83
342
+ name: Cosine Recall@3
343
+ - type: cosine_recall@5
344
+ value: 0.8671428571428571
345
+ name: Cosine Recall@5
346
+ - type: cosine_recall@10
347
+ value: 0.9128571428571428
348
+ name: Cosine Recall@10
349
+ - type: cosine_ndcg@10
350
+ value: 0.804303333645382
351
+ name: Cosine Ndcg@10
352
+ - type: cosine_mrr@10
353
+ value: 0.769315192743764
354
+ name: Cosine Mrr@10
355
+ - type: cosine_map@100
356
+ value: 0.7729055647510643
357
+ name: Cosine Map@100
358
+ ---
359
+
360
+ # Nomic Embed Financial Matryoshka
361
+
362
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [nomic-ai/nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
363
+
364
+ ## Model Details
365
+
366
+ ### Model Description
367
+ - **Model Type:** Sentence Transformer
368
+ - **Base model:** [nomic-ai/nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) <!-- at revision d802ae16c9caed4d197895d27c6d529434cd8c6d -->
369
+ - **Maximum Sequence Length:** 8192 tokens
370
+ - **Output Dimensionality:** 768 dimensions
371
+ - **Similarity Function:** Cosine Similarity
372
+ - **Training Dataset:**
373
+ - json
374
+ - **Language:** en
375
+ - **License:** apache-2.0
376
+
377
+ ### Model Sources
378
+
379
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
380
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
381
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
382
+
383
+ ### Full Model Architecture
384
+
385
+ ```
386
+ SentenceTransformer(
387
+ (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NomicBertModel
388
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
389
+ )
390
+ ```
391
+
392
+ ## Usage
393
+
394
+ ### Direct Usage (Sentence Transformers)
395
+
396
+ First install the Sentence Transformers library:
397
+
398
+ ```bash
399
+ pip install -U sentence-transformers
400
+ ```
401
+
402
+ Then you can load this model and run inference.
403
+ ```python
404
+ from sentence_transformers import SentenceTransformer
405
+
406
+ # Download from the 🤗 Hub
407
+ model = SentenceTransformer("shail-2512/nomic-embed-financial-matryoshka")
408
+ # Run inference
409
+ sentences = [
410
+ 'How are government incentives treated in accounting according to the given information?',
411
+ 'We are entitled to certain advanced manufacturing production credits under the IRA, and government incentives are not accounted for or classified as an income tax credit. We account for government incentives as a reduction of expense, a reduction of the cost of the capital investment or other income based on the substance of the incentive received. Benefits are generally recorded when there is reasonable assurance of receipt or, as it relates with advanced manufacturing production credits, upon the generation of the credit.',
412
+ 'Basic net income per share is computed by dividing net income attributable to common stock by the weighted-average number of shares of common stock outstanding during the period.',
413
+ ]
414
+ embeddings = model.encode(sentences)
415
+ print(embeddings.shape)
416
+ # [3, 768]
417
+
418
+ # Get the similarity scores for the embeddings
419
+ similarities = model.similarity(embeddings, embeddings)
420
+ print(similarities.shape)
421
+ # [3, 3]
422
+ ```
423
+
424
+ <!--
425
+ ### Direct Usage (Transformers)
426
+
427
+ <details><summary>Click to see the direct usage in Transformers</summary>
428
+
429
+ </details>
430
+ -->
431
+
432
+ <!--
433
+ ### Downstream Usage (Sentence Transformers)
434
+
435
+ You can finetune this model on your own dataset.
436
+
437
+ <details><summary>Click to expand</summary>
438
+
439
+ </details>
440
+ -->
441
+
442
+ <!--
443
+ ### Out-of-Scope Use
444
+
445
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
446
+ -->
447
+
448
+ ## Evaluation
449
+
450
+ ### Metrics
451
+
452
+ #### Information Retrieval
453
+
454
+ * Datasets: `dim_768`, `dim_512`, `dim_256`, `dim_128` and `dim_64`
455
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
456
+
457
+ | Metric | dim_768 | dim_512 | dim_256 | dim_128 | dim_64 |
458
+ |:--------------------|:-----------|:-----------|:-----------|:-----------|:-----------|
459
+ | cosine_accuracy@1 | 0.7186 | 0.7157 | 0.7029 | 0.7 | 0.69 |
460
+ | cosine_accuracy@3 | 0.87 | 0.8686 | 0.86 | 0.8429 | 0.83 |
461
+ | cosine_accuracy@5 | 0.9014 | 0.9029 | 0.8914 | 0.8771 | 0.8671 |
462
+ | cosine_accuracy@10 | 0.9357 | 0.9343 | 0.9271 | 0.9271 | 0.9129 |
463
+ | cosine_precision@1 | 0.7186 | 0.7157 | 0.7029 | 0.7 | 0.69 |
464
+ | cosine_precision@3 | 0.29 | 0.2895 | 0.2867 | 0.281 | 0.2767 |
465
+ | cosine_precision@5 | 0.1803 | 0.1806 | 0.1783 | 0.1754 | 0.1734 |
466
+ | cosine_precision@10 | 0.0936 | 0.0934 | 0.0927 | 0.0927 | 0.0913 |
467
+ | cosine_recall@1 | 0.7186 | 0.7157 | 0.7029 | 0.7 | 0.69 |
468
+ | cosine_recall@3 | 0.87 | 0.8686 | 0.86 | 0.8429 | 0.83 |
469
+ | cosine_recall@5 | 0.9014 | 0.9029 | 0.8914 | 0.8771 | 0.8671 |
470
+ | cosine_recall@10 | 0.9357 | 0.9343 | 0.9271 | 0.9271 | 0.9129 |
471
+ | **cosine_ndcg@10** | **0.8338** | **0.8321** | **0.8208** | **0.8175** | **0.8043** |
472
+ | cosine_mrr@10 | 0.8005 | 0.7986 | 0.7862 | 0.7821 | 0.7693 |
473
+ | cosine_map@100 | 0.8031 | 0.8013 | 0.7893 | 0.7853 | 0.7729 |
474
+
475
+ <!--
476
+ ## Bias, Risks and Limitations
477
+
478
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
479
+ -->
480
+
481
+ <!--
482
+ ### Recommendations
483
+
484
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
485
+ -->
486
+
487
+ ## Training Details
488
+
489
+ ### Training Dataset
490
+
491
+ #### json
492
+
493
+ * Dataset: json
494
+ * Size: 6,300 training samples
495
+ * Columns: <code>anchor</code> and <code>positive</code>
496
+ * Approximate statistics based on the first 1000 samples:
497
+ | | anchor | positive |
498
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
499
+ | type | string | string |
500
+ | details | <ul><li>min: 2 tokens</li><li>mean: 20.65 tokens</li><li>max: 45 tokens</li></ul> | <ul><li>min: 2 tokens</li><li>mean: 46.29 tokens</li><li>max: 326 tokens</li></ul> |
501
+ * Samples:
502
+ | anchor | positive |
503
+ |:-------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
504
+ | <code>Where is the Investor Relations office of Intuit Inc. located?</code> | <code>Copies of this Annual Report on Form 10-K may also be obtained without charge by contacting Investor Relations, Intuit Inc., P.O. Box 7850, Mountain View, California 94039-7850, calling 650-944-6000, or emailing [email protected].</code> |
505
+ | <code>Where is the Financial Statement Schedule located in the Form 10-K?</code> | <code>The Financial Statement Schedule is found on page S-1 of the Form 10-K.</code> |
506
+ | <code>What factors are considered when evaluating the realization of deferred tax assets?</code> | <code>Many factors are considered when assessing whether it is more likely than not that the deferred tax assets will be realized, including recent cumulative earnings, expectations of future taxable income, carryforward periods and other relevant quantitative and qualitative factors.</code> |
507
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
508
+ ```json
509
+ {
510
+ "loss": "MultipleNegativesRankingLoss",
511
+ "matryoshka_dims": [
512
+ 768,
513
+ 512,
514
+ 256,
515
+ 128,
516
+ 64
517
+ ],
518
+ "matryoshka_weights": [
519
+ 1,
520
+ 1,
521
+ 1,
522
+ 1,
523
+ 1
524
+ ],
525
+ "n_dims_per_step": -1
526
+ }
527
+ ```
528
+
529
+ ### Evaluation Dataset
530
+
531
+ #### json
532
+
533
+ * Dataset: json
534
+ * Size: 700 evaluation samples
535
+ * Columns: <code>anchor</code> and <code>positive</code>
536
+ * Approximate statistics based on the first 700 samples:
537
+ | | anchor | positive |
538
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
539
+ | type | string | string |
540
+ | details | <ul><li>min: 2 tokens</li><li>mean: 20.71 tokens</li><li>max: 45 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 46.74 tokens</li><li>max: 248 tokens</li></ul> |
541
+ * Samples:
542
+ | anchor | positive |
543
+ |:--------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
544
+ | <code>What fiscal changes did Garmin make in January 2023?</code> | <code>The Company announced an organization realignment in January 2023, which combined the consumer auto operating segment with the outdoor operating segment.</code> |
545
+ | <code>Where are the details about 'Legal Matters' and 'Government Investigations, Audits and Reviews' located in the financial statements?</code> | <code>The information required by this Item 3 is incorporated herein by reference to the information set forth under the captions 'Legal Matters' and 'Government Investigations, Audits and Reviews' in Note 12 of the Notes to the Consolidated Financial Statements included in Part II, Item 8, 'Financial Statements and Supplementary Data'.</code> |
546
+ | <code>Are the pages of IBM's Management’s Discussion and Analysis section in the 2023 Annual Report included in the report itself?</code> | <code>In IBM’s 2023 Annual Report, the pages containing Management’s Discussion and Analysis of Financial Condition and Results of Operations (pages 6 through 40) are incorporated by reference.</code> |
547
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
548
+ ```json
549
+ {
550
+ "loss": "MultipleNegativesRankingLoss",
551
+ "matryoshka_dims": [
552
+ 768,
553
+ 512,
554
+ 256,
555
+ 128,
556
+ 64
557
+ ],
558
+ "matryoshka_weights": [
559
+ 1,
560
+ 1,
561
+ 1,
562
+ 1,
563
+ 1
564
+ ],
565
+ "n_dims_per_step": -1
566
+ }
567
+ ```
568
+
569
+ ### Training Hyperparameters
570
+ #### Non-Default Hyperparameters
571
+
572
+ - `eval_strategy`: epoch
573
+ - `gradient_accumulation_steps`: 8
574
+ - `learning_rate`: 2e-05
575
+ - `lr_scheduler_type`: cosine
576
+ - `warmup_ratio`: 0.1
577
+ - `bf16`: True
578
+ - `load_best_model_at_end`: True
579
+ - `optim`: adamw_torch_fused
580
+ - `batch_sampler`: no_duplicates
581
+
582
+ #### All Hyperparameters
583
+ <details><summary>Click to expand</summary>
584
+
585
+ - `overwrite_output_dir`: False
586
+ - `do_predict`: False
587
+ - `eval_strategy`: epoch
588
+ - `prediction_loss_only`: True
589
+ - `per_device_train_batch_size`: 8
590
+ - `per_device_eval_batch_size`: 8
591
+ - `per_gpu_train_batch_size`: None
592
+ - `per_gpu_eval_batch_size`: None
593
+ - `gradient_accumulation_steps`: 8
594
+ - `eval_accumulation_steps`: None
595
+ - `torch_empty_cache_steps`: None
596
+ - `learning_rate`: 2e-05
597
+ - `weight_decay`: 0.0
598
+ - `adam_beta1`: 0.9
599
+ - `adam_beta2`: 0.999
600
+ - `adam_epsilon`: 1e-08
601
+ - `max_grad_norm`: 1.0
602
+ - `num_train_epochs`: 3
603
+ - `max_steps`: -1
604
+ - `lr_scheduler_type`: cosine
605
+ - `lr_scheduler_kwargs`: {}
606
+ - `warmup_ratio`: 0.1
607
+ - `warmup_steps`: 0
608
+ - `log_level`: passive
609
+ - `log_level_replica`: warning
610
+ - `log_on_each_node`: True
611
+ - `logging_nan_inf_filter`: True
612
+ - `save_safetensors`: True
613
+ - `save_on_each_node`: False
614
+ - `save_only_model`: False
615
+ - `restore_callback_states_from_checkpoint`: False
616
+ - `no_cuda`: False
617
+ - `use_cpu`: False
618
+ - `use_mps_device`: False
619
+ - `seed`: 42
620
+ - `data_seed`: None
621
+ - `jit_mode_eval`: False
622
+ - `use_ipex`: False
623
+ - `bf16`: True
624
+ - `fp16`: False
625
+ - `fp16_opt_level`: O1
626
+ - `half_precision_backend`: auto
627
+ - `bf16_full_eval`: False
628
+ - `fp16_full_eval`: False
629
+ - `tf32`: None
630
+ - `local_rank`: 0
631
+ - `ddp_backend`: None
632
+ - `tpu_num_cores`: None
633
+ - `tpu_metrics_debug`: False
634
+ - `debug`: []
635
+ - `dataloader_drop_last`: False
636
+ - `dataloader_num_workers`: 0
637
+ - `dataloader_prefetch_factor`: None
638
+ - `past_index`: -1
639
+ - `disable_tqdm`: False
640
+ - `remove_unused_columns`: True
641
+ - `label_names`: None
642
+ - `load_best_model_at_end`: True
643
+ - `ignore_data_skip`: False
644
+ - `fsdp`: []
645
+ - `fsdp_min_num_params`: 0
646
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
647
+ - `fsdp_transformer_layer_cls_to_wrap`: None
648
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
649
+ - `deepspeed`: None
650
+ - `label_smoothing_factor`: 0.0
651
+ - `optim`: adamw_torch_fused
652
+ - `optim_args`: None
653
+ - `adafactor`: False
654
+ - `group_by_length`: False
655
+ - `length_column_name`: length
656
+ - `ddp_find_unused_parameters`: None
657
+ - `ddp_bucket_cap_mb`: None
658
+ - `ddp_broadcast_buffers`: False
659
+ - `dataloader_pin_memory`: True
660
+ - `dataloader_persistent_workers`: False
661
+ - `skip_memory_metrics`: True
662
+ - `use_legacy_prediction_loop`: False
663
+ - `push_to_hub`: False
664
+ - `resume_from_checkpoint`: None
665
+ - `hub_model_id`: None
666
+ - `hub_strategy`: every_save
667
+ - `hub_private_repo`: None
668
+ - `hub_always_push`: False
669
+ - `gradient_checkpointing`: False
670
+ - `gradient_checkpointing_kwargs`: None
671
+ - `include_inputs_for_metrics`: False
672
+ - `include_for_metrics`: []
673
+ - `eval_do_concat_batches`: True
674
+ - `fp16_backend`: auto
675
+ - `push_to_hub_model_id`: None
676
+ - `push_to_hub_organization`: None
677
+ - `mp_parameters`:
678
+ - `auto_find_batch_size`: False
679
+ - `full_determinism`: False
680
+ - `torchdynamo`: None
681
+ - `ray_scope`: last
682
+ - `ddp_timeout`: 1800
683
+ - `torch_compile`: False
684
+ - `torch_compile_backend`: None
685
+ - `torch_compile_mode`: None
686
+ - `dispatch_batches`: None
687
+ - `split_batches`: None
688
+ - `include_tokens_per_second`: False
689
+ - `include_num_input_tokens_seen`: False
690
+ - `neftune_noise_alpha`: None
691
+ - `optim_target_modules`: None
692
+ - `batch_eval_metrics`: False
693
+ - `eval_on_start`: False
694
+ - `use_liger_kernel`: False
695
+ - `eval_use_gather_object`: False
696
+ - `average_tokens_across_devices`: False
697
+ - `prompts`: None
698
+ - `batch_sampler`: no_duplicates
699
+ - `multi_dataset_batch_sampler`: proportional
700
+
701
+ </details>
702
+
703
+ ### Training Logs
704
+ | Epoch | Step | Training Loss | Validation Loss | dim_768_cosine_ndcg@10 | dim_512_cosine_ndcg@10 | dim_256_cosine_ndcg@10 | dim_128_cosine_ndcg@10 | dim_64_cosine_ndcg@10 |
705
+ |:-------:|:------:|:-------------:|:---------------:|:----------------------:|:----------------------:|:----------------------:|:----------------------:|:---------------------:|
706
+ | 0.1015 | 10 | 0.2626 | - | - | - | - | - | - |
707
+ | 0.2030 | 20 | 0.1764 | - | - | - | - | - | - |
708
+ | 0.1015 | 10 | 0.0311 | - | - | - | - | - | - |
709
+ | 0.2030 | 20 | 0.0259 | - | - | - | - | - | - |
710
+ | 0.1015 | 10 | 0.0056 | - | - | - | - | - | - |
711
+ | 0.2030 | 20 | 0.0064 | - | - | - | - | - | - |
712
+ | 0.1015 | 10 | 0.0016 | - | - | - | - | - | - |
713
+ | 0.2030 | 20 | 0.0015 | - | - | - | - | - | - |
714
+ | 0.1015 | 10 | 0.0006 | - | - | - | - | - | - |
715
+ | 0.2030 | 20 | 0.0006 | - | - | - | - | - | - |
716
+ | 0.3046 | 30 | 0.1324 | - | - | - | - | - | - |
717
+ | 0.4061 | 40 | 0.113 | - | - | - | - | - | - |
718
+ | 0.5076 | 50 | 0.128 | - | - | - | - | - | - |
719
+ | 0.6091 | 60 | 0.1134 | - | - | - | - | - | - |
720
+ | 0.7107 | 70 | 0.056 | - | - | - | - | - | - |
721
+ | 0.8122 | 80 | 0.1086 | - | - | - | - | - | - |
722
+ | 0.9137 | 90 | 0.1008 | - | - | - | - | - | - |
723
+ | **1.0** | **99** | **-** | **0.0771** | **0.8286** | **0.8306** | **0.8266** | **0.8197** | **0.7955** |
724
+ | 1.0102 | 100 | 0.0491 | - | - | - | - | - | - |
725
+ | 1.1117 | 110 | 0.0029 | - | - | - | - | - | - |
726
+ | 1.2132 | 120 | 0.0009 | - | - | - | - | - | - |
727
+ | 1.3147 | 130 | 0.0326 | - | - | - | - | - | - |
728
+ | 1.4162 | 140 | 0.0077 | - | - | - | - | - | - |
729
+ | 1.5178 | 150 | 0.0109 | - | - | - | - | - | - |
730
+ | 1.6193 | 160 | 0.0047 | - | - | - | - | - | - |
731
+ | 1.7208 | 170 | 0.004 | - | - | - | - | - | - |
732
+ | 1.8223 | 180 | 0.0122 | - | - | - | - | - | - |
733
+ | 1.9239 | 190 | 0.0043 | - | - | - | - | - | - |
734
+ | 2.0 | 198 | - | 0.0758 | 0.8296 | 0.8330 | 0.8222 | 0.8169 | 0.7998 |
735
+ | 2.0203 | 200 | 0.0032 | - | - | - | - | - | - |
736
+ | 2.1218 | 210 | 0.0002 | - | - | - | - | - | - |
737
+ | 2.2234 | 220 | 0.0002 | - | - | - | - | - | - |
738
+ | 2.3249 | 230 | 0.0097 | - | - | - | - | - | - |
739
+ | 2.4264 | 240 | 0.0012 | - | - | - | - | - | - |
740
+ | 2.5279 | 250 | 0.0012 | - | - | - | - | - | - |
741
+ | 2.6294 | 260 | 0.0009 | - | - | - | - | - | - |
742
+ | 2.7310 | 270 | 0.0007 | - | - | - | - | - | - |
743
+ | 2.8325 | 280 | 0.0019 | - | - | - | - | - | - |
744
+ | 2.9340 | 290 | 0.0009 | - | - | - | - | - | - |
745
+ | 2.9746 | 294 | - | 0.0744 | 0.8338 | 0.8321 | 0.8208 | 0.8175 | 0.8043 |
746
+
747
+ * The bold row denotes the saved checkpoint.
748
+
749
+ ### Framework Versions
750
+ - Python: 3.10.12
751
+ - Sentence Transformers: 3.3.1
752
+ - Transformers: 4.47.0
753
+ - PyTorch: 2.5.1+cu121
754
+ - Accelerate: 1.1.1
755
+ - Datasets: 3.1.0
756
+ - Tokenizers: 0.21.0
757
+
758
+ ## Citation
759
+
760
+ ### BibTeX
761
+
762
+ #### Sentence Transformers
763
+ ```bibtex
764
+ @inproceedings{reimers-2019-sentence-bert,
765
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
766
+ author = "Reimers, Nils and Gurevych, Iryna",
767
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
768
+ month = "11",
769
+ year = "2019",
770
+ publisher = "Association for Computational Linguistics",
771
+ url = "https://arxiv.org/abs/1908.10084",
772
+ }
773
+ ```
774
+
775
+ #### MatryoshkaLoss
776
+ ```bibtex
777
+ @misc{kusupati2024matryoshka,
778
+ title={Matryoshka Representation Learning},
779
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
780
+ year={2024},
781
+ eprint={2205.13147},
782
+ archivePrefix={arXiv},
783
+ primaryClass={cs.LG}
784
+ }
785
+ ```
786
+
787
+ #### MultipleNegativesRankingLoss
788
+ ```bibtex
789
+ @misc{henderson2017efficient,
790
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
791
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
792
+ year={2017},
793
+ eprint={1705.00652},
794
+ archivePrefix={arXiv},
795
+ primaryClass={cs.CL}
796
+ }
797
+ ```
798
+
799
+ <!--
800
+ ## Glossary
801
+
802
+ *Clearly define terms in order to be accessible across audiences.*
803
+ -->
804
+
805
+ <!--
806
+ ## Model Card Authors
807
+
808
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
809
+ -->
810
+
811
+ <!--
812
+ ## Model Card Contact
813
+
814
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
815
+ -->
config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "nomic-ai/nomic-embed-text-v1.5",
3
+ "activation_function": "swiglu",
4
+ "architectures": [
5
+ "NomicBertModel"
6
+ ],
7
+ "attn_pdrop": 0.0,
8
+ "auto_map": {
9
+ "AutoConfig": "nomic-ai/nomic-bert-2048--configuration_hf_nomic_bert.NomicBertConfig",
10
+ "AutoModel": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertModel",
11
+ "AutoModelForMaskedLM": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForPreTraining"
12
+ },
13
+ "bos_token_id": null,
14
+ "causal": false,
15
+ "dense_seq_output": true,
16
+ "embd_pdrop": 0.0,
17
+ "eos_token_id": null,
18
+ "fused_bias_fc": true,
19
+ "fused_dropout_add_ln": true,
20
+ "initializer_range": 0.02,
21
+ "layer_norm_epsilon": 1e-12,
22
+ "max_trained_positions": 2048,
23
+ "mlp_fc1_bias": false,
24
+ "mlp_fc2_bias": false,
25
+ "model_type": "nomic_bert",
26
+ "n_embd": 768,
27
+ "n_head": 12,
28
+ "n_inner": 3072,
29
+ "n_layer": 12,
30
+ "n_positions": 8192,
31
+ "pad_vocab_size_multiple": 64,
32
+ "parallel_block": false,
33
+ "parallel_block_tied_norm": false,
34
+ "prenorm": false,
35
+ "qkv_proj_bias": false,
36
+ "reorder_and_upcast_attn": false,
37
+ "resid_pdrop": 0.0,
38
+ "rotary_emb_base": 1000,
39
+ "rotary_emb_fraction": 1.0,
40
+ "rotary_emb_interleaved": false,
41
+ "rotary_emb_scale_base": null,
42
+ "rotary_scaling_factor": null,
43
+ "scale_attn_by_inverse_layer_idx": false,
44
+ "scale_attn_weights": true,
45
+ "summary_activation": null,
46
+ "summary_first_dropout": 0.0,
47
+ "summary_proj_to_labels": true,
48
+ "summary_type": "cls_index",
49
+ "summary_use_proj": true,
50
+ "torch_dtype": "float32",
51
+ "transformers_version": "4.47.0",
52
+ "type_vocab_size": 2,
53
+ "use_cache": true,
54
+ "use_flash_attn": true,
55
+ "use_rms_norm": false,
56
+ "use_xentropy": true,
57
+ "vocab_size": 30528
58
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.3.1",
4
+ "transformers": "4.47.0",
5
+ "pytorch": "2.5.1+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bd36670fd965dc9373726968a7600f730f69e5bf5b0f935170d70a7f1c8aa807
3
+ size 546938168
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 8192,
50
+ "pad_token": "[PAD]",
51
+ "sep_token": "[SEP]",
52
+ "strip_accents": null,
53
+ "tokenize_chinese_chars": true,
54
+ "tokenizer_class": "BertTokenizer",
55
+ "unk_token": "[UNK]"
56
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff