lufercho commited on
Commit
239e10f
·
verified ·
1 Parent(s): 18dc022

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,730 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:5000
8
+ - loss:MultipleNegativesRankingLoss
9
+ base_model: lufercho/my-finetuned-bert-mlm
10
+ widget:
11
+ - source_sentence: "A Comprehensive Approach to Universal Piecewise Nonlinear Regression\n\
12
+ \ Based on Trees"
13
+ sentences:
14
+ - " In sparse recovery we are given a matrix $A$ (the dictionary) and a vector\
15
+ \ of\nthe form $A X$ where $X$ is sparse, and the goal is to recover $X$. This\
16
+ \ is a\ncentral notion in signal processing, statistics and machine learning.\
17
+ \ But in\napplications such as sparse coding, edge detection, compression and\
18
+ \ super\nresolution, the dictionary $A$ is unknown and has to be learned from\
19
+ \ random\nexamples of the form $Y = AX$ where $X$ is drawn from an appropriate\n\
20
+ distribution --- this is the dictionary learning problem. In most settings, $A$\n\
21
+ is overcomplete: it has more columns than rows. This paper presents a\npolynomial-time\
22
+ \ algorithm for learning overcomplete dictionaries; the only\npreviously known\
23
+ \ algorithm with provable guarantees is the recent work of\nSpielman, Wang and\
24
+ \ Wright who gave an algorithm for the full-rank case, which\nis rarely the case\
25
+ \ in applications. Our algorithm applies to incoherent\ndictionaries which have\
26
+ \ been a central object of study since they were\nintroduced in seminal work of\
27
+ \ Donoho and Huo. In particular, a dictionary is\n$\\mu$-incoherent if each pair\
28
+ \ of columns has inner product at most $\\mu /\n\\sqrt{n}$.\n The algorithm makes\
29
+ \ natural stochastic assumptions about the unknown sparse\nvector $X$, which can\
30
+ \ contain $k \\leq c \\min(\\sqrt{n}/\\mu \\log n, m^{1/2\n-\\eta})$ non-zero\
31
+ \ entries (for any $\\eta > 0$). This is close to the best $k$\nallowable by the\
32
+ \ best sparse recovery algorithms even if one knows the\ndictionary $A$ exactly.\
33
+ \ Moreover, both the running time and sample complexity\ndepend on $\\log 1/\\\
34
+ epsilon$, where $\\epsilon$ is the target accuracy, and so\nour algorithms converge\
35
+ \ very quickly to the true dictionary. Our algorithm can\nalso tolerate substantial\
36
+ \ amounts of noise provided it is incoherent with\nrespect to the dictionary (e.g.,\
37
+ \ Gaussian). In the noisy setting, our running\ntime and sample complexity depend\
38
+ \ polynomially on $1/\\epsilon$, and this is\nnecessary.\n"
39
+ - ' In this paper, we investigate adaptive nonlinear regression and introduce
40
+
41
+ tree based piecewise linear regression algorithms that are highly efficient and
42
+
43
+ provide significantly improved performance with guaranteed upper bounds in an
44
+
45
+ individual sequence manner. We use a tree notion in order to partition the
46
+
47
+ space of regressors in a nested structure. The introduced algorithms adapt not
48
+
49
+ only their regression functions but also the complete tree structure while
50
+
51
+ achieving the performance of the "best" linear mixture of a doubly exponential
52
+
53
+ number of partitions, with a computational complexity only polynomial in the
54
+
55
+ number of nodes of the tree. While constructing these algorithms, we also avoid
56
+
57
+ using any artificial "weighting" of models (with highly data dependent
58
+
59
+ parameters) and, instead, directly minimize the final regression error, which
60
+
61
+ is the ultimate performance goal. The introduced methods are generic such that
62
+
63
+ they can readily incorporate different tree construction methods such as random
64
+
65
+ trees in their framework and can use different regressor or partitioning
66
+
67
+ functions as demonstrated in the paper.
68
+
69
+ '
70
+ - ' In this paper we propose a multi-task linear classifier learning problem
71
+
72
+ called D-SVM (Dictionary SVM). D-SVM uses a dictionary of parameter covariance
73
+
74
+ shared by all tasks to do multi-task knowledge transfer among different tasks.
75
+
76
+ We formally define the learning problem of D-SVM and show two interpretations
77
+
78
+ of this problem, from both the probabilistic and kernel perspectives. From the
79
+
80
+ probabilistic perspective, we show that our learning formulation is actually a
81
+
82
+ MAP estimation on all optimization variables. We also show its equivalence to
83
+ a
84
+
85
+ multiple kernel learning problem in which one is trying to find a re-weighting
86
+
87
+ kernel for features from a dictionary of basis (despite the fact that only
88
+
89
+ linear classifiers are learned). Finally, we describe an alternative
90
+
91
+ optimization scheme to minimize the objective function and present empirical
92
+
93
+ studies to valid our algorithm.
94
+
95
+ '
96
+ - source_sentence: "A Game-theoretic Machine Learning Approach for Revenue Maximization\
97
+ \ in\n Sponsored Search"
98
+ sentences:
99
+ - ' A learning algorithm based on primary school teaching and learning is
100
+
101
+ presented. The methodology is to continuously evaluate a student and to give
102
+
103
+ them training on the examples for which they repeatedly fail, until, they can
104
+
105
+ correctly answer all types of questions. This incremental learning procedure
106
+
107
+ produces better learning curves by demanding the student to optimally dedicate
108
+
109
+ their learning time on the failed examples. When used in machine learning, the
110
+
111
+ algorithm is found to train a machine on a data with maximum variance in the
112
+
113
+ feature space so that the generalization ability of the network improves. The
114
+
115
+ algorithm has interesting applications in data mining, model evaluations and
116
+
117
+ rare objects discovery.
118
+
119
+ '
120
+ - ' In this paper we extend temporal difference policy evaluation algorithms to
121
+
122
+ performance criteria that include the variance of the cumulative reward. Such
123
+
124
+ criteria are useful for risk management, and are important in domains such as
125
+
126
+ finance and process control. We propose both TD(0) and LSTD(lambda) variants
127
+
128
+ with linear function approximation, prove their convergence, and demonstrate
129
+
130
+ their utility in a 4-dimensional continuous state space problem.
131
+
132
+ '
133
+ - ' Sponsored search is an important monetization channel for search engines, in
134
+
135
+ which an auction mechanism is used to select the ads shown to users and
136
+
137
+ determine the prices charged from advertisers. There have been several pieces
138
+
139
+ of work in the literature that investigate how to design an auction mechanism
140
+
141
+ in order to optimize the revenue of the search engine. However, due to some
142
+
143
+ unrealistic assumptions used, the practical values of these studies are not
144
+
145
+ very clear. In this paper, we propose a novel \emph{game-theoretic machine
146
+
147
+ learning} approach, which naturally combines machine learning and game theory,
148
+
149
+ and learns the auction mechanism using a bilevel optimization framework. In
150
+
151
+ particular, we first learn a Markov model from historical data to describe how
152
+
153
+ advertisers change their bids in response to an auction mechanism, and then for
154
+
155
+ any given auction mechanism, we use the learnt model to predict its
156
+
157
+ corresponding future bid sequences. Next we learn the auction mechanism through
158
+
159
+ empirical revenue maximization on the predicted bid sequences. We show that the
160
+
161
+ empirical revenue will converge when the prediction period approaches infinity,
162
+
163
+ and a Genetic Programming algorithm can effectively optimize this empirical
164
+
165
+ revenue. Our experiments indicate that the proposed approach is able to produce
166
+
167
+ a much more effective auction mechanism than several baselines.
168
+
169
+ '
170
+ - source_sentence: Normalized Online Learning
171
+ sentences:
172
+ - " The Frank-Wolfe method (a.k.a. conditional gradient algorithm) for smooth\n\
173
+ optimization has regained much interest in recent years in the context of large\n\
174
+ scale optimization and machine learning. A key advantage of the method is that\n\
175
+ it avoids projections - the computational bottleneck in many applications -\n\
176
+ replacing it by a linear optimization step. Despite this advantage, the known\n\
177
+ convergence rates of the FW method fall behind standard first order methods for\n\
178
+ most settings of interest. It is an active line of research to derive faster\n\
179
+ linear optimization-based algorithms for various settings of convex\noptimization.\n\
180
+ \ In this paper we consider the special case of optimization over strongly\n\
181
+ convex sets, for which we prove that the vanila FW method converges at a rate\n\
182
+ of $\\frac{1}{t^2}$. This gives a quadratic improvement in convergence rate\n\
183
+ compared to the general case, in which convergence is of the order\n$\\frac{1}{t}$,\
184
+ \ and known to be tight. We show that various balls induced by\n$\\ell_p$ norms,\
185
+ \ Schatten norms and group norms are strongly convex on one hand\nand on the other\
186
+ \ hand, linear optimization over these sets is straightforward\nand admits a closed-form\
187
+ \ solution. We further show how several previous\nfast-rate results for the FW\
188
+ \ method follow easily from our analysis.\n"
189
+ - ' We introduce online learning algorithms which are independent of feature
190
+
191
+ scales, proving regret bounds dependent on the ratio of scales existent in the
192
+
193
+ data rather than the absolute scale. This has several useful effects: there is
194
+
195
+ no need to pre-normalize data, the test-time and test-space complexity are
196
+
197
+ reduced, and the algorithms are more robust.
198
+
199
+ '
200
+ - ' In order to achieve high efficiency of classification in intrusion detection,
201
+
202
+ a compressed model is proposed in this paper which combines horizontal
203
+
204
+ compression with vertical compression. OneR is utilized as horizontal
205
+
206
+ com-pression for attribute reduction, and affinity propagation is employed as
207
+
208
+ vertical compression to select small representative exemplars from large
209
+
210
+ training data. As to be able to computationally compress the larger volume of
211
+
212
+ training data with scalability, MapReduce based parallelization approach is
213
+
214
+ then implemented and evaluated for each step of the model compression process
215
+
216
+ abovementioned, on which common but efficient classification methods can be
217
+
218
+ directly used. Experimental application study on two publicly available
219
+
220
+ datasets of intrusion detection, KDD99 and CMDC2012, demonstrates that the
221
+
222
+ classification using the compressed model proposed can effectively speed up the
223
+
224
+ detection procedure at up to 184 times, most importantly at the cost of a
225
+
226
+ minimal accuracy difference with less than 1% on average.
227
+
228
+ '
229
+ - source_sentence: Bounds on the Bethe Free Energy for Gaussian Networks
230
+ sentences:
231
+ - ' We extend the Bayesian Information Criterion (BIC), an asymptotic
232
+
233
+ approximation for the marginal likelihood, to Bayesian networks with hidden
234
+
235
+ variables. This approximation can be used to select models given large samples
236
+
237
+ of data. The standard BIC as well as our extension punishes the complexity of
238
+ a
239
+
240
+ model according to the dimension of its parameters. We argue that the dimension
241
+
242
+ of a Bayesian network with hidden variables is the rank of the Jacobian matrix
243
+
244
+ of the transformation between the parameters of the network and the parameters
245
+
246
+ of the observable variables. We compute the dimensions of several networks
247
+
248
+ including the naive Bayes model with a hidden root node.
249
+
250
+ '
251
+ - ' Complex networks refer to large-scale graphs with nontrivial connection
252
+
253
+ patterns. The salient and interesting features that the complex network study
254
+
255
+ offer in comparison to graph theory are the emphasis on the dynamical
256
+
257
+ properties of the networks and the ability of inherently uncovering pattern
258
+
259
+ formation of the vertices. In this paper, we present a hybrid data
260
+
261
+ classification technique combining a low level and a high level classifier. The
262
+
263
+ low level term can be equipped with any traditional classification techniques,
264
+
265
+ which realize the classification task considering only physical features (e.g.,
266
+
267
+ geometrical or statistical features) of the input data. On the other hand, the
268
+
269
+ high level term has the ability of detecting data patterns with semantic
270
+
271
+ meanings. In this way, the classification is realized by means of the
272
+
273
+ extraction of the underlying network''s features constructed from the input
274
+
275
+ data. As a result, the high level classification process measures the
276
+
277
+ compliance of the test instances with the pattern formation of the training
278
+
279
+ data. Out of various high level perspectives that can be utilized to capture
280
+
281
+ semantic meaning, we utilize the dynamical features that are generated from a
282
+
283
+ tourist walker in a networked environment. Specifically, a weighted combination
284
+
285
+ of transient and cycle lengths generated by the tourist walk is employed for
286
+
287
+ that end. Interestingly, our study shows that the proposed technique is able to
288
+
289
+ further improve the already optimized performance of traditional classification
290
+
291
+ techniques.
292
+
293
+ '
294
+ - ' We address the problem of computing approximate marginals in Gaussian
295
+
296
+ probabilistic models by using mean field and fractional Bethe approximations.
297
+
298
+ As an extension of Welling and Teh (2001), we define the Gaussian fractional
299
+
300
+ Bethe free energy in terms of the moment parameters of the approximate
301
+
302
+ marginals and derive an upper and lower bound for it. We give necessary
303
+
304
+ conditions for the Gaussian fractional Bethe free energies to be bounded from
305
+
306
+ below. It turns out that the bounding condition is the same as the pairwise
307
+
308
+ normalizability condition derived by Malioutov et al. (2006) as a sufficient
309
+
310
+ condition for the convergence of the message passing algorithm. By giving a
311
+
312
+ counterexample, we disprove the conjecture in Welling and Teh (2001): even when
313
+
314
+ the Bethe free energy is not bounded from below, it can possess a local minimum
315
+
316
+ to which the minimization algorithms can converge.
317
+
318
+ '
319
+ - source_sentence: Multi-Armed Bandits in Metric Spaces
320
+ sentences:
321
+ - ' The paper presents a FrameNet-based information extraction and knowledge
322
+
323
+ representation framework, called FrameNet-CNL. The framework is used on natural
324
+
325
+ language documents and represents the extracted knowledge in a tailor-made
326
+
327
+ Frame-ontology from which unambiguous FrameNet-CNL paraphrase text can be
328
+
329
+ generated automatically in multiple languages. This approach brings together
330
+
331
+ the fields of information extraction and CNL, because a source text can be
332
+
333
+ considered belonging to FrameNet-CNL, if information extraction parser produces
334
+
335
+ the correct knowledge representation as a result. We describe a
336
+
337
+ state-of-the-art information extraction parser used by a national news agency
338
+
339
+ and speculate that FrameNet-CNL eventually could shape the natural language
340
+
341
+ subset used for writing the newswire articles.
342
+
343
+ '
344
+ - ' Applications such as face recognition that deal with high-dimensional data
345
+
346
+ need a mapping technique that introduces representation of low-dimensional
347
+
348
+ features with enhanced discriminatory power and a proper classifier, able to
349
+
350
+ classify those complex features. Most of traditional Linear Discriminant
351
+
352
+ Analysis suffer from the disadvantage that their optimality criteria are not
353
+
354
+ directly related to the classification ability of the obtained feature
355
+
356
+ representation. Moreover, their classification accuracy is affected by the
357
+
358
+ "small sample size" problem which is often encountered in FR tasks. In this
359
+
360
+ short paper, we combine nonlinear kernel based mapping of data called KDDA with
361
+
362
+ Support Vector machine classifier to deal with both of the shortcomings in an
363
+
364
+ efficient and cost effective manner. The proposed here method is compared, in
365
+
366
+ terms of classification accuracy, to other commonly used FR methods on UMIST
367
+
368
+ face database. Results indicate that the performance of the proposed method is
369
+
370
+ overall superior to those of traditional FR approaches, such as the Eigenfaces,
371
+
372
+ Fisherfaces, and D-LDA methods and traditional linear classifiers.
373
+
374
+ '
375
+ - ' In a multi-armed bandit problem, an online algorithm chooses from a set of
376
+
377
+ strategies in a sequence of trials so as to maximize the total payoff of the
378
+
379
+ chosen strategies. While the performance of bandit algorithms with a small
380
+
381
+ finite strategy set is quite well understood, bandit problems with large
382
+
383
+ strategy sets are still a topic of very active investigation, motivated by
384
+
385
+ practical applications such as online auctions and web advertisement. The goal
386
+
387
+ of such research is to identify broad and natural classes of strategy sets and
388
+
389
+ payoff functions which enable the design of efficient solutions. In this work
390
+
391
+ we study a very general setting for the multi-armed bandit problem in which the
392
+
393
+ strategies form a metric space, and the payoff function satisfies a Lipschitz
394
+
395
+ condition with respect to the metric. We refer to this problem as the
396
+
397
+ "Lipschitz MAB problem". We present a complete solution for the multi-armed
398
+
399
+ problem in this setting. That is, for every metric space (L,X) we define an
400
+
401
+ isometry invariant which bounds from below the performance of Lipschitz MAB
402
+
403
+ algorithms for X, and we present an algorithm which comes arbitrarily close to
404
+
405
+ meeting this bound. Furthermore, our technique gives even better results for
406
+
407
+ benign payoff functions.
408
+
409
+ '
410
+ pipeline_tag: sentence-similarity
411
+ library_name: sentence-transformers
412
+ ---
413
+
414
+ # SentenceTransformer based on lufercho/my-finetuned-bert-mlm
415
+
416
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [lufercho/my-finetuned-bert-mlm](https://huggingface.co/lufercho/my-finetuned-bert-mlm). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
417
+
418
+ ## Model Details
419
+
420
+ ### Model Description
421
+ - **Model Type:** Sentence Transformer
422
+ - **Base model:** [lufercho/my-finetuned-bert-mlm](https://huggingface.co/lufercho/my-finetuned-bert-mlm) <!-- at revision 8cf44893fd607477d06b067f1788b495abac1b2c -->
423
+ - **Maximum Sequence Length:** 512 tokens
424
+ - **Output Dimensionality:** 768 dimensions
425
+ - **Similarity Function:** Cosine Similarity
426
+ <!-- - **Training Dataset:** Unknown -->
427
+ <!-- - **Language:** Unknown -->
428
+ <!-- - **License:** Unknown -->
429
+
430
+ ### Model Sources
431
+
432
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
433
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
434
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
435
+
436
+ ### Full Model Architecture
437
+
438
+ ```
439
+ SentenceTransformer(
440
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
441
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
442
+ )
443
+ ```
444
+
445
+ ## Usage
446
+
447
+ ### Direct Usage (Sentence Transformers)
448
+
449
+ First install the Sentence Transformers library:
450
+
451
+ ```bash
452
+ pip install -U sentence-transformers
453
+ ```
454
+
455
+ Then you can load this model and run inference.
456
+ ```python
457
+ from sentence_transformers import SentenceTransformer
458
+
459
+ # Download from the 🤗 Hub
460
+ model = SentenceTransformer("lufercho/AxvBert-Sentente-Transformer")
461
+ # Run inference
462
+ sentences = [
463
+ 'Multi-Armed Bandits in Metric Spaces',
464
+ ' In a multi-armed bandit problem, an online algorithm chooses from a set of\nstrategies in a sequence of trials so as to maximize the total payoff of the\nchosen strategies. While the performance of bandit algorithms with a small\nfinite strategy set is quite well understood, bandit problems with large\nstrategy sets are still a topic of very active investigation, motivated by\npractical applications such as online auctions and web advertisement. The goal\nof such research is to identify broad and natural classes of strategy sets and\npayoff functions which enable the design of efficient solutions. In this work\nwe study a very general setting for the multi-armed bandit problem in which the\nstrategies form a metric space, and the payoff function satisfies a Lipschitz\ncondition with respect to the metric. We refer to this problem as the\n"Lipschitz MAB problem". We present a complete solution for the multi-armed\nproblem in this setting. That is, for every metric space (L,X) we define an\nisometry invariant which bounds from below the performance of Lipschitz MAB\nalgorithms for X, and we present an algorithm which comes arbitrarily close to\nmeeting this bound. Furthermore, our technique gives even better results for\nbenign payoff functions.\n',
465
+ ' Applications such as face recognition that deal with high-dimensional data\nneed a mapping technique that introduces representation of low-dimensional\nfeatures with enhanced discriminatory power and a proper classifier, able to\nclassify those complex features. Most of traditional Linear Discriminant\nAnalysis suffer from the disadvantage that their optimality criteria are not\ndirectly related to the classification ability of the obtained feature\nrepresentation. Moreover, their classification accuracy is affected by the\n"small sample size" problem which is often encountered in FR tasks. In this\nshort paper, we combine nonlinear kernel based mapping of data called KDDA with\nSupport Vector machine classifier to deal with both of the shortcomings in an\nefficient and cost effective manner. The proposed here method is compared, in\nterms of classification accuracy, to other commonly used FR methods on UMIST\nface database. Results indicate that the performance of the proposed method is\noverall superior to those of traditional FR approaches, such as the Eigenfaces,\nFisherfaces, and D-LDA methods and traditional linear classifiers.\n',
466
+ ]
467
+ embeddings = model.encode(sentences)
468
+ print(embeddings.shape)
469
+ # [3, 768]
470
+
471
+ # Get the similarity scores for the embeddings
472
+ similarities = model.similarity(embeddings, embeddings)
473
+ print(similarities.shape)
474
+ # [3, 3]
475
+ ```
476
+
477
+ <!--
478
+ ### Direct Usage (Transformers)
479
+
480
+ <details><summary>Click to see the direct usage in Transformers</summary>
481
+
482
+ </details>
483
+ -->
484
+
485
+ <!--
486
+ ### Downstream Usage (Sentence Transformers)
487
+
488
+ You can finetune this model on your own dataset.
489
+
490
+ <details><summary>Click to expand</summary>
491
+
492
+ </details>
493
+ -->
494
+
495
+ <!--
496
+ ### Out-of-Scope Use
497
+
498
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
499
+ -->
500
+
501
+ <!--
502
+ ## Bias, Risks and Limitations
503
+
504
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
505
+ -->
506
+
507
+ <!--
508
+ ### Recommendations
509
+
510
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
511
+ -->
512
+
513
+ ## Training Details
514
+
515
+ ### Training Dataset
516
+
517
+ #### Unnamed Dataset
518
+
519
+
520
+ * Size: 5,000 training samples
521
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
522
+ * Approximate statistics based on the first 1000 samples:
523
+ | | sentence_0 | sentence_1 |
524
+ |:--------|:----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
525
+ | type | string | string |
526
+ | details | <ul><li>min: 4 tokens</li><li>mean: 13.29 tokens</li><li>max: 56 tokens</li></ul> | <ul><li>min: 26 tokens</li><li>mean: 202.49 tokens</li><li>max: 506 tokens</li></ul> |
527
+ * Samples:
528
+ | sentence_0 | sentence_1 |
529
+ |:-------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
530
+ | <code>Validation of nonlinear PCA</code> | <code> Linear principal component analysis (PCA) can be extended to a nonlinear PCA<br>by using artificial neural networks. But the benefit of curved components<br>requires a careful control of the model complexity. Moreover, standard<br>techniques for model selection, including cross-validation and more generally<br>the use of an independent test set, fail when applied to nonlinear PCA because<br>of its inherent unsupervised characteristics. This paper presents a new<br>approach for validating the complexity of nonlinear PCA models by using the<br>error in missing data estimation as a criterion for model selection. It is<br>motivated by the idea that only the model of optimal complexity is able to<br>predict missing values with the highest accuracy. While standard test set<br>validation usually favours over-fitted nonlinear PCA models, the proposed model<br>validation approach correctly selects the optimal model complexity.<br></code> |
531
+ | <code>Learning Attitudes and Attributes from Multi-Aspect Reviews</code> | <code> The majority of online reviews consist of plain-text feedback together with a<br>single numeric score. However, there are multiple dimensions to products and<br>opinions, and understanding the `aspects' that contribute to users' ratings may<br>help us to better understand their individual preferences. For example, a<br>user's impression of an audiobook presumably depends on aspects such as the<br>story and the narrator, and knowing their opinions on these aspects may help us<br>to recommend better products. In this paper, we build models for rating systems<br>in which such dimensions are explicit, in the sense that users leave separate<br>ratings for each aspect of a product. By introducing new corpora consisting of<br>five million reviews, rated with between three and six aspects, we evaluate our<br>models on three prediction tasks: First, we use our model to uncover which<br>parts of a review discuss which of the rated aspects. Second, we use our model<br>to summarize reviews, which for us means finding the sentences...</code> |
532
+ | <code>Bayesian Differential Privacy through Posterior Sampling</code> | <code> Differential privacy formalises privacy-preserving mechanisms that provide<br>access to a database. We pose the question of whether Bayesian inference itself<br>can be used directly to provide private access to data, with no modification.<br>The answer is affirmative: under certain conditions on the prior, sampling from<br>the posterior distribution can be used to achieve a desired level of privacy<br>and utility. To do so, we generalise differential privacy to arbitrary dataset<br>metrics, outcome spaces and distribution families. This allows us to also deal<br>with non-i.i.d or non-tabular datasets. We prove bounds on the sensitivity of<br>the posterior to the data, which gives a measure of robustness. We also show<br>how to use posterior sampling to provide differentially private responses to<br>queries, within a decision-theoretic framework. Finally, we provide bounds on<br>the utility and on the distinguishability of datasets. The latter are<br>complemented by a novel use of Le Cam's method to obtain lower bounds....</code> |
533
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
534
+ ```json
535
+ {
536
+ "scale": 20.0,
537
+ "similarity_fct": "cos_sim"
538
+ }
539
+ ```
540
+
541
+ ### Training Hyperparameters
542
+ #### Non-Default Hyperparameters
543
+
544
+ - `per_device_train_batch_size`: 16
545
+ - `per_device_eval_batch_size`: 16
546
+ - `num_train_epochs`: 2
547
+ - `multi_dataset_batch_sampler`: round_robin
548
+
549
+ #### All Hyperparameters
550
+ <details><summary>Click to expand</summary>
551
+
552
+ - `overwrite_output_dir`: False
553
+ - `do_predict`: False
554
+ - `eval_strategy`: no
555
+ - `prediction_loss_only`: True
556
+ - `per_device_train_batch_size`: 16
557
+ - `per_device_eval_batch_size`: 16
558
+ - `per_gpu_train_batch_size`: None
559
+ - `per_gpu_eval_batch_size`: None
560
+ - `gradient_accumulation_steps`: 1
561
+ - `eval_accumulation_steps`: None
562
+ - `torch_empty_cache_steps`: None
563
+ - `learning_rate`: 5e-05
564
+ - `weight_decay`: 0.0
565
+ - `adam_beta1`: 0.9
566
+ - `adam_beta2`: 0.999
567
+ - `adam_epsilon`: 1e-08
568
+ - `max_grad_norm`: 1
569
+ - `num_train_epochs`: 2
570
+ - `max_steps`: -1
571
+ - `lr_scheduler_type`: linear
572
+ - `lr_scheduler_kwargs`: {}
573
+ - `warmup_ratio`: 0.0
574
+ - `warmup_steps`: 0
575
+ - `log_level`: passive
576
+ - `log_level_replica`: warning
577
+ - `log_on_each_node`: True
578
+ - `logging_nan_inf_filter`: True
579
+ - `save_safetensors`: True
580
+ - `save_on_each_node`: False
581
+ - `save_only_model`: False
582
+ - `restore_callback_states_from_checkpoint`: False
583
+ - `no_cuda`: False
584
+ - `use_cpu`: False
585
+ - `use_mps_device`: False
586
+ - `seed`: 42
587
+ - `data_seed`: None
588
+ - `jit_mode_eval`: False
589
+ - `use_ipex`: False
590
+ - `bf16`: False
591
+ - `fp16`: False
592
+ - `fp16_opt_level`: O1
593
+ - `half_precision_backend`: auto
594
+ - `bf16_full_eval`: False
595
+ - `fp16_full_eval`: False
596
+ - `tf32`: None
597
+ - `local_rank`: 0
598
+ - `ddp_backend`: None
599
+ - `tpu_num_cores`: None
600
+ - `tpu_metrics_debug`: False
601
+ - `debug`: []
602
+ - `dataloader_drop_last`: False
603
+ - `dataloader_num_workers`: 0
604
+ - `dataloader_prefetch_factor`: None
605
+ - `past_index`: -1
606
+ - `disable_tqdm`: False
607
+ - `remove_unused_columns`: True
608
+ - `label_names`: None
609
+ - `load_best_model_at_end`: False
610
+ - `ignore_data_skip`: False
611
+ - `fsdp`: []
612
+ - `fsdp_min_num_params`: 0
613
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
614
+ - `fsdp_transformer_layer_cls_to_wrap`: None
615
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
616
+ - `deepspeed`: None
617
+ - `label_smoothing_factor`: 0.0
618
+ - `optim`: adamw_torch
619
+ - `optim_args`: None
620
+ - `adafactor`: False
621
+ - `group_by_length`: False
622
+ - `length_column_name`: length
623
+ - `ddp_find_unused_parameters`: None
624
+ - `ddp_bucket_cap_mb`: None
625
+ - `ddp_broadcast_buffers`: False
626
+ - `dataloader_pin_memory`: True
627
+ - `dataloader_persistent_workers`: False
628
+ - `skip_memory_metrics`: True
629
+ - `use_legacy_prediction_loop`: False
630
+ - `push_to_hub`: False
631
+ - `resume_from_checkpoint`: None
632
+ - `hub_model_id`: None
633
+ - `hub_strategy`: every_save
634
+ - `hub_private_repo`: False
635
+ - `hub_always_push`: False
636
+ - `gradient_checkpointing`: False
637
+ - `gradient_checkpointing_kwargs`: None
638
+ - `include_inputs_for_metrics`: False
639
+ - `include_for_metrics`: []
640
+ - `eval_do_concat_batches`: True
641
+ - `fp16_backend`: auto
642
+ - `push_to_hub_model_id`: None
643
+ - `push_to_hub_organization`: None
644
+ - `mp_parameters`:
645
+ - `auto_find_batch_size`: False
646
+ - `full_determinism`: False
647
+ - `torchdynamo`: None
648
+ - `ray_scope`: last
649
+ - `ddp_timeout`: 1800
650
+ - `torch_compile`: False
651
+ - `torch_compile_backend`: None
652
+ - `torch_compile_mode`: None
653
+ - `dispatch_batches`: None
654
+ - `split_batches`: None
655
+ - `include_tokens_per_second`: False
656
+ - `include_num_input_tokens_seen`: False
657
+ - `neftune_noise_alpha`: None
658
+ - `optim_target_modules`: None
659
+ - `batch_eval_metrics`: False
660
+ - `eval_on_start`: False
661
+ - `use_liger_kernel`: False
662
+ - `eval_use_gather_object`: False
663
+ - `average_tokens_across_devices`: False
664
+ - `prompts`: None
665
+ - `batch_sampler`: batch_sampler
666
+ - `multi_dataset_batch_sampler`: round_robin
667
+
668
+ </details>
669
+
670
+ ### Training Logs
671
+ | Epoch | Step | Training Loss |
672
+ |:------:|:----:|:-------------:|
673
+ | 1.5974 | 500 | 0.3039 |
674
+
675
+
676
+ ### Framework Versions
677
+ - Python: 3.10.12
678
+ - Sentence Transformers: 3.3.1
679
+ - Transformers: 4.46.2
680
+ - PyTorch: 2.5.1+cu121
681
+ - Accelerate: 1.1.1
682
+ - Datasets: 3.1.0
683
+ - Tokenizers: 0.20.3
684
+
685
+ ## Citation
686
+
687
+ ### BibTeX
688
+
689
+ #### Sentence Transformers
690
+ ```bibtex
691
+ @inproceedings{reimers-2019-sentence-bert,
692
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
693
+ author = "Reimers, Nils and Gurevych, Iryna",
694
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
695
+ month = "11",
696
+ year = "2019",
697
+ publisher = "Association for Computational Linguistics",
698
+ url = "https://arxiv.org/abs/1908.10084",
699
+ }
700
+ ```
701
+
702
+ #### MultipleNegativesRankingLoss
703
+ ```bibtex
704
+ @misc{henderson2017efficient,
705
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
706
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
707
+ year={2017},
708
+ eprint={1705.00652},
709
+ archivePrefix={arXiv},
710
+ primaryClass={cs.CL}
711
+ }
712
+ ```
713
+
714
+ <!--
715
+ ## Glossary
716
+
717
+ *Clearly define terms in order to be accessible across audiences.*
718
+ -->
719
+
720
+ <!--
721
+ ## Model Card Authors
722
+
723
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
724
+ -->
725
+
726
+ <!--
727
+ ## Model Card Contact
728
+
729
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
730
+ -->
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "lufercho/my-finetuned-bert-mlm",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.46.2",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 30522
26
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.3.1",
4
+ "transformers": "4.46.2",
5
+ "pytorch": "2.5.1+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5acb361ba7f01378e500c54cef73f7794364f98dec8469c218eb8b51f1d5ede8
3
+ size 437951328
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff