metadata
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:5000
- loss:TripletLoss
base_model: lufercho/my-finetuned-bert-mlm
widget:
- source_sentence: |-
Auto-WEKA: Combined Selection and Hyperparameter Optimization of
Classification Algorithms
sentences:
- >2
It has been a long time, since data mining technologies have made their ways
to the field of data management. Classification is one of the most
important
data mining tasks for label prediction, categorization of objects into
groups,
advertisement and data management. In this paper, we focus on the
standard
classification problem which is predicting unknown labels in Euclidean
space.
Most efforts in Machine Learning communities are devoted to methods that
use
probabilistic algorithms which are heavy on Calculus and Linear Algebra.
Most
of these techniques have scalability issues for big data, and are hardly
parallelizable if they are to maintain their high accuracies in their
standard
form. Sampling is a new direction for improving scalability, using many
small
parallel classifiers. In this paper, rather than conventional sampling
methods,
we focus on a discrete classification algorithm with O(n) expected
running
time. Our approach performs a similar task as sampling methods. However,
we use
column-wise sampling of data, rather than the row-wise sampling used in
the
literature. In either case, our algorithm is completely deterministic.
Our
algorithm, proposes a way of combining 2D convex hulls in order to
achieve high
classification accuracy as well as scalability in the same time. First,
we
thoroughly describe and prove our O(n) algorithm for finding the convex
hull of
a point set in 2D. Then, we show with experiments our classifier model
built
based on this idea is very competitive compared with existing
sophisticated
classification algorithms included in commercial statistical
applications such
as MATLAB.
- >2
Many different machine learning algorithms exist; taking into account each
algorithm's hyperparameters, there is a staggeringly large number of
possible
alternatives overall. We consider the problem of simultaneously
selecting a
learning algorithm and setting its hyperparameters, going beyond
previous work
that addresses these issues in isolation. We show that this problem can
be
addressed by a fully automated approach, leveraging recent innovations
in
Bayesian optimization. Specifically, we consider a wide range of feature
selection techniques (combining 3 search and 8 evaluator methods) and
all
classification approaches implemented in WEKA, spanning 2 ensemble
methods, 10
meta-methods, 27 base classifiers, and hyperparameter settings for each
classifier. On each of 21 popular datasets from the UCI repository, the
KDD Cup
09, variants of the MNIST dataset and CIFAR-10, we show classification
performance often much better than using standard
selection/hyperparameter
optimization methods. We hope that our approach will help non-expert
users to
more effectively identify machine learning algorithms and hyperparameter
settings appropriate to their applications, and hence to achieve
improved
performance.
- >2
Nonnegative matrix factorization (NMF) has become a ubiquitous tool for data
analysis. An important variant is the sparse NMF problem which arises
when we
explicitly require the learnt features to be sparse. A natural measure
of
sparsity is the L$_0$ norm, however its optimization is NP-hard. Mixed
norms,
such as L$_1$/L$_2$ measure, have been shown to model sparsity robustly,
based
on intuitive attributes that such measures need to satisfy. This is in
contrast
to computationally cheaper alternatives such as the plain L$_1$ norm.
However,
present algorithms designed for optimizing the mixed norm L$_1$/L$_2$
are slow
and other formulations for sparse NMF have been proposed such as those
based on
L$_1$ and L$_0$ norms. Our proposed algorithm allows us to solve the
mixed norm
sparsity constraints while not sacrificing computation time. We present
experimental evidence on real-world datasets that shows our new
algorithm
performs an order of magnitude faster compared to the current
state-of-the-art
solvers optimizing the mixed norm and is suitable for large-scale
datasets.
- source_sentence: |-
Effect of Different Distance Measures on the Performance of K-Means
Algorithm: An Experimental Study in Matlab
sentences:
- >2
The kernel method is a potential approach to analyzing structured data such
as sequences, trees, and graphs; however, unordered trees have not been
investigated extensively. Kimura et al. (2011) proposed a kernel
function for
unordered trees on the basis of their subpaths, which are vertical
substructures of trees responsible for hierarchical information in them.
Their
kernel exhibits practically good performance in terms of accuracy and
speed;
however, linear-time computation is not guaranteed theoretically, unlike
the
case of the other unordered tree kernel proposed by Vishwanathan and
Smola
(2003). In this paper, we propose a theoretically guaranteed linear-time
kernel
computation algorithm that is practically fast, and we present an
efficient
prediction algorithm whose running time depends only on the size of the
input
tree. Experimental results show that the proposed algorithms are quite
efficient in practice.
- >2
We express the classic ARMA time-series model as a directed graphical model.
In doing so, we find that the deterministic relationships in the model
make it
effectively impossible to use the EM algorithm for learning model
parameters.
To remedy this problem, we replace the deterministic relationships with
Gaussian distributions having a small variance, yielding the stochastic
ARMA
(ARMA) model. This modification allows us to use the EM algorithm to
learn
parmeters and to forecast,even in situations where some data is missing.
This
modification, in conjunction with the graphicalmodel approach, also
allows us
to include cross predictors in situations where there are multiple times
series
and/or additional nontemporal covariates. More surprising,experiments
suggest
that the move to stochastic ARMA yields improved accuracy through better
smoothing. We demonstrate improvements afforded by cross prediction and
better
smoothing on real data.
- >2
K-means algorithm is a very popular clustering algorithm which is famous for
its simplicity. Distance measure plays a very important rule on the
performance
of this algorithm. We have different distance measure techniques
available. But
choosing a proper technique for distance calculation is totally
dependent on
the type of the data that we are going to cluster. In this paper an
experimental study is done in Matlab to cluster the iris and wine data
sets
with different distance measures and thereby observing the variation of
the
performances shown.
- source_sentence: A Dynamic Near-Optimal Algorithm for Online Linear Programming
sentences:
- >2
Social media channels such as Twitter have emerged as popular platforms for
crowds to respond to public events such as speeches, sports and debates.
While
this promises tremendous opportunities to understand and make sense of
the
reception of an event from the social media, the promises come entwined
with
significant technical challenges. In particular, given an event and an
associated large scale collection of tweets, we need approaches to
effectively
align tweets and the parts of the event they refer to. This in turn
raises
questions about how to segment the event into smaller yet meaningful
parts, and
how to figure out whether a tweet is a general one about the entire
event or
specific one aimed at a particular segment of the event. In this work,
we
present ET-LDA, an effective method for aligning an event and its tweets
through joint statistical modeling of topical influences from the events
and
their associated tweets. The model enables the automatic segmentation of
the
events and the characterization of tweets into two categories: (1)
episodic
tweets that respond specifically to the content in the segments of the
events,
and (2) steady tweets that respond generally about the events. We
present an
efficient inference method for this model, and a comprehensive
evaluation of
its effectiveness over existing methods. In particular, through a user
study,
we demonstrate that users find the topics, the segments, the alignment,
and the
episodic tweets discovered by ET-LDA to be of higher quality and more
interesting as compared to the state-of-the-art, with improvements in
the range
of 18-41%.
- >2
A natural optimization model that formulates many online resource allocation
and revenue management problems is the online linear program (LP) in
which the
constraint matrix is revealed column by column along with the
corresponding
objective coefficient. In such a model, a decision variable has to be
set each
time a column is revealed without observing the future inputs and the
goal is
to maximize the overall objective function. In this paper, we provide a
near-optimal algorithm for this general class of online problems under
the
assumption of random order of arrival and some mild conditions on the
size of
the LP right-hand-side input. Specifically, our learning-based algorithm
works
by dynamically updating a threshold price vector at geometric time
intervals,
where the dual prices learned from the revealed columns in the previous
period
are used to determine the sequential decisions in the current period.
Due to
the feature of dynamic learning, the competitiveness of our algorithm
improves
over the past study of the same problem. We also present a worst-case
example
showing that the performance of our algorithm is near-optimal.
- >2
One of the biggest challenges in Multimedia information retrieval and
understanding is to bridge the semantic gap by properly modeling concept
semantics in context. The presence of out of vocabulary (OOV) concepts
exacerbates this difficulty. To address the semantic gap issues, we
formulate a
problem on learning contextualized semantics from descriptive terms and
propose
a novel Siamese architecture to model the contextualized semantics from
descriptive terms. By means of pattern aggregation and probabilistic
topic
models, our Siamese architecture captures contextualized semantics from
the
co-occurring descriptive terms via unsupervised learning, which leads to
a
concept embedding space of the terms in context. Furthermore, the
co-occurring
OOV concepts can be easily represented in the learnt concept embedding
space.
The main properties of the concept embedding space are demonstrated via
visualization. Using various settings in semantic priming, we have
carried out
a thorough evaluation by comparing our approach to a number of
state-of-the-art
methods on six annotation corpora in different domains, i.e., MagTag5K,
CAL500
and Million Song Dataset in the music domain as well as Corel5K, LabelMe
and
SUNDatabase in the image domain. Experimental results on semantic
priming
suggest that our approach outperforms those state-of-the-art methods
considerably in various aspects.
- source_sentence: Parallel Online Learning
sentences:
- >2
In our recent paper, we showed that in exponential family, contrastive
divergence (CD) with fixed learning rate will give asymptotically
consistent
estimates \cite{wu2016convergence}. In this paper, we establish
consistency and
convergence rate of CD with annealed learning rate $\eta_t$.
Specifically,
suppose CD-$m$ generates the sequence of parameters $\{\theta_t\}_{t \ge
0}$
using an i.i.d. data sample $\mathbf{X}_1^n \sim p_{\theta^*}$ of size
$n$,
then $\delta_n(\mathbf{X}_1^n) = \limsup_{t \to \infty} \Vert
\sum_{s=t_0}^t
\eta_s \theta_s / \sum_{s=t_0}^t \eta_s - \theta^* \Vert$ converges in
probability to 0 at a rate of $1/\sqrt[3]{n}$. The number ($m$) of MCMC
transitions in CD only affects the coefficient factor of convergence
rate. Our
proof is not a simple extension of the one in \cite{wu2016convergence}.
which
depends critically on the fact that $\{\theta_t\}_{t \ge 0}$ is a
homogeneous
Markov chain conditional on the observed sample $\mathbf{X}_1^n$. Under
annealed learning rate, the homogeneous Markov property is not available
and we
have to develop an alternative approach based on super-martingales.
Experiment
results of CD on a fully-visible $2\times 2$ Boltzmann Machine are
provided to
demonstrate our theoretical results.
- >2
This report outlines an approach to learning generative models from data. We
express models as probabilistic programs, which allows us to capture
abstract
patterns within the examples. By choosing our language for programs to
be an
extension of the algebraic data type of the examples, we can begin with
a
program that generates all and only the examples. We then introduce
greater
abstraction, and hence generalization, incrementally to the extent that
it
improves the posterior probability of the examples given the program.
Motivated
by previous approaches to model merging and program induction, we search
for
such explanatory abstractions using program transformations. We consider
two
types of transformation: Abstraction merges common subexpressions within
a
program into new functions (a form of anti-unification). Deargumentation
simplifies functions by reducing the number of arguments. We demonstrate
that
this approach finds key patterns in the domain of nested lists,
including
parameterized sub-functions and stochastic recursion.
- >2
In this work we study parallelization of online learning, a core primitive in
machine learning. In a parallel environment all known approaches for
parallel
online learning lead to delayed updates, where the model is updated
using
out-of-date information. In the worst case, or when examples are
temporally
correlated, delay can have a very adverse effect on the learning
algorithm.
Here, we analyze and present preliminary empirical results on a set of
learning
architectures based on a feature sharding approach that present various
tradeoffs between delay, degree of parallelism, representation power and
empirical performance.
- source_sentence: Maximin affinity learning of image segmentation
sentences:
- >2
Most existing approaches to hashing apply a single form of hash function, and
an optimization process which is typically deeply coupled to this
specific
form. This tight coupling restricts the flexibility of the method to
respond to
the data, and can result in complex optimization problems that are
difficult to
solve. Here we propose a flexible yet simple framework that is able to
accommodate different types of loss functions and hash functions. This
framework allows a number of existing approaches to hashing to be placed
in
context, and simplifies the development of new problem-specific hashing
methods. Our framework decomposes hashing learning problem into two
steps: hash
bit learning and hash function learning based on the learned bits. The
first
step can typically be formulated as binary quadratic problems, and the
second
step can be accomplished by training standard binary classifiers. Both
problems
have been extensively studied in the literature. Our extensive
experiments
demonstrate that the proposed framework is effective, flexible and
outperforms
the state-of-the-art.
- >2
Changes in the UK electricity market mean that domestic users will be
required to modify their usage behaviour in order that supplies can be
maintained. Clustering allows usage profiles collected at the household
level
to be clustered into groups and assigned a stereotypical profile which
can be
used to target marketing campaigns. Fuzzy C Means clustering extends
this by
allowing each household to be a member of many groups and hence provides
the
opportunity to make personalised offers to the household dependent on
their
degree of membership of each group. In addition, feedback can be
provided on
how user's changing behaviour is moving them towards more "green" or
cost
effective stereotypical usage.
- >2
Images can be segmented by first using a classifier to predict an affinity
graph that reflects the degree to which image pixels must be grouped
together
and then partitioning the graph to yield a segmentation. Machine
learning has
been applied to the affinity classifier to produce affinity graphs that
are
good in the sense of minimizing edge misclassification rates. However,
this
error measure is only indirectly related to the quality of segmentations
produced by ultimately partitioning the affinity graph. We present the
first
machine learning algorithm for training a classifier to produce affinity
graphs
that are good in the sense of producing segmentations that directly
minimize
the Rand index, a well known segmentation performance measure. The Rand
index
measures segmentation performance by quantifying the classification of
the
connectivity of image pixel pairs after segmentation. By using the
simple graph
partitioning algorithm of finding the connected components of the
thresholded
affinity graph, we are able to train an affinity classifier to directly
minimize the Rand index of segmentations resulting from the graph
partitioning.
Our learning algorithm corresponds to the learning of maximin affinities
between image pixel pairs, which are predictive of the pixel-pair
connectivity.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
SentenceTransformer based on lufercho/my-finetuned-bert-mlm
This is a sentence-transformers model finetuned from lufercho/my-finetuned-bert-mlm. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: lufercho/my-finetuned-bert-mlm
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("lufercho/my-finetuned-sentence-bert")
# Run inference
sentences = [
'Maximin affinity learning of image segmentation',
' Images can be segmented by first using a classifier to predict an affinity\ngraph that reflects the degree to which image pixels must be grouped together\nand then partitioning the graph to yield a segmentation. Machine learning has\nbeen applied to the affinity classifier to produce affinity graphs that are\ngood in the sense of minimizing edge misclassification rates. However, this\nerror measure is only indirectly related to the quality of segmentations\nproduced by ultimately partitioning the affinity graph. We present the first\nmachine learning algorithm for training a classifier to produce affinity graphs\nthat are good in the sense of producing segmentations that directly minimize\nthe Rand index, a well known segmentation performance measure. The Rand index\nmeasures segmentation performance by quantifying the classification of the\nconnectivity of image pixel pairs after segmentation. By using the simple graph\npartitioning algorithm of finding the connected components of the thresholded\naffinity graph, we are able to train an affinity classifier to directly\nminimize the Rand index of segmentations resulting from the graph partitioning.\nOur learning algorithm corresponds to the learning of maximin affinities\nbetween image pixel pairs, which are predictive of the pixel-pair connectivity.\n',
' Changes in the UK electricity market mean that domestic users will be\nrequired to modify their usage behaviour in order that supplies can be\nmaintained. Clustering allows usage profiles collected at the household level\nto be clustered into groups and assigned a stereotypical profile which can be\nused to target marketing campaigns. Fuzzy C Means clustering extends this by\nallowing each household to be a member of many groups and hence provides the\nopportunity to make personalised offers to the household dependent on their\ndegree of membership of each group. In addition, feedback can be provided on\nhow user\'s changing behaviour is moving them towards more "green" or cost\neffective stereotypical usage.\n',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Training Details
Training Dataset
Unnamed Dataset
- Size: 5,000 training samples
- Columns:
sentence_0
,sentence_1
, andsentence_2
- Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1 sentence_2 type string string string details - min: 4 tokens
- mean: 13.41 tokens
- max: 38 tokens
- min: 37 tokens
- mean: 201.32 tokens
- max: 512 tokens
- min: 24 tokens
- mean: 204.09 tokens
- max: 512 tokens
- Samples:
sentence_0 sentence_1 sentence_2 Clustering with Transitive Distance and K-Means Duality
Recent spectral clustering methods are a propular and powerful technique for
data clustering. These methods need to solve the eigenproblem whose
computational complexity is $O(n^3)$, where $n$ is the number of data samples.
In this paper, a non-eigenproblem based clustering method is proposed to deal
with the clustering problem. Its performance is comparable to the spectral
clustering algorithms but it is more efficient with computational complexity
$O(n^2)$. We show that with a transitive distance and an observed property,
called K-means duality, our algorithm can be used to handle data sets with
complex cluster shapes, multi-scale clusters, and noise. Moreover, no
parameters except the number of clusters need to be set in our algorithm.We show that the log-likelihood of several probabilistic graphical models is
Lipschitz continuous with respect to the lp-norm of the parameters. We discuss
several implications of Lipschitz parametrization. We present an upper bound of
the Kullback-Leibler divergence that allows understanding methods that penalize
the lp-norm of differences of parameters as the minimization of that upper
bound. The expected log-likelihood is lower bounded by the negative lp-norm,
which allows understanding the generalization ability of probabilistic models.
The exponential of the negative lp-norm is involved in the lower bound of the
Bayes error rate, which shows that it is reasonable to use parameters as
features in algorithms that rely on metric spaces (e.g. classification,
dimensionality reduction, clustering). Our results do not rely on specific
algorithms for learning the structure or parameters. We show preliminary
results for activity recognition and temporal segmentation.Clustering Dynamic Web Usage Data
Most classification methods are based on the assumption that data conforms to
a stationary distribution. The machine learning domain currently suffers from a
lack of classification techniques that are able to detect the occurrence of a
change in the underlying data distribution. Ignoring possible changes in the
underlying concept, also known as concept drift, may degrade the performance of
the classification model. Often these changes make the model inconsistent and
regular updatings become necessary. Taking the temporal dimension into account
during the analysis of Web usage data is a necessity, since the way a site is
visited may indeed evolve due to modifications in the structure and content of
the site, or even due to changes in the behavior of certain user groups. One
solution to this problem, proposed in this article, is to update models using
summaries obtained by means of an evolutionary approach based on an intelligent
clustering approach. We carry out various clustering str...Exponential family extensions of principal component analysis (EPCA) have
received a considerable amount of attention in recent years, demonstrating the
growing need for basic modeling tools that do not assume the squared loss or
Gaussian distribution. We extend the EPCA model toolbox by presenting the first
exponential family multi-view learning methods of the partial least squares and
canonical correlation analysis, based on a unified representation of EPCA as
matrix factorization of the natural parameters of exponential family. The
models are based on a new family of priors that are generally usable for all
such factorizations. We also introduce new inference strategies, and
demonstrate how the methods outperform earlier ones when the Gaussianity
assumption does not hold.Trading USDCHF filtered by Gold dynamics via HMM coupling
We devise a USDCHF trading strategy using the dynamics of gold as a filter.
Our strategy involves modelling both USDCHF and gold using a coupled hidden
Markov model (CHMM). The observations will be indicators, RSI and CCI, which
will be used as triggers for our trading signals. Upon decoding the model in
each iteration, we can get the next most probable state and the next most
probable observation. Hopefully by taking advantage of intermarket analysis and
the Markov property implicit in the model, trading with these most probable
values will produce profitable results.Most existing machine learning classifiers are highly vulnerable to
adversarial examples. An adversarial example is a sample of input data which
has been modified very slightly in a way that is intended to cause a machine
learning classifier to misclassify it. In many cases, these modifications can
be so subtle that a human observer does not even notice the modification at
all, yet the classifier still makes a mistake. Adversarial examples pose
security concerns because they could be used to perform an attack on machine
learning systems, even if the adversary has no access to the underlying model.
Up to now, all previous work have assumed a threat model in which the adversary
can feed data directly into the machine learning classifier. This is not always
the case for systems operating in the physical world, for example those which
are using signals from cameras and other sensors as an input. This paper shows
that even in such physical world scenarios, machine learning systems are
vul... - Loss:
TripletLoss
with these parameters:{ "distance_metric": "TripletDistanceMetric.EUCLIDEAN", "triplet_margin": 5 }
Training Hyperparameters
Non-Default Hyperparameters
per_device_train_batch_size
: 16per_device_eval_batch_size
: 16num_train_epochs
: 2multi_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: noprediction_loss_only
: Trueper_device_train_batch_size
: 16per_device_eval_batch_size
: 16per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 2max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | Training Loss |
---|---|---|
1.5974 | 500 | 0.8647 |
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.3.1
- Transformers: 4.46.2
- PyTorch: 2.5.1+cu121
- Accelerate: 1.1.1
- Datasets: 3.1.0
- Tokenizers: 0.20.3
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
TripletLoss
@misc{hermans2017defense,
title={In Defense of the Triplet Loss for Person Re-Identification},
author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
year={2017},
eprint={1703.07737},
archivePrefix={arXiv},
primaryClass={cs.CV}
}