metadata

tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:5000
  - loss:TripletLoss
base_model: lufercho/my-finetuned-bert-mlm
widget:
  - source_sentence: |-
      Auto-WEKA: Combined Selection and Hyperparameter Optimization of
        Classification Algorithms
    sentences:
      - >2
          It has been a long time, since data mining technologies have made their ways
        to the field of data management. Classification is one of the most
        important

        data mining tasks for label prediction, categorization of objects into
        groups,

        advertisement and data management. In this paper, we focus on the
        standard

        classification problem which is predicting unknown labels in Euclidean
        space.

        Most efforts in Machine Learning communities are devoted to methods that
        use

        probabilistic algorithms which are heavy on Calculus and Linear Algebra.
        Most

        of these techniques have scalability issues for big data, and are hardly

        parallelizable if they are to maintain their high accuracies in their
        standard

        form. Sampling is a new direction for improving scalability, using many
        small

        parallel classifiers. In this paper, rather than conventional sampling
        methods,

        we focus on a discrete classification algorithm with O(n) expected
        running

        time. Our approach performs a similar task as sampling methods. However,
        we use

        column-wise sampling of data, rather than the row-wise sampling used in
        the

        literature. In either case, our algorithm is completely deterministic.
        Our

        algorithm, proposes a way of combining 2D convex hulls in order to
        achieve high

        classification accuracy as well as scalability in the same time. First,
        we

        thoroughly describe and prove our O(n) algorithm for finding the convex
        hull of

        a point set in 2D. Then, we show with experiments our classifier model
        built

        based on this idea is very competitive compared with existing
        sophisticated

        classification algorithms included in commercial statistical
        applications such

        as MATLAB.
      - >2
          Many different machine learning algorithms exist; taking into account each
        algorithm's hyperparameters, there is a staggeringly large number of
        possible

        alternatives overall. We consider the problem of simultaneously
        selecting a

        learning algorithm and setting its hyperparameters, going beyond
        previous work

        that addresses these issues in isolation. We show that this problem can
        be

        addressed by a fully automated approach, leveraging recent innovations
        in

        Bayesian optimization. Specifically, we consider a wide range of feature

        selection techniques (combining 3 search and 8 evaluator methods) and
        all

        classification approaches implemented in WEKA, spanning 2 ensemble
        methods, 10

        meta-methods, 27 base classifiers, and hyperparameter settings for each

        classifier. On each of 21 popular datasets from the UCI repository, the
        KDD Cup

        09, variants of the MNIST dataset and CIFAR-10, we show classification

        performance often much better than using standard
        selection/hyperparameter

        optimization methods. We hope that our approach will help non-expert
        users to

        more effectively identify machine learning algorithms and hyperparameter

        settings appropriate to their applications, and hence to achieve
        improved

        performance.
      - >2
          Nonnegative matrix factorization (NMF) has become a ubiquitous tool for data
        analysis. An important variant is the sparse NMF problem which arises
        when we

        explicitly require the learnt features to be sparse. A natural measure
        of

        sparsity is the L$_0$ norm, however its optimization is NP-hard. Mixed
        norms,

        such as L$_1$/L$_2$ measure, have been shown to model sparsity robustly,
        based

        on intuitive attributes that such measures need to satisfy. This is in
        contrast

        to computationally cheaper alternatives such as the plain L$_1$ norm.
        However,

        present algorithms designed for optimizing the mixed norm L$_1$/L$_2$
        are slow

        and other formulations for sparse NMF have been proposed such as those
        based on

        L$_1$ and L$_0$ norms. Our proposed algorithm allows us to solve the
        mixed norm

        sparsity constraints while not sacrificing computation time. We present

        experimental evidence on real-world datasets that shows our new
        algorithm

        performs an order of magnitude faster compared to the current
        state-of-the-art

        solvers optimizing the mixed norm and is suitable for large-scale
        datasets.
  - source_sentence: |-
      Effect of Different Distance Measures on the Performance of K-Means
        Algorithm: An Experimental Study in Matlab
    sentences:
      - >2
          The kernel method is a potential approach to analyzing structured data such
        as sequences, trees, and graphs; however, unordered trees have not been

        investigated extensively. Kimura et al. (2011) proposed a kernel
        function for

        unordered trees on the basis of their subpaths, which are vertical

        substructures of trees responsible for hierarchical information in them.
        Their

        kernel exhibits practically good performance in terms of accuracy and
        speed;

        however, linear-time computation is not guaranteed theoretically, unlike
        the

        case of the other unordered tree kernel proposed by Vishwanathan and
        Smola

        (2003). In this paper, we propose a theoretically guaranteed linear-time
        kernel

        computation algorithm that is practically fast, and we present an
        efficient

        prediction algorithm whose running time depends only on the size of the
        input

        tree. Experimental results show that the proposed algorithms are quite

        efficient in practice.
      - >2
          We express the classic ARMA time-series model as a directed graphical model.
        In doing so, we find that the deterministic relationships in the model
        make it

        effectively impossible to use the EM algorithm for learning model
        parameters.

        To remedy this problem, we replace the deterministic relationships with

        Gaussian distributions having a small variance, yielding the stochastic
        ARMA

        (ARMA) model. This modification allows us to use the EM algorithm to
        learn

        parmeters and to forecast,even in situations where some data is missing.
        This

        modification, in conjunction with the graphicalmodel approach, also
        allows us

        to include cross predictors in situations where there are multiple times
        series

        and/or additional nontemporal covariates. More surprising,experiments
        suggest

        that the move to stochastic ARMA yields improved accuracy through better

        smoothing. We demonstrate improvements afforded by cross prediction and
        better

        smoothing on real data.
      - >2
          K-means algorithm is a very popular clustering algorithm which is famous for
        its simplicity. Distance measure plays a very important rule on the
        performance

        of this algorithm. We have different distance measure techniques
        available. But

        choosing a proper technique for distance calculation is totally
        dependent on

        the type of the data that we are going to cluster. In this paper an

        experimental study is done in Matlab to cluster the iris and wine data
        sets

        with different distance measures and thereby observing the variation of
        the

        performances shown.
  - source_sentence: A Dynamic Near-Optimal Algorithm for Online Linear Programming
    sentences:
      - >2
          Social media channels such as Twitter have emerged as popular platforms for
        crowds to respond to public events such as speeches, sports and debates.
        While

        this promises tremendous opportunities to understand and make sense of
        the

        reception of an event from the social media, the promises come entwined
        with

        significant technical challenges. In particular, given an event and an

        associated large scale collection of tweets, we need approaches to
        effectively

        align tweets and the parts of the event they refer to. This in turn
        raises

        questions about how to segment the event into smaller yet meaningful
        parts, and

        how to figure out whether a tweet is a general one about the entire
        event or

        specific one aimed at a particular segment of the event. In this work,
        we

        present ET-LDA, an effective method for aligning an event and its tweets

        through joint statistical modeling of topical influences from the events
        and

        their associated tweets. The model enables the automatic segmentation of
        the

        events and the characterization of tweets into two categories: (1)
        episodic

        tweets that respond specifically to the content in the segments of the
        events,

        and (2) steady tweets that respond generally about the events. We
        present an

        efficient inference method for this model, and a comprehensive
        evaluation of

        its effectiveness over existing methods. In particular, through a user
        study,

        we demonstrate that users find the topics, the segments, the alignment,
        and the

        episodic tweets discovered by ET-LDA to be of higher quality and more

        interesting as compared to the state-of-the-art, with improvements in
        the range

        of 18-41%.
      - >2
          A natural optimization model that formulates many online resource allocation
        and revenue management problems is the online linear program (LP) in
        which the

        constraint matrix is revealed column by column along with the
        corresponding

        objective coefficient. In such a model, a decision variable has to be
        set each

        time a column is revealed without observing the future inputs and the
        goal is

        to maximize the overall objective function. In this paper, we provide a

        near-optimal algorithm for this general class of online problems under
        the

        assumption of random order of arrival and some mild conditions on the
        size of

        the LP right-hand-side input. Specifically, our learning-based algorithm
        works

        by dynamically updating a threshold price vector at geometric time
        intervals,

        where the dual prices learned from the revealed columns in the previous
        period

        are used to determine the sequential decisions in the current period.
        Due to

        the feature of dynamic learning, the competitiveness of our algorithm
        improves

        over the past study of the same problem. We also present a worst-case
        example

        showing that the performance of our algorithm is near-optimal.
      - >2
          One of the biggest challenges in Multimedia information retrieval and
        understanding is to bridge the semantic gap by properly modeling concept

        semantics in context. The presence of out of vocabulary (OOV) concepts

        exacerbates this difficulty. To address the semantic gap issues, we
        formulate a

        problem on learning contextualized semantics from descriptive terms and
        propose

        a novel Siamese architecture to model the contextualized semantics from

        descriptive terms. By means of pattern aggregation and probabilistic
        topic

        models, our Siamese architecture captures contextualized semantics from
        the

        co-occurring descriptive terms via unsupervised learning, which leads to
        a

        concept embedding space of the terms in context. Furthermore, the
        co-occurring

        OOV concepts can be easily represented in the learnt concept embedding
        space.

        The main properties of the concept embedding space are demonstrated via

        visualization. Using various settings in semantic priming, we have
        carried out

        a thorough evaluation by comparing our approach to a number of
        state-of-the-art

        methods on six annotation corpora in different domains, i.e., MagTag5K,
        CAL500

        and Million Song Dataset in the music domain as well as Corel5K, LabelMe
        and

        SUNDatabase in the image domain. Experimental results on semantic
        priming

        suggest that our approach outperforms those state-of-the-art methods

        considerably in various aspects.
  - source_sentence: Parallel Online Learning
    sentences:
      - >2
          In our recent paper, we showed that in exponential family, contrastive
        divergence (CD) with fixed learning rate will give asymptotically
        consistent

        estimates \cite{wu2016convergence}. In this paper, we establish
        consistency and

        convergence rate of CD with annealed learning rate $\eta_t$.
        Specifically,

        suppose CD-$m$ generates the sequence of parameters $\{\theta_t\}_{t \ge
        0}$

        using an i.i.d. data sample $\mathbf{X}_1^n \sim p_{\theta^*}$ of size
        $n$,

        then $\delta_n(\mathbf{X}_1^n) = \limsup_{t \to \infty} \Vert
        \sum_{s=t_0}^t

        \eta_s \theta_s / \sum_{s=t_0}^t \eta_s - \theta^* \Vert$ converges in

        probability to 0 at a rate of $1/\sqrt[3]{n}$. The number ($m$) of MCMC

        transitions in CD only affects the coefficient factor of convergence
        rate. Our

        proof is not a simple extension of the one in \cite{wu2016convergence}.
        which

        depends critically on the fact that $\{\theta_t\}_{t \ge 0}$ is a
        homogeneous

        Markov chain conditional on the observed sample $\mathbf{X}_1^n$. Under

        annealed learning rate, the homogeneous Markov property is not available
        and we

        have to develop an alternative approach based on super-martingales.
        Experiment

        results of CD on a fully-visible $2\times 2$ Boltzmann Machine are
        provided to

        demonstrate our theoretical results.
      - >2
          This report outlines an approach to learning generative models from data. We
        express models as probabilistic programs, which allows us to capture
        abstract

        patterns within the examples. By choosing our language for programs to
        be an

        extension of the algebraic data type of the examples, we can begin with
        a

        program that generates all and only the examples. We then introduce
        greater

        abstraction, and hence generalization, incrementally to the extent that
        it

        improves the posterior probability of the examples given the program.
        Motivated

        by previous approaches to model merging and program induction, we search
        for

        such explanatory abstractions using program transformations. We consider
        two

        types of transformation: Abstraction merges common subexpressions within
        a

        program into new functions (a form of anti-unification). Deargumentation

        simplifies functions by reducing the number of arguments. We demonstrate
        that

        this approach finds key patterns in the domain of nested lists,
        including

        parameterized sub-functions and stochastic recursion.
      - >2
          In this work we study parallelization of online learning, a core primitive in
        machine learning. In a parallel environment all known approaches for
        parallel

        online learning lead to delayed updates, where the model is updated
        using

        out-of-date information. In the worst case, or when examples are
        temporally

        correlated, delay can have a very adverse effect on the learning
        algorithm.

        Here, we analyze and present preliminary empirical results on a set of
        learning

        architectures based on a feature sharding approach that present various

        tradeoffs between delay, degree of parallelism, representation power and

        empirical performance.
  - source_sentence: Maximin affinity learning of image segmentation
    sentences:
      - >2
          Most existing approaches to hashing apply a single form of hash function, and
        an optimization process which is typically deeply coupled to this
        specific

        form. This tight coupling restricts the flexibility of the method to
        respond to

        the data, and can result in complex optimization problems that are
        difficult to

        solve. Here we propose a flexible yet simple framework that is able to

        accommodate different types of loss functions and hash functions. This

        framework allows a number of existing approaches to hashing to be placed
        in

        context, and simplifies the development of new problem-specific hashing

        methods. Our framework decomposes hashing learning problem into two
        steps: hash

        bit learning and hash function learning based on the learned bits. The
        first

        step can typically be formulated as binary quadratic problems, and the
        second

        step can be accomplished by training standard binary classifiers. Both
        problems

        have been extensively studied in the literature. Our extensive
        experiments

        demonstrate that the proposed framework is effective, flexible and
        outperforms

        the state-of-the-art.
      - >2
          Changes in the UK electricity market mean that domestic users will be
        required to modify their usage behaviour in order that supplies can be

        maintained. Clustering allows usage profiles collected at the household
        level

        to be clustered into groups and assigned a stereotypical profile which
        can be

        used to target marketing campaigns. Fuzzy C Means clustering extends
        this by

        allowing each household to be a member of many groups and hence provides
        the

        opportunity to make personalised offers to the household dependent on
        their

        degree of membership of each group. In addition, feedback can be
        provided on

        how user's changing behaviour is moving them towards more "green" or
        cost

        effective stereotypical usage.
      - >2
          Images can be segmented by first using a classifier to predict an affinity
        graph that reflects the degree to which image pixels must be grouped
        together

        and then partitioning the graph to yield a segmentation. Machine
        learning has

        been applied to the affinity classifier to produce affinity graphs that
        are

        good in the sense of minimizing edge misclassification rates. However,
        this

        error measure is only indirectly related to the quality of segmentations

        produced by ultimately partitioning the affinity graph. We present the
        first

        machine learning algorithm for training a classifier to produce affinity
        graphs

        that are good in the sense of producing segmentations that directly
        minimize

        the Rand index, a well known segmentation performance measure. The Rand
        index

        measures segmentation performance by quantifying the classification of
        the

        connectivity of image pixel pairs after segmentation. By using the
        simple graph

        partitioning algorithm of finding the connected components of the
        thresholded

        affinity graph, we are able to train an affinity classifier to directly

        minimize the Rand index of segmentations resulting from the graph
        partitioning.

        Our learning algorithm corresponds to the learning of maximin affinities

        between image pixel pairs, which are predictive of the pixel-pair
        connectivity.
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on lufercho/my-finetuned-bert-mlm

This is a sentence-transformers model finetuned from lufercho/my-finetuned-bert-mlm. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: lufercho/my-finetuned-bert-mlm
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("lufercho/my-finetuned-sentence-bert")
# Run inference
sentences = [
    'Maximin affinity learning of image segmentation',
    '  Images can be segmented by first using a classifier to predict an affinity\ngraph that reflects the degree to which image pixels must be grouped together\nand then partitioning the graph to yield a segmentation. Machine learning has\nbeen applied to the affinity classifier to produce affinity graphs that are\ngood in the sense of minimizing edge misclassification rates. However, this\nerror measure is only indirectly related to the quality of segmentations\nproduced by ultimately partitioning the affinity graph. We present the first\nmachine learning algorithm for training a classifier to produce affinity graphs\nthat are good in the sense of producing segmentations that directly minimize\nthe Rand index, a well known segmentation performance measure. The Rand index\nmeasures segmentation performance by quantifying the classification of the\nconnectivity of image pixel pairs after segmentation. By using the simple graph\npartitioning algorithm of finding the connected components of the thresholded\naffinity graph, we are able to train an affinity classifier to directly\nminimize the Rand index of segmentations resulting from the graph partitioning.\nOur learning algorithm corresponds to the learning of maximin affinities\nbetween image pixel pairs, which are predictive of the pixel-pair connectivity.\n',
    '  Changes in the UK electricity market mean that domestic users will be\nrequired to modify their usage behaviour in order that supplies can be\nmaintained. Clustering allows usage profiles collected at the household level\nto be clustered into groups and assigned a stereotypical profile which can be\nused to target marketing campaigns. Fuzzy C Means clustering extends this by\nallowing each household to be a member of many groups and hence provides the\nopportunity to make personalised offers to the household dependent on their\ndegree of membership of each group. In addition, feedback can be provided on\nhow user\'s changing behaviour is moving them towards more "green" or cost\neffective stereotypical usage.\n',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

Size: 5,000 training samples
Columns: sentence_0, sentence_1, and sentence_2

Approximate statistics based on the first 1000 samples:

	sentence_0	sentence_1	sentence_2
type	string	string	string
details	min: 4 tokens mean: 13.41 tokens max: 38 tokens	min: 37 tokens mean: 201.32 tokens max: 512 tokens	min: 24 tokens mean: 204.09 tokens max: 512 tokens

Samples:

sentence_0	sentence_1	sentence_2
`Clustering with Transitive Distance and K-Means Duality`	Recent spectral clustering methods are a propular and powerful technique for data clustering. These methods need to solve the eigenproblem whose computational complexity is $O(n^3)$, where $n$ is the number of data samples. In this paper, a non-eigenproblem based clustering method is proposed to deal with the clustering problem. Its performance is comparable to the spectral clustering algorithms but it is more efficient with computational complexity $O(n^2)$. We show that with a transitive distance and an observed property, called K-means duality, our algorithm can be used to handle data sets with complex cluster shapes, multi-scale clusters, and noise. Moreover, no parameters except the number of clusters need to be set in our algorithm.	We show that the log-likelihood of several probabilistic graphical models is Lipschitz continuous with respect to the lp-norm of the parameters. We discuss several implications of Lipschitz parametrization. We present an upper bound of the Kullback-Leibler divergence that allows understanding methods that penalize the lp-norm of differences of parameters as the minimization of that upper bound. The expected log-likelihood is lower bounded by the negative lp-norm, which allows understanding the generalization ability of probabilistic models. The exponential of the negative lp-norm is involved in the lower bound of the Bayes error rate, which shows that it is reasonable to use parameters as features in algorithms that rely on metric spaces (e.g. classification, dimensionality reduction, clustering). Our results do not rely on specific algorithms for learning the structure or parameters. We show preliminary results for activity recognition and temporal segmentation.
`Clustering Dynamic Web Usage Data`	Most classification methods are based on the assumption that data conforms to a stationary distribution. The machine learning domain currently suffers from a lack of classification techniques that are able to detect the occurrence of a change in the underlying data distribution. Ignoring possible changes in the underlying concept, also known as concept drift, may degrade the performance of the classification model. Often these changes make the model inconsistent and regular updatings become necessary. Taking the temporal dimension into account during the analysis of Web usage data is a necessity, since the way a site is visited may indeed evolve due to modifications in the structure and content of the site, or even due to changes in the behavior of certain user groups. One solution to this problem, proposed in this article, is to update models using summaries obtained by means of an evolutionary approach based on an intelligent clustering approach. We carry out various clustering str...	Exponential family extensions of principal component analysis (EPCA) have received a considerable amount of attention in recent years, demonstrating the growing need for basic modeling tools that do not assume the squared loss or Gaussian distribution. We extend the EPCA model toolbox by presenting the first exponential family multi-view learning methods of the partial least squares and canonical correlation analysis, based on a unified representation of EPCA as matrix factorization of the natural parameters of exponential family. The models are based on a new family of priors that are generally usable for all such factorizations. We also introduce new inference strategies, and demonstrate how the methods outperform earlier ones when the Gaussianity assumption does not hold.
`Trading USDCHF filtered by Gold dynamics via HMM coupling`	We devise a USDCHF trading strategy using the dynamics of gold as a filter. Our strategy involves modelling both USDCHF and gold using a coupled hidden Markov model (CHMM). The observations will be indicators, RSI and CCI, which will be used as triggers for our trading signals. Upon decoding the model in each iteration, we can get the next most probable state and the next most probable observation. Hopefully by taking advantage of intermarket analysis and the Markov property implicit in the model, trading with these most probable values will produce profitable results.	Most existing machine learning classifiers are highly vulnerable to adversarial examples. An adversarial example is a sample of input data which has been modified very slightly in a way that is intended to cause a machine learning classifier to misclassify it. In many cases, these modifications can be so subtle that a human observer does not even notice the modification at all, yet the classifier still makes a mistake. Adversarial examples pose security concerns because they could be used to perform an attack on machine learning systems, even if the adversary has no access to the underlying model. Up to now, all previous work have assumed a threat model in which the adversary can feed data directly into the machine learning classifier. This is not always the case for systems operating in the physical world, for example those which are using signals from cameras and other sensors as an input. This paper shows that even in such physical world scenarios, machine learning systems are vul...

Loss: TripletLoss with these parameters:

{
    "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
    "triplet_margin": 5
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 16
per_device_eval_batch_size: 16
num_train_epochs: 2
multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 2
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin

Training Logs

Epoch	Step	Training Loss
1.5974	500	0.8647

Framework Versions

Python: 3.10.12
Sentence Transformers: 3.3.1
Transformers: 4.46.2
PyTorch: 2.5.1+cu121
Accelerate: 1.1.1
Datasets: 3.1.0
Tokenizers: 0.20.3

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

TripletLoss

@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification},
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}