SentenceTransformer based on microsoft/deberta-v3-small

This is a sentence-transformers model finetuned from microsoft/deberta-v3-small on the stanfordnlp/snli dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: microsoft/deberta-v3-small
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity
  • Training Dataset:
  • Language: en

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DebertaV2Model 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("bobox/DeBERTaV3-small-ST-AdaptiveLayers-ep2")
# Run inference
sentences = [
    'A wet child stands in chest deep ocean water.',
    'The child s playing on the beach.',
    'A woman paints a portrait of her best friend.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Binary Classification

Metric Value
cosine_accuracy 0.6583
cosine_accuracy_threshold 0.6767
cosine_f1 0.7049
cosine_f1_threshold 0.6018
cosine_precision 0.6115
cosine_recall 0.8321
cosine_ap 0.6995
dot_accuracy 0.6272
dot_accuracy_threshold 163.2505
dot_f1 0.6976
dot_f1_threshold 119.2078
dot_precision 0.5639
dot_recall 0.9144
dot_ap 0.6437
manhattan_accuracy 0.6571
manhattan_accuracy_threshold 243.7545
manhattan_f1 0.7056
manhattan_f1_threshold 295.9595
manhattan_precision 0.5901
manhattan_recall 0.8773
manhattan_ap 0.7072
euclidean_accuracy 0.6591
euclidean_accuracy_threshold 12.1418
euclidean_f1 0.7037
euclidean_f1_threshold 14.1975
euclidean_precision 0.5997
euclidean_recall 0.8513
euclidean_ap 0.7035
max_accuracy 0.6591
max_accuracy_threshold 243.7545
max_f1 0.7056
max_f1_threshold 295.9595
max_precision 0.6115
max_recall 0.9144
max_ap 0.7072

Semantic Similarity

Metric Value
pearson_cosine 0.7322
spearman_cosine 0.7345
pearson_manhattan 0.7537
spearman_manhattan 0.7551
pearson_euclidean 0.7468
spearman_euclidean 0.7485
pearson_dot 0.6143
spearman_dot 0.61
pearson_max 0.7537
spearman_max 0.7551

Training Details

Training Dataset

stanfordnlp/snli

  • Dataset: stanfordnlp/snli at cdb5c3d
  • Size: 67,190 training samples
  • Columns: sentence1, sentence2, and label
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 label
    type string string int
    details
    • min: 4 tokens
    • mean: 21.19 tokens
    • max: 133 tokens
    • min: 4 tokens
    • mean: 11.77 tokens
    • max: 49 tokens
    • 0: 100.00%
  • Samples:
    sentence1 sentence2 label
    Without a placebo group, we still won't know if any of the treatments are better than nothing and therefore worth giving. It is necessary to use a controlled method to ensure the treatments are worthwhile. 0
    It was conducted in silence. It was done silently. 0
    oh Lewisville any decent food in your cafeteria up there Is there any decent food in your cafeteria up there in Lewisville? 0
  • Loss: AdaptiveLayerLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "n_layers_per_step": 1,
        "last_layer_weight": 1,
        "prior_layers_weight": 1,
        "kl_div_weight": 1,
        "kl_temperature": 1
    }
    

Evaluation Dataset

stanfordnlp/snli

  • Dataset: stanfordnlp/snli at cdb5c3d
  • Size: 1,500 evaluation samples
  • Columns: sentence1, sentence2, and score
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 score
    type string string float
    details
    • min: 5 tokens
    • mean: 14.77 tokens
    • max: 45 tokens
    • min: 6 tokens
    • mean: 14.74 tokens
    • max: 49 tokens
    • min: 0.0
    • mean: 0.47
    • max: 1.0
  • Samples:
    sentence1 sentence2 score
    A man with a hard hat is dancing. A man wearing a hard hat is dancing. 1.0
    A young child is riding a horse. A child is riding a horse. 0.95
    A man is feeding a mouse to a snake. The man is feeding a mouse to the snake. 1.0
  • Loss: AdaptiveLayerLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "n_layers_per_step": 1,
        "last_layer_weight": 1,
        "prior_layers_weight": 1,
        "kl_div_weight": 1,
        "kl_temperature": 1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 42
  • per_device_eval_batch_size: 22
  • learning_rate: 3e-06
  • weight_decay: 1e-08
  • num_train_epochs: 2
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.5
  • save_safetensors: False
  • fp16: True
  • hub_model_id: bobox/DeBERTaV3-small-ST-AdaptiveLayers-ep2-tmp
  • hub_strategy: checkpoint
  • hub_private_repo: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 42
  • per_device_eval_batch_size: 22
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • learning_rate: 3e-06
  • weight_decay: 1e-08
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.5
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: False
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: bobox/DeBERTaV3-small-ST-AdaptiveLayers-ep2-tmp
  • hub_strategy: checkpoint
  • hub_private_repo: True
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss loss max_ap spearman_cosine
0.1 160 4.6003 4.8299 0.6017 -
0.2 320 4.0659 4.3436 0.6168 -
0.3 480 3.4886 4.0840 0.6339 -
0.4 640 3.0592 3.6422 0.6611 -
0.5 800 2.5728 3.1927 0.6773 -
0.6 960 2.184 2.8322 0.6893 -
0.7 1120 1.8744 2.4892 0.6954 -
0.8 1280 1.757 2.4453 0.7002 -
0.9 1440 1.5872 2.2565 0.7010 -
1.0 1600 1.446 2.1391 0.7046 -
1.1 1760 1.3892 2.1236 0.7058 -
1.2 1920 1.2567 1.9738 0.7053 -
1.3 2080 1.2233 1.8925 0.7063 -
1.4 2240 1.1954 1.8392 0.7075 -
1.5 2400 1.1395 1.9081 0.7065 -
1.6 2560 1.1211 1.8080 0.7074 -
1.7 2720 1.0825 1.8408 0.7073 -
1.8 2880 1.1358 1.7363 0.7073 -
1.9 3040 1.0628 1.8936 0.7072 -
2.0 3200 1.1412 1.7846 0.7072 -
None 0 - 3.0121 0.7072 0.7345

Framework Versions

  • Python: 3.10.13
  • Sentence Transformers: 3.0.1
  • Transformers: 4.41.2
  • PyTorch: 2.1.2
  • Accelerate: 0.30.1
  • Datasets: 2.19.2
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

AdaptiveLayerLoss

@misc{li20242d,
    title={2D Matryoshka Sentence Embeddings}, 
    author={Xianming Li and Zongxi Li and Jing Li and Haoran Xie and Qing Li},
    year={2024},
    eprint={2402.14776},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
9
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for bobox/DeBERTaV3-small-ST-AdaptiveLayers-ep2

Finetuned
(134)
this model

Dataset used to train bobox/DeBERTaV3-small-ST-AdaptiveLayers-ep2

Evaluation results