SentenceTransformer based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on the bps-query-publication-similarity-pairs dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstat-semantic-search-mpnet-base-v3-sts")
# Run inference
sentences = [
    'Sistem neraca lingkungan dan ekonomi Indonesia, -',
    'Sistem Terintegrasi Neraca Lingkungan dan Ekonomi Indonesia -',
    'Distribusi Perdagangan Komoditas Minyak Goreng Indonesia ',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstat-semantic-dev allstat-semantic-test
pearson_cosine 0.9672 0.9644
spearman_cosine 0.8714 0.8572

Training Details

Training Dataset

bps-query-publication-similarity-pairs

  • Dataset: bps-query-publication-similarity-pairs at cf2836e
  • Size: 44,668 training samples
  • Columns: query, doc_title, and score
  • Approximate statistics based on the first 1000 samples:
    query doc_title score
    type string string float
    details
    • min: 4 tokens
    • mean: 9.66 tokens
    • max: 36 tokens
    • min: 4 tokens
    • mean: 11.88 tokens
    • max: 49 tokens
    • min: 0.0
    • mean: 0.49
    • max: 1.0
  • Samples:
    query doc_title score
    Tren bisnis perikanan di Indonesia Statistik Perusahaan Perikanan 0.88
    Statistik APBDes Statistik Perusahaan Peternakan Ternak Besar dan Kecil 0.29
    Laporan Indikator Konstruksi semester 1 Statistik Air Bersih - 0.25
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

bps-query-publication-similarity-pairs

  • Dataset: bps-query-publication-similarity-pairs at cf2836e
  • Size: 2,482 evaluation samples
  • Columns: query, doc_title, and score
  • Approximate statistics based on the first 1000 samples:
    query doc_title score
    type string string float
    details
    • min: 4 tokens
    • mean: 9.55 tokens
    • max: 31 tokens
    • min: 4 tokens
    • mean: 11.62 tokens
    • max: 36 tokens
    • min: 0.0
    • mean: 0.54
    • max: 1.0
  • Samples:
    query doc_title score
    Dampak COVID-19 pada usaha mikro kecil Statistik Penyedia Makan Minum 0.2
    Sektor konstruksi Aceh, data UMKM Profil Usaha Konstruksi Perorangan Provinsi Aceh, 0.88
    SP2010: Statistik lansia Sumatera Selatan Statistik Penduduk Lanjut Usia Provinsi Sumatera Selatan -Hasil Sensus Penduduk 0.81
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • num_train_epochs: 4
  • warmup_ratio: 0.1
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Click to expand
Epoch Step Training Loss Validation Loss allstat-semantic-dev_spearman_cosine allstat-semantic-test_spearman_cosine
0.0358 100 0.0498 0.0311 0.7840 -
0.0716 200 0.0294 0.0245 0.7970 -
0.1074 300 0.0241 0.0210 0.8040 -
0.1433 400 0.0215 0.0192 0.8078 -
0.1791 500 0.0208 0.0200 0.8091 -
0.2149 600 0.0208 0.0183 0.8183 -
0.2507 700 0.0216 0.0176 0.8177 -
0.2865 800 0.02 0.0177 0.8192 -
0.3223 900 0.0183 0.0180 0.8107 -
0.3582 1000 0.0197 0.0190 0.8058 -
0.3940 1100 0.0199 0.0176 0.8182 -
0.4298 1200 0.0207 0.0193 0.8097 -
0.4656 1300 0.0186 0.0190 0.8088 -
0.5014 1400 0.0197 0.0178 0.8122 -
0.5372 1500 0.0179 0.0177 0.8161 -
0.5731 1600 0.0171 0.0169 0.8197 -
0.6089 1700 0.0178 0.0162 0.8152 -
0.6447 1800 0.0152 0.0162 0.8234 -
0.6805 1900 0.0171 0.0162 0.8187 -
0.7163 2000 0.0179 0.0154 0.8194 -
0.7521 2100 0.0164 0.0158 0.8126 -
0.7880 2200 0.016 0.0149 0.8254 -
0.8238 2300 0.0164 0.0149 0.8193 -
0.8596 2400 0.0151 0.0139 0.8297 -
0.8954 2500 0.0151 0.0142 0.8306 -
0.9312 2600 0.0136 0.0143 0.8315 -
0.9670 2700 0.0157 0.0135 0.8342 -
1.0029 2800 0.0133 0.0135 0.8330 -
1.0387 2900 0.0116 0.0133 0.8369 -
1.0745 3000 0.0106 0.0132 0.8357 -
1.1103 3100 0.0113 0.0126 0.8395 -
1.1461 3200 0.0123 0.0131 0.8362 -
1.1819 3300 0.0117 0.0142 0.8289 -
1.2178 3400 0.0133 0.0135 0.8322 -
1.2536 3500 0.0113 0.0129 0.8358 -
1.2894 3600 0.0109 0.0132 0.8352 -
1.3252 3700 0.0107 0.0122 0.8394 -
1.3610 3800 0.0125 0.0128 0.8364 -
1.3968 3900 0.012 0.0126 0.8342 -
1.4327 4000 0.0123 0.0128 0.8364 -
1.4685 4100 0.0109 0.0127 0.8369 -
1.5043 4200 0.0108 0.0125 0.8385 -
1.5401 4300 0.011 0.0124 0.8416 -
1.5759 4400 0.0104 0.0120 0.8455 -
1.6117 4500 0.0107 0.0114 0.8498 -
1.6476 4600 0.0095 0.0114 0.8485 -
1.6834 4700 0.0114 0.0118 0.8457 -
1.7192 4800 0.0101 0.0118 0.8417 -
1.7550 4900 0.0127 0.0113 0.8466 -
1.7908 5000 0.0112 0.0114 0.8466 -
1.8266 5100 0.0095 0.0109 0.8485 -
1.8625 5200 0.0107 0.0114 0.8465 -
1.8983 5300 0.0113 0.0115 0.8454 -
1.9341 5400 0.0107 0.0116 0.8473 -
1.9699 5500 0.0102 0.0111 0.8526 -
2.0057 5600 0.0097 0.0109 0.8542 -
2.0415 5700 0.0082 0.0106 0.8534 -
2.0774 5800 0.0069 0.0107 0.8551 -
2.1132 5900 0.0077 0.0107 0.8533 -
2.1490 6000 0.0076 0.0109 0.8532 -
2.1848 6100 0.0071 0.0107 0.8515 -
2.2206 6200 0.0075 0.0104 0.8563 -
2.2564 6300 0.0074 0.0102 0.8567 -
2.2923 6400 0.0083 0.0105 0.8567 -
2.3281 6500 0.0075 0.0107 0.8515 -
2.3639 6600 0.007 0.0103 0.8546 -
2.3997 6700 0.0079 0.0103 0.8559 -
2.4355 6800 0.0072 0.0102 0.8550 -
2.4713 6900 0.0069 0.0098 0.8618 -
2.5072 7000 0.0082 0.0099 0.8611 -
2.5430 7100 0.0067 0.0101 0.8596 -
2.5788 7200 0.0062 0.0097 0.8593 -
2.6146 7300 0.0074 0.0094 0.8622 -
2.6504 7400 0.008 0.0093 0.8624 -
2.6862 7500 0.0066 0.0097 0.8610 -
2.7221 7600 0.0066 0.0098 0.8616 -
2.7579 7700 0.0066 0.0097 0.8593 -
2.7937 7800 0.0076 0.0099 0.8582 -
2.8295 7900 0.0078 0.0094 0.8625 -
2.8653 8000 0.0075 0.0092 0.8639 -
2.9011 8100 0.0077 0.0092 0.8620 -
2.9370 8200 0.0067 0.0092 0.8643 -
2.9728 8300 0.0069 0.0095 0.8625 -
3.0086 8400 0.0067 0.0095 0.8632 -
3.0444 8500 0.0051 0.0093 0.8652 -
3.0802 8600 0.0046 0.0094 0.8662 -
3.1160 8700 0.0046 0.0094 0.8669 -
3.1519 8800 0.0047 0.0095 0.8671 -
3.1877 8900 0.0049 0.0091 0.8688 -
3.2235 9000 0.0048 0.0090 0.8688 -
3.2593 9100 0.0047 0.0092 0.8697 -
3.2951 9200 0.0058 0.0092 0.8686 -
3.3309 9300 0.005 0.0091 0.8681 -
3.3668 9400 0.0049 0.0090 0.8694 -
3.4026 9500 0.0051 0.0091 0.8670 -
3.4384 9600 0.0048 0.0090 0.8666 -
3.4742 9700 0.0047 0.0089 0.8672 -
3.5100 9800 0.0046 0.0091 0.8658 -
3.5458 9900 0.0051 0.0090 0.8658 -
3.5817 10000 0.0054 0.0089 0.8681 -
3.6175 10100 0.0049 0.0089 0.8679 -
3.6533 10200 0.0042 0.0089 0.8681 -
3.6891 10300 0.0049 0.0089 0.8684 -
3.7249 10400 0.0046 0.0088 0.8692 -
3.7607 10500 0.0048 0.0088 0.8691 -
3.7966 10600 0.0042 0.0088 0.8704 -
3.8324 10700 0.0049 0.0088 0.8702 -
3.8682 10800 0.0045 0.0088 0.8709 -
3.9040 10900 0.0047 0.0088 0.8712 -
3.9398 11000 0.0046 0.0088 0.8711 -
3.9756 11100 0.0045 0.0088 0.8714 -
4.0 11168 - - - 0.8572

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.1
  • PyTorch: 2.2.2+cu121
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
4
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for yahyaabd/allstat-semantic-search-mpnet-base-v3-sts

Dataset used to train yahyaabd/allstat-semantic-search-mpnet-base-v3-sts

Evaluation results