SentenceTransformer based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on the bps-semantic-pairs-synthetic-dataset-v1 dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-mpnet-v1")
# Run inference
sentences = [
    'Direktori perusahaan pengelola hutan 2015',
    'Direktori Perusahaan Kehutanan 2015',
    'Indeks Pembangunan Manusia (IPM) Indonesia tahun 2024 mencapai 75,02, meningkat 0,63 poin atau 0,85 persen dibandingkan tahun sebelumnya yang sebesar 74,39.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-mpnet-v1-eval allstat-semantic-mpnet-v1-test
pearson_cosine 0.9722 0.9715
spearman_cosine 0.877 0.8697

Training Details

Training Dataset

bps-semantic-pairs-synthetic-dataset-v1

  • Dataset: bps-semantic-pairs-synthetic-dataset-v1 at 6656af9
  • Size: 73,392 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 11.28 tokens
    • max: 34 tokens
    • min: 5 tokens
    • mean: 14.71 tokens
    • max: 58 tokens
    • min: 0.0
    • mean: 0.48
    • max: 1.0
  • Samples:
    query doc label
    Data bisnis Kalbar sensus 2016 Indikator Ekonomi Oktober 2012 0.1
    Informasi tentang pola pengeluaran masyarakat Bengkulu berdasarkan kelompok pendapatan? Rata-rata Konsumsi dan Pengeluaran Perkapita Seminggu Menurut Komoditi Makanan dan Golongan Pengeluaran per Kapita Seminggu di Provinsi Bengkulu, 2018-2023 0.88
    Laopran keuagnan lmebaga non proft 20112-013 Neraca Lembaga Non Profit yang Melayani Rumah Tangga 2011-2013 0.93
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

bps-semantic-pairs-synthetic-dataset-v1

  • Dataset: bps-semantic-pairs-synthetic-dataset-v1 at 6656af9
  • Size: 15,726 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 4 tokens
    • mean: 11.52 tokens
    • max: 37 tokens
    • min: 5 tokens
    • mean: 14.38 tokens
    • max: 61 tokens
    • min: 0.0
    • mean: 0.49
    • max: 1.0
  • Samples:
    query doc label
    Data transportasi bulan Februari 2021 Tenaga Kerja Februari 2023 0.08
    Sebear berspa prrsen eknaikan Inseks Hraga Predagangan eBsar (IHB) Umym Nasiona di aMret 202? Maret 2020, Indeks Harga Perdagangan Besar (IHPB) Umum Nasional naik 0,10 persen 1.0
    Data ekspor dan moda transportasi tahun 2018-2019 Indikator Pasar Tenaga Kerja Indonesia Agustus 2012 0.08
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • num_train_epochs: 24
  • warmup_ratio: 0.1
  • fp16: True
  • dataloader_num_workers: 4
  • load_best_model_at_end: True
  • label_smoothing_factor: 0.01
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 24
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 4
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.01
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Click to expand
Epoch Step Training Loss Validation Loss allstats-semantic-mpnet-v1-eval_spearman_cosine allstat-semantic-mpnet-v1-test_spearman_cosine
0 0 - 0.1031 0.6244 -
0.2180 250 0.064 0.0413 0.6958 -
0.4359 500 0.0381 0.0305 0.7301 -
0.6539 750 0.0284 0.0243 0.7651 -
0.8718 1000 0.025 0.0213 0.7656 -
1.0898 1250 0.0207 0.0201 0.7822 -
1.3078 1500 0.0188 0.0194 0.7805 -
1.5257 1750 0.0182 0.0177 0.7918 -
1.7437 2000 0.0177 0.0168 0.8098 -
1.9616 2250 0.0173 0.0173 0.7979 -
2.1796 2500 0.0151 0.0174 0.8010 -
2.3976 2750 0.014 0.0163 0.8005 -
2.6155 3000 0.0142 0.0159 0.8027 -
2.8335 3250 0.0137 0.0154 0.8074 -
3.0514 3500 0.013 0.0146 0.8173 -
3.2694 3750 0.0099 0.0138 0.8179 -
3.4874 4000 0.0105 0.0135 0.8138 -
3.7053 4250 0.0109 0.0145 0.8138 -
3.9233 4500 0.011 0.0145 0.8244 -
4.1412 4750 0.0086 0.0132 0.8327 -
4.3592 5000 0.0077 0.0129 0.8307 -
4.5772 5250 0.0081 0.0124 0.8380 -
4.7951 5500 0.0087 0.0128 0.8358 -
5.0131 5750 0.0076 0.0135 0.8280 -
5.2310 6000 0.0061 0.0122 0.8399 -
5.4490 6250 0.0062 0.0119 0.8344 -
5.6670 6500 0.007 0.0113 0.8432 -
5.8849 6750 0.0069 0.0117 0.8353 -
6.1029 7000 0.0056 0.0117 0.8333 -
6.3208 7250 0.0047 0.0114 0.8438 -
6.5388 7500 0.0059 0.0114 0.8429 -
6.7568 7750 0.0054 0.0113 0.8452 -
6.9747 8000 0.0059 0.0118 0.8477 -
7.1927 8250 0.0045 0.0109 0.8474 -
7.4106 8500 0.0042 0.0111 0.8532 -
7.6286 8750 0.0045 0.0114 0.8385 -
7.8466 9000 0.005 0.0111 0.8502 -
8.0645 9250 0.0045 0.0111 0.8496 -
8.2825 9500 0.0035 0.0109 0.8490 -
8.5004 9750 0.0038 0.0112 0.8519 -
8.7184 10000 0.0038 0.0112 0.8463 -
8.9364 10250 0.0039 0.0109 0.8556 -
9.1543 10500 0.0035 0.0110 0.8534 -
9.3723 10750 0.003 0.0111 0.8525 -
9.5902 11000 0.0039 0.0108 0.8593 -
9.8082 11250 0.0038 0.0112 0.8537 -
10.0262 11500 0.0033 0.0108 0.8553 -
10.2441 11750 0.0023 0.0104 0.8601 -
10.4621 12000 0.0025 0.0104 0.8571 -
10.6800 12250 0.0026 0.0106 0.8594 -
10.8980 12500 0.0026 0.0106 0.8627 -
11.1160 12750 0.0024 0.0105 0.8623 -
11.3339 13000 0.002 0.0104 0.8614 -
11.5519 13250 0.0021 0.0103 0.8622 -
11.7698 13500 0.0025 0.0106 0.8580 -
11.9878 13750 0.0023 0.0108 0.8613 -
12.2058 14000 0.0019 0.0106 0.8618 -
12.4237 14250 0.0017 0.0104 0.8641 -
12.6417 14500 0.0019 0.0103 0.8620 -
12.8596 14750 0.002 0.0104 0.8649 -
13.0776 15000 0.002 0.0102 0.8620 -
13.2956 15250 0.0014 0.0103 0.8631 -
13.5135 15500 0.0018 0.0104 0.8635 -
13.7315 15750 0.0018 0.0102 0.8661 -
13.9494 16000 0.0018 0.0104 0.8683 -
14.1674 16250 0.0014 0.0104 0.8691 -
14.3854 16500 0.0014 0.0103 0.8668 -
14.6033 16750 0.0015 0.0102 0.8673 -
14.8213 17000 0.0016 0.0102 0.8679 -
15.0392 17250 0.0016 0.0101 0.8688 -
15.2572 17500 0.0012 0.0102 0.8676 -
15.4752 17750 0.0012 0.0102 0.8712 -
15.6931 18000 0.0014 0.0102 0.8702 -
15.9111 18250 0.0013 0.0101 0.8718 -
16.1290 18500 0.0011 0.0100 0.8727 -
16.3470 18750 0.001 0.0101 0.8729 -
16.5650 19000 0.0012 0.0099 0.8714 -
16.7829 19250 0.0011 0.0101 0.8723 -
17.0009 19500 0.0012 0.0101 0.8679 -
17.2188 19750 0.0009 0.0103 0.8706 -
17.4368 20000 0.0009 0.0101 0.8722 -
17.6548 20250 0.0009 0.0100 0.8710 -
17.8727 20500 0.001 0.0101 0.8719 -
18.0907 20750 0.0009 0.0100 0.8728 -
18.3086 21000 0.0009 0.0100 0.8738 -
18.5266 21250 0.0008 0.0100 0.8720 -
18.7446 21500 0.0009 0.0100 0.8731 -
18.9625 21750 0.0009 0.0098 0.8738 -
19.1805 22000 0.0007 0.0100 0.8750 -
19.3984 22250 0.0007 0.0099 0.8730 -
19.6164 22500 0.0007 0.0100 0.8753 -
19.8344 22750 0.0007 0.0099 0.8753 -
20.0523 23000 0.0008 0.0100 0.8755 -
20.2703 23250 0.0006 0.0100 0.8747 -
20.4882 23500 0.0006 0.0101 0.8753 -
20.7062 23750 0.0007 0.0101 0.8738 -
20.9241 24000 0.0007 0.0101 0.8750 -
21.1421 24250 0.0006 0.0101 0.8760 -
21.3601 24500 0.0006 0.0101 0.8753 -
21.5780 24750 0.0006 0.0101 0.8759 -
21.7960 25000 0.0006 0.0100 0.8759 -
22.0139 25250 0.0006 0.0100 0.8762 -
22.2319 25500 0.0005 0.0100 0.8767 -
22.4499 25750 0.0005 0.0100 0.8772 -
22.6678 26000 0.0005 0.0099 0.8771 -
22.8858 26250 0.0005 0.0100 0.8769 -
23.1037 26500 0.0005 0.0100 0.8770 -
23.3217 26750 0.0005 0.0100 0.8769 -
23.5397 27000 0.0004 0.0100 0.8769 -
23.7576 27250 0.0005 0.0100 0.8769 -
23.9756 27500 0.0005 0.0100 0.8770 -
24.0 27528 - - - 0.8697
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
8
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for yahyaabd/allstats-semantic-mpnet-v1

Dataset used to train yahyaabd/allstats-semantic-mpnet-v1

Evaluation results

  • Pearson Cosine on allstats semantic mpnet v1 eval
    self-reported
    0.972
  • Spearman Cosine on allstats semantic mpnet v1 eval
    self-reported
    0.877
  • Pearson Cosine on allstat semantic mpnet v1 test
    self-reported
    0.971
  • Spearman Cosine on allstat semantic mpnet v1 test
    self-reported
    0.870