yahyaabd's picture
Add new SentenceTransformer model
95962a3 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:212940
  - loss:CosineSimilarityLoss
base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
widget:
  - source_sentence: Ringkasan data strategis BPS 2012
    sentences:
      - >-
        Rata-rata Upah/Gaji Bersih Sebulan Buruh/Karyawan/Pegawai Menurut
        Provinsi dan Jenis Pekerjaan Utama, 2021
      - Laporan Perekonomian Indonesia 2007
      - Statistik Potensi Desa Provinsi Banten 2008
  - source_sentence: tahun berapa ekspor naik 2,37% dan impor naik 30,30%?
    sentences:
      - Bulan November 2006 Ekspor Naik 2,37 % dan Impor Naik 30,30 %
      - Indeks Harga Konsumen per Kelompok di 82 Kota <sup>1</sup> (2012=100)
      - >-
        Februari 2022: Tingkat Pengangguran Terbuka (TPT) sebesar 5,83 persen
        dan Rata-rata upah buruh sebesar 2,89 juta rupiah per bulan
  - source_sentence: akses air bersih di indonesia (2005-2009)
    sentences:
      - Desember 2016, Rupiah Terapresiasi 0,74 Persen Terhadap Dolar Amerika
      - Statistik Air Bersih 2005-2009
      - >-
        Rata-rata Upah/Gaji Bersih Sebulan Buruh/Karyawan/Pegawai Menurut
        Pendidikan Tertinggi yang Ditamatkan dan Lapangan Pekerjaan Utama di 17
        Sektor (rupiah), 2018
  - source_sentence: >-
      Tinjauan Regional Berdasarkan PDRB Kabupaten/Kota 2014-2018, Buku 2 Pulau
      Jawa dan Bali
    sentences:
      - Profil Migran Hasil Susenas 2011-2012
      - Statistik Gas Kota 2004-2008
      - >-
        Jumlah kunjungan wisman ke Indonesia melalui pintu masuk utama pada Juni
        2022 mencapai 345,44 ribu kunjungan dan Jumlah penumpang angkutan udara
        internasional pada Juni 2022 naik 23,28 persen
  - source_sentence: perubahan nilai tukar petani bulan mei 2017
    sentences:
      - Perkembangan Nilai Tukar Petani Mei 2017
      - NTP Naik 0,15%, Harga Gabah Kualitas GKG Naik 0,98%
      - Statistik Restoran/Rumah Makan Tahun 2014
datasets:
  - yahyaabd/allstats-semantic-search-synthetic-dataset-v1
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
model-index:
  - name: >-
      SentenceTransformer based on
      sentence-transformers/paraphrase-multilingual-mpnet-base-v2
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstats semantic search v1 3 dev
          type: allstats-semantic-search-v1-3-dev
        metrics:
          - type: pearson_cosine
            value: 0.9958745183830993
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.96406478662103
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstat semantic search v1 3 test
          type: allstat-semantic-search-v1-3-test
        metrics:
          - type: pearson_cosine
            value: 0.9960950217535739
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.9647914507837114
            name: Spearman Cosine

SentenceTransformer based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on the allstats-semantic-search-synthetic-dataset-v1 dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-search-model-v1-3")
# Run inference
sentences = [
    'perubahan nilai tukar petani bulan mei 2017',
    'Perkembangan Nilai Tukar Petani Mei 2017',
    'Statistik Restoran/Rumah Makan Tahun 2014',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-search-v1-3-dev allstat-semantic-search-v1-3-test
pearson_cosine 0.9959 0.9961
spearman_cosine 0.9641 0.9648

Training Details

Training Dataset

allstats-semantic-search-synthetic-dataset-v1

  • Dataset: allstats-semantic-search-synthetic-dataset-v1 at b13c0a7
  • Size: 212,940 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 11.46 tokens
    • max: 34 tokens
    • min: 5 tokens
    • mean: 14.47 tokens
    • max: 54 tokens
    • min: 0.0
    • mean: 0.5
    • max: 1.05
  • Samples:
    query doc label
    aDta industri besar dan sedang Indonesia 2008 Statistik Industri Besar dan Sedang Indonesia 2008 0.9
    profil bisnis konstruksi individu jawa barat 2022 Statistik Industri Manufaktur Indonesia 2015 - Bahan Baku 0.15
    data statistik ekonomi indonesia Nilai Tukar Valuta Asing di Indonesia 2014 0.08
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

allstats-semantic-search-synthetic-dataset-v1

  • Dataset: allstats-semantic-search-synthetic-dataset-v1 at b13c0a7
  • Size: 26,618 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 11.38 tokens
    • max: 34 tokens
    • min: 4 tokens
    • mean: 14.63 tokens
    • max: 55 tokens
    • min: 0.0
    • mean: 0.51
    • max: 1.0
  • Samples:
    query doc label
    tahun berapa ekspor naik 2,37% dan impor naik 30,30%? Bulan November 2006 Ekspor Naik 2,37 % dan Impor Naik 30,30 % 1.0
    Berapa produksi padi pada tahun 2023? Produksi padi tahun lainnya 0.0
    data statistik solus per aqua 2015 Statistik Solus Per Aqua (SPA) 2015 0.97
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • num_train_epochs: 16
  • warmup_ratio: 0.1
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 16
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Click to expand
Epoch Step Training Loss Validation Loss allstats-semantic-search-v1-3-dev_spearman_cosine allstat-semantic-search-v1-3-test_spearman_cosine
0.1502 500 0.0579 0.0351 0.7132 -
0.3005 1000 0.03 0.0225 0.7589 -
0.4507 1500 0.0219 0.0185 0.7834 -
0.6010 2000 0.0181 0.0163 0.7946 -
0.7512 2500 0.0162 0.0147 0.7941 -
0.9014 3000 0.015 0.0147 0.8050 -
1.0517 3500 0.014 0.0131 0.7946 -
1.2019 4000 0.0119 0.0126 0.8038 -
1.3522 4500 0.0121 0.0128 0.8213 -
1.5024 5000 0.0117 0.0116 0.8268 -
1.6526 5500 0.0124 0.0117 0.8269 -
1.8029 6000 0.0111 0.0109 0.8421 -
1.9531 6500 0.0105 0.0108 0.8278 -
2.1034 7000 0.0091 0.0093 0.8460 -
2.2536 7500 0.0085 0.0091 0.8469 -
2.4038 8000 0.0079 0.0083 0.8595 -
2.5541 8500 0.0075 0.0085 0.8495 -
2.7043 9000 0.0073 0.0082 0.8614 -
2.8546 9500 0.0068 0.0077 0.8696 -
3.0048 10000 0.0066 0.0076 0.8669 -
3.1550 10500 0.0058 0.0072 0.8678 -
3.3053 11000 0.0056 0.0067 0.8703 -
3.4555 11500 0.0054 0.0067 0.8766 -
3.6058 12000 0.0054 0.0063 0.8678 -
3.7560 12500 0.0051 0.0061 0.8786 -
3.9062 13000 0.0052 0.0077 0.8699 -
4.0565 13500 0.005 0.0055 0.8859 -
4.2067 14000 0.0041 0.0054 0.8900 -
4.3570 14500 0.0038 0.0052 0.8892 -
4.5072 15000 0.0039 0.0050 0.8895 -
4.6575 15500 0.004 0.0052 0.8972 -
4.8077 16000 0.0042 0.0051 0.8927 -
4.9579 16500 0.0041 0.0052 0.8930 -
5.1082 17000 0.0034 0.0053 0.8998 -
5.2584 17500 0.003 0.0047 0.9023 -
5.4087 18000 0.0032 0.0045 0.9039 -
5.5589 18500 0.0032 0.0044 0.8996 -
5.7091 19000 0.0032 0.0041 0.9085 -
5.8594 19500 0.0032 0.0047 0.9072 -
6.0096 20000 0.0029 0.0037 0.9104 -
6.1599 20500 0.0024 0.0037 0.9112 -
6.3101 21000 0.0026 0.0039 0.9112 -
6.4603 21500 0.0024 0.0037 0.9157 -
6.6106 22000 0.0022 0.0038 0.9122 -
6.7608 22500 0.0025 0.0034 0.9170 -
6.9111 23000 0.0023 0.0034 0.9179 -
7.0613 23500 0.002 0.0031 0.9244 -
7.2115 24000 0.0019 0.0030 0.9250 -
7.3618 24500 0.0018 0.0032 0.9249 -
7.5120 25000 0.0022 0.0031 0.9162 -
7.6623 25500 0.0019 0.0030 0.9266 -
7.8125 26000 0.0019 0.0028 0.9297 -
7.9627 26500 0.0018 0.0028 0.9282 -
8.1130 27000 0.0015 0.0025 0.9324 -
8.2632 27500 0.0014 0.0027 0.9337 -
8.4135 28000 0.0015 0.0027 0.9327 -
8.5637 28500 0.0016 0.0027 0.9313 -
8.7139 29000 0.0016 0.0027 0.9333 -
8.8642 29500 0.0015 0.0025 0.9382 -
9.0144 30000 0.0014 0.0025 0.9375 -
9.1647 30500 0.0011 0.0024 0.9398 -
9.3149 31000 0.0012 0.0025 0.9384 -
9.4651 31500 0.0014 0.0025 0.9383 -
9.6154 32000 0.0013 0.0023 0.9410 -
9.7656 32500 0.0011 0.0023 0.9409 -
9.9159 33000 0.0012 0.0021 0.9432 -
10.0661 33500 0.0011 0.0021 0.9432 -
10.2163 34000 0.001 0.0021 0.9442 -
10.3666 34500 0.0009 0.0022 0.9436 -
10.5168 35000 0.001 0.0021 0.9468 -
10.6671 35500 0.001 0.0020 0.9471 -
10.8173 36000 0.001 0.0021 0.9467 -
10.9675 36500 0.0011 0.0021 0.9478 -
11.1178 37000 0.0008 0.0020 0.9493 -
11.2680 37500 0.0008 0.0019 0.9509 -
11.4183 38000 0.0008 0.0019 0.9504 -
11.5685 38500 0.0008 0.0019 0.9512 -
11.7188 39000 0.0008 0.0019 0.9516 -
11.8690 39500 0.0007 0.0019 0.9534 -
12.0192 40000 0.0007 0.0018 0.9539 -
12.1695 40500 0.0006 0.0018 0.9555 -
12.3197 41000 0.0006 0.0019 0.9551 -
12.4700 41500 0.0007 0.0019 0.9550 -
12.6202 42000 0.0008 0.0018 0.9552 -
12.7704 42500 0.0006 0.0017 0.9559 -
12.9207 43000 0.0006 0.0017 0.9568 -
13.0709 43500 0.0006 0.0017 0.9577 -
13.2212 44000 0.0005 0.0017 0.9581 -
13.3714 44500 0.0006 0.0017 0.9586 -
13.5216 45000 0.0005 0.0017 0.9587 -
13.6719 45500 0.0005 0.0017 0.9591 -
13.8221 46000 0.0006 0.0016 0.9600 -
13.9724 46500 0.0005 0.0016 0.9603 -
14.1226 47000 0.0005 0.0016 0.9609 -
14.2728 47500 0.0005 0.0016 0.9612 -
14.4231 48000 0.0005 0.0016 0.9611 -
14.5733 48500 0.0005 0.0016 0.9616 -
14.7236 49000 0.0004 0.0015 0.9625 -
14.8738 49500 0.0004 0.0016 0.9628 -
15.0240 50000 0.0004 0.0016 0.9631 -
15.1743 50500 0.0004 0.0016 0.9632 -
15.3245 51000 0.0004 0.0016 0.9633 -
15.4748 51500 0.0004 0.0016 0.9635 -
15.625 52000 0.0004 0.0015 0.9638 -
15.7752 52500 0.0004 0.0015 0.9640 -
15.9255 53000 0.0004 0.0015 0.9641 -
16.0 53248 - - - 0.9648

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.1
  • PyTorch: 2.2.2+cu121
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}