yahyaabd's picture
Add new SentenceTransformer model
032378a verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:123637
  - loss:CosineSimilarityLoss
base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
widget:
  - source_sentence: Analisis biaya hidup di tiga kota Banten thn 2018
    sentences:
      - Indikator Konstruksi Triwulan I-2007
      - Survei Biaya Hidup (SBH) 2018 Bengkulu
      - Indikator Ekonomi Februari 2002
  - source_sentence: >-
      Grafik ekspor hasil minyak Indonesia ke berbagai negara dari tahun 2000
      hingga 2023.
    sentences:
      - >-
        Sistem Neraca Sosial Ekonomi Indonesia Tahun 2022 dalam Format SNA 1968
        (65x65)
      - Harga Produsen Gabah dan Beras Januari 2020
      - Profil Usaha Konstruksi Perorangan Provinsi Papua 2016
  - source_sentence: Tren konstruksi Indonesia tahun 2007 Q4
    sentences:
      - Laporan Bulanan Data Sosial Ekonomi Desember 2018
      - Indeks Unit Value Ekspor Menurut Kode SITC Bulan Februari 2023
      - Inflasi Februari 2008 sebesar 0,5 persen
  - source_sentence: >-
      Informasi tentang kepemilikan dan penggunaan AC di rumah tangga Indonesia
      tahun 2013?
    sentences:
      - Data dan Informasi Kemiskinan Kabupaten/Kota Tahun 2014
      - >-
        Rata-rata Upah/Gaji Bersih Sebulan Buruh/Karyawan/Pegawai Menurut
        Kelompok Umur dan Jenis Pekerjaan, 2022-2023
      - Indikator Konstruksi, Triwulan II-2022
  - source_sentence: Statistik harga Ternate 2012
    sentences:
      - Statistik Perhubungan 2005
      - Indeks Unit Value Ekspor Menurut Kode SITC Bulan Januari 2019
      - Indikator Ekonomi Agustus 2002
datasets:
  - yahyaabd/allstats-semantic-synthetic-dataset-v1
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
model-index:
  - name: >-
      SentenceTransformer based on
      sentence-transformers/paraphrase-multilingual-mpnet-base-v2
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstats semantic base v1 eval
          type: allstats-semantic-base-v1-eval
        metrics:
          - type: pearson_cosine
            value: 0.9868927327091045
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.9277441071536588
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstat semantic base v1 test
          type: allstat-semantic-base-v1-test
        metrics:
          - type: pearson_cosine
            value: 0.9867639981224826
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.9256998894451143
            name: Spearman Cosine

SentenceTransformer based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on the allstats-semantic-synthetic-dataset-v1 dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-base-v1-2")
# Run inference
sentences = [
    'Statistik harga Ternate 2012',
    'Indikator Ekonomi Agustus 2002',
    'Indeks Unit Value Ekspor Menurut Kode SITC Bulan Januari 2019',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-base-v1-eval allstat-semantic-base-v1-test
pearson_cosine 0.9869 0.9868
spearman_cosine 0.9277 0.9257

Training Details

Training Dataset

allstats-semantic-synthetic-dataset-v1

  • Dataset: allstats-semantic-synthetic-dataset-v1 at e73718f
  • Size: 123,637 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 10.59 tokens
    • max: 34 tokens
    • min: 5 tokens
    • mean: 14.29 tokens
    • max: 56 tokens
    • min: 0.0
    • mean: 0.5
    • max: 1.0
  • Samples:
    query doc label
    Analisis upah tenaga kerja ekonomi kreatif Upah Tenaga Kerja Ekonomi Kreatif 2011-2016 0.88
    cari data persentase rumah tangga yang menggunakan listrik pln menurut provinsi dari 1993 sampai 2022. Persentase Rumah Tangga menurut Provinsi dan Sumber Penerangan Listrik PLN, 1993-2022 0.93
    apakah ada tabel yang menunjukkan ekspor minyak mentah ke negara tujuan utama tahun 2000-2023? IHK dan Rata-rata Upah per Bulan Buruh Peternakan dan Perikanan di Bawah Mandor (Supervisor), 2012-2014 (2012=100) 0.13
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

allstats-semantic-synthetic-dataset-v1

  • Dataset: allstats-semantic-synthetic-dataset-v1 at e73718f
  • Size: 26,494 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 10.66 tokens
    • max: 31 tokens
    • min: 4 tokens
    • mean: 13.94 tokens
    • max: 70 tokens
    • min: 0.0
    • mean: 0.49
    • max: 1.0
  • Samples:
    query doc label
    SBH Aceh 2018: Meulaboh, Banda Aceh, Lhokseumawe Survei Biaya Hidup (SBH) 2018 Meulaboh, Banda Aceh, dan Lhokseumawe 0.9
    ekspor produk indonesia juli 2018 per negara Direktori Perusahaan Pertambangan Besar 2013 0.07
    peternakan sapi di jawa tengah 2011 Laporan Bulanan Data Sosial Ekonomi Juli 2024 0.07
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • num_train_epochs: 24
  • warmup_ratio: 0.1
  • fp16: True
  • dataloader_num_workers: 4
  • load_best_model_at_end: True
  • label_smoothing_factor: 0.1
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 24
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 4
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.1
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-base-v1-eval_spearman_cosine allstat-semantic-base-v1-test_spearman_cosine
0 0 - 0.0942 0.6574 -
0.2588 500 0.0449 0.0262 0.7353 -
0.5176 1000 0.0232 0.0185 0.7592 -
0.7764 1500 0.0172 0.0154 0.7760 -
1.0352 2000 0.0153 0.0137 0.7905 -
1.2940 2500 0.0124 0.0130 0.7920 -
1.5528 3000 0.0119 0.0120 0.8048 -
1.8116 3500 0.0121 0.0121 0.8021 -
2.0704 4000 0.0114 0.0112 0.8018 -
2.3292 4500 0.0093 0.0117 0.7996 -
2.5880 5000 0.0097 0.0105 0.8133 -
2.8468 5500 0.0092 0.0103 0.8137 -
3.1056 6000 0.0085 0.0094 0.8247 -
3.3644 6500 0.0068 0.0090 0.8326 -
3.6232 7000 0.0073 0.0092 0.8273 -
3.8820 7500 0.007 0.0084 0.8404 -
4.1408 8000 0.0061 0.0083 0.8381 -
4.3996 8500 0.0057 0.0082 0.8382 -
4.6584 9000 0.0056 0.0074 0.8458 -
4.9172 9500 0.0057 0.0073 0.8468 -
5.1760 10000 0.0045 0.0071 0.8508 -
5.4348 10500 0.0041 0.0069 0.8579 -
5.6936 11000 0.0047 0.0069 0.8471 -
5.9524 11500 0.0046 0.0067 0.8554 -
6.2112 12000 0.0034 0.0062 0.8616 -
6.4700 12500 0.0034 0.0063 0.8636 -
6.7288 13000 0.0036 0.0062 0.8649 -
6.9876 13500 0.0037 0.0063 0.8641 -
7.2464 14000 0.0027 0.0059 0.8691 -
7.5052 14500 0.0027 0.0060 0.8733 -
7.7640 15000 0.0031 0.0060 0.8748 -
8.0228 15500 0.0028 0.0058 0.8736 -
8.2816 16000 0.0023 0.0055 0.8785 -
8.5404 16500 0.0025 0.0054 0.8801 -
8.7992 17000 0.0024 0.0058 0.8809 -
9.0580 17500 0.0026 0.0058 0.8811 -
9.3168 18000 0.002 0.0055 0.8824 -
9.5756 18500 0.002 0.0053 0.8859 -
9.8344 19000 0.0021 0.0053 0.8851 -
10.0932 19500 0.0019 0.0055 0.8904 -
10.3520 20000 0.0016 0.0052 0.8946 -
10.6108 20500 0.0017 0.0057 0.8884 -
10.8696 21000 0.0019 0.0055 0.8889 -
11.1284 21500 0.0016 0.0052 0.8942 -
11.3872 22000 0.0014 0.0053 0.8961 -
11.6460 22500 0.0016 0.0053 0.8928 -
11.9048 23000 0.0017 0.0051 0.8947 -
12.1636 23500 0.0013 0.0050 0.9015 -
12.4224 24000 0.0012 0.0059 0.8886 -
12.6812 24500 0.0014 0.0051 0.9030 -
12.9400 25000 0.0014 0.0051 0.9012 -
13.1988 25500 0.0011 0.0050 0.9037 -
13.4576 26000 0.0011 0.0050 0.9053 -
13.7164 26500 0.0011 0.0049 0.9060 -
13.9752 27000 0.0011 0.0049 0.9086 -
14.2340 27500 0.001 0.0048 0.9063 -
14.4928 28000 0.001 0.0051 0.9056 -
14.7516 28500 0.001 0.0051 0.9079 -
15.0104 29000 0.0011 0.0049 0.9080 -
15.2692 29500 0.0008 0.0048 0.9126 -
15.5280 30000 0.0008 0.0049 0.9112 -
15.7867 30500 0.0008 0.0049 0.9123 -
16.0455 31000 0.0008 0.0048 0.9133 -
16.3043 31500 0.0006 0.0048 0.9103 -
16.5631 32000 0.0007 0.0049 0.9144 -
16.8219 32500 0.0008 0.0048 0.9143 -
17.0807 33000 0.0007 0.0048 0.9159 -
17.3395 33500 0.0007 0.0047 0.9174 -
17.5983 34000 0.0006 0.0048 0.9175 -
17.8571 34500 0.0007 0.0047 0.9163 -
18.1159 35000 0.0006 0.0046 0.9195 -
18.3747 35500 0.0006 0.0047 0.9190 -
18.6335 36000 0.0006 0.0047 0.9192 -
18.8923 36500 0.0006 0.0047 0.9204 -
19.1511 37000 0.0005 0.0047 0.9219 -
19.4099 37500 0.0004 0.0046 0.9218 -
19.6687 38000 0.0005 0.0047 0.9221 -
19.9275 38500 0.0005 0.0046 0.9230 -
20.1863 39000 0.0005 0.0046 0.9233 -
20.4451 39500 0.0004 0.0046 0.9240 -
20.7039 40000 0.0005 0.0047 0.9234 -
20.9627 40500 0.0004 0.0047 0.9241 -
21.2215 41000 0.0004 0.0046 0.9253 -
21.4803 41500 0.0004 0.0046 0.9259 -
21.7391 42000 0.0004 0.0046 0.9262 -
21.9979 42500 0.0004 0.0046 0.9263 -
22.2567 43000 0.0003 0.0046 0.9266 -
22.5155 43500 0.0003 0.0046 0.9266 -
22.7743 44000 0.0003 0.0046 0.9273 -
23.0331 44500 0.0003 0.0046 0.9273 -
23.2919 45000 0.0003 0.0046 0.9274 -
23.5507 45500 0.0003 0.0046 0.9277 -
23.8095 46000 0.0003 0.0046 0.9277 -
24.0 46368 - - - 0.9257
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}