yahyaabd's picture
Add new SentenceTransformer model
a19c1f5 verified
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:25580
  - loss:OnlineContrastiveLoss
base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
  - source_sentence: ikhtisar arus kas triwulan 1, 2004 (miliar)
      - Balita (0-59 Bulan) Menurut Status Gizi, Tahun 1998-2005
      - >-
        Perbandingan Indeks dan Tingkat Inflasi Desember 2023 Kota-kota di Luar
        Pulau Jawa dan Sumatera dengan Nasional (2018=100)
      - >-
        Rata-rata Konsumsi dan Pengeluaran Perkapita Seminggu Menurut Komoditi
        Makanan dan Golongan Pengeluaran per Kapita Seminggu di Provinsi
        Sulawesi Tengah, 2018-2023
  - source_sentence: >-
      BaIgaimana gambaran neraca arus dana dUi Indonesia pada kuartal kedua
      tahun 2015?
      - >-
        Jumlah Sekolah, Guru, dan Murid Sekolah Menengah Pertama (SMP) di Bawah
        Kementrian Pendidikan dan Kebudayaan Menurut Provinsi
      - Ringkasan Neraca Arus Dana Triwulan III Tahun 2003 (Miliar Rupiah)
      - >-
        Rata-rata Konsumsi dan Pengeluaran Perkapita Seminggu Menurut Komoditi
        Makanan dan Golongan Pengeluaran per Kapita Seminggu di Provinsi
        Sulawesi Tenggara, 2018-2023
  - source_sentence: >-
      Berapa persen pengeluaran orang di kotaa untuk makanan vs non-makanan, per
      provinsi, 2018?
      - >-
        Ekspor Tanaman Obat, Aromatik, dan Rempah-Rempah menurut Negara Tujuan
        Utama, 2012-2023
      - >-
        Rata-rata Pendapatan Bersih Pekerja Bebas Menurut Provinsi dan
        Pendidikan Tertinggi yang Ditamatkan (ribu rupiah), 2017
      - >-
        IHK dan Rata-rata Upah per Bulan Buruh Industri di Bawah Mandor
        (Supervisor), 1996-2014 (1996=100)
  - source_sentence: Negara-negara asal impor crude oil dan produk turunannya tahun 2002-2023
      - >-
        Persentase Pengeluaran Rata-rata per Kapita Sebulan Menurut Kelompok
        Barang, Indonesia, 1999, 2002-2023
      - >-
        Rata-rata Pendapatan Bersih Berusaha Sendiri menurut Provinsi dan
        Pendidikan yang Ditamatkan (ribu rupiah), 2016
      - >-
        Perkembangan Beberapa Agregat Pendapatan dan Pendapatan per Kapita Atas
        Dasar Harga Berlaku, 2010-2016
  - source_sentence: Arus dana Q3 2006
      - >-
        Posisi Simpanan Berjangka Rupiah pada Bank Umum dan BPR Menurut Golongan
        Pemilik (miliar rupiah), 2005-2018
      - Ringkasan Neraca Arus Dana, Triwulan III, 2006, (Miliar Rupiah)
      - >-
        Rata-Rata Pengeluaran per Kapita Sebulan di Daerah Perkotaan Menurut
        Kelompok Barang dan Golongan Pengeluaran per Kapita Sebulan, 2000-2012
  - yahyaabd/query-hard-pos-neg-doc-pairs-statictable
pipeline_tag: sentence-similarity
library_name: sentence-transformers
  - pearson_cosine
  - spearman_cosine
  - name: >-
      SentenceTransformer based on
      - task:
          type: semantic-similarity
          name: Semantic Similarity
          name: allstats semantic mini v1 eval
          type: allstats-semantic-mini-v1-eval
          - type: pearson_cosine
            value: 0.8664940363669927
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8063420000992144
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
          name: allstat search mini v1 test
          type: allstat-search-mini-v1-test
          - type: pearson_cosine
            value: 0.877199276521204
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.809551340542674
            name: Spearman Cosine

SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 on the query-hard-pos-neg-doc-pairs-statictable dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})


Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-search-multilingual-miniLM-v1")
# Run inference
sentences = [
    'Arus dana Q3 2006',
    'Ringkasan Neraca Arus Dana, Triwulan III, 2006, (Miliar Rupiah)',
    'Rata-Rata Pengeluaran per Kapita Sebulan di Daerah Perkotaan Menurut Kelompok Barang dan Golongan Pengeluaran per Kapita Sebulan, 2000-2012',
embeddings = model.encode(sentences)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
# [3, 3]



Semantic Similarity

Metric allstats-semantic-mini-v1-eval allstat-search-mini-v1-test
pearson_cosine 0.8665 0.8772
spearman_cosine 0.8063 0.8096

Training Details

Training Dataset


  • Dataset: query-hard-pos-neg-doc-pairs-statictable at 7b28b96
  • Size: 25,580 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    • min: 7 tokens
    • mean: 20.14 tokens
    • max: 55 tokens
    • min: 5 tokens
    • mean: 24.9 tokens
    • max: 47 tokens
    • 0: ~70.80%
    • 1: ~29.20%
  • Samples:
    query doc label
    Status pekerjaan utama penduduk usia 15+ yang bekerja, 2020 Jumlah Penghuni Lapas per Kanwil 0
    status pekerjaan utama penduduk usia 15+ yang bekerja, 2020 Jumlah Penghuni Lapas per Kanwil 0
    STATUS PEKERJAAN UTAMA PENDUDUK USIA 15+ YANG BEKERJA, 2020 Jumlah Penghuni Lapas per Kanwil 0
  • Loss: OnlineContrastiveLoss

Evaluation Dataset


  • Dataset: query-hard-pos-neg-doc-pairs-statictable at 7b28b96
  • Size: 5,479 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    • min: 7 tokens
    • mean: 20.78 tokens
    • max: 52 tokens
    • min: 4 tokens
    • mean: 26.28 tokens
    • max: 43 tokens
    • 0: ~71.50%
    • 1: ~28.50%
  • Samples:
    query doc label
    Bagaimana perbandingan PNS pria dan wanita di berbagai golongan tahun 2014? Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama (ribu rupiah), 2017 0
    bagaimana perbandingan pns pria dan wanita di berbagai golongan tahun 2014? Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama (ribu rupiah), 2017 0
    BAGAIMANA PERBANDINGAN PNS PRIA DAN WANITA DI BERBAGAI GOLONGAN TAHUN 2014? Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama (ribu rupiah), 2017 0
  • Loss: OnlineContrastiveLoss

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • warmup_ratio: 0.05
  • fp16: True
  • load_best_model_at_end: True
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.05
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-mini-v1-eval_spearman_cosine allstat-search-mini-v1-test_spearman_cosine
0 0 - 2.3102 0.6855 -
0.05 20 1.7642 1.2458 0.7253 -
0.1 40 0.9637 0.6870 0.7751 -
0.15 60 0.4319 0.4890 0.7897 -
0.2 80 0.3251 0.4944 0.7899 -
0.25 100 0.2665 0.3988 0.7954 -
0.3 120 0.1938 0.3795 0.7972 -
0.35 140 0.1495 0.2839 0.8014 -
0.4 160 0.0681 0.3011 0.8021 -
0.45 180 0.1775 0.3116 0.8004 -
0.5 200 0.0829 0.2536 0.8028 -
0.55 220 0.2332 0.2887 0.8015 -
0.6 240 0.1171 0.2862 0.8021 -
0.65 260 0.1059 0.2467 0.8023 -
0.7 280 0.1089 0.2240 0.8033 -
0.75 300 0.0445 0.1772 0.8048 -
0.8 320 0.0633 0.2392 0.8030 -
0.85 340 0.0506 0.2440 0.8027 -
0.9 360 0.1086 0.1926 0.8054 -
0.95 380 0.064 0.2984 0.8025 -
1.0 400 0.0478 0.2764 0.8025 -
1.05 420 0.0508 0.2393 0.8038 -
1.1 440 0.0266 0.2295 0.8039 -
1.15 460 0.0236 0.2477 0.8032 -
1.2 480 0.0142 0.2077 0.8045 -
1.25 500 0.0128 0.1972 0.8047 -
1.3 520 0.0205 0.2116 0.8042 -
1.35 540 0.0447 0.2425 0.8033 -
1.4 560 0.0 0.1999 0.8045 -
1.45 580 0.0284 0.1989 0.8046 -
1.5 600 0.0222 0.1789 0.8049 -
1.55 620 0.0066 0.1957 0.8045 -
1.6 640 0.0187 0.1993 0.8046 -
1.65 660 0.0489 0.1901 0.8051 -
1.7 680 0.0236 0.1556 0.8058 -
1.75 700 0.0186 0.1597 0.8059 -
1.8 720 0.0475 0.1813 0.8053 -
1.85 740 0.0215 0.1689 0.8060 -
1.9 760 0.0066 0.1746 0.8057 -
1.95 780 0.0158 0.1808 0.8054 -
2.0 800 0.0412 0.1799 0.8050 -
2.05 820 0.0 0.1809 0.8049 -
2.1 840 0.0072 0.1519 0.8059 -
2.15 860 0.032 0.1538 0.8060 -
2.2 880 0.0 0.1605 0.8058 -
2.25 900 0.016 0.1812 0.8053 -
2.3 920 0.0216 0.1550 0.8060 -
2.35 940 0.0124 0.1533 0.8062 -
2.4 960 0.0087 0.1499 0.8064 -
2.45 980 0.0 0.1493 0.8064 -
2.5 1000 0.0063 0.1483 0.8063 -
2.55 1020 0.0 0.1505 0.8063 -
2.6 1040 0.0 0.1508 0.8063 -
2.65 1060 0.0 0.1508 0.8063 -
2.7 1080 0.0 0.1508 0.8063 -
2.75 1100 0.0191 0.1546 0.8062 -
2.8 1120 0.0073 0.1566 0.8063 -
2.85 1140 0.0095 0.1529 0.8063 -
2.9 1160 0.0065 0.1512 0.8064 -
2.95 1180 0.0 0.1508 0.8063 -
3.0 1200 0.0 0.1508 0.8063 -
-1 -1 - - - 0.8096
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.4.0
  • Transformers: 4.48.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0



Sentence Transformers

    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",