yahyaabd's picture
Add new SentenceTransformer model
7aa32c7 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:2602
  - loss:ContrastiveLoss
base_model: denaya/indoSBERT-large
widget:
  - source_sentence: >-
      Data triwulanan GDS, investasi non-fin, pinjaman neto pemerintah (triliun)
      2010
    sentences:
      - 'Nilai Ekspor Menurut Pelabuhan Utama (Nilai FOB: juta US$) 2000-2023'
      - >-
        Suhu Minimum, Rata-Rata, dan Maksimum di Stasiun Pengamatan BMKG (oC),
        2011-2015
      - >-
        Nilai Ekspor Menurut Negara Tujuan Utama (Nilai FOB: juta US$),
        2000-2023
  - source_sentence: >-
      Data triwulanan GDS, investasi non-fin, pinjaman neto pemerintah (triliun)
      2010
    sentences:
      - >-
        Tabungan Bruto, Investasi Nonfinansial, dan Pinjaman Neto Triwulanan
        Sektor Pemerintahan Umum (triliun rupiah), 2009-2015
      - >-
        Produksi Perikanan Budidaya Menurut Provinsi dan Jenis Budidaya,
        2000-2020
      - >-
        Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan
        Kelompok Umur (ribu rupiah), 2017
  - source_sentence: Gaji bersih vs kelompok umur dan lapangan pekerjaan, 2023
    sentences:
      - Investasi Nonfinansial Menurut Sektor (triliun rupiah), 2008-2014
      - >-
        Posisi Kredit Usaha Mikro, Kecil, dan Menengah (UMKM) 1 pada Bank Umum
        (miliar rupiah), 2012-2016
      - Ringkasan Neraca Arus Dana, Triwulan I, 2013*), (Miliar Rupiah)
  - source_sentence: >-
      Data utang luar negeri Indonesia (pemerintah dan BI), detail kreditor dan
      syarat, tahun 2010
    sentences:
      - >-
        Angka Partisipasi Sekolah (APS) Penduduk Umur 7-18 Tahun Menurut
        Klasifikasi Desa, Jenis Kelamin, dan Kelompok Umur, 2009-2023
      - Indeks Integritas Ujian Nasional
      - >-
        Rekapitulasi Luas Penutupan Lahan Hutan dan Non Hutan Menurut Provinsi
        Tahun 2014-2022 (Ribu Ha)
  - source_sentence: Laporan keuangan perusahaan asuransi wajib & BPJS akhir 2015
    sentences:
      - Indeks Harga Konsumen Menurut Kelompok Pengeluaran, 2020-2023
      - Ringkasan Neraca Arus Dana, Triwulan I, 2013*), (Miliar Rupiah)
      - >-
        Rata-rata Konsumsi dan Pengeluaran Perkapita Seminggu Menurut Komoditi
        Makanan dan Golongan Pengeluaran per Kapita Seminggu di Provinsi Jawa
        Timur, 2018-2023
datasets:
  - yahyaabd/bps-statictable-query-title-pairs
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
model-index:
  - name: SentenceTransformer based on denaya/indoSBERT-large
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstats semantic base v1 eval
          type: allstats-semantic-base-v1-eval
        metrics:
          - type: pearson_cosine
            value: 0.902671671573215
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.7797277576994545
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstat semantic base v1 test
          type: allstat-semantic-base-v1-test
        metrics:
          - type: pearson_cosine
            value: 0.9166324050239434
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8089661156615633
            name: Spearman Cosine

SentenceTransformer based on denaya/indoSBERT-large

This is a sentence-transformers model finetuned from denaya/indoSBERT-large on the bps-statictable-query-title-pairs dataset. It maps sentences & paragraphs to a 256-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 1024, 'out_features': 256, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-ir-indoSBERT-large-v1")
# Run inference
sentences = [
    'Laporan keuangan perusahaan asuransi wajib & BPJS akhir 2015',
    'Ringkasan Neraca Arus Dana, Triwulan I, 2013*), (Miliar Rupiah)',
    'Rata-rata Konsumsi dan Pengeluaran Perkapita Seminggu Menurut Komoditi Makanan dan Golongan Pengeluaran per Kapita Seminggu di Provinsi Jawa Timur, 2018-2023',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 256]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-base-v1-eval allstat-semantic-base-v1-test
pearson_cosine 0.9027 0.9166
spearman_cosine 0.7797 0.809

Training Details

Training Dataset

bps-statictable-query-title-pairs

  • Dataset: bps-statictable-query-title-pairs at c7df38f
  • Size: 2,602 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 4 tokens
    • mean: 16.78 tokens
    • max: 28 tokens
    • min: 3 tokens
    • mean: 21.01 tokens
    • max: 48 tokens
    • 0: ~66.50%
    • 1: ~33.50%
  • Samples:
    query doc label
    Pertumbuhan populasi provinsi di Indonesia 1971-2024 Kecepatan Angin dan Kelembaban di Stasiun Pengamatan BMKG, 2000-2010 0
    Perbandingan upah nominal dan riil pekerja pertanian di Indonesia (tahun dasar 2012), periode 2017. Upah Nominal dan Riil Buruh Tani di Indonesia (Rupiah), 2009-2019 (2012=100) 1
    Laporan singkat cash flow statement Q4/2005 Nilai Produksi dan Biaya Produksi per Hektar Usaha Tanaman Bawang Merah dan Cabai Merah, 2014 0
  • Loss: ContrastiveLoss with these parameters:
    {
        "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
        "margin": 0.5,
        "size_average": true
    }
    

Evaluation Dataset

bps-statictable-query-title-pairs

  • Dataset: bps-statictable-query-title-pairs at c7df38f
  • Size: 558 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 558 samples:
    query doc label
    type string string int
    details
    • min: 3 tokens
    • mean: 16.82 tokens
    • max: 30 tokens
    • min: 3 tokens
    • mean: 21.13 tokens
    • max: 48 tokens
    • 0: ~70.97%
    • 1: ~29.03%
  • Samples:
    query doc label
    Data pengeluaran makanan rata-rata warga Sulteng per minggu di tahun 2022, berdasarkan kelompok pendapatan Sistem Neraca Sosial Ekonomi Indonesia Tahun 2022 (84 x 84) 0
    Konsumsi & belanja makanan per orang di NTB, beda kelompok pengeluaran, 2021 Rata-rata Konsumsi dan Pengeluaran Perkapita Seminggu Menurut Komoditi Makanan dan Golongan Pengeluaran per Kapita Seminggu di Provinsi Nusa Tenggara Barat, 2018-2023 1
    Bagaimana perbandingan PNS pria dan wanita di berbagai golongan tahun 2014? Penduduk Berumur 15 Tahun Ke Atas Menurut Provinsi dan Jenis Kegiatan Selama Seminggu yang Lalu, 2008 - 2024 0
  • Loss: ContrastiveLoss with these parameters:
    {
        "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
        "margin": 0.5,
        "size_average": true
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • num_train_epochs: 4
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-base-v1-eval_spearman_cosine allstat-semantic-base-v1-test_spearman_cosine
0 0 - 0.0086 0.7549 -
0.1220 10 0.0082 0.0069 0.7610 -
0.2439 20 0.0058 0.0049 0.7688 -
0.3659 30 0.0047 0.0041 0.7686 -
0.4878 40 0.0034 0.0036 0.7682 -
0.6098 50 0.003 0.0034 0.7696 -
0.7317 60 0.0031 0.0027 0.7728 -
0.8537 70 0.0031 0.0029 0.7713 -
0.9756 80 0.003 0.0031 0.7731 -
1.0976 90 0.0011 0.0025 0.7746 -
1.2195 100 0.001 0.0023 0.7759 -
1.3415 110 0.0013 0.0021 0.7767 -
1.4634 120 0.0011 0.0021 0.7773 -
1.5854 130 0.0008 0.0021 0.7786 -
1.7073 140 0.0006 0.0021 0.7789 -
1.8293 150 0.0007 0.0020 0.7788 -
1.9512 160 0.0018 0.002 0.7799 -
2.0732 170 0.0006 0.0020 0.7800 -
2.1951 180 0.0004 0.0021 0.7795 -
2.3171 190 0.0006 0.0021 0.7796 -
2.4390 200 0.0004 0.0021 0.7798 -
2.5610 210 0.0003 0.0021 0.7799 -
2.6829 220 0.0003 0.0021 0.7798 -
2.8049 230 0.0004 0.0021 0.7797 -
2.9268 240 0.0007 0.0021 0.7798 -
3.0488 250 0.0003 0.0021 0.7798 -
3.1707 260 0.0002 0.0021 0.7796 -
3.2927 270 0.0003 0.0021 0.7797 -
3.4146 280 0.0002 0.0021 0.7797 -
3.5366 290 0.0002 0.0021 0.7797 -
3.6585 300 0.0002 0.0021 0.7797 -
3.7805 310 0.0004 0.0021 0.7797 -
3.9024 320 0.0003 0.0021 0.7797 -
-1 -1 - - - 0.8090
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.4.0
  • Transformers: 4.48.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

ContrastiveLoss

@inproceedings{hadsell2006dimensionality,
    author={Hadsell, R. and Chopra, S. and LeCun, Y.},
    booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)},
    title={Dimensionality Reduction by Learning an Invariant Mapping},
    year={2006},
    volume={2},
    number={},
    pages={1735-1742},
    doi={10.1109/CVPR.2006.100}
}