yahyaabd's picture
Add new SentenceTransformer model
01286d7 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:25580
  - loss:OnlineContrastiveLoss
base_model: denaya/indoSBERT-large
widget:
  - source_sentence: ikhtisar arus kas triwulan 1, 2004 (miliar)
    sentences:
      - Balita (0-59 Bulan) Menurut Status Gizi, Tahun 1998-2005
      - >-
        Perbandingan Indeks dan Tingkat Inflasi Desember 2023 Kota-kota di Luar
        Pulau Jawa dan Sumatera dengan Nasional (2018=100)
      - >-
        Rata-rata Konsumsi dan Pengeluaran Perkapita Seminggu Menurut Komoditi
        Makanan dan Golongan Pengeluaran per Kapita Seminggu di Provinsi
        Sulawesi Tengah, 2018-2023
  - source_sentence: >-
      BaIgaimana gambaran neraca arus dana dUi Indonesia pada kuartal kedua
      tahun 2015?
    sentences:
      - >-
        Jumlah Sekolah, Guru, dan Murid Sekolah Menengah Pertama (SMP) di Bawah
        Kementrian Pendidikan dan Kebudayaan Menurut Provinsi
        2011/2012-2015/2016
      - Ringkasan Neraca Arus Dana Triwulan III Tahun 2003 (Miliar Rupiah)
      - >-
        Rata-rata Konsumsi dan Pengeluaran Perkapita Seminggu Menurut Komoditi
        Makanan dan Golongan Pengeluaran per Kapita Seminggu di Provinsi
        Sulawesi Tenggara, 2018-2023
  - source_sentence: >-
      Berapa persen pengeluaran orang di kotaa untuk makanan vs non-makanan, per
      provinsi, 2018?
    sentences:
      - >-
        Ekspor Tanaman Obat, Aromatik, dan Rempah-Rempah menurut Negara Tujuan
        Utama, 2012-2023
      - >-
        Rata-rata Pendapatan Bersih Pekerja Bebas Menurut Provinsi dan
        Pendidikan Tertinggi yang Ditamatkan (ribu rupiah), 2017
      - >-
        IHK dan Rata-rata Upah per Bulan Buruh Industri di Bawah Mandor
        (Supervisor), 1996-2014 (1996=100)
  - source_sentence: Negara-negara asal impor crude oil dan produk turunannya tahun 2002-2023
    sentences:
      - >-
        Persentase Pengeluaran Rata-rata per Kapita Sebulan Menurut Kelompok
        Barang, Indonesia, 1999, 2002-2023
      - >-
        Rata-rata Pendapatan Bersih Berusaha Sendiri menurut Provinsi dan
        Pendidikan yang Ditamatkan (ribu rupiah), 2016
      - >-
        Perkembangan Beberapa Agregat Pendapatan dan Pendapatan per Kapita Atas
        Dasar Harga Berlaku, 2010-2016
  - source_sentence: Arus dana Q3 2006
    sentences:
      - >-
        Posisi Simpanan Berjangka Rupiah pada Bank Umum dan BPR Menurut Golongan
        Pemilik (miliar rupiah), 2005-2018
      - Ringkasan Neraca Arus Dana, Triwulan III, 2006, (Miliar Rupiah)
      - >-
        Rata-Rata Pengeluaran per Kapita Sebulan di Daerah Perkotaan Menurut
        Kelompok Barang dan Golongan Pengeluaran per Kapita Sebulan, 2000-2012
datasets:
  - yahyaabd/query-hard-pos-neg-doc-pairs-statictable
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy
  - cosine_accuracy_threshold
  - cosine_f1
  - cosine_f1_threshold
  - cosine_precision
  - cosine_recall
  - cosine_ap
  - cosine_mcc
model-index:
  - name: SentenceTransformer based on denaya/indoSBERT-large
    results:
      - task:
          type: binary-classification
          name: Binary Classification
        dataset:
          name: allstats semantic large v1 test
          type: allstats-semantic-large-v1_test
        metrics:
          - type: cosine_accuracy
            value: 0.9834364761558063
            name: Cosine Accuracy
          - type: cosine_accuracy_threshold
            value: 0.7773222327232361
            name: Cosine Accuracy Threshold
          - type: cosine_f1
            value: 0.9745739033249511
            name: Cosine F1
          - type: cosine_f1_threshold
            value: 0.7773222327232361
            name: Cosine F1 Threshold
          - type: cosine_precision
            value: 0.9748462828395752
            name: Cosine Precision
          - type: cosine_recall
            value: 0.9743016759776536
            name: Cosine Recall
          - type: cosine_ap
            value: 0.9959810762137397
            name: Cosine Ap
          - type: cosine_mcc
            value: 0.9622916280716365
            name: Cosine Mcc
      - task:
          type: binary-classification
          name: Binary Classification
        dataset:
          name: allstats semantic large v1 dev
          type: allstats-semantic-large-v1_dev
        metrics:
          - type: cosine_accuracy
            value: 0.9760905274685161
            name: Cosine Accuracy
          - type: cosine_accuracy_threshold
            value: 0.7572722434997559
            name: Cosine Accuracy Threshold
          - type: cosine_f1
            value: 0.9640997533570841
            name: Cosine F1
          - type: cosine_f1_threshold
            value: 0.7572722434997559
            name: Cosine F1 Threshold
          - type: cosine_precision
            value: 0.9386339381003201
            name: Cosine Precision
          - type: cosine_recall
            value: 0.9909859154929578
            name: Cosine Recall
          - type: cosine_ap
            value: 0.9953499585582108
            name: Cosine Ap
          - type: cosine_mcc
            value: 0.9469795586519781
            name: Cosine Mcc

SentenceTransformer based on denaya/indoSBERT-large

This is a sentence-transformers model finetuned from denaya/indoSBERT-large on the query-hard-pos-neg-doc-pairs-statictable dataset. It maps sentences & paragraphs to a 256-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 1024, 'out_features': 256, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-search-large-v1-32-2")
# Run inference
sentences = [
    'Arus dana Q3 2006',
    'Ringkasan Neraca Arus Dana, Triwulan III, 2006, (Miliar Rupiah)',
    'Rata-Rata Pengeluaran per Kapita Sebulan di Daerah Perkotaan Menurut Kelompok Barang dan Golongan Pengeluaran per Kapita Sebulan, 2000-2012',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 256]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Binary Classification

Metric allstats-semantic-large-v1_test allstats-semantic-large-v1_dev
cosine_accuracy 0.9834 0.9761
cosine_accuracy_threshold 0.7773 0.7573
cosine_f1 0.9746 0.9641
cosine_f1_threshold 0.7773 0.7573
cosine_precision 0.9748 0.9386
cosine_recall 0.9743 0.991
cosine_ap 0.996 0.9953
cosine_mcc 0.9623 0.947

Training Details

Training Dataset

query-hard-pos-neg-doc-pairs-statictable

  • Dataset: query-hard-pos-neg-doc-pairs-statictable at 7b28b96
  • Size: 25,580 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 6 tokens
    • mean: 17.12 tokens
    • max: 31 tokens
    • min: 5 tokens
    • mean: 20.47 tokens
    • max: 42 tokens
    • 0: ~70.80%
    • 1: ~29.20%
  • Samples:
    query doc label
    Status pekerjaan utama penduduk usia 15+ yang bekerja, 2020 Jumlah Penghuni Lapas per Kanwil 0
    status pekerjaan utama penduduk usia 15+ yang bekerja, 2020 Jumlah Penghuni Lapas per Kanwil 0
    STATUS PEKERJAAN UTAMA PENDUDUK USIA 15+ YANG BEKERJA, 2020 Jumlah Penghuni Lapas per Kanwil 0
  • Loss: OnlineContrastiveLoss

Evaluation Dataset

query-hard-pos-neg-doc-pairs-statictable

  • Dataset: query-hard-pos-neg-doc-pairs-statictable at 7b28b96
  • Size: 5,479 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 7 tokens
    • mean: 17.85 tokens
    • max: 35 tokens
    • min: 3 tokens
    • mean: 21.2 tokens
    • max: 31 tokens
    • 0: ~71.50%
    • 1: ~28.50%
  • Samples:
    query doc label
    Bagaimana perbandingan PNS pria dan wanita di berbagai golongan tahun 2014? Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama (ribu rupiah), 2017 0
    bagaimana perbandingan pns pria dan wanita di berbagai golongan tahun 2014? Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama (ribu rupiah), 2017 0
    BAGAIMANA PERBANDINGAN PNS PRIA DAN WANITA DI BERBAGAI GOLONGAN TAHUN 2014? Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama (ribu rupiah), 2017 0
  • Loss: OnlineContrastiveLoss

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • num_train_epochs: 2
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-large-v1_test_cosine_ap allstats-semantic-large-v1_dev_cosine_ap
-1 -1 - - 0.9750 -
0 0 - 0.1850 - 0.9766
0.025 20 0.1581 0.1538 - 0.9789
0.05 40 0.1898 0.1200 - 0.9848
0.075 60 0.0647 0.1096 - 0.9855
0.1 80 0.118 0.1242 - 0.9831
0.125 100 0.0545 0.1301 - 0.9827
0.15 120 0.0646 0.1114 - 0.9862
0.175 140 0.0775 0.1005 - 0.9865
0.2 160 0.0664 0.1234 - 0.9840
0.225 180 0.067 0.1349 - 0.9850
0.25 200 0.0823 0.1032 - 0.9877
0.275 220 0.0895 0.1432 - 0.9808
0.3 240 0.0666 0.1389 - 0.9809
0.325 260 0.0872 0.1122 - 0.9844
0.35 280 0.0551 0.1435 - 0.9838
0.375 300 0.0919 0.1068 - 0.9886
0.4 320 0.0437 0.0903 - 0.9861
0.425 340 0.0619 0.1065 - 0.9850
0.45 360 0.0469 0.1346 - 0.9844
0.475 380 0.029 0.1351 - 0.9828
0.5 400 0.0511 0.1123 - 0.9843
0.525 420 0.0394 0.1434 - 0.9815
0.55 440 0.0178 0.1577 - 0.9769
0.575 460 0.047 0.1253 - 0.9796
0.6 480 0.0066 0.1262 - 0.9791
0.625 500 0.0383 0.1277 - 0.9814
0.65 520 0.0084 0.1361 - 0.9845
0.675 540 0.0409 0.1202 - 0.9872
0.7 560 0.0372 0.1245 - 0.9854
0.725 580 0.0353 0.1469 - 0.9817
0.75 600 0.0429 0.1225 - 0.9836
0.775 620 0.0595 0.1082 - 0.9862
0.8 640 0.0266 0.0886 - 0.9903
0.825 660 0.0178 0.0712 - 0.9918
0.85 680 0.0567 0.0511 - 0.9936
0.875 700 0.0142 0.0538 - 0.9916
0.9 720 0.0136 0.0726 - 0.9890
0.925 740 0.0192 0.0707 - 0.9884
0.95 760 0.0253 0.0937 - 0.9872
0.975 780 0.0149 0.0792 - 0.9878
1.0 800 0.0231 0.0912 - 0.9879
1.025 820 0.0 0.1030 - 0.9871
1.05 840 0.0096 0.0990 - 0.9876
1.075 860 0.0 0.1032 - 0.9868
1.1 880 0.0 0.1037 - 0.9866
1.125 900 0.0 0.1038 - 0.9866
1.15 920 0.0 0.1038 - 0.9866
1.175 940 0.0 0.1038 - 0.9866
1.2 960 0.0121 0.1030 - 0.9895
1.225 980 0.0 0.1035 - 0.9899
1.25 1000 0.0 0.1040 - 0.9898
1.275 1020 0.0 0.1049 - 0.9898
1.3 1040 0.0 0.1049 - 0.9898
1.325 1060 0.0067 0.1015 - 0.9903
1.35 1080 0.0 0.1048 - 0.9901
1.375 1100 0.0159 0.0956 - 0.9910
1.4 1120 0.0067 0.0818 - 0.9926
1.425 1140 0.0151 0.0838 - 0.9926
1.45 1160 0.0 0.0889 - 0.9920
1.475 1180 0.0 0.0894 - 0.9920
1.5 1200 0.023 0.0696 - 0.9935
1.525 1220 0.0 0.0693 - 0.9935
1.55 1240 0.0 0.0711 - 0.9935
1.575 1260 0.0 0.0711 - 0.9935
1.6 1280 0.0 0.0711 - 0.9935
1.625 1300 0.0176 0.0743 - 0.9936
1.65 1320 0.0 0.0806 - 0.9931
1.675 1340 0.0 0.0817 - 0.9931
1.7 1360 0.007 0.0809 - 0.9929
1.725 1380 0.0209 0.0700 - 0.9941
1.75 1400 0.0068 0.0605 - 0.9949
1.775 1420 0.0069 0.0564 - 0.9951
1.8 1440 0.0097 0.0559 - 0.9953
1.825 1460 0.0 0.0557 - 0.9953
1.85 1480 0.0 0.0557 - 0.9953
1.875 1500 0.0 0.0557 - 0.9953
1.9 1520 0.0 0.0557 - 0.9953
1.925 1540 0.0 0.0557 - 0.9953
1.95 1560 0.0089 0.0544 - 0.9953
1.975 1580 0.0 0.0544 - 0.9953
2.0 1600 0.0 0.0544 - 0.9953
-1 -1 - - 0.9960 -
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.4.0
  • Transformers: 4.48.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}