SentenceTransformer based on denaya/indoSBERT-large

This is a sentence-transformers model finetuned from denaya/indoSBERT-large on the bps-statictable-query-title-pairs dataset. It maps sentences & paragraphs to a 256-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 1024, 'out_features': 256, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-ir-indoSBERT-large-v1")
# Run inference
sentences = [
    'Laporan keuangan perusahaan asuransi wajib & BPJS akhir 2015',
    'Ringkasan Neraca Arus Dana, Triwulan I, 2013*), (Miliar Rupiah)',
    'Rata-rata Konsumsi dan Pengeluaran Perkapita Seminggu Menurut Komoditi Makanan dan Golongan Pengeluaran per Kapita Seminggu di Provinsi Jawa Timur, 2018-2023',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 256]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-base-v1-eval allstat-semantic-base-v1-test
pearson_cosine 0.9027 0.9166
spearman_cosine 0.7797 0.809

Training Details

Training Dataset

bps-statictable-query-title-pairs

  • Dataset: bps-statictable-query-title-pairs at c7df38f
  • Size: 2,602 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 4 tokens
    • mean: 16.78 tokens
    • max: 28 tokens
    • min: 3 tokens
    • mean: 21.01 tokens
    • max: 48 tokens
    • 0: ~66.50%
    • 1: ~33.50%
  • Samples:
    query doc label
    Pertumbuhan populasi provinsi di Indonesia 1971-2024 Kecepatan Angin dan Kelembaban di Stasiun Pengamatan BMKG, 2000-2010 0
    Perbandingan upah nominal dan riil pekerja pertanian di Indonesia (tahun dasar 2012), periode 2017. Upah Nominal dan Riil Buruh Tani di Indonesia (Rupiah), 2009-2019 (2012=100) 1
    Laporan singkat cash flow statement Q4/2005 Nilai Produksi dan Biaya Produksi per Hektar Usaha Tanaman Bawang Merah dan Cabai Merah, 2014 0
  • Loss: ContrastiveLoss with these parameters:
    {
        "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
        "margin": 0.5,
        "size_average": true
    }
    

Evaluation Dataset

bps-statictable-query-title-pairs

  • Dataset: bps-statictable-query-title-pairs at c7df38f
  • Size: 558 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 558 samples:
    query doc label
    type string string int
    details
    • min: 3 tokens
    • mean: 16.82 tokens
    • max: 30 tokens
    • min: 3 tokens
    • mean: 21.13 tokens
    • max: 48 tokens
    • 0: ~70.97%
    • 1: ~29.03%
  • Samples:
    query doc label
    Data pengeluaran makanan rata-rata warga Sulteng per minggu di tahun 2022, berdasarkan kelompok pendapatan Sistem Neraca Sosial Ekonomi Indonesia Tahun 2022 (84 x 84) 0
    Konsumsi & belanja makanan per orang di NTB, beda kelompok pengeluaran, 2021 Rata-rata Konsumsi dan Pengeluaran Perkapita Seminggu Menurut Komoditi Makanan dan Golongan Pengeluaran per Kapita Seminggu di Provinsi Nusa Tenggara Barat, 2018-2023 1
    Bagaimana perbandingan PNS pria dan wanita di berbagai golongan tahun 2014? Penduduk Berumur 15 Tahun Ke Atas Menurut Provinsi dan Jenis Kegiatan Selama Seminggu yang Lalu, 2008 - 2024 0
  • Loss: ContrastiveLoss with these parameters:
    {
        "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
        "margin": 0.5,
        "size_average": true
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • num_train_epochs: 4
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-base-v1-eval_spearman_cosine allstat-semantic-base-v1-test_spearman_cosine
0 0 - 0.0086 0.7549 -
0.1220 10 0.0082 0.0069 0.7610 -
0.2439 20 0.0058 0.0049 0.7688 -
0.3659 30 0.0047 0.0041 0.7686 -
0.4878 40 0.0034 0.0036 0.7682 -
0.6098 50 0.003 0.0034 0.7696 -
0.7317 60 0.0031 0.0027 0.7728 -
0.8537 70 0.0031 0.0029 0.7713 -
0.9756 80 0.003 0.0031 0.7731 -
1.0976 90 0.0011 0.0025 0.7746 -
1.2195 100 0.001 0.0023 0.7759 -
1.3415 110 0.0013 0.0021 0.7767 -
1.4634 120 0.0011 0.0021 0.7773 -
1.5854 130 0.0008 0.0021 0.7786 -
1.7073 140 0.0006 0.0021 0.7789 -
1.8293 150 0.0007 0.0020 0.7788 -
1.9512 160 0.0018 0.002 0.7799 -
2.0732 170 0.0006 0.0020 0.7800 -
2.1951 180 0.0004 0.0021 0.7795 -
2.3171 190 0.0006 0.0021 0.7796 -
2.4390 200 0.0004 0.0021 0.7798 -
2.5610 210 0.0003 0.0021 0.7799 -
2.6829 220 0.0003 0.0021 0.7798 -
2.8049 230 0.0004 0.0021 0.7797 -
2.9268 240 0.0007 0.0021 0.7798 -
3.0488 250 0.0003 0.0021 0.7798 -
3.1707 260 0.0002 0.0021 0.7796 -
3.2927 270 0.0003 0.0021 0.7797 -
3.4146 280 0.0002 0.0021 0.7797 -
3.5366 290 0.0002 0.0021 0.7797 -
3.6585 300 0.0002 0.0021 0.7797 -
3.7805 310 0.0004 0.0021 0.7797 -
3.9024 320 0.0003 0.0021 0.7797 -
-1 -1 - - - 0.8090
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.4.0
  • Transformers: 4.48.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

ContrastiveLoss

@inproceedings{hadsell2006dimensionality,
    author={Hadsell, R. and Chopra, S. and LeCun, Y.},
    booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)},
    title={Dimensionality Reduction by Learning an Invariant Mapping},
    year={2006},
    volume={2},
    number={},
    pages={1735-1742},
    doi={10.1109/CVPR.2006.100}
}
Downloads last month
14
Safetensors
Model size
335M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for yahyaabd/allstats-ir-indoSBERT-large-v1

Finetuned
(5)
this model

Dataset used to train yahyaabd/allstats-ir-indoSBERT-large-v1

Evaluation results