SentenceTransformer based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on the query-hard-pos-neg-doc-pairs-statictable dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-search-multilingual-base-v1")
# Run inference
sentences = [
    'Arus dana Q3 2006',
    'Ringkasan Neraca Arus Dana, Triwulan III, 2006, (Miliar Rupiah)',
    'Rata-Rata Pengeluaran per Kapita Sebulan di Daerah Perkotaan Menurut Kelompok Barang dan Golongan Pengeluaran per Kapita Sebulan, 2000-2012',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

  • Datasets: allstats-search-multilingual-base-v1-eval and allstats-search-multilingual-base-v1-test
  • Evaluated with EmbeddingSimilarityEvaluator
Metric allstats-search-multilingual-base-v1-eval allstats-search-multilingual-base-v1-test
pearson_cosine 0.87 0.9023
spearman_cosine 0.8062 0.8093

Training Details

Training Dataset

query-hard-pos-neg-doc-pairs-statictable

  • Dataset: query-hard-pos-neg-doc-pairs-statictable at 7b28b96
  • Size: 25,580 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 7 tokens
    • mean: 20.14 tokens
    • max: 55 tokens
    • min: 5 tokens
    • mean: 24.9 tokens
    • max: 47 tokens
    • 0: ~70.80%
    • 1: ~29.20%
  • Samples:
    query doc label
    Status pekerjaan utama penduduk usia 15+ yang bekerja, 2020 Jumlah Penghuni Lapas per Kanwil 0
    status pekerjaan utama penduduk usia 15+ yang bekerja, 2020 Jumlah Penghuni Lapas per Kanwil 0
    STATUS PEKERJAAN UTAMA PENDUDUK USIA 15+ YANG BEKERJA, 2020 Jumlah Penghuni Lapas per Kanwil 0
  • Loss: OnlineContrastiveLoss

Evaluation Dataset

query-hard-pos-neg-doc-pairs-statictable

  • Dataset: query-hard-pos-neg-doc-pairs-statictable at 7b28b96
  • Size: 5,479 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string int
    details
    • min: 7 tokens
    • mean: 20.78 tokens
    • max: 52 tokens
    • min: 4 tokens
    • mean: 26.28 tokens
    • max: 43 tokens
    • 0: ~71.50%
    • 1: ~28.50%
  • Samples:
    query doc label
    Bagaimana perbandingan PNS pria dan wanita di berbagai golongan tahun 2014? Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama (ribu rupiah), 2017 0
    bagaimana perbandingan pns pria dan wanita di berbagai golongan tahun 2014? Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama (ribu rupiah), 2017 0
    BAGAIMANA PERBANDINGAN PNS PRIA DAN WANITA DI BERBAGAI GOLONGAN TAHUN 2014? Rata-rata Pendapatan Bersih Berusaha Sendiri Menurut Provinsi dan Lapangan Pekerjaan Utama (ribu rupiah), 2017 0
  • Loss: OnlineContrastiveLoss

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • warmup_ratio: 0.05
  • fp16: True
  • load_best_model_at_end: True
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.05
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-search-multilingual-base-v1-eval_spearman_cosine allstats-search-multilingual-base-v1-test_spearman_cosine
0 0 - 1.3012 0.7447 -
0.05 20 0.9548 0.3980 0.7961 -
0.1 40 0.3959 0.3512 0.7993 -
0.15 60 0.1949 0.3102 0.8016 -
0.2 80 0.2126 0.4306 0.7967 -
0.25 100 0.2228 0.2865 0.8026 -
0.3 120 0.1306 0.2476 0.8035 -
0.35 140 0.172 0.2592 0.8014 -
0.4 160 0.1619 0.2495 0.8037 -
0.45 180 0.1416 0.1890 0.8046 -
0.5 200 0.1041 0.1717 0.8059 -
0.55 220 0.2145 0.2165 0.8049 -
0.6 240 0.0459 0.2176 0.8036 -
0.65 260 0.0627 0.2670 0.8023 -
0.7 280 0.1132 0.2309 0.8041 -
0.75 300 0.1048 0.2623 0.8028 -
0.8 320 0.0524 0.2328 0.8031 -
0.85 340 0.034 0.2580 0.8024 -
0.9 360 0.0664 0.2309 0.8034 -
0.95 380 0.0623 0.1746 0.8053 -
1.0 400 0.0402 0.2126 0.8041 -
1.05 420 0.0459 0.1660 0.8062 -
1.1 440 0.0739 0.1487 0.8068 -
1.15 460 0.0191 0.1595 0.8066 -
1.2 480 0.0073 0.1509 0.8066 -
1.25 500 0.0265 0.1779 0.8062 -
1.3 520 0.0325 0.2646 0.8032 -
1.35 540 0.0536 0.2818 0.8030 -
1.4 560 0.0076 0.1768 0.8057 -
1.45 580 0.011 0.1866 0.8054 -
1.5 600 0.0181 0.1726 0.8057 -
1.55 620 0.032 0.1881 0.8052 -
1.6 640 0.0459 0.1482 0.8066 -
1.65 660 0.041 0.1571 0.8065 -
1.7 680 0.0228 0.1298 0.807 -
1.75 700 0.0275 0.1571 0.8067 -
1.8 720 0.0 0.1624 0.8066 -
1.85 740 0.0218 0.1537 0.8068 -
1.9 760 0.0241 0.1699 0.8062 -
1.95 780 0.0065 0.1841 0.8059 -
2.0 800 0.0073 0.1805 0.8061 -
2.05 820 0.0 0.1703 0.8064 -
2.1 840 0.0 0.1702 0.8064 -
2.15 860 0.0 0.1717 0.8064 -
2.2 880 0.0 0.1717 0.8064 -
2.25 900 0.0 0.1717 0.8064 -
2.3 920 0.0097 0.1875 0.8059 -
2.35 940 0.0148 0.1868 0.8060 -
2.4 960 0.0067 0.2205 0.8051 -
2.45 980 0.0 0.2295 0.8049 -
2.5 1000 0.0154 0.2238 0.8052 -
2.55 1020 0.0063 0.2125 0.8055 -
2.6 1040 0.0 0.2183 0.8053 -
2.65 1060 0.0 0.2188 0.8053 -
2.7 1080 0.0068 0.2082 0.8056 -
2.75 1100 0.0384 0.1770 0.8060 -
2.8 1120 0.0 0.1645 0.8061 -
2.85 1140 0.0105 0.1613 0.8061 -
2.9 1160 0.0 0.1601 0.8061 -
2.95 1180 0.0 0.1601 0.8062 -
3.0 1200 0.0 0.1601 0.8062 -
-1 -1 - - - 0.8093
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.4.0
  • Transformers: 4.48.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
4
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for yahyaabd/allstats-search-multilingual-base-v1

Dataset used to train yahyaabd/allstats-search-multilingual-base-v1

Evaluation results

  • Pearson Cosine on allstats search multilingual base v1 eval
    self-reported
    0.870
  • Spearman Cosine on allstats search multilingual base v1 eval
    self-reported
    0.806
  • Pearson Cosine on allstats search multilingual base v1 test
    self-reported
    0.902
  • Spearman Cosine on allstats search multilingual base v1 test
    self-reported
    0.809