SentenceTransformer based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on the allstats-semantic-synthetic-dataset-v1 dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-base-v1-2")
# Run inference
sentences = [
    'Statistik harga Ternate 2012',
    'Indikator Ekonomi Agustus 2002',
    'Indeks Unit Value Ekspor Menurut Kode SITC Bulan Januari 2019',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-base-v1-eval allstat-semantic-base-v1-test
pearson_cosine 0.9869 0.9868
spearman_cosine 0.9277 0.9257

Training Details

Training Dataset

allstats-semantic-synthetic-dataset-v1

  • Dataset: allstats-semantic-synthetic-dataset-v1 at e73718f
  • Size: 123,637 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 10.59 tokens
    • max: 34 tokens
    • min: 5 tokens
    • mean: 14.29 tokens
    • max: 56 tokens
    • min: 0.0
    • mean: 0.5
    • max: 1.0
  • Samples:
    query doc label
    Analisis upah tenaga kerja ekonomi kreatif Upah Tenaga Kerja Ekonomi Kreatif 2011-2016 0.88
    cari data persentase rumah tangga yang menggunakan listrik pln menurut provinsi dari 1993 sampai 2022. Persentase Rumah Tangga menurut Provinsi dan Sumber Penerangan Listrik PLN, 1993-2022 0.93
    apakah ada tabel yang menunjukkan ekspor minyak mentah ke negara tujuan utama tahun 2000-2023? IHK dan Rata-rata Upah per Bulan Buruh Peternakan dan Perikanan di Bawah Mandor (Supervisor), 2012-2014 (2012=100) 0.13
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

allstats-semantic-synthetic-dataset-v1

  • Dataset: allstats-semantic-synthetic-dataset-v1 at e73718f
  • Size: 26,494 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 10.66 tokens
    • max: 31 tokens
    • min: 4 tokens
    • mean: 13.94 tokens
    • max: 70 tokens
    • min: 0.0
    • mean: 0.49
    • max: 1.0
  • Samples:
    query doc label
    SBH Aceh 2018: Meulaboh, Banda Aceh, Lhokseumawe Survei Biaya Hidup (SBH) 2018 Meulaboh, Banda Aceh, dan Lhokseumawe 0.9
    ekspor produk indonesia juli 2018 per negara Direktori Perusahaan Pertambangan Besar 2013 0.07
    peternakan sapi di jawa tengah 2011 Laporan Bulanan Data Sosial Ekonomi Juli 2024 0.07
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • num_train_epochs: 24
  • warmup_ratio: 0.1
  • fp16: True
  • dataloader_num_workers: 4
  • load_best_model_at_end: True
  • label_smoothing_factor: 0.1
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 24
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 4
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.1
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-base-v1-eval_spearman_cosine allstat-semantic-base-v1-test_spearman_cosine
0 0 - 0.0942 0.6574 -
0.2588 500 0.0449 0.0262 0.7353 -
0.5176 1000 0.0232 0.0185 0.7592 -
0.7764 1500 0.0172 0.0154 0.7760 -
1.0352 2000 0.0153 0.0137 0.7905 -
1.2940 2500 0.0124 0.0130 0.7920 -
1.5528 3000 0.0119 0.0120 0.8048 -
1.8116 3500 0.0121 0.0121 0.8021 -
2.0704 4000 0.0114 0.0112 0.8018 -
2.3292 4500 0.0093 0.0117 0.7996 -
2.5880 5000 0.0097 0.0105 0.8133 -
2.8468 5500 0.0092 0.0103 0.8137 -
3.1056 6000 0.0085 0.0094 0.8247 -
3.3644 6500 0.0068 0.0090 0.8326 -
3.6232 7000 0.0073 0.0092 0.8273 -
3.8820 7500 0.007 0.0084 0.8404 -
4.1408 8000 0.0061 0.0083 0.8381 -
4.3996 8500 0.0057 0.0082 0.8382 -
4.6584 9000 0.0056 0.0074 0.8458 -
4.9172 9500 0.0057 0.0073 0.8468 -
5.1760 10000 0.0045 0.0071 0.8508 -
5.4348 10500 0.0041 0.0069 0.8579 -
5.6936 11000 0.0047 0.0069 0.8471 -
5.9524 11500 0.0046 0.0067 0.8554 -
6.2112 12000 0.0034 0.0062 0.8616 -
6.4700 12500 0.0034 0.0063 0.8636 -
6.7288 13000 0.0036 0.0062 0.8649 -
6.9876 13500 0.0037 0.0063 0.8641 -
7.2464 14000 0.0027 0.0059 0.8691 -
7.5052 14500 0.0027 0.0060 0.8733 -
7.7640 15000 0.0031 0.0060 0.8748 -
8.0228 15500 0.0028 0.0058 0.8736 -
8.2816 16000 0.0023 0.0055 0.8785 -
8.5404 16500 0.0025 0.0054 0.8801 -
8.7992 17000 0.0024 0.0058 0.8809 -
9.0580 17500 0.0026 0.0058 0.8811 -
9.3168 18000 0.002 0.0055 0.8824 -
9.5756 18500 0.002 0.0053 0.8859 -
9.8344 19000 0.0021 0.0053 0.8851 -
10.0932 19500 0.0019 0.0055 0.8904 -
10.3520 20000 0.0016 0.0052 0.8946 -
10.6108 20500 0.0017 0.0057 0.8884 -
10.8696 21000 0.0019 0.0055 0.8889 -
11.1284 21500 0.0016 0.0052 0.8942 -
11.3872 22000 0.0014 0.0053 0.8961 -
11.6460 22500 0.0016 0.0053 0.8928 -
11.9048 23000 0.0017 0.0051 0.8947 -
12.1636 23500 0.0013 0.0050 0.9015 -
12.4224 24000 0.0012 0.0059 0.8886 -
12.6812 24500 0.0014 0.0051 0.9030 -
12.9400 25000 0.0014 0.0051 0.9012 -
13.1988 25500 0.0011 0.0050 0.9037 -
13.4576 26000 0.0011 0.0050 0.9053 -
13.7164 26500 0.0011 0.0049 0.9060 -
13.9752 27000 0.0011 0.0049 0.9086 -
14.2340 27500 0.001 0.0048 0.9063 -
14.4928 28000 0.001 0.0051 0.9056 -
14.7516 28500 0.001 0.0051 0.9079 -
15.0104 29000 0.0011 0.0049 0.9080 -
15.2692 29500 0.0008 0.0048 0.9126 -
15.5280 30000 0.0008 0.0049 0.9112 -
15.7867 30500 0.0008 0.0049 0.9123 -
16.0455 31000 0.0008 0.0048 0.9133 -
16.3043 31500 0.0006 0.0048 0.9103 -
16.5631 32000 0.0007 0.0049 0.9144 -
16.8219 32500 0.0008 0.0048 0.9143 -
17.0807 33000 0.0007 0.0048 0.9159 -
17.3395 33500 0.0007 0.0047 0.9174 -
17.5983 34000 0.0006 0.0048 0.9175 -
17.8571 34500 0.0007 0.0047 0.9163 -
18.1159 35000 0.0006 0.0046 0.9195 -
18.3747 35500 0.0006 0.0047 0.9190 -
18.6335 36000 0.0006 0.0047 0.9192 -
18.8923 36500 0.0006 0.0047 0.9204 -
19.1511 37000 0.0005 0.0047 0.9219 -
19.4099 37500 0.0004 0.0046 0.9218 -
19.6687 38000 0.0005 0.0047 0.9221 -
19.9275 38500 0.0005 0.0046 0.9230 -
20.1863 39000 0.0005 0.0046 0.9233 -
20.4451 39500 0.0004 0.0046 0.9240 -
20.7039 40000 0.0005 0.0047 0.9234 -
20.9627 40500 0.0004 0.0047 0.9241 -
21.2215 41000 0.0004 0.0046 0.9253 -
21.4803 41500 0.0004 0.0046 0.9259 -
21.7391 42000 0.0004 0.0046 0.9262 -
21.9979 42500 0.0004 0.0046 0.9263 -
22.2567 43000 0.0003 0.0046 0.9266 -
22.5155 43500 0.0003 0.0046 0.9266 -
22.7743 44000 0.0003 0.0046 0.9273 -
23.0331 44500 0.0003 0.0046 0.9273 -
23.2919 45000 0.0003 0.0046 0.9274 -
23.5507 45500 0.0003 0.0046 0.9277 -
23.8095 46000 0.0003 0.0046 0.9277 -
24.0 46368 - - - 0.9257
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
8
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for yahyaabd/allstats-semantic-base-v1-2

Dataset used to train yahyaabd/allstats-semantic-base-v1-2

Evaluation results