SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 on the allstats-semantic-search-synthetic-dataset-v2-mini dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-search-mini-model-v2-2")
# Run inference
sentences = [
    'Perdagangan luar negeri, impor, Oktober 2020',
    'Indikator Ekonomi November 1999',
    'Indikator Ekonomi September 2005',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-search-mini-v2-eval allstat-semantic-search-mini-v2-test
pearson_cosine 0.9617 0.9605
spearman_cosine 0.8518 0.8481

Training Details

Training Dataset

allstats-semantic-search-synthetic-dataset-v2-mini

  • Dataset: allstats-semantic-search-synthetic-dataset-v2-mini at 8222b01
  • Size: 70,280 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 3 tokens
    • mean: 10.92 tokens
    • max: 50 tokens
    • min: 4 tokens
    • mean: 14.68 tokens
    • max: 59 tokens
    • min: 0.0
    • mean: 0.52
    • max: 1.0
  • Samples:
    query doc label
    Statistik perusahaan pembudidaya tanaman kehutanan 2018 Statistik Perusahaan Pembudidaya Tanaman Kehutanan 2018 0.97
    Berapa persen pertumbuhan PDB Indonesia pada Triwulan III Tahun 2002? Inflasi Bulan November 2002 Sebesar 1,85 % 0.0
    Perdagangan luar negeri Indonesia, impor 2019, jilid 2 Pendataan Sapi Potong Sapi Perah (PSPK 2011) Sulawesi Barat 0.06
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

allstats-semantic-search-synthetic-dataset-v2-mini

  • Dataset: allstats-semantic-search-synthetic-dataset-v2-mini at 8222b01
  • Size: 15,060 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 4 tokens
    • mean: 10.96 tokens
    • max: 48 tokens
    • min: 4 tokens
    • mean: 14.74 tokens
    • max: 70 tokens
    • min: 0.0
    • mean: 0.5
    • max: 1.0
  • Samples:
    query doc label
    Review PDRB daerah di Pulau Sumatera 2010-2013 Statistik Pendidikan 2006 0.12
    Analisis data angkatan kerja Agustus 2021 Booklet Survei Angkatan Kerja Nasional Agustus 2021 0.9
    Berapa persen inflasi yang terjadi pada Juli 2015? Inflasi pada bulan lain tidak disebutkan 0.0
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • num_train_epochs: 24
  • warmup_ratio: 0.1
  • bf16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 24
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-search-mini-v2-eval_spearman_cosine allstat-semantic-search-mini-v2-test_spearman_cosine
0.4550 500 0.0643 0.0413 0.6996 -
0.9099 1000 0.0348 0.0280 0.7533 -
1.3649 1500 0.0254 0.0238 0.7737 -
1.8198 2000 0.0223 0.0205 0.7831 -
2.2748 2500 0.0181 0.0197 0.7894 -
2.7298 3000 0.0173 0.0184 0.7876 -
3.1847 3500 0.0152 0.0170 0.7954 -
3.6397 4000 0.0123 0.0175 0.7970 -
4.0946 4500 0.0125 0.0163 0.8118 -
4.5496 5000 0.01 0.0161 0.8047 -
5.0045 5500 0.0103 0.0157 0.8126 -
5.4595 6000 0.0079 0.0150 0.8224 -
5.9145 6500 0.0087 0.0156 0.8219 -
6.3694 7000 0.0071 0.0152 0.8145 -
6.8244 7500 0.0068 0.0153 0.8172 -
7.2793 8000 0.0061 0.0147 0.8216 -
7.7343 8500 0.0062 0.0146 0.8267 -
8.1893 9000 0.0055 0.0145 0.8325 -
8.6442 9500 0.005 0.0146 0.8335 -
9.0992 10000 0.0052 0.0143 0.8356 -
9.5541 10500 0.0043 0.0144 0.8313 -
10.0091 11000 0.0051 0.0144 0.8362 -
10.4641 11500 0.004 0.0145 0.8376 -
10.9190 12000 0.0039 0.0142 0.8331 -
11.3740 12500 0.0034 0.0141 0.8397 -
11.8289 13000 0.0033 0.0140 0.8398 -
12.2839 13500 0.0032 0.0143 0.8411 -
12.7389 14000 0.003 0.0141 0.8407 -
13.1938 14500 0.0031 0.0141 0.8379 -
13.6488 15000 0.0026 0.0141 0.8419 -
14.1037 15500 0.0028 0.0141 0.8442 -
14.5587 16000 0.0023 0.0138 0.8455 -
15.0136 16500 0.0025 0.0147 0.8359 -
15.4686 17000 0.0021 0.0141 0.8459 -
15.9236 17500 0.0023 0.0140 0.8433 -
16.3785 18000 0.002 0.0139 0.8465 -
16.8335 18500 0.002 0.0139 0.8461 -
17.2884 19000 0.0018 0.0139 0.8482 -
17.7434 19500 0.0018 0.0138 0.8477 -
18.1984 20000 0.0017 0.0138 0.8503 -
18.6533 20500 0.0016 0.0136 0.8493 -
19.1083 21000 0.0016 0.0139 0.8501 -
19.5632 21500 0.0015 0.0138 0.8478 -
20.0182 22000 0.0015 0.0139 0.8501 -
20.4732 22500 0.0013 0.0139 0.8508 -
20.9281 23000 0.0015 0.0139 0.8511 -
21.3831 23500 0.0013 0.0139 0.8517 -
21.8380 24000 0.0013 0.0139 0.8512 -
22.2930 24500 0.0012 0.0139 0.8512 -
22.7480 25000 0.0012 0.0138 0.8520 -
23.2029 25500 0.0012 0.0139 0.8520 -
23.6579 26000 0.0011 0.0139 0.8518 -
24.0 26376 - - - 0.8481

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.44.2
  • PyTorch: 2.4.1+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.2.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
1
Safetensors
Model size
118M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for yahyaabd/allstats-semantic-search-mini-model-v2-2

Dataset used to train yahyaabd/allstats-semantic-search-mini-model-v2-2

Evaluation results

  • Pearson Cosine on allstats semantic search mini v2 eval
    self-reported
    0.962
  • Spearman Cosine on allstats semantic search mini v2 eval
    self-reported
    0.852
  • Pearson Cosine on allstat semantic search mini v2 test
    self-reported
    0.960
  • Spearman Cosine on allstat semantic search mini v2 test
    self-reported
    0.848