yahyaabd's picture
Add new SentenceTransformer model
2cdb99a verified
metadata
base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
datasets:
  - yahyaabd/allstats-semantic-dataset-v4
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:73392
  - loss:CosineSimilarityLoss
widget:
  - source_sentence: >-
      Berapa persen kenaikan Indeks Harga Perdagangan Besar (IHPB) Umum Nasional
      pada bulan April 2021?
    sentences:
      - Statistik Kriminal 2023
      - Ekonomi Indonesia Triwulan I-2021 turun 0,74 persen (y-on-y)
      - Survei Biaya Hidup (SBH) 2018 Ambon dan Tual
  - source_sentence: Usaha pertanian sampingan di Indonesia tahun 2022
    sentences:
      - Analisis Hasil Survei Dampak Covid-19 Terhadap Pelaku Usaha
      - Direktori Usaha Pertanian Lainnya 2022
      - EksporImpor September 2018
  - source_sentence: Pertumbuhan industri Indonesia 2006-2009
    sentences:
      - Pertumbuhan Produksi IBS Triwulan III 2019 Naik 4,35 Persen
      - Indikator Ekonomi April 2000
      - Perkembangan Indeks Produksi Industri Besar dan Sedang 2006 - 2009
  - source_sentence: 'Sensus ekonomi Kalbar 2016: data usaha'
    sentences:
      - Pertumbuhan ekonomi Indonesia tahun 2022
      - Buletin Statistik Perdagangan Luar Negeri Impor November 2017
      - Data jumlah wisatawan mancanegara 2019
  - source_sentence: Direktori perusahaan pengelola hutan 2015
    sentences:
      - >-
        Buletin Statistik Perdagangan Luar Negeri Ekspor Menurut Kelompok
        Komoditi dan Negara, April 2017
      - Direktori Perusahaan Kehutanan 2015
      - >-
        Indeks Pembangunan Manusia (IPM) Indonesia tahun 2024 mencapai 75,02,
        meningkat 0,63 poin atau 0,85 persen dibandingkan tahun sebelumnya yang
        sebesar 74,39.
model-index:
  - name: >-
      SentenceTransformer based on
      sentence-transformers/paraphrase-multilingual-mpnet-base-v2
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstats semantic mpnet v2 eval
          type: allstats-semantic-mpnet-v2-eval
        metrics:
          - type: pearson_cosine
            value: 0.9437560461787071
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.7866108512073439
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: allstats semantic mpnet v2 test
          type: allstats-semantic-mpnet-v2-test
        metrics:
          - type: pearson_cosine
            value: 0.9433638771860691
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.7869770777792755
            name: Spearman Cosine

SentenceTransformer based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on the allstats-semantic-dataset-v4 dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-semantic-mpnet-v2")
# Run inference
sentences = [
    'Direktori perusahaan pengelola hutan 2015',
    'Direktori Perusahaan Kehutanan 2015',
    'Indeks Pembangunan Manusia (IPM) Indonesia tahun 2024 mencapai 75,02, meningkat 0,63 poin atau 0,85 persen dibandingkan tahun sebelumnya yang sebesar 74,39.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric allstats-semantic-mpnet-v2-eval allstats-semantic-mpnet-v2-test
pearson_cosine 0.9438 0.9434
spearman_cosine 0.7866 0.787

Training Details

Training Dataset

allstats-semantic-dataset-v4

  • Dataset: allstats-semantic-dataset-v4 at 0e15dee
  • Size: 73,392 training samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 5 tokens
    • mean: 11.28 tokens
    • max: 34 tokens
    • min: 5 tokens
    • mean: 14.71 tokens
    • max: 58 tokens
    • min: 0.0
    • mean: 0.48
    • max: 1.0
  • Samples:
    query doc label
    Data bisnis Kalbar sensus 2016 Indikator Ekonomi Oktober 2012 0.1
    Informasi tentang pola pengeluaran masyarakat Bengkulu berdasarkan kelompok pendapatan? Rata-rata Konsumsi dan Pengeluaran Perkapita Seminggu Menurut Komoditi Makanan dan Golongan Pengeluaran per Kapita Seminggu di Provinsi Bengkulu, 2018-2023 0.88
    Laopran keuagnan lmebaga non proft 20112-013 Neraca Lembaga Non Profit yang Melayani Rumah Tangga 2011-2013 0.93
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

allstats-semantic-dataset-v4

  • Dataset: allstats-semantic-dataset-v4 at 0e15dee
  • Size: 15,726 evaluation samples
  • Columns: query, doc, and label
  • Approximate statistics based on the first 1000 samples:
    query doc label
    type string string float
    details
    • min: 4 tokens
    • mean: 11.52 tokens
    • max: 37 tokens
    • min: 5 tokens
    • mean: 14.38 tokens
    • max: 61 tokens
    • min: 0.0
    • mean: 0.49
    • max: 1.0
  • Samples:
    query doc label
    Data transportasi bulan Februari 2021 Tenaga Kerja Februari 2023 0.08
    Sebear berspa prrsen eknaikan Inseks Hraga Predagangan eBsar (IHB) Umym Nasiona di aMret 202? Maret 2020, Indeks Harga Perdagangan Besar (IHPB) Umum Nasional naik 0,10 persen 1.0
    Data ekspor dan moda transportasi tahun 2018-2019 Indikator Pasar Tenaga Kerja Indonesia Agustus 2012 0.08
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • num_train_epochs: 0.5
  • warmup_ratio: 0.1
  • fp16: True
  • dataloader_num_workers: 4
  • load_best_model_at_end: True
  • label_smoothing_factor: 0.05
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 0.5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 4
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.05
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss allstats-semantic-mpnet-v2-eval_spearman_cosine allstats-semantic-mpnet-v2-test_spearman_cosine
0 0 - 0.1031 0.6244 -
0.1090 250 0.0442 0.0321 0.7393 -
0.2180 500 0.0295 0.0248 0.7641 -
0.3269 750 0.0259 0.0224 0.7733 -
0.4359 1000 0.0208 0.0199 0.7866 -
0.5 1147 - - - 0.7870
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.48.0
  • PyTorch: 2.4.1+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}