arrivaldwis's picture
Add new SentenceTransformer model
6a99aff verified
metadata
language:
  - id
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:6198
  - loss:CoSENTLoss
base_model: BAAI/bge-m3
widget:
  - source_sentence: Seekor kucing hitam dan putih yang sedang bermain dengan keranjang wol.
    sentences:
      - Dua ekor anjing berlari melintasi lapangan berumput.
      - Seorang pria mengiris bawang.
      - Seekor kucing hitam dan putih yang sedang berbaring di atas selimut.
  - source_sentence: Bintang-bintang memang berotasi, tapi itu bukan penyebab kestabilannya.
    sentences:
      - Seorang pria sedang bernyanyi dan memainkan gitar.
      - Tingkat pertumbuhan Uni Soviet selama tahun 50-an tidak terlalu tinggi.
      - Bintang berotasi karena momentum sudut gas yang membentuknya.
  - source_sentence: Hal penting yang saya coba ingat adalah, hanya memperhatikan.
    sentences:
      - Tiga orang wanita sedang duduk di dekat dinding.
      - >-
        Saya telah membaca tentang topik ini sejak saya mengajukan pertanyaan
        ini.
      - >-
        Untuk melatih diri Anda menggunakan pintasan keyboard, cabutlah mouse
        Anda selama beberapa hari.
  - source_sentence: Mari kita asumsikan data untuk gugus bola setara dengan data M13.
    sentences:
      - Wanita itu mengiris dagingnya.
      - Sebuah laptop dan PC di stasiun kerja.
      - >-
        Gugus bola menempati tempat yang menarik dalam spektrum sistem bintang
        komposit.
  - source_sentence: >-
      Jawaban singkatnya adalah: kita terbuat dari "materi" yang disumbangkan
      oleh banyak bintang.
    sentences:
      - Sebuah band sedang bermain di atas panggung.
      - >-
        Sangat tidak mungkin bahwa kita terbuat dari benda-benda yang hanya
        terbuat dari satu bintang.
      - Seorang wanita sedang mengiris brokoli.
datasets:
  - Pustekhan-ITB/stsb-indo-edu
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
model-index:
  - name: SentenceTransformer based on BAAI/bge-m3
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: stsb indo edu dev
          type: stsb-indo-edu-dev
        metrics:
          - type: pearson_cosine
            value: 0.8432609269312235
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8580118610725878
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: stsb indo edu test
          type: stsb-indo-edu-test
        metrics:
          - type: pearson_cosine
            value: 0.8442709665997649
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8602630711004111
            name: Spearman Cosine

SentenceTransformer based on BAAI/bge-m3

This is a sentence-transformers model finetuned from BAAI/bge-m3 on the stsb-indo-edu dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-m3
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
  • Language: id

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Pustekhan-ITB/indoedubert-bge-m3-exp2")
# Run inference
sentences = [
    'Jawaban singkatnya adalah: kita terbuat dari "materi" yang disumbangkan oleh banyak bintang.',
    'Sangat tidak mungkin bahwa kita terbuat dari benda-benda yang hanya terbuat dari satu bintang.',
    'Sebuah band sedang bermain di atas panggung.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric stsb-indo-edu-dev stsb-indo-edu-test
pearson_cosine 0.8433 0.8443
spearman_cosine 0.858 0.8603

Training Details

Training Dataset

stsb-indo-edu

  • Dataset: stsb-indo-edu at 2c5aa12
  • Size: 6,198 training samples
  • Columns: sentence1, sentence2, and score
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 score
    type string string float
    details
    • min: 6 tokens
    • mean: 10.95 tokens
    • max: 28 tokens
    • min: 6 tokens
    • mean: 10.81 tokens
    • max: 30 tokens
    • min: 0.0
    • mean: 0.46
    • max: 1.0
  • Samples:
    sentence1 sentence2 score
    Pelajaran menari daerah membantu siswa SD melestarikan kebudayaan lokal Tarian ini sering dipentaskan saat perayaan hari besar 0.76
    Sebelum ujian sekolah, guru memberikan bimbingan belajar tambahan secara gratis Upaya ini agar seluruh siswa siap menghadapi ujian 0.85
    Beberapa SD terletak di daerah pegunungan, sehingga siswa harus berjalan kaki cukup jauh Ini melatih kemandirian dan fisik yang kuat 0.63
  • Loss: CoSENTLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "pairwise_cos_sim"
    }
    

Evaluation Dataset

stsb-indo-edu

  • Dataset: stsb-indo-edu at 2c5aa12
  • Size: 1,536 evaluation samples
  • Columns: sentence1, sentence2, and score
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 score
    type string string float
    details
    • min: 5 tokens
    • mean: 15.96 tokens
    • max: 44 tokens
    • min: 6 tokens
    • mean: 15.97 tokens
    • max: 47 tokens
    • min: 0.0
    • mean: 0.42
    • max: 1.0
  • Samples:
    sentence1 sentence2 score
    Seorang pria dengan topi keras sedang menari. Seorang pria yang mengenakan topi keras sedang menari. 1.0
    Seorang anak kecil sedang menunggang kuda. Seorang anak sedang menunggang kuda. 0.95
    Seorang pria sedang memberi makan seekor tikus kepada seekor ular. Pria itu sedang memberi makan seekor tikus kepada ular. 1.0
  • Loss: CoSENTLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "pairwise_cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • fp16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss stsb-indo-edu-dev_spearman_cosine stsb-indo-edu-test_spearman_cosine
-1 -1 - - 0.8096 -
0.5155 100 6.0081 5.7898 0.8580 -
-1 -1 - - - 0.8603

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.3
  • PyTorch: 2.5.1
  • Accelerate: 1.3.0
  • Datasets: 3.3.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CoSENTLoss

@online{kexuefm-8847,
    title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
    author={Su Jianlin},
    year={2022},
    month={Jan},
    url={https://kexue.fm/archives/8847},
}