metadata
language:
- id
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:6198
- loss:CoSENTLoss
base_model: BAAI/bge-m3
widget:
- source_sentence: Seekor kucing hitam dan putih yang sedang bermain dengan keranjang wol.
sentences:
- Dua ekor anjing berlari melintasi lapangan berumput.
- Seorang pria mengiris bawang.
- Seekor kucing hitam dan putih yang sedang berbaring di atas selimut.
- source_sentence: Bintang-bintang memang berotasi, tapi itu bukan penyebab kestabilannya.
sentences:
- Seorang pria sedang bernyanyi dan memainkan gitar.
- Tingkat pertumbuhan Uni Soviet selama tahun 50-an tidak terlalu tinggi.
- Bintang berotasi karena momentum sudut gas yang membentuknya.
- source_sentence: Hal penting yang saya coba ingat adalah, hanya memperhatikan.
sentences:
- Tiga orang wanita sedang duduk di dekat dinding.
- >-
Saya telah membaca tentang topik ini sejak saya mengajukan pertanyaan
ini.
- >-
Untuk melatih diri Anda menggunakan pintasan keyboard, cabutlah mouse
Anda selama beberapa hari.
- source_sentence: Mari kita asumsikan data untuk gugus bola setara dengan data M13.
sentences:
- Wanita itu mengiris dagingnya.
- Sebuah laptop dan PC di stasiun kerja.
- >-
Gugus bola menempati tempat yang menarik dalam spektrum sistem bintang
komposit.
- source_sentence: >-
Jawaban singkatnya adalah: kita terbuat dari "materi" yang disumbangkan
oleh banyak bintang.
sentences:
- Sebuah band sedang bermain di atas panggung.
- >-
Sangat tidak mungkin bahwa kita terbuat dari benda-benda yang hanya
terbuat dari satu bintang.
- Seorang wanita sedang mengiris brokoli.
datasets:
- Pustekhan-ITB/stsb-indo-edu
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- pearson_cosine
- spearman_cosine
model-index:
- name: SentenceTransformer based on BAAI/bge-m3
results:
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
name: stsb indo edu dev
type: stsb-indo-edu-dev
metrics:
- type: pearson_cosine
value: 0.8432609269312235
name: Pearson Cosine
- type: spearman_cosine
value: 0.8580118610725878
name: Spearman Cosine
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
name: stsb indo edu test
type: stsb-indo-edu-test
metrics:
- type: pearson_cosine
value: 0.8442709665997649
name: Pearson Cosine
- type: spearman_cosine
value: 0.8602630711004111
name: Spearman Cosine
SentenceTransformer based on BAAI/bge-m3
This is a sentence-transformers model finetuned from BAAI/bge-m3 on the stsb-indo-edu dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: BAAI/bge-m3
- Maximum Sequence Length: 8192 tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Cosine Similarity
- Training Dataset:
- Language: id
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Pustekhan-ITB/indoedubert-bge-m3-exp2")
# Run inference
sentences = [
'Jawaban singkatnya adalah: kita terbuat dari "materi" yang disumbangkan oleh banyak bintang.',
'Sangat tidak mungkin bahwa kita terbuat dari benda-benda yang hanya terbuat dari satu bintang.',
'Sebuah band sedang bermain di atas panggung.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Semantic Similarity
- Datasets:
stsb-indo-edu-dev
andstsb-indo-edu-test
- Evaluated with
EmbeddingSimilarityEvaluator
Metric | stsb-indo-edu-dev | stsb-indo-edu-test |
---|---|---|
pearson_cosine | 0.8433 | 0.8443 |
spearman_cosine | 0.858 | 0.8603 |
Training Details
Training Dataset
stsb-indo-edu
- Dataset: stsb-indo-edu at 2c5aa12
- Size: 6,198 training samples
- Columns:
sentence1
,sentence2
, andscore
- Approximate statistics based on the first 1000 samples:
sentence1 sentence2 score type string string float details - min: 6 tokens
- mean: 10.95 tokens
- max: 28 tokens
- min: 6 tokens
- mean: 10.81 tokens
- max: 30 tokens
- min: 0.0
- mean: 0.46
- max: 1.0
- Samples:
sentence1 sentence2 score Pelajaran menari daerah membantu siswa SD melestarikan kebudayaan lokal
Tarian ini sering dipentaskan saat perayaan hari besar
0.76
Sebelum ujian sekolah, guru memberikan bimbingan belajar tambahan secara gratis
Upaya ini agar seluruh siswa siap menghadapi ujian
0.85
Beberapa SD terletak di daerah pegunungan, sehingga siswa harus berjalan kaki cukup jauh
Ini melatih kemandirian dan fisik yang kuat
0.63
- Loss:
CoSENTLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "pairwise_cos_sim" }
Evaluation Dataset
stsb-indo-edu
- Dataset: stsb-indo-edu at 2c5aa12
- Size: 1,536 evaluation samples
- Columns:
sentence1
,sentence2
, andscore
- Approximate statistics based on the first 1000 samples:
sentence1 sentence2 score type string string float details - min: 5 tokens
- mean: 15.96 tokens
- max: 44 tokens
- min: 6 tokens
- mean: 15.97 tokens
- max: 47 tokens
- min: 0.0
- mean: 0.42
- max: 1.0
- Samples:
sentence1 sentence2 score Seorang pria dengan topi keras sedang menari.
Seorang pria yang mengenakan topi keras sedang menari.
1.0
Seorang anak kecil sedang menunggang kuda.
Seorang anak sedang menunggang kuda.
0.95
Seorang pria sedang memberi makan seekor tikus kepada seekor ular.
Pria itu sedang memberi makan seekor tikus kepada ular.
1.0
- Loss:
CoSENTLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "pairwise_cos_sim" }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 32per_device_eval_batch_size
: 32num_train_epochs
: 1warmup_ratio
: 0.1fp16
: Truebatch_sampler
: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 32per_device_eval_batch_size
: 32per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 1max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Truefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: no_duplicatesmulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | Training Loss | Validation Loss | stsb-indo-edu-dev_spearman_cosine | stsb-indo-edu-test_spearman_cosine |
---|---|---|---|---|---|
-1 | -1 | - | - | 0.8096 | - |
0.5155 | 100 | 6.0081 | 5.7898 | 0.8580 | - |
-1 | -1 | - | - | - | 0.8603 |
Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.48.3
- PyTorch: 2.5.1
- Accelerate: 1.3.0
- Datasets: 3.3.0
- Tokenizers: 0.21.0
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
CoSENTLoss
@online{kexuefm-8847,
title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
author={Su Jianlin},
year={2022},
month={Jan},
url={https://kexue.fm/archives/8847},
}