SentenceTransformer based on BAAI/bge-base-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-base-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("MugheesAwan11/bge-base-securiti-dataset-1-v16")
# Run inference
sentences = [
    "office of the \u200b\u200bFederal Commissioner for Data Protection and Freedom of Information, with its headquarters in the city of Bonn. It is led by a Federal Commissioner, elected via a vote by the German Bundestag. Eligibility criteria include being at least 35 years old, appropriate qualifications in the field of data protection law gained through relevant professional experience. The Commissioner's term is for five years, which can be extended once. The Commissioner has the responsibility to act as the primary office responsible for enforcing the Federal Data Protection Act within Germany. Some of the office's key responsibilities include: Advising the Bundestag, the Bundesrat, and the Federal Government on administrative and legislative measures related to data protection within the country; To oversee and implement both the GDPR and Federal Data Protection Act within Germany; To promote awareness within the public related to the risks, rules, safeguards, and rights concerning the processing of personal data; To handle all,  within Germany. It supplements and aligns with the requirements of the EU GDPR. Yes, Germany is covered by GDPR (General Data Protection Regulation). GDPR is a regulation that applies uniformly across all EU member states, including Germany. The Federal Data Protection Act established the office of the \u200b\u200bFederal Commissioner for Data Protection and Freedom of Information, with its headquarters in the city of Bonn. It is led by a Federal Commissioner, elected via a vote by the German Bundestag. Germany's interpretation is the Bundesdatenschutzgesetz (BDSG), the German Federal Data Protection Act. It mirrors the GDPR in all key areas while giving local German regulatory authorities the power to enforce it more efficiently nationally. ## Join Our Newsletter Get all the latest information, law updates and more delivered to your inbox ### Share Copy 14 ### More Stories that May Interest You View More",
    'What are the main responsibilities of the Federal Commissioner for Data Protection and Freedom of Information in enforcing data protection laws in Germany, including the GDPR and the Federal Data Protection Act?',
    'What is the collection and use of personal information by businesses?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.6907
cosine_accuracy@3 0.8866
cosine_accuracy@5 0.9381
cosine_accuracy@10 0.9691
cosine_precision@1 0.6907
cosine_precision@3 0.2955
cosine_precision@5 0.1876
cosine_precision@10 0.0969
cosine_recall@1 0.6907
cosine_recall@3 0.8866
cosine_recall@5 0.9381
cosine_recall@10 0.9691
cosine_ndcg@10 0.8386
cosine_mrr@10 0.7956
cosine_map@100 0.7968

Information Retrieval

Metric Value
cosine_accuracy@1 0.6907
cosine_accuracy@3 0.8763
cosine_accuracy@5 0.9278
cosine_accuracy@10 0.9691
cosine_precision@1 0.6907
cosine_precision@3 0.2921
cosine_precision@5 0.1856
cosine_precision@10 0.0969
cosine_recall@1 0.6907
cosine_recall@3 0.8763
cosine_recall@5 0.9278
cosine_recall@10 0.9691
cosine_ndcg@10 0.833
cosine_mrr@10 0.7889
cosine_map@100 0.7896

Information Retrieval

Metric Value
cosine_accuracy@1 0.6907
cosine_accuracy@3 0.8557
cosine_accuracy@5 0.8969
cosine_accuracy@10 0.9381
cosine_precision@1 0.6907
cosine_precision@3 0.2852
cosine_precision@5 0.1794
cosine_precision@10 0.0938
cosine_recall@1 0.6907
cosine_recall@3 0.8557
cosine_recall@5 0.8969
cosine_recall@10 0.9381
cosine_ndcg@10 0.8162
cosine_mrr@10 0.777
cosine_map@100 0.7796

Information Retrieval

Metric Value
cosine_accuracy@1 0.5979
cosine_accuracy@3 0.7732
cosine_accuracy@5 0.8247
cosine_accuracy@10 0.8866
cosine_precision@1 0.5979
cosine_precision@3 0.2577
cosine_precision@5 0.1649
cosine_precision@10 0.0887
cosine_recall@1 0.5979
cosine_recall@3 0.7732
cosine_recall@5 0.8247
cosine_recall@10 0.8866
cosine_ndcg@10 0.7462
cosine_mrr@10 0.701
cosine_map@100 0.7047

Information Retrieval

Metric Value
cosine_accuracy@1 0.5155
cosine_accuracy@3 0.6804
cosine_accuracy@5 0.7113
cosine_accuracy@10 0.7732
cosine_precision@1 0.5155
cosine_precision@3 0.2268
cosine_precision@5 0.1423
cosine_precision@10 0.0773
cosine_recall@1 0.5155
cosine_recall@3 0.6804
cosine_recall@5 0.7113
cosine_recall@10 0.7732
cosine_ndcg@10 0.6463
cosine_mrr@10 0.6055
cosine_map@100 0.6128

Training Details

Training Dataset

Unnamed Dataset

  • Size: 7,872 training samples
  • Columns: positive and anchor
  • Approximate statistics based on the first 1000 samples:
    positive anchor
    type string string
    details
    • min: 18 tokens
    • mean: 206.12 tokens
    • max: 414 tokens
    • min: 9 tokens
    • mean: 21.62 tokens
    • max: 102 tokens
  • Samples:
    positive anchor
    Automation PrivacyCenter.Cloud Data Mapping
    on both in terms of material and territorial scope. ### 1.1 Material Scope The Spanish data protection law affords blanket protection for all data that may have been collected on a data subject. There are only a handful of exceptions that include: Information subject to a pending legal case Information collected concerning the investigation of terrorism or organised crime Information classified as "Confidential" for matters related to Spain's national security ### 1.2 Territorial Scope The Spanish data protection law applies to all data handlers that are: Carrying out data collection activities in Spain Not established in Spain but carrying out data collection activities on Spanish territory Not established within the European Union but carrying out data collection activities on Spanish residents unless for data transit purposes only ## 2. Obligations for Organizations Under Spanish Data Protection Law The Spanish data protection law and GDPR lay out specific obligations for all data handlers. These obligations ensure, . ### 2.3 Privacy Policy Requirements Spain's data protection law requires all data handlers to inform the data subject of the following in their privacy policy: The purpose of collecting the data and the recipients of the information The obligatory or voluntary nature of the reply to the questions put to them The consequences of obtaining the data or of refusing to provide them The possibility of exercising rights of access, rectification, erasure, portability, and objection The identity and address of the controller or their local Spanish representative ### 2.4 Security Requirements Article 9 of Spain's Data Protection Law is direct and explicit in stating the responsibility of the data handler is to take adequate measures to ensure the protection of any data collected. It mandates all data handlers to adopt technical and organisational measures necessary to ensure the security of the personal data and prevent their alteration, loss, and unauthorised processing or access. Additionally, collection of any What are the requirements for organizations under the Spanish data protection law regarding privacy policies and security measures?
    before the point of collection of their personal information. ## Right to Erasure The right to erasure gives consumers the right to request deleting all their data stored by the organization. Organizations are supposed to comply within 45 days and must deliver a report to the consumer confirming the deletion of their information. ## Right to Opt-in for Minors Personal information containing minors' personal information cannot be sold by a business unless the minor (age of 13 to 16 years) or the Parent/Guardian (if the minor is aged below 13 years) opt-ins to allow this sale. Businesses can be held liable for the sale of minors' personal information if they either knew or wilfully disregarded the consumer's status as a minor and the minor or Parent/Guardian had not willingly opted in. ## Right to Continued Protection Even when consumers choose to allow a business to collect and sell their personal information, businesses' must sign written What are the conditions under which businesses can sell minors' personal information?
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • learning_rate: 2e-05
  • num_train_epochs: 2
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: True
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: True
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_128_cosine_map@100 dim_256_cosine_map@100 dim_512_cosine_map@100 dim_64_cosine_map@100 dim_768_cosine_map@100
0.0407 10 7.3954 - - - - -
0.0813 20 6.0944 - - - - -
0.1220 30 4.9443 - - - - -
0.1626 40 3.8606 - - - - -
0.2033 50 3.0961 - - - - -
0.2439 60 1.8788 - - - - -
0.2846 70 2.3815 - - - - -
0.3252 80 4.0698 - - - - -
0.3659 90 2.2183 - - - - -
0.4065 100 1.9142 - - - - -
0.4472 110 1.5149 - - - - -
0.4878 120 1.7036 - - - - -
0.5285 130 2.9528 - - - - -
0.5691 140 1.0596 - - - - -
0.6098 150 1.7619 - - - - -
0.6504 160 1.6529 - - - - -
0.6911 170 3.097 - - - - -
0.7317 180 1.3802 - - - - -
0.7724 190 1.9744 - - - - -
0.8130 200 5.1313 - - - - -
0.8537 210 1.405 - - - - -
0.8943 220 1.4389 - - - - -
0.9350 230 3.6439 - - - - -
0.9756 240 3.7227 - - - - -
1.0122 249 - 0.6623 0.7328 0.7549 0.5729 0.7572
1.0041 250 1.3183 - - - - -
1.0447 260 5.2631 - - - - -
1.0854 270 4.0516 - - - - -
1.1260 280 2.5487 - - - - -
1.1667 290 1.7379 - - - - -
1.2073 300 1.1724 - - - - -
1.2480 310 0.7885 - - - - -
1.2886 320 1.2341 - - - - -
1.3293 330 3.3722 - - - - -
1.3699 340 1.2227 - - - - -
1.4106 350 0.8475 - - - - -
1.4512 360 0.7605 - - - - -
1.4919 370 0.8954 - - - - -
1.5325 380 1.9712 - - - - -
1.5732 390 0.5607 - - - - -
1.6138 400 0.9671 - - - - -
1.6545 410 1.0024 - - - - -
1.6951 420 2.1374 - - - - -
1.7358 430 0.8213 - - - - -
1.7764 440 2.1253 - - - - -
1.8171 450 2.7885 - - - - -
1.8577 460 0.9053 - - - - -
1.8984 470 0.9261 - - - - -
1.9390 480 3.1218 - - - - -
1.9797 490 3.0135 - - - - -
1.9878 492 - 0.7047 0.7796 0.7896 0.6128 0.7968
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.14
  • Sentence Transformers: 3.0.1
  • Transformers: 4.41.2
  • PyTorch: 2.1.2+cu121
  • Accelerate: 0.31.0
  • Datasets: 2.19.1
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning}, 
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
20
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for MugheesAwan11/bge-base-securiti-dataset-1-v16

Finetuned
(310)
this model

Evaluation results