SentenceTransformer based on thenlper/gte-large

This is a sentence-transformers model finetuned from thenlper/gte-large. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: thenlper/gte-large
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("JFernandoGRE/gtelarge-colombian-elitenames")
# Run inference
sentences = [
    'EL MAR',
    'ZEA DE SPINEL MARIELENA',
    'FONSECA MEDINA FLOR MARIA',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 23,976 training samples
  • Columns: sentence1, sentence2, and label
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 label
    type string string int
    details
    • min: 4 tokens
    • mean: 6.46 tokens
    • max: 26 tokens
    • min: 4 tokens
    • mean: 8.24 tokens
    • max: 28 tokens
    • 0: ~87.50%
    • 1: ~12.50%
  • Samples:
    sentence1 sentence2 label
    CA M YURLY ANGELICA MENDOZA MENDEZ 0
    JOSE MARIA JOSE MARIA DOMINGUEZ VELASCO 0
    O GOMEZ JOSE DEMETRIO GOMEZ NARVAEZ 0
  • Loss: OnlineContrastiveLoss

Evaluation Dataset

Unnamed Dataset

  • Size: 5,995 evaluation samples
  • Columns: sentence1, sentence2, and label
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 label
    type string string int
    details
    • min: 4 tokens
    • mean: 6.49 tokens
    • max: 39 tokens
    • min: 4 tokens
    • mean: 8.18 tokens
    • max: 23 tokens
    • 0: ~89.20%
    • 1: ~10.80%
  • Samples:
    sentence1 sentence2 label
    MARIA CECILIA VILLAMIZAR ANGULO MARY CECILIA VILLAMIZAR ANGULO 0
    GO GOMEZ RAULARANGO GOMEZ 0
    ALVAROJIMENEZ LINARES ALVAROJIMENEZ LINAREZ 1
  • Loss: OnlineContrastiveLoss

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • learning_rate: 1e-05
  • num_train_epochs: 5
  • warmup_ratio: 0.182
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 1e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.182
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss
0.0667 100 0.5037 0.5726
0.1334 200 0.3707 0.5032
0.2001 300 0.1852 0.5350
0.2668 400 0.1513 0.4760
0.3336 500 0.1414 0.4331
0.4003 600 0.1268 0.4587
0.4670 700 0.1474 0.3920
0.5337 800 0.106 0.3886
0.6004 900 0.1205 0.2674
0.6671 1000 0.1186 0.4007
0.7338 1100 0.1089 0.3114
0.8005 1200 0.098 0.3934
0.8672 1300 0.0903 0.3584
0.9340 1400 0.1162 0.3693
1.0007 1500 0.12 0.3155
1.0674 1600 0.1145 0.3847
1.1341 1700 0.0987 0.2464
1.2008 1800 0.0908 0.2814
1.2675 1900 0.098 0.3297
1.3342 2000 0.0761 0.3088
1.4009 2100 0.0883 0.2902
1.4676 2200 0.1037 0.2578
1.5344 2300 0.0848 0.3500
1.6011 2400 0.0701 0.2834
1.6678 2500 0.0912 0.2429
1.7345 2600 0.0815 0.2146
1.8012 2700 0.0804 0.2155
1.8679 2800 0.0729 0.2373
1.9346 2900 0.0734 0.2314
2.0013 3000 0.0804 0.2570
2.0680 3100 0.0524 0.3019
2.1348 3200 0.0602 0.2900
2.2015 3300 0.0561 0.2553
2.2682 3400 0.0457 0.2436
2.3349 3500 0.0626 0.3225
2.4016 3600 0.0576 0.2204
2.4683 3700 0.0644 0.2630
2.5350 3800 0.0556 0.2038
2.6017 3900 0.0593 0.2694
2.6684 4000 0.0499 0.2262
2.7352 4100 0.0611 0.1960
2.8019 4200 0.0554 0.2043
2.8686 4300 0.0495 0.1858
2.9353 4400 0.0772 0.2147
3.0020 4500 0.0656 0.2513
3.0687 4600 0.0322 0.1809
3.1354 4700 0.0354 0.1908
3.2021 4800 0.0552 0.1639
3.2688 4900 0.0513 0.2011
3.3356 5000 0.0423 0.2323
3.4023 5100 0.0396 0.1624
3.4690 5200 0.0411 0.2187
3.5357 5300 0.0499 0.1867
3.6024 5400 0.0345 0.1755
3.6691 5500 0.0312 0.1708
3.7358 5600 0.0558 0.1832
3.8025 5700 0.0342 0.2056
3.8692 5800 0.0513 0.1858
3.9360 5900 0.0449 0.1792
4.0027 6000 0.044 0.1815
4.0694 6100 0.0329 0.1693
4.1361 6200 0.0481 0.1707
4.2028 6300 0.0328 0.1696
4.2695 6400 0.0269 0.1766
4.3362 6500 0.0299 0.1815
4.4029 6600 0.0374 0.2109
4.4696 6700 0.0449 0.2033
4.5364 6800 0.0277 0.2103
4.6031 6900 0.039 0.2088
4.6698 7000 0.0261 0.2045
4.7365 7100 0.0258 0.2051
4.8032 7200 0.0405 0.2069
4.8699 7300 0.0313 0.2051
4.9366 7400 0.0384 0.2039

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.47.1
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
9
Safetensors
Model size
335M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for JFernandoGRE/gtelarge-colombian-elitenames

Base model

thenlper/gte-large
Finetuned
(17)
this model