SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Snowflake/snowflake-arctic-embed-m
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("gmedrano/snowflake-arctic-embed-m-finetuned")
# Run inference
sentences = [
'How does the AI Bill of Rights protect individual privacy?',
'principles for managing information about individuals have been incorporated into data privacy laws and \npolicies across the globe.5 The Blueprint for an AI Bill of Rights embraces elements of the FIPPs that are \nparticularly relevant to automated systems, without articulating a specific set of FIPPs or scoping \napplicability or the interests served to a single particular domain, like privacy, civil rights and civil liberties,',
'harmful \nuses. \nThe \nNIST \nframework \nwill \nconsider \nand \nencompass \nprinciples \nsuch \nas \ntransparency, accountability, and fairness during pre-design, design and development, deployment, use, \nand testing and evaluation of AI technologies and systems. It is expected to be released in the winter of 2022-23. \n21',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Semantic Similarity
- Dataset:
val
- Evaluated with
EmbeddingSimilarityEvaluator
Metric | Value |
---|---|
pearson_cosine | 0.6585 |
spearman_cosine | 0.7 |
pearson_manhattan | 0.5827 |
spearman_manhattan | 0.6 |
pearson_euclidean | 0.6723 |
spearman_euclidean | 0.7 |
pearson_dot | 0.6585 |
spearman_dot | 0.7 |
pearson_max | 0.6723 |
spearman_max | 0.7 |
Semantic Similarity
- Dataset:
test
- Evaluated with
EmbeddingSimilarityEvaluator
Metric | Value |
---|---|
pearson_cosine | 0.7463 |
spearman_cosine | 0.8 |
pearson_manhattan | 0.7475 |
spearman_manhattan | 0.8 |
pearson_euclidean | 0.7592 |
spearman_euclidean | 0.8 |
pearson_dot | 0.7463 |
spearman_dot | 0.8 |
pearson_max | 0.7592 |
spearman_max | 0.8 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 40 training samples
- Columns:
sentence_0
,sentence_1
, andlabel
- Approximate statistics based on the first 40 samples:
sentence_0 sentence_1 label type string string float details - min: 12 tokens
- mean: 14.43 tokens
- max: 18 tokens
- min: 41 tokens
- mean: 80.55 tokens
- max: 117 tokens
- min: 0.53
- mean: 0.61
- max: 0.76
- Samples:
sentence_0 sentence_1 label What should business leaders understand about AI risk management?
57
National Institute of Standards and Technology (2023) AI Risk Management Framework, Appendix B:
How AI Risks Differ from Traditional Software Risks.
https://airc.nist.gov/AI_RMF_Knowledge_Base/AI_RMF/Appendices/Appendix_B
National Institute of Standards and Technology (2023) AI RMF Playbook.
https://airc.nist.gov/AI_RMF_Knowledge_Base/Playbook
National Institue of Standards and Technology (2023) Framing Risk0.5692041097520776
What kind of data protection measures are required under current AI regulations?
GOVERN 1.1: Legal and regulatory requirements involving AI are understood, managed, and documented.
Action ID
Suggested Action
GAI Risks
GV-1.1-001 Align GAI development and use with applicable laws and regulations, including
those related to data privacy, copyright and intellectual property law.
Data Privacy; Harmful Bias and
Homogenization; Intellectual
Property
AI Actor Tasks: Governance and Oversight0.5830958798587019
What are the implications of AI in decision-making processes?
state of the science of AI measurement and safety today. This document focuses on risks for which there
is an existing empirical evidence base at the time this profile was written; for example, speculative risks
that may potentially arise in more advanced, future GAI systems are not considered. Future updates may
incorporate additional risks or provide further details on the risks identified below.0.5317174553776045
- Loss:
CosineSimilarityLoss
with these parameters:{ "loss_fct": "torch.nn.modules.loss.MSELoss" }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 16per_device_eval_batch_size
: 16multi_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 16per_device_eval_batch_size
: 16per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 3max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseeval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseeval_use_gather_object
: Falsebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | test_spearman_max | val_spearman_max |
---|---|---|---|
1.0 | 3 | - | 0.6 |
2.0 | 6 | - | 0.7 |
3.0 | 9 | 0.8000 | 0.7 |
Framework Versions
- Python: 3.11.9
- Sentence Transformers: 3.1.1
- Transformers: 4.44.2
- PyTorch: 2.2.2
- Accelerate: 0.34.2
- Datasets: 3.0.0
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
- Downloads last month
- 13
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for gmedrano/snowflake-arctic-embed-m-finetuned
Base model
Snowflake/snowflake-arctic-embed-mEvaluation results
- Pearson Cosine on valself-reported0.659
- Spearman Cosine on valself-reported0.700
- Pearson Manhattan on valself-reported0.583
- Spearman Manhattan on valself-reported0.600
- Pearson Euclidean on valself-reported0.672
- Spearman Euclidean on valself-reported0.700
- Pearson Dot on valself-reported0.659
- Spearman Dot on valself-reported0.700
- Pearson Max on valself-reported0.672
- Spearman Max on valself-reported0.700