
SentenceTransformer based on Snowflake/snowflake-arctic-embed-l-v2.0
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l-v2.0 on the clustered datasets. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search.
The Snowflake/snowflake-arctic-embed-l-v2.0 model has been further trained with Korean data to enhance its performance in Korean retrieval tasks. It is a powerful model that achieves state-of-the-art (SOTA) performance across multiple retrieval benchmarks.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Snowflake/snowflake-arctic-embed-l-v2.0
- Maximum Sequence Length: 8192 tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Cosine Similarity
- Training Datasets:
- AI Hub Dataset
- 행정 문서 대상 기계 독해
- 기계 독해
- 뉴스 기사 기계독해
- 도서 자료 기계독해
- 숫자 연산 기계독해
- 금융 법률 문서 기계독해
- AI Hub Dataset
- Language: Korean, English
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
First install the Sentence Transformers library and xformers library
pip install -U sentence-transformers
pip install xformers
Then you can load this model and run inference.
Using Sentence Transformers
from sentence_transformers import SentenceTransformer
# Load the model
model_name = 'dragonkue/snowflake-arctic-embed-l-v2.0-ko'
model = SentenceTransformer(model_name)
# Define the queries and documents
queries = ['대한민국의 수도는 어디인가?', '한글을 만든 사람은 누구인가?']
documents = ['대한민국의 수도는 서울이다.', '한글은 세종대왕이 창제하였다.']
# Compute embeddings: use `prompt_name="query"` to encode queries!
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)
# Compute cosine similarity scores
scores = model.similarity(query_embeddings, document_embeddings)
# Output the results
for query, query_scores in zip(queries, scores):
doc_score_pairs = list(zip(documents, query_scores))
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
print("Query:", query)
for document, score in doc_score_pairs:
print(score, document)
Using Huggingface Transformers
You can use the transformers package to use Snowflake's arctic-embed model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).
import torch
from transformers import AutoModel, AutoTokenizer
model_name = 'dragonkue/snowflake-arctic-embed-l-v2.0-ko'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, add_pooling_layer=False)
model.eval()
query_prefix = 'query: '
queries = ['대한민국의 수도는 어디인가?', '한글을 만든 사람은 누구인가?']
queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=8192)
documents = ['대한민국의 수도는 서울이다.', '한글은 세종대왕이 창제하였다.']
document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=8192)
# Compute token embeddings
with torch.no_grad():
query_embeddings = model(**query_tokens)[0][:, 0]
document_embeddings = model(**document_tokens)[0][:, 0]
# Normalize embeddings
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)
scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
for query, query_scores in zip(queries, scores):
doc_score_pairs = list(zip(documents, query_scores))
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
# Output passages & scores
print("Query:", query)
for document, score in doc_score_pairs:
print(score, document)
Evaluation
- This evaluation references the KURE GitHub repository. (https://github.com/nlpai-lab/KURE)
- We conducted an evaluation on all Korean Retrieval Benchmarks registered in MTEB.
Korean Retrieval Benchmark
- Ko-StrategyQA: A Korean ODQA multi-hop retrieval dataset, translated from StrategyQA.
- AutoRAGRetrieval: A Korean document retrieval dataset constructed by parsing PDFs from five domains: finance, public, medical, legal, and commerce.
- MIRACLRetrieval: A Korean document retrieval dataset based on Wikipedia.
- PublicHealthQA: A retrieval dataset focused on medical and public health domains in Korean.
- BelebeleRetrieval: A Korean document retrieval dataset based on FLORES-200.
- MrTidyRetrieval: A Wikipedia-based Korean document retrieval dataset.
- MultiLongDocRetrieval: A long-document retrieval dataset covering various domains in Korean.
- XPQARetrieval: A cross-domain Korean document retrieval dataset.
Metrics
- Standard metric : NDCG@10
Information Retrieval
- Achieves state-of-the-art (SOTA) performance across various benchmarks.
- For each benchmark, the highest score is highlighted in bold, and the second-highest score is italicized.
Model | MrTidyRetrieval | MIRACLRetrieval | XPQARetrieval | BelebeleRetrieval | PublicHealthQA | AutoRAGRetrieval | Ko-StrategyQA | Average |
---|---|---|---|---|---|---|---|---|
dragonkue/snowflake-arctic-embed-l-v2.0-ko | 0.57121 | 0.66846 | 0.4436 | 0.95177 | 0.83374 | 0.90927 | 0.80498 | 0.740433 |
dragonkue/BGE-m3-ko | 0.60992 | 0.68331 | 0.38131 | 0.95027 | 0.81545 | 0.87379 | 0.7959 | 0.729993 |
nlpai-lab/KURE-v1 | 0.59092 | 0.68157 | 0.38158 | 0.95019 | 0.81925 | 0.87076 | 0.7999 | 0.727739 |
BAAI/bge-m3 | 0.64708 | 0.70146 | 0.36075 | 0.93164 | 0.80412 | 0.83008 | 0.79405 | 0.724169 |
Snowflake/snowflake-arctic-embed-l-v2.0 | 0.59071 | 0.66077 | 0.43018 | 0.9271 | 0.81679 | 0.83863 | 0.80455 | 0.724104 |
intfloat/multilingual-e5-large | 0.64211 | 0.66486 | 0.3571 | 0.94499 | 0.82534 | 0.81337 | 0.80348 | 0.721607 |
nlpai-lab/KoE5 | 0.58411 | 0.62347 | 0.35086 | 0.94251 | 0.83507 | 0.84339 | 0.80008 | 0.711356 |
BAAI/bge-multilingual-gemma2 | 0.47521 | 0.70315 | 0.37446 | 0.95001 | 0.87102 | 0.76535 | 0.79072 | 0.704274 |
jinaai/jina-embeddings-v3 | 0.55759 | 0.63716 | 0.41272 | 0.91203 | 0.83059 | 0.76104 | 0.79807 | 0.701314 |
intfloat/multilingual-e5-large-instruct | 0.52877 | 0.59914 | 0.39712 | 0.936 | 0.84967 | 0.77996 | 0.79793 | 0.69837 |
nomic-ai/nomic-embed-text-v2-moe | 0.53766 | 0.65913 | 0.36871 | 0.93636 | 0.78448 | 0.80682 | 0.76325 | 0.693773 |
intfloat/multilingual-e5-base | 0.58082 | 0.6227 | 0.3607 | 0.92868 | 0.77203 | 0.79752 | 0.76355 | 0.689429 |
intfloat/e5-mistral-7b-instruct | 0.52444 | 0.58709 | 0.39159 | 0.92403 | 0.88733 | 0.67849 | 0.79317 | 0.683734 |
Alibaba-NLP/gte-Qwen2-7B-instruct | 0.46571 | 0.53375 | 0.37866 | 0.94808 | 0.85844 | 0.76682 | 0.8108 | 0.680323 |
Alibaba-NLP/gte-multilingual-base | 0.56464 | 0.62697 | 0.30702 | 0.8796 | 0.74584 | 0.77108 | 0.75121 | 0.663766 |
openai/text-embedding-3-large | 0.44728 | 0.56248 | 0.37423 | 0.89451 | 0.85617 | 0.76466 | 0.73634 | 0.662239 |
upskyy/bge-m3-korean | 0.55011 | 0.59892 | 0.31695 | 0.8731 | 0.77559 | 0.72946 | 0.75277 | 0.6567 |
Salesforce/SFR-Embedding-2_R | 0.40347 | 0.55798 | 0.37371 | 0.91747 | 0.8605 | 0.70782 | 0.77042 | 0.65591 |
ibm-granite/granite-embedding-278m-multilingual | nan | 0.59216 | 0.23058 | 0.83231 | 0.77668 | 0.70226 | 0.71762 | 0.641935 |
jhgan/ko-sroberta-multitask | 0.29475 | 0.36698 | 0.27961 | 0.81636 | 0.69212 | 0.58332 | 0.65097 | 0.526301 |
Capabilities Beyond Benchmarks
This model is designed to handle various retrieval scenarios that are not directly measured in benchmarks:
Supports phrase-based queries in addition to full-sentence queries.
Example: "What products does Samsung sell?" or "Samsung's products"
Trained to handle diverse query formats, regardless of phrasing variations.
Example: "Tell me about Samsung.", "I'm curious about Samsung.", "What is Samsung?"
Optimized for Markdown table search, allowing retrieval of answers embedded within tables when present in documents.
Efficient clustering without hard negatives:
- Samples within the same batch are clustered together.
- Uses efficient embedding formation for clustering by truncating embeddings from the Snowflake/snowflake-arctic-embed-l-v2.0 model to 256 dimensions.
- The clustering approach is inspired by the findings in the following papers:
- Embedding And Clustering Your Data Can Improve Contrastive Pretraining
- CONTEXTUAL DOCUMENT EMBEDDINGS
Strong performance across different domains:
- The Arctic-Embed 2.0: Multilingual Retrieval Without Compromise paper states:
"While models like mE5, mGTE, and BGE-M3 excel on MIRACL, their performance on CLEF is notably weaker compared to ours and closed-source offerings, suggesting the potential of overfitting to MIRACL or its Wikipedia-based domain." - Based on my own experience, Snowflake/snowflake-arctic-embed-l-v2.0 has consistently outperformed BGE-M3 in different domains, further validating this observation.
- The Arctic-Embed 2.0: Multilingual Retrieval Without Compromise paper states:
Bias, Risks and Limitations
To prevent excessive GPU usage costs, the model was trained with a maximum sequence length of 1300 tokens. As a result, its performance may degrade on benchmarks like MultiLongDocRetrieval (MLDR).
The previous model, BGE-m3-ko, was trained with a token length of 1024, which imposed limitations on its MLDR benchmark performance.
In the case of snowflake-arctic-embed-l-v2.0-ko, if the document length exceeds 1300 tokens or approximately 2500 characters, it is recommended to consider the following models instead.
Model | MultiLongDocRetrieval |
---|---|
Alibaba-NLP/gte-multilingual-base/Alibaba-NLP/gte-multilingual-base | 0.48402 |
nlpai-lab/KURE-v1/nlpai-lab_KURE-v1 | 0.47528 |
dragonkue/snowflake-arctic-embed-l-v2.0-ko | 0.4459 |
BAAI/bge-m3/BAAI_bge-m3 | 0.43011 |
Snowflake/snowflake-arctic-embed-l-v2.0 | 0.40401 |
dragonkue/BGE-m3-ko/dragonkue_BGE-m3-ko | 0.40135 |
openai/text-embedding-3-large | 0.31108 |
BAAI/bge-multilingual-gemma2 | 0.31021 |
nlpai-lab/KoE5 | 0.30869 |
jinaai/jina-embeddings-v3/jinaai__jina-embeddings-v3 | 0.30512 |
Alibaba-NLP/gte-Qwen2-7B-instruct/Alibaba-NLP__gte-Qwen2-7B-instruct | 0.30313 |
intfloat/multilingual-e5-large-instruct/intfloat__multilingual-e5-large-instruct | 0.27973 |
nomic-ai/nomic-embed-text-v2-moe | 0.27135 |
intfloat/e5-mistral-7b-instruct/intfloat__e5-mistral-7b-instruct | 0.2583 |
intfloat/multilingual-e5-large/intfloat__multilingual-e5-large | 0.24596 |
Salesforce/SFR-Embedding-2_R/Salesforce__SFR-Embedding-2_R | 0.24346 |
intfloat/multilingual-e5-base/intfloat__multilingual-e5-base | 0.23766 |
upskyy/bge-m3-korean/upskyy__bge-m3-korean | 0.21968 |
ibm-granite/granite-embedding-278m-multilingual/ibm-granite__granite-embedding-278m-multilingual | 0.20781 |
jhgan/ko-sroberta-multitask/jhgan__ko-sroberta-multitask | 0.20416 |
Training Details
- Loss:
CachedGISTEmbedLoss
with these parameters:
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 20000per_device_eval_batch_size
: 4096learning_rate
: 2e-05num_train_epochs
: 2lr_scheduler_type
: warmup_stable_decaylr_scheduler_kwargs
: {'num_decay_steps': 160}warmup_ratio
: 0.05bf16
: Truebatch_sampler
: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 10000per_device_eval_batch_size
: 4096per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 2e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 2max_steps
: -1lr_scheduler_type
: warmup_stable_decaylr_scheduler_kwargs
: {'num_decay_steps': 160}warmup_ratio
: 0.05warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Truefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Truedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: no_duplicatesmulti_dataset_batch_sampler
: proportional
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.4.1
- Transformers: 4.49.0
- PyTorch: 2.6.0+cu124
- Accelerate: 1.4.0
- Datasets: 3.3.2
- Tokenizers: 0.21.0
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084"
}
KURE
@misc{KURE,
publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee},
year = {2024},
url = {https://github.com/nlpai-lab/KURE}
}
Arctic-Embed 2.0
@article{yu2024arcticembed,
title = "Arctic-Embed 2.0: Multilingual Retrieval Without Compromise",
author = "Puxuan Yu, Luke Merrick, Gaurav Nuti, Daniel Campos",
journal = "arXiv preprint arXiv:2412.04506",
year = "2024",
url = "https://arxiv.org/abs/2412.04506"
}
Embedding And Clustering Your Data Can Improve Contrastive Pretraining
@article{merrick2024embedding,
title = "Embedding And Clustering Your Data Can Improve Contrastive Pretraining",
author = "Luke Merrick",
journal = "arXiv preprint arXiv:2407.18887",
year = "2024",
url = "https://arxiv.org/abs/2407.18887"
}
Contextual Document Embeddings
@article{morris2024contextual,
title = "Contextual Document Embeddings",
author = "John X. Morris, Alexander M. Rush",
journal = "arXiv preprint arXiv:2410.02525",
year = "2024",
url = "https://arxiv.org/abs/2410.02525"
}
License
Arctic is licensed under the Apache-2. The released models can be used for commercial purposes free of charge.
- Downloads last month
- 33
Model tree for dragonkue/snowflake-arctic-embed-l-v2.0-ko
Base model
Snowflake/snowflake-arctic-embed-l-v2.0