File size: 6,536 Bytes
209fbaf 77c87d9 209fbaf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
---
language:
- en
- ko
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
pipeline_tag: sentence-similarity
library_name: sentence-transformers
model_id: datalama/kanana-nano-2.1b-embedding
repo: datalama/kanana-nano-2.1b-embedding
developers: datalama
license: cc-by-nc-4.0
---
# Sentence-Transformers Compatible Kanana-Nano-2.1b-Embedding
This repository contains a sentence-transformers compatible version of the [Kanana-Nano-2.1b-Embedding](https://huggingface.co/kakaocorp/kanana-nano-2.1b-embedding) model developed by Kakao.
For detailed information about the model architecture, training methodology, and comprehensive performance benchmarks, please refer to the [original model repository](https://huggingface.co/kakaocorp/kanana-nano-2.1b-embedding) and the [Kanana technical report](https://arxiv.org/abs/2502.18934).
## Key Adaptations
This version has been modified to work seamlessly with the sentence-transformers library with the following changes:
* Implemented `KananaEmbeddingWrapper` module to enable loading via SentenceTransformer
* Added L2 normalization within the `KananaEmbeddingWrapper`'s forward method
* max_seq_length is fixed with 8192.
* Embed the query prompt related parts into the model. You can encode the query with `query_name`.
## Usage
### Installation
```bash
pip install sentence-transformers
```
### Basic Usage
```python
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("datalama/kanana-nano-2.1b-embedding", device="cpu", trust_remote_code=True)
# Encode sentences
sentences = [
"์ด ๋ฌธ์ฅ์ ํ๊ตญ์ด๋ก ์์ฑ๋์์ต๋๋ค.",
"This sentence is written in English."
]
embeddings = model.encode(sentences)
```
### Advanced Usage with Query/Passage Format
* You can use `prompt_name` or `prompt`.
```python
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("datalama/kanana-nano-2.1b-embedding", device="cpu", trust_remote_code=True)
# For retrieval tasks
instruction = "Given a question, retrieve passages that answer the question"
queries = [
"are judo throws allowed in wrestling?",
"how to become a radiology technician in michigan?",
]
# You can encode query by prompt_name with predefiend prompt template.
embedding_a = model.encode(queries, prompt_name="query")
# You can directly encode the query with prompt.
prompt_template = """Instruct: {instruction}\nQuery: """
embedding_b = model.encode(queries, prompt=prompt_template.format(instruction=instruction))
# compare input.
np.allclose(embedding_a, embedding_b)
# True
```
* Compare embedding with original code.
```python
import torch.nn.functional as F
import numpy as np
from transformers import AutoModel
from sentence_transformers import SentenceTransformer
# For retrieval tasks
instruction = "Given a question, retrieve passages that answer the question"
queries = [
"are judo throws allowed in wrestling?",
"how to become a radiology technician in michigan?",
]
passages = [
"Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
"Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan.",
]
# compare originaml model and this model.
model_a = AutoModel.from_pretrained("kakaocorp/kanana-nano-2.1b-embedding",trust_remote_code=True,).to("cpu")
model_b = SentenceTransformer("datalama/kanana-nano-2.1b-embedding", device="cpu", trust_remote_code=True)
# original encoding method.
max_length = 512
query_embeddings = model_a.encode(queries, instruction=instruction, max_length=max_length)
passage_embeddings = model_a.encode(passages, instruction="", max_length=max_length)
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
passage_embeddings = F.normalize(passage_embeddings, p=2, dim=1)
scores_a = (query_embeddings @ passage_embeddings.T) * 100
# sentence_transformers compatible encoding method.
query_embeddings = model_b.encode(queries, prompt_name="query")
passage_embeddings = model_b.encode(passages)
scores_b = (query_embeddings @ passage_embeddings.T) * 100
# compare embedding
np.allclose(scores_a.cpu().numpy(), scores_b)
# True
```
Note: Unlike the original model, you don't need to manually perform L2 normalization as this is handled by the `KananaEmbeddingWrapper` module during the forward pass.
## License
This model is licensed under [CC-BY-NC-4.0](https://spdx.org/licenses/CC-BY-NC-4.0).
## Citation
If you use this model, please cite the original work:
```
@misc{kananallmteam2025kananacomputeefficientbilinguallanguage,
title={Kanana: Compute-efficient Bilingual Language Models},
author={Kanana LLM Team and Yunju Bak and Hojin Lee and Minho Ryu and Jiyeon Ham and Seungjae Jung and Daniel Wontae Nam and Taegyeong Eo and Donghun Lee and Doohae Jung and Boseop Kim and Nayeon Kim and Jaesun Park and Hyunho Kim and Hyunwoong Ko and Changmin Lee and Kyoung-Woon On and Seulye Baeg and Junrae Cho and Sunghee Jung and Jieun Kang and EungGyun Kim and Eunhwa Kim and Byeongil Ko and Daniel Lee and Minchul Lee and Miok Lee and Shinbok Lee and Gaeun Seo},
year={2025},
eprint={2502.18934},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.18934},
}
```
## Acknowledgements
- Original model developed by the Kanana LLM Team at Kakao
- Adaptation to sentence-transformers format by datalama |