|
--- |
|
language: |
|
- en |
|
- ko |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
model_id: datalama/kanana-nano-2.1b-embedding |
|
repo: datalama/kanana-nano-2.1b-embedding |
|
developers: datalama |
|
license: cc-by-nc-4.0 |
|
--- |
|
|
|
# Sentence-Transformers Compatible Kanana-Nano-2.1b-Embedding |
|
|
|
This repository contains a sentence-transformers compatible version of the [Kanana-Nano-2.1b-Embedding](https://huggingface.co/kakaocorp/kanana-nano-2.1b-embedding) model developed by Kakao. |
|
|
|
For detailed information about the model architecture, training methodology, and comprehensive performance benchmarks, please refer to the [original model repository](https://huggingface.co/kakaocorp/kanana-nano-2.1b-embedding) and the [Kanana technical report](https://arxiv.org/abs/2502.18934). |
|
|
|
## Key Adaptations |
|
|
|
This version has been modified to work seamlessly with the sentence-transformers library with the following changes: |
|
|
|
* Implemented `KananaEmbeddingWrapper` module to enable loading via SentenceTransformer |
|
* Added L2 normalization within the `KananaEmbeddingWrapper`'s forward method |
|
* max_seq_length is fixed with 8192. |
|
* Embed the query prompt related parts into the model. You can encode the query with `query_name`. |
|
|
|
## Usage |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install sentence-transformers |
|
``` |
|
|
|
### Basic Usage |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Load the model |
|
model = SentenceTransformer("datalama/kanana-nano-2.1b-embedding", device="cpu", trust_remote_code=True) |
|
|
|
# Encode sentences |
|
sentences = [ |
|
"์ด ๋ฌธ์ฅ์ ํ๊ตญ์ด๋ก ์์ฑ๋์์ต๋๋ค.", |
|
"This sentence is written in English." |
|
] |
|
|
|
embeddings = model.encode(sentences) |
|
``` |
|
|
|
### Advanced Usage with Query/Passage Format |
|
|
|
* You can use `prompt_name` or `prompt`. |
|
|
|
```python |
|
import numpy as np |
|
from sentence_transformers import SentenceTransformer |
|
|
|
model = SentenceTransformer("datalama/kanana-nano-2.1b-embedding", device="cpu", trust_remote_code=True) |
|
|
|
# For retrieval tasks |
|
instruction = "Given a question, retrieve passages that answer the question" |
|
queries = [ |
|
"are judo throws allowed in wrestling?", |
|
"how to become a radiology technician in michigan?", |
|
] |
|
|
|
|
|
# You can encode query by prompt_name with predefiend prompt template. |
|
embedding_a = model.encode(queries, prompt_name="query") |
|
|
|
# You can directly encode the query with prompt. |
|
prompt_template = """Instruct: {instruction}\nQuery: """ |
|
embedding_b = model.encode(queries, prompt=prompt_template.format(instruction=instruction)) |
|
|
|
# compare input. |
|
np.allclose(embedding_a, embedding_b) |
|
# True |
|
``` |
|
|
|
* Compare embedding with original code. |
|
|
|
```python |
|
import torch.nn.functional as F |
|
import numpy as np |
|
from transformers import AutoModel |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# For retrieval tasks |
|
instruction = "Given a question, retrieve passages that answer the question" |
|
queries = [ |
|
"are judo throws allowed in wrestling?", |
|
"how to become a radiology technician in michigan?", |
|
] |
|
|
|
passages = [ |
|
"Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.", |
|
"Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan.", |
|
] |
|
|
|
# compare originaml model and this model. |
|
model_a = AutoModel.from_pretrained("kakaocorp/kanana-nano-2.1b-embedding",trust_remote_code=True,).to("cpu") |
|
model_b = SentenceTransformer("datalama/kanana-nano-2.1b-embedding", device="cpu", trust_remote_code=True) |
|
|
|
# original encoding method. |
|
max_length = 512 |
|
query_embeddings = model_a.encode(queries, instruction=instruction, max_length=max_length) |
|
passage_embeddings = model_a.encode(passages, instruction="", max_length=max_length) |
|
|
|
query_embeddings = F.normalize(query_embeddings, p=2, dim=1) |
|
passage_embeddings = F.normalize(passage_embeddings, p=2, dim=1) |
|
|
|
scores_a = (query_embeddings @ passage_embeddings.T) * 100 |
|
|
|
# sentence_transformers compatible encoding method. |
|
query_embeddings = model_b.encode(queries, prompt_name="query") |
|
passage_embeddings = model_b.encode(passages) |
|
|
|
scores_b = (query_embeddings @ passage_embeddings.T) * 100 |
|
|
|
# compare embedding |
|
np.allclose(scores_a.cpu().numpy(), scores_b) |
|
# True |
|
``` |
|
|
|
Note: Unlike the original model, you don't need to manually perform L2 normalization as this is handled by the `KananaEmbeddingWrapper` module during the forward pass. |
|
|
|
## License |
|
|
|
This model is licensed under [CC-BY-NC-4.0](https://spdx.org/licenses/CC-BY-NC-4.0). |
|
|
|
## Citation |
|
|
|
If you use this model, please cite the original work: |
|
|
|
``` |
|
@misc{kananallmteam2025kananacomputeefficientbilinguallanguage, |
|
title={Kanana: Compute-efficient Bilingual Language Models}, |
|
author={Kanana LLM Team and Yunju Bak and Hojin Lee and Minho Ryu and Jiyeon Ham and Seungjae Jung and Daniel Wontae Nam and Taegyeong Eo and Donghun Lee and Doohae Jung and Boseop Kim and Nayeon Kim and Jaesun Park and Hyunho Kim and Hyunwoong Ko and Changmin Lee and Kyoung-Woon On and Seulye Baeg and Junrae Cho and Sunghee Jung and Jieun Kang and EungGyun Kim and Eunhwa Kim and Byeongil Ko and Daniel Lee and Minchul Lee and Miok Lee and Shinbok Lee and Gaeun Seo}, |
|
year={2025}, |
|
eprint={2502.18934}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2502.18934}, |
|
} |
|
``` |
|
|
|
## Acknowledgements |
|
|
|
- Original model developed by the Kanana LLM Team at Kakao |
|
- Adaptation to sentence-transformers format by datalama |