Update README.md

77c87d9 verified 14 days ago

6.54 kB

	---
	language:
	- en
	- ko
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	model_id: datalama/kanana-nano-2.1b-embedding
	repo: datalama/kanana-nano-2.1b-embedding
	developers: datalama
	license: cc-by-nc-4.0
	---

	# Sentence-Transformers Compatible Kanana-Nano-2.1b-Embedding

	This repository contains a sentence-transformers compatible version of the [Kanana-Nano-2.1b-Embedding](https://huggingface.co/kakaocorp/kanana-nano-2.1b-embedding) model developed by Kakao.

	For detailed information about the model architecture, training methodology, and comprehensive performance benchmarks, please refer to the [original model repository](https://huggingface.co/kakaocorp/kanana-nano-2.1b-embedding) and the [Kanana technical report](https://arxiv.org/abs/2502.18934).

	## Key Adaptations

	This version has been modified to work seamlessly with the sentence-transformers library with the following changes:

	* Implemented `KananaEmbeddingWrapper` module to enable loading via SentenceTransformer
	* Added L2 normalization within the `KananaEmbeddingWrapper`'s forward method
	* max_seq_length is fixed with 8192.
	* Embed the query prompt related parts into the model. You can encode the query with `query_name`.

	## Usage

	### Installation

	```bash
	pip install sentence-transformers
	```

	### Basic Usage

	```python
	from sentence_transformers import SentenceTransformer

	# Load the model
	model = SentenceTransformer("datalama/kanana-nano-2.1b-embedding", device="cpu", trust_remote_code=True)

	# Encode sentences
	sentences = [
	"이 문장은 한국어로 작성되었습니다.",
	"This sentence is written in English."
	]

	embeddings = model.encode(sentences)
	```

	### Advanced Usage with Query/Passage Format

	* You can use `prompt_name` or `prompt`.

	```python
	import numpy as np
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("datalama/kanana-nano-2.1b-embedding", device="cpu", trust_remote_code=True)

	# For retrieval tasks
	instruction = "Given a question, retrieve passages that answer the question"
	queries = [
	"are judo throws allowed in wrestling?",
	"how to become a radiology technician in michigan?",
	]


	# You can encode query by prompt_name with predefiend prompt template.
	embedding_a = model.encode(queries, prompt_name="query")

	# You can directly encode the query with prompt.
	prompt_template = """Instruct: {instruction}\nQuery: """
	embedding_b = model.encode(queries, prompt=prompt_template.format(instruction=instruction))

	# compare input.
	np.allclose(embedding_a, embedding_b)
	# True
	```

	* Compare embedding with original code.

	```python
	import torch.nn.functional as F
	import numpy as np
	from transformers import AutoModel
	from sentence_transformers import SentenceTransformer

	# For retrieval tasks
	instruction = "Given a question, retrieve passages that answer the question"
	queries = [
	"are judo throws allowed in wrestling?",
	"how to become a radiology technician in michigan?",
	]

	passages = [
	"Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
	"Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan.",
	]

	# compare originaml model and this model.
	model_a = AutoModel.from_pretrained("kakaocorp/kanana-nano-2.1b-embedding",trust_remote_code=True,).to("cpu")
	model_b = SentenceTransformer("datalama/kanana-nano-2.1b-embedding", device="cpu", trust_remote_code=True)

	# original encoding method.
	max_length = 512
	query_embeddings = model_a.encode(queries, instruction=instruction, max_length=max_length)
	passage_embeddings = model_a.encode(passages, instruction="", max_length=max_length)

	query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
	passage_embeddings = F.normalize(passage_embeddings, p=2, dim=1)

	scores_a = (query_embeddings @ passage_embeddings.T) * 100

	# sentence_transformers compatible encoding method.
	query_embeddings = model_b.encode(queries, prompt_name="query")
	passage_embeddings = model_b.encode(passages)

	scores_b = (query_embeddings @ passage_embeddings.T) * 100

	# compare embedding
	np.allclose(scores_a.cpu().numpy(), scores_b)
	# True
	```

	Note: Unlike the original model, you don't need to manually perform L2 normalization as this is handled by the `KananaEmbeddingWrapper` module during the forward pass.

	## License

	This model is licensed under [CC-BY-NC-4.0](https://spdx.org/licenses/CC-BY-NC-4.0).

	## Citation

	If you use this model, please cite the original work:

	```
	@misc{kananallmteam2025kananacomputeefficientbilinguallanguage,
	title={Kanana: Compute-efficient Bilingual Language Models},
	author={Kanana LLM Team and Yunju Bak and Hojin Lee and Minho Ryu and Jiyeon Ham and Seungjae Jung and Daniel Wontae Nam and Taegyeong Eo and Donghun Lee and Doohae Jung and Boseop Kim and Nayeon Kim and Jaesun Park and Hyunho Kim and Hyunwoong Ko and Changmin Lee and Kyoung-Woon On and Seulye Baeg and Junrae Cho and Sunghee Jung and Jieun Kang and EungGyun Kim and Eunhwa Kim and Byeongil Ko and Daniel Lee and Minchul Lee and Miok Lee and Shinbok Lee and Gaeun Seo},
	year={2025},
	eprint={2502.18934},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2502.18934},
	}
	```

	## Acknowledgements

	- Original model developed by the Kanana LLM Team at Kakao
	- Adaptation to sentence-transformers format by datalama