RetrievaEmbedding-01: AMBER

The AMBER (Adaptive Multitask Bilingual Embedding Representations) is a text embedding model trained by Retrieva, Inc. This model is primarily designed for Japanese, but it also supports English. We trained this model on various datasets related to Japanese and English.

This model size is 315M parameters (large size).

Model Details

Model Description

The AMBER model is a text embedding model based on the sbintuitions/modernbert-ja-310m architecture, designed for Japanese text. This model was trained on a variety of datasets related to Japanese, and also includes English datasets. The model can be used for English text as well. During training, prompts (instructions) in natural language were included, allowing the model to generate embeddings tailored to specific tasks.

Developed by: Retrieva, Inc.
Model type: Based on the ModernBERT Architecture.
Language(s) (NLP): Primarily Japanese (optional support for English).
License: Apache 2.0
Finetuned from model: sbintuitions/modernbert-ja-310m
Model Type: Sentence Transformer
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity

Uses

How to Get Started with the Model

Install Library

First install the python library using pip:

pip install sentence-transformers sentencepiece

Run Inference

Then you can load this model and run inference.

You can specify the prompt at inference time by adding an argument called prompt to model.encode. The prompts used in the Japanese benchmark are described in jmteb/tasks, and the prompts used in the English benchmark are described in mteb/models/retrieva_en.py.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("retrieva-jp/amber-large")
# Run inference
queries = [
    "自然言語処理とはなんですか？",
    "株式会社レトリバについて教えて",
]
documents = [
    "自然言語処理（しぜんげんごしょり、英語: Natural language processing、略称：NLP）は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、人工知能と言語学の一分野である。",
    "株式会社レトリバは、自然言語処理と機械学習を核としたAI技術で組織の課題解決を支援するテクノロジー企業である。",
]

queries_embeddings = model.encode(queries, prompt_name="Retrieval-query")
documents_embeddings = model.encode(documents, prompt_name="Retrieval-passage")

similarities = model.similarity(queries_embeddings, documents_embeddings)
print(similarities.shape)

Training Details

Training Data

We used multiple datasets to train this model. We selected datasets from llm-jp-eval, llm-japanese-dataset, and hpprc/emb for Japanese datasets. For English datasets, we mainly used some of the datasets utilized in Asai et al. (2023). Additionally, we partially used the English datasets at the sentence-transformers repository and kilt-tasks. To consider cross-lingual between Japanese and English, we also used translation datasets between Japanese and English.

For Japanese, we used synthetic data created by LLM to prepare a sufficient amount of training data.

Evaluation

We evaluated the model on the following benchmarks:

Japanese Benchmark: JMTEB
Japanese Retrieval Tasks: JQaRA, JaCWIR, MLDR Japanese Subset
English Benchmark: MTEB(eng, v2).

The scores in the table are all calculated by us unless otherwise noted.

Japanese Benchmark: JMTEB

Note that the Mean (TaskType) in the following leaderboard is the same as the Avg. in the original JMTEB leaderboard.

The files used for evaluation are stored in the jmteb directory.

Model	# Parameters	Mean (TaskType)	Mean (Task)	Retrieval	STS	Classification	Reranking	Clustering	PairClassification
base models	< 300M
cl-nagoya/ruri-base	111M	72.60	71.56	69.53	82.87	75.49	92.91	52.40	62.38
AMBER-base	130M	72.12	72.12	73.40	77.81	76.14	93.27	48.05	64.03
pkshatech/GLuCoSE-base-ja-v2	133M	72.89	72.47	73.03	82.96	74.02	93.01	51.96	62.37
pkshatech/RoSEtta-base-ja	190M	72.49	72.05	73.14	81.39	72.37	92.69	53.60	61.74
intfloat/multilingual-e5-base	278M	71.11	69.72	69.45	80.45	69.86	92.90	51.62	62.35
large models	300M <
AMBER-large (this model)	315M	72.52	73.22	75.40	79.32	77.14	93.54	48.73	60.97
cl-nagoya/ruri-large	337M	73.20	73.06	72.86	83.14	77.15	93.00	50.78	62.29
intfloat/multilingual-e5-large	560M	72.06	71.29	71.71	80.87	72.45	93.29	51.59	62.42

Japanese Retrieval Tasks: JQaRA, JaCWIR, MLDR Japanese Subset

The files used for MLDR are stored in the mldr directory.

The prompts used in JQaRA and JaCWIR are Retrieval-query and Retrieval-passage described in config_sentence_transformers.json.

Model	# Parameters	JQaRA (nDCG@10)	JaCWIR (MAP@10)	MLDR Japanese Subset (nDCG@10)
base models	< 300M
cl-nagoya/ruri-base	111M	58.4	83.3	32.77
AMBER-base	130M	57.1	81.6	35.69
pkshatech/GLuCoSE-base-ja-v2	133M	60.6	85.3	33.99
intfloat/multilingual-e5-base	278M	47.1	85.3	25.46
large models	300M <
AMBER-large (this model)	315M	62.5	82.4	34.57
cl-nagoya/ruri-large	337M	62.8	82.5	34.78
intfloat/multilingual-e5-large	560M	55.4	87.3	29.95

English Benchmark: MTEB(eng, v2)

The files used for evaluation are stored in the mteb directory.

Model	# Parameters	Mean (TaskType)	Mean (Task)	Retrieval	STS	Classification	Reranking	Clustering	PairClassification	Summarization
base models	< 300M
AMBER-base	130M	54.75	58.20	40.11	81.29	70.39	42.98	42.27	80.12	26.08
intfloat/multilingual-e5-base	278M	56.21	59.75	43.22	80.50	73.84	43.87	42.19	83.74	26.10
large models	300M <
AMBER-large (this model)	315M	56.08	59.13	41.04	81.52	72.23	43.83	42.71	81.00	30.21
intfloat/multilingual-e5-large	560M	57.06	60.84	46.17	81.11	74.88	44.31	41.91	84.33	26.67

More Information

TBA

Model Card Authors

Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba

Model Card Contact

pr[at]retrieva.jp

retrieva-jp
/

amber-large