RetrievaEmbedding-01: AMBER

The AMBER (Adaptive Multitask Bilingual Embedding Representations) is a text embedding model trained by Retrieva, Inc. This model is primarily designed for Japanese, but it also supports English. We trained this model on various datasets related to Japanese and English.

This model size is 315M parameters (large size).

Model Details

Model Description

The AMBER model is a text embedding model based on the sbintuitions/modernbert-ja-310m architecture, designed for Japanese text. This model was trained on a variety of datasets related to Japanese, and also includes English datasets. The model can be used for English text as well. During training, prompts (instructions) in natural language were included, allowing the model to generate embeddings tailored to specific tasks.

  • Developed by: Retrieva, Inc.
  • Model type: Based on the ModernBERT Architecture.
  • Language(s) (NLP): Primarily Japanese (optional support for English).
  • License: Apache 2.0
  • Finetuned from model: sbintuitions/modernbert-ja-310m
  • Model Type: Sentence Transformer
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Uses

How to Get Started with the Model

Install Library

First install the python library using pip:

pip install sentence-transformers sentencepiece

Run Inference

Then you can load this model and run inference.

You can specify the prompt at inference time by adding an argument called prompt to model.encode. The prompts used in the Japanese benchmark are described in jmteb/tasks, and the prompts used in the English benchmark are described in mteb/models/retrieva_en.py.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("retrieva-jp/amber-large")
# Run inference
queries = [
    "自然言語処理とはなんですか?",
    "株式会社レトリバについて教えて",
]
documents = [
    "自然言語処理(しぜんげんごしょり、英語: Natural language processing、略称:NLP)は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、人工知能と言語学の一分野である。",
    "株式会社レトリバは、自然言語処理と機械学習を核としたAI技術で組織の課題解決を支援するテクノロジー企業である。",
]

queries_embeddings = model.encode(queries, prompt_name="Retrieval-query")
documents_embeddings = model.encode(documents, prompt_name="Retrieval-passage")

similarities = model.similarity(queries_embeddings, documents_embeddings)
print(similarities.shape)

Training Details

Training Data

We used multiple datasets to train this model. We selected datasets from llm-jp-eval, llm-japanese-dataset, and hpprc/emb for Japanese datasets. For English datasets, we mainly used some of the datasets utilized in Asai et al. (2023). Additionally, we partially used the English datasets at the sentence-transformers repository and kilt-tasks. To consider cross-lingual between Japanese and English, we also used translation datasets between Japanese and English.

For Japanese, we used synthetic data created by LLM to prepare a sufficient amount of training data.

Evaluation

We evaluated the model on the following benchmarks:

The scores in the table are all calculated by us unless otherwise noted.

Japanese Benchmark: JMTEB

Note that the Mean (TaskType) in the following leaderboard is the same as the Avg. in the original JMTEB leaderboard.

The files used for evaluation are stored in the jmteb directory.

Model # Parameters Mean (TaskType) Mean (Task) Retrieval STS Classification Reranking Clustering PairClassification
base models < 300M
cl-nagoya/ruri-base 111M 72.60 71.56 69.53 82.87 75.49 92.91 52.40 62.38
AMBER-base 130M 72.12 72.12 73.40 77.81 76.14 93.27 48.05 64.03
pkshatech/GLuCoSE-base-ja-v2 133M 72.89 72.47 73.03 82.96 74.02 93.01 51.96 62.37
pkshatech/RoSEtta-base-ja 190M 72.49 72.05 73.14 81.39 72.37 92.69 53.60 61.74
intfloat/multilingual-e5-base 278M 71.11 69.72 69.45 80.45 69.86 92.90 51.62 62.35
large models 300M <
AMBER-large
(this model)
315M 72.52 73.22 75.40 79.32 77.14 93.54 48.73 60.97
cl-nagoya/ruri-large 337M 73.20 73.06 72.86 83.14 77.15 93.00 50.78 62.29
intfloat/multilingual-e5-large 560M 72.06 71.29 71.71 80.87 72.45 93.29 51.59 62.42

Japanese Retrieval Tasks: JQaRA, JaCWIR, MLDR Japanese Subset

The files used for MLDR are stored in the mldr directory.

The prompts used in JQaRA and JaCWIR are Retrieval-query and Retrieval-passage described in config_sentence_transformers.json.

Model # Parameters JQaRA (nDCG@10) JaCWIR (MAP@10) MLDR Japanese Subset (nDCG@10)
base models < 300M
cl-nagoya/ruri-base 111M 58.4 83.3 32.77
AMBER-base 130M 57.1 81.6 35.69
pkshatech/GLuCoSE-base-ja-v2 133M 60.6 85.3 33.99
intfloat/multilingual-e5-base 278M 47.1 85.3 25.46
large models 300M <
AMBER-large
(this model)
315M 62.5 82.4 34.57
cl-nagoya/ruri-large 337M 62.8 82.5 34.78
intfloat/multilingual-e5-large 560M 55.4 87.3 29.95

English Benchmark: MTEB(eng, v2)

The files used for evaluation are stored in the mteb directory.

Model # Parameters Mean (TaskType) Mean (Task) Retrieval STS Classification Reranking Clustering PairClassification Summarization
base models < 300M
AMBER-base 130M 54.75 58.20 40.11 81.29 70.39 42.98 42.27 80.12 26.08
intfloat/multilingual-e5-base 278M 56.21 59.75 43.22 80.50 73.84 43.87 42.19 83.74 26.10
large models 300M <
AMBER-large
(this model)
315M 56.08 59.13 41.04 81.52 72.23 43.83 42.71 81.00 30.21
intfloat/multilingual-e5-large 560M 57.06 60.84 46.17 81.11 74.88 44.31 41.91 84.33 26.67

More Information

TBA

Model Card Authors

Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba

Model Card Contact

pr[at]retrieva.jp

Downloads last month
0
Safetensors
Model size
315M params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for retrieva-jp/amber-large

Finetuned
(4)
this model

Evaluation results