File size: 3,829 Bytes
e65f2ba
 
1ee3980
 
 
e65f2ba
1ee3980
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bf85c44
6676eba
 
 
1ee3980
 
 
 
 
 
 
6676eba
1ee3980
 
 
 
6676eba
cc0be66
1ee3980
 
6676eba
1ee3980
 
 
 
cc0be66
e6af204
cc0be66
 
1ee3980
cc0be66
1ee3980
 
 
2e892b3
1ee3980
 
2e892b3
1ee3980
2e892b3
 
 
 
 
 
 
1ee3980
 
2e892b3
 
 
 
30a103d
 
 
 
 
2e892b3
1ee3980
2e892b3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
license: apache-2.0
language:
- en
inference: false
---

<br><br>

<p align="center">
<img src="https://github.com/jina-ai/finetuner/blob/main/docs/_static/finetuner-logo-ani.svg?raw=true" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
</p>


<p align="center">
<b>The text embedding suit trained by Jina AI, Finetuner team.</b>
</p>


## Intented Usage & Model Info

`jina-embedding-b-en-v1` is a language model that has been trained using Jina AI's Linnaeus-Clean dataset.
This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.

The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.

With a standard size of 110 million parameters,
the model enables fast inference while delivering better performance than our small model.
It is recommended to use a single GPU for inference.
Additionally, we provide the following options:

- `jina-embedding-s-en-v1`: 35 million parameters.
- `jina-embedding-b-en-v1`: 110 million parameters  **(you are here)**.
- `jina-embedding-l-en-v1`: 330 million parameters.
- `jina-embedding-xl-en-v1`: 1.2 billion parameters (soon).
- `jina-embedding-xxl-en-v1`: 6 billion parameters (soon).

## Data & Parameters

More info will be released together with the technique report.

## Metrics

We compared the model against `all-minilm-l6-v2`/`all-mpnet-base-v2` from sbert and `text-embeddings-ada-002` from OpenAI:

|Name|param    |context|
|------------------------------|-----|------|
|all-minilm-l6-v2|33m      |128|
|all-mpnet-base-v2 |110m     |128|
|ada-embedding-002|Unknown/OpenAI API  |8192|
|jina-embedding-s-en-v1|35m      |512|
|jina-embedding-b-en-v1|110m      |512|
|jina-embedding-l-en-v1|330m      |512|


|Name|STS12|STS13|STS14|STS15|STS16|STS17|TRECOVID|Quora|SciFact|
|------------------------------|-----|-----|-----|-----|-----|-----|--------|-----|-----|
|all-minilm-l6-v2|0.724|0.806|0.756|0.854|0.79 |0.876|0.473   |**0.876**|0.645  |
|all-mpnet-base-v2|0.726|**0.835**|**0.78** |0.857|0.8  |**0.906**|0.513   |0.875|0.656  |
|ada-embedding-002|0.698|0.833|0.761|**0.861**|**0.86** |0.903|**0.685**   |**0.876**|**0.726**  |
|jina-embedding-s-en-v1|**0.738**|0.781|0.732|0.833|0.785|0.859|0.471   |0.852|0.567  |
|jina-embedding-b-en-v1|0.736|0.804|0.745|0.844|0.793|0.873|0.481   |0.87|0.616  |
|jina-embedding-l-en-v1|0.736|0.832|0.762|0.846|0.805|0.885|0.477   |**0.876**|0.65  |

For more tasks and metrics, please checkout [MTEB](https://huggingface.co/spaces/mteb/leaderboard) benchmark.

## Usage

```python
!pip install finetuner
import finetuner

model = finetuner.build_model('jinaai/jina-embedding-b-en-v1')
embeddings = finetuner.encode(
    model=model,
    data=['how is the weather today', 'What is the current weather like today?']
)
print(finetuner.cos_sim(embeddings[0], embeddings[1]))
```

## Fine-tuning

Please consider [Finetuner](https://github.com/jina-ai/finetuner).

## Plans

1. The development of `jina-embedding-s-en-v2` is currently underway with two main objectives: improving performance and increasing the maximum sequence length.
2. We are currently working on a bilingual embedding model that combines English and X language. The upcoming model will be called `jina-embedding-s/b/l-de-v1`.

## Contact

Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.