File size: 3,829 Bytes
e65f2ba 1ee3980 e65f2ba 1ee3980 bf85c44 6676eba 1ee3980 6676eba 1ee3980 6676eba cc0be66 1ee3980 6676eba 1ee3980 cc0be66 e6af204 cc0be66 1ee3980 cc0be66 1ee3980 2e892b3 1ee3980 2e892b3 1ee3980 2e892b3 1ee3980 2e892b3 30a103d 2e892b3 1ee3980 2e892b3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
---
license: apache-2.0
language:
- en
inference: false
---
<br><br>
<p align="center">
<img src="https://github.com/jina-ai/finetuner/blob/main/docs/_static/finetuner-logo-ani.svg?raw=true" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
</p>
<p align="center">
<b>The text embedding suit trained by Jina AI, Finetuner team.</b>
</p>
## Intented Usage & Model Info
`jina-embedding-b-en-v1` is a language model that has been trained using Jina AI's Linnaeus-Clean dataset.
This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.
The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
With a standard size of 110 million parameters,
the model enables fast inference while delivering better performance than our small model.
It is recommended to use a single GPU for inference.
Additionally, we provide the following options:
- `jina-embedding-s-en-v1`: 35 million parameters.
- `jina-embedding-b-en-v1`: 110 million parameters **(you are here)**.
- `jina-embedding-l-en-v1`: 330 million parameters.
- `jina-embedding-xl-en-v1`: 1.2 billion parameters (soon).
- `jina-embedding-xxl-en-v1`: 6 billion parameters (soon).
## Data & Parameters
More info will be released together with the technique report.
## Metrics
We compared the model against `all-minilm-l6-v2`/`all-mpnet-base-v2` from sbert and `text-embeddings-ada-002` from OpenAI:
|Name|param |context|
|------------------------------|-----|------|
|all-minilm-l6-v2|33m |128|
|all-mpnet-base-v2 |110m |128|
|ada-embedding-002|Unknown/OpenAI API |8192|
|jina-embedding-s-en-v1|35m |512|
|jina-embedding-b-en-v1|110m |512|
|jina-embedding-l-en-v1|330m |512|
|Name|STS12|STS13|STS14|STS15|STS16|STS17|TRECOVID|Quora|SciFact|
|------------------------------|-----|-----|-----|-----|-----|-----|--------|-----|-----|
|all-minilm-l6-v2|0.724|0.806|0.756|0.854|0.79 |0.876|0.473 |**0.876**|0.645 |
|all-mpnet-base-v2|0.726|**0.835**|**0.78** |0.857|0.8 |**0.906**|0.513 |0.875|0.656 |
|ada-embedding-002|0.698|0.833|0.761|**0.861**|**0.86** |0.903|**0.685** |**0.876**|**0.726** |
|jina-embedding-s-en-v1|**0.738**|0.781|0.732|0.833|0.785|0.859|0.471 |0.852|0.567 |
|jina-embedding-b-en-v1|0.736|0.804|0.745|0.844|0.793|0.873|0.481 |0.87|0.616 |
|jina-embedding-l-en-v1|0.736|0.832|0.762|0.846|0.805|0.885|0.477 |**0.876**|0.65 |
For more tasks and metrics, please checkout [MTEB](https://huggingface.co/spaces/mteb/leaderboard) benchmark.
## Usage
```python
!pip install finetuner
import finetuner
model = finetuner.build_model('jinaai/jina-embedding-b-en-v1')
embeddings = finetuner.encode(
model=model,
data=['how is the weather today', 'What is the current weather like today?']
)
print(finetuner.cos_sim(embeddings[0], embeddings[1]))
```
## Fine-tuning
Please consider [Finetuner](https://github.com/jina-ai/finetuner).
## Plans
1. The development of `jina-embedding-s-en-v2` is currently underway with two main objectives: improving performance and increasing the maximum sequence length.
2. We are currently working on a bilingual embedding model that combines English and X language. The upcoming model will be called `jina-embedding-s/b/l-de-v1`.
## Contact
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas. |