|
<!--- BADGES: START ---> |
|
[][#github-license] |
|
[][#pypi-package] |
|
[][#pypi-package] |
|
[][#conda-forge-package] |
|
[][#conda-forge-package] |
|
[][#docs-package] |
|
<!--- |
|
[][#pypi-package] |
|
[][#conda-forge-package] |
|
---> |
|
|
|
[#github-license]: https://github.com/UKPLab/sentence-transformers/blob/master/LICENSE |
|
[#pypi-package]: https://pypi.org/project/sentence-transformers/ |
|
[#conda-forge-package]: https://anaconda.org/conda-forge/sentence-transformers |
|
[#docs-package]: https://www.sbert.net/ |
|
<!--- BADGES: END ---> |
|
|
|
# Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co. |
|
|
|
This framework provides an easy method to compute dense vector representations for **sentences**, **paragraphs**, and **images**. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various task. Text is embedding in vector space such that similar text is close and can efficiently be found using cosine similarity. |
|
|
|
We provide an increasing number of **[state-of-the-art pretrained models](https://www.sbert.net/docs/pretrained_models.html)** for more than 100 languages, fine-tuned for various use-cases. |
|
|
|
Further, this framework allows an easy **[fine-tuning of custom embeddings models](https://www.sbert.net/docs/training/overview.html)**, to achieve maximal performance on your specific task. |
|
|
|
For the **full documentation**, see **[www.SBERT.net](https://www.sbert.net)**. |
|
|
|
The following publications are integrated in this framework: |
|
|
|
- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) (EMNLP 2019) |
|
- [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) (EMNLP 2020) |
|
- [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240) (NAACL 2021) |
|
- [The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes](https://arxiv.org/abs/2012.14210) (arXiv 2020) |
|
- [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979) (arXiv 2021) |
|
- [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) (arXiv 2021) |
|
|
|
## Installation |
|
|
|
We recommend **Python 3.6** or higher, **[PyTorch 1.6.0](https://pytorch.org/get-started/locally/)** or higher and **[transformers v4.6.0](https://github.com/huggingface/transformers)** or higher. The code does **not** work with Python 2.7. |
|
|
|
**Install with pip** |
|
|
|
Install the *sentence-transformers* with `pip`: |
|
|
|
``` |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
**Install with conda** |
|
|
|
You can install the *sentence-transformers* with `conda`: |
|
|
|
``` |
|
conda install -c conda-forge sentence-transformers |
|
``` |
|
|
|
**Install from sources** |
|
|
|
Alternatively, you can also clone the latest version from the [repository](https://github.com/UKPLab/sentence-transformers) and install it directly from the source code: |
|
|
|
```` |
|
pip install -e . |
|
```` |
|
|
|
**PyTorch with CUDA** |
|
|
|
If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow |
|
[PyTorch - Get Started](https://pytorch.org/get-started/locally/) for further details how to install PyTorch. |
|
|
|
## Getting Started |
|
|
|
See [Quickstart](https://www.sbert.net/docs/quickstart.html) in our documenation. |
|
|
|
[This example](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/computing-embeddings/computing_embeddings.py) shows you how to use an already trained Sentence Transformer model to embed sentences for another task. |
|
|
|
First download a pretrained model. |
|
|
|
````python |
|
from sentence_transformers import SentenceTransformer |
|
model = SentenceTransformer('all-MiniLM-L6-v2') |
|
```` |
|
|
|
Then provide some sentences to the model. |
|
|
|
````python |
|
sentences = ['This framework generates embeddings for each input sentence', |
|
'Sentences are passed as a list of string.', |
|
'The quick brown fox jumps over the lazy dog.'] |
|
sentence_embeddings = model.encode(sentences) |
|
```` |
|
|
|
And that's it already. We now have a list of numpy arrays with the embeddings. |
|
|
|
````python |
|
for sentence, embedding in zip(sentences, sentence_embeddings): |
|
print("Sentence:", sentence) |
|
print("Embedding:", embedding) |
|
print("") |
|
```` |
|
|
|
## Pre-Trained Models |
|
|
|
We provide a large list of [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html) for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: `SentenceTransformer('model_name')`. |
|
|
|
[» Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html) |
|
|
|
## Training |
|
|
|
This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task. |
|
|
|
See [Training Overview](https://www.sbert.net/docs/training/overview.html) for an introduction how to train your own embedding models. We provide [various examples](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training) how to train models on various datasets. |
|
|
|
Some highlights are: |
|
- Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Electra, BART, ... |
|
- Multi-Lingual and multi-task learning |
|
- Evaluation during training to find optimal model |
|
- [10+ loss-functions](https://www.sbert.net/docs/package_reference/losses.html) allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, contrastive loss. |
|
|
|
## Performance |
|
|
|
Our models are evaluated extensively on 15+ datasets including challening domains like Tweets, Reddit, emails. They achieve by far the **best performance** from all available sentence embedding methods. Further, we provide several **smaller models** that are **optimized for speed**. |
|
|
|
[» Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html) |
|
|
|
## Application Examples |
|
|
|
You can use this framework for: |
|
|
|
- [Computing Sentence Embeddings](https://www.sbert.net/examples/applications/computing-embeddings/README.html) |
|
- [Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html) |
|
- [Clustering](https://www.sbert.net/examples/applications/clustering/README.html) |
|
- [Paraphrase Mining](https://www.sbert.net/examples/applications/paraphrase-mining/README.html) |
|
- [Translated Sentence Mining](https://www.sbert.net/examples/applications/parallel-sentence-mining/README.html) |
|
- [Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html) |
|
- [Retrieve & Re-Rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) |
|
- [Text Summarization](https://www.sbert.net/examples/applications/text-summarization/README.html) |
|
- [Multilingual Image Search, Clustering & Duplicate Detection](https://www.sbert.net/examples/applications/image-search/README.html) |
|
|
|
and many more use-cases. |
|
|
|
For all examples, see [examples/applications](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications). |
|
|
|
## Citing & Authors |
|
|
|
If you find this repository helpful, feel free to cite our publication [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084): |
|
|
|
```bibtex |
|
@inproceedings{reimers-2019-sentence-bert, |
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
month = "11", |
|
year = "2019", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://arxiv.org/abs/1908.10084", |
|
} |
|
``` |
|
|
|
If you use one of the multilingual models, feel free to cite our publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813): |
|
|
|
```bibtex |
|
@inproceedings{reimers-2020-multilingual-sentence-bert, |
|
title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing", |
|
month = "11", |
|
year = "2020", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://arxiv.org/abs/2004.09813", |
|
} |
|
``` |
|
|
|
Please have a look at [Publications](https://www.sbert.net/docs/publications.html) for our different publications that are integrated into SentenceTransformers. |
|
|
|
Contact person: [Nils Reimers](https://www.nils-reimers.de), [[email protected]](mailto:[email protected]) |
|
|
|
https://www.ukp.tu-darmstadt.de/ |
|
|
|
Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions. |
|
|
|
> This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication. |
|
|