ItaLegalEmb_v2 / README.md
Obiactum's picture
Update README.md
a3e35b9 verified
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- Italian
---
# ItaLegalEmb_v2 ๐Ÿ‡ฎ๐Ÿ‡น
ItaLegalEmb_v2 is the second version of the ItaLegalEmb family embedding models. As his predecessor, it is a specialized embedding model specifically trained on
a corpus of Italian legal documents.
ItalegalEmb_v2 is based on **BAAI/bge-m3**, a SOTA embedding model with outstanding multilingual skills.
Features:
Dimensions: 1024
Sequence Lenght: 8192
**Please note :** <mark>any access request made using an organizational email address automatically grants us permission to list your organization as a user of our products and services on our website. If you do not agree with this policy, we ask that you refrain from requesting access to our materials</mark>.
## Evaluation Results
In our evaluations on the specific domain, **ItaLegalEmb_v2** **scores** **93%**, while OpenAI stops at 79% and **ItaLegalEmb** at **85%**.
As llama.cpp team has just released (early August 2024) a version which supports **XLMRoberta** embedding models (ItaLegalEmb_v2 belongs to this), a gguf Q8 version
of the model is also included here ๐Ÿ˜‰.
This is a [sentence-transformers](https://www.SBERT.net) model: It can be used for tasks like clustering or semantic search.
<!--- Describe your model here -->
## Usage (Sentence-Transformers)
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
```
pip install -U sentence-transformers
```
Then you can use the model like this:
```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)
```
<!--- Describe how your model was evaluated -->
**DataLoader**:
`torch.utils.data.dataloader.DataLoader` of length 190 with parameters:
```
{'batch_size': 10, 'sampler': 'torch.utils.data.sampler.SequentialSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```
**Loss**:
`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
```
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
```
Parameters of the fit()-Method:
```
{
"epochs": 3,
"evaluation_steps": 50,
"evaluator": "sentence_transformers.evaluation.InformationRetrievalEvaluator.InformationRetrievalEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 57,
"weight_decay": 0.01
}
```
## Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
(2): Normalize()
)
```
## Citing & Authors
@misc{ItaLegalEmb,
title = {Kleva-ai/ItaLegalEmb_v2: An embedding model fine-tuned on Italian legal documents.},
author = {Obiactum},
year = {2024},
publisher = {Kleva-ai},
journal = {HuggingFace repository},
howpublished = {\url{https://huggingface.co/Kleva-ai/ItaLegalEmb_v2}},
}