|
--- |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- Italian |
|
|
|
--- |
|
|
|
# ItaLegalEmb_v2 ๐ฎ๐น |
|
ItaLegalEmb_v2 is the second version of the ItaLegalEmb family embedding models. As his predecessor, it is a specialized embedding model specifically trained on |
|
a corpus of Italian legal documents. |
|
|
|
ItalegalEmb_v2 is based on **BAAI/bge-m3**, a SOTA embedding model with outstanding multilingual skills. |
|
|
|
Features: |
|
Dimensions: 1024 |
|
Sequence Lenght: 8192 |
|
|
|
**Please note :** <mark>any access request made using an organizational email address automatically grants us permission to list your organization as a user of our products and services on our website. If you do not agree with this policy, we ask that you refrain from requesting access to our materials</mark>. |
|
|
|
## Evaluation Results |
|
In our evaluations on the specific domain, **ItaLegalEmb_v2** **scores** **93%**, while OpenAI stops at 79% and **ItaLegalEmb** at **85%**. |
|
|
|
As llama.cpp team has just released (early August 2024) a version which supports **XLMRoberta** embedding models (ItaLegalEmb_v2 belongs to this), a gguf Q8 version |
|
of the model is also included here ๐. |
|
|
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model: It can be used for tasks like clustering or semantic search. |
|
|
|
<!--- Describe your model here --> |
|
|
|
## Usage (Sentence-Transformers) |
|
|
|
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: |
|
|
|
``` |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can use the model like this: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
sentences = ["This is an example sentence", "Each sentence is converted"] |
|
|
|
model = SentenceTransformer('{MODEL_NAME}') |
|
embeddings = model.encode(sentences) |
|
print(embeddings) |
|
``` |
|
|
|
|
|
|
|
|
|
|
|
<!--- Describe how your model was evaluated --> |
|
|
|
|
|
**DataLoader**: |
|
|
|
`torch.utils.data.dataloader.DataLoader` of length 190 with parameters: |
|
``` |
|
{'batch_size': 10, 'sampler': 'torch.utils.data.sampler.SequentialSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'} |
|
``` |
|
|
|
**Loss**: |
|
|
|
`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters: |
|
``` |
|
{'scale': 20.0, 'similarity_fct': 'cos_sim'} |
|
``` |
|
|
|
Parameters of the fit()-Method: |
|
``` |
|
{ |
|
"epochs": 3, |
|
"evaluation_steps": 50, |
|
"evaluator": "sentence_transformers.evaluation.InformationRetrievalEvaluator.InformationRetrievalEvaluator", |
|
"max_grad_norm": 1, |
|
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>", |
|
"optimizer_params": { |
|
"lr": 2e-05 |
|
}, |
|
"scheduler": "WarmupLinear", |
|
"steps_per_epoch": null, |
|
"warmup_steps": 57, |
|
"weight_decay": 0.01 |
|
} |
|
``` |
|
|
|
|
|
## Full Model Architecture |
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel |
|
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False}) |
|
(2): Normalize() |
|
) |
|
``` |
|
|
|
## Citing & Authors |
|
|
|
@misc{ItaLegalEmb, |
|
title = {Kleva-ai/ItaLegalEmb_v2: An embedding model fine-tuned on Italian legal documents.}, |
|
author = {Obiactum}, |
|
year = {2024}, |
|
publisher = {Kleva-ai}, |
|
journal = {HuggingFace repository}, |
|
howpublished = {\url{https://huggingface.co/Kleva-ai/ItaLegalEmb_v2}}, |
|
} |
|
|