|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- ArchitRastogi/Italian-BERT-FineTuning-Embeddings |
|
language: |
|
- it |
|
metrics: |
|
- Recall@1 |
|
- Recall@100 |
|
- Recall@1000 |
|
- Average Precision |
|
- NDCG@10 |
|
- NDCG@100 |
|
- NDCG@1000 |
|
- MRR@10 |
|
- MRR@100 |
|
- MRR@1000 |
|
base_model: |
|
- dbmdz/bert-base-italian-xxl-uncased |
|
new_version: "true" |
|
pipeline_tag: feature-extraction |
|
library_name: transformers |
|
tags: |
|
- information-retrieval |
|
- contrastive-learning |
|
- embeddings |
|
- italian |
|
- fine-tuned |
|
- bert |
|
- retrieval-augmented-generation |
|
model-index: |
|
- name: bert-base-italian-embeddings |
|
results: |
|
- task: |
|
type: information-retrieval |
|
dataset: |
|
name: mMARCO |
|
type: mMARCO |
|
metrics: |
|
- name: Recall@1000 |
|
type: Recall |
|
value: 0.9719 |
|
source: |
|
name: Fine-tuned Italian BERT Model Evaluation |
|
url: https://github.com/unicamp-dl/mMARCO |
|
--- |
|
|
|
|
|
# bert-base-italian-embeddings: A Fine-Tuned Italian BERT Model for IR and RAG Applications |
|
|
|
## Model Overview |
|
|
|
This model is a fine-tuned version of [dbmdz/bert-base-italian-xxl-uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased) tailored for Italian language Information Retrieval (IR) and Retrieval-Augmented Generation (RAG) tasks. It leverages contrastive learning to generate high-quality embeddings suitable for both industry and academic applications. |
|
|
|
## Model Size |
|
|
|
- **Size**: Approximately 450 MB |
|
|
|
## Training Details |
|
|
|
- **Base Model**: [dbmdz/bert-base-italian-xxl-uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased) |
|
- **Dataset**: [Italian-BERT-FineTuning-Embeddings](https://huggingface.co/datasets/ArchitRastogi/Italian-BERT-FineTuning-Embeddings) |
|
- Derived from the C4 dataset using sliding window segmentation and in-document sampling. |
|
- **Size**: ~5GB (4.5GB train, 0.5GB test) |
|
- **Training Configuration**: |
|
- **Hardware**: NVIDIA A40 GPU |
|
- **Epochs**: 3 |
|
- **Total Steps**: 922,958 |
|
- **Training Time**: Approximately 5 days, 2 hours, and 23 minutes |
|
- **Training Objective**: Contrastive Learning |
|
|
|
## Evaluation Metrics |
|
|
|
Evaluations were performed using the [mMARCO](https://github.com/unicamp-dl/mMARCO) dataset, a multilingual version of MS MARCO. The model was assessed on 6,980 queries. |
|
|
|
### Results Comparison |
|
|
|
| Metric | Base Model (`dbmdz/bert-base-italian-xxl-uncased`) | `facebook/mcontriever-msmarco` | **Fine-Tuned Model** | |
|
|---------------------|----------------------------------------------------|--------------------------------|----------------------| |
|
| **Recall@1** | 0.0026 | 0.0828 | **0.2106** | |
|
| **Recall@100** | 0.0417 | 0.5028 | **0.8356** | |
|
| **Recall@1000** | 0.2061 | 0.8049 | **0.9719** | |
|
| **Average Precision** | 0.0050 | 0.1397 | **0.3173** | |
|
| **NDCG@10** | 0.0043 | 0.1591 | **0.3601** | |
|
| **NDCG@100** | 0.0108 | 0.2086 | **0.4218** | |
|
| **NDCG@1000** | 0.0299 | 0.2454 | **0.4391** | |
|
| **MRR@10** | 0.0036 | 0.1299 | **0.3047** | |
|
| **MRR@100** | 0.0045 | 0.1385 | **0.3167** | |
|
| **MRR@1000** | 0.0050 | 0.1397 | **0.3173** | |
|
|
|
**Note**: The fine-tuned model significantly outperforms both the base model and `facebook/mcontriever-msmarco` across all metrics. |
|
|
|
## Usage |
|
|
|
You can load and use the model directly with the Hugging Face Transformers library: |
|
```python |
|
# Load model directly |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/bert-base-italian-embeddings") |
|
model = AutoModelForMaskedLM.from_pretrained("ArchitRastogi/bert-base-italian-embeddings") |
|
|
|
# Example usage |
|
text = "Stanchi di non riuscire a trovare il partner perfetto?" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model(**inputs) |
|
``` |
|
|
|
## Intended Use |
|
This model is intended for: |
|
|
|
- Information Retrieval (IR): Enhancing search engines and retrieval systems in the Italian language. |
|
- Retrieval-Augmented Generation (RAG): Improving the quality of generated content by providing relevant context. |
|
|
|
Suitable for both industry applications and academic research. |
|
|
|
## Limitations |
|
- The model may inherit biases present in the C4 dataset. |
|
- Performance is primarily evaluated on mMARCO; results may vary with other datasets. |
|
|
|
--- |
|
|
|
## Contact |
|
|
|
**Archit Rastogi** |
|
📧 [email protected] |
|
|