File size: 5,087 Bytes
7a086c0
 
 
 
 
 
19d1d46
 
 
 
 
 
 
 
 
 
 
7a086c0
 
19d1d46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a086c0
 
19d1d46
7a086c0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
license: apache-2.0
datasets:
- ArchitRastogi/Italian-BERT-FineTuning-Embeddings
language:
- it
metrics:
- Recall@1
- Recall@100
- Recall@1000
- Average Precision
- NDCG@10
- NDCG@100
- NDCG@1000
- MRR@10
- MRR@100
- MRR@1000
base_model:
- dbmdz/bert-base-italian-xxl-uncased
new_version: "true"
pipeline_tag: feature-extraction
library_name: transformers
tags:
- information-retrieval
- contrastive-learning
- embeddings
- italian
- fine-tuned
- bert
- retrieval-augmented-generation
model-index:
  - name: bert-base-italian-embeddings
    results:
      - task:
          type: information-retrieval
        dataset:
          name: mMARCO
          type: mMARCO
        metrics:
          - name: Recall@1000
            type: Recall
            value: 0.9719
        source:
          name: Fine-tuned Italian BERT Model Evaluation
          url: https://github.com/unicamp-dl/mMARCO
---


# bert-base-italian-embeddings: A Fine-Tuned Italian BERT Model for IR and RAG Applications

## Model Overview

This model is a fine-tuned version of [dbmdz/bert-base-italian-xxl-uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased) tailored for Italian language Information Retrieval (IR) and Retrieval-Augmented Generation (RAG) tasks. It leverages contrastive learning to generate high-quality embeddings suitable for both industry and academic applications.

## Model Size

- **Size**: Approximately 450 MB

## Training Details

- **Base Model**: [dbmdz/bert-base-italian-xxl-uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased)
- **Dataset**: [Italian-BERT-FineTuning-Embeddings](https://huggingface.co/datasets/ArchitRastogi/Italian-BERT-FineTuning-Embeddings)
  - Derived from the C4 dataset using sliding window segmentation and in-document sampling.
  - **Size**: ~5GB (4.5GB train, 0.5GB test)
- **Training Configuration**:
  - **Hardware**: NVIDIA A40 GPU
  - **Epochs**: 3
  - **Total Steps**: 922,958
  - **Training Time**: Approximately 5 days, 2 hours, and 23 minutes
- **Training Objective**: Contrastive Learning

## Evaluation Metrics

Evaluations were performed using the [mMARCO](https://github.com/unicamp-dl/mMARCO) dataset, a multilingual version of MS MARCO. The model was assessed on 6,980 queries.

### Results Comparison

| Metric              | Base Model (`dbmdz/bert-base-italian-xxl-uncased`) | `facebook/mcontriever-msmarco` | **Fine-Tuned Model** |
|---------------------|----------------------------------------------------|--------------------------------|----------------------|
| **Recall@1**        | 0.0026                                             | 0.0828                         | **0.2106**           |
| **Recall@100**      | 0.0417                                             | 0.5028                         | **0.8356**           |
| **Recall@1000**     | 0.2061                                             | 0.8049                         | **0.9719**           |
| **Average Precision** | 0.0050                                          | 0.1397                         | **0.3173**           |
| **NDCG@10**         | 0.0043                                             | 0.1591                         | **0.3601**           |
| **NDCG@100**        | 0.0108                                             | 0.2086                         | **0.4218**           |
| **NDCG@1000**       | 0.0299                                             | 0.2454                         | **0.4391**           |
| **MRR@10**          | 0.0036                                             | 0.1299                         | **0.3047**           |
| **MRR@100**         | 0.0045                                             | 0.1385                         | **0.3167**           |
| **MRR@1000**        | 0.0050                                             | 0.1397                         | **0.3173**           |

**Note**: The fine-tuned model significantly outperforms both the base model and `facebook/mcontriever-msmarco` across all metrics.

## Usage

You can load and use the model directly with the Hugging Face Transformers library:
```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/bert-base-italian-embeddings")
model = AutoModelForMaskedLM.from_pretrained("ArchitRastogi/bert-base-italian-embeddings")

# Example usage
text = "Stanchi di non riuscire a trovare il partner perfetto?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```

## Intended Use
This model is intended for:

- Information Retrieval (IR): Enhancing search engines and retrieval systems in the Italian language.
- Retrieval-Augmented Generation (RAG): Improving the quality of generated content by providing relevant context.

Suitable for both industry applications and academic research.

## Limitations
- The model may inherit biases present in the C4 dataset.
- Performance is primarily evaluated on mMARCO; results may vary with other datasets.

---

## Contact

**Archit Rastogi**  
📧 [email protected]