File size: 5,296 Bytes
7a086c0
 
 
 
 
 
19d1d46
 
 
 
 
 
 
 
 
 
 
7a086c0
 
19d1d46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10fdb73
 
251fb2c
10fdb73
 
 
 
19d1d46
 
 
7a086c0
 
19d1d46
7a086c0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
license: apache-2.0
datasets:
- ArchitRastogi/Italian-BERT-FineTuning-Embeddings
language:
- it
metrics:
- Recall@1
- Recall@100
- Recall@1000
- Average Precision
- NDCG@10
- NDCG@100
- NDCG@1000
- MRR@10
- MRR@100
- MRR@1000
base_model:
- dbmdz/bert-base-italian-xxl-uncased
new_version: "true"
pipeline_tag: feature-extraction
library_name: transformers
tags:
- information-retrieval
- contrastive-learning
- embeddings
- italian
- fine-tuned
- bert
- retrieval-augmented-generation
model-index:
  - name: bert-base-italian-embeddings
    results:
      - task:
          type: information-retrieval
        dataset:
          name: mMARCO
          type: mMARCO
        metrics:
          - name: Recall@1000
            type: Recall
            value: 0.9719
          - name: NDCG@1000
            type: Normalized Discounted Cumulative Gain
            value: 0.4391
          - Average Precision: AP
            type: Precision
            value: 0.3173
          
        source:
          name: Fine-tuned Italian BERT Model Evaluation
          url: https://github.com/unicamp-dl/mMARCO
---


# bert-base-italian-embeddings: A Fine-Tuned Italian BERT Model for IR and RAG Applications

## Model Overview

This model is a fine-tuned version of [dbmdz/bert-base-italian-xxl-uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased) tailored for Italian language Information Retrieval (IR) and Retrieval-Augmented Generation (RAG) tasks. It leverages contrastive learning to generate high-quality embeddings suitable for both industry and academic applications.

## Model Size

- **Size**: Approximately 450 MB

## Training Details

- **Base Model**: [dbmdz/bert-base-italian-xxl-uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased)
- **Dataset**: [Italian-BERT-FineTuning-Embeddings](https://huggingface.co/datasets/ArchitRastogi/Italian-BERT-FineTuning-Embeddings)
  - Derived from the C4 dataset using sliding window segmentation and in-document sampling.
  - **Size**: ~5GB (4.5GB train, 0.5GB test)
- **Training Configuration**:
  - **Hardware**: NVIDIA A40 GPU
  - **Epochs**: 3
  - **Total Steps**: 922,958
  - **Training Time**: Approximately 5 days, 2 hours, and 23 minutes
- **Training Objective**: Contrastive Learning

## Evaluation Metrics

Evaluations were performed using the [mMARCO](https://github.com/unicamp-dl/mMARCO) dataset, a multilingual version of MS MARCO. The model was assessed on 6,980 queries.

### Results Comparison

| Metric              | Base Model (`dbmdz/bert-base-italian-xxl-uncased`) | `facebook/mcontriever-msmarco` | **Fine-Tuned Model** |
|---------------------|----------------------------------------------------|--------------------------------|----------------------|
| **Recall@1**        | 0.0026                                             | 0.0828                         | **0.2106**           |
| **Recall@100**      | 0.0417                                             | 0.5028                         | **0.8356**           |
| **Recall@1000**     | 0.2061                                             | 0.8049                         | **0.9719**           |
| **Average Precision** | 0.0050                                          | 0.1397                         | **0.3173**           |
| **NDCG@10**         | 0.0043                                             | 0.1591                         | **0.3601**           |
| **NDCG@100**        | 0.0108                                             | 0.2086                         | **0.4218**           |
| **NDCG@1000**       | 0.0299                                             | 0.2454                         | **0.4391**           |
| **MRR@10**          | 0.0036                                             | 0.1299                         | **0.3047**           |
| **MRR@100**         | 0.0045                                             | 0.1385                         | **0.3167**           |
| **MRR@1000**        | 0.0050                                             | 0.1397                         | **0.3173**           |

**Note**: The fine-tuned model significantly outperforms both the base model and `facebook/mcontriever-msmarco` across all metrics.

## Usage

You can load and use the model directly with the Hugging Face Transformers library:
```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/bert-base-italian-embeddings")
model = AutoModelForMaskedLM.from_pretrained("ArchitRastogi/bert-base-italian-embeddings")

# Example usage
text = "Stanchi di non riuscire a trovare il partner perfetto?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```

## Intended Use
This model is intended for:

- Information Retrieval (IR): Enhancing search engines and retrieval systems in the Italian language.
- Retrieval-Augmented Generation (RAG): Improving the quality of generated content by providing relevant context.

Suitable for both industry applications and academic research.

## Limitations
- The model may inherit biases present in the C4 dataset.
- Performance is primarily evaluated on mMARCO; results may vary with other datasets.

---

## Contact

**Archit Rastogi**  
📧 [email protected]