File size: 4,344 Bytes
7a086c0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
license: apache-2.0
datasets:
- ArchitRastogi/Italian-BERT-FineTuning-Embeddings
language:
- it
base_model:
- dbmdz/bert-base-italian-xxl-uncased
---

# bert-base-italian-embeddings: A Fine-Tuned Italian BERT Model for IR and RAG Applications

## Model Overview

This model is a fine-tuned version of [dbmdz/bert-base-italian-xxl-uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased) tailored for Italian language Information Retrieval (IR) and Retrieval-Augmented Generation (RAG) tasks. It leverages contrastive learning to generate high-quality embeddings suitable for both industry and academic applications.

## Model Size

- **Size**: Approximately 450 MB

## Training Details

- **Base Model**: [dbmdz/bert-base-italian-xxl-uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased)
- **Dataset**: [Italian-BERT-FineTuning-Embeddings](https://huggingface.co/datasets/ArchitRastogi/Italian-BERT-FineTuning-Embeddings)
  - Derived from the C4 dataset using sliding window segmentation and in-document sampling.
  - **Size**: ~5GB (4.5GB train, 0.5GB test)
- **Training Configuration**:
  - **Hardware**: NVIDIA A40 GPU
  - **Epochs**: 3
  - **Total Steps**: 922,958
  - **Training Time**: Approximately 5 days, 2 hours, and 23 minutes
- **Training Objective**: Contrastive Learning

## Evaluation Metrics

Evaluations were performed using the [mMARCO](https://github.com/unicamp-dl/mMARCO) dataset, a multilingual version of MS MARCO. The model was assessed on 6,980 queries.

### Results Comparison

| Metric              | Base Model (`dbmdz/bert-base-italian-xxl-uncased`) | `facebook/mcontriever-msmarco` | **Fine-Tuned Model** |
|---------------------|----------------------------------------------------|--------------------------------|----------------------|
| **Recall@1**        | 0.0026                                             | 0.0828                         | **0.2106**           |
| **Recall@100**      | 0.0417                                             | 0.5028                         | **0.8356**           |
| **Recall@1000**     | 0.2061                                             | 0.8049                         | **0.9719**           |
| **Average Precision** | 0.0050                                          | 0.1397                         | **0.3173**           |
| **NDCG@10**         | 0.0043                                             | 0.1591                         | **0.3601**           |
| **NDCG@100**        | 0.0108                                             | 0.2086                         | **0.4218**           |
| **NDCG@1000**       | 0.0299                                             | 0.2454                         | **0.4391**           |
| **MRR@10**          | 0.0036                                             | 0.1299                         | **0.3047**           |
| **MRR@100**         | 0.0045                                             | 0.1385                         | **0.3167**           |
| **MRR@1000**        | 0.0050                                             | 0.1397                         | **0.3173**           |

**Note**: The fine-tuned model significantly outperforms both the base model and `facebook/mcontriever-msmarco` across all metrics.

## Usage

You can load and use the model directly with the Hugging Face Transformers library:
```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/bert-base-italian-embeddings")
model = AutoModelForMaskedLM.from_pretrained("ArchitRastogi/bert-base-italian-embeddings")

# Example usage
text = "Stanchi di non riuscire a trovare il partner perfetto?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```

## Intended Use
This model is intended for:

- Information Retrieval (IR): Enhancing search engines and retrieval systems in the Italian language.
- Retrieval-Augmented Generation (RAG): Improving the quality of generated content by providing relevant context.

Suitable for both industry applications and academic research.

## Limitations
- The model may inherit biases present in the C4 dataset.
- Performance is primarily evaluated on mMARCO; results may vary with other datasets.

---

## Contact

**Archit Rastogi**  
📧 [email protected]