File size: 5,296 Bytes
7a086c0 19d1d46 7a086c0 19d1d46 10fdb73 251fb2c 10fdb73 19d1d46 7a086c0 19d1d46 7a086c0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
---
license: apache-2.0
datasets:
- ArchitRastogi/Italian-BERT-FineTuning-Embeddings
language:
- it
metrics:
- Recall@1
- Recall@100
- Recall@1000
- Average Precision
- NDCG@10
- NDCG@100
- NDCG@1000
- MRR@10
- MRR@100
- MRR@1000
base_model:
- dbmdz/bert-base-italian-xxl-uncased
new_version: "true"
pipeline_tag: feature-extraction
library_name: transformers
tags:
- information-retrieval
- contrastive-learning
- embeddings
- italian
- fine-tuned
- bert
- retrieval-augmented-generation
model-index:
- name: bert-base-italian-embeddings
results:
- task:
type: information-retrieval
dataset:
name: mMARCO
type: mMARCO
metrics:
- name: Recall@1000
type: Recall
value: 0.9719
- name: NDCG@1000
type: Normalized Discounted Cumulative Gain
value: 0.4391
- Average Precision: AP
type: Precision
value: 0.3173
source:
name: Fine-tuned Italian BERT Model Evaluation
url: https://github.com/unicamp-dl/mMARCO
---
# bert-base-italian-embeddings: A Fine-Tuned Italian BERT Model for IR and RAG Applications
## Model Overview
This model is a fine-tuned version of [dbmdz/bert-base-italian-xxl-uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased) tailored for Italian language Information Retrieval (IR) and Retrieval-Augmented Generation (RAG) tasks. It leverages contrastive learning to generate high-quality embeddings suitable for both industry and academic applications.
## Model Size
- **Size**: Approximately 450 MB
## Training Details
- **Base Model**: [dbmdz/bert-base-italian-xxl-uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased)
- **Dataset**: [Italian-BERT-FineTuning-Embeddings](https://huggingface.co/datasets/ArchitRastogi/Italian-BERT-FineTuning-Embeddings)
- Derived from the C4 dataset using sliding window segmentation and in-document sampling.
- **Size**: ~5GB (4.5GB train, 0.5GB test)
- **Training Configuration**:
- **Hardware**: NVIDIA A40 GPU
- **Epochs**: 3
- **Total Steps**: 922,958
- **Training Time**: Approximately 5 days, 2 hours, and 23 minutes
- **Training Objective**: Contrastive Learning
## Evaluation Metrics
Evaluations were performed using the [mMARCO](https://github.com/unicamp-dl/mMARCO) dataset, a multilingual version of MS MARCO. The model was assessed on 6,980 queries.
### Results Comparison
| Metric | Base Model (`dbmdz/bert-base-italian-xxl-uncased`) | `facebook/mcontriever-msmarco` | **Fine-Tuned Model** |
|---------------------|----------------------------------------------------|--------------------------------|----------------------|
| **Recall@1** | 0.0026 | 0.0828 | **0.2106** |
| **Recall@100** | 0.0417 | 0.5028 | **0.8356** |
| **Recall@1000** | 0.2061 | 0.8049 | **0.9719** |
| **Average Precision** | 0.0050 | 0.1397 | **0.3173** |
| **NDCG@10** | 0.0043 | 0.1591 | **0.3601** |
| **NDCG@100** | 0.0108 | 0.2086 | **0.4218** |
| **NDCG@1000** | 0.0299 | 0.2454 | **0.4391** |
| **MRR@10** | 0.0036 | 0.1299 | **0.3047** |
| **MRR@100** | 0.0045 | 0.1385 | **0.3167** |
| **MRR@1000** | 0.0050 | 0.1397 | **0.3173** |
**Note**: The fine-tuned model significantly outperforms both the base model and `facebook/mcontriever-msmarco` across all metrics.
## Usage
You can load and use the model directly with the Hugging Face Transformers library:
```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/bert-base-italian-embeddings")
model = AutoModelForMaskedLM.from_pretrained("ArchitRastogi/bert-base-italian-embeddings")
# Example usage
text = "Stanchi di non riuscire a trovare il partner perfetto?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```
## Intended Use
This model is intended for:
- Information Retrieval (IR): Enhancing search engines and retrieval systems in the Italian language.
- Retrieval-Augmented Generation (RAG): Improving the quality of generated content by providing relevant context.
Suitable for both industry applications and academic research.
## Limitations
- The model may inherit biases present in the C4 dataset.
- Performance is primarily evaluated on mMARCO; results may vary with other datasets.
---
## Contact
**Archit Rastogi**
📧 [email protected]
|