bert-base-italian-embeddings: A Fine-Tuned Italian BERT Model for IR and RAG Applications

Model Overview

This model is a fine-tuned version of dbmdz/bert-base-italian-xxl-uncased tailored for Italian language Information Retrieval (IR) and Retrieval-Augmented Generation (RAG) tasks. It leverages contrastive learning to generate high-quality embeddings suitable for both industry and academic applications.

Model Size

  • Size: Approximately 450 MB

Training Details

  • Base Model: dbmdz/bert-base-italian-xxl-uncased
  • Dataset: Italian-BERT-FineTuning-Embeddings
    • Derived from the C4 dataset using sliding window segmentation and in-document sampling.
    • Size: ~5GB (4.5GB train, 0.5GB test)
  • Training Configuration:
    • Hardware: NVIDIA A40 GPU
    • Epochs: 3
    • Total Steps: 922,958
    • Training Time: Approximately 5 days, 2 hours, and 23 minutes
  • Training Objective: Contrastive Learning

Evaluation Metrics

Evaluations were performed using the mMARCO dataset, a multilingual version of MS MARCO. The model was assessed on 6,980 queries.

Results Comparison

Metric Base Model (dbmdz/bert-base-italian-xxl-uncased) facebook/mcontriever-msmarco Fine-Tuned Model
Recall@1 0.0026 0.0828 0.2106
Recall@100 0.0417 0.5028 0.8356
Recall@1000 0.2061 0.8049 0.9719
Average Precision 0.0050 0.1397 0.3173
NDCG@10 0.0043 0.1591 0.3601
NDCG@100 0.0108 0.2086 0.4218
NDCG@1000 0.0299 0.2454 0.4391
MRR@10 0.0036 0.1299 0.3047
MRR@100 0.0045 0.1385 0.3167
MRR@1000 0.0050 0.1397 0.3173

Note: The fine-tuned model significantly outperforms both the base model and facebook/mcontriever-msmarco across all metrics.

Usage

You can load and use the model directly with the Hugging Face Transformers library:

# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/bert-base-italian-embeddings")
model = AutoModelForMaskedLM.from_pretrained("ArchitRastogi/bert-base-italian-embeddings")

# Example usage
text = "Stanchi di non riuscire a trovare il partner perfetto?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Intended Use

This model is intended for:

  • Information Retrieval (IR): Enhancing search engines and retrieval systems in the Italian language.
  • Retrieval-Augmented Generation (RAG): Improving the quality of generated content by providing relevant context.

Suitable for both industry applications and academic research.

Limitations

  • The model may inherit biases present in the C4 dataset.
  • Performance is primarily evaluated on mMARCO; results may vary with other datasets.

Contact

Archit Rastogi
📧 [email protected]

Downloads last month
123
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for ArchitRastogi/bert-base-italian-embeddings

Finetuned
(2)
this model

Dataset used to train ArchitRastogi/bert-base-italian-embeddings

Evaluation results