--- license: apache-2.0 datasets: - ArchitRastogi/Italian-BERT-FineTuning-Embeddings language: - it metrics: - Recall@1 - Recall@100 - Recall@1000 - Average Precision - NDCG@10 - NDCG@100 - NDCG@1000 - MRR@10 - MRR@100 - MRR@1000 base_model: - dbmdz/bert-base-italian-xxl-uncased new_version: "true" pipeline_tag: feature-extraction library_name: transformers tags: - information-retrieval - contrastive-learning - embeddings - italian - fine-tuned - bert - retrieval-augmented-generation model-index: - name: bert-base-italian-embeddings results: - task: type: information-retrieval dataset: name: mMARCO type: mMARCO metrics: - name: Recall@1000 type: Recall value: 0.9719 source: name: Fine-tuned Italian BERT Model Evaluation url: https://github.com/unicamp-dl/mMARCO --- # bert-base-italian-embeddings: A Fine-Tuned Italian BERT Model for IR and RAG Applications ## Model Overview This model is a fine-tuned version of [dbmdz/bert-base-italian-xxl-uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased) tailored for Italian language Information Retrieval (IR) and Retrieval-Augmented Generation (RAG) tasks. It leverages contrastive learning to generate high-quality embeddings suitable for both industry and academic applications. ## Model Size - **Size**: Approximately 450 MB ## Training Details - **Base Model**: [dbmdz/bert-base-italian-xxl-uncased](https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased) - **Dataset**: [Italian-BERT-FineTuning-Embeddings](https://huggingface.co/datasets/ArchitRastogi/Italian-BERT-FineTuning-Embeddings) - Derived from the C4 dataset using sliding window segmentation and in-document sampling. - **Size**: ~5GB (4.5GB train, 0.5GB test) - **Training Configuration**: - **Hardware**: NVIDIA A40 GPU - **Epochs**: 3 - **Total Steps**: 922,958 - **Training Time**: Approximately 5 days, 2 hours, and 23 minutes - **Training Objective**: Contrastive Learning ## Evaluation Metrics Evaluations were performed using the [mMARCO](https://github.com/unicamp-dl/mMARCO) dataset, a multilingual version of MS MARCO. The model was assessed on 6,980 queries. ### Results Comparison | Metric | Base Model (`dbmdz/bert-base-italian-xxl-uncased`) | `facebook/mcontriever-msmarco` | **Fine-Tuned Model** | |---------------------|----------------------------------------------------|--------------------------------|----------------------| | **Recall@1** | 0.0026 | 0.0828 | **0.2106** | | **Recall@100** | 0.0417 | 0.5028 | **0.8356** | | **Recall@1000** | 0.2061 | 0.8049 | **0.9719** | | **Average Precision** | 0.0050 | 0.1397 | **0.3173** | | **NDCG@10** | 0.0043 | 0.1591 | **0.3601** | | **NDCG@100** | 0.0108 | 0.2086 | **0.4218** | | **NDCG@1000** | 0.0299 | 0.2454 | **0.4391** | | **MRR@10** | 0.0036 | 0.1299 | **0.3047** | | **MRR@100** | 0.0045 | 0.1385 | **0.3167** | | **MRR@1000** | 0.0050 | 0.1397 | **0.3173** | **Note**: The fine-tuned model significantly outperforms both the base model and `facebook/mcontriever-msmarco` across all metrics. ## Usage You can load and use the model directly with the Hugging Face Transformers library: ```python # Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/bert-base-italian-embeddings") model = AutoModelForMaskedLM.from_pretrained("ArchitRastogi/bert-base-italian-embeddings") # Example usage text = "Stanchi di non riuscire a trovare il partner perfetto?" inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) ``` ## Intended Use This model is intended for: - Information Retrieval (IR): Enhancing search engines and retrieval systems in the Italian language. - Retrieval-Augmented Generation (RAG): Improving the quality of generated content by providing relevant context. Suitable for both industry applications and academic research. ## Limitations - The model may inherit biases present in the C4 dataset. - Performance is primarily evaluated on mMARCO; results may vary with other datasets. --- ## Contact **Archit Rastogi** 📧 architrastogi20@gmail.com