DeepMount00 commited on
Commit
45b423b
·
verified ·
1 Parent(s): f0c04f1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -29
README.md CHANGED
@@ -5,18 +5,37 @@ tags:
5
  - information-retrieval
6
  - semantic-search
7
  widget:
8
- - source_sentence: "Descrivi dettagliatamente il processo chimico e fisico che avviene durante la preparazione di un impasto per crostata"
 
 
9
  sentences:
10
- - "## La Magia Chimica e Fisica nell'Impasto della Crostata: Un Viaggio Dagli Ingredienti Secchi al Trionfo del Forno\n\nLa preparazione di una crostata, apparentemente un gesto semplice e familiare, cela in realtà un affascinante balletto di reazioni chimiche e trasformazioni fisiche..."
11
- - "## L'Arte Effimera: Creare un Dolce Paesaggio Invernale\n\nImmergiamoci nel cuore pulsante della pasticceria festiva, dove l'arte culinaria si fonde con la creatività artistica..."
12
- - "Le piattaforme di comunicazione digitale, con la loro ubiquità crescente, si configurano come un'arma a doppio taglio nel panorama sociale contemporaneo..."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  pipeline_tag: sentence-similarity
14
  library_name: sentence-transformers
 
 
15
  ---
16
 
17
- # Fine-tuned Qwen3-Embedding for Italian-English Cross-Lingual Semantic Retrieval
18
 
19
- This model is a specialized fine-tuned version of [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) optimized for cross-lingual semantic retrieval tasks, with particular emphasis on Italian query understanding and multilingual document ranking.
20
 
21
  ## Model Description
22
 
@@ -24,16 +43,16 @@ This model is a specialized fine-tuned version of [Qwen/Qwen3-Embedding-0.6B](ht
24
  - **Base Model**: [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)
25
  - **Output Dimensionality**: 1,024-dimensional dense vectors
26
  - **Maximum Sequence Length**: 32,768 tokens
27
- - **Primary Languages**: Italian, English
28
  - **Similarity Function**: Cosine similarity
29
 
30
  ## Capabilities
31
 
32
- ### Cross-Lingual Retrieval
33
- The model demonstrates strong performance in matching Italian queries to English documents and vice versa, particularly effective in technical and academic domains.
34
 
35
  ### Domain Coverage
36
- Trained on diverse knowledge domains including:
37
  - **Medical & Health Sciences**: Diagnostic imaging, clinical procedures, medical terminology
38
  - **STEM Fields**: Physics, computer science, geology, engineering
39
  - **Professional Domains**: Finance, law, agriculture, software development
@@ -41,18 +60,18 @@ Trained on diverse knowledge domains including:
41
 
42
  ### Query Understanding
43
  Enhanced comprehension of:
44
- - Conversational and informal query patterns
45
- - Technical terminology across domains
46
- - Cross-lingual semantic concepts
47
- - Complex multi-faceted questions
48
 
49
  ## Training Data
50
 
51
- The model was fine-tuned on a curated corpus of Italian-English cross-lingual data, featuring high-quality triplets designed to capture semantic nuances across multiple domains. The dataset emphasizes:
52
 
53
  - **Hard negative mining**: Strategic inclusion of semantically related but incorrect documents
54
- - **Cross-lingual alignment**: Balanced representation of Italian-English language pairs
55
- - **Domain diversity**: Comprehensive coverage of academic, professional, and conversational contexts
56
  - **Quality curation**: Manual review and automated filtering for coherence and relevance
57
 
58
  ## Usage
@@ -63,12 +82,12 @@ from sentence_transformers import SentenceTransformer
63
 
64
  model = SentenceTransformer("your-model-name")
65
 
66
- # Cross-lingual query-document matching
67
  query = "Come si distingue una faglia trascorrente da una normale?"
68
  documents = [
69
- "Strike-slip faults are characterized by horizontal movement...",
70
- "Normal faults occur due to extensional stress...",
71
- "Investment portfolio management strategies..."
72
  ]
73
 
74
  query_embedding = model.encode(query, prompt="Represent this search query for finding relevant passages: ")
@@ -83,17 +102,17 @@ The model is optimized for specific prompt templates:
83
 
84
  ## Applications
85
 
86
- - **Cross-lingual information retrieval systems**
87
- - **Academic and technical document search**
88
- - **Multilingual question-answering platforms**
89
- - **Educational content recommendation**
90
- - **Professional knowledge base systems**
91
 
92
  ## Limitations
93
 
94
- - **Language coverage**: Primarily optimized for Italian-English pairs
95
  - **Domain specificity**: Performance may vary on highly specialized domains not represented in training
96
- - **Cultural context**: Reflects primarily Western/European knowledge perspectives
97
  - **Computational requirements**: Dense representations require significant storage for large-scale deployment
98
 
99
  ## Model Architecture
@@ -110,7 +129,7 @@ SentenceTransformer(
110
 
111
  ```bibtex
112
  @misc{qwen3-italian-retrieval-2024,
113
- title={Fine-tuned Qwen3-Embedding for Italian-English Cross-Lingual Semantic Retrieval},
114
  year={2024},
115
  howpublished={\\url{https://huggingface.co/your-model-name}}
116
  }
 
5
  - information-retrieval
6
  - semantic-search
7
  widget:
8
+ - source_sentence: >-
9
+ Descrivi dettagliatamente il processo chimico e fisico che avviene durante
10
+ la preparazione di un impasto per crostata
11
  sentences:
12
+ - >-
13
+ ## La Magia Chimica e Fisica nell'Impasto della Crostata: Un Viaggio Dagli
14
+ Ingredienti Secchi al Trionfo del Forno
15
+
16
+
17
+ La preparazione di una crostata, apparentemente un gesto semplice e
18
+ familiare, cela in realtà un affascinante balletto di reazioni chimiche e
19
+ trasformazioni fisiche...
20
+ - >-
21
+ ## L'Arte Effimera: Creare un Dolce Paesaggio Invernale
22
+
23
+
24
+ Immergiamoci nel cuore pulsante della pasticceria festiva, dove l'arte
25
+ culinaria si fonde con la creatività artistica...
26
+ - >-
27
+ Le piattaforme di comunicazione digitale, con la loro ubiquità crescente, si
28
+ configurano come un'arma a doppio taglio nel panorama sociale
29
+ contemporaneo...
30
  pipeline_tag: sentence-similarity
31
  library_name: sentence-transformers
32
+ language:
33
+ - it
34
  ---
35
 
36
+ # Fine-tuned Qwen3-Embedding for Italian Semantic Retrieval
37
 
38
+ This model is a specialized fine-tuned version of [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) optimized for Italian semantic retrieval tasks, with particular emphasis on Italian query understanding and document ranking.
39
 
40
  ## Model Description
41
 
 
43
  - **Base Model**: [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)
44
  - **Output Dimensionality**: 1,024-dimensional dense vectors
45
  - **Maximum Sequence Length**: 32,768 tokens
46
+ - **Primary Language**: Italian
47
  - **Similarity Function**: Cosine similarity
48
 
49
  ## Capabilities
50
 
51
+ ### Italian Semantic Retrieval
52
+ The model demonstrates strong performance in matching Italian queries to Italian documents, particularly effective in technical and academic domains within the Italian language context.
53
 
54
  ### Domain Coverage
55
+ Trained on diverse Italian knowledge domains including:
56
  - **Medical & Health Sciences**: Diagnostic imaging, clinical procedures, medical terminology
57
  - **STEM Fields**: Physics, computer science, geology, engineering
58
  - **Professional Domains**: Finance, law, agriculture, software development
 
60
 
61
  ### Query Understanding
62
  Enhanced comprehension of:
63
+ - Conversational and informal Italian query patterns
64
+ - Technical terminology in Italian across domains
65
+ - Italian semantic concepts and nuances
66
+ - Complex multi-faceted questions in Italian
67
 
68
  ## Training Data
69
 
70
+ The model was fine-tuned on a curated corpus of Italian semantic data, featuring high-quality triplets designed to capture semantic nuances across multiple domains. The dataset emphasizes:
71
 
72
  - **Hard negative mining**: Strategic inclusion of semantically related but incorrect documents
73
+ - **Italian language focus**: Comprehensive representation of Italian language patterns
74
+ - **Domain diversity**: Comprehensive coverage of academic, professional, and conversational contexts in Italian
75
  - **Quality curation**: Manual review and automated filtering for coherence and relevance
76
 
77
  ## Usage
 
82
 
83
  model = SentenceTransformer("your-model-name")
84
 
85
+ # Italian query-document matching
86
  query = "Come si distingue una faglia trascorrente da una normale?"
87
  documents = [
88
+ "Le faglie trascorrenti sono caratterizzate da movimento orizzontale...",
89
+ "Le faglie normali si verificano a causa di stress estensionale...",
90
+ "Le strategie di gestione del portafoglio di investimenti..."
91
  ]
92
 
93
  query_embedding = model.encode(query, prompt="Represent this search query for finding relevant passages: ")
 
102
 
103
  ## Applications
104
 
105
+ - **Italian information retrieval systems**
106
+ - **Academic and technical document search in Italian**
107
+ - **Italian question-answering platforms**
108
+ - **Educational content recommendation for Italian speakers**
109
+ - **Professional knowledge base systems in Italian**
110
 
111
  ## Limitations
112
 
113
+ - **Language coverage**: Specifically optimized for Italian language
114
  - **Domain specificity**: Performance may vary on highly specialized domains not represented in training
115
+ - **Cultural context**: Reflects primarily Italian/European knowledge perspectives
116
  - **Computational requirements**: Dense representations require significant storage for large-scale deployment
117
 
118
  ## Model Architecture
 
129
 
130
  ```bibtex
131
  @misc{qwen3-italian-retrieval-2024,
132
+ title={Fine-tuned Qwen3-Embedding for Italian Semantic Retrieval},
133
  year={2024},
134
  howpublished={\\url{https://huggingface.co/your-model-name}}
135
  }