DeepMount00 commited on
Commit
f0c04f1
·
verified ·
1 Parent(s): f69b700

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - information-retrieval
6
+ - semantic-search
7
+ widget:
8
+ - source_sentence: "Descrivi dettagliatamente il processo chimico e fisico che avviene durante la preparazione di un impasto per crostata"
9
+ sentences:
10
+ - "## La Magia Chimica e Fisica nell'Impasto della Crostata: Un Viaggio Dagli Ingredienti Secchi al Trionfo del Forno\n\nLa preparazione di una crostata, apparentemente un gesto semplice e familiare, cela in realtà un affascinante balletto di reazioni chimiche e trasformazioni fisiche..."
11
+ - "## L'Arte Effimera: Creare un Dolce Paesaggio Invernale\n\nImmergiamoci nel cuore pulsante della pasticceria festiva, dove l'arte culinaria si fonde con la creatività artistica..."
12
+ - "Le piattaforme di comunicazione digitale, con la loro ubiquità crescente, si configurano come un'arma a doppio taglio nel panorama sociale contemporaneo..."
13
+ pipeline_tag: sentence-similarity
14
+ library_name: sentence-transformers
15
+ ---
16
+
17
+ # Fine-tuned Qwen3-Embedding for Italian-English Cross-Lingual Semantic Retrieval
18
+
19
+ This model is a specialized fine-tuned version of [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) optimized for cross-lingual semantic retrieval tasks, with particular emphasis on Italian query understanding and multilingual document ranking.
20
+
21
+ ## Model Description
22
+
23
+ - **Model Type**: Dense embedding model for semantic retrieval
24
+ - **Base Model**: [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)
25
+ - **Output Dimensionality**: 1,024-dimensional dense vectors
26
+ - **Maximum Sequence Length**: 32,768 tokens
27
+ - **Primary Languages**: Italian, English
28
+ - **Similarity Function**: Cosine similarity
29
+
30
+ ## Capabilities
31
+
32
+ ### Cross-Lingual Retrieval
33
+ The model demonstrates strong performance in matching Italian queries to English documents and vice versa, particularly effective in technical and academic domains.
34
+
35
+ ### Domain Coverage
36
+ Trained on diverse knowledge domains including:
37
+ - **Medical & Health Sciences**: Diagnostic imaging, clinical procedures, medical terminology
38
+ - **STEM Fields**: Physics, computer science, geology, engineering
39
+ - **Professional Domains**: Finance, law, agriculture, software development
40
+ - **Educational Content**: Historical studies, culinary arts, general knowledge
41
+
42
+ ### Query Understanding
43
+ Enhanced comprehension of:
44
+ - Conversational and informal query patterns
45
+ - Technical terminology across domains
46
+ - Cross-lingual semantic concepts
47
+ - Complex multi-faceted questions
48
+
49
+ ## Training Data
50
+
51
+ The model was fine-tuned on a curated corpus of Italian-English cross-lingual data, featuring high-quality triplets designed to capture semantic nuances across multiple domains. The dataset emphasizes:
52
+
53
+ - **Hard negative mining**: Strategic inclusion of semantically related but incorrect documents
54
+ - **Cross-lingual alignment**: Balanced representation of Italian-English language pairs
55
+ - **Domain diversity**: Comprehensive coverage of academic, professional, and conversational contexts
56
+ - **Quality curation**: Manual review and automated filtering for coherence and relevance
57
+
58
+ ## Usage
59
+
60
+ ### Basic Retrieval
61
+ ```python
62
+ from sentence_transformers import SentenceTransformer
63
+
64
+ model = SentenceTransformer("your-model-name")
65
+
66
+ # Cross-lingual query-document matching
67
+ query = "Come si distingue una faglia trascorrente da una normale?"
68
+ documents = [
69
+ "Strike-slip faults are characterized by horizontal movement...",
70
+ "Normal faults occur due to extensional stress...",
71
+ "Investment portfolio management strategies..."
72
+ ]
73
+
74
+ query_embedding = model.encode(query, prompt="Represent this search query for finding relevant passages: ")
75
+ doc_embeddings = model.encode(documents, prompt="Represent this passage for retrieval: ")
76
+ similarities = model.similarity(query_embedding, doc_embeddings)
77
+ ```
78
+
79
+ ### Prompt Templates
80
+ The model is optimized for specific prompt templates:
81
+ - **Queries**: `"Represent this search query for finding relevant passages: "`
82
+ - **Documents**: `"Represent this passage for retrieval: "`
83
+
84
+ ## Applications
85
+
86
+ - **Cross-lingual information retrieval systems**
87
+ - **Academic and technical document search**
88
+ - **Multilingual question-answering platforms**
89
+ - **Educational content recommendation**
90
+ - **Professional knowledge base systems**
91
+
92
+ ## Limitations
93
+
94
+ - **Language coverage**: Primarily optimized for Italian-English pairs
95
+ - **Domain specificity**: Performance may vary on highly specialized domains not represented in training
96
+ - **Cultural context**: Reflects primarily Western/European knowledge perspectives
97
+ - **Computational requirements**: Dense representations require significant storage for large-scale deployment
98
+
99
+ ## Model Architecture
100
+
101
+ ```
102
+ SentenceTransformer(
103
+ (0): Transformer({'max_seq_length': 32768, 'architecture': 'Qwen3Model'})
104
+ (1): Pooling({'pooling_mode_lasttoken': True, 'include_prompt': True})
105
+ (2): Normalize()
106
+ )
107
+ ```
108
+
109
+ ## Citation
110
+
111
+ ```bibtex
112
+ @misc{qwen3-italian-retrieval-2024,
113
+ title={Fine-tuned Qwen3-Embedding for Italian-English Cross-Lingual Semantic Retrieval},
114
+ year={2024},
115
+ howpublished={\\url{https://huggingface.co/your-model-name}}
116
+ }
117
+ ```
118
+
119
+ ## Acknowledgments
120
+
121
+ This work builds upon the Qwen3-Embedding architecture and advances in contrastive learning for dense retrieval. We acknowledge the contributions of the Qwen team and the sentence-transformers community.
122
+
123
+ ---
124
+
125
+ **License**: Inherits licensing terms from the base Qwen/Qwen3-Embedding-0.6B model.