PartAI
/

Tooka-SBERT-V2-Small

+---
+language:
+- fa
+library_name: sentence-transformers
+pipeline_tag: sentence-similarity
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- loss:CachedMultipleNegativesRankingLoss
+widget:
+- source_sentence: درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.
+  sentences:
+  - >-
+    درناها با قامتی بلند و بال‌های پهن، از زیباترین پرندگان مهاجر به شمار
+    می‌روند.
+  - درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمی‌کنند.
+  - ایران برای بار دیگر توانست به مدال طلا دست یابد.
+- source_sentence: در زمستان هوای تهران بسیار آلوده است.
+  sentences:
+  - تهران هوای پاکی در فصل زمستان دارد.
+  - مشهد و تهران شلوغ‌ترین شهرهای ایران هستند.
+  - در زمستان‌ها هوای تهران پاک نیست.
+- source_sentence: یادگیری زبان خارجی فرصت‌های شغلی را افزایش می‌دهد.
+  sentences:
+  - تسلط بر چند زبان، شانس استخدام در شرکت‌های بین‌المللی را بالا می‌برد.
+  - دانستن زبان‌های خارجی تأثیری در موفقیت شغلی ندارد.
+  - دمای هوا در قطب جنوب به پایین‌ترین حد خود در 50 سال اخیر رسید.
+- source_sentence: سفر کردن باعث گسترش دیدگاه‌های فرهنگی می‌شود.
+  sentences:
+  - بازدید از کشورهای مختلف به درک بهتر تنوع فرهنگی کمک می‌کند.
+  - سفر کردن هیچ تأثیری بر دیدگاه‌های فرهنگی افراد ندارد
+  - دمای هوا در قطب جنوب به پایین‌ترین حد خود در 50 سال اخیر رسید.
+base_model:
+- PartAI/TookaBERT-Base
+---
+# TookaSBERT-Base1
+This model is a Sentence Transformers model trained for semantic textual similarity and embedding tasks. It maps sentences and paragraphs to a dense vector space, where semantically similar texts are close together.
+The model is trained in two sizes: **Base** and **Large**
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install sentence-transformers==3.4.1
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("PartAI/TookaSBERT-Base1")
+# Run inference
+sentences = [
+    'درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.',
+    'درناها با قامتی بلند و بال‌های پهن، از زیباترین پرندگان مهاجر به شمار می‌روند.',
+    'درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمی‌کنند.'
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 1024]
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings, embeddings)
+print(similarities.shape)
+# [3, 3]
+```
+## 🛠️ Training Details
+The training is performed in two stages:
+1. **Pretraining** on the *Targoman News* dataset
+2. **Fine-tuning** on multiple synthetic datasets
+### Stage 1: Pretraining
+- We use an **asymmetric** setup.
+- Input formatting:
+  - Titles are prepended with `"سوال: "`
+  - Texts are prepended with `"متن: "`
+- Loss function: `CachedMultipleNegativesRankingLoss`
+### Stage 2: Fine-tuning
+- Loss functions:
+  - `CachedMultipleNegativesRankingLoss`
+  - `CoSENTLoss`
+- Used across multiple synthetic datasets
+#  📊 Evaluation
+We evaluate our model on the [**PTEB Benchmark**](https://huggingface.co/spaces/PartAI/pteb-leaderboard). Our model **outperforms mE5-Base on average across PTEB tasks**.
+For *Retrieval* and *Reranking* tasks, we follow the same asymmetric structure, prepending:
+- `"سوال: "` to queries
+- `"متن: "` to documents
+| Model                                                                          | #Params | Pair-Classification-Avg | Classification-Avg | Retrieval-Avg | Reranking-Avg | Tasks-Avg |
+|--------------------------------------------------------------------------------|:-------:|-------------------------|--------------------|---------------|---------------|-----------|
+| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)   |  278M   | 70.76                   | 69.71              | 63.90         | 76.01         | 70.09     |
+| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) |  560M   | 72.55                   | 72.18              | **65.36**     | **78.52**     | **72.15** |
+| [jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3)         |  572M   | 71.88                   | **79.27**          | 65.18         | 64.62         | 70.24     |
+| tooka-sbert-large-v1                                                           |  353M   | **81.52**               | 71.54              | 45.61         | 60.44         | 64.78     |
+| tooka-sbert-base-v2                                                            |  123M   | 75.69                   | 72.16              | 61.24         | 73.40         | 70.62     |
+| tooka-sbert-large-v2                                                           |  353M   | 80.24                   | 74.73              | 59.80         | 73.44         | 72.05     |
+### Task-Specific Datasets in PTEB
+- **Pair-Classification**:
+  - FarsTail
+- **Classification**:
+  - MassiveIntentClassification
+  - MassiveScenarioClassification
+  - MultilingualSentimentClassification
+  - PersianFoodSentimentClassification
+- **Retrieval**:
+  - MIRACLRetrieval
+  - NeuCLIR2023Retrieval
+  - WikipediaRetrievalMultilingual
+- **Reranking**:
+  - MIRACLReranking
+  - WikipediaRerankingMultilingual
+## Citation
+### BibTeX
+#### Sentence Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+#### CachedMultipleNegativesRankingLoss
+```bibtex
+@misc{gao2021scaling,
+    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
+    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
+    year={2021},
+    eprint={2101.06983},
+    archivePrefix={arXiv},
+    primaryClass={cs.LG}
+}
+```