Update README.md

8bbed87 verified 2 months ago

6.84 kB

	---
	language:
	- fa
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- loss:CachedMultipleNegativesRankingLoss
	widget:
	- source_sentence: درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.
	sentences:
	- >-
	درناها با قامتی بلند و بال‌های پهن، از زیباترین پرندگان مهاجر به شمار
	می‌روند.
	- درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمی‌کنند.
	- ایران برای بار دیگر توانست به مدال طلا دست یابد.
	- source_sentence: در زمستان هوای تهران بسیار آلوده است.
	sentences:
	- تهران هوای پاکی در فصل زمستان دارد.
	- مشهد و تهران شلوغ‌ترین شهرهای ایران هستند.
	- در زمستان‌ها هوای تهران پاک نیست.
	- source_sentence: یادگیری زبان خارجی فرصت‌های شغلی را افزایش می‌دهد.
	sentences:
	- تسلط بر چند زبان، شانس استخدام در شرکت‌های بین‌المللی را بالا می‌برد.
	- دانستن زبان‌های خارجی تأثیری در موفقیت شغلی ندارد.
	- دمای هوا در قطب جنوب به پایین‌ترین حد خود در 50 سال اخیر رسید.
	- source_sentence: سفر کردن باعث گسترش دیدگاه‌های فرهنگی می‌شود.
	sentences:
	- بازدید از کشورهای مختلف به درک بهتر تنوع فرهنگی کمک می‌کند.
	- سفر کردن هیچ تأثیری بر دیدگاه‌های فرهنگی افراد ندارد
	- دمای هوا در قطب جنوب به پایین‌ترین حد خود در 50 سال اخیر رسید.
	base_model:
	- PartAI/TookaBERT-Base
	---

	# Tooka-SBERT-V2-Small


	This model is a Sentence Transformers model trained for semantic textual similarity and embedding tasks. It maps sentences and paragraphs to a dense vector space, where semantically similar texts are close together.

	The model is trained in two sizes: [Small](https://huggingface.co/PartAI/Tooka-SBERT-V2-Small/) and [Large](https://huggingface.co/PartAI/Tooka-SBERT-V2-Large/)

	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install sentence-transformers==3.4.1
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("PartAI/Tooka-SBERT-V2-Small")
	# Run inference
	sentences = [
	'درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.',
	'درناها با قامتی بلند و بال‌های پهن، از زیباترین پرندگان مهاجر به شمار می‌روند.',
	'درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمی‌کنند.'
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 768]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities.shape)
	# [3, 3]
	```

	## 🛠️ Training Details
	The training is performed in two stages:

	1. Pretraining on the Targoman News dataset
	2. Fine-tuning on multiple synthetic datasets

	### Stage 1: Pretraining
	- We use an asymmetric setup.
	- Input formatting:
	- Titles are prepended with `"سوال: "`
	- Texts are prepended with `"متن: "`
	- Loss function: `CachedMultipleNegativesRankingLoss`

	### Stage 2: Fine-tuning
	- Loss functions:
	- `CachedMultipleNegativesRankingLoss`
	- `CoSENTLoss`
	- Used across multiple synthetic datasets


	# 📊 Evaluation
	We evaluate our model on the [PTEB Benchmark](https://huggingface.co/spaces/PartAI/pteb-leaderboard). Our model outperforms mE5-Base on average across PTEB tasks.

	For Retrieval and Reranking tasks, we follow the same asymmetric structure, prepending:
	- `"سوال: "` to queries
	- `"متن: "` to documents


	\| Model \| #Params \| Pair-Classification-Avg \| Classification-Avg \| Retrieval-Avg \| Reranking-Avg \| CrossTasks-Avg \|
	\|--------------------------------------------------------------------------------\|:-------:\|-------------------------\|--------------------\|---------------\|---------------\|----------------\|
	\| [Tooka-SBERT-V2-Large](https://huggingface.co/PartAI/Tooka-SBERT-V2-Large) \| 353M \| 80.24 \| 74.73 \| 59.80 \| 73.44 \| 72.05 \|
	\| [Tooka-SBERT-V2-Small](https://huggingface.co/PartAI/Tooka-SBERT-V2-Small) \| 123M \| 75.69 \| 72.16 \| 61.24 \| 73.40 \| 70.62 \|
	\| [jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) \| 572M \| 71.88 \| 79.27 \| 65.18 \| 64.62 \| 70.24 \|
	\| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) \| 278M \| 70.76 \| 69.71 \| 63.90 \| 76.01 \| 70.09 \|
	\| [Tooka-SBERT-V1-Large](https://huggingface.co/PartAI/Tooka-SBERT) \| 353M \| 81.52 \| 71.54 \| 45.61 \| 60.44 \| 64.78 \|


	### Task-Specific Datasets in PTEB

	- Pair-Classification:
	- FarsTail

	- Classification:
	- MassiveIntentClassification
	- MassiveScenarioClassification
	- MultilingualSentimentClassification
	- PersianFoodSentimentClassification

	- Retrieval:
	- MIRACLRetrieval
	- NeuCLIR2023Retrieval
	- WikipediaRetrievalMultilingual

	- Reranking:
	- MIRACLReranking
	- WikipediaRerankingMultilingual


	## Citation

	### BibTeX

	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	#### CachedMultipleNegativesRankingLoss
	```bibtex
	@misc{gao2021scaling,
	title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
	author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
	year={2021},
	eprint={2101.06983},
	archivePrefix={arXiv},
	primaryClass={cs.LG}
	}
	```