dictabert-ce / README.md

Update README.md

bb5c564 verified 10 months ago

4.03 kB

	---
	library_name: transformers
	language:
	- he
	---


	## Model Details

	### Model Description

	This is the model card of a 🤗 transformers model that has been pushed on the Hub.

	- Model type: CrossEncoder
	- Language(s) (NLP): Hebrew
	- License: [More Information Needed]
	- Finetuned from model [optional]: [DictaBERT](https://huggingface.co/dicta-il/dictabert)


	## Uses

	Model was trained for ranking task as a part of a Hebrew semantic search engine.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	from sentence_transformers import CrossEncoder


	query = "על מה לא הסכים דוד בן גוריון לוותר?"
	doc1 = """
	מלחמת סיני הסתיימה בתבוסה של הכוחות המצריים, אך ברית המועצות וארצות הברית הפעילו לחץ כבד על ישראל לסגת מחצי האי סיני.
	ראש ממשלת ישראל, דוד בן-גוריון, הסכים, בעקבות הלחץ של שתי המעצמות,
	לפנות את חצי האי סיני ורצועת עזה בתהליך שהסתיים במרץ 1957,
	אך הודיע שסגירה של מצרי טיראן לשיט ישראלי תהווה עילה למלחמה.
	ארצות הברית התחייבה לדאוג להבטחת חופש המעבר של ישראל במצרי טיראן.
	כוח חירום בינלאומי של האו"ם הוצב בצד המצרי של הגבול עם ישראל ובשארם א-שייח' וכתוצאה מכך נשאר נתיב השיט במפרץ אילת פתוח לשיט הישראלי.
	"""
	doc2 = """
	ים סוף מהווה מוקד חשוב לתיירות מרחבי העולם.
	מזג האוויר הנוח בעונת החורף, החופים היפים, הים הצלול ואתרי הצלילה המרהיבים לחופי סיני,
	מצרים, וסודאן הופכים את חופי ים סוף ליעד תיירות מבוקש.
	ראס מוחמד והחור הכחול בסיני, ידועים כאתרי צלילה מהמרהיבים בעולם.
	מאז הסכם השלום בין ישראל למצרים פיתחה מצרים מאוד את התיירות לאורך חופי ים סוף,
	ובמיוחד בסיני, ובנתה עשרות אתרי תיירות ומאות מלונות וכפרי נופש.
	תיירות זו נפגעה קשות מאז המהפכה של 2011 במצרים,
	עם עלייה חדה בתקריות טרור מצד ארגונים אסלאמיים קיצוניים בסיני.
	"""

	model = CrossEncoder("haguy77/dictabert-ce")

	scores = model.predict([[query, doc1], [query, doc2]]) # Note: query should ALWAYS be the first of each pair
	# array([0.02000629, 0.00031683], dtype=float32)

	results = model.rank(query, [doc2, doc1])
	# [{'corpus_id': 1, 'score': 0.020006292}, {'corpus_id': 0, 'score': 0.00031683326}]
	```

	### Training Data

	[Hebrew Question Answering Dataset (HeQ)](https://github.com/NNLP-IL/Hebrew-Question-Answering-Dataset)

	## Citation

	BibTeX:

	```bibtex
	@misc{shmidman2023dictabert,
	title={DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew},
	author={Shaltiel Shmidman and Avi Shmidman and Moshe Koppel},
	year={2023},
	eprint={2308.16687},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```
	```bibtex
	@inproceedings{cohen2023heq,
	title={Heq: a large and diverse hebrew reading comprehension benchmark},
	author={Cohen, Amir and Merhav-Fine, Hilla and Goldberg, Yoav and Tsarfaty, Reut},
	booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},
	pages={13693--13705},
	year={2023}
	}
	```

	APA:
	```apa
	Shmidman, S., Shmidman, A., & Koppel, M. (2023). DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew. arXiv preprint arXiv:2308.16687.

	Cohen, A., Merhav-Fine, H., Goldberg, Y., & Tsarfaty, R. (2023, December). Heq: a large and diverse hebrew reading comprehension benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 13693-13705).
	```