avichr
/

Legal-heBERT

Fill-Mask

Transformers

PyTorch

bert

Model card Files Files and versions Community

avichr commited on May 5, 2022

Commit

b5919c8

1 Parent(s): 8ea5d4c

Create README.md

Browse files

Files changed (1) hide show

README.md +65 -0

README.md ADDED Viewed

	@@ -0,0 +1,65 @@

+# Legal-HeBERT
+Legal-HeBERT is a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. We release two versions of Legal-HeBERT. The first version is a fine-tuned model of [HeBERT](https://github.com/avichaychriqui/HeBERT) applied on legal and legislative documents. The second version uses [HeBERT](https://github.com/avichaychriqui/HeBERT)'s architecture guidlines to train a BERT model from scratch. <br>
+We continue collecting legal data, examining different architectural designs, and performing tagged datasets and legal tasks for evaluating and to development of a Hebrew legal tools.
+## Training Data
+Our training datasets are:
+| Name | Hebrew Description | Size (GB) | Documents | Sentences | Words | Notes |
+|---|---|---|---|---|---|---|
+| The Israeli Law Book | ספר החוקים הישראלי | 0.05 | 2338 | 293352 | 4851063 |  |
+| Judgments of the Supreme Court | מאגר פסקי   הדין של בית המשפט העליון | 0.7 | 212348 | 5790138 | 79672415 |  |
+| custody courts | החלטות בתי   הדין למשמורת | 2.46 | 169,708 | 8,555,893 | 213,050,492 |  |
+| Law memoranda, drafts of secondary legislation and drafts of   support tests that have been distributed to the public for comment | תזכירי חוק,   טיוטות חקיקת משנה וטיוטות מבחני תמיכה שהופצו להערות הציבור | 0.4 | 3,291 | 294,752 | 7,218,960 |  |
+| Supervisors of Land Registration judgments | מאגר פסקי   דין של המפקחים על רישום המקרקעין | 0.02 | 559 | 67,639 | 1,785,446 |  |
+| Decisions of the Labor Court - Corona |          מאגר החלטות   בית הדין לעניין שירות התעסוקה – קורונה | 0.001 | 146 | 3505 | 60195 |  |
+| Decisions of the Israel Lands Council | החלטות   מועצת מקרקעי ישראל |   | 118 | 11283 | 162692 | aggregate file |
+| Judgments of the Disciplinary Tribunal and the Israel Police Appeals Tribunal | פסקי  דין של בית הדין למשמעת ובית הדין לערעורים של משטרת ישראל | 0.02 | 54 | 83724 | 1743419 | aggregate files  |
+| Disciplinary Appeals Committee in the   Ministry of Health | ועדת   ערר לדין משמעתי במשרד הבריאות | 0.004 | 252 | 21010 | 429807 | 465 files are scanned and didn't parser |
+| Attorney General's Positions | מאגר התייצבויות היועץ המשפטי לממשלה | 0.008 | 281 | 32724 | 813877 |  |
+| Legal-Opinion of the Attorney General | מאגר חוות דעת היועץ המשפטי לממשלה  | 0.002 | 44 | 7132 | 188053 |  |
+|  |  |  |  |  |  |  |
+| total |  | 3.665 | 389,139 | 15,161,152 | 309,976,419 |  |
+We thank <b>Yair Gardin</b> for the referring to the governance data, <b>Elhanan Schwarts</b> for collecting and parsing The Israeli law book, and <b>Jonathan Schler</b> for collecting the judgments of the supreme court.
+## Training process
+* Vocabulary size: 50,000 tokens
+* 4 epochs (1M steps±)
+* lr=5e-5
+* mlm_probability=0.15
+* batch size = 32 (for each gpu)
+* NVIDIA GeForce RTX 2080 TI + NVIDIA GeForce RTX 3090 (1 week training)
+### Additional training settings:
+<b>Fine-tuned [HeBERT](https://github.com/avichaychriqui/HeBERT) model:</b> The first eight layers were freezed (like [Lee et al. (2019)](https://arxiv.org/abs/1911.03090) suggest)<br>
+<b>Legal-HeBERT trained from scratch:</b> The training process is similar to [HeBERT](https://github.com/avichaychriqui/HeBERT) and inspired by [Chalkidis et al. (2020)](https://arxiv.org/abs/2010.02559) <br>
+## How to use
+The models can be found in huggingface hub and can be fine-tunned to any down-stream task:
+```
+# !pip install transformers==4.14.1
+from transformers import AutoTokenizer, AutoModel
+model_name = 'avichr/Legal-heBERT_ft' # for the fine-tuned HeBERT model
+model_name = 'avichr/Legal-heBERT' # for legal HeBERT model trained from scratch
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModel.from_pretrained(model_name)
+from transformers import pipeline
+fill_mask = pipeline(
+    "fill-mask",
+    model=model_name,
+)
+fill_mask("הקורונה לקחה את [MASK] ולנו לא נשאר דבר.")
+```
+## Stay tuned!
+We are still working on our models and the datasets. We will edit this page as we progress. We are open for collaborations.
+## Contact us
+[Avichay Chriqui](mailto:[email protected]), The Coller AI Lab <br>
+[Inbal yahav](mailto:[email protected]), The Coller AI Lab <br>
+[Ittai Bar-Siman-Tov](mailto:[email protected]), the BIU Innovation Lab for Law, Data-Science and Digital Ethics <br>
+Thank you, תודה, شكرا <br>