Commit
·
48d18ef
1
Parent(s):
d6e37fb
Update README.md
Browse files
README.md
CHANGED
@@ -41,7 +41,7 @@ We introduce LEGAL-ROBERTA, which is a domain-specific language representation m
|
|
41 |
| legalBert | torture | detention | arrest | rape | death |
|
42 |
| legalRoberta | torture | abuse | insanity | cruelty | confinement |
|
43 |
|
44 |
-
'Establishing a system for the identification and registration of \<mask\> animals and regarding the
|
45 |
|
46 |
| Model | top1 | top2 | top3 | top4 | top5 |
|
47 |
| ------------ | ---- | --- | --- | --- | -------- |
|
@@ -49,9 +49,18 @@ We introduce LEGAL-ROBERTA, which is a domain-specific language representation m
|
|
49 |
| legalBert | live | beef | farm | pet | dairy |
|
50 |
| legalRoberta | domestic | all | beef | wild | registered |
|
51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
52 |
## Training data
|
53 |
|
54 |
-
The
|
55 |
|
56 |
1. Patent Litigations (https://www.kaggle.com/uspto/patent-litigations): This dataset covers over 74k cases across 52 years and over 5 million relevant documents. 5 different files detail the litigating parties, their attorneys, results, locations, and dates.
|
57 |
1. *1.57GB*
|
@@ -59,7 +68,7 @@ The tranining data consists of 3 origins:
|
|
59 |
3. *clean 1.1GB*
|
60 |
|
61 |
|
62 |
-
2. Caselaw Access Project (CAP) (https://case.law/): Following 360 years of United States
|
63 |
1. *raw 5.6*
|
64 |
2. abbrev:CAP
|
65 |
3. *clean 2.8GB*
|
@@ -69,7 +78,7 @@ The tranining data consists of 3 origins:
|
|
69 |
3. *clean 1GB*
|
70 |
|
71 |
## Training procedure
|
72 |
-
We start from a pretrained ROBERTA-BASE model and fine-tune it on legal corpus.
|
73 |
|
74 |
Fine-tuning configuration:
|
75 |
- lr = 5e-5(with lr decay, ends at 4.95e-8)
|
@@ -128,8 +137,8 @@ tranining time per epoch(including validation ):
|
|
128 |
|
129 |
|
130 |
## Limitations:
|
131 |
-
In the Masked Language Model showroom, the tokens have
|
132 |
-
I know in case of BPE tokenizer(ROBERTA's tokenizer), the symbol Ġ means the end of a new token and the majority of tokens in vocabs of pre-trained tokenizers start with Ġ.
|
133 |
|
134 |
For example
|
135 |
```python
|
@@ -150,4 +159,3 @@ So I think this is not fundamentally linked to the model itself.
|
|
150 |
|
151 |
|
152 |
|
153 |
-
|
|
|
41 |
| legalBert | torture | detention | arrest | rape | death |
|
42 |
| legalRoberta | torture | abuse | insanity | cruelty | confinement |
|
43 |
|
44 |
+
'Establishing a system for the identification and registration of \<mask\> animals and regarding the labeling of beef and beef products .':
|
45 |
|
46 |
| Model | top1 | top2 | top3 | top4 | top5 |
|
47 |
| ------------ | ---- | --- | --- | --- | -------- |
|
|
|
49 |
| legalBert | live | beef | farm | pet | dairy |
|
50 |
| legalRoberta | domestic | all | beef | wild | registered |
|
51 |
|
52 |
+
## Load Pretrained Model
|
53 |
+
|
54 |
+
```python
|
55 |
+
from transformers import AutoTokenizer, AutoModel
|
56 |
+
|
57 |
+
tokenizer = AutoTokenizer.from_pretrained("saibo/legal-roberta-base")
|
58 |
+
model = AutoModel.from_pretrained("saibo/legal-roberta-base")
|
59 |
+
```
|
60 |
+
|
61 |
## Training data
|
62 |
|
63 |
+
The training data consists of 3 origins:
|
64 |
|
65 |
1. Patent Litigations (https://www.kaggle.com/uspto/patent-litigations): This dataset covers over 74k cases across 52 years and over 5 million relevant documents. 5 different files detail the litigating parties, their attorneys, results, locations, and dates.
|
66 |
1. *1.57GB*
|
|
|
68 |
3. *clean 1.1GB*
|
69 |
|
70 |
|
71 |
+
2. Caselaw Access Project (CAP) (https://case.law/): Following 360 years of United States case law, Caselaw Access Project (CAP) API and bulk data services includes 40 million pages of U.S. court decisions and almost 6.5 million individual cases.
|
72 |
1. *raw 5.6*
|
73 |
2. abbrev:CAP
|
74 |
3. *clean 2.8GB*
|
|
|
78 |
3. *clean 1GB*
|
79 |
|
80 |
## Training procedure
|
81 |
+
We start from a pretrained ROBERTA-BASE model and fine-tune it on the legal corpus.
|
82 |
|
83 |
Fine-tuning configuration:
|
84 |
- lr = 5e-5(with lr decay, ends at 4.95e-8)
|
|
|
137 |
|
138 |
|
139 |
## Limitations:
|
140 |
+
In the Masked Language Model showroom, the tokens have the prefix **Ġ**. This seems to be wired but I haven't yet been able to fix it.
|
141 |
+
I know in the case of BPE tokenizer(ROBERTA's tokenizer), the symbol Ġ means the end of a new token, and the majority of tokens in the vocabs of pre-trained tokenizers start with Ġ.
|
142 |
|
143 |
For example
|
144 |
```python
|
|
|
159 |
|
160 |
|
161 |
|
|