Saibo-creator commited on
Commit
48d18ef
·
1 Parent(s): d6e37fb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -7
README.md CHANGED
@@ -41,7 +41,7 @@ We introduce LEGAL-ROBERTA, which is a domain-specific language representation m
41
  | legalBert | torture | detention | arrest | rape | death |
42
  | legalRoberta | torture | abuse | insanity | cruelty | confinement |
43
 
44
- 'Establishing a system for the identification and registration of \<mask\> animals and regarding the labelling of beef and beef products .':
45
 
46
  | Model | top1 | top2 | top3 | top4 | top5 |
47
  | ------------ | ---- | --- | --- | --- | -------- |
@@ -49,9 +49,18 @@ We introduce LEGAL-ROBERTA, which is a domain-specific language representation m
49
  | legalBert | live | beef | farm | pet | dairy |
50
  | legalRoberta | domestic | all | beef | wild | registered |
51
 
 
 
 
 
 
 
 
 
 
52
  ## Training data
53
 
54
- The tranining data consists of 3 origins:
55
 
56
  1. Patent Litigations (https://www.kaggle.com/uspto/patent-litigations): This dataset covers over 74k cases across 52 years and over 5 million relevant documents. 5 different files detail the litigating parties, their attorneys, results, locations, and dates.
57
  1. *1.57GB*
@@ -59,7 +68,7 @@ The tranining data consists of 3 origins:
59
  3. *clean 1.1GB*
60
 
61
 
62
- 2. Caselaw Access Project (CAP) (https://case.law/): Following 360 years of United States caselaw, Caselaw Access Project (CAP) API and bulk data services includes 40 million pages of U.S. court decisions and almost 6.5 million individual cases.
63
  1. *raw 5.6*
64
  2. abbrev:CAP
65
  3. *clean 2.8GB*
@@ -69,7 +78,7 @@ The tranining data consists of 3 origins:
69
  3. *clean 1GB*
70
 
71
  ## Training procedure
72
- We start from a pretrained ROBERTA-BASE model and fine-tune it on legal corpus.
73
 
74
  Fine-tuning configuration:
75
  - lr = 5e-5(with lr decay, ends at 4.95e-8)
@@ -128,8 +137,8 @@ tranining time per epoch(including validation ):
128
 
129
 
130
  ## Limitations:
131
- In the Masked Language Model showroom, the tokens have a prefix **Ġ**. This seems to be wired but I haven't yet been able to fix it.
132
- I know in case of BPE tokenizer(ROBERTA's tokenizer), the symbol Ġ means the end of a new token and the majority of tokens in vocabs of pre-trained tokenizers start with Ġ.
133
 
134
  For example
135
  ```python
@@ -150,4 +159,3 @@ So I think this is not fundamentally linked to the model itself.
150
 
151
 
152
 
153
-
 
41
  | legalBert | torture | detention | arrest | rape | death |
42
  | legalRoberta | torture | abuse | insanity | cruelty | confinement |
43
 
44
+ 'Establishing a system for the identification and registration of \<mask\> animals and regarding the labeling of beef and beef products .':
45
 
46
  | Model | top1 | top2 | top3 | top4 | top5 |
47
  | ------------ | ---- | --- | --- | --- | -------- |
 
49
  | legalBert | live | beef | farm | pet | dairy |
50
  | legalRoberta | domestic | all | beef | wild | registered |
51
 
52
+ ## Load Pretrained Model
53
+
54
+ ```python
55
+ from transformers import AutoTokenizer, AutoModel
56
+
57
+ tokenizer = AutoTokenizer.from_pretrained("saibo/legal-roberta-base")
58
+ model = AutoModel.from_pretrained("saibo/legal-roberta-base")
59
+ ```
60
+
61
  ## Training data
62
 
63
+ The training data consists of 3 origins:
64
 
65
  1. Patent Litigations (https://www.kaggle.com/uspto/patent-litigations): This dataset covers over 74k cases across 52 years and over 5 million relevant documents. 5 different files detail the litigating parties, their attorneys, results, locations, and dates.
66
  1. *1.57GB*
 
68
  3. *clean 1.1GB*
69
 
70
 
71
+ 2. Caselaw Access Project (CAP) (https://case.law/): Following 360 years of United States case law, Caselaw Access Project (CAP) API and bulk data services includes 40 million pages of U.S. court decisions and almost 6.5 million individual cases.
72
  1. *raw 5.6*
73
  2. abbrev:CAP
74
  3. *clean 2.8GB*
 
78
  3. *clean 1GB*
79
 
80
  ## Training procedure
81
+ We start from a pretrained ROBERTA-BASE model and fine-tune it on the legal corpus.
82
 
83
  Fine-tuning configuration:
84
  - lr = 5e-5(with lr decay, ends at 4.95e-8)
 
137
 
138
 
139
  ## Limitations:
140
+ In the Masked Language Model showroom, the tokens have the prefix **Ġ**. This seems to be wired but I haven't yet been able to fix it.
141
+ I know in the case of BPE tokenizer(ROBERTA's tokenizer), the symbol Ġ means the end of a new token, and the majority of tokens in the vocabs of pre-trained tokenizers start with Ġ.
142
 
143
  For example
144
  ```python
 
159
 
160
 
161