Files changed (1) hide show
  1. README.md +151 -41
README.md CHANGED
@@ -1,64 +1,174 @@
1
  ---
2
  tags:
3
- - generated_from_trainer
4
  model-index:
5
  - name: legal-croatian-roberta-base
6
  results: []
 
 
 
7
  ---
8
 
9
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
10
- should probably proofread and complete it, then remove this comment. -->
11
 
12
- # legal-croatian-roberta-base
13
 
14
- This model was trained from scratch on an unknown dataset.
15
- It achieves the following results on the evaluation set:
16
- - Loss: 0.4008
17
 
18
- ## Model description
19
 
20
- More information needed
 
 
 
21
 
22
- ## Intended uses & limitations
23
 
24
- More information needed
25
 
26
- ## Training and evaluation data
27
 
28
- More information needed
29
 
30
- ## Training procedure
31
 
32
- ### Training hyperparameters
33
 
34
- The following hyperparameters were used during training:
35
- - learning_rate: 0.0001
36
- - train_batch_size: 16
37
- - eval_batch_size: 16
38
- - seed: 42
39
- - distributed_type: tpu
40
- - num_devices: 8
41
- - gradient_accumulation_steps: 4
42
- - total_train_batch_size: 512
43
- - total_eval_batch_size: 128
44
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
45
- - lr_scheduler_type: cosine
46
- - lr_scheduler_warmup_ratio: 0.05
47
- - training_steps: 200000
48
 
49
- ### Training results
50
 
51
- | Training Loss | Epoch | Step | Validation Loss |
52
- |:-------------:|:-----:|:------:|:---------------:|
53
- | 0.8722 | 32.0 | 50000 | 0.5036 |
54
- | 0.7612 | 64.0 | 100000 | 0.4363 |
55
- | 0.6916 | 96.0 | 150000 | 0.4080 |
56
- | 0.6656 | 128.0 | 200000 | 0.4008 |
57
 
 
58
 
59
- ### Framework versions
60
 
61
- - Transformers 4.20.1
62
- - Pytorch 1.12.0+cu102
63
- - Datasets 2.9.0
64
- - Tokenizers 0.12.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  tags:
3
+ - legal
4
  model-index:
5
  - name: legal-croatian-roberta-base
6
  results: []
7
+ license: cc
8
+ language:
9
+ - hr
10
  ---
11
 
12
+ # Model Card for joelito/legal-croatian-roberta-base
 
13
 
14
+ This model is a monolingual model pretrained on legal data. It is based on XLM-R ([base](https://huggingface.co/xlm-roberta-base) and [large](https://huggingface.co/xlm-roberta-large)). For pretraining we used the Croatian portion of [Multi Legal Pile](https://huggingface.co/datasets/joelito/Multi_Legal_Pile) ([Niklaus et al. 2023](https://arxiv.org/abs/2306.02069?utm_source=tldrai)), a multilingual dataset from various legal sources covering 24 languages.
15
 
16
+ ## Model Details
 
 
17
 
18
+ ### Model Description
19
 
20
+ - **Developed by:** Joel Niklaus: [huggingface](https://huggingface.co/joelito); [email](mailto:[email protected])
21
+ - **Model type:** Transformer-based language model (RoBERTa)
22
+ - **Language(s) (NLP):** Croatian
23
+ - **License:** CC BY-SA
24
 
25
+ ## Uses
26
 
27
+ ### Direct Use and Downstream Use
28
 
29
+ You can utilize the raw model for masked language modeling since we did not perform next sentence prediction. However, its main purpose is to be fine-tuned for downstream tasks.
30
 
31
+ It's important to note that this model is primarily designed for fine-tuning on tasks that rely on the entire sentence, potentially with masked elements, to make decisions. Examples of such tasks include sequence classification, token classification, or question answering. For text generation tasks, models like GPT-2 are more suitable.
32
 
33
+ Additionally, the model is specifically trained on legal data, aiming to deliver strong performance in that domain. Its performance may vary when applied to non-legal data.
34
 
35
+ ### Out-of-Scope Use
36
 
37
+ For tasks such as text generation you should look at model like GPT2.
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
+ The model should not be used to intentionally create hostile or alienating environments for people. The model was not trained to be factual or true representations of people or events, and therefore using the models to generate such content is out-of-scope for the abilities of this model.
40
 
41
+ ## Bias, Risks, and Limitations
 
 
 
 
 
42
 
43
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
44
 
45
+ ### Recommendations
46
 
47
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
48
+
49
+ ## How to Get Started with the Model
50
+
51
+ See [huggingface tutorials](https://huggingface.co/learn/nlp-course/chapter7/1?fw=pt). For masked word prediction see [this tutorial](https://huggingface.co/tasks/fill-mask).
52
+
53
+ ## Training Details
54
+
55
+ This model was pretrained on [Multi Legal Pile](https://huggingface.co/datasets/joelito/Multi_Legal_Pile) ([Niklaus et al. 2023](https://arxiv.org/abs/2306.02069?utm_source=tldrai)).
56
+
57
+ Our pretraining procedure includes the following key steps:
58
+
59
+ (a) Warm-starting: We initialize our models from the original XLM-R checkpoints ([base](https://huggingface.co/xlm-roberta-base) and [large](https://huggingface.co/xlm-roberta-large)) of [Conneau et al. (2019)](https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf) to benefit from a well-trained base.
60
+
61
+ (b) Tokenization: We train a new tokenizer of 128K BPEs to cover legal language better. However, we reuse the original XLM-R embeddings for lexically overlapping tokens and use random embeddings for the rest.
62
+
63
+ (c) Pretraining: We continue pretraining on Multi Legal Pile with batches of 512 samples for an additional 1M/500K steps for the base/large model. We use warm-up steps, a linearly increasing learning rate, and cosine decay scheduling. During the warm-up phase, only the embeddings are updated, and a higher masking rate and percentage of predictions based on masked tokens are used compared to [Devlin et al. (2019)](https://aclanthology.org/N19-1423).
64
+
65
+ (d) Sentence Sampling: We employ a sentence sampler with exponential smoothing to handle disparate token proportions across cantons and languages, preserving per-canton and language capacity.
66
+
67
+ (e) Mixed Cased Models: Our models cover both upper- and lowercase letters, similar to recently developed large PLMs.
68
+
69
+ ### Training Data
70
+
71
+ This model was pretrained on the Croatian portion of [Multi Legal Pile](https://huggingface.co/datasets/joelito/Multi_Legal_Pile) ([Niklaus et al. 2023](https://arxiv.org/abs/2306.02069?utm_source=tldrai)).
72
+
73
+ #### Preprocessing
74
+
75
+ For further details see [Niklaus et al. 2023](https://arxiv.org/abs/2306.02069?utm_source=tldrai)
76
+
77
+ #### Training Hyperparameters
78
+
79
+ - batche size: 512 samples
80
+ - Number of steps: 1M/500K for the base/large model
81
+ - Warm-up steps for the first 5\% of the total training steps
82
+ - Learning rate: (linearly increasing up to) 1e-4
83
+ - Word masking: increased 20/30\% masking rate for base/large models respectively
84
+
85
+ ### Model Architecture and Objective
86
+
87
+ It is a RoBERTa-based model. Run the following code to view the architecture:
88
+
89
+ ```
90
+ from transformers import AutoModel
91
+ model = AutoModel.from_pretrained('joelito/legal-croatian-roberta-base')
92
+ print(model)
93
+
94
+ RobertaModel(
95
+ (embeddings): RobertaEmbeddings(
96
+ (word_embeddings): Embedding(32000, 768, padding_idx=0)
97
+ (position_embeddings): Embedding(514, 768, padding_idx=0)
98
+ (token_type_embeddings): Embedding(1, 768)
99
+ (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
100
+ (dropout): Dropout(p=0.1, inplace=False)
101
+ )
102
+ (encoder): RobertaEncoder(
103
+ (layer): ModuleList(
104
+ (0-11): 12 x RobertaLayer(
105
+ (attention): RobertaAttention(
106
+ (self): RobertaSelfAttention(
107
+ (query): Linear(in_features=768, out_features=768, bias=True)
108
+ (key): Linear(in_features=768, out_features=768, bias=True)
109
+ (value): Linear(in_features=768, out_features=768, bias=True)
110
+ (dropout): Dropout(p=0.1, inplace=False)
111
+ )
112
+ (output): RobertaSelfOutput(
113
+ (dense): Linear(in_features=768, out_features=768, bias=True)
114
+ (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
115
+ (dropout): Dropout(p=0.1, inplace=False)
116
+ )
117
+ )
118
+ (intermediate): RobertaIntermediate(
119
+ (dense): Linear(in_features=768, out_features=3072, bias=True)
120
+ (intermediate_act_fn): GELUActivation()
121
+ )
122
+ (output): RobertaOutput(
123
+ (dense): Linear(in_features=3072, out_features=768, bias=True)
124
+ (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
125
+ (dropout): Dropout(p=0.1, inplace=False)
126
+ )
127
+ )
128
+ )
129
+ )
130
+ (pooler): RobertaPooler(
131
+ (dense): Linear(in_features=768, out_features=768, bias=True)
132
+ (activation): Tanh()
133
+ )
134
+ )
135
+
136
+ ```
137
+
138
+ ### Compute Infrastructure
139
+
140
+ Google TPU.
141
+
142
+ #### Hardware
143
+
144
+ Google TPU v3-8
145
+
146
+ #### Software
147
+
148
+ pytorch, transformers.
149
+
150
+ ## Citation
151
+
152
+ ```
153
+
154
+ @article{Niklaus2023MultiLegalPileA6,
155
+ title={MultiLegalPile: A 689GB Multilingual Legal Corpus},
156
+ author={Joel Niklaus and Veton Matoshi and Matthias Sturmer and Ilias Chalkidis and Daniel E. Ho},
157
+ journal={ArXiv},
158
+ year={2023},
159
+ volume={abs/2306.02069}
160
+ }
161
+
162
+ ```
163
+
164
+ ## Model Card Authors
165
+
166
+ Joel Niklaus: [huggingface](https://huggingface.co/joelito); [email](mailto:[email protected])
167
+
168
+ Veton Matoshi: [huggingface](https://huggingface.co/kapllan); [email](mailto:[email protected])
169
+
170
+ ## Model Card Contact
171
+
172
+ Joel Niklaus: [huggingface](https://huggingface.co/joelito); [email](mailto:[email protected])
173
+
174
+ Veton Matoshi: [huggingface](https://huggingface.co/kapllan); [email](mailto:[email protected])