umarbutler commited on
Commit
80abc11
β€’
1 Parent(s): 5adcb84

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ base_model: distilgpt2
7
+ tags:
8
+ - law
9
+ - legal
10
+ - australia
11
+ - generated_from_trainer
12
+ datasets:
13
+ - umarbutler/open-australian-legal-corpus
14
+ widget:
15
+ - text: "Under the Crimes Act"
16
+ - text: "Section 51 of the Constitution provides"
17
+ - text: '"Unsatisfactory professional conduct" includes'
18
+ ---
19
+
20
+ # Open Australian Legal DistilGPT2 β€βš–οΈ
21
+ Open Australian Legal DistilGPT2 is a DistilGPT2 model trained on Australian law.
22
+
23
+ Naturally, as a finetune of [DistilGPT2](https://huggingface.co/distilgpt2), the model may be used for any of the tasks for which [DistilGPT2](https://huggingface.co/distilgpt2) and its parent model, [GPT2](https://huggingface.co/gpt2), are suitable, including text generation, text completion and question answering.
24
+
25
+ Trained on 37,560 laws and regulations, comprising 635,482,112 tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), the model is intended specifically to be finetuned for downstream natural language processing tasks applied to the Australian legal domain.
26
+
27
+ To ensure its accessibility to as wide an audience as possible, the model is issued under the same licence as [DistilGPT2](https://huggingface.co/distilgpt2), namely the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0.html).
28
+
29
+ A larger, non-distilled version of the model, trained on the same dataset, is available [here](https://huggingface.co/umarbutler/open-australian-legal-gpt2).
30
+
31
+ ## Usage πŸ‘©β€πŸ’»
32
+ The code snippet below demonstrates just one of the many ways in which the model may be accessed:
33
+ ```python
34
+ >>> from transformers import pipeline, set_seed
35
+
36
+ >>> set_seed(42) # We set a seed for reproducibility.
37
+ >>> generator = pipeline('text-generation', model='umarbutler/open-australian-legal-distilgpt2')
38
+ >>> generator('Under the', max_length=20, num_return_sequences=5)
39
+ [{'generated_text': 'Under the purposes of Part 6 Division 2 of the Act, regulations may confer power on an applicant for'},
40
+ {'generated_text': 'Under the circumstances, in deciding which person to whom a protected information request may be made, the AP'},
41
+ {'generated_text': 'Under the provisions of this Act, an offence against section 51 or 52 of the Act that relates to'},
42
+ {'generated_text': 'Under the definition of State or Territory, the State or Territory in section 8 of the A New Tax'},
43
+ {'generated_text': 'Under the Act, a person is taken to be an occupier of premises ifβ€”\n\t('}]
44
+ ```
45
+
46
+ ## Creation πŸ§ͺ
47
+ 37,560 documents were sampled from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) by filtering for primary and secondary legislation that, when stripped of whitespace, was not empty. Such documents were then randomly shuffled and added to blocks 1,024-tokens-long, with GPT2's end-of-sequence token ('<|endoftext|>') being used as a delimiter as well as to pad the end of the final block, resulting in a training dataset of 620,588 blocks, or 635,482,112 tokens.
48
+
49
+ The training dataset was subsequently fed to [DistilGPT2](https://huggingface.co/distilgpt2) via [`transformers.Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer) with the following hyperparameters:
50
+ | Hyperparameter | Value |
51
+ | --- | --- |
52
+ | Sequence length | 1,024 |
53
+ | Epochs | 3 |
54
+ | Optimiser | AdamW |
55
+ | Learning rate | 1e-5 |
56
+ | Learning rate scheduler | Linear with warmup |
57
+ | Batch size per device | 4 |
58
+ | Weight decay | 0.01 |
59
+ | Warmup ratio | 0.06 |
60
+ | Gradient accumulation steps | 1 |
61
+
62
+ After training for 3 epochs, or 465,441 steps, over a period of ~40 hours on a single GeForce RTX 2080 Ti, the model achieved a loss of 0.65.
63
+
64
+ ## Licence πŸ“œ
65
+ The model is issued under the same licence as [DistilGPT2](https://huggingface.co/distilgpt2), namely the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0.html).
66
+
67
+ ## Citation πŸ”–
68
+ If you've relied on the model for your work, please cite:
69
+ ```bibtex
70
+ @misc{butler-2023-open-australian-legal-distilgpt2,
71
+ author = {Butler, Umar},
72
+ year = {2023},
73
+ title = {Open Australian Legal DistilGPT2},
74
+ publisher = {Hugging Face},
75
+ version = {1.0.0},
76
+ url = {https://huggingface.co/datasets/umarbutler/open-australian-legal-distilgpt2}
77
+ }
78
+ ```
79
+
80
+ ## Acknowledgements πŸ™
81
+ In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today.
82
+
83
+ The author thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) for making their data available under open licences.
84
+
85
+ The author also acknowledges the developers of the many Python libraries relied upon in the training of the model, as well as the makers of [DistilGPT2](https://huggingface.co/distilgpt2) and [GPT2](https://huggingface.co/gpt2), which the model was built atop.
86
+
87
+ Finally, the author is eternally grateful for the endless support of his wife and her willingness to put up with many a late night spent writing code and quashing bugs.