Mainak Manna commited on
Commit
969a6e8
·
1 Parent(s): 6444d85

First version of the model

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ language: Cszech Swedish
4
+ tags:
5
+ - translation Cszech Swedish model
6
+ datasets:
7
+ - dcep europarl jrc-acquis
8
+ ---
9
+
10
+ # legal_t5_small_trans_cs_sv model
11
+
12
+ Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was first released in
13
+ [this repository](https://github.com/agemagician/LegalTrans). This model is trained on three parallel corpus from jrc-acquis, europarl and dcep.
14
+
15
+
16
+ ## Model description
17
+
18
+ legal_t5_small_trans_cs_sv is based on the `t5-small` model and was trained on a large corpus of parallel text. This is a smaller model, which scales the baseline model of t5 down by using `dmodel = 512`, `dff = 2,048`, 8-headed attention, and only 6 layers each in the encoder and decoder. This variant has about 60 million parameters.
19
+
20
+ ## Intended uses & limitations
21
+
22
+ The model could be used for translation of legal texts from Cszech to Swedish.
23
+
24
+ ### How to use
25
+
26
+ Here is how to use this model to translate legal text from Cszech to Swedish in PyTorch:
27
+
28
+ ```python
29
+ from transformers import AutoTokenizer, AutoModelWithLMHead, TranslationPipeline
30
+
31
+ pipeline = TranslationPipeline(
32
+ model=AutoModelWithLMHead.from_pretrained("SEBIS/legal_t5_small_trans_cs_sv"),
33
+ tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "SEBIS/legal_t5_small_trans_cs_sv", do_lower_case=False,
34
+ skip_special_tokens=True),
35
+ device=0
36
+ )
37
+
38
+ cs_text = "Slutomröstning: närvarande ledamöter
39
+ "
40
+
41
+ pipeline([cs_text], max_length=512)
42
+ ```
43
+
44
+ ## Training data
45
+
46
+ The legal_t5_small_trans_cs_sv model was trained on [JRC-ACQUIS](https://wt-public.emm4u.eu/Acquis/index_2.2.html), [EUROPARL](https://www.statmt.org/europarl/), and [DCEP](https://ec.europa.eu/jrc/en/language-technologies/dcep) dataset consisting of 5 Million parallel texts.
47
+
48
+ ## Training procedure
49
+
50
+ ### Preprocessing
51
+
52
+ ### Pretraining
53
+ An unigram model with 88M parameters is trained over the complete parallel corpus to get the vocabulary (with byte pair encoding), which is used with this model.
54
+
55
+
56
+ ## Evaluation results
57
+
58
+ When the model is used for translation test dataset, achieves the following results:
59
+
60
+ Test results :
61
+
62
+ | Model | secondary structure (3-states) |
63
+ |:-----:|:-----:|
64
+ | legal_t5_small_trans_cs_sv | 47.9|
65
+
66
+
67
+ ### BibTeX entry and citation info