Gerwin commited on
Commit
0b80012
·
1 Parent(s): f3daaf1

test table, change headers

Browse files
Files changed (1) hide show
  1. README.md +6 -8
README.md CHANGED
@@ -13,9 +13,7 @@ metrics:
13
  ---
14
 
15
  # Legal BERT model applicable for Dutch and English
16
- A legal BERT model further trained from [mBERT](https://huggingface.co/bert-base-multilingual-uncased).
17
-
18
- The thesis can be downloaded using this [link](https://www.ru.nl/publish/pages/769526/gerwin_de_kruijf.pdf)
19
 
20
  ## Data
21
  The model is further trained the same way as [EurlexBERT](https://huggingface.co/nlpaueb/bert-base-uncased-eurlex): regulations, decisions, directives, and parliamentary questions were acquired in both Dutch and English. A total of 184k documents, around 295M words, was used to further train the model. This is less than 9% the size of the original BERT model.
@@ -32,7 +30,7 @@ model = TFAutoModel.from_pretrained("Gerwin/legal-bert-dutch-english") # Tensor
32
  ## Benchmarks
33
  The thesis lists various benchmarks. Here are a couple of comparisons between popular BERT models and this model. The fine-tuning procedures for these benchmarks are identical for each pre-trained model, and are more explained in the thesis. You may be able to achieve higher scores for individual models by optimizing fine-tuning procedures. The table shows the weighted F1-scores.
34
 
35
- ### Legal Topic Classification
36
  | Model | [Multi-EURLEX (NL)](https://huggingface.co/datasets/multi_eurlex) |
37
  | ----------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
38
  | **legal-bert-dutch-english** | **0.786** |
@@ -51,7 +49,7 @@ The thesis lists various benchmarks. Here are a couple of comparisons between po
51
  ### Multi-class classification (Rabobank)
52
  This dataset is not open-source, but it is still an interesting case since the dataset contains both Dutch and English long legal documents that have to be classified. The dataset only consisted of 8000 documents (2000 Dutch & 6000 English) with a total of 30 classes. Using a combined architecture of a Dutch and English BERT model was not beneficial, since documents from both languages could belong to the same class.
53
 
54
- | Model | Rabobank |
55
- | ----------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
56
- | **legal-bert-dutch-english** | **0.732** |
57
- | [mBERT](https://huggingface.co/bert-base-multilingual-uncased) | 0.713 |
 
13
  ---
14
 
15
  # Legal BERT model applicable for Dutch and English
16
+ A BERT model further trained from [mBERT](https://huggingface.co/bert-base-multilingual-uncased) on legal documents. The thesis can be downloaded [here](https://www.ru.nl/publish/pages/769526/gerwin_de_kruijf.pdf)
 
 
17
 
18
  ## Data
19
  The model is further trained the same way as [EurlexBERT](https://huggingface.co/nlpaueb/bert-base-uncased-eurlex): regulations, decisions, directives, and parliamentary questions were acquired in both Dutch and English. A total of 184k documents, around 295M words, was used to further train the model. This is less than 9% the size of the original BERT model.
 
30
  ## Benchmarks
31
  The thesis lists various benchmarks. Here are a couple of comparisons between popular BERT models and this model. The fine-tuning procedures for these benchmarks are identical for each pre-trained model, and are more explained in the thesis. You may be able to achieve higher scores for individual models by optimizing fine-tuning procedures. The table shows the weighted F1-scores.
32
 
33
+ ### Legal topic classification
34
  | Model | [Multi-EURLEX (NL)](https://huggingface.co/datasets/multi_eurlex) |
35
  | ----------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
36
  | **legal-bert-dutch-english** | **0.786** |
 
49
  ### Multi-class classification (Rabobank)
50
  This dataset is not open-source, but it is still an interesting case since the dataset contains both Dutch and English long legal documents that have to be classified. The dataset only consisted of 8000 documents (2000 Dutch & 6000 English) with a total of 30 classes. Using a combined architecture of a Dutch and English BERT model was not beneficial, since documents from both languages could belong to the same class.
51
 
52
+ | Model | Rabobank |
53
+ | ---------------------------------- | ---------------------------------- |
54
+ | **legal-bert-dutch-english** | **0.732** |
55
+ | [mBERT](https://huggingface.co/bert-base-multilingual-uncased) | 0.713 |