ybracke commited on
Commit
5e8da84
1 Parent(s): c4f2c13

Update README.md

Browse files

Add sections on use

Files changed (1) hide show
  1. README.md +30 -4
README.md CHANGED
@@ -29,11 +29,28 @@ library_name: transformers
29
 
30
  # Transnormer 19th century (beta v02)
31
 
32
- This model generates a normalized version of historical input text for German from the 19th (and late 18th) century.
33
- The base model [google/byt5-small](https://huggingface.co/google/byt5-small) was fine-tuned on a modified version of the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/) (see section [Training and evaluation data](#training-and-evaluation-data)).
34
 
35
  ## Model description
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  ### Demo Usage
38
 
39
  ```python
@@ -52,12 +69,21 @@ Or use this model with the [pipeline API](https://huggingface.co/transformers/ma
52
 
53
  ```python
54
  from transformers import pipeline
55
- transnormer = pipeline('text2text-generation', model='ybracke/transnormer-19c-beta-v02')
 
56
  sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
57
- print(transnormer(sentence))
58
  # >>> [{'generated_text': 'Die Königin saß auf des Palastes mittlerer Tribüne.'}]
59
  ```
60
 
 
 
 
 
 
 
 
 
61
  ## Training and evaluation data
62
 
63
  The model was fine-tuned and evaluated on splits derived from the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/), a parallel corpus containing of 121 texts from Deutsches Textarchiv (German Text Archive). The corpus was originally created by aligning historic prints in original spelling with an edition in contemporary orthography.
 
29
 
30
  # Transnormer 19th century (beta v02)
31
 
32
+ This model generates a normalized version of historical input text for German from the 19th century.
 
33
 
34
  ## Model description
35
 
36
+ `Transnormer` is a byte-level sequence-to-sequence model for normalizing historical German text.
37
+ This model was trained on text from the 19th and late 18th century, by performing a fine-tuning of [google/byt5-small](https://huggingface.co/google/byt5-small).
38
+ The fine-tuning data was a modified version of the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/) (see section [Training and evaluation data](#training-and-evaluation-data)).
39
+
40
+ ## Uses
41
+
42
+ This model is intended for users that are working with historical text and are in need of a normalized version, i.e. a version that comes closer to modern spelling.
43
+ Historical text typically contains spelling variations and extinct spellings that differ from contemporary text.
44
+ This can be a drawback when working with historical text: The variation can impair the performance of NLP tools (POS tagging, etc.) that were trained on contemporary language,
45
+ and a full text search becomes more tedious due to numerous spelling options for the same search term.
46
+ Historical text normalization, as offered by this model, can mitigate these problems to some extent.
47
+
48
+ Note that this model is intended for the normalization of *historical German text from a specific time period*.
49
+ It is *not intended* for other types of text that may require normalization (e.g. computer mediated communication), other languages than German or other time frames.
50
+ There may be other models available for that on the [Hub](https://huggingface.co/models).
51
+
52
+ The model can be further fine-tuned to be adapted or improved, e.g. as described in the [Transformers](https://huggingface.co/docs/transformers/training) tutorials.
53
+
54
  ### Demo Usage
55
 
56
  ```python
 
69
 
70
  ```python
71
  from transformers import pipeline
72
+
73
+ transnormer = pipeline(model='ybracke/transnormer-19c-beta-v02')
74
  sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
75
+ print(transnormer(sentence, num_beams=4, max_length=128))
76
  # >>> [{'generated_text': 'Die Königin saß auf des Palastes mittlerer Tribüne.'}]
77
  ```
78
 
79
+ ### Recommendations
80
+
81
+ The model was trained using a maximum input length of 512 bytes (~70 words).
82
+ Inference is generally possible for longer sequences, but may be worse than for shorter sequence.
83
+ Generally, shorter sequences ensures inference that is faster and less computationally expensive.
84
+ Consider splitting long sequences to process them separately.
85
+
86
+
87
  ## Training and evaluation data
88
 
89
  The model was fine-tuned and evaluated on splits derived from the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/), a parallel corpus containing of 121 texts from Deutsches Textarchiv (German Text Archive). The corpus was originally created by aligning historic prints in original spelling with an edition in contemporary orthography.