gremid commited on
Commit
9d50f56
·
verified ·
1 Parent(s): c296641

Upload folder using huggingface_hub

Browse files
Files changed (16) hide show
  1. .gitattributes +5 -0
  2. BUILT +1 -0
  3. GIT_REV +1 -0
  4. GIT_REV_LEX +1 -0
  5. README.md +409 -0
  6. finite.a +3 -0
  7. finite.ca +0 -0
  8. index.a +0 -0
  9. index.ca +0 -0
  10. index.csv.lzma +0 -0
  11. lemma.a +3 -0
  12. lemma.ca +0 -0
  13. morph.a +3 -0
  14. morph.ca +0 -0
  15. root.a +3 -0
  16. root.ca +3 -0
.gitattributes CHANGED
@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ finite.a filter=lfs diff=lfs merge=lfs -text
37
+ lemma.a filter=lfs diff=lfs merge=lfs -text
38
+ morph.a filter=lfs diff=lfs merge=lfs -text
39
+ root.a filter=lfs diff=lfs merge=lfs -text
40
+ root.ca filter=lfs diff=lfs merge=lfs -text
BUILT ADDED
@@ -0,0 +1 @@
 
 
1
+ 2024-11-28T13:37:11.407789
GIT_REV ADDED
@@ -0,0 +1 @@
 
 
1
+ 76701cc
GIT_REV_LEX ADDED
@@ -0,0 +1 @@
 
 
1
+ 76701cc
README.md ADDED
@@ -0,0 +1,409 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: de
3
+ library_name: sfst
4
+ license: gpl-2.0
5
+ tags:
6
+ - sfst
7
+ - dwdsmor
8
+ - token-classification
9
+ - lemmatisation
10
+ model-index:
11
+ - name: dwdsmor
12
+ results:
13
+ - task:
14
+ type: token-classification
15
+ name: Lemmatisation
16
+ dataset:
17
+ name: Universal Dependencies Treebank (de-hdt)
18
+ type: universal_dependencies
19
+ config: de_hdt
20
+ split: train
21
+ metrics:
22
+ - type: coverage
23
+ value: 84.15293963067323
24
+ name: Lemma
25
+ - type: coverage
26
+ value: 100.0
27
+ name: Lemma ($()
28
+ - type: coverage
29
+ value: 100.0
30
+ name: Lemma ($,)
31
+ - type: coverage
32
+ value: 99.99580703997988
33
+ name: Lemma ($.)
34
+ - type: coverage
35
+ value: 77.40301552167969
36
+ name: Lemma (ADJA)
37
+ - type: coverage
38
+ value: 75.48407611333322
39
+ name: Lemma (ADJD)
40
+ - type: coverage
41
+ value: 96.82621529723873
42
+ name: Lemma (ADV)
43
+ - type: coverage
44
+ value: 99.89939637826963
45
+ name: Lemma (APPO)
46
+ - type: coverage
47
+ value: 93.08645050358152
48
+ name: Lemma (APPR)
49
+ - type: coverage
50
+ value: 99.67651071695788
51
+ name: Lemma (APPRART)
52
+ - type: coverage
53
+ value: 79.16666666666666
54
+ name: Lemma (APZR)
55
+ - type: coverage
56
+ value: 99.99603964317186
57
+ name: Lemma (ART)
58
+ - type: coverage
59
+ value: 96.13524039049265
60
+ name: Lemma (CARD)
61
+ - type: coverage
62
+ value: 13.320473120462967
63
+ name: Lemma (FM)
64
+ - type: coverage
65
+ value: 71.42857142857143
66
+ name: Lemma (ITJ)
67
+ - type: coverage
68
+ value: 100.0
69
+ name: Lemma (KOKOM)
70
+ - type: coverage
71
+ value: 99.95274949083503
72
+ name: Lemma (KON)
73
+ - type: coverage
74
+ value: 100.0
75
+ name: Lemma (KOUI)
76
+ - type: coverage
77
+ value: 98.58579967925354
78
+ name: Lemma (KOUS)
79
+ - type: coverage
80
+ value: 6.1808081211782095
81
+ name: Lemma (NE)
82
+ - type: coverage
83
+ value: 74.40482047389456
84
+ name: Lemma (NN)
85
+ - type: coverage
86
+ value: 97.99275737196068
87
+ name: Lemma (PDAT)
88
+ - type: coverage
89
+ value: 99.95682832062167
90
+ name: Lemma (PDS)
91
+ - type: coverage
92
+ value: 98.79094306440976
93
+ name: Lemma (PIAT)
94
+ - type: coverage
95
+ value: 100.0
96
+ name: Lemma (PIDAT)
97
+ - type: coverage
98
+ value: 99.51910051476564
99
+ name: Lemma (PIS)
100
+ - type: coverage
101
+ value: 99.9888876541838
102
+ name: Lemma (PPER)
103
+ - type: coverage
104
+ value: 100.0
105
+ name: Lemma (PPOSAT)
106
+ - type: coverage
107
+ value: 100.0
108
+ name: Lemma (PPOSS)
109
+ - type: coverage
110
+ value: 100.0
111
+ name: Lemma (PRELAT)
112
+ - type: coverage
113
+ value: 100.0
114
+ name: Lemma (PRELS)
115
+ - type: coverage
116
+ value: 100.0
117
+ name: Lemma (PRF)
118
+ - type: coverage
119
+ value: 98.61938278289118
120
+ name: Lemma (PROAV)
121
+ - type: coverage
122
+ value: 30.821337849280273
123
+ name: Lemma (PTKA)
124
+ - type: coverage
125
+ value: 100.0
126
+ name: Lemma (PTKANT)
127
+ - type: coverage
128
+ value: 100.0
129
+ name: Lemma (PTKNEG)
130
+ - type: coverage
131
+ value: 77.05097087378641
132
+ name: Lemma (PTKVZ)
133
+ - type: coverage
134
+ value: 0.0
135
+ name: Lemma (PTKZU)
136
+ - type: coverage
137
+ value: 95.51166965888689
138
+ name: Lemma (PWAT)
139
+ - type: coverage
140
+ value: 99.37264742785446
141
+ name: Lemma (PWAV)
142
+ - type: coverage
143
+ value: 99.46524064171123
144
+ name: Lemma (PWS)
145
+ - type: coverage
146
+ value: 100.0
147
+ name: Lemma (VAFIN)
148
+ - type: coverage
149
+ value: 100.0
150
+ name: Lemma (VAIMP)
151
+ - type: coverage
152
+ value: 100.0
153
+ name: Lemma (VAINF)
154
+ - type: coverage
155
+ value: 100.0
156
+ name: Lemma (VAPP)
157
+ - type: coverage
158
+ value: 100.0
159
+ name: Lemma (VMFIN)
160
+ - type: coverage
161
+ value: 100.0
162
+ name: Lemma (VMINF)
163
+ - type: coverage
164
+ value: 100.0
165
+ name: Lemma (VMPP)
166
+ - type: coverage
167
+ value: 88.6487187323461
168
+ name: Lemma (VVFIN)
169
+ - type: coverage
170
+ value: 95.96122778675283
171
+ name: Lemma (VVIMP)
172
+ - type: coverage
173
+ value: 82.1453501900256
174
+ name: Lemma (VVINF)
175
+ - type: coverage
176
+ value: 82.9683698296837
177
+ name: Lemma (VVIZU)
178
+ - type: coverage
179
+ value: 79.96866513473992
180
+ name: Lemma (VVPP)
181
+ - type: coverage
182
+ value: 41.48471615720524
183
+ name: Lemma (XY)
184
+ ---
185
+
186
+ # DWDSmor
187
+
188
+ _SFST/SMOR/DWDS-based German morphology_
189
+
190
+
191
+
192
+
193
+
194
+ DWDSmor implements the lemmatisation and morphological analysis of
195
+ word forms as well as the generation of paradigms of lexical words in
196
+ written German.
197
+
198
+ ## Usage
199
+
200
+ DWDSmor is available via PyPI:
201
+
202
+ ``` plaintext
203
+ pip install dwdsmor
204
+ ```
205
+
206
+ For lemmatisation:
207
+
208
+ ``` python-console
209
+ >>> import dwsdmor
210
+ >>> lemmatizer = dwdsmor.lemmatizer()
211
+ >>> assert lemmatizer("getestet", pos={"+V"}) == "testen"
212
+ >>> assert lemmatizer("getestet", pos={"+ADJ"}) == "getestet"
213
+ ```
214
+
215
+
216
+
217
+ ## Development
218
+
219
+ This repository provides source code for building DWDSmor lexica and transducers
220
+ as well as for using DWDSmor transducers for morphological analysis and paradigm
221
+ generation:
222
+
223
+ * `dwdsmor/` contains Python packages for using DWDSmor, including
224
+ scripts for morphological analysis and for paradigm generation by
225
+ means of DWDSmor transducers.
226
+ * `share/` contains XSLT stylesheets for extracting lexical entries in SMORLemma
227
+ format form XML sources of DWDS articles. Sample inputs and outputs can be
228
+ found in `samples/`.
229
+ * `lexicon/dwds/` contains scripts for building DWDSmor lexica by means of the
230
+ XSLT stylesheets in `share/` and DWDS sources in `lexicon/dwds/wb/`, which are
231
+ not part of this repository.
232
+ * `lexicon/sample/` contains scripts for building sample DWDSmor lexica by means
233
+ of the XSLT stylesheets in `share/` and the sample lexicon in
234
+ `lexicon/sample/wb/`.
235
+ * `grammar/` contains an FST grammar derived from SMORLemma, providing the
236
+ morphology for building DWDSmor automata from DWDSmor lexica.
237
+ * `test/` implements a test suite for the DWDSmor transducers.
238
+
239
+ DWDSmor is in active development. In its current stage, DWDSmor supports most
240
+ inflection classes and some productive word-formation patterns of written
241
+ German. Note that the sample lexicon in `lexicon/sample/wb/` only covers a
242
+ sketchy subset of the German vocabulary, and so do the DWDSmor automata compiled
243
+ from it.
244
+
245
+
246
+ ## Prerequisites
247
+
248
+ [GNU/Linux](https://www.debian.org/)
249
+ : Development, builds and tests of DWDSmor are performed
250
+ on [Debian GNU/Linux](https://debian.org/). While other UNIX-like operating
251
+ systems such as MacOS should work, too, they are not actively supported.
252
+
253
+ [Python >= v3.9](https://www.python.org/)
254
+ : DWDSmor targets Python as its primary runtime environment. The DWDSmor
255
+ transducers can be used via SFST's commandline tools, queried in Python
256
+ applications via language-specific
257
+ [bindings](https://github.com/gremid/sfst-transduce), or used by the Python
258
+ scripts `dwdsmor.py` and `paradigm.py` for morphological analysis and for
259
+ paradigm generation.
260
+
261
+ [Saxon-HE](https://www.saxonica.com/)
262
+ : The extraction of lexical entries from XML sources of DWDS articles is
263
+ implemented in XSLT 2, for which Saxon-HE is used as the runtime environment.
264
+
265
+ [Java (JDK) >= v8](https://openjdk.java.net/)
266
+ : Saxon requires a Java runtime.
267
+
268
+ [SFST](https://www.cis.uni-muenchen.de/~schmid/tools/SFST/)
269
+ : a C++ library and toolbox for finite-state transducers (FSTs); please take a
270
+ look at its homepage for installation and usage instructions.
271
+
272
+ On a Debian-based distribution, install the following packages:
273
+
274
+ ```sh
275
+ apt install python3 default-jdk libsaxonhe-java sfst
276
+ ```
277
+
278
+ Set up a virtual environment for project builds, for example via Python's `venv`:
279
+
280
+ ```sh
281
+ python3 -m venv .venv
282
+ source .venv/bin/activate
283
+ ```
284
+
285
+ Then run the DWDSmor setup routine in order to install Python dependencies:
286
+
287
+ ```sh
288
+ pip install -e .[dev]
289
+ ```
290
+
291
+
292
+ ## Building DWDSmor lexica and transducers
293
+
294
+ For building DWDSmor lexica and transducers, run:
295
+
296
+ ```sh
297
+ make all
298
+ ```
299
+
300
+ Alternatively, you can run:
301
+
302
+ ```sh
303
+ make dwds && make dwds-install && make dwdsmor
304
+ ```
305
+
306
+ Note that these commands require DWDS sources in `lexicon/dwds/wb/`, which are
307
+ not part of this repository.
308
+
309
+ Alternatively, you can build sample DWDSmor lexica and transducers from the
310
+ sample lexicon in `lexicon/sample/wb/` by running:
311
+
312
+ ```sh
313
+ make sample && make sample-install && make dwdsmor
314
+ ```
315
+
316
+ After building DWDSmor transducers, install them into `lib/`, where the
317
+ Python scripts `dwdsmor` and `dwdsmor-paradigm` expect them by default:
318
+
319
+ ```sh
320
+ make install
321
+ ```
322
+
323
+ The installed DWDSmor transducers are:
324
+
325
+ * `lib/dwdsmor.{a,ca}`: transducer with inflection and word-formation
326
+ components, for lemmatisation and morphological analysis of word forms in
327
+ terms of grammatical categories
328
+ * `lib/dwdsmor-morph.{a,ca}`: transducer with inflection and word-formation
329
+ components, for the generation of morphologically segmented word forms
330
+ * `lib/dwdsmor-finite.{a,ca}`: transducer with an inflection component and a
331
+ finite word-formation component, for testing purposes
332
+ * `lib/dwdsmor-root.{a,ca}`: transducer with inflection and word-formation
333
+ components, for lexical analysis of word forms in terms of root lemmas (i.e.,
334
+ lemmas of ultimate word-formation bases), word-formation process,
335
+ word-formation means, and grammatical categories in term of the
336
+ Pattern-and-Restriction Theory of word formation (Nolda 2022)
337
+ * `lib/dwdsmor-index.{a,ca}`: transducer with an inflection component only with
338
+ DWDS homographic lemma indices, for paradigm generation
339
+
340
+
341
+ ## Testing DWDSmor
342
+
343
+ Run
344
+
345
+ pytest
346
+
347
+ in order to test basic transducer usage and for potential regressions.
348
+
349
+ ## Contact
350
+
351
+ Feel free to contact [Andreas Nolda](mailto:[email protected]) for
352
+ questions regarding the lexicon or the grammar and
353
+ [Gregor Middell](mailto:[email protected]) for question related
354
+ to the integration of DWDSmor into your corpus-annotation pipeline.
355
+
356
+
357
+ ## License
358
+
359
+ As the original SMOR and SMORLemma grammars, the DWDSmor grammar is
360
+ licensed under the GNU General Public Licence v2.0. The same applies
361
+ to the rest of this project.
362
+
363
+ ## Credits
364
+
365
+ DWSDmor is based on the following software and datasets:
366
+
367
+ 1. [SFST](https://www.cis.uni-muenchen.de/~schmid/tools/SFST/), a C++ library
368
+ and toolbox for finite-state transducers (FSTs) (Schmidt 2006)
369
+ 2. [SMORLemma](https://github.com/rsennrich/SMORLemma) (Sennrich and Kunz 2014),
370
+ a modified version of the Stuttgart Morphology
371
+ ([SMOR](https://www.cis.lmu.de/~schmid/tools/SMOR/)) (Schmid, Fitschen, and
372
+ Heid 2004) with an alternative lemmatisation component
373
+ 3. the [DWDS dictionary](https://www.dwds.de/) (BBAW n.d.) replacing the
374
+ [IMSLex](https://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/imslex/)
375
+ (Fitschen 2004) as the lexical data source for German words, their grammatical
376
+ categories, and their morphological properties.
377
+
378
+ ## Bibliography
379
+
380
+ * Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.).
381
+ DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur
382
+ deutschen Sprache in Geschichte und Gegenwart.
383
+ https://www.dwds.de
384
+ * Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes
385
+ System. Ph.D. thesis, Universität Stuttgart.
386
+ [PDF](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/IMSLex/fitschendiss.pdf)
387
+ * Nolda, Andreas (2022). Headedness as an epiphenomenon: Case studies on
388
+ compounding and blending in German. In *Headedness and/or Grammatical
389
+ Anarchy?*, ed. by Ulrike Freywald, Horst Simon, and Stefan Müller, Empirically
390
+ Oriented Theoretical Morphology and Syntax 11, Berlin: Language Science Press,
391
+ 343–376.
392
+ [PDF](https://zenodo.org/record/7142720/files/336-FreywaldSimonMüller-2022-11.pdf).
393
+ * Schmid, Helmut (2006). A programming language for finite state transducers. In
394
+ *Finite-State Methods and Natural Language Processing: 5th International
395
+ Workshop, FSMNLP 2005, Helsinki, Finland, September 1–2, 2005*, ed. by Anssi
396
+ Yli-Jyrä, Lauri Karttunen, and Juhani Karhumäki, Lecture Notes in Artificial
397
+ Intelligence 4002, Berlin: Springer, 1263–1266.
398
+ [PDF](https://www.cis.uni-muenchen.de/~schmid/papers/SFST-PL.pdf).
399
+ * Schmid, Helmut, Arne Fitschen, and Ulrich Heid (2004). SMOR: A German
400
+ computational morphology covering derivation, composition, and inflection. In
401
+ LREC 2004: Fourth International Conference on Language Resources and
402
+ Evaluation, ed. by Maria T. Lino *et al.*, European Language Resources
403
+ Association, 1263–1266.
404
+ [PDF](http://www.lrec-conf.org/proceedings/lrec2004/pdf/468.pdf)
405
+ * Sennrich, Rico and Beta Kunz (2014). Zmorge: A German morphological lexicon
406
+ extracted from Wiktionary. In LREC 2014: Ninth International Conference on
407
+ Language Resources and Evaluation, ed. by Nicoletta Calzolari *et al.*,
408
+ European Language Resources Association, 1063–1067.
409
+ [PDF](http://www.lrec-conf.org/proceedings/lrec2014/pdf/116_Paper.pdf).
finite.a ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a90dbeb10d36b610bb58a58c8720d1d0793508459796f38e85877a9b65b78316
3
+ size 1134309
finite.ca ADDED
Binary file (523 kB). View file
 
index.a ADDED
Binary file (308 kB). View file
 
index.ca ADDED
Binary file (144 kB). View file
 
index.csv.lzma ADDED
Binary file (982 kB). View file
 
lemma.a ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4b41ad7a8e276d80c356cb29e2a2d4a71e0ab3947a407274174b24c9420ce86f
3
+ size 1233648
lemma.ca ADDED
Binary file (572 kB). View file
 
morph.a ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:11153b2ad849789b455ba27f7c801604007b38e6e4eba223184d855268fa039c
3
+ size 1241182
morph.ca ADDED
Binary file (575 kB). View file
 
root.a ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:de16583efbd2441ca98d843cb2bd519bf4c925ac405d65396b6572a4bd3c9bdf
3
+ size 6980498
root.ca ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c73e66ea1433a035e929e0d06d7081a6b03a2c5d60b74f5950d99f98871a55c7
3
+ size 3632222