gremid commited on
Commit
4a6f8b8
·
verified ·
1 Parent(s): 20acaa2

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +118 -58
README.md CHANGED
@@ -183,7 +183,7 @@ model-index:
183
  name: Coverage (XY)
184
  ---
185
 
186
- # DWDSmor – German morphology
187
 
188
 
189
 
@@ -205,14 +205,33 @@ The automata are compiled and traversed via
205
  library and toolbox for finite-state transducers (FSTs). Their
206
  coverage of the German language depends on
207
 
208
- 1. the DWDSmor grammar, defining the rules by which word formation happens, and
209
- 1. a lexicon, assigning inflection classes to lexical words.
 
 
210
 
211
- While the grammar, derived from
212
  [SMORLemma](https://github.com/rsennrich/SMORLemma) and providing the
213
  morphology for building automata from lexica, is common to all DWDSmor
214
- installations and published as open source, there are **2 lexicon
215
- editions**:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
216
 
217
  ## Usage
218
 
@@ -222,7 +241,7 @@ DWDSmor as a Python library is available via the package index PyPI:
222
  pip install dwdsmor
223
  ```
224
 
225
- For lemmatisation:
226
 
227
  ``` python-console
228
  >>> import dwsdmor
@@ -231,13 +250,51 @@ For lemmatisation:
231
  >>> assert lemmatizer("getestet", pos={"+ADJ"}) == "getestet"
232
  ```
233
 
234
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
235
 
236
  ## Development
237
 
238
- DWDSmor is in active development. In its current stage, DWDSmor
239
- supports most inflection classes and some productive word-formation
240
- patterns of written German.
241
 
242
 
243
  ### Prerequisites
@@ -266,7 +323,7 @@ patterns of written German.
266
  On a Debian-based distribution, the following command install the
267
  required software:
268
 
269
- ```sh
270
  apt-get install python3 default-jdk libsaxonhe-java sfst
271
  ```
272
 
@@ -275,66 +332,69 @@ apt-get install python3 default-jdk libsaxonhe-java sfst
275
  Optionally, set up a Python virtual environment for project builds,
276
  i. e. via Python's `venv`:
277
 
278
- ```sh
279
  python3 -m venv .venv
280
  source .venv/bin/activate
281
  ```
282
 
283
  Then install DWDSmor, including development dependencies:
284
 
285
- ```sh
286
  pip install -U pip setuptools && pip install -e '.[dev]'
287
  ```
288
 
289
 
290
  ### Building lexica and automata
291
 
292
- Building different editions is facilitated via a build script:
293
 
294
 
 
 
 
 
295
 
296
- ```sh
297
- make all
298
- ```
299
 
300
- Alternatively, you can run:
 
301
 
302
- ```sh
303
- make dwds && make dwds-install && make dwdsmor
 
 
 
 
 
 
304
  ```
305
 
306
- Note that these commands require DWDS sources in `lexicon/dwds/wb/`, which are
307
- not part of this repository.
308
 
309
- Alternatively, you can build sample DWDSmor lexica and transducers from the
310
- sample lexicon in `lexicon/sample/wb/` by running:
311
-
312
- ```sh
313
- make sample && make sample-install && make dwdsmor
314
  ```
315
 
316
- After building DWDSmor transducers, install them into `lib/`, where the
317
- Python scripts `dwdsmor` and `dwdsmor-paradigm` expect them by default:
318
-
319
- ```sh
320
- make install
321
- ```
322
 
323
- The installed DWDSmor transducers are:
324
 
325
- * `lib/dwdsmor.{a,ca}`: transducer with inflection and word-formation
326
- components, for lemmatisation and morphological analysis of word forms in
327
- terms of grammatical categories
328
- * `lib/dwdsmor-morph.{a,ca}`: transducer with inflection and word-formation
329
- components, for the generation of morphologically segmented word forms
330
- * `lib/dwdsmor-finite.{a,ca}`: transducer with an inflection component and a
 
331
  finite word-formation component, for testing purposes
332
- * `lib/dwdsmor-root.{a,ca}`: transducer with inflection and word-formation
333
- components, for lexical analysis of word forms in terms of root lemmas (i.e.,
334
- lemmas of ultimate word-formation bases), word-formation process,
335
- word-formation means, and grammatical categories in term of the
336
- Pattern-and-Restriction Theory of word formation (Nolda 2022)
337
- * `lib/dwdsmor-index.{a,ca}`: transducer with an inflection component only with
 
338
  DWDS homographic lemma indices, for paradigm generation
339
 
340
 
@@ -344,19 +404,20 @@ In order to test basic transducer usage and for potential regressions, run
344
 
345
  pytest
346
 
347
- ## Contact
348
 
349
- Feel free to contact [Andreas Nolda](mailto:[email protected]) for
350
- questions regarding the lexicon or the grammar and
351
- [Gregor Middell](mailto:gregor.middell@bbaw.de) for question related
352
- to the integration of DWDSmor into your corpus-annotation pipeline.
353
 
 
 
 
354
 
355
- ## License
356
 
357
- As the original SMOR and SMORLemma grammars, the DWDSmor grammar is
358
- licensed under the GNU General Public Licence v2.0. The same applies
359
- to the rest of this project.
360
 
361
  ## Credits
362
 
@@ -377,8 +438,7 @@ DWSDmor is based on the following software and datasets:
377
 
378
  * Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.).
379
  DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur
380
- deutschen Sprache in Geschichte und Gegenwart.
381
- https://www.dwds.de
382
  * Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes
383
  System. Ph.D. thesis, Universität Stuttgart.
384
  [PDF](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/IMSLex/fitschendiss.pdf)
 
183
  name: Coverage (XY)
184
  ---
185
 
186
+ # DWDSmor – German Morphology
187
 
188
 
189
 
 
205
  library and toolbox for finite-state transducers (FSTs). Their
206
  coverage of the German language depends on
207
 
208
+ 1. the DWDSmor grammar, defining the rules by which word formation
209
+ happens, and
210
+ 1. a lexicon, declaring inflection classes and other morphological
211
+ properties for covered lexical words.
212
 
213
+ The grammar, derived from
214
  [SMORLemma](https://github.com/rsennrich/SMORLemma) and providing the
215
  morphology for building automata from lexica, is common to all DWDSmor
216
+ installations and published as open source. In contrast we provide
217
+ **multiple lexica** resulting in different editions of DWDSmor:
218
+
219
+ 1. the **DWDS Edition**, derived from the complete lexical dataset of
220
+ the [DWDS dictionary](https://www.dwds.de/) and available upon
221
+ request for research purposes,
222
+ 1. the **Open Edition**, based on a subset of the DWDS, covering the
223
+ most common word forms and released freely with the grammar for
224
+ general use and experiments.
225
+
226
+ Depending on the edition and word class, coverage ranges from 70 to
227
+ 100% with the notable exceptions of foreign language words and named
228
+ entities: Generally, both classes are not part of the underlying DWDS
229
+ dictionary and thus barely covered by DWDSmor. Current overall
230
+ coverage measured against the [German Universal Dependencies
231
+ treebank](https://universaldependencies.org/treebanks/de_hdt/index.html)
232
+ is documented on the respective [Hugging Face Hub
233
+ page](https://huggingface.co/zentrum-lexikographie) of each edition.
234
+
235
 
236
  ## Usage
237
 
 
241
  pip install dwdsmor
242
  ```
243
 
244
+ The library can be used for lemmatisation:
245
 
246
  ``` python-console
247
  >>> import dwsdmor
 
250
  >>> assert lemmatizer("getestet", pos={"+ADJ"}) == "getestet"
251
  ```
252
 
253
+ Next to the Python API, the package provides a simple command line
254
+ interface named `dwdsmor`. To analyze a word form, pass it as an
255
+ argument:
256
+
257
+ ```plaintext
258
+ $ dwdsmor getestet
259
+ | Wordform | Lemma | Analysis | POS | Degree | Function | Nonfinite | Tense | Auxiliary |
260
+ |------------|----------|-------------------------------------|-------|----------|------------|-------------|---------|-------------|
261
+ | getestet | getestet | ge<~>test<~>et<+ADJ><Pos><Pred/Adv> | +ADJ | Pos | Pred/Adv | | | |
262
+ | getestet | testen | test<~>en<+V><Part><Perf><haben> | +V | | | Part | Perf | haben |
263
+ ```
264
+
265
+ To generate all word forms for a lexical word, pass it (or a form
266
+ which can be analyzed as the lexical word) as an argument together
267
+ with the option `-g`:
268
+
269
+ ``` plaintext
270
+ $ dwdsmor -g getestet
271
+ […]
272
+ | Wordform | Lemma | Analysis | POS | Subcategory | Degree | Function | Person | Gender | Case | Number | Nonfinite | Tense | Mood | Auxiliary | Inflection |
273
+ |------------|----------|-------------------------------------------------------------|-------|---------------|----------|------------|----------|----------|--------|----------|-------------|---------|--------|-------------|--------------|
274
+ | getestete | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><St> | +ADJ | | Pos | Attr/Subst | | Fem | Acc | Sg | | | | | St |
275
+ | getestete | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><Wk> | +ADJ | | Pos | Attr/Subst | | Fem | Acc | Sg | | | | | Wk |
276
+ | getesteter | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><St> | +ADJ | | Pos | Attr/Subst | | Fem | Dat | Sg | | | | | St |
277
+ | getesteten | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><Wk> | +ADJ | | Pos | Attr/Subst | | Fem | Dat | Sg | | | | | Wk |
278
+ | getesteter | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><St> | +ADJ | | Pos | Attr/Subst | | Fem | Gen | Sg | | | | | St |
279
+ | getesteten | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><Wk> | +ADJ | | Pos | Attr/Subst | | Fem | Gen | Sg | | | | | Wk |
280
+ […]
281
+ | testeten | testen | test<~>en<+V><1><Pl><Past><Ind> | +V | | | | 1 | | | Pl | | Past | Ind | | |
282
+ | testeten | testen | test<~>en<+V><1><Pl><Past><Subj> | +V | | | | 1 | | | Pl | | Past | Subj | | |
283
+ | testen | testen | test<~>en<+V><1><Pl><Pres><Ind> | +V | | | | 1 | | | Pl | | Pres | Ind | | |
284
+ | testen | testen | test<~>en<+V><1><Pl><Pres><Subj> | +V | | | | 1 | | | Pl | | Pres | Subj | | |
285
+ | testete | testen | test<~>en<+V><1><Sg><Past><Ind> | +V | | | | 1 | | | Sg | | Past | Ind | | |
286
+ | testete | testen | test<~>en<+V><1><Sg><Past><Subj> | +V | | | | 1 | | | Sg | | Past | Subj | | |
287
+ | teste | testen | test<~>en<+V><1><Sg><Pres><Ind> | +V | | | | 1 | | | Sg | | Pres | Ind | | |
288
+ | teste | testen | test<~>en<+V><1><Sg><Pres><Subj> | +V | | | | 1 | | | Sg | | Pres | Subj | | |
289
+ | testetet | testen | test<~>en<+V><2><Pl><Past><Ind> | +V | | | | 2 | | | Pl | | Past | Ind | | |
290
+ […]
291
+ ```
292
 
293
  ## Development
294
 
295
+ DWDSmor is in active development. In its current stage, it supports
296
+ most inflection classes and some productive word-formation patterns of
297
+ written German.
298
 
299
 
300
  ### Prerequisites
 
323
  On a Debian-based distribution, the following command install the
324
  required software:
325
 
326
+ ```plaintext
327
  apt-get install python3 default-jdk libsaxonhe-java sfst
328
  ```
329
 
 
332
  Optionally, set up a Python virtual environment for project builds,
333
  i. e. via Python's `venv`:
334
 
335
+ ```plaintext
336
  python3 -m venv .venv
337
  source .venv/bin/activate
338
  ```
339
 
340
  Then install DWDSmor, including development dependencies:
341
 
342
+ ```plaintext
343
  pip install -U pip setuptools && pip install -e '.[dev]'
344
  ```
345
 
346
 
347
  ### Building lexica and automata
348
 
349
+ Building different editions is facilitated via the script `build-dwdsmor`:
350
 
351
 
352
+ ```plaintext
353
+ $ ./build-dwdsmor --help
354
+ usage: cli.py [-h] [--automaton AUTOMATON] [--force] [--with-metrics] [--release] [--tag]
355
+ [editions ...]
356
 
357
+ Build DWDSmor.
 
 
358
 
359
+ positional arguments:
360
+ editions Editions to build (all by default)
361
 
362
+ options:
363
+ -h, --help show this help message and exit
364
+ --automaton AUTOMATON
365
+ Automaton type to build (all by default)
366
+ --force Force building (also current targets)
367
+ --with-metrics Measure UD/de-hdt coverage
368
+ --release Push automata to HF hub
369
+ --tag Tag HF hub release with current version
370
  ```
371
 
372
+ To build all editions available in the current git checkout, run:
 
373
 
374
+ ```plaintext
375
+ ./build-dwdsmor
 
 
 
376
  ```
377
 
378
+ The build result can be found in `build/` with one subdirectory per
379
+ edition. Each edition contains several automata types in standard and
380
+ compact format:
 
 
 
381
 
 
382
 
383
+ * `lemma.{a,ca}`: transducer with inflection and word-formation
384
+ components, for lemmatisation and morphological analysis of word
385
+ forms in terms of grammatical categories
386
+ * `morph.{a,ca}`: transducer with inflection and word-formation
387
+ components, for the generation of morphologically segmented word
388
+ forms
389
+ * `finite.{a,ca}`: transducer with an inflection component and a
390
  finite word-formation component, for testing purposes
391
+ * `root.{a,ca}`: transducer with inflection and word-formation
392
+ components, for lexical analysis of word forms in terms of root
393
+ lemmas (i.e., lemmas of ultimate word-formation bases),
394
+ word-formation process, word-formation means, and grammatical
395
+ categories in term of the Pattern-and-Restriction Theory of word
396
+ formation (Nolda 2022)
397
+ * `index.{a,ca}`: transducer with an inflection component only with
398
  DWDS homographic lemma indices, for paradigm generation
399
 
400
 
 
404
 
405
  pytest
406
 
407
+ ## License
408
 
409
+ As the original SMOR and SMORLemma grammars, the DWDSmor grammar and
410
+ Python library are licensed under the GNU General Public License
411
+ v2.0. The same applies to the open edition of the DWDSmor lexicon.
 
412
 
413
+ For the DWDS edition based on the complete DWDS dictionary, all rights
414
+ are reserved and individual license terms apply. If you are interested
415
+ in the DWDS edition, please contact us.
416
 
417
+ ## Contact
418
 
419
+ Feel free to contact [Andreas Nolda](mailto:[email protected]) for any
420
+ question about this project.
 
421
 
422
  ## Credits
423
 
 
438
 
439
  * Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.).
440
  DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur
441
+ deutschen Sprache in Geschichte und Gegenwart. [Online](https://www.dwds.de/)
 
442
  * Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes
443
  System. Ph.D. thesis, Universität Stuttgart.
444
  [PDF](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/IMSLex/fitschendiss.pdf)