---
language: de
library_name: sfst
license: gpl-2.0
tags:
- sfst
- dwdsmor
- token-classification
- lemmatisation
model-index:
- name: dwdsmor
  results:
  - task:
      type: token-classification
      name: Lemmatisation
    dataset:
      name: Universal Dependencies Treebank (de-hdt)
      type: universal_dependencies
      config: de_hdt
      split: train
    metrics:
    - type: coverage
      value: 0.8415324536382167
      name: Coverage
    - type: coverage
      value: 1.0
      name: Coverage ($()
    - type: coverage
      value: 1.0
      name: Coverage ($,)
    - type: coverage
      value: 0.9999580703997988
      name: Coverage ($.)
    - type: coverage
      value: 0.7740509710590406
      name: Coverage (ADJA)
    - type: coverage
      value: 0.7548407611333322
      name: Coverage (ADJD)
    - type: coverage
      value: 0.9682621529723873
      name: Coverage (ADV)
    - type: coverage
      value: 0.9989939637826962
      name: Coverage (APPO)
    - type: coverage
      value: 0.9308645050358152
      name: Coverage (APPR)
    - type: coverage
      value: 0.9967651071695788
      name: Coverage (APPRART)
    - type: coverage
      value: 0.7916666666666666
      name: Coverage (APZR)
    - type: coverage
      value: 0.9999603964317185
      name: Coverage (ART)
    - type: coverage
      value: 0.9613524039049266
      name: Coverage (CARD)
    - type: coverage
      value: 0.13320473120462967
      name: Coverage (FM)
    - type: coverage
      value: 0.7142857142857143
      name: Coverage (ITJ)
    - type: coverage
      value: 1.0
      name: Coverage (KOKOM)
    - type: coverage
      value: 0.9995274949083504
      name: Coverage (KON)
    - type: coverage
      value: 1.0
      name: Coverage (KOUI)
    - type: coverage
      value: 0.9858579967925354
      name: Coverage (KOUS)
    - type: coverage
      value: 0.0618080812117821
      name: Coverage (NE)
    - type: coverage
      value: 0.7440593189565299
      name: Coverage (NN)
    - type: coverage
      value: 0.9799275737196068
      name: Coverage (PDAT)
    - type: coverage
      value: 0.9995682832062167
      name: Coverage (PDS)
    - type: coverage
      value: 0.9879094306440976
      name: Coverage (PIAT)
    - type: coverage
      value: 1.0
      name: Coverage (PIDAT)
    - type: coverage
      value: 0.9951910051476565
      name: Coverage (PIS)
    - type: coverage
      value: 0.999888876541838
      name: Coverage (PPER)
    - type: coverage
      value: 1.0
      name: Coverage (PPOSAT)
    - type: coverage
      value: 1.0
      name: Coverage (PPOSS)
    - type: coverage
      value: 1.0
      name: Coverage (PRELAT)
    - type: coverage
      value: 1.0
      name: Coverage (PRELS)
    - type: coverage
      value: 1.0
      name: Coverage (PRF)
    - type: coverage
      value: 0.9861938278289117
      name: Coverage (PROAV)
    - type: coverage
      value: 0.3082133784928027
      name: Coverage (PTKA)
    - type: coverage
      value: 1.0
      name: Coverage (PTKANT)
    - type: coverage
      value: 1.0
      name: Coverage (PTKNEG)
    - type: coverage
      value: 0.7705097087378641
      name: Coverage (PTKVZ)
    - type: coverage
      value: 0.0
      name: Coverage (PTKZU)
    - type: coverage
      value: 0.9551166965888689
      name: Coverage (PWAT)
    - type: coverage
      value: 0.9937264742785445
      name: Coverage (PWAV)
    - type: coverage
      value: 0.9946524064171123
      name: Coverage (PWS)
    - type: coverage
      value: 1.0
      name: Coverage (VAFIN)
    - type: coverage
      value: 1.0
      name: Coverage (VAIMP)
    - type: coverage
      value: 1.0
      name: Coverage (VAINF)
    - type: coverage
      value: 1.0
      name: Coverage (VAPP)
    - type: coverage
      value: 1.0
      name: Coverage (VMFIN)
    - type: coverage
      value: 1.0
      name: Coverage (VMINF)
    - type: coverage
      value: 1.0
      name: Coverage (VMPP)
    - type: coverage
      value: 0.886487187323461
      name: Coverage (VVFIN)
    - type: coverage
      value: 0.9596122778675282
      name: Coverage (VVIMP)
    - type: coverage
      value: 0.8214535019002559
      name: Coverage (VVINF)
    - type: coverage
      value: 0.829683698296837
      name: Coverage (VVIZU)
    - type: coverage
      value: 0.7996866513473992
      name: Coverage (VVPP)
    - type: coverage
      value: 0.4148471615720524
      name: Coverage (XY)
---

# DWDSmor – German Morphology


DWDSmor implements the **lemmatisation and morphological analysis** of
word forms as well as the **generation of paradigms of lexical words**
in **written German**. Finite state transducers (automata) map word
forms to specifications of corresponding lexical words and tagging
which represents morphological properties. By traversing such
transducers

1. a given word form can be analysed and lemmatised, or
1. a lexical word together with a set of morphological tagging will
   generate corresponding inflected word forms.

The automata are compiled and traversed via
[SFST](https://www.cis.uni-muenchen.de/~schmid/tools/SFST/), a C++
library and toolbox for finite-state transducers (FSTs). Their
coverage of the German language depends on

1. the DWDSmor grammar, defining the rules by which word formation
   happens, and
1. a lexicon, declaring inflection classes and other morphological
   properties for covered lexical words.

The grammar, derived from
[SMORLemma](https://github.com/rsennrich/SMORLemma) and providing the
morphology for building automata from lexica, is common to all DWDSmor
installations and published as open source. In contrast we provide
**multiple lexica** resulting in different editions of DWDSmor:

1. the **DWDS Edition**, derived from the complete lexical dataset of
   the [DWDS dictionary](https://www.dwds.de/) and available upon
   request for research purposes,
1. the **Open Edition**, based on a subset of the DWDS, covering the
   most common word forms and released freely with the grammar for
   general use and experiments.

Depending on the edition and word class, coverage ranges from 70 to
100% with the notable exceptions of foreign language words and named
entities: Generally, both classes are not part of the underlying DWDS
dictionary and thus barely covered by DWDSmor. Current overall
coverage measured against the [German Universal Dependencies
treebank](https://universaldependencies.org/treebanks/de_hdt/index.html)
is documented on the respective [Hugging Face Hub
page](https://huggingface.co/zentrum-lexikographie) of each edition.


## Usage

DWDSmor as a Python library is available via the package index PyPI:

``` plaintext
pip install dwdsmor
```

The library can be used for lemmatisation:

``` python-console
>>> import dwsdmor
>>> lemmatizer = dwdsmor.lemmatizer()
>>> assert lemmatizer("getestet", pos={"+V"}) == "testen"
>>> assert lemmatizer("getestet", pos={"+ADJ"}) == "getestet"
```

Next to the Python API, the package provides a simple command line
interface named `dwdsmor`. To analyze a word form, pass it as an
argument:

```plaintext
$ dwdsmor getestet
| Wordform   | Lemma    | Analysis                            | POS   | Degree   | Function   | Nonfinite   | Tense   | Auxiliary   |
|------------|----------|-------------------------------------|-------|----------|------------|-------------|---------|-------------|
| getestet   | getestet | ge<~>test<~>et<+ADJ><Pos><Pred/Adv> | +ADJ  | Pos      | Pred/Adv   |             |         |             |
| getestet   | testen   | test<~>en<+V><Part><Perf><haben>    | +V    |          |            | Part        | Perf    | haben       |
```

To generate all word forms for a lexical word, pass it (or a form
which can be analyzed as the lexical word) as an argument together
with the option `-g`:

``` plaintext
$ dwdsmor -g getestet
[…]
| Wordform   | Lemma    | Analysis                                                    | POS   | Subcategory   | Degree   | Function   |   Person | Gender   | Case   | Number   | Nonfinite   | Tense   | Mood   | Auxiliary   | Inflection   |
|------------|----------|-------------------------------------------------------------|-------|---------------|----------|------------|----------|----------|--------|----------|-------------|---------|--------|-------------|--------------|
| getestete  | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><St>     | +ADJ  |               | Pos      | Attr/Subst |          | Fem      | Acc    | Sg       |             |         |        |             | St           |
| getestete  | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><Wk>     | +ADJ  |               | Pos      | Attr/Subst |          | Fem      | Acc    | Sg       |             |         |        |             | Wk           |
| getesteter | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><St>     | +ADJ  |               | Pos      | Attr/Subst |          | Fem      | Dat    | Sg       |             |         |        |             | St           |
| getesteten | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><Wk>     | +ADJ  |               | Pos      | Attr/Subst |          | Fem      | Dat    | Sg       |             |         |        |             | Wk           |
| getesteter | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><St>     | +ADJ  |               | Pos      | Attr/Subst |          | Fem      | Gen    | Sg       |             |         |        |             | St           |
| getesteten | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><Wk>     | +ADJ  |               | Pos      | Attr/Subst |          | Fem      | Gen    | Sg       |             |         |        |             | Wk           |
[…]
| testeten   | testen   | test<~>en<+V><1><Pl><Past><Ind>                             | +V    |               |          |            |        1 |          |        | Pl       |             | Past    | Ind    |             |              |
| testeten   | testen   | test<~>en<+V><1><Pl><Past><Subj>                            | +V    |               |          |            |        1 |          |        | Pl       |             | Past    | Subj   |             |              |
| testen     | testen   | test<~>en<+V><1><Pl><Pres><Ind>                             | +V    |               |          |            |        1 |          |        | Pl       |             | Pres    | Ind    |             |              |
| testen     | testen   | test<~>en<+V><1><Pl><Pres><Subj>                            | +V    |               |          |            |        1 |          |        | Pl       |             | Pres    | Subj   |             |              |
| testete    | testen   | test<~>en<+V><1><Sg><Past><Ind>                             | +V    |               |          |            |        1 |          |        | Sg       |             | Past    | Ind    |             |              |
| testete    | testen   | test<~>en<+V><1><Sg><Past><Subj>                            | +V    |               |          |            |        1 |          |        | Sg       |             | Past    | Subj   |             |              |
| teste      | testen   | test<~>en<+V><1><Sg><Pres><Ind>                             | +V    |               |          |            |        1 |          |        | Sg       |             | Pres    | Ind    |             |              |
| teste      | testen   | test<~>en<+V><1><Sg><Pres><Subj>                            | +V    |               |          |            |        1 |          |        | Sg       |             | Pres    | Subj   |             |              |
| testetet   | testen   | test<~>en<+V><2><Pl><Past><Ind>                             | +V    |               |          |            |        2 |          |        | Pl       |             | Past    | Ind    |             |              |
[…]
```

## Development

DWDSmor is in active development. In its current stage, it supports
most inflection classes and some productive word-formation patterns of
written German.


### Prerequisites

* [GNU/Linux](https://www.debian.org/): Development, builds and tests
  of DWDSmor are performed on [Debian
  GNU/Linux](https://debian.org/). While other UNIX-like operating
  systems such as MacOS should work, too, they are not actively
  supported.
* [SFST](https://www.cis.uni-muenchen.de/~schmid/tools/SFST/): a C++
  library and toolbox for finite-state transducers (FSTs); please take
  a look at its homepage for installation and usage instructions.
* [Python >= v3.9](https://www.python.org/): DWDSmor targets Python as
  its primary runtime environment. The DWDSmor transducers can be used
  via SFST's commandline tools, queried in Python applications via
  language-specific
  [bindings](https://github.com/gremid/sfst-transduce), or used by the
  Python scripts `dwdsmor.py` and `paradigm.py` for morphological
  analysis and for paradigm generation.
* [Saxon-HE](https://www.saxonica.com/): The extraction of lexical
  entries from XML sources of DWDS articles is implemented in XSLT 2,
  for which Saxon-HE is used as the runtime environment. Saxon
  requires [Java](https://openjdk.java.net/)) as a runtime
  environment.

On a Debian-based distribution, the following command install the
required software:

```plaintext
apt-get install python3 default-jdk libsaxonhe-java sfst
```

### Project setup

Optionally, set up a Python virtual environment for project builds,
i. e. via Python's `venv`:

```plaintext
python3 -m venv .venv
source .venv/bin/activate
```

Then install DWDSmor, including development dependencies:

```plaintext
pip install -U pip setuptools && pip install -e '.[dev]'
```


### Building lexica and automata

Building different editions is facilitated via the script `build-dwdsmor`:


```plaintext
$ ./build-dwdsmor --help
usage: cli.py [-h] [--automaton AUTOMATON] [--force] [--with-metrics] [--release] [--tag]
              [editions ...]

Build DWDSmor.

positional arguments:
  editions              Editions to build (all by default)

options:
  -h, --help            show this help message and exit
  --automaton AUTOMATON
                        Automaton type to build (all by default)
  --force               Force building (also current targets)
  --with-metrics        Measure UD/de-hdt coverage
  --release             Push automata to HF hub
  --tag                 Tag HF hub release with current version
```

To build all editions available in the current git checkout, run:

```plaintext
./build-dwdsmor
```

The build result can be found in `build/` with one subdirectory per
edition. Each edition contains several automata types in standard and
compact format:


* `lemma.{a,ca}`: transducer with inflection and word-formation
  components, for lemmatisation and morphological analysis of word
  forms in terms of grammatical categories
* `morph.{a,ca}`: transducer with inflection and word-formation
  components, for the generation of morphologically segmented word
  forms
* `finite.{a,ca}`: transducer with an inflection component and a
  finite word-formation component, for testing purposes
* `root.{a,ca}`: transducer with inflection and word-formation
  components, for lexical analysis of word forms in terms of root
  lemmas (i.e., lemmas of ultimate word-formation bases),
  word-formation process, word-formation means, and grammatical
  categories in term of the Pattern-and-Restriction Theory of word
  formation (Nolda 2022)
* `index.{a,ca}`: transducer with an inflection component only with
  DWDS homographic lemma indices, for paradigm generation


### Testing

In order to test basic transducer usage and for potential regressions, run

    pytest

## License

As the original SMOR and SMORLemma grammars, the DWDSmor grammar and
Python library are licensed under the GNU General Public License
v2.0. The same applies to the open edition of the DWDSmor lexicon.

For the DWDS edition based on the complete DWDS dictionary, all rights
are reserved and individual license terms apply. If you are interested
in the DWDS edition, please contact us.

## Contact

Feel free to contact [Andreas Nolda](mailto:andreas.nolda@bbaw.de) for any
question about this project.

## Credits

DWSDmor is based on the following software and datasets:

1. [SFST](https://www.cis.uni-muenchen.de/~schmid/tools/SFST/), a C++ library
   and toolbox for finite-state transducers (FSTs) (Schmidt 2006)
2. [SMORLemma](https://github.com/rsennrich/SMORLemma) (Sennrich and Kunz 2014),
   a modified version of the Stuttgart Morphology
   ([SMOR](https://www.cis.lmu.de/~schmid/tools/SMOR/)) (Schmid, Fitschen, and
   Heid 2004) with an alternative lemmatisation component
3. the [DWDS dictionary](https://www.dwds.de/) (BBAW n.d.) replacing the
   [IMSLex](https://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/imslex/)
   (Fitschen 2004) as the lexical data source for German words, their grammatical
   categories, and their morphological properties.

## References

* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.).
  DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur
  deutschen Sprache in Geschichte und Gegenwart. [Online](https://www.dwds.de/)
* Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes
  System. Ph.D. thesis, Universität Stuttgart.
  [PDF](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/IMSLex/fitschendiss.pdf)
* Nolda, Andreas (2022). Headedness as an epiphenomenon: Case studies on
  compounding and blending in German. In *Headedness and/or Grammatical
  Anarchy?*, ed. by Ulrike Freywald, Horst Simon, and Stefan Müller, Empirically
  Oriented Theoretical Morphology and Syntax 11, Berlin: Language Science Press,
  343–376.
  [PDF](https://zenodo.org/record/7142720/files/336-FreywaldSimonMüller-2022-11.pdf).
* Schmid, Helmut (2006). A programming language for finite state transducers. In
  *Finite-State Methods and Natural Language Processing: 5th International
  Workshop, FSMNLP 2005, Helsinki, Finland, September 1–2, 2005*, ed. by Anssi
  Yli-Jyrä, Lauri Karttunen, and Juhani Karhumäki, Lecture Notes in Artificial
  Intelligence 4002, Berlin: Springer, 1263–1266.
  [PDF](https://www.cis.uni-muenchen.de/~schmid/papers/SFST-PL.pdf).
* Schmid, Helmut, Arne Fitschen, and Ulrich Heid (2004). SMOR: A German
  computational morphology covering derivation, composition, and inflection. In
  LREC 2004: Fourth International Conference on Language Resources and
  Evaluation, ed. by Maria T. Lino *et al.*, European Language Resources
  Association, 1263–1266.
  [PDF](http://www.lrec-conf.org/proceedings/lrec2004/pdf/468.pdf)
* Sennrich, Rico and Beta Kunz (2014). Zmorge: A German morphological lexicon
  extracted from Wiktionary. In LREC 2014: Ninth International Conference on
  Language Resources and Evaluation, ed. by Nicoletta Calzolari *et al.*,
  European Language Resources Association, 1063–1067.
  [PDF](http://www.lrec-conf.org/proceedings/lrec2014/pdf/116_Paper.pdf).