Upload folder using huggingface_hub
Browse files
README.md
CHANGED
@@ -183,7 +183,7 @@ model-index:
|
|
183 |
name: Coverage (XY)
|
184 |
---
|
185 |
|
186 |
-
# DWDSmor – German
|
187 |
|
188 |
|
189 |
|
@@ -205,14 +205,33 @@ The automata are compiled and traversed via
|
|
205 |
library and toolbox for finite-state transducers (FSTs). Their
|
206 |
coverage of the German language depends on
|
207 |
|
208 |
-
1. the DWDSmor grammar, defining the rules by which word formation
|
209 |
-
|
|
|
|
|
210 |
|
211 |
-
|
212 |
[SMORLemma](https://github.com/rsennrich/SMORLemma) and providing the
|
213 |
morphology for building automata from lexica, is common to all DWDSmor
|
214 |
-
installations and published as open source
|
215 |
-
editions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
216 |
|
217 |
## Usage
|
218 |
|
@@ -222,7 +241,7 @@ DWDSmor as a Python library is available via the package index PyPI:
|
|
222 |
pip install dwdsmor
|
223 |
```
|
224 |
|
225 |
-
|
226 |
|
227 |
``` python-console
|
228 |
>>> import dwsdmor
|
@@ -231,13 +250,51 @@ For lemmatisation:
|
|
231 |
>>> assert lemmatizer("getestet", pos={"+ADJ"}) == "getestet"
|
232 |
```
|
233 |
|
234 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
235 |
|
236 |
## Development
|
237 |
|
238 |
-
DWDSmor is in active development. In its current stage,
|
239 |
-
|
240 |
-
|
241 |
|
242 |
|
243 |
### Prerequisites
|
@@ -266,7 +323,7 @@ patterns of written German.
|
|
266 |
On a Debian-based distribution, the following command install the
|
267 |
required software:
|
268 |
|
269 |
-
```
|
270 |
apt-get install python3 default-jdk libsaxonhe-java sfst
|
271 |
```
|
272 |
|
@@ -275,66 +332,69 @@ apt-get install python3 default-jdk libsaxonhe-java sfst
|
|
275 |
Optionally, set up a Python virtual environment for project builds,
|
276 |
i. e. via Python's `venv`:
|
277 |
|
278 |
-
```
|
279 |
python3 -m venv .venv
|
280 |
source .venv/bin/activate
|
281 |
```
|
282 |
|
283 |
Then install DWDSmor, including development dependencies:
|
284 |
|
285 |
-
```
|
286 |
pip install -U pip setuptools && pip install -e '.[dev]'
|
287 |
```
|
288 |
|
289 |
|
290 |
### Building lexica and automata
|
291 |
|
292 |
-
Building different editions is facilitated via
|
293 |
|
294 |
|
|
|
|
|
|
|
|
|
295 |
|
296 |
-
|
297 |
-
make all
|
298 |
-
```
|
299 |
|
300 |
-
|
|
|
301 |
|
302 |
-
|
303 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
304 |
```
|
305 |
|
306 |
-
|
307 |
-
not part of this repository.
|
308 |
|
309 |
-
|
310 |
-
|
311 |
-
|
312 |
-
```sh
|
313 |
-
make sample && make sample-install && make dwdsmor
|
314 |
```
|
315 |
|
316 |
-
|
317 |
-
|
318 |
-
|
319 |
-
```sh
|
320 |
-
make install
|
321 |
-
```
|
322 |
|
323 |
-
The installed DWDSmor transducers are:
|
324 |
|
325 |
-
* `
|
326 |
-
components, for lemmatisation and morphological analysis of word
|
327 |
-
terms of grammatical categories
|
328 |
-
* `
|
329 |
-
components, for the generation of morphologically segmented word
|
330 |
-
|
|
|
331 |
finite word-formation component, for testing purposes
|
332 |
-
* `
|
333 |
-
components, for lexical analysis of word forms in terms of root
|
334 |
-
lemmas of ultimate word-formation bases),
|
335 |
-
word-formation means, and grammatical
|
336 |
-
Pattern-and-Restriction Theory of word
|
337 |
-
|
|
|
338 |
DWDS homographic lemma indices, for paradigm generation
|
339 |
|
340 |
|
@@ -344,19 +404,20 @@ In order to test basic transducer usage and for potential regressions, run
|
|
344 |
|
345 |
pytest
|
346 |
|
347 |
-
##
|
348 |
|
349 |
-
|
350 |
-
|
351 |
-
|
352 |
-
to the integration of DWDSmor into your corpus-annotation pipeline.
|
353 |
|
|
|
|
|
|
|
354 |
|
355 |
-
##
|
356 |
|
357 |
-
|
358 |
-
|
359 |
-
to the rest of this project.
|
360 |
|
361 |
## Credits
|
362 |
|
@@ -377,8 +438,7 @@ DWSDmor is based on the following software and datasets:
|
|
377 |
|
378 |
* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.).
|
379 |
DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur
|
380 |
-
deutschen Sprache in Geschichte und Gegenwart.
|
381 |
-
https://www.dwds.de
|
382 |
* Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes
|
383 |
System. Ph.D. thesis, Universität Stuttgart.
|
384 |
[PDF](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/IMSLex/fitschendiss.pdf)
|
|
|
183 |
name: Coverage (XY)
|
184 |
---
|
185 |
|
186 |
+
# DWDSmor – German Morphology
|
187 |
|
188 |
|
189 |
|
|
|
205 |
library and toolbox for finite-state transducers (FSTs). Their
|
206 |
coverage of the German language depends on
|
207 |
|
208 |
+
1. the DWDSmor grammar, defining the rules by which word formation
|
209 |
+
happens, and
|
210 |
+
1. a lexicon, declaring inflection classes and other morphological
|
211 |
+
properties for covered lexical words.
|
212 |
|
213 |
+
The grammar, derived from
|
214 |
[SMORLemma](https://github.com/rsennrich/SMORLemma) and providing the
|
215 |
morphology for building automata from lexica, is common to all DWDSmor
|
216 |
+
installations and published as open source. In contrast we provide
|
217 |
+
**multiple lexica** resulting in different editions of DWDSmor:
|
218 |
+
|
219 |
+
1. the **DWDS Edition**, derived from the complete lexical dataset of
|
220 |
+
the [DWDS dictionary](https://www.dwds.de/) and available upon
|
221 |
+
request for research purposes,
|
222 |
+
1. the **Open Edition**, based on a subset of the DWDS, covering the
|
223 |
+
most common word forms and released freely with the grammar for
|
224 |
+
general use and experiments.
|
225 |
+
|
226 |
+
Depending on the edition and word class, coverage ranges from 70 to
|
227 |
+
100% with the notable exceptions of foreign language words and named
|
228 |
+
entities: Generally, both classes are not part of the underlying DWDS
|
229 |
+
dictionary and thus barely covered by DWDSmor. Current overall
|
230 |
+
coverage measured against the [German Universal Dependencies
|
231 |
+
treebank](https://universaldependencies.org/treebanks/de_hdt/index.html)
|
232 |
+
is documented on the respective [Hugging Face Hub
|
233 |
+
page](https://huggingface.co/zentrum-lexikographie) of each edition.
|
234 |
+
|
235 |
|
236 |
## Usage
|
237 |
|
|
|
241 |
pip install dwdsmor
|
242 |
```
|
243 |
|
244 |
+
The library can be used for lemmatisation:
|
245 |
|
246 |
``` python-console
|
247 |
>>> import dwsdmor
|
|
|
250 |
>>> assert lemmatizer("getestet", pos={"+ADJ"}) == "getestet"
|
251 |
```
|
252 |
|
253 |
+
Next to the Python API, the package provides a simple command line
|
254 |
+
interface named `dwdsmor`. To analyze a word form, pass it as an
|
255 |
+
argument:
|
256 |
+
|
257 |
+
```plaintext
|
258 |
+
$ dwdsmor getestet
|
259 |
+
| Wordform | Lemma | Analysis | POS | Degree | Function | Nonfinite | Tense | Auxiliary |
|
260 |
+
|------------|----------|-------------------------------------|-------|----------|------------|-------------|---------|-------------|
|
261 |
+
| getestet | getestet | ge<~>test<~>et<+ADJ><Pos><Pred/Adv> | +ADJ | Pos | Pred/Adv | | | |
|
262 |
+
| getestet | testen | test<~>en<+V><Part><Perf><haben> | +V | | | Part | Perf | haben |
|
263 |
+
```
|
264 |
+
|
265 |
+
To generate all word forms for a lexical word, pass it (or a form
|
266 |
+
which can be analyzed as the lexical word) as an argument together
|
267 |
+
with the option `-g`:
|
268 |
+
|
269 |
+
``` plaintext
|
270 |
+
$ dwdsmor -g getestet
|
271 |
+
[…]
|
272 |
+
| Wordform | Lemma | Analysis | POS | Subcategory | Degree | Function | Person | Gender | Case | Number | Nonfinite | Tense | Mood | Auxiliary | Inflection |
|
273 |
+
|------------|----------|-------------------------------------------------------------|-------|---------------|----------|------------|----------|----------|--------|----------|-------------|---------|--------|-------------|--------------|
|
274 |
+
| getestete | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><St> | +ADJ | | Pos | Attr/Subst | | Fem | Acc | Sg | | | | | St |
|
275 |
+
| getestete | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><Wk> | +ADJ | | Pos | Attr/Subst | | Fem | Acc | Sg | | | | | Wk |
|
276 |
+
| getesteter | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><St> | +ADJ | | Pos | Attr/Subst | | Fem | Dat | Sg | | | | | St |
|
277 |
+
| getesteten | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><Wk> | +ADJ | | Pos | Attr/Subst | | Fem | Dat | Sg | | | | | Wk |
|
278 |
+
| getesteter | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><St> | +ADJ | | Pos | Attr/Subst | | Fem | Gen | Sg | | | | | St |
|
279 |
+
| getesteten | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><Wk> | +ADJ | | Pos | Attr/Subst | | Fem | Gen | Sg | | | | | Wk |
|
280 |
+
[…]
|
281 |
+
| testeten | testen | test<~>en<+V><1><Pl><Past><Ind> | +V | | | | 1 | | | Pl | | Past | Ind | | |
|
282 |
+
| testeten | testen | test<~>en<+V><1><Pl><Past><Subj> | +V | | | | 1 | | | Pl | | Past | Subj | | |
|
283 |
+
| testen | testen | test<~>en<+V><1><Pl><Pres><Ind> | +V | | | | 1 | | | Pl | | Pres | Ind | | |
|
284 |
+
| testen | testen | test<~>en<+V><1><Pl><Pres><Subj> | +V | | | | 1 | | | Pl | | Pres | Subj | | |
|
285 |
+
| testete | testen | test<~>en<+V><1><Sg><Past><Ind> | +V | | | | 1 | | | Sg | | Past | Ind | | |
|
286 |
+
| testete | testen | test<~>en<+V><1><Sg><Past><Subj> | +V | | | | 1 | | | Sg | | Past | Subj | | |
|
287 |
+
| teste | testen | test<~>en<+V><1><Sg><Pres><Ind> | +V | | | | 1 | | | Sg | | Pres | Ind | | |
|
288 |
+
| teste | testen | test<~>en<+V><1><Sg><Pres><Subj> | +V | | | | 1 | | | Sg | | Pres | Subj | | |
|
289 |
+
| testetet | testen | test<~>en<+V><2><Pl><Past><Ind> | +V | | | | 2 | | | Pl | | Past | Ind | | |
|
290 |
+
[…]
|
291 |
+
```
|
292 |
|
293 |
## Development
|
294 |
|
295 |
+
DWDSmor is in active development. In its current stage, it supports
|
296 |
+
most inflection classes and some productive word-formation patterns of
|
297 |
+
written German.
|
298 |
|
299 |
|
300 |
### Prerequisites
|
|
|
323 |
On a Debian-based distribution, the following command install the
|
324 |
required software:
|
325 |
|
326 |
+
```plaintext
|
327 |
apt-get install python3 default-jdk libsaxonhe-java sfst
|
328 |
```
|
329 |
|
|
|
332 |
Optionally, set up a Python virtual environment for project builds,
|
333 |
i. e. via Python's `venv`:
|
334 |
|
335 |
+
```plaintext
|
336 |
python3 -m venv .venv
|
337 |
source .venv/bin/activate
|
338 |
```
|
339 |
|
340 |
Then install DWDSmor, including development dependencies:
|
341 |
|
342 |
+
```plaintext
|
343 |
pip install -U pip setuptools && pip install -e '.[dev]'
|
344 |
```
|
345 |
|
346 |
|
347 |
### Building lexica and automata
|
348 |
|
349 |
+
Building different editions is facilitated via the script `build-dwdsmor`:
|
350 |
|
351 |
|
352 |
+
```plaintext
|
353 |
+
$ ./build-dwdsmor --help
|
354 |
+
usage: cli.py [-h] [--automaton AUTOMATON] [--force] [--with-metrics] [--release] [--tag]
|
355 |
+
[editions ...]
|
356 |
|
357 |
+
Build DWDSmor.
|
|
|
|
|
358 |
|
359 |
+
positional arguments:
|
360 |
+
editions Editions to build (all by default)
|
361 |
|
362 |
+
options:
|
363 |
+
-h, --help show this help message and exit
|
364 |
+
--automaton AUTOMATON
|
365 |
+
Automaton type to build (all by default)
|
366 |
+
--force Force building (also current targets)
|
367 |
+
--with-metrics Measure UD/de-hdt coverage
|
368 |
+
--release Push automata to HF hub
|
369 |
+
--tag Tag HF hub release with current version
|
370 |
```
|
371 |
|
372 |
+
To build all editions available in the current git checkout, run:
|
|
|
373 |
|
374 |
+
```plaintext
|
375 |
+
./build-dwdsmor
|
|
|
|
|
|
|
376 |
```
|
377 |
|
378 |
+
The build result can be found in `build/` with one subdirectory per
|
379 |
+
edition. Each edition contains several automata types in standard and
|
380 |
+
compact format:
|
|
|
|
|
|
|
381 |
|
|
|
382 |
|
383 |
+
* `lemma.{a,ca}`: transducer with inflection and word-formation
|
384 |
+
components, for lemmatisation and morphological analysis of word
|
385 |
+
forms in terms of grammatical categories
|
386 |
+
* `morph.{a,ca}`: transducer with inflection and word-formation
|
387 |
+
components, for the generation of morphologically segmented word
|
388 |
+
forms
|
389 |
+
* `finite.{a,ca}`: transducer with an inflection component and a
|
390 |
finite word-formation component, for testing purposes
|
391 |
+
* `root.{a,ca}`: transducer with inflection and word-formation
|
392 |
+
components, for lexical analysis of word forms in terms of root
|
393 |
+
lemmas (i.e., lemmas of ultimate word-formation bases),
|
394 |
+
word-formation process, word-formation means, and grammatical
|
395 |
+
categories in term of the Pattern-and-Restriction Theory of word
|
396 |
+
formation (Nolda 2022)
|
397 |
+
* `index.{a,ca}`: transducer with an inflection component only with
|
398 |
DWDS homographic lemma indices, for paradigm generation
|
399 |
|
400 |
|
|
|
404 |
|
405 |
pytest
|
406 |
|
407 |
+
## License
|
408 |
|
409 |
+
As the original SMOR and SMORLemma grammars, the DWDSmor grammar and
|
410 |
+
Python library are licensed under the GNU General Public License
|
411 |
+
v2.0. The same applies to the open edition of the DWDSmor lexicon.
|
|
|
412 |
|
413 |
+
For the DWDS edition based on the complete DWDS dictionary, all rights
|
414 |
+
are reserved and individual license terms apply. If you are interested
|
415 |
+
in the DWDS edition, please contact us.
|
416 |
|
417 |
+
## Contact
|
418 |
|
419 |
+
Feel free to contact [Andreas Nolda](mailto:[email protected]) for any
|
420 |
+
question about this project.
|
|
|
421 |
|
422 |
## Credits
|
423 |
|
|
|
438 |
|
439 |
* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.).
|
440 |
DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur
|
441 |
+
deutschen Sprache in Geschichte und Gegenwart. [Online](https://www.dwds.de/)
|
|
|
442 |
* Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes
|
443 |
System. Ph.D. thesis, Universität Stuttgart.
|
444 |
[PDF](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/IMSLex/fitschendiss.pdf)
|