de_STTS2_folk tagger

This is a spaCy language model trained to use the Stuttgart-Tübingen Tagset version 2.0, which was designed to tag transcripts of conversational speech in German. The model may be useful for tagging ASR transcripts such as those collected in the CoGS corpus.

The model was trained using the tag annotations from the FOLK corpus at https://agd.ids-mannheim.de/folk-gold.shtml, employing an 80/20 training/test split. Tokens in the training data for the model were converted to lower case prior to traning to match the format used for automatic speech recognition transcripts on YouTube, as of early 2023.

Usage example:

!pip install https://huggingface.co/stcoats/de_STTS2_folk/resolve/main/de_STTS2_folk-any-py3-none-any.whl
import spacy
import de_STTS2_folk
nlp = de_STTS2_folk.load()
doc = nlp("ach so meinst du wir sollen es jetzt tun")
for token in doc:
    print(token.text, token.tag_)

References

Coats, Steven. (2023). A new corpus of geolocated ASR transcripts from Germany. Language Resources and Evaluation. https://doi.org/10.1007/s10579-023-09686-9

Westpfahl, Swantje and Thomas Schmidt. (2016): FOLK-Gold – A GOLD standard for Part-of-Speech-Tagging of Spoken German. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia


tags: - spacy - token-classification language: - de model-index: - name: de_STTS2_folk results: - task: name: TAG type: token-classification metrics: - name: TAG (XPOS) Accuracy type: accuracy value: 0.9191333537

Feature Description
Name de_STTS2_folk
Version 0.0.1
spaCy >=3.5.1,<3.6.0
Default Pipeline tok2vec, tagger
Components tok2vec, tagger
Vectors 0 keys, 0 unique vectors (0 dimensions)
Sources Swantje Westpfahl and Thomas Schmidt, FOLK-Gold, https://agd.ids-mannheim.de/folk-gold.shtml
License CC-BY 4.0
Author Steven Coats

Label Scheme

View label scheme (62 labels for 1 components)
Component Labels
tagger $., AB, ADJA, ADJD, ADV, APPO, APPR, APPRART, APZR, ART, CARD, FM, KOKOM, KON, KOUI, KOUS, NE, NGAKW, NGHES, NGIRR, NGONO, NN, ORD, PDAT, PDS, PIAT, PIDAT, PIDS, PIS, PPER, PPOSAT, PPOSS, PRELAT, PRELS, PRF, PTKA, PTKIFG, PTKMA, PTKMWL, PTKNEG, PTKVZ, PTKZU, PWAT, PWAV, PWS, SEDM, SEQU, SPELL, TRUNC, UI, VAFIN, VAIMP, VAINF, VAPP, VMFIN, VMINF, VVFIN, VVIMP, VVINF, VVIZU, VVPP, XY

Accuracy

Type Score
TAG_ACC 91.91
TOK2VEC_LOSS 478891.28
TAGGER_LOSS 402526.03
Downloads last month
3
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.