kenlm_wikipedia_nl / README.md
BramVanroy's picture
Update README.md
0b9a2ce verified
metadata
language:
  - nl
tags:
  - kenlm
license: apache-2.0

KenLM (arpa) models for Dutch based on Wikipedia

This repository contains KenLM models (n=5) for Dutch, based on the Dutch portion of Wikipedia - sentence-segmented (one sentence per line). Models are provided on tokens, part-of-speech, dependency labels, and lemmas, as processed with spaCy nl_core_news_sm:

  • wiki_nl_token.arpa[.bin]: token
  • wiki_nl_pos.arpa[.bin]: part-of-speech tag
  • wiki_nl_dep.arpa[.bin]: dependency label
  • wiki_nl_lemma.arpa[.bin]: lemma

Both regular .arpa files as well as more efficient KenLM binary files (.arpa.bin) are provided. You probably want to use the binary versions.

Usage from within Python

Make sure to install dependencies:

pip install huggingface_hub
pip install https://github.com/kpu/kenlm/archive/master.zip

# If you want to use spaCy preprocessing
pip install spacy
python -m spacy download nl_core_news_sm

We can then use the Hugging Face hub software to download and cache the model file that we want, and directly use it with KenLM.

import kenlm
from huggingface_hub import hf_hub_download

model_file = hf_hub_download(repo_id="BramVanroy/kenlm_wikipedia_nl", filename="wiki_nl_token.arpa.bin")
model = kenlm.Model(model_file)

text = "Ik eet graag koekjes !"  # pre-tokenized
model.perplexity(text)
# 1790.5033832700467

It is recommended to use spaCy as a preprocessor to automatically use the same tagsets and tokenization as were used when creating the LMs.

import kenlm
import spacy
from huggingface_hub import hf_hub_download

model_file = hf_hub_download(repo_id="BramVanroy/kenlm_wikipedia_nl", filename="wiki_nl_pos.arpa.bin")  # pos file
model = kenlm.Model(model_file)

nlp = spacy.load("nl_core_news_sm")

text = "Ik eet graag koekjes!" 
pos_sequence = " ".join([token.pos_ for token in nlp(text)])
# 'PRON VERB ADV NOUN PUNCT'
model.perplexity(pos_sequence)
# 6.190638021041525

Reproduction

Example:

bin/lmplz -o 5 -S 75% -T ../data/tmp/ < ../data/wikipedia/nl/wiki_nl_processed_lemma_dedup.txt > ../data/wikipedia/nl/models/wiki_nl_lemma.arpa
bin/build_binary ../data/wikipedia/nl/models/wiki_nl_lemma.arpa ../data/wikipedia/nl/models/wiki_nl_lemma.arpa.bin

For class-based LMs (POS and DEP), the --discount_fallback was used and the parsed data was not deduplicated (but it was deduplicated on the sentence-level for token and lemma models).