Spaces:
Running
Running
drewThomasson
commited on
Commit
•
f045c49
1
Parent(s):
30fa9fe
Upload 115 files
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- nltk_data/tokenizers/punkt.zip +3 -0
- nltk_data/tokenizers/punkt/PY3/README +98 -0
- nltk_data/tokenizers/punkt/PY3/czech.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/danish.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/dutch.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/english.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/estonian.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/finnish.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/french.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/german.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/greek.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/italian.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/malayalam.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/norwegian.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/polish.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/portuguese.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/russian.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/slovene.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/spanish.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/swedish.pickle +3 -0
- nltk_data/tokenizers/punkt/PY3/turkish.pickle +3 -0
- nltk_data/tokenizers/punkt/README +98 -0
- nltk_data/tokenizers/punkt/czech.pickle +3 -0
- nltk_data/tokenizers/punkt/danish.pickle +3 -0
- nltk_data/tokenizers/punkt/dutch.pickle +3 -0
- nltk_data/tokenizers/punkt/english.pickle +3 -0
- nltk_data/tokenizers/punkt/estonian.pickle +3 -0
- nltk_data/tokenizers/punkt/finnish.pickle +3 -0
- nltk_data/tokenizers/punkt/french.pickle +3 -0
- nltk_data/tokenizers/punkt/german.pickle +3 -0
- nltk_data/tokenizers/punkt/greek.pickle +3 -0
- nltk_data/tokenizers/punkt/italian.pickle +3 -0
- nltk_data/tokenizers/punkt/malayalam.pickle +3 -0
- nltk_data/tokenizers/punkt/norwegian.pickle +3 -0
- nltk_data/tokenizers/punkt/polish.pickle +3 -0
- nltk_data/tokenizers/punkt/portuguese.pickle +3 -0
- nltk_data/tokenizers/punkt/russian.pickle +3 -0
- nltk_data/tokenizers/punkt/slovene.pickle +3 -0
- nltk_data/tokenizers/punkt/spanish.pickle +3 -0
- nltk_data/tokenizers/punkt/swedish.pickle +3 -0
- nltk_data/tokenizers/punkt/turkish.pickle +3 -0
- nltk_data/tokenizers/punkt_tab.zip +3 -0
- nltk_data/tokenizers/punkt_tab/README +98 -0
- nltk_data/tokenizers/punkt_tab/czech/abbrev_types.txt +118 -0
- nltk_data/tokenizers/punkt_tab/czech/collocations.tab +96 -0
- nltk_data/tokenizers/punkt_tab/czech/ortho_context.tab +0 -0
- nltk_data/tokenizers/punkt_tab/czech/sent_starters.txt +54 -0
- nltk_data/tokenizers/punkt_tab/danish/abbrev_types.txt +211 -0
- nltk_data/tokenizers/punkt_tab/danish/collocations.tab +101 -0
- nltk_data/tokenizers/punkt_tab/danish/ortho_context.tab +0 -0
nltk_data/tokenizers/punkt.zip
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:51c3078994aeaf650bfc8e028be4fb42b4a0d177d41c012b6a983979653660ec
|
3 |
+
size 13905355
|
nltk_data/tokenizers/punkt/PY3/README
ADDED
@@ -0,0 +1,98 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
|
2 |
+
|
3 |
+
Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
|
4 |
+
been contributed by various people using NLTK for sentence boundary detection.
|
5 |
+
|
6 |
+
For information about how to use these models, please confer the tokenization HOWTO:
|
7 |
+
http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
|
8 |
+
and chapter 3.8 of the NLTK book:
|
9 |
+
http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
|
10 |
+
|
11 |
+
There are pretrained tokenizers for the following languages:
|
12 |
+
|
13 |
+
File Language Source Contents Size of training corpus(in tokens) Model contributed by
|
14 |
+
=======================================================================================================================================================================
|
15 |
+
czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss
|
16 |
+
Literarni Noviny
|
17 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
18 |
+
danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss
|
19 |
+
(Berlingske Avisdata, Copenhagen) Weekend Avisen
|
20 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
21 |
+
dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss
|
22 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
23 |
+
english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss
|
24 |
+
(American)
|
25 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
26 |
+
estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss
|
27 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
28 |
+
finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss
|
29 |
+
Text Bank (Suomen Kielen newspapers
|
30 |
+
Tekstipankki)
|
31 |
+
Finnish Center for IT Science
|
32 |
+
(CSC)
|
33 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
34 |
+
french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss
|
35 |
+
(European)
|
36 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
37 |
+
german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss
|
38 |
+
(Switzerland) CD-ROM
|
39 |
+
(Uses "ss"
|
40 |
+
instead of "ß")
|
41 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
42 |
+
greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss
|
43 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
44 |
+
italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss
|
45 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
46 |
+
norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss
|
47 |
+
(Bokmål and Information Technologies,
|
48 |
+
Nynorsk) Bergen
|
49 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
50 |
+
polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner
|
51 |
+
(http://www.nkjp.pl/)
|
52 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
53 |
+
portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss
|
54 |
+
(Brazilian) (Linguateca)
|
55 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
56 |
+
slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss
|
57 |
+
Slovene Academy for Arts
|
58 |
+
and Sciences
|
59 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
60 |
+
spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss
|
61 |
+
(European)
|
62 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
63 |
+
swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss
|
64 |
+
(and some other texts)
|
65 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
66 |
+
turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss
|
67 |
+
(Türkçe Derlem Projesi)
|
68 |
+
University of Ankara
|
69 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
70 |
+
|
71 |
+
The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
|
72 |
+
Unicode using the codecs module.
|
73 |
+
|
74 |
+
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
|
75 |
+
Computational Linguistics 32: 485-525.
|
76 |
+
|
77 |
+
---- Training Code ----
|
78 |
+
|
79 |
+
# import punkt
|
80 |
+
import nltk.tokenize.punkt
|
81 |
+
|
82 |
+
# Make a new Tokenizer
|
83 |
+
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
|
84 |
+
|
85 |
+
# Read in training corpus (one example: Slovene)
|
86 |
+
import codecs
|
87 |
+
text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
|
88 |
+
|
89 |
+
# Train tokenizer
|
90 |
+
tokenizer.train(text)
|
91 |
+
|
92 |
+
# Dump pickled tokenizer
|
93 |
+
import pickle
|
94 |
+
out = open("slovene.pickle","wb")
|
95 |
+
pickle.dump(tokenizer, out)
|
96 |
+
out.close()
|
97 |
+
|
98 |
+
---------
|
nltk_data/tokenizers/punkt/PY3/czech.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:64b0734b6fbe8e8d7cac79f48d1dd9f853824e57c4e3594dadd74ba2c1d97f50
|
3 |
+
size 1119050
|
nltk_data/tokenizers/punkt/PY3/danish.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6189c7dd254e29e2bd406a7f6a4336297c8953214792466a790ea4444223ceb3
|
3 |
+
size 1191710
|
nltk_data/tokenizers/punkt/PY3/dutch.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:fda0d6a13f02e8898daec7fe923da88e25abe081bcfa755c0e015075c215fe4c
|
3 |
+
size 693759
|
nltk_data/tokenizers/punkt/PY3/english.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:5cad3758596392364e3be9803dbd7ebeda384b68937b488a01365f5551bb942c
|
3 |
+
size 406697
|
nltk_data/tokenizers/punkt/PY3/estonian.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:b364f72538d17b146a98009ad239a8096ce6c0a8b02958c0bc776ecd0c58a25f
|
3 |
+
size 1499502
|
nltk_data/tokenizers/punkt/PY3/finnish.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6a4b5ff5500ee851c456f9dd40d5fc0d8c1859c88eb3178de1317d26b7d22833
|
3 |
+
size 1852226
|
nltk_data/tokenizers/punkt/PY3/french.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:28e3a4cd2971989b3cb9fd3433a6f15d17981e464db2be039364313b5de94f29
|
3 |
+
size 553575
|
nltk_data/tokenizers/punkt/PY3/german.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ddcbbe85e2042a019b1a6e37fd8c153286c38ba201fae0f5bfd9a3f74abae25c
|
3 |
+
size 1463575
|
nltk_data/tokenizers/punkt/PY3/greek.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:85dabc44ab90a5f208ef37ff6b4892ebe7e740f71fb4da47cfd95417ca3e22fd
|
3 |
+
size 876006
|
nltk_data/tokenizers/punkt/PY3/italian.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:68a94007b1e4ffdc4d1a190185ca5442c3dafeb17ab39d30329e84cd74a43947
|
3 |
+
size 615089
|
nltk_data/tokenizers/punkt/PY3/malayalam.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1f8cf58acbdb7f472ac40affc13663be42dafb47c15030c11ade0444c9e0e53d
|
3 |
+
size 221207
|
nltk_data/tokenizers/punkt/PY3/norwegian.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:4ff7a46d1438b311457d15d7763060b8d3270852c1850fd788c5cee194dc4a1d
|
3 |
+
size 1181271
|
nltk_data/tokenizers/punkt/PY3/polish.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:624900ae3ddfb4854a98c5d3b8b1c9bb719975f33fee61ce1441dab9f8a00718
|
3 |
+
size 1738386
|
nltk_data/tokenizers/punkt/PY3/portuguese.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:02a0b7b25c3c7471e1791b66a31bbb530afbb0160aee4fcecf0107652067b4a1
|
3 |
+
size 611919
|
nltk_data/tokenizers/punkt/PY3/russian.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:549762f8190024d89b511472df21a3a135eee5d9233e63ac244db737c2c61d7e
|
3 |
+
size 33020
|
nltk_data/tokenizers/punkt/PY3/slovene.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:52ef2cc0ed27d79b3aa635cbbc40ad811883a75a4b8a8be1ae406972870fd864
|
3 |
+
size 734444
|
nltk_data/tokenizers/punkt/PY3/spanish.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:164a50fadc5a49f8ec7426eae11d3111ee752b48a3ef373d47745011192a5984
|
3 |
+
size 562337
|
nltk_data/tokenizers/punkt/PY3/swedish.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:b0f7d538bfd5266633b09e842cd92e9e0ac10f1d923bf211e1497972ddc47318
|
3 |
+
size 979681
|
nltk_data/tokenizers/punkt/PY3/turkish.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ae68ef5863728ac5332e87eb1f6bae772ff32a13a4caa2b01a5c68103e853c5b
|
3 |
+
size 1017038
|
nltk_data/tokenizers/punkt/README
ADDED
@@ -0,0 +1,98 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
|
2 |
+
|
3 |
+
Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
|
4 |
+
been contributed by various people using NLTK for sentence boundary detection.
|
5 |
+
|
6 |
+
For information about how to use these models, please confer the tokenization HOWTO:
|
7 |
+
http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
|
8 |
+
and chapter 3.8 of the NLTK book:
|
9 |
+
http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
|
10 |
+
|
11 |
+
There are pretrained tokenizers for the following languages:
|
12 |
+
|
13 |
+
File Language Source Contents Size of training corpus(in tokens) Model contributed by
|
14 |
+
=======================================================================================================================================================================
|
15 |
+
czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss
|
16 |
+
Literarni Noviny
|
17 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
18 |
+
danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss
|
19 |
+
(Berlingske Avisdata, Copenhagen) Weekend Avisen
|
20 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
21 |
+
dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss
|
22 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
23 |
+
english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss
|
24 |
+
(American)
|
25 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
26 |
+
estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss
|
27 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
28 |
+
finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss
|
29 |
+
Text Bank (Suomen Kielen newspapers
|
30 |
+
Tekstipankki)
|
31 |
+
Finnish Center for IT Science
|
32 |
+
(CSC)
|
33 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
34 |
+
french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss
|
35 |
+
(European)
|
36 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
37 |
+
german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss
|
38 |
+
(Switzerland) CD-ROM
|
39 |
+
(Uses "ss"
|
40 |
+
instead of "ß")
|
41 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
42 |
+
greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss
|
43 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
44 |
+
italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss
|
45 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
46 |
+
norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss
|
47 |
+
(Bokmål and Information Technologies,
|
48 |
+
Nynorsk) Bergen
|
49 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
50 |
+
polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner
|
51 |
+
(http://www.nkjp.pl/)
|
52 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
53 |
+
portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss
|
54 |
+
(Brazilian) (Linguateca)
|
55 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
56 |
+
slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss
|
57 |
+
Slovene Academy for Arts
|
58 |
+
and Sciences
|
59 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
60 |
+
spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss
|
61 |
+
(European)
|
62 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
63 |
+
swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss
|
64 |
+
(and some other texts)
|
65 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
66 |
+
turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss
|
67 |
+
(Türkçe Derlem Projesi)
|
68 |
+
University of Ankara
|
69 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
70 |
+
|
71 |
+
The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
|
72 |
+
Unicode using the codecs module.
|
73 |
+
|
74 |
+
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
|
75 |
+
Computational Linguistics 32: 485-525.
|
76 |
+
|
77 |
+
---- Training Code ----
|
78 |
+
|
79 |
+
# import punkt
|
80 |
+
import nltk.tokenize.punkt
|
81 |
+
|
82 |
+
# Make a new Tokenizer
|
83 |
+
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
|
84 |
+
|
85 |
+
# Read in training corpus (one example: Slovene)
|
86 |
+
import codecs
|
87 |
+
text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
|
88 |
+
|
89 |
+
# Train tokenizer
|
90 |
+
tokenizer.train(text)
|
91 |
+
|
92 |
+
# Dump pickled tokenizer
|
93 |
+
import pickle
|
94 |
+
out = open("slovene.pickle","wb")
|
95 |
+
pickle.dump(tokenizer, out)
|
96 |
+
out.close()
|
97 |
+
|
98 |
+
---------
|
nltk_data/tokenizers/punkt/czech.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:5ba73d293c7d7953956bcf02f3695ec5c1f0d527f2a3c38097f5593394fa1690
|
3 |
+
size 1265552
|
nltk_data/tokenizers/punkt/danish.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ea29760a0a9197f52ca59e78aeafc5a6f55d05258faf7db1709b2b9eb321ef20
|
3 |
+
size 1264725
|
nltk_data/tokenizers/punkt/dutch.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:4a8e26b3d68c45c38e594d19e2d5677447bfdcaa636d3b1e7acfed0e9272d73c
|
3 |
+
size 742624
|
nltk_data/tokenizers/punkt/english.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:dda37972ae88998a6fd3e3ec002697a6bd362b32d050fda7d7ca5276873092aa
|
3 |
+
size 433305
|
nltk_data/tokenizers/punkt/estonian.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3867fee26a36bdb197c64362aa13ac683f5f33fa4d0d225a5d56707582a55a1d
|
3 |
+
size 1596714
|
nltk_data/tokenizers/punkt/finnish.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1a9e17b3d5b4df76345d812b8a65b1da0767eda5086eadcc11e625eef0942835
|
3 |
+
size 1951656
|
nltk_data/tokenizers/punkt/french.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:de05f3d5647d3d2296626fb83f68428e4c6ad6e05a00ed4694c8bdc8f2f197ee
|
3 |
+
size 583482
|
nltk_data/tokenizers/punkt/german.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:eab497fa085413130c8fd0fb13b929128930afe2f6a26ea8715c95df7088e97c
|
3 |
+
size 1526714
|
nltk_data/tokenizers/punkt/greek.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:21752a6762fad5cfe46fb5c45fad9a85484a0e8e81c67e6af6fb973cfc27d67c
|
3 |
+
size 1953106
|
nltk_data/tokenizers/punkt/italian.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:dcb2717d7be5f26e860a92e05acf69b1123a5f4527cd7a269a9ab9e9e668c805
|
3 |
+
size 658331
|
nltk_data/tokenizers/punkt/malayalam.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1f8cf58acbdb7f472ac40affc13663be42dafb47c15030c11ade0444c9e0e53d
|
3 |
+
size 221207
|
nltk_data/tokenizers/punkt/norwegian.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e4a97f8f9a03a0338dd746bcc89a0ae0f54ae43b835fa37d83e279e1ca794faf
|
3 |
+
size 1259779
|
nltk_data/tokenizers/punkt/polish.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:16127b6d10933427a3e90fb20e9be53e1fb371ff79a730c1030734ed80b90c92
|
3 |
+
size 2042451
|
nltk_data/tokenizers/punkt/portuguese.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:bb01bf7c79a4eadc2178bbd209665139a0e4b38f2d1c44fef097de93955140e0
|
3 |
+
size 649051
|
nltk_data/tokenizers/punkt/russian.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:bc984432fbe31f7000014f8047502476889169c60f09be5413ca09276b16c909
|
3 |
+
size 33027
|
nltk_data/tokenizers/punkt/slovene.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7dac650212b3787b39996c01bd2084115493e6f6ec390bab61f767525b08b8ea
|
3 |
+
size 832867
|
nltk_data/tokenizers/punkt/spanish.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:271dc6027c4aae056f72a9bfab5645cf67e198bf4f972895844e40f5989ccdc3
|
3 |
+
size 597831
|
nltk_data/tokenizers/punkt/swedish.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:40d50ebdad6caa87715f2e300b1217ec92c42de205a543cc4a56903bd2c9acfa
|
3 |
+
size 1034496
|
nltk_data/tokenizers/punkt/turkish.pickle
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:d3ae47d76501d027698809d12e75292c9c392910488543342802f95db9765ccc
|
3 |
+
size 1225013
|
nltk_data/tokenizers/punkt_tab.zip
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c2b16c23d738effbdc5789d7aa601397c13ba2819bf922fb904687f3f16657ed
|
3 |
+
size 4259017
|
nltk_data/tokenizers/punkt_tab/README
ADDED
@@ -0,0 +1,98 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
|
2 |
+
|
3 |
+
Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
|
4 |
+
been contributed by various people using NLTK for sentence boundary detection.
|
5 |
+
|
6 |
+
For information about how to use these models, please confer the tokenization HOWTO:
|
7 |
+
http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
|
8 |
+
and chapter 3.8 of the NLTK book:
|
9 |
+
http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
|
10 |
+
|
11 |
+
There are pretrained tokenizers for the following languages:
|
12 |
+
|
13 |
+
File Language Source Contents Size of training corpus(in tokens) Model contributed by
|
14 |
+
=======================================================================================================================================================================
|
15 |
+
czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss
|
16 |
+
Literarni Noviny
|
17 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
18 |
+
danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss
|
19 |
+
(Berlingske Avisdata, Copenhagen) Weekend Avisen
|
20 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
21 |
+
dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss
|
22 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
23 |
+
english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss
|
24 |
+
(American)
|
25 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
26 |
+
estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss
|
27 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
28 |
+
finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss
|
29 |
+
Text Bank (Suomen Kielen newspapers
|
30 |
+
Tekstipankki)
|
31 |
+
Finnish Center for IT Science
|
32 |
+
(CSC)
|
33 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
34 |
+
french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss
|
35 |
+
(European)
|
36 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
37 |
+
german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss
|
38 |
+
(Switzerland) CD-ROM
|
39 |
+
(Uses "ss"
|
40 |
+
instead of "ß")
|
41 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
42 |
+
greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss
|
43 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
44 |
+
italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss
|
45 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
46 |
+
norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss
|
47 |
+
(Bokmål and Information Technologies,
|
48 |
+
Nynorsk) Bergen
|
49 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
50 |
+
polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner
|
51 |
+
(http://www.nkjp.pl/)
|
52 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
53 |
+
portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss
|
54 |
+
(Brazilian) (Linguateca)
|
55 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
56 |
+
slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss
|
57 |
+
Slovene Academy for Arts
|
58 |
+
and Sciences
|
59 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
60 |
+
spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss
|
61 |
+
(European)
|
62 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
63 |
+
swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss
|
64 |
+
(and some other texts)
|
65 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
66 |
+
turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss
|
67 |
+
(Türkçe Derlem Projesi)
|
68 |
+
University of Ankara
|
69 |
+
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
70 |
+
|
71 |
+
The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
|
72 |
+
Unicode using the codecs module.
|
73 |
+
|
74 |
+
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
|
75 |
+
Computational Linguistics 32: 485-525.
|
76 |
+
|
77 |
+
---- Training Code ----
|
78 |
+
|
79 |
+
# import punkt
|
80 |
+
import nltk.tokenize.punkt
|
81 |
+
|
82 |
+
# Make a new Tokenizer
|
83 |
+
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
|
84 |
+
|
85 |
+
# Read in training corpus (one example: Slovene)
|
86 |
+
import codecs
|
87 |
+
text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
|
88 |
+
|
89 |
+
# Train tokenizer
|
90 |
+
tokenizer.train(text)
|
91 |
+
|
92 |
+
# Dump pickled tokenizer
|
93 |
+
import pickle
|
94 |
+
out = open("slovene.pickle","wb")
|
95 |
+
pickle.dump(tokenizer, out)
|
96 |
+
out.close()
|
97 |
+
|
98 |
+
---------
|
nltk_data/tokenizers/punkt_tab/czech/abbrev_types.txt
ADDED
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
t
|
2 |
+
množ
|
3 |
+
např
|
4 |
+
j.h
|
5 |
+
man
|
6 |
+
ú
|
7 |
+
jug
|
8 |
+
dr
|
9 |
+
bl
|
10 |
+
ml
|
11 |
+
okr
|
12 |
+
st
|
13 |
+
uh
|
14 |
+
šp
|
15 |
+
judr
|
16 |
+
u.s.a
|
17 |
+
p
|
18 |
+
arg
|
19 |
+
žitě
|
20 |
+
st.celsia
|
21 |
+
etc
|
22 |
+
p.s
|
23 |
+
t.r
|
24 |
+
lok
|
25 |
+
mil
|
26 |
+
ict
|
27 |
+
n
|
28 |
+
tl
|
29 |
+
min
|
30 |
+
č
|
31 |
+
d
|
32 |
+
al
|
33 |
+
ravenně
|
34 |
+
mj
|
35 |
+
nar
|
36 |
+
plk
|
37 |
+
s.p
|
38 |
+
a.g
|
39 |
+
roč
|
40 |
+
b
|
41 |
+
zdi
|
42 |
+
r.s.c
|
43 |
+
přek
|
44 |
+
m
|
45 |
+
gen
|
46 |
+
csc
|
47 |
+
mudr
|
48 |
+
vic
|
49 |
+
š
|
50 |
+
sb
|
51 |
+
resp
|
52 |
+
tzn
|
53 |
+
iv
|
54 |
+
s.r.o
|
55 |
+
mar
|
56 |
+
w
|
57 |
+
čs
|
58 |
+
vi
|
59 |
+
tzv
|
60 |
+
ul
|
61 |
+
pen
|
62 |
+
zv
|
63 |
+
str
|
64 |
+
čp
|
65 |
+
org
|
66 |
+
rak
|
67 |
+
sv
|
68 |
+
pplk
|
69 |
+
u.s
|
70 |
+
prof
|
71 |
+
c.k
|
72 |
+
op
|
73 |
+
g
|
74 |
+
vii
|
75 |
+
kr
|
76 |
+
ing
|
77 |
+
j.o
|
78 |
+
drsc
|
79 |
+
m3
|
80 |
+
l
|
81 |
+
tr
|
82 |
+
ceo
|
83 |
+
ch
|
84 |
+
fuk
|
85 |
+
vl
|
86 |
+
viii
|
87 |
+
líp
|
88 |
+
hl.m
|
89 |
+
t.zv
|
90 |
+
phdr
|
91 |
+
o.k
|
92 |
+
tis
|
93 |
+
doc
|
94 |
+
kl
|
95 |
+
ard
|
96 |
+
čkd
|
97 |
+
pok
|
98 |
+
apod
|
99 |
+
r
|
100 |
+
př
|
101 |
+
a.s
|
102 |
+
j
|
103 |
+
jr
|
104 |
+
i.m
|
105 |
+
e
|
106 |
+
kupř
|
107 |
+
f
|
108 |
+
tř
|
109 |
+
xvi
|
110 |
+
mir
|
111 |
+
atď
|
112 |
+
vr
|
113 |
+
r.i.v
|
114 |
+
hl
|
115 |
+
kv
|
116 |
+
t.j
|
117 |
+
y
|
118 |
+
q.p.r
|
nltk_data/tokenizers/punkt_tab/czech/collocations.tab
ADDED
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
i dejmala
|
2 |
+
##number## prosince
|
3 |
+
h steina
|
4 |
+
##number## listopadu
|
5 |
+
a dvořák
|
6 |
+
v klaus
|
7 |
+
i čnhl
|
8 |
+
##number## wladyslawowo
|
9 |
+
##number## letech
|
10 |
+
a jiráska
|
11 |
+
a dubček
|
12 |
+
##number## štrasburk
|
13 |
+
##number## juniorské
|
14 |
+
##number## století
|
15 |
+
##number## kola
|
16 |
+
##number## pád
|
17 |
+
##number## května
|
18 |
+
##number## týdne
|
19 |
+
v dlouhý
|
20 |
+
k design
|
21 |
+
##number## červenec
|
22 |
+
i ligy
|
23 |
+
##number## kolo
|
24 |
+
z svěrák
|
25 |
+
##number## mája
|
26 |
+
##number## šimková
|
27 |
+
a bělého
|
28 |
+
a bradáč
|
29 |
+
##number## ročníku
|
30 |
+
##number## dubna
|
31 |
+
a vivaldiho
|
32 |
+
v mečiara
|
33 |
+
c carrićre
|
34 |
+
##number## sjezd
|
35 |
+
##number## výroční
|
36 |
+
##number## kole
|
37 |
+
##number## narozenin
|
38 |
+
k maleevová
|
39 |
+
i čnfl
|
40 |
+
##number## pádě
|
41 |
+
##number## září
|
42 |
+
##number## výročí
|
43 |
+
a dvořáka
|
44 |
+
h g.
|
45 |
+
##number## ledna
|
46 |
+
a dvorský
|
47 |
+
h měsíc
|
48 |
+
##number## srpna
|
49 |
+
##number## tř.
|
50 |
+
a mozarta
|
51 |
+
##number## sudetoněmeckých
|
52 |
+
o sokolov
|
53 |
+
k škrach
|
54 |
+
v benda
|
55 |
+
##number## symfonie
|
56 |
+
##number## července
|
57 |
+
x šalda
|
58 |
+
c abrahama
|
59 |
+
a tichý
|
60 |
+
##number## místo
|
61 |
+
k bielecki
|
62 |
+
v havel
|
63 |
+
##number## etapu
|
64 |
+
a dubčeka
|
65 |
+
i liga
|
66 |
+
##number## světový
|
67 |
+
v klausem
|
68 |
+
##number## ženy
|
69 |
+
##number## létech
|
70 |
+
##number## minutě
|
71 |
+
##number## listopadem
|
72 |
+
##number## místě
|
73 |
+
o vlček
|
74 |
+
k peteraje
|
75 |
+
i sponzor
|
76 |
+
##number## června
|
77 |
+
##number## min.
|
78 |
+
##number## oprávněnou
|
79 |
+
##number## květnu
|
80 |
+
##number## aktu
|
81 |
+
##number## květnem
|
82 |
+
##number## října
|
83 |
+
i rynda
|
84 |
+
##number## února
|
85 |
+
i snfl
|
86 |
+
a mozart
|
87 |
+
z košler
|
88 |
+
a dvorskému
|
89 |
+
v marhoul
|
90 |
+
v mečiar
|
91 |
+
##number## ročník
|
92 |
+
##number## máje
|
93 |
+
v havla
|
94 |
+
k gott
|
95 |
+
s bacha
|
96 |
+
##number## ad
|
nltk_data/tokenizers/punkt_tab/czech/ortho_context.tab
ADDED
The diff for this file is too large to render.
See raw diff
|
|
nltk_data/tokenizers/punkt_tab/czech/sent_starters.txt
ADDED
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
já
|
2 |
+
milena
|
3 |
+
tomáš
|
4 |
+
oznámila
|
5 |
+
podle
|
6 |
+
my
|
7 |
+
vyplývá
|
8 |
+
hlavní
|
9 |
+
jelikož
|
10 |
+
musíme
|
11 |
+
kdyby
|
12 |
+
foto
|
13 |
+
rozptylové
|
14 |
+
snad
|
15 |
+
zároveň
|
16 |
+
jaroslav
|
17 |
+
po
|
18 |
+
v
|
19 |
+
kromě
|
20 |
+
pokud
|
21 |
+
toto
|
22 |
+
jenže
|
23 |
+
oba
|
24 |
+
jak
|
25 |
+
zatímco
|
26 |
+
ten
|
27 |
+
myslím
|
28 |
+
navíc
|
29 |
+
dušan
|
30 |
+
zdá
|
31 |
+
dnes
|
32 |
+
přesto
|
33 |
+
tato
|
34 |
+
ti
|
35 |
+
bratislava
|
36 |
+
ale
|
37 |
+
když
|
38 |
+
nicméně
|
39 |
+
tento
|
40 |
+
mirka
|
41 |
+
přitom
|
42 |
+
dokud
|
43 |
+
jan
|
44 |
+
bohužel
|
45 |
+
ta
|
46 |
+
díky
|
47 |
+
prohlásil
|
48 |
+
praha
|
49 |
+
jestliže
|
50 |
+
jde
|
51 |
+
vždyť
|
52 |
+
moskva
|
53 |
+
proto
|
54 |
+
to
|
nltk_data/tokenizers/punkt_tab/danish/abbrev_types.txt
ADDED
@@ -0,0 +1,211 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
t
|
2 |
+
tlf
|
3 |
+
b.p
|
4 |
+
evt
|
5 |
+
j.h
|
6 |
+
lenz
|
7 |
+
mht
|
8 |
+
gl
|
9 |
+
bl
|
10 |
+
stud.polit
|
11 |
+
e.j
|
12 |
+
st
|
13 |
+
o
|
14 |
+
dec
|
15 |
+
mag
|
16 |
+
h.b
|
17 |
+
p
|
18 |
+
adm
|
19 |
+
el.lign
|
20 |
+
e.s
|
21 |
+
saalba
|
22 |
+
styrt
|
23 |
+
nr
|
24 |
+
m.a.s.h
|
25 |
+
etc
|
26 |
+
pharm
|
27 |
+
hg
|
28 |
+
j.j
|
29 |
+
dj
|
30 |
+
mountainb
|
31 |
+
f.kr
|
32 |
+
h.r
|
33 |
+
cand.jur
|
34 |
+
sp
|
35 |
+
osv
|
36 |
+
s.g
|
37 |
+
ndr
|
38 |
+
inc
|
39 |
+
b.i.g
|
40 |
+
dk-sver
|
41 |
+
sl
|
42 |
+
v.s.o.d
|
43 |
+
cand.mag
|
44 |
+
d.v.s
|
45 |
+
v.i
|
46 |
+
bøddel
|
47 |
+
fr
|
48 |
+
ø«
|
49 |
+
dr.phil
|
50 |
+
chr
|
51 |
+
p.d
|
52 |
+
bj
|
53 |
+
fhv
|
54 |
+
tilskudsforhold
|
55 |
+
m.a
|
56 |
+
sek
|
57 |
+
p.g.a
|
58 |
+
int
|
59 |
+
pokalf
|
60 |
+
ik
|
61 |
+
dir
|
62 |
+
em-lodtrækn
|
63 |
+
a.h
|
64 |
+
o.lign
|
65 |
+
p.t
|
66 |
+
m.v
|
67 |
+
n.j
|
68 |
+
m.h.t
|
69 |
+
m.m
|
70 |
+
a.p
|
71 |
+
pers
|
72 |
+
4-bakketurn
|
73 |
+
dr.med
|
74 |
+
w.ø
|
75 |
+
polit
|
76 |
+
fremsættes
|
77 |
+
techn
|
78 |
+
tidl
|
79 |
+
o.g
|
80 |
+
i.c.i
|
81 |
+
mill
|
82 |
+
skt
|
83 |
+
m.fl
|
84 |
+
cand.merc
|
85 |
+
kbh
|
86 |
+
indiv
|
87 |
+
stk
|
88 |
+
dk-maked
|
89 |
+
memorandum
|
90 |
+
mestersk
|
91 |
+
mag.art
|
92 |
+
kitzb
|
93 |
+
h
|
94 |
+
lic
|
95 |
+
fig
|
96 |
+
dressurst
|
97 |
+
sportsg
|
98 |
+
r.e.m
|
99 |
+
d.u.m
|
100 |
+
sct
|
101 |
+
kld
|
102 |
+
bl.a
|
103 |
+
hf
|
104 |
+
g.a
|
105 |
+
corp
|
106 |
+
w
|
107 |
+
konk
|
108 |
+
zoeterm
|
109 |
+
b.t
|
110 |
+
a.d
|
111 |
+
l.b
|
112 |
+
jf
|
113 |
+
s.b
|
114 |
+
kgl
|
115 |
+
ill
|
116 |
+
beck
|
117 |
+
tosset
|
118 |
+
afd
|
119 |
+
johs
|
120 |
+
pct
|
121 |
+
k.b
|
122 |
+
sv
|
123 |
+
verbalt
|
124 |
+
kgs
|
125 |
+
l.m.k
|
126 |
+
j.l
|
127 |
+
aus
|
128 |
+
superl
|
129 |
+
t.v
|
130 |
+
mia
|
131 |
+
kr
|
132 |
+
pr
|
133 |
+
præmien
|
134 |
+
j.b.s
|
135 |
+
j.o
|
136 |
+
o.s.v
|
137 |
+
edb-oplysninger
|
138 |
+
o.m.a
|
139 |
+
ca
|
140 |
+
1b
|
141 |
+
f.eks
|
142 |
+
rens
|
143 |
+
ch
|
144 |
+
mr
|
145 |
+
schw
|
146 |
+
d.c
|
147 |
+
utraditionelt
|
148 |
+
idrætsgym
|
149 |
+
hhv
|
150 |
+
e.l
|
151 |
+
s.s
|
152 |
+
eks
|
153 |
+
f.o.m
|
154 |
+
dk-storbrit
|
155 |
+
dk-jugo
|
156 |
+
n.z
|
157 |
+
derivater
|
158 |
+
c
|
159 |
+
pt
|
160 |
+
vm-kval
|
161 |
+
kl
|
162 |
+
hr
|
163 |
+
cand
|
164 |
+
jur
|
165 |
+
sav
|
166 |
+
h.c
|
167 |
+
arab.-danm
|
168 |
+
d.a.d
|
169 |
+
fl
|
170 |
+
o.a
|
171 |
+
a.s
|
172 |
+
cand.polit
|
173 |
+
grundejerform
|
174 |
+
j
|
175 |
+
faglærte
|
176 |
+
cr
|
177 |
+
a.a
|
178 |
+
mou
|
179 |
+
f.r.i
|
180 |
+
årh
|
181 |
+
o.m.m
|
182 |
+
sve
|
183 |
+
c.a
|
184 |
+
engl
|
185 |
+
sikkerhedssystemerne
|
186 |
+
m.f
|
187 |
+
j.k
|
188 |
+
phil
|
189 |
+
f
|
190 |
+
vet
|
191 |
+
mio
|
192 |
+
k.e
|
193 |
+
m.k
|
194 |
+
atla
|
195 |
+
idrætsg
|
196 |
+
n.n
|
197 |
+
4-bakketur
|
198 |
+
dvs
|
199 |
+
sdr
|
200 |
+
s.j
|
201 |
+
hol
|
202 |
+
s.h
|
203 |
+
pei
|
204 |
+
kbhvn
|
205 |
+
aa
|
206 |
+
m.g.i
|
207 |
+
fvt
|
208 |
+
i«
|
209 |
+
b.c
|
210 |
+
th
|
211 |
+
lrs
|
nltk_data/tokenizers/punkt_tab/danish/collocations.tab
ADDED
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
##number## skak
|
2 |
+
##number## speedway
|
3 |
+
##number## rally
|
4 |
+
##number## april
|
5 |
+
##number## dm-fin
|
6 |
+
##number## viceformand
|
7 |
+
m jensen
|
8 |
+
##number## kano/kajak
|
9 |
+
##number## bowling
|
10 |
+
##number## dm-finale
|
11 |
+
##number## årh.
|
12 |
+
##number## januar
|
13 |
+
##number## august
|
14 |
+
##number## marathon
|
15 |
+
##number## kamp
|
16 |
+
##number## skihop
|
17 |
+
##number## etage
|
18 |
+
##number## tennis
|
19 |
+
##number## cykling
|
20 |
+
e andersen
|
21 |
+
##number## december
|
22 |
+
g h.
|
23 |
+
##number## neb
|
24 |
+
##number## sektion
|
25 |
+
##number## afd.
|
26 |
+
##number## klasse
|
27 |
+
##number## trampolin
|
28 |
+
##number## bordtennis
|
29 |
+
##number## formel
|
30 |
+
##number## århundredes
|
31 |
+
##number## dm-semifin
|
32 |
+
##number## heks
|
33 |
+
##number## taekwondo
|
34 |
+
##number## galop
|
35 |
+
##number## basketball
|
36 |
+
##number## dm
|
37 |
+
m skræl
|
38 |
+
##number## trav
|
39 |
+
##number## provins
|
40 |
+
##number## triathlon
|
41 |
+
k axel
|
42 |
+
##number## rugby
|
43 |
+
s h.
|
44 |
+
##number## klaverkoncert
|
45 |
+
a p.
|
46 |
+
e løgstrup
|
47 |
+
k telefax
|
48 |
+
##number## gyldendal
|
49 |
+
##number## fodbold
|
50 |
+
e rosenfeldt
|
51 |
+
##number## oktober
|
52 |
+
k o.
|
53 |
+
##number## september
|
54 |
+
##number## dec.
|
55 |
+
##number## juledag
|
56 |
+
##number## badminton
|
57 |
+
##number## sejlsport
|
58 |
+
##number## håndbold
|
59 |
+
r førsund
|
60 |
+
e jørgensen
|
61 |
+
d ##number##
|
62 |
+
k e
|
63 |
+
##number## alp.ski
|
64 |
+
##number## judo
|
65 |
+
##number## roning
|
66 |
+
##number## november
|
67 |
+
##number## atletik
|
68 |
+
##number## århundrede
|
69 |
+
##number## ridning
|
70 |
+
##number## marts
|
71 |
+
m andersen
|
72 |
+
d roosevelt
|
73 |
+
##number## brydning
|
74 |
+
s kr.
|
75 |
+
##number## runde
|
76 |
+
##number## division
|
77 |
+
##number## sal
|
78 |
+
##number## boksning
|
79 |
+
##number## minut
|
80 |
+
##number## golf
|
81 |
+
##number## juni
|
82 |
+
##number## symfoni
|
83 |
+
##number## hurtigløb
|
84 |
+
k jørgensen
|
85 |
+
##number## jörgen
|
86 |
+
##number## klasses
|
87 |
+
e jacobsen
|
88 |
+
k jensen
|
89 |
+
##number## februar
|
90 |
+
k nielsen
|
91 |
+
##number## volleyball
|
92 |
+
##number## maj
|
93 |
+
##number## verdenskrig
|
94 |
+
##number## juli
|
95 |
+
##number## ishockey
|
96 |
+
##number## kunstskøjteløb
|
97 |
+
b jørgensen
|
98 |
+
##number## gymnastik
|
99 |
+
##number## svømning
|
100 |
+
##number## tw
|
101 |
+
i pedersens
|
nltk_data/tokenizers/punkt_tab/danish/ortho_context.tab
ADDED
The diff for this file is too large to render.
See raw diff
|
|