ebook2audiobookXTTS

Sleeping

App Files Files Community

drewThomasson commited on Oct 7, 2024

Commit

f045c49

verified ·

1 Parent(s): 30fa9fe

Upload 115 files

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

nltk_data/tokenizers/punkt.zip +3 -0
nltk_data/tokenizers/punkt/PY3/README +98 -0
nltk_data/tokenizers/punkt/PY3/czech.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/danish.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/dutch.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/english.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/estonian.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/finnish.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/french.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/german.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/greek.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/italian.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/malayalam.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/norwegian.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/polish.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/portuguese.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/russian.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/slovene.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/spanish.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/swedish.pickle +3 -0
nltk_data/tokenizers/punkt/PY3/turkish.pickle +3 -0
nltk_data/tokenizers/punkt/README +98 -0
nltk_data/tokenizers/punkt/czech.pickle +3 -0
nltk_data/tokenizers/punkt/danish.pickle +3 -0
nltk_data/tokenizers/punkt/dutch.pickle +3 -0
nltk_data/tokenizers/punkt/english.pickle +3 -0
nltk_data/tokenizers/punkt/estonian.pickle +3 -0
nltk_data/tokenizers/punkt/finnish.pickle +3 -0
nltk_data/tokenizers/punkt/french.pickle +3 -0
nltk_data/tokenizers/punkt/german.pickle +3 -0
nltk_data/tokenizers/punkt/greek.pickle +3 -0
nltk_data/tokenizers/punkt/italian.pickle +3 -0
nltk_data/tokenizers/punkt/malayalam.pickle +3 -0
nltk_data/tokenizers/punkt/norwegian.pickle +3 -0
nltk_data/tokenizers/punkt/polish.pickle +3 -0
nltk_data/tokenizers/punkt/portuguese.pickle +3 -0
nltk_data/tokenizers/punkt/russian.pickle +3 -0
nltk_data/tokenizers/punkt/slovene.pickle +3 -0
nltk_data/tokenizers/punkt/spanish.pickle +3 -0
nltk_data/tokenizers/punkt/swedish.pickle +3 -0
nltk_data/tokenizers/punkt/turkish.pickle +3 -0
nltk_data/tokenizers/punkt_tab.zip +3 -0
nltk_data/tokenizers/punkt_tab/README +98 -0
nltk_data/tokenizers/punkt_tab/czech/abbrev_types.txt +118 -0
nltk_data/tokenizers/punkt_tab/czech/collocations.tab +96 -0
nltk_data/tokenizers/punkt_tab/czech/ortho_context.tab +0 -0
nltk_data/tokenizers/punkt_tab/czech/sent_starters.txt +54 -0
nltk_data/tokenizers/punkt_tab/danish/abbrev_types.txt +211 -0
nltk_data/tokenizers/punkt_tab/danish/collocations.tab +101 -0
nltk_data/tokenizers/punkt_tab/danish/ortho_context.tab +0 -0

nltk_data/tokenizers/punkt.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:51c3078994aeaf650bfc8e028be4fb42b4a0d177d41c012b6a983979653660ec
+size 13905355

nltk_data/tokenizers/punkt/PY3/README ADDED Viewed

	@@ -0,0 +1,98 @@

+Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
+Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
+been contributed by various people using NLTK for sentence boundary detection.
+For information about how to use these models, please confer the tokenization HOWTO:
+http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
+and chapter 3.8 of the NLTK book:
+http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
+There are pretrained tokenizers for the following languages:
+File                Language            Source                             Contents                Size of training corpus(in tokens)           Model contributed by
+=======================================================================================================================================================================
+czech.pickle        Czech               Multilingual Corpus 1 (ECI)        Lidove Noviny                   ~345,000                             Jan Strunk / Tibor Kiss
+                                                                           Literarni Noviny
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+danish.pickle       Danish              Avisdata CD-Rom Ver. 1.1. 1995     Berlingske Tidende              ~550,000                             Jan Strunk / Tibor Kiss
+                                        (Berlingske Avisdata, Copenhagen)  Weekend Avisen
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+dutch.pickle        Dutch               Multilingual Corpus 1 (ECI)        De Limburger                    ~340,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+english.pickle      English             Penn Treebank (LDC)                Wall Street Journal             ~469,000                             Jan Strunk / Tibor Kiss
+                    (American)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+estonian.pickle     Estonian            University of Tartu, Estonia       Eesti Ekspress                  ~359,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+finnish.pickle      Finnish             Finnish Parole Corpus, Finnish     Books and major national        ~364,000                             Jan Strunk / Tibor Kiss
+                                        Text Bank (Suomen Kielen           newspapers
+                                        Tekstipankki)
+                                        Finnish Center for IT Science
+                                        (CSC)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+french.pickle       French              Multilingual Corpus 1 (ECI)        Le Monde                        ~370,000                             Jan Strunk / Tibor Kiss
+                    (European)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+german.pickle       German              Neue Zürcher Zeitung AG            Neue Zürcher Zeitung            ~847,000                             Jan Strunk / Tibor Kiss
+                    (Switzerland)       CD-ROM
+                    (Uses "ss"
+                     instead of "ß")
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+greek.pickle        Greek               Efstathios Stamatatos              To Vima (TO BHMA)               ~227,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+italian.pickle      Italian             Multilingual Corpus 1 (ECI)        La Stampa, Il Mattino           ~312,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+norwegian.pickle    Norwegian           Centre for Humanities              Bergens Tidende                 ~479,000                             Jan Strunk / Tibor Kiss
+                    (Bokmål and         Information Technologies,
+                     Nynorsk)           Bergen
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+polish.pickle       Polish              Polish National Corpus             Literature, newspapers, etc.  ~1,000,000                             Krzysztof Langner
+                                        (http://www.nkjp.pl/)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+portuguese.pickle   Portuguese          CETENFolha Corpus                  Folha de São Paulo              ~321,000                             Jan Strunk / Tibor Kiss
+                    (Brazilian)         (Linguateca)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+slovene.pickle      Slovene             TRACTOR                            Delo                            ~354,000                             Jan Strunk / Tibor Kiss
+                                        Slovene Academy for Arts
+                                        and Sciences
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+spanish.pickle      Spanish             Multilingual Corpus 1 (ECI)        Sur                             ~353,000                             Jan Strunk / Tibor Kiss
+                    (European)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+swedish.pickle      Swedish             Multilingual Corpus 1 (ECI)        Dagens Nyheter                  ~339,000                             Jan Strunk / Tibor Kiss
+                                                                           (and some other texts)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+turkish.pickle      Turkish             METU Turkish Corpus                Milliyet                        ~333,000                             Jan Strunk / Tibor Kiss
+                                        (Türkçe Derlem Projesi)
+                                        University of Ankara
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
+Unicode using the codecs module.
+Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
+Computational Linguistics 32: 485-525.
+---- Training Code ----
+# import punkt
+import nltk.tokenize.punkt
+# Make a new Tokenizer
+tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
+# Read in training corpus (one example: Slovene)
+import codecs
+text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
+# Train tokenizer
+tokenizer.train(text)
+# Dump pickled tokenizer
+import pickle
+out = open("slovene.pickle","wb")
+pickle.dump(tokenizer, out)
+out.close()
+---------

nltk_data/tokenizers/punkt/PY3/czech.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:64b0734b6fbe8e8d7cac79f48d1dd9f853824e57c4e3594dadd74ba2c1d97f50
+size 1119050

nltk_data/tokenizers/punkt/PY3/danish.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6189c7dd254e29e2bd406a7f6a4336297c8953214792466a790ea4444223ceb3
+size 1191710

nltk_data/tokenizers/punkt/PY3/dutch.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fda0d6a13f02e8898daec7fe923da88e25abe081bcfa755c0e015075c215fe4c
+size 693759

nltk_data/tokenizers/punkt/PY3/english.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5cad3758596392364e3be9803dbd7ebeda384b68937b488a01365f5551bb942c
+size 406697

nltk_data/tokenizers/punkt/PY3/estonian.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b364f72538d17b146a98009ad239a8096ce6c0a8b02958c0bc776ecd0c58a25f
+size 1499502

nltk_data/tokenizers/punkt/PY3/finnish.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6a4b5ff5500ee851c456f9dd40d5fc0d8c1859c88eb3178de1317d26b7d22833
+size 1852226

nltk_data/tokenizers/punkt/PY3/french.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:28e3a4cd2971989b3cb9fd3433a6f15d17981e464db2be039364313b5de94f29
+size 553575

nltk_data/tokenizers/punkt/PY3/german.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ddcbbe85e2042a019b1a6e37fd8c153286c38ba201fae0f5bfd9a3f74abae25c
+size 1463575

nltk_data/tokenizers/punkt/PY3/greek.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:85dabc44ab90a5f208ef37ff6b4892ebe7e740f71fb4da47cfd95417ca3e22fd
+size 876006

nltk_data/tokenizers/punkt/PY3/italian.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:68a94007b1e4ffdc4d1a190185ca5442c3dafeb17ab39d30329e84cd74a43947
+size 615089

nltk_data/tokenizers/punkt/PY3/malayalam.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1f8cf58acbdb7f472ac40affc13663be42dafb47c15030c11ade0444c9e0e53d
+size 221207

nltk_data/tokenizers/punkt/PY3/norwegian.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4ff7a46d1438b311457d15d7763060b8d3270852c1850fd788c5cee194dc4a1d
+size 1181271

nltk_data/tokenizers/punkt/PY3/polish.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:624900ae3ddfb4854a98c5d3b8b1c9bb719975f33fee61ce1441dab9f8a00718
+size 1738386

nltk_data/tokenizers/punkt/PY3/portuguese.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:02a0b7b25c3c7471e1791b66a31bbb530afbb0160aee4fcecf0107652067b4a1
+size 611919

nltk_data/tokenizers/punkt/PY3/russian.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:549762f8190024d89b511472df21a3a135eee5d9233e63ac244db737c2c61d7e
+size 33020

nltk_data/tokenizers/punkt/PY3/slovene.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:52ef2cc0ed27d79b3aa635cbbc40ad811883a75a4b8a8be1ae406972870fd864
+size 734444

nltk_data/tokenizers/punkt/PY3/spanish.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:164a50fadc5a49f8ec7426eae11d3111ee752b48a3ef373d47745011192a5984
+size 562337

nltk_data/tokenizers/punkt/PY3/swedish.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b0f7d538bfd5266633b09e842cd92e9e0ac10f1d923bf211e1497972ddc47318
+size 979681

nltk_data/tokenizers/punkt/PY3/turkish.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ae68ef5863728ac5332e87eb1f6bae772ff32a13a4caa2b01a5c68103e853c5b
+size 1017038

nltk_data/tokenizers/punkt/README ADDED Viewed

	@@ -0,0 +1,98 @@

+Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
+Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
+been contributed by various people using NLTK for sentence boundary detection.
+For information about how to use these models, please confer the tokenization HOWTO:
+http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
+and chapter 3.8 of the NLTK book:
+http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
+There are pretrained tokenizers for the following languages:
+File                Language            Source                             Contents                Size of training corpus(in tokens)           Model contributed by
+=======================================================================================================================================================================
+czech.pickle        Czech               Multilingual Corpus 1 (ECI)        Lidove Noviny                   ~345,000                             Jan Strunk / Tibor Kiss
+                                                                           Literarni Noviny
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+danish.pickle       Danish              Avisdata CD-Rom Ver. 1.1. 1995     Berlingske Tidende              ~550,000                             Jan Strunk / Tibor Kiss
+                                        (Berlingske Avisdata, Copenhagen)  Weekend Avisen
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+dutch.pickle        Dutch               Multilingual Corpus 1 (ECI)        De Limburger                    ~340,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+english.pickle      English             Penn Treebank (LDC)                Wall Street Journal             ~469,000                             Jan Strunk / Tibor Kiss
+                    (American)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+estonian.pickle     Estonian            University of Tartu, Estonia       Eesti Ekspress                  ~359,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+finnish.pickle      Finnish             Finnish Parole Corpus, Finnish     Books and major national        ~364,000                             Jan Strunk / Tibor Kiss
+                                        Text Bank (Suomen Kielen           newspapers
+                                        Tekstipankki)
+                                        Finnish Center for IT Science
+                                        (CSC)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+french.pickle       French              Multilingual Corpus 1 (ECI)        Le Monde                        ~370,000                             Jan Strunk / Tibor Kiss
+                    (European)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+german.pickle       German              Neue Zürcher Zeitung AG            Neue Zürcher Zeitung            ~847,000                             Jan Strunk / Tibor Kiss
+                    (Switzerland)       CD-ROM
+                    (Uses "ss"
+                     instead of "ß")
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+greek.pickle        Greek               Efstathios Stamatatos              To Vima (TO BHMA)               ~227,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+italian.pickle      Italian             Multilingual Corpus 1 (ECI)        La Stampa, Il Mattino           ~312,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+norwegian.pickle    Norwegian           Centre for Humanities              Bergens Tidende                 ~479,000                             Jan Strunk / Tibor Kiss
+                    (Bokmål and         Information Technologies,
+                     Nynorsk)           Bergen
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+polish.pickle       Polish              Polish National Corpus             Literature, newspapers, etc.  ~1,000,000                             Krzysztof Langner
+                                        (http://www.nkjp.pl/)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+portuguese.pickle   Portuguese          CETENFolha Corpus                  Folha de São Paulo              ~321,000                             Jan Strunk / Tibor Kiss
+                    (Brazilian)         (Linguateca)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+slovene.pickle      Slovene             TRACTOR                            Delo                            ~354,000                             Jan Strunk / Tibor Kiss
+                                        Slovene Academy for Arts
+                                        and Sciences
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+spanish.pickle      Spanish             Multilingual Corpus 1 (ECI)        Sur                             ~353,000                             Jan Strunk / Tibor Kiss
+                    (European)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+swedish.pickle      Swedish             Multilingual Corpus 1 (ECI)        Dagens Nyheter                  ~339,000                             Jan Strunk / Tibor Kiss
+                                                                           (and some other texts)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+turkish.pickle      Turkish             METU Turkish Corpus                Milliyet                        ~333,000                             Jan Strunk / Tibor Kiss
+                                        (Türkçe Derlem Projesi)
+                                        University of Ankara
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
+Unicode using the codecs module.
+Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
+Computational Linguistics 32: 485-525.
+---- Training Code ----
+# import punkt
+import nltk.tokenize.punkt
+# Make a new Tokenizer
+tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
+# Read in training corpus (one example: Slovene)
+import codecs
+text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
+# Train tokenizer
+tokenizer.train(text)
+# Dump pickled tokenizer
+import pickle
+out = open("slovene.pickle","wb")
+pickle.dump(tokenizer, out)
+out.close()
+---------

nltk_data/tokenizers/punkt/czech.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5ba73d293c7d7953956bcf02f3695ec5c1f0d527f2a3c38097f5593394fa1690
+size 1265552

nltk_data/tokenizers/punkt/danish.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ea29760a0a9197f52ca59e78aeafc5a6f55d05258faf7db1709b2b9eb321ef20
+size 1264725

nltk_data/tokenizers/punkt/dutch.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4a8e26b3d68c45c38e594d19e2d5677447bfdcaa636d3b1e7acfed0e9272d73c
+size 742624

nltk_data/tokenizers/punkt/english.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dda37972ae88998a6fd3e3ec002697a6bd362b32d050fda7d7ca5276873092aa
+size 433305

nltk_data/tokenizers/punkt/estonian.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3867fee26a36bdb197c64362aa13ac683f5f33fa4d0d225a5d56707582a55a1d
+size 1596714

nltk_data/tokenizers/punkt/finnish.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1a9e17b3d5b4df76345d812b8a65b1da0767eda5086eadcc11e625eef0942835
+size 1951656

nltk_data/tokenizers/punkt/french.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:de05f3d5647d3d2296626fb83f68428e4c6ad6e05a00ed4694c8bdc8f2f197ee
+size 583482

nltk_data/tokenizers/punkt/german.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eab497fa085413130c8fd0fb13b929128930afe2f6a26ea8715c95df7088e97c
+size 1526714

nltk_data/tokenizers/punkt/greek.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:21752a6762fad5cfe46fb5c45fad9a85484a0e8e81c67e6af6fb973cfc27d67c
+size 1953106

nltk_data/tokenizers/punkt/italian.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dcb2717d7be5f26e860a92e05acf69b1123a5f4527cd7a269a9ab9e9e668c805
+size 658331

nltk_data/tokenizers/punkt/malayalam.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1f8cf58acbdb7f472ac40affc13663be42dafb47c15030c11ade0444c9e0e53d
+size 221207

nltk_data/tokenizers/punkt/norwegian.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e4a97f8f9a03a0338dd746bcc89a0ae0f54ae43b835fa37d83e279e1ca794faf
+size 1259779

nltk_data/tokenizers/punkt/polish.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:16127b6d10933427a3e90fb20e9be53e1fb371ff79a730c1030734ed80b90c92
+size 2042451

nltk_data/tokenizers/punkt/portuguese.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bb01bf7c79a4eadc2178bbd209665139a0e4b38f2d1c44fef097de93955140e0
+size 649051

nltk_data/tokenizers/punkt/russian.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bc984432fbe31f7000014f8047502476889169c60f09be5413ca09276b16c909
+size 33027

nltk_data/tokenizers/punkt/slovene.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7dac650212b3787b39996c01bd2084115493e6f6ec390bab61f767525b08b8ea
+size 832867

nltk_data/tokenizers/punkt/spanish.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:271dc6027c4aae056f72a9bfab5645cf67e198bf4f972895844e40f5989ccdc3
+size 597831

nltk_data/tokenizers/punkt/swedish.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:40d50ebdad6caa87715f2e300b1217ec92c42de205a543cc4a56903bd2c9acfa
+size 1034496

nltk_data/tokenizers/punkt/turkish.pickle ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d3ae47d76501d027698809d12e75292c9c392910488543342802f95db9765ccc
+size 1225013

nltk_data/tokenizers/punkt_tab.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c2b16c23d738effbdc5789d7aa601397c13ba2819bf922fb904687f3f16657ed
+size 4259017

nltk_data/tokenizers/punkt_tab/README ADDED Viewed

	@@ -0,0 +1,98 @@

+Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
+Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
+been contributed by various people using NLTK for sentence boundary detection.
+For information about how to use these models, please confer the tokenization HOWTO:
+http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
+and chapter 3.8 of the NLTK book:
+http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
+There are pretrained tokenizers for the following languages:
+File                Language            Source                             Contents                Size of training corpus(in tokens)           Model contributed by
+=======================================================================================================================================================================
+czech.pickle        Czech               Multilingual Corpus 1 (ECI)        Lidove Noviny                   ~345,000                             Jan Strunk / Tibor Kiss
+                                                                           Literarni Noviny
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+danish.pickle       Danish              Avisdata CD-Rom Ver. 1.1. 1995     Berlingske Tidende              ~550,000                             Jan Strunk / Tibor Kiss
+                                        (Berlingske Avisdata, Copenhagen)  Weekend Avisen
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+dutch.pickle        Dutch               Multilingual Corpus 1 (ECI)        De Limburger                    ~340,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+english.pickle      English             Penn Treebank (LDC)                Wall Street Journal             ~469,000                             Jan Strunk / Tibor Kiss
+                    (American)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+estonian.pickle     Estonian            University of Tartu, Estonia       Eesti Ekspress                  ~359,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+finnish.pickle      Finnish             Finnish Parole Corpus, Finnish     Books and major national        ~364,000                             Jan Strunk / Tibor Kiss
+                                        Text Bank (Suomen Kielen           newspapers
+                                        Tekstipankki)
+                                        Finnish Center for IT Science
+                                        (CSC)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+french.pickle       French              Multilingual Corpus 1 (ECI)        Le Monde                        ~370,000                             Jan Strunk / Tibor Kiss
+                    (European)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+german.pickle       German              Neue Zürcher Zeitung AG            Neue Zürcher Zeitung            ~847,000                             Jan Strunk / Tibor Kiss
+                    (Switzerland)       CD-ROM
+                    (Uses "ss"
+                     instead of "ß")
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+greek.pickle        Greek               Efstathios Stamatatos              To Vima (TO BHMA)               ~227,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+italian.pickle      Italian             Multilingual Corpus 1 (ECI)        La Stampa, Il Mattino           ~312,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+norwegian.pickle    Norwegian           Centre for Humanities              Bergens Tidende                 ~479,000                             Jan Strunk / Tibor Kiss
+                    (Bokmål and         Information Technologies,
+                     Nynorsk)           Bergen
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+polish.pickle       Polish              Polish National Corpus             Literature, newspapers, etc.  ~1,000,000                             Krzysztof Langner
+                                        (http://www.nkjp.pl/)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+portuguese.pickle   Portuguese          CETENFolha Corpus                  Folha de São Paulo              ~321,000                             Jan Strunk / Tibor Kiss
+                    (Brazilian)         (Linguateca)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+slovene.pickle      Slovene             TRACTOR                            Delo                            ~354,000                             Jan Strunk / Tibor Kiss
+                                        Slovene Academy for Arts
+                                        and Sciences
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+spanish.pickle      Spanish             Multilingual Corpus 1 (ECI)        Sur                             ~353,000                             Jan Strunk / Tibor Kiss
+                    (European)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+swedish.pickle      Swedish             Multilingual Corpus 1 (ECI)        Dagens Nyheter                  ~339,000                             Jan Strunk / Tibor Kiss
+                                                                           (and some other texts)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+turkish.pickle      Turkish             METU Turkish Corpus                Milliyet                        ~333,000                             Jan Strunk / Tibor Kiss
+                                        (Türkçe Derlem Projesi)
+                                        University of Ankara
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
+Unicode using the codecs module.
+Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
+Computational Linguistics 32: 485-525.
+---- Training Code ----
+# import punkt
+import nltk.tokenize.punkt
+# Make a new Tokenizer
+tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
+# Read in training corpus (one example: Slovene)
+import codecs
+text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
+# Train tokenizer
+tokenizer.train(text)
+# Dump pickled tokenizer
+import pickle
+out = open("slovene.pickle","wb")
+pickle.dump(tokenizer, out)
+out.close()
+---------

nltk_data/tokenizers/punkt_tab/czech/abbrev_types.txt ADDED Viewed

	@@ -0,0 +1,118 @@

+t
+množ
+např
+j.h
+man
+ú
+jug
+dr
+bl
+ml
+okr
+st
+uh
+šp
+judr
+u.s.a
+p
+arg
+žitě
+st.celsia
+etc
+p.s
+t.r
+lok
+mil
+ict
+n
+tl
+min
+č
+d
+al
+ravenně
+mj
+nar
+plk
+s.p
+a.g
+roč
+b
+zdi
+r.s.c
+přek
+m
+gen
+csc
+mudr
+vic
+š
+sb
+resp
+tzn
+iv
+s.r.o
+mar
+w
+čs
+vi
+tzv
+ul
+pen
+zv
+str
+čp
+org
+rak
+sv
+pplk
+u.s
+prof
+c.k
+op
+g
+vii
+kr
+ing
+j.o
+drsc
+m3
+l
+tr
+ceo
+ch
+fuk
+vl
+viii
+líp
+hl.m
+t.zv
+phdr
+o.k
+tis
+doc
+kl
+ard
+čkd
+pok
+apod
+r
+př
+a.s
+j
+jr
+i.m
+e
+kupř
+f
+tř
+xvi
+mir
+atď
+vr
+r.i.v
+hl
+kv
+t.j
+y
+q.p.r

nltk_data/tokenizers/punkt_tab/czech/collocations.tab ADDED Viewed

	@@ -0,0 +1,96 @@

+i	dejmala
+##number##	prosince
+h	steina
+##number##	listopadu
+a	dvořák
+v	klaus
+i	čnhl
+##number##	wladyslawowo
+##number##	letech
+a	jiráska
+a	dubček
+##number##	štrasburk
+##number##	juniorské
+##number##	století
+##number##	kola
+##number##	pád
+##number##	května
+##number##	týdne
+v	dlouhý
+k	design
+##number##	červenec
+i	ligy
+##number##	kolo
+z	svěrák
+##number##	mája
+##number##	šimková
+a	bělého
+a	bradáč
+##number##	ročníku
+##number##	dubna
+a	vivaldiho
+v	mečiara
+c	carrićre
+##number##	sjezd
+##number##	výroční
+##number##	kole
+##number##	narozenin
+k	maleevová
+i	čnfl
+##number##	pádě
+##number##	září
+##number##	výročí
+a	dvořáka
+h	g.
+##number##	ledna
+a	dvorský
+h	měsíc
+##number##	srpna
+##number##	tř.
+a	mozarta
+##number##	sudetoněmeckých
+o	sokolov
+k	škrach
+v	benda
+##number##	symfonie
+##number##	července
+x	šalda
+c	abrahama
+a	tichý
+##number##	místo
+k	bielecki
+v	havel
+##number##	etapu
+a	dubčeka
+i	liga
+##number##	světový
+v	klausem
+##number##	ženy
+##number##	létech
+##number##	minutě
+##number##	listopadem
+##number##	místě
+o	vlček
+k	peteraje
+i	sponzor
+##number##	června
+##number##	min.
+##number##	oprávněnou
+##number##	květnu
+##number##	aktu
+##number##	květnem
+##number##	října
+i	rynda
+##number##	února
+i	snfl
+a	mozart
+z	košler
+a	dvorskému
+v	marhoul
+v	mečiar
+##number##	ročník
+##number##	máje
+v	havla
+k	gott
+s	bacha
+##number##	ad

nltk_data/tokenizers/punkt_tab/czech/ortho_context.tab ADDED Viewed

The diff for this file is too large to render. See raw diff

nltk_data/tokenizers/punkt_tab/czech/sent_starters.txt ADDED Viewed

	@@ -0,0 +1,54 @@

+já
+milena
+tomáš
+oznámila
+podle
+my
+vyplývá
+hlavní
+jelikož
+musíme
+kdyby
+foto
+rozptylové
+snad
+zároveň
+jaroslav
+po
+v
+kromě
+pokud
+toto
+jenže
+oba
+jak
+zatímco
+ten
+myslím
+navíc
+dušan
+zdá
+dnes
+přesto
+tato
+ti
+bratislava
+ale
+když
+nicméně
+tento
+mirka
+přitom
+dokud
+jan
+bohužel
+ta
+díky
+prohlásil
+praha
+jestliže
+jde
+vždyť
+moskva
+proto
+to

nltk_data/tokenizers/punkt_tab/danish/abbrev_types.txt ADDED Viewed

	@@ -0,0 +1,211 @@

+t
+tlf
+b.p
+evt
+j.h
+lenz
+mht
+gl
+bl
+stud.polit
+e.j
+st
+o
+dec
+mag
+h.b
+p
+adm
+el.lign
+e.s
+saalba
+styrt
+nr
+m.a.s.h
+etc
+pharm
+hg
+j.j
+dj
+mountainb
+f.kr
+h.r
+cand.jur
+sp
+osv
+s.g
+ndr
+inc
+b.i.g
+dk-sver
+sl
+v.s.o.d
+cand.mag
+d.v.s
+v.i
+bøddel
+fr
+ø«
+dr.phil
+chr
+p.d
+bj
+fhv
+tilskudsforhold
+m.a
+sek
+p.g.a
+int
+pokalf
+ik
+dir
+em-lodtrækn
+a.h
+o.lign
+p.t
+m.v
+n.j
+m.h.t
+m.m
+a.p
+pers
+4-bakketurn
+dr.med
+w.ø
+polit
+fremsættes
+techn
+tidl
+o.g
+i.c.i
+mill
+skt
+m.fl
+cand.merc
+kbh
+indiv
+stk
+dk-maked
+memorandum
+mestersk
+mag.art
+kitzb
+h
+lic
+fig
+dressurst
+sportsg
+r.e.m
+d.u.m
+sct
+kld
+bl.a
+hf
+g.a
+corp
+w
+konk
+zoeterm
+b.t
+a.d
+l.b
+jf
+s.b
+kgl
+ill
+beck
+tosset
+afd
+johs
+pct
+k.b
+sv
+verbalt
+kgs
+l.m.k
+j.l
+aus
+superl
+t.v
+mia
+kr
+pr
+præmien
+j.b.s
+j.o
+o.s.v
+edb-oplysninger
+o.m.a
+ca
+1b
+f.eks
+rens
+ch
+mr
+schw
+d.c
+utraditionelt
+idrætsgym
+hhv
+e.l
+s.s
+eks
+f.o.m
+dk-storbrit
+dk-jugo
+n.z
+derivater
+c
+pt
+vm-kval
+kl
+hr
+cand
+jur
+sav
+h.c
+arab.-danm
+d.a.d
+fl
+o.a
+a.s
+cand.polit
+grundejerform
+j
+faglærte
+cr
+a.a
+mou
+f.r.i
+årh
+o.m.m
+sve
+c.a
+engl
+sikkerhedssystemerne
+m.f
+j.k
+phil
+f
+vet
+mio
+k.e
+m.k
+atla
+idrætsg
+n.n
+4-bakketur
+dvs
+sdr
+s.j
+hol
+s.h
+pei
+kbhvn
+aa
+m.g.i
+fvt
+i«
+b.c
+th
+lrs

nltk_data/tokenizers/punkt_tab/danish/collocations.tab ADDED Viewed

	@@ -0,0 +1,101 @@

+##number##	skak
+##number##	speedway
+##number##	rally
+##number##	april
+##number##	dm-fin
+##number##	viceformand
+m	jensen
+##number##	kano/kajak
+##number##	bowling
+##number##	dm-finale
+##number##	årh.
+##number##	januar
+##number##	august
+##number##	marathon
+##number##	kamp
+##number##	skihop
+##number##	etage
+##number##	tennis
+##number##	cykling
+e	andersen
+##number##	december
+g	h.
+##number##	neb
+##number##	sektion
+##number##	afd.
+##number##	klasse
+##number##	trampolin
+##number##	bordtennis
+##number##	formel
+##number##	århundredes
+##number##	dm-semifin
+##number##	heks
+##number##	taekwondo
+##number##	galop
+##number##	basketball
+##number##	dm
+m	skræl
+##number##	trav
+##number##	provins
+##number##	triathlon
+k	axel
+##number##	rugby
+s	h.
+##number##	klaverkoncert
+a	p.
+e	løgstrup
+k	telefax
+##number##	gyldendal
+##number##	fodbold
+e	rosenfeldt
+##number##	oktober
+k	o.
+##number##	september
+##number##	dec.
+##number##	juledag
+##number##	badminton
+##number##	sejlsport
+##number##	håndbold
+r	førsund
+e	jørgensen
+d	##number##
+k	e
+##number##	alp.ski
+##number##	judo
+##number##	roning
+##number##	november
+##number##	atletik
+##number##	århundrede
+##number##	ridning
+##number##	marts
+m	andersen
+d	roosevelt
+##number##	brydning
+s	kr.
+##number##	runde
+##number##	division
+##number##	sal
+##number##	boksning
+##number##	minut
+##number##	golf
+##number##	juni
+##number##	symfoni
+##number##	hurtigløb
+k	jørgensen
+##number##	jörgen
+##number##	klasses
+e	jacobsen
+k	jensen
+##number##	februar
+k	nielsen
+##number##	volleyball
+##number##	maj
+##number##	verdenskrig
+##number##	juli
+##number##	ishockey
+##number##	kunstskøjteløb
+b	jørgensen
+##number##	gymnastik
+##number##	svømning
+##number##	tw
+i	pedersens

nltk_data/tokenizers/punkt_tab/danish/ortho_context.tab ADDED Viewed

The diff for this file is too large to render. See raw diff