harveen commited on
Commit
9bbf386
·
1 Parent(s): 7314a94

Adding code

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. indicTrans/IndicTrans_training.ipynb → IndicTrans_training.ipynb +0 -0
  2. indicTrans/LICENSE → LICENSE +0 -0
  3. README.md +9 -296
  4. indicTrans/api.py → api.py +0 -0
  5. indicTrans/apply_bpe_traindevtest_notag.sh → apply_bpe_traindevtest_notag.sh +0 -0
  6. indicTrans/apply_single_bpe_traindevtest_notag.sh → apply_single_bpe_traindevtest_notag.sh +0 -0
  7. indicTrans/binarize_training_exp.sh → binarize_training_exp.sh +0 -0
  8. indicTrans/compute_bleu.sh → compute_bleu.sh +0 -0
  9. indicTrans/indicTrans_Finetuning.ipynb → indicTrans_Finetuning.ipynb +0 -0
  10. indicTrans/indicTrans_python_interface.ipynb → indicTrans_python_interface.ipynb +0 -0
  11. {indicTrans/indic_nlp_library → indic_nlp_library}/LICENSE +0 -0
  12. {indicTrans/indic_nlp_library → indic_nlp_library}/README.md +0 -0
  13. {indicTrans/indic_nlp_library → indic_nlp_library}/contrib/README.md +0 -0
  14. {indicTrans/indic_nlp_library → indic_nlp_library}/contrib/correct_moses_tokenizer.py +0 -0
  15. {indicTrans/indic_nlp_library → indic_nlp_library}/contrib/hindi_to_kannada_transliterator.py +0 -0
  16. {indicTrans/indic_nlp_library → indic_nlp_library}/contrib/indic_scraper_project_sample.ipynb +0 -0
  17. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/Makefile +0 -0
  18. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/cmd.rst +0 -0
  19. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/code.rst +0 -0
  20. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/conf.py +0 -0
  21. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/index.rst +0 -0
  22. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.MD +0 -0
  23. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.cli.rst +0 -0
  24. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.morph.rst +0 -0
  25. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.normalize.rst +0 -0
  26. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.pdf +0 -0
  27. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.rst +0 -0
  28. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.script.rst +0 -0
  29. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.syllable.rst +0 -0
  30. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.tokenize.rst +0 -0
  31. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.transliterate.rst +0 -0
  32. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/make.bat +0 -0
  33. {indicTrans/indic_nlp_library → indic_nlp_library}/docs/modules.rst +0 -0
  34. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/__init__.py +0 -0
  35. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/cli/__init__.py +0 -0
  36. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/cli/cliparser.py +0 -0
  37. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/common.py +0 -0
  38. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/langinfo.py +0 -0
  39. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/loader.py +0 -0
  40. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/morph/__init__.py +0 -0
  41. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/morph/unsupervised_morph.py +0 -0
  42. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/normalize/__init__.py +0 -0
  43. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/normalize/indic_normalize.py +0 -0
  44. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/__init__.py +0 -0
  45. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/english_script.py +0 -0
  46. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/indic_scripts.py +0 -0
  47. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/phonetic_sim.py +0 -0
  48. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/syllable/__init__.py +0 -0
  49. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/syllable/syllabifier.py +0 -0
  50. {indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/test/__init__.py +0 -0
indicTrans/IndicTrans_training.ipynb → IndicTrans_training.ipynb RENAMED
File without changes
indicTrans/LICENSE → LICENSE RENAMED
File without changes
README.md CHANGED
@@ -1,296 +1,9 @@
1
- <div align="center">
2
- <h1><b><i>IndicTrans</i></b></h1>
3
- <a href="http://indicnlp.ai4bharat.org/samanantar">Website</a> |
4
- <a href="https://arxiv.org/abs/2104.05596">Paper</a> |
5
- <a href="https://youtu.be/QwYPOd1eBtQ?t=383">Video</a><br><br>
6
- </div>
7
-
8
- **IndicTrans** is a Transformer-4x ( ~434M ) multilingual NMT model trained on [Samanantar](https://indicnlp.ai4bharat.org/samanantar) dataset which is the largest publicly available parallel corpora collection for Indic languages at the time of writing ( 14 April 2021 ). It is a single script model i.e we convert all the Indic data to the Devanagari script which allows for ***better lexical sharing between languages for transfer learning, prevents fragmentation of the subword vocabulary between Indic languages and allows using a smaller subword vocabulary***. We currently release two models - Indic to English and English to Indic and support the following 11 indic languages:
9
-
10
- | <!-- --> | <!-- --> | <!-- --> | <!-- --> |
11
- | ------------- | -------------- | ------------ | ----------- |
12
- | Assamese (as) | Hindi (hi) | Marathi (mr) | Tamil (ta) |
13
- | Bengali (bn) | Kannada (kn) | Oriya (or) | Telugu (te) |
14
- | Gujarati (gu) | Malayalam (ml) | Punjabi (pa) |
15
-
16
-
17
-
18
-
19
- - [Updates](#updates)
20
- - [Download IndicTrans models:](#download-indictrans-models)
21
- - [Using the model for translating any input](#using-the-model-for-translating-any-input)
22
- - [Finetuning the model on your input dataset](#finetuning-the-model-on-your-input-dataset)
23
- - [Mining Indic to Indic pairs from english centric corpus](#mining-indic-to-indic-pairs-from-english-centric-corpus)
24
- - [Installation](#installation)
25
- - [How to train the indictrans model on your training data?](#how-to-train-the-indictrans-model-on-your-training-data)
26
- - [Network & Training Details](#network--training-details)
27
- - [Folder Structure](#folder-structure)
28
- - [Citing](#citing)
29
- - [License](#license)
30
- - [Contributors](#contributors)
31
- - [Contact](#contact)
32
-
33
-
34
- ## Updates
35
- <details><summary>Click to expand </summary>
36
- 18 December 2021
37
-
38
- ```
39
- Tutorials updated with latest model links
40
- ```
41
-
42
-
43
- 26 November 2021
44
- ```
45
- - v0.3 models are now available for download
46
- ```
47
-
48
- 27 June 2021
49
- ```
50
- - Updated links for indic to indic model
51
- - Add more comments to training scripts
52
- - Add link to [Samanantar Video](https://youtu.be/QwYPOd1eBtQ?t=383)
53
- - Add folder structure in readme
54
- - Add python wrapper for model inference
55
- ```
56
-
57
- 09 June 2021
58
- ```
59
- - Updated links for models
60
- - Added Indic to Indic model
61
- ```
62
-
63
- 09 May 2021
64
- ```
65
- - Added fix for finetuning on datasets where some lang pairs are not present. Previously the script assumed the finetuning dataset will have data for all 11 indic lang pairs
66
- - Added colab notebook for finetuning instructions
67
- ```
68
- </details>
69
-
70
- ## Download IndicTrans models:
71
-
72
- Indic to English: [v0.3](https://storage.googleapis.com/samanantar-public/V0.3/models/indic-en.zip)
73
-
74
- English to Indic: [v0.3](https://storage.googleapis.com/samanantar-public/V0.3/models/en-indic.zip)
75
-
76
- Indic to Indic: [v0.3](https://storage.googleapis.com/samanantar-public/V0.3/models/m2m.zip)
77
-
78
-
79
-
80
- ## Using the model for translating any input
81
-
82
- The model is trained on single sentences and hence, users need to split parapgraphs to sentences before running the translation when using our command line interface (The python interface has `translate_paragraph` method to handle multi sentence translations).
83
-
84
- Note: IndicTrans is trained with a max sequence length of **200** tokens (subwords). If your sentence is too long (> 200 tokens), the sentence will be truncated to 200 tokens before translation.
85
-
86
- Here is an example snippet to split paragraphs into sentences for English and Indic languages supported by our model:
87
- ```python
88
- # install these libraries
89
- # pip install mosestokenizer
90
- # pip install indic-nlp-library
91
-
92
- from mosestokenizer import *
93
- from indicnlp.tokenize import sentence_tokenize
94
-
95
- INDIC = ["as", "bn", "gu", "hi", "kn", "ml", "mr", "or", "pa", "ta", "te"]
96
-
97
- def split_sentences(paragraph, language):
98
- if language == "en":
99
- with MosesSentenceSplitter(language) as splitter:
100
- return splitter([paragraph])
101
- elif language in INDIC:
102
- return sentence_tokenize.sentence_split(paragraph, lang=language)
103
-
104
- split_sentences("""COVID-19 is caused by infection with the severe acute respiratory
105
- syndrome coronavirus 2 (SARS-CoV-2) virus strain. The disease is mainly transmitted via the respiratory
106
- route when people inhale droplets and particles that infected people release as they breathe, talk, cough, sneeze, or sing. """, language='en')
107
-
108
- >> ['COVID-19 is caused by infection with the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus strain.',
109
- 'The disease is mainly transmitted via the respiratory route when people inhale droplets and particles that infected people release as they breathe, talk, cough, sneeze, or sing.']
110
-
111
- split_sentences("""இத்தொற்றுநோய் உலகளாவிய சமூக மற்���ும் பொருளாதார சீர்குலைவை ஏற்படுத்தியுள்ளது.இதனால் பெரும் பொருளாதார மந்தநிலைக்குப் பின்னர் உலகளவில் மிகப்பெரிய மந்தநிலை ஏற்பட்டுள்ளது. இது விளையாட்டு,மத, அரசியல் மற்றும் கலாச்சார நிகழ்வுகளை ஒத்திவைக்க அல்லது ரத்து செய்ய வழிவகுத்தது.
112
- அச்சம் காரணமாக முகக்கவசம், கிருமிநாசினி உள்ளிட்ட பொருட்களை அதிக நபர்கள் வாங்கியதால் விநியோகப் பற்றாக்குறை ஏற்பட்டது.""",
113
- language='ta')
114
-
115
- >> ['இத்தொற்றுநோய் உலகளாவிய சமூக மற்றும் பொருளாதார சீர்குலைவை ஏற்படுத்தியுள்ளது.',
116
- 'இதனால் பெரும் பொருளாதார மந்தநிலைக்குப் பின்னர் உலகளவில் மிகப்பெரிய மந்தநிலை ஏற்பட்டுள்ளது.',
117
- 'இது விளையாட்டு,மத, அரசியல் மற்றும் கலாச்சார நிகழ்வுகளை ஒத்திவைக்க அல்லது ரத்து செய்ய வழிவகுத்தது.',
118
- 'அச்சம் காரணமாக முகக்கவசம், கிருமிநாசினி உள்ளிட்ட பொருட்களை அதிக நபர்கள் வாங்கியதால் விநியோகப் பற்றாக்குறை ஏற்பட்டது.']
119
-
120
-
121
- ```
122
-
123
- Follow the colab notebook to setup the environment, download the trained _IndicTrans_ models and translating your own text.
124
-
125
- Command line interface --> [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/indictrans_fairseq_inference.ipynb)
126
-
127
-
128
- Python interface --> [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/indicTrans_python_interface.ipynb)
129
-
130
- The python interface is useful in case you want to reuse the model for multiple translations and do not want to reinitialize the model each time
131
-
132
-
133
- ## Finetuning the model on your input dataset
134
-
135
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/indicTrans_Finetuning.ipynb)
136
-
137
- The colab notebook can be used to setup the environment, download the trained _IndicTrans_ models and prepare your custom dataset for funetuning the indictrans model. There is also a section on mining indic to indic data from english centric corpus for finetuning indic to indic model.
138
-
139
- **Note**: Since this is a big model (400M params), you might not be able to train with reasonable batch sizes in the free google Colab account. We are planning to release smaller models (after pruning / distallation) soon.
140
-
141
- ## Mining Indic to Indic pairs from english centric corpus
142
-
143
- The `extract_non_english_pairs` in `scripts/extract_non_english_pairs.py` can be used to mine indic to indic pairs from english centric corpus.
144
-
145
- As described in the [paper](https://arxiv.org/pdf/2104.05596.pdf) (section 2.5) , we use a very strict deduplication criterion to avoid the creation of very similar parallel sentences. For example, if an en sentence is aligned to *M* hi sentences and *N* ta sentences, then we would get *MN* hi-ta pairs. However, these pairs would be very similar and not contribute much to the training process. Hence, we retain only 1 randomly chosen pair out of these *MN* pairs.
146
-
147
- ```bash
148
- extract_non_english_pairs(indir, outdir, LANGS):
149
- """
150
- Extracts non-english pair parallel corpora
151
- indir: contains english centric data in the following form:
152
- - directory named en-xx for language xx
153
- - each directory contains a train.en and train.xx
154
- outdir: output directory to store mined data for each pair.
155
- One directory is created for each pair.
156
- LANGS: list of languages in the corpus (other than English).
157
- The language codes must correspond to the ones used in the
158
- files and directories in indir. Prefarably, sort the languages
159
- in this list in alphabetic order. outdir will contain data for xx-yy,
160
- but not for yy-xx, so it will be convenient to have this list in sorted order.
161
- """
162
- ```
163
-
164
- ## Installation
165
- <details><summary>Click to expand </summary>
166
-
167
- ```bash
168
- cd indicTrans
169
- git clone https://github.com/anoopkunchukuttan/indic_nlp_library.git
170
- git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
171
- git clone https://github.com/rsennrich/subword-nmt.git
172
- # install required libraries
173
- pip install sacremoses pandas mock sacrebleu tensorboardX pyarrow indic-nlp-library
174
-
175
- # Install fairseq from source
176
- git clone https://github.com/pytorch/fairseq.git
177
- cd fairseq
178
- pip install --editable ./
179
-
180
- ```
181
- </details>
182
-
183
- ## How to train the indictrans model on your training data?
184
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/IndicTrans_training.ipynb)
185
-
186
-
187
- Follow the colab notebook to setup the environment, download the dataset and train the indicTrans model
188
-
189
- ## Network & Training Details
190
-
191
- - Architechture: IndicTrans uses 6 encoder and decoder layers, input embeddings of size 1536 with 16 attention heads and
192
- feedforward dimension of 4096 with total number of parameters of 434M
193
- - Loss: Cross entropy loss
194
- - Optimizer: Adam
195
- - Label Smoothing: 0.1
196
- - Gradient clipping: 1.0
197
- - Learning rate: 5e-4
198
- - Warmup_steps: 4000
199
-
200
- Please refer to section 4, 5 of our [paper](https://arxiv.org/ftp/arxiv/papers/2104/2104.05596.pdf) for more details on training/experimental setup.
201
-
202
- ## Folder Structure
203
- ```
204
-
205
- IndicTrans
206
- │ .gitignore
207
- │ apply_bpe_traindevtest_notag.sh # apply bpe for joint vocab (Train, dev and test)
208
- │ apply_single_bpe_traindevtest_notag.sh # apply bpe for seperate vocab (Train, dev and test)
209
- │ binarize_training_exp.sh # binarize the training data after preprocessing for fairseq-training
210
- │ compute_bleu.sh # Compute blue scores with postprocessing after translating with `joint_translate.sh`
211
- │ indictrans_fairseq_inference.ipynb # colab example to show how to use model for inference
212
- │ indicTrans_Finetuning.ipynb # colab example to show how to use model for finetuning on custom domain data
213
- │ joint_translate.sh # used for inference (see colab inference notebook for more details on usage)
214
- │ learn_bpe.sh # learning joint bpe on preprocessed text
215
- │ learn_single_bpe.sh # learning seperate bpe on preprocessed text
216
- │ LICENSE
217
- │ prepare_data.sh # prepare data given an experiment dir (this does preprocessing,
218
- │ # building vocab, binarization ) for bilingual training
219
- │ prepare_data_joint_training.sh # prepare data given an experiment dir (this does preprocessing,
220
- │ # building vocab, binarization ) for joint training
221
- │ README.md
222
-
223
- ├───legacy # old unused scripts
224
- ├───model_configs # custom model configrations are stored here
225
- │ custom_transformer.py # contains custom 4x transformer models
226
- │ __init__.py
227
- ├───inference
228
- │ custom_interactive.py # for python wrapper around fairseq-interactive
229
- │ engine.py # python interface for model inference
230
- └───scripts # stores python scripts that are used by other bash scripts
231
- │ add_joint_tags_translate.py # add lang tags to the processed training data for bilingual training
232
- │ add_tags_translate.py # add lang tags to the processed training data for joint training
233
- │ clean_vocab.py # clean vocabulary after building with subword_nmt
234
- │ concat_joint_data.py # concatenates lang pair data and creates text files to keep track
235
- │ # of number of lines in each lang pair.
236
- │ extract_non_english_pairs.py # Mining Indic to Indic pairs from english centric corpus
237
- │ postprocess_translate.py # Postprocesses translations
238
- │ preprocess_translate.py # Preprocess translations and for script conversion (from indic to devnagiri)
239
- │ remove_large_sentences.py # to remove large sentences from training data
240
- └───remove_train_devtest_overlaps.py # Finds and removes overlaped data of train with dev and test sets
241
- ```
242
-
243
-
244
- ## Citing
245
-
246
- If you are using any of the resources, please cite the following article:
247
- ```
248
- @misc{ramesh2021samanantar,
249
- title={Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
250
- author={Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
251
- year={2021},
252
- eprint={2104.05596},
253
- archivePrefix={arXiv},
254
- primaryClass={cs.CL}
255
- }
256
- ```
257
-
258
- We would like to hear from you if:
259
-
260
- - You are using our resources. Please let us know how you are putting these resources to use.
261
- - You have any feedback on these resources.
262
-
263
-
264
-
265
- ### License
266
-
267
- The IndicTrans code (and models) are released under the MIT License.
268
-
269
-
270
- ### Contributors
271
-
272
- - Gowtham Ramesh, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [IITM](https://www.iitm.ac.in))</sub>
273
- - Sumanth Doddapaneni, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [IITM](https://www.iitm.ac.in))</sub>
274
- - Aravinth Bheemaraj, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
275
- - Mayank Jobanputra, <sub>([IITM](https://www.iitm.ac.in))</sub>
276
- - Raghavan AK, <sub>([AI4Bharat](https://ai4bharat.org))</sub>
277
- - Ajitesh Sharma, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
278
- - Sujit Sahoo, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
279
- - Harshita Diddee, <sub>([AI4Bharat](https://ai4bharat.org))</sub>
280
- - Mahalakshmi J, <sub>([AI4Bharat](https://ai4bharat.org))</sub>
281
- - Divyanshu Kakwani, <sub>([IITM](https://www.iitm.ac.in), [AI4Bharat](https://ai4bharat.org))</sub>
282
- - Navneet Kumar, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
283
- - Aswin Pradeep, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
284
- - Kumar Deepak, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
285
- - Vivek Raghavan, <sub>([EkStep](https://ekstep.in))</sub>
286
- - Anoop Kunchukuttan, <sub>([Microsoft](https://www.microsoft.com/en-in/), [AI4Bharat](https://ai4bharat.org))</sub>
287
- - Pratyush Kumar, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in))</sub>
288
- - Mitesh Shantadevi Khapra, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in))</sub>
289
-
290
-
291
-
292
- ### Contact
293
-
294
- - Anoop Kunchukuttan ([[email protected]](mailto:[email protected]))
295
- - Mitesh Khapra ([[email protected]](mailto:[email protected]))
296
- - Pratyush Kumar ([[email protected]](mailto:[email protected]))
 
1
+ ---
2
+ title: Indic Translation
3
+ emoji: 🐢
4
+ colorFrom: blue
5
+ colorTo: gray
6
+ sdk: gradio
7
+ app_file: app.py
8
+ pinned: false
9
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
indicTrans/api.py → api.py RENAMED
File without changes
indicTrans/apply_bpe_traindevtest_notag.sh → apply_bpe_traindevtest_notag.sh RENAMED
File without changes
indicTrans/apply_single_bpe_traindevtest_notag.sh → apply_single_bpe_traindevtest_notag.sh RENAMED
File without changes
indicTrans/binarize_training_exp.sh → binarize_training_exp.sh RENAMED
File without changes
indicTrans/compute_bleu.sh → compute_bleu.sh RENAMED
File without changes
indicTrans/indicTrans_Finetuning.ipynb → indicTrans_Finetuning.ipynb RENAMED
File without changes
indicTrans/indicTrans_python_interface.ipynb → indicTrans_python_interface.ipynb RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/LICENSE RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/README.md RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/contrib/README.md RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/contrib/correct_moses_tokenizer.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/contrib/hindi_to_kannada_transliterator.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/contrib/indic_scraper_project_sample.ipynb RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/Makefile RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/cmd.rst RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/code.rst RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/conf.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/index.rst RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.MD RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.cli.rst RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.morph.rst RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.normalize.rst RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.pdf RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.rst RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.script.rst RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.syllable.rst RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.tokenize.rst RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.transliterate.rst RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/make.bat RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/modules.rst RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/__init__.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/cli/__init__.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/cli/cliparser.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/common.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/langinfo.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/loader.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/morph/__init__.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/morph/unsupervised_morph.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/normalize/__init__.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/normalize/indic_normalize.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/__init__.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/english_script.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/indic_scripts.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/phonetic_sim.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/syllable/__init__.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/syllable/syllabifier.py RENAMED
File without changes
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/test/__init__.py RENAMED
File without changes