rooftopcoder commited on
Commit
ce4167f
·
1 Parent(s): 9793cb6

Add requirements

Browse files
Files changed (3) hide show
  1. README.md +122 -14
  2. app.py +210 -0
  3. requirements.txt +9 -0
README.md CHANGED
@@ -1,14 +1,122 @@
1
- ---
2
- title: NMT Demo
3
- emoji: 🐢
4
- colorFrom: yellow
5
- colorTo: indigo
6
- sdk: gradio
7
- sdk_version: 5.21.0
8
- app_file: app.py
9
- pinned: false
10
- license: apache-2.0
11
- short_description: NMT demo for BITS Assignment Group 54
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Neural Machine Translation for English-Hindi
2
+
3
+ This project implements a Neural Machine Translation system for English-Hindi translation using the MarianMT model fine-tuned on 100k split of Samanantar, with a user-friendly Gradio interface.
4
+
5
+ ![NMT UI Screenshot](assets/nmt_ui_screenshot.png)
6
+
7
+ ## Features
8
+
9
+ - Unidirectional translation between English and Hindi
10
+ - User-friendly web interface built with Gradio
11
+ - Example translations included
12
+ - Built on Helsinki-NLP's MarianMT model
13
+
14
+ ## Installation
15
+
16
+ ### Local Setup with Virtual Environment
17
+
18
+ 1. Clone the repository:
19
+ ```bash
20
+ git clone https://github.com/yourusername/NLPA_Assignment_2_Group_54.git
21
+ cd NLPA_Assignment_2_Group_54
22
+ ```
23
+
24
+ 2. Create and activate a virtual environment:
25
+ ```bash
26
+ python -m venv venv
27
+ source venv/bin/activate # On Windows, use: venv\Scripts\activate
28
+ ```
29
+
30
+ 3. Install the required packages:
31
+ ```bash
32
+ pip install -r requirements.txt
33
+ ```
34
+
35
+ ## Usage
36
+
37
+ 1. Make sure your virtual environment is activated
38
+ 2. Run the UI:
39
+ ```bash
40
+ python nmt_ui.py
41
+ ```
42
+ 3. Open your browser and navigate to `http://localhost:7860`
43
+
44
+ ## Supported Language Pairs
45
+
46
+ - English -> Hindi (using rooftopcoder/opus-mt-en-hi-samanantar-100k model)
47
+
48
+ ## Training the Model
49
+
50
+ The `train.py` script is used to train the MarianMT model on the Samanantar dataset. The script performs the following steps:
51
+ - Loads the Samanantar dataset (English-Hindi subset).
52
+ - Splits the dataset into training and validation sets.
53
+ - Tokenizes the dataset.
54
+ - Sets up training arguments optimized for GPU.
55
+ - Trains the model using the Hugging Face `Trainer` class.
56
+ - Saves the trained model to the specified directory.
57
+ - Uploads the trained model to the Hugging Face Hub.
58
+
59
+ To train the model, run:
60
+ ```bash
61
+ python train.py
62
+ ```
63
+
64
+ ## Testing the Model
65
+
66
+ The `model_test.py` script is used to test the trained MarianMT model. The script performs the following steps:
67
+ - Loads the trained model and tokenizer from the Hugging Face Hub.
68
+ - Translates a sample input text from English to Hindi.
69
+ - Prints the translated text.
70
+
71
+ To test the model, run:
72
+ ```bash
73
+ python model_test.py
74
+ ```
75
+
76
+ ## User Interface
77
+
78
+ The `nmt_ui.py` script provides a Gradio-based user interface for translating text between English and Hindi. The interface includes options for transliteration of Romanized Hindi text to Devanagari script.
79
+
80
+ To launch the interface, run:
81
+ ```bash
82
+ python nmt_ui.py
83
+ ```
84
+
85
+ ## Model Information
86
+
87
+ This project uses the MarianMT model from Hugging Face Transformers.
88
+
89
+ ### Notes:
90
+ - The model supports English-Hindi translation.
91
+ - Based on the Helsinki-NLP/opus-mt-en-hi model.
92
+ - Optimized for English -> Hindi translation pairs.
93
+ - Includes transliteration support for Romanized Hindi text.
94
+
95
+ ### Supported Features:
96
+ - English -> Hindi translation.
97
+ - Romanized Hindi -> Devanagari Hindi transliteration.
98
+
99
+ ### Examples of Transliteration:
100
+ - "namaste" → "नमस्ते"
101
+ - "aap kaise ho" → "आप कैसे हो"
102
+ - "mera naam" → "मेरा नाम"
103
+
104
+ ## Project Structure
105
+
106
+ ```
107
+ NLPA_Assignment_2_Group_54/
108
+ ├── nmt_ui.py # Main application file with Gradio interface
109
+ ├── requirements.txt # Python dependencies
110
+ └── README.md # Project documentation
111
+ ```
112
+
113
+ ## License
114
+
115
+ MIT
116
+
117
+ ## Group Members
118
+
119
+ - Shubhra J Gadhwala: 2023aa05750
120
+ - Sandeep Kumar Yadav: 2023ab05047
121
+ - Ravi Krishna Mayura: 2023ab05157
122
+ - Satheesh Kumar G: 2023ab05041
app.py ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from huggingface_hub import HfFolder
3
+ from transformers import MarianMTModel, MarianTokenizer
4
+ from indic_transliteration import sanscript
5
+ from indic_transliteration.sanscript import transliterate
6
+ import torch # Add this import at the top with other imports
7
+
8
+ # Global variables to store models and tokenizers
9
+ models = {}
10
+ tokenizers = {}
11
+ token = HfFolder.get_token()
12
+
13
+ # Model configurations
14
+ MODEL_CONFIGS = {
15
+ "en-hi": {
16
+ "model_path": "rooftopcoder/opus-mt-en-hi-samanantar-finetuned",
17
+ "name": "English to Hindi"
18
+ },
19
+ "hi-en": {
20
+ "model_path": "rooftopcoder/opus-mt-hi-en-samanantar-finetuned",
21
+ "name": "Hindi to English"
22
+ },
23
+ "en-mr": {
24
+ "model_path": "rooftopcoder/opus-mt-en-mr-samanantar-finetuned",
25
+ "name": "English to Marathi"
26
+ },
27
+ "mr-en": {
28
+ "model_path": "rooftopcoder/opus-mt-mr-en-samanantar-finetuned",
29
+ "name": "Marathi to English"
30
+ }
31
+ }
32
+
33
+ # Update language codes dictionary
34
+ language_codes = {
35
+ "English": "en",
36
+ "Hindi": "hi",
37
+ "Marathi": "mr"
38
+ }
39
+
40
+ # Reverse dictionary for display purposes
41
+ language_names = {v: k for k, v in language_codes.items()}
42
+
43
+ def load_models():
44
+ try:
45
+ print("Loading models from local storage...")
46
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
47
+ print(f"Using device: {device}")
48
+
49
+ for direction, config in MODEL_CONFIGS.items():
50
+ print(f"Loading {config['name']} model...")
51
+ tokenizers[direction] = MarianTokenizer.from_pretrained(config["model_path"], token=token)
52
+ models[direction] = MarianMTModel.from_pretrained(config["model_path"], token=token).to(device)
53
+
54
+ print("All models loaded successfully!")
55
+ return True
56
+ except Exception as e:
57
+ print(f"Error loading models: {e}")
58
+ return False
59
+
60
+ # Function to perform transliteration from English to Hindi
61
+ def transliterate_text(text, from_scheme=sanscript.ITRANS, to_scheme=sanscript.DEVANAGARI):
62
+ """
63
+ Transliterates text from one script to another
64
+ Default is from ITRANS (Roman) to Devanagari (Hindi)
65
+ """
66
+ try:
67
+ return transliterate(text, from_scheme, to_scheme)
68
+ except Exception as e:
69
+ print(f"Transliteration error: {e}")
70
+ return text
71
+
72
+ # Function to perform translation with MarianMT
73
+ def translate(input_text, source_lang, target_lang):
74
+ """
75
+ Translates text using MarianMT models
76
+ """
77
+ direction = f"{source_lang}-{target_lang}"
78
+ if direction not in models or direction not in tokenizers:
79
+ return "Error: Unsupported language pair"
80
+
81
+ if not input_text.strip():
82
+ return "Error: Please enter some text to translate."
83
+
84
+ try:
85
+ device = next(models[direction].parameters()).device
86
+ tokens = tokenizers[direction](input_text, return_tensors="pt", padding=True, truncation=True)
87
+ tokens = {k: v.to(device) for k, v in tokens.items()}
88
+
89
+ translated = models[direction].generate(**tokens)
90
+ translated = translated.cpu()
91
+ output = tokenizers[direction].batch_decode(translated, skip_special_tokens=True)
92
+ return output[0]
93
+ except Exception as e:
94
+ print(f"Translation error: {e}")
95
+ return f"Error during translation: {str(e)}"
96
+
97
+ # Helper function for handling the UI translation process
98
+ def perform_translation(input_text, source_lang, target_lang):
99
+ """Wrapper function for the Gradio interface"""
100
+ source_code = language_codes[source_lang]
101
+ target_code = language_codes[target_lang]
102
+
103
+ # Handle transliteration for Hindi and Marathi
104
+ if source_code == "en" and target_code in ["hi", "mr"]:
105
+ common_indic_words = {
106
+ "hi": ["namaste", "dhanyavad", "kaise", "hai", "aap", "tum", "main"],
107
+ "mr": ["namaskar", "dhanyawad", "kase", "ahe", "tumhi", "mi"]
108
+ }
109
+
110
+ words = input_text.lower().split()
111
+ if any(word in common_indic_words.get(target_code, []) for word in words):
112
+ transliterated = transliterate_text(input_text)
113
+ if transliterated != input_text:
114
+ translation = translate(input_text, source_code, target_code)
115
+ return f"Transliterated: {transliterated}\n\nTranslated: {translation}"
116
+
117
+ return translate(input_text, source_code, target_code)
118
+
119
+ # Create Gradio interface
120
+ def create_interface():
121
+ with gr.Blocks(title="Neural Machine Translation - Indian Languages") as demo:
122
+ gr.Markdown("# Neural Machine Translation for Indian Languages")
123
+ gr.Markdown("Translate between English, Hindi, and Marathi using MarianMT models")
124
+
125
+ with gr.Row():
126
+ with gr.Column():
127
+ source_lang = gr.Dropdown(
128
+ choices=list(language_codes.keys()),
129
+ label="Source Language",
130
+ value="English"
131
+ )
132
+ input_text = gr.Textbox(
133
+ lines=5,
134
+ placeholder="Enter text to translate...",
135
+ label="Input Text"
136
+ )
137
+
138
+ with gr.Column():
139
+ target_lang = gr.Dropdown(
140
+ choices=list(language_codes.keys()),
141
+ label="Target Language",
142
+ value="Hindi"
143
+ )
144
+ output_text = gr.Textbox(
145
+ lines=5,
146
+ label="Translated Text",
147
+ placeholder="Translation will appear here..."
148
+ )
149
+
150
+ translate_btn = gr.Button("Translate", variant="primary")
151
+ transliterate_btn = gr.Button("Transliterate Only", variant="secondary")
152
+
153
+ # Event handlers
154
+ translate_btn.click(
155
+ fn=perform_translation,
156
+ inputs=[input_text, source_lang, target_lang],
157
+ outputs=[output_text],
158
+ api_name="translate"
159
+ )
160
+
161
+ # Direct transliteration handler (new)
162
+ def direct_transliterate(text):
163
+ if not text.strip():
164
+ return "Please enter text to transliterate"
165
+ return transliterate_text(text)
166
+
167
+ transliterate_btn.click(
168
+ fn=direct_transliterate,
169
+ inputs=[input_text],
170
+ outputs=[output_text],
171
+ api_name="transliterate"
172
+ )
173
+
174
+ # Examples for all language pairs
175
+ gr.Examples(
176
+ examples=[
177
+ ["Hello, how are you?", "English", "Hindi"],
178
+ ["नमस्ते, आप कैसे हैं?", "Hindi", "English"],
179
+ ["Hello, how are you?", "English", "Marathi"],
180
+ ["नमस्कार, तुम्ही कसे आहात?", "Marathi", "English"],
181
+ ],
182
+ inputs=[input_text, source_lang, target_lang],
183
+ fn=perform_translation,
184
+ outputs=output_text,
185
+ cache_examples=True
186
+ )
187
+
188
+ gr.Markdown("""
189
+ ## Model Information
190
+
191
+ This demo uses fine-tuned MarianMT models for translation between:
192
+ - English ↔️ Hindi
193
+ - English ↔️ Marathi
194
+
195
+ ### Features:
196
+ - Bidirectional translation support
197
+ - Transliteration support for romanized Indic text
198
+ - Optimized models for each language pair
199
+ """)
200
+
201
+ return demo
202
+
203
+ # Launch the interface
204
+ if __name__ == "__main__":
205
+ # Load all models before launching the interface
206
+ if load_models():
207
+ demo = create_interface()
208
+ demo.launch(share=False)
209
+ else:
210
+ print("Failed to load models. Please check the model paths and try again.")
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ gradio
2
+ transformers[sentencepiece]
3
+ torch
4
+ sacremoses
5
+ indic-transliteration
6
+ datasets
7
+ accelerate>=0.26.0
8
+ evaluate
9
+ sacrebleu