diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..1e5a6d97a33b19bb0395b0ec43c829631dd57f61
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,6 @@
+bin/__pycache__
+layers/__pycache__
+models/__pycache__
+modules/__pycache__
+utils/__pycache__
+data/raw/preprocess.ipynb
\ No newline at end of file
diff --git a/README.md b/README.md
index be03da256c23857125a61f03f478f64751f78580..9fe77346d8016872b9faeb0890732ebce9355692 100644
--- a/README.md
+++ b/README.md
@@ -1,13 +1,156 @@
----
-title: NMT LaVi
-emoji: 🐢
-colorFrom: red
-colorTo: pink
-sdk: streamlit
-sdk_version: 1.28.2
-app_file: app.py
-pinned: false
-license: unknown
----
-
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+Dự án MultilingualMT-UET-KC4.0 là dự án open-source được phát triển bởi nhóm UETNLPLab.
+
+# Setup
+## Cài đặt công cụ Multilingual-NMT
+
+**Note**:
+Lưu ý:
+Phiên bản hiện tại chỉ tương thích với python>=3.6
+```bash
+git clone https://github.com/KCDichDaNgu/KC4.0_MultilingualNMT.git
+cd KC4.0_MultilingualNMT
+pip install -r requirements.txt
+
+# Quickstart
+
+```
+
+## Bước 1: Chuẩn bị dữ liệu
+
+Ví dụ thực nghiệm dựa trên cặp dữ liệu Anh-Việt nguồn từ iwslt với 133k cặp câu:
+
+```bash
+cd data/iwslt_en_vi
+```
+
+Dữ liệu bao gồm câu nguồn (`src`) và câu đích (`tgt`) dữ liệu đã được tách từ:
+
+* `train.en`
+* `train.vi`
+* `tst2012.en`
+* `tst2012.vi`
+
+| Data set    | Sentences  |                    Download                   |
+| :---------: | :--------: | :-------------------------------------------: |
+| Training    | 133,317    | via GitHub or located in data/train-en-vi.tgz |
+| Development | 1,553      | via GitHub or located in data/train-en-vi.tgz |
+| Test        | 1,268      | via GitHub or located in data/train-en-vi.tgz |
+
+
+**Note**:
+Lưu ý:
+- Dữ liệu trước khi đưa vào huấn luyện cần phải được tokenize. 
+- $CONFIG là đường dẫn tới vị trí chứa file config
+
+Tách dữ liệu dev để tính toán hội tụ trong quá trình huấn luyện, thường không lớn hơn 5k câu.
+
+```text
+$ head -n 5 data/iwslt_en_vi/train.en
+Rachel Pike : The science behind a climate headline
+In 4 minutes , atmospheric chemist Rachel Pike provides a glimpse of the massive scientific effort behind the bold headlines on climate change , with her team -- one of thousands who contributed -- taking a risky flight over the rainforest in pursuit of data on a key molecule .
+I &apos;d like to talk to you today about the scale of the scientific effort that goes into making the headlines you see in the paper .
+Headlines that look like this when they have to do with climate change , and headlines that look like this when they have to do with air quality or smog .
+They are both two branches of the same field of atmospheric science .
+```
+
+## Bước 2: Huấn luyện mô hình
+
+Để huấn luyện một mô hình mới **hãy chỉnh sửa file YAML config**:
+Cần phải sửa lại file config en_vi.yml chỉnh siêu tham số và đường dẫn tới dữ liệu huấn luyện:
+
+```yaml
+# data location and config section
+data:
+  train_data_location: data/iwslt_en_vi/train
+  eval_data_location:  data/iwslt_en_vi/tst2013
+  src_lang: .en 
+  trg_lang: .vi 
+log_file_models: 'model.log'
+lowercase: false
+build_vocab_kwargs: # additional arguments for build_vocab. See torchtext.vocab.Vocab for mode details
+#  max_size: 50000
+  min_freq: 5
+# model parameters section
+device: cuda
+d_model: 512
+n_layers: 6
+heads: 8
+# inference section
+eval_batch_size: 8
+decode_strategy: BeamSearch
+decode_strategy_kwargs:
+  beam_size: 5 # beam search size
+  length_normalize: 0.6 # recalculate beam position by length. Currently only work in default BeamSearch
+  replace_unk: # tuple of layer/head attention to replace unknown words
+    - 0 # layer
+    - 0 # head
+input_max_length: 200 # input longer than this value will be trimmed in inference. Note that this values are to be used during cached PE, hence, validation set with more than this much tokens will call a warning for the trimming.
+max_length: 160 # only perform up to this much timestep during inference
+train_max_length: 50 # training samples with this much length in src/trg will be discarded
+# optimizer and learning arguments section
+lr: 0.2
+optimizer: AdaBelief
+optimizer_params:
+  betas:
+    - 0.9 # beta1
+    - 0.98 # beta2
+  eps: !!float 1e-9
+n_warmup_steps: 4000
+label_smoothing: 0.1
+dropout: 0.1
+# training config, evaluation, save & load section
+batch_size: 64
+epochs: 20
+printevery: 200
+save_checkpoint_epochs: 1
+maximum_saved_model_eval: 5
+maximum_saved_model_train: 5
+
+```
+
+Sau đó có thể chạy với câu lệnh:
+
+```bash
+python -m bin.main train --model Transformer --model_dir $MODEL/en-vi.model --config $CONFIG/en_vi.yml
+```
+
+**Note**:
+Ở đây:
+- $MODEL là dường dẫn tới vị trí lưu mô hình. Sau khi huấn luyện mô hình, thư mục chứa mô hình bao gồm mô hình huyến luyện, file config, file log, vocab.
+- $CONFIG là đường dẫn tới vị trí chứa file config
+
+## Bước 3: Dịch 
+
+Mô hình dịch dựa trên thuật toán beam search và lưu bản dịch tại `$your_data_path/translate.en2vi.vi`.
+
+```bash
+python -m bin.main infer --model Transformer --model_dir $MODEL/en-vi.model --features_file $your_data_path/tst2012.en --predictions_file $your_data_path/translate.en2vi.vi
+```
+
+## Bước 4: Đánh giá chất lượng dựa trên điểm BLEU
+
+Đánh giá điểm BLEU dựa trên multi-bleu
+
+```bash
+perl thrid-party/multi-bleu.perl $your_data_path/translate.en2vi.vi < $your_data_path/tst2012.vi
+```
+
+|        MODEL       | BLEU (Beam Search) |
+| :-----------------:| :----------------: |
+| Transformer (Base) |        25.64       |
+
+
+## Chi tiết tham khảo tại 
+[nmtuet.ddns.net](http://nmtuet.ddns.net:1190/)
+
+## Nếu có ý kiến đóng góp, xin hãy gửi thư tới địa chỉ mail kcdichdangu@gmail.com
+
+## Xin trích dẫn bài báo sau:
+```bash
+@inproceedings{ViNMT2022,
+  title = {ViNMT: Neural Machine Translation Toolkit},
+  author = {Nguyen Hoang Quan, Nguyen Thanh Dat, Nguyen Hoang Minh Cong, Nguyen Van Vinh, Ngo Thi Vinh, Nguyen Phuong Thai, Tran Hong Viet},
+  booktitle = {https://arxiv.org/abs/2112.15272},
+  year = {2022},
+}
+```
\ No newline at end of file
diff --git a/bin/__init__.py b/bin/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..de9efebaeaa882cf7affa2014593c7c332e99e7a
--- /dev/null
+++ b/bin/__init__.py
@@ -0,0 +1 @@
+import bin.main as main
diff --git a/bin/main.py b/bin/main.py
new file mode 100644
index 0000000000000000000000000000000000000000..81d99b236eba6d9c71cb80a8b48ee27ebf181ed1
--- /dev/null
+++ b/bin/main.py
@@ -0,0 +1,73 @@
+import models
+import argparse, os
+from shutil import copy2 as copy
+from modules.config import find_all_config
+
+OVERRIDE_RUN_MODE = {"serve": "infer", "debug": "eval"}
+
+def check_valid_file(path):
+  if(os.path.isfile(path)):
+    return path
+  else:
+    raise argparse.ArgumentError("This path {:s} is not a valid file, check again.".format(path))
+
+def create_torchscript_model(model, model_dir, model_name):
+  """Create a torchscript model using junk data. NOTE: same as tensorflow, is a limited model with no native python script."""
+  import torch
+  junk_input = torch.rand(2, 10)
+  junk_output = torch.rand(2, 7)
+  traced_model = torch.jit.trace(model, junk_input, junk_output)
+  save_location = os.path.join(model_dir, model_name)
+  traced_model.save(save_location)
+
+if __name__ == "__main__":
+  parser = argparse.ArgumentParser(description="Main argument parser")
+  parser.add_argument("run_mode", choices=("train", "eval", "infer", "debug", "serve"), help="Main running mode of the program")
+  parser.add_argument("--model", type=str, choices=models.AvailableModels.keys(), help="The type of model to be ran")
+  parser.add_argument("--model_dir", type=str, required=True, help="Location of model")
+  parser.add_argument("--config", type=str, nargs="+", default=None, help="Location of the config file")
+  parser.add_argument("--no_keeping_config", action="store_false", help="If set, do not copy the config file to the model directory")
+  # arguments for inference
+  parser.add_argument("--features_file", type=str, help="Inference mode: Provide the location of features file")
+  parser.add_argument("--predictions_file", type=str, help="Inference mode: Provide Location of output file which is predicted from features file")
+  parser.add_argument("--src_lang", type=str, help="Inference mode: Provide language used by source file")
+  parser.add_argument("--trg_lang", type=str, default=None, help="Inference mode: Choose language that is translated from source file. NOTE: only specify for multilingual model")
+  parser.add_argument("--infer_batch_size", type=int, default=None, help="Specify the batch_size to run the model with. Default use the config value.")
+  parser.add_argument("--checkpoint", type=str, default=None, help="All mode: specify to load the checkpoint into model.")
+  parser.add_argument("--checkpoint_idx", type=int, default=0, help="All mode: specify the epoch of the checkpoint loaded. Only useful for training.")
+  parser.add_argument("--serve_path", type=str, default=None, help="File to save TorchScript model into.")
+  
+  args = parser.parse_args()
+  # create directory if not exist
+  os.makedirs(args.model_dir, exist_ok=True)
+  config_path = args.config
+  if(config_path is None):
+    config_path = find_all_config(args.model_dir)
+    print("Config path not specified, load the configs in model directory which is {}".format(config_path))
+  elif(args.no_keeping_config):
+    # store false variable, mean true is default
+    print("Config specified, copying all to model dir")
+    for subpath in config_path:
+      copy(subpath, args.model_dir)
+    
+  # load model. Specific run mode required converting
+  run_mode = OVERRIDE_RUN_MODE.get(args.run_mode, args.run_mode)
+  model = models.AvailableModels[args.model](config=config_path, model_dir=args.model_dir, mode=run_mode)
+  model.load_checkpoint(args.model_dir, checkpoint=args.checkpoint, checkpoint_idx=args.checkpoint_idx)
+  # run model
+  run_mode = args.run_mode
+  if(run_mode == "train"):
+    model.run_train(model_dir=args.model_dir, config=config_path)
+  elif(run_mode == "eval"):
+    model.run_eval(model_dir=args.model_dir, config=config_path)
+  elif(run_mode == "infer"):
+    model.run_infer(args.features_file, args.predictions_file, src_lang=args.src_lang, trg_lang=args.trg_lang, config=config_path, batch_size=args.infer_batch_size)
+  elif(run_mode == "debug"):
+    raise NotImplementedError
+    model.run_debug(model_dir=args.model_dir, config=config_path)
+  elif(run_mode == "serve"):
+    if(args.serve_path is None):
+      raise parser.ArgumentError("In serving, --serve_path cannot be empty")
+    model.prepare_serve(args.serve_path, model_dir=args.model_dir, config=config_path)
+  else:
+    raise ValueError("Run mode {:s} not implemented.".format(run_mode))
diff --git a/bin/serve.py b/bin/serve.py
new file mode 100644
index 0000000000000000000000000000000000000000..035b239630ab5927573daa9a0adb3569e8a4f37f
--- /dev/null
+++ b/bin/serve.py
@@ -0,0 +1,108 @@
+import os
+
+#import utils.save as saver
+#import models
+#from models.transformer import Transformer
+#from modules.config import find_all_config
+
+class TransformerHandlerClass:
+  def __init__(self):
+    self.model = None
+    self.device = None
+    self.initialized = False
+
+  def _find_checkpoint(self, model_dir, best_model_prefix="best_model", model_prefix="model", validate=True):
+    """Attempt to retrieve the best model checkpoint from model_dir. Failing that, the model of the latest iteration.
+    Args:
+      model_dir: location to search for checkpoint. str
+    Returns:
+      single str denoting the checkpoint path """
+    score_file_path = os.path.join(model_dir, saver.BEST_MODEL_FILE)
+    if(os.path.isfile(score_file_path)): # score exist -> best model
+      best_model_path = os.path.join(model_dir, saver.MODEL_FILE_FORMAT.format(best_model_prefix, 0, saver.MODEL_EXTENSION))
+      if(validate):
+        assert os.path.isfile(best_model_path), "Score file is available, but file {:s} is missing.".format(best_model_path)
+      return best_model_path
+    else: # score not exist -> latest model
+      last_checkpoint_idx = saver.check_model_in_path(name_prefix=model_prefix)
+      if(last_checkpoint_idx == 0):
+        raise ValueError("No checkpoint found in folder {:s} with prefix {:s}.".format(model_dir, model_prefix))
+      else:
+        return os.path.join(model_dir, saver.MODEL_FILE_FORMAT.format(model_prefix, last_checkpoint_idx, saver.MODEL_EXTENSION))
+
+
+  def initialize(self, ctx):
+    manifest = ctx.manifest
+    properties = ctx.system_properties
+
+    self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")
+    self.model_dir = model_dir = properties.get("model_dir")
+
+    # extract checkpoint location, config & model name
+    model_serve_file = os.path.join(model_dir, saver.MODEL_SERVE_FILE)
+    with io.open(model_serve_file, "r") as serve_config:
+      model_name = serve_config.read().strip()
+#    model_cls = models.AvailableModels[model_name]
+    model_cls = Transformer # can't select due to nature of model file
+    checkpoint_path = manifest['model'].get('serializedFile', self._find_checkpoint(model_dir)) # attempt to use the checkpoint fed from archiver; else use the best checkpoint found
+    config_path = find_all_config(model_dir)
+
+    # load model with inbuilt config + vocab & without pretraining data
+    self.model = model = model_cls(config=config_path, model_dir=model_dir, mode="infer")
+    model.load_checkpoint(args.model_dir, checkpoint=checkpoint_path) # TODO find_checkpoint might do some redundant thing here since load_checkpoint had already done searching for latest
+    
+    print("Model {:s} loaded successfully at location {:s}.".format(model_name, model_dir))
+    self.initialized = True
+
+  def handle(self, data):
+    """The main bulk of handling. Process a batch of data received from client.
+    Args: 
+      data: the object received from client. Should contain something in [batch_size] of str
+    Returns:
+      the expected translation, [batch_size] of str
+    """
+    batch_sentences = data[0].get("data")
+#    assert batch_sentences is not None, "data is {}".format(data)
+    
+    # make sure that sentences are detokenized before returning
+    translated_sentences = self.model.translate_batch(batch_sentences, output_tokens=False)
+
+    return translated_sentences
+
+class BeamSearchHandlerClass:
+  def __init__(self):
+    self.model = None
+    self.inferrer = None
+    self.initialized = False
+
+  def initialize(self, ctx):
+    manifest = ctx.manifest
+    properties = ctx.system_properties
+
+    model_dir = properties['model_dir']
+    ts_modelpath = manifest['model']['serializedFile'] 
+    self.model = ts_model = torch.jit.load(os.path.join(model_dir, ts_modelpath))
+
+    from modules.inference.beam_search import BeamSearch
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    self.inferrer = BeamSearch(model, 160, device, beam_size=5)
+    
+    self.initialized = True
+
+  def handle(self, data):
+    batch_sentences = data[0].get("data")
+#    assert batch_sentences is not None, "data is {}".format(data)
+    
+    translated_sentences = self.inferrer.translate_batch_sentence(data, output_tokens=False)
+    return translated_sentences
+
+RUNNING_MODEL = BeamSearchHandlerClass()
+
+def handle(data, context):
+  if(not RUNNING_MODEL.initialized): # Lazy init
+    RUNNING_MODEL.initialize(context)
+
+  if(data is None):
+    return None
+
+  return RUNNING_MODEL.handle(data)
diff --git a/config/bilingual_prototype.yml b/config/bilingual_prototype.yml
new file mode 100644
index 0000000000000000000000000000000000000000..c814d2233b01c5cce940e70b857dab7a639c35cc
--- /dev/null
+++ b/config/bilingual_prototype.yml
@@ -0,0 +1,52 @@
+# data location and config section
+data:
+  train_data_location: data/test/train2023
+  eval_data_location:  data/test/dev2023
+  src_lang: .lo 
+  trg_lang: .vi 
+log_file_models: 'model.log'
+lowercase: false
+build_vocab_kwargs: # additional arguments for build_vocab. See torchtext.vocab.Vocab for mode details
+#  max_size: 50000
+  min_freq: 4
+  specials:
+    - <unk>
+    - <pad>
+    - <sos>
+    - <eos>
+  # data augmentation section
+# model parameters section
+device: cuda
+d_model: 512
+n_layers: 6
+heads: 8
+# inference section
+eval_batch_size: 8
+decode_strategy: BeamSearch
+decode_strategy_kwargs:
+  beam_size: 5 # beam search size
+  length_normalize: 0.6 # recalculate beam position by length. Currently only work in default BeamSearch
+  replace_unk: # tuple of layer/head attention to replace unknown words
+    - 0 # layer
+    - 0 # head
+input_max_length: 250 # input longer than this value will be trimmed in inference. Note that this values are to be used during cached PE, hence, validation set with more than this much tokens will call a warning for the trimming.
+max_length: 160 # only perform up to this much timestep during inference
+train_max_length: 140 # training samples with this much length in src/trg will be discarded
+# optimizer and learning arguments section
+lr: 0.2
+optimizer: AdaBelief
+optimizer_params:
+  betas:
+    - 0.9 # beta1
+    - 0.98 # beta2
+  eps: !!float 1e-9
+n_warmup_steps: 4000
+label_smoothing: 0.1
+dropout: 0.05
+# training config, evaluation, save & load section
+batch_size: 32
+epochs: 40
+printevery: 200
+save_checkpoint_epochs: 1
+maximum_saved_model_eval: 5
+maximum_saved_model_train: 5
diff --git a/config/prototype.json b/config/prototype.json
new file mode 100644
index 0000000000000000000000000000000000000000..eb9c820b09094f67aa711714e0672ff98daae9d7
--- /dev/null
+++ b/config/prototype.json
@@ -0,0 +1,25 @@
+{
+  "train_src_data": "/workspace/khoai23/opennmt/data/iwslt_en_vi/train.en",
+  "train_trg_data": "/workspace/khoai23/opennmt/data/iwslt_en_vi/train.vi",
+  "valid_src_data": "/workspace/khoai23/opennmt/data/iwslt_en_vi/tst2013.en",
+  "valid_trg_data": "/workspace/khoai23/opennmt/data/iwslt_en_vi/tst2013.vi",
+  "src_lang": "en",
+  "trg_lang": "en",
+  "max_strlen": 160,
+  "batchsize": 1500,
+  "device": "cpu",
+  "d_model": 512,
+  "n_layers": 6,
+  "heads": 8,
+  "dropout": 0.1,
+  "lr": 0.0001,
+  "epochs": 30,
+  "printevery": 200,
+  "k": 5,
+  "n_warmup_steps": 4000,
+  "beta1": 0.9,
+  "beta2": 0.98,
+  "eps": 1e-09,
+  "label_smoothing": 0.1,
+  "save_checkpoint_epochs": 5
+}
diff --git a/layers/__init__.py b/layers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a7b242ecd548671d7dfaf8ae08ab050f5862b8e
--- /dev/null
+++ b/layers/__init__.py
@@ -0,0 +1 @@
+from layers.prototypes import *
diff --git a/layers/prototypes.py b/layers/prototypes.py
new file mode 100644
index 0000000000000000000000000000000000000000..121ffb3453ec5f024208777c61cf53cd34cb3f7d
--- /dev/null
+++ b/layers/prototypes.py
@@ -0,0 +1,148 @@
+import torch
+import torch.nn as nn
+from torch.autograd import Variable
+import torch.nn.functional as functional
+import math
+import logging
+
+class PositionalEncoder(nn.Module):
+    def __init__(self, d_model, max_seq_length=200, dropout=0.1):
+        super().__init__()
+        
+        self.d_model = d_model
+        self.dropout = nn.Dropout(dropout)
+        self._max_seq_length = max_seq_length
+        
+        pe = torch.zeros(max_seq_length, d_model)
+        
+        for pos in range(max_seq_length):
+            for i in range(0, d_model, 2):
+                pe[pos, i] = math.sin(pos/(10000**(2*i/d_model)))
+                pe[pos, i+1] = math.cos(pos/(10000**((2*i+1)/d_model)))
+        pe = pe.unsqueeze(0)        
+        self.register_buffer('pe', pe)
+
+        @torch.jit.script
+        def splice_by_size(source, target):
+            """Custom function to splice the source by target's second dimension. Required due to torch.Size not a torchTensor. Why? hell if I know."""
+            length = target.size(1);
+            return source[:, :length]
+
+        self.splice_by_size = splice_by_size
+    
+    def forward(self, x):
+        if(x.shape[1] > self._max_seq_length):
+            logging.warn("Input longer than maximum supported length for PE detected. Build a model with a larger input_max_length limit if you want to keep the input; or ignore if you want the input trimmed")
+            x = x[:, :self._max_seq_length]
+        
+        x = x * math.sqrt(self.d_model)
+        
+        spliced_pe = self.splice_by_size(self.pe, x) # self.pe[:, :x.shape[1]]
+#        pe = Variable(spliced_pe, requires_grad=False)
+        pe = spliced_pe.requires_grad_(False)
+        
+#        if x.is_cuda: # remove since it is a sub nn.Module
+#            pe.cuda()
+#        assert all([xs == ys for xs, ys in zip(x.shape[1:], pe.shape[1:])]), "{} - {}".format(x.shape, pe.shape)
+
+        x = x + pe
+        x = self.dropout(x)
+        
+        return x
+
+class MultiHeadAttention(nn.Module):
+    def __init__(self, heads, d_model, dropout=0.1):
+        super().__init__()
+        assert d_model % heads == 0
+        
+        self.d_model = d_model
+        self.d_k = d_model // heads
+        self.h = heads
+
+        # three casting linear layer for query/key.value
+        self.q_linear = nn.Linear(d_model, d_model)
+        self.k_linear = nn.Linear(d_model, d_model)
+        self.v_linear = nn.Linear(d_model, d_model)
+        
+        self.dropout = nn.Dropout(dropout)
+        self.out = nn.Linear(d_model, d_model)
+    
+    def forward(self, q, k, v, mask=None):
+        """
+        Args:
+            q / k / v: query/key/value, should all be [batch_size, sequence_length, d_model]. Only differ in decode attention, where q is tgt_len and k/v is src_len
+            mask: either [batch_size, 1, src_len] or [batch_size, tgt_len, tgt_len]. The last two dimensions must match or are broadcastable.
+        Returns:
+            the value of the attention process, [batch_size, sequence_length, d_model].
+            The used attention, [batch_size, q_length, k_v_length]
+        """
+        bs = q.shape[0]
+        q = self.q_linear(q).view(bs, -1, self.h, self.d_k)
+        k = self.k_linear(k).view(bs, -1, self.h, self.d_k)
+        v = self.v_linear(v).view(bs, -1, self.h, self.d_k)
+        
+        q = q.transpose(1, 2)
+        k = k.transpose(1, 2)
+        v = v.transpose(1, 2)
+        
+        value, attn = self.attention(q, k, v, mask, self.dropout)
+        concat = value.transpose(1, 2).contiguous().view(bs, -1, self.d_model)
+        output = self.out(concat)
+        return output, attn
+
+    def attention(self, q, k, v, mask=None, dropout=None):
+        """Calculate the attention and output the attention & value
+        Args:
+            q / k / v: query/key/value already transformed, should all be [batch_size, heads, sequence_length, d_k]. Only differ in decode attention, where q is tgt_len and k/v is src_len
+            mask: either [batch_size, 1, src_len] or [batch_size, tgt_len, tgt_len]. The last two dimensions must match or are broadcastable.
+        Returns: 
+            the attentionized but raw values [batch_size, head, seq_length, d_k]
+            the attention calculated [batch_size, heads, sequence_length, sequence_length]
+        """
+    
+#        d_k = q.shape[-1]
+        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
+        
+        if mask is not None:
+            mask = mask.unsqueeze(1) # add a dimension to account for head
+            scores = scores.masked_fill(mask==0, -1e9)
+        # softmax the padding/peeking masked attention
+        scores = functional.softmax(scores, dim=-1)
+        
+        if dropout is not None:
+            scores = dropout(scores)
+        
+        output = torch.matmul(scores, v)
+        return output, scores
+
+class Norm(nn.Module):
+    def __init__(self, d_model, eps = 1e-6):
+        super().__init__()
+    
+        self.size = d_model
+        
+        # create two learnable parameters to calibrate normalisation
+        self.alpha = nn.Parameter(torch.ones(self.size))
+        self.bias = nn.Parameter(torch.zeros(self.size))
+        
+        self.eps = eps
+    
+    def forward(self, x):
+        norm = self.alpha * (x - x.mean(dim=-1, keepdim=True)) \
+        / (x.std(dim=-1, keepdim=True) + self.eps) + self.bias
+        return norm
+
+class FeedForward(nn.Module):
+    """A two-hidden-linear feedforward layer that can activate and dropout its transition state"""
+    def __init__(self, d_model, d_ff=2048, internal_activation=functional.relu, dropout=0.1):
+        super().__init__() 
+        self.linear_1 = nn.Linear(d_model, d_ff)
+        self.dropout = nn.Dropout(dropout)
+        self.linear_2 = nn.Linear(d_ff, d_model)
+
+        self.internal_activation = internal_activation
+    
+    def forward(self, x):
+        x = self.dropout(self.internal_activation(self.linear_1(x)))
+        x = self.linear_2(x)
+        return x
diff --git a/models/__init__.py b/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..34317622b8b41554de2d4d84ca8818446e71abe8
--- /dev/null
+++ b/models/__init__.py
@@ -0,0 +1,7 @@
+from models.default import MockModel
+from models.transformer import Transformer
+
+AvailableModels = {
+    "MockModel": MockModel, 
+    "Transformer" : Transformer
+}
diff --git a/models/default.py b/models/default.py
new file mode 100644
index 0000000000000000000000000000000000000000..40d0fb26f0930882fc199d9681cf4111749f37cc
--- /dev/null
+++ b/models/default.py
@@ -0,0 +1,13 @@
+class MockModel:
+  """A model that only output string to show flow"""
+  def __init__(self, *args, **kwargs):
+    print("Mock model initialization, with args/kwargs: {} {}".format(args, kwargs))
+
+  def run_train(self, **kwargs):
+    print("Model in training, with args: {}".format(kwargs))
+
+  def run_eval(self, **kwargs):
+    print("Model in evaluation, with args: {}".format(kwargs))
+
+  def run_debug(self, **kwargs):
+    print("Model in debuging, with args: {}".format(kwargs))
diff --git a/models/transformer.py b/models/transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..bf51032ce61406a7b466964069a868cb9342dc3b
--- /dev/null
+++ b/models/transformer.py
@@ -0,0 +1,404 @@
+import torch
+import torch.nn as nn
+import torchtext.data as data
+import copy, time, io
+import numpy as np
+
+from modules.prototypes import Encoder, Decoder, Config as DefaultConfig
+from modules.loader import DefaultLoader, MultiLoader
+from modules.config import MultiplePathConfig as Config
+from modules.inference import strategies
+from modules import constants as const
+from modules.optim import optimizers, ScheduledOptim
+
+import utils.save as saver
+from utils.decode_old import create_masks, translate_sentence
+#from utils.data import create_fields, create_dataset, read_data, read_file, write_file
+from utils.loss import LabelSmoothingLoss
+from utils.metric import bleu, bleu_batch_iter, bleu_single, bleu_batch
+#from utils.save import load_model_from_path, check_model_in_path, save_and_clear_model, write_model_score, load_model_score, save_model_best_to_path, load_model
+
+class Transformer(nn.Module):
+    """
+    Implementation of Transformer architecture based on the paper `Attention is all you need`.
+    Source: https://arxiv.org/abs/1706.03762
+    """
+    def __init__(self, mode=None, model_dir=None, config=None):
+        super().__init__()
+
+        # Use specific config file if provided otherwise use the default config instead
+        self.config = DefaultConfig() if(config is None) else Config(config)
+        opt = self.config
+        self.device = opt.get('device', const.DEFAULT_DEVICE)
+
+        if('train_data_location' in opt or 'train_data_location' in opt.get("data", {})):
+            # monolingual data detected
+            data_opt = opt if 'train_data_location' in opt else opt["data"]
+            self.loader = DefaultLoader(data_opt['train_data_location'], eval_path=data_opt.get('eval_data_location', None), language_tuple=(data_opt["src_lang"], data_opt["trg_lang"]), option=opt)
+        elif('data' in opt):
+            # multilingual data with multiple corpus in [data][train] namespace
+            self.loader = MultiLoader(opt["data"]["train"], valid=opt["data"].get("valid", None), option=opt)
+        # input fields
+        self.SRC, self.TRG = self.loader.build_field(lower=opt.get("lowercase", const.DEFAULT_LOWERCASE))
+#        self.SRC = data.Field(lower=opt.get("lowercase", const.DEFAULT_LOWERCASE))
+#        self.TRG = data.Field(lower=opt.get("lowercase", const.DEFAULT_LOWERCASE), eos_token='<eos>')
+
+        # initialize dataset and by proxy the vocabulary
+        if(mode == "train"):
+            # training flow, necessitate the DataLoader and iterations. This will attempt to load vocab file from the dir instead of rebuilding, but can build a new vocab if no data is found
+            self.train_iter, self.valid_iter = self.loader.create_iterator(self.fields, model_path=model_dir)
+        elif(mode == "eval"):
+            # evaluation flow, which only require valid_iter
+            # TODO fix accordingly
+            self.train_iter, self.valid_iter = self.loader.create_iterator(self.fields, model_path=model_dir)
+        elif(mode == "infer"):
+            # inference, require pickled model and vocab in the path
+            self.loader.build_vocab(self.fields, model_path=model_dir)
+        else:
+            raise ValueError("Unknown model's mode: {}".format(mode))
+
+
+        # define the model
+        src_vocab_size, trg_vocab_size = len(self.SRC.vocab), len(self.TRG.vocab)
+        d_model, N, heads, dropout = opt['d_model'], opt['n_layers'], opt['heads'], opt['dropout']
+        # get the maximum amount of tokens per sample in encoder. This is useful due to PositionalEncoder requiring this value
+        train_ignore_length = self.config.get("train_max_length", const.DEFAULT_TRAIN_MAX_LENGTH)
+        input_max_length = self.config.get("input_max_length", const.DEFAULT_INPUT_MAX_LENGTH)
+        infer_max_length = self.config.get('max_length', const.DEFAULT_MAX_LENGTH)
+        encoder_max_length = max(input_max_length, train_ignore_length)
+        decoder_max_length = max(infer_max_length, train_ignore_length)
+        self.encoder = Encoder(src_vocab_size, d_model, N, heads, dropout, max_seq_length=encoder_max_length)
+        self.decoder = Decoder(trg_vocab_size, d_model, N, heads, dropout, max_seq_length=decoder_max_length)
+        self.out = nn.Linear(d_model, trg_vocab_size)
+
+        # load the beamsearch obj with preset values read from config. ALWAYS require the current model, max_length, and device used as per DecodeStrategy base
+        decode_strategy_class = strategies[opt.get('decode_strategy', const.DEFAULT_DECODE_STRATEGY)]
+        decode_strategy_kwargs = opt.get('decode_strategy_kwargs', const.DEFAULT_STRATEGY_KWARGS)
+        self.decode_strategy = decode_strategy_class(self, infer_max_length, self.device, **decode_strategy_kwargs)
+
+        self.to(self.device)
+
+    def load_checkpoint(self, model_dir, checkpoint=None, checkpoint_idx=0):
+        """Attempt to load past checkpoint into the model. If a specified checkpoint is set, load it; otherwise load the latest checkpoint in model_dir.
+        Args:
+            model_dir: location of the current model. Not used if checkpoint is specified
+            checkpoint: location of the specific checkpoint to load
+            checkpoint_idx: the epoch of the checkpoint
+        NOTE: checkpoint_idx return -1 in the event of not found; while 0 is when checkpoint is forced
+        """
+        if(checkpoint is not None):
+            saver.load_model(self, checkpoint)
+            self._checkpoint_idx = checkpoint_idx
+        else:
+            if model_dir is not None:
+                # load the latest available checkpoint, overriding the checkpoint value
+                checkpoint_idx = saver.check_model_in_path(model_dir)
+                if(checkpoint_idx > 0):
+                    print("Found model with index {:d} already saved.".format(checkpoint_idx))
+                    saver.load_model_from_path(self, model_dir, checkpoint_idx=checkpoint_idx)
+                else:
+                    print("No checkpoint found, start from beginning.")
+                    checkpoint_idx = -1
+            else:
+                print("No model_dir available, start from beginning.")
+                # train the model from begin
+                checkpoint_idx = -1
+            self._checkpoint_idx = checkpoint_idx
+            
+
+    def forward(self, src, trg, src_mask, trg_mask, output_attention=False):
+        """Run a full model with specified source-target batched set of data
+        Args:
+            src: the source input [batch_size, src_len]
+            trg: the target input (& expected output) [batch_size, trg len]
+            src_mask: the padding mask for src [batch_size, 1, src_len]
+            trg_mask: the triangle mask for trg [batch_size, trg_len, trg_len]
+            output_attention: if specified, output the attention as needed
+        Returns:
+            the logits (unsoftmaxed outputs), same shape as trg
+        """
+        e_outputs = self.encoder(src, src_mask)
+        d_output, attn = self.decoder(trg, e_outputs, src_mask, trg_mask, output_attention=True)
+        output = self.out(d_output)
+        if(output_attention):
+            return output, attn
+        else:
+            return output
+ 
+    def train_step(self, optimizer, batch, criterion):
+        """
+        Perform one training step.
+        """
+        self.train()
+        opt = self.config
+        
+        # move data to specific device's memory
+        src = batch.src.transpose(0, 1).to(opt.get('device', const.DEFAULT_DEVICE))
+        trg = batch.trg.transpose(0, 1).to(opt.get('device', const.DEFAULT_DEVICE))
+
+        trg_input = trg[:, :-1]
+        src_pad = self.SRC.vocab.stoi['<pad>']
+        trg_pad = self.TRG.vocab.stoi['<pad>']
+        ys = trg[:, 1:].contiguous().view(-1)
+
+        # create mask and perform network forward
+        src_mask, trg_mask = create_masks(src, trg_input, src_pad, trg_pad, opt.get('device', const.DEFAULT_DEVICE))
+        preds = self(src, trg_input, src_mask, trg_mask)
+        
+        # perform backprogation
+        optimizer.zero_grad()
+        loss = criterion(preds.view(-1, preds.size(-1)), ys)
+        loss.backward()
+        optimizer.step_and_update_lr()
+        loss = loss.item()
+        
+        return loss    
+
+    def validate(self, valid_iter, criterion, maximum_length=None):
+        """Compute loss in validation dataset. As we can't perform trimming the input in the valid_iter yet, using a crutch in maximum_input_length variable
+        Args:
+            valid_iter: the Iteration containing batches of data, accessed by .src and .trg
+            criterion: the loss function to use to evaluate
+            maximum_length: if fed, a tuple of max_input_len, max_output_len to trim the src/trg
+        Returns:
+            the avg loss of the criterion
+        """
+        self.eval()
+        opt = self.config
+        src_pad = self.SRC.vocab.stoi['<pad>']
+        trg_pad = self.TRG.vocab.stoi['<pad>']
+    
+        with torch.no_grad():
+            total_loss = []
+            for batch in valid_iter:
+                # load model into specific device (GPU/CPU) memory  
+                src = batch.src.transpose(0, 1).to(opt.get('device', const.DEFAULT_DEVICE))
+                trg = batch.trg.transpose(0, 1).to(opt.get('device', const.DEFAULT_DEVICE))
+                if(maximum_length is not None):
+                    src = src[:, :maximum_length[0]]
+                    trg = trg[:, :maximum_length[1]-1] # using partials
+                trg_input = trg[:, :-1]
+                ys = trg[:, 1:].contiguous().view(-1)
+
+                # create mask and perform network forward
+                src_mask, trg_mask = create_masks(src, trg_input, src_pad, trg_pad, opt.get('device', const.DEFAULT_DEVICE))
+                preds = self(src, trg_input, src_mask, trg_mask)
+
+                # compute loss on current batch
+                loss = criterion(preds.view(-1, preds.size(-1)), ys)
+                loss = loss.item()
+                total_loss.append(loss)
+    
+        avg_loss = np.mean(total_loss)
+        return avg_loss
+
+    def translate_sentence(self, sentence, device=None, k=None, max_len=None, debug=False):
+        """
+        Receive a sentence string and output the prediction generated from the model.
+        NOTE: sentence input is a list of tokens instead of string due to change in loader. See the current DefaultLoader for further details
+        """
+        self.eval()
+        if(device is None): device = self.config.get('device', const.DEFAULT_DEVICE)
+        if(k is None): k = self.config.get('k', const.DEFAULT_K)
+        if(max_len is None): max_len = self.config.get('max_length', const.DEFAULT_MAX_LENGTH)
+
+        # Get output from decode
+        translated_tokens = translate_sentence(sentence, self, self.SRC, self.TRG, device, k, max_len, debug=debug, output_list_of_tokens=True)
+        
+        return translated_tokens
+
+    def translate_batch_sentence(self, sentences, src_lang=None, trg_lang=None, output_tokens=False, batch_size=None):
+        """Translate sentences by splitting them to batches and process them simultaneously
+        Args:
+            sentences: the sentences in a list. Must NOT have been tokenized (due to SRC preprocess)
+            output_tokens: if set, do not detokenize the output
+            batch_size: if specified, use the value; else use config ones
+        Returns:
+            a matching translated sentences list in [detokenized format using loader.detokenize | list of tokens]
+        """
+        if(batch_size is None): 
+            batch_size = self.config.get("eval_batch_size", const.DEFAULT_EVAL_BATCH_SIZE)
+        input_max_length = self.config.get("input_max_length", const.DEFAULT_INPUT_MAX_LENGTH)
+        self.eval()
+
+        translated = []
+        for b_idx in range(0, len(sentences), batch_size):
+            batch = sentences[b_idx: b_idx+batch_size]
+#            raise Exception(batch)
+            trans_batch = self.translate_batch(batch, trg_lang=trg_lang, output_tokens=output_tokens, input_max_length=input_max_length)
+#            raise Exception(detokenized_batch)
+            translated.extend(trans_batch)
+            # for line in trans_batch:
+            #     print(line)
+        return translated
+
+    def translate_batch(self, batch_sentences, src_lang=None, trg_lang=None, output_tokens=False, input_max_length=None):
+        """Translate a single batch of sentences. Split to aid serving
+        Args:
+            sentences: the sentences in a list. Must NOT have been tokenized (due to SRC preprocess)
+            src_lang/trg_lang: the language from src->trg. Used for multilingual models only.
+            output_tokens: if set, do not detokenize the output
+        Returns:
+            a matching translated sentences list in [detokenized format using loader.detokenize | list of tokens]
+        """
+        if(input_max_length is None):
+            input_max_length = self.config.get("input_max_length", const.DEFAULT_INPUT_MAX_LENGTH)
+        translated_batch = self.decode_strategy.translate_batch(batch_sentences, trg_lang=trg_lang, src_size_limit=input_max_length, output_tokens=True, debug=False)
+        return self.loader.detokenize(translated_batch) if not output_tokens else translated_batch
+
+    def run_train(self, model_dir=None, config=None):
+        opt = self.config
+        from utils.logging import init_logger
+        logging = init_logger(model_dir, opt.get('log_file_models'))
+
+        trg_pad = self.TRG.vocab.stoi['<pad>']     
+        # load model into specific device (GPU/CPU) memory   
+        logging.info("%s * src vocab size = %s"%(self.loader._language_tuple[0] ,len(self.SRC.vocab)))
+        logging.info("%s * tgt vocab size = %s"%(self.loader._language_tuple[1] ,len(self.TRG.vocab)))
+        logging.info("Building model...")
+        model = self.to(opt.get('device', const.DEFAULT_DEVICE))
+
+        checkpoint_idx = self._checkpoint_idx
+        if(checkpoint_idx < 0):
+            # initialize weights    
+            print("Zero checkpoint detected, reinitialize the model")
+            for p in model.parameters():
+                if p.dim() > 1:
+                    nn.init.xavier_uniform_(p)
+            checkpoint_idx = 0
+
+        # also, load the scores of the best model
+        best_model_score = saver.load_model_score(model_dir)
+        
+        # set up optimizer  
+        optim_algo = opt["optimizer"]
+        lr = opt["lr"]
+        d_model = opt["d_model"]
+        n_warmup_steps = opt["n_warmup_steps"]
+        optimizer_params = opt.get("optimizer_params", dict({}))
+
+        if optim_algo not in optimizers:
+            raise ValueError("Unknown optimizer: {}".format(optim_algo))
+        
+        optimizer = ScheduledOptim(
+                optimizer=optimizers.get(optim_algo)(model.parameters(), **optimizer_params),
+                init_lr=lr, 
+                d_model=d_model, 
+                n_warmup_steps=n_warmup_steps
+            )
+        
+        # define loss function 
+        criterion = LabelSmoothingLoss(len(self.TRG.vocab), padding_idx=trg_pad, smoothing=opt['label_smoothing'])
+
+#        valid_src_data, valid_trg_data = self.loader._eval_data
+#        raise Exception("Initial bleu: %.3f" % bleu_batch_iter(self, self.valid_iter, debug=True))
+        logging.info(self)
+        model_encoder_parameters = filter(lambda p: p.requires_grad, self.encoder.parameters())
+        model_decoder_parameters = filter(lambda p: p.requires_grad, self.decoder.parameters())
+        params_encode = sum([np.prod(p.size()) for p in model_encoder_parameters])
+        params_decode = sum([np.prod(p.size()) for p in model_decoder_parameters])
+
+        logging.info("Encoder: %s"%(params_encode))
+        logging.info("Decoder: %s"%(params_decode))
+        logging.info("* Number of parameters: %s"%(params_encode+params_decode))
+        logging.info("Starting training on %s"%(opt.get('device', const.DEFAULT_DEVICE)))
+
+        for epoch in range(checkpoint_idx, opt['epochs']):
+            total_loss = 0.0
+            
+            s = time.time()
+            for i, batch in enumerate(self.train_iter): 
+                loss = self.train_step(optimizer, batch, criterion)
+                total_loss += loss
+                
+                # print training loss after every {printevery} steps
+                if (i + 1) % opt['printevery'] == 0:
+                    avg_loss = total_loss / opt['printevery']
+                    et = time.time() - s
+                    # print('epoch: {:03d} - iter: {:05d} - train loss: {:.4f} - time elapsed/per batch: {:.4f} {:.4f}'.format(epoch, i+1, avg_loss, et, et / opt['printevery']))
+                    logging.info('epoch: {:03d} - iter: {:05d} - train loss: {:.4f} - time elapsed/per batch: {:.4f} {:.4f}'.format(epoch, i+1, avg_loss, et, et / opt['printevery']))
+                    total_loss = 0
+                    s = time.time()
+            
+            # bleu calculation and evaluate, save checkpoint for every {save_checkpoint_epochs} epochs
+            s = time.time()
+            valid_loss = self.validate(self.valid_iter, criterion, maximum_length=(self.encoder._max_seq_length, self.decoder._max_seq_length))
+            if (epoch+1) % opt['save_checkpoint_epochs'] == 0 and model_dir is not None:
+        
+                # evaluate loss and bleu score on validation dataset for each epoch
+#                bleuscore = bleu(valid_src_data, valid_trg_data, model, opt.get('device', const.DEFAULT_DEVICE), opt['k'], opt['max_strlen'])
+#                bleuscore = bleu_single(self, self.loader._eval_data)
+#                bleuscore = bleu_batch(self, self.loader._eval_data, batch_size=opt.get('eval_batch_size', const.DEFAULT_EVAL_BATCH_SIZE))
+                valid_src_lang, valid_trg_lang = self.loader.language_tuple
+                bleuscore = bleu_batch_iter(self, self.valid_iter, src_lang=valid_src_lang, trg_lang=valid_trg_lang)
+
+#                save_model_to_path(model, model_dir, checkpoint_idx=epoch+1)
+                saver.save_and_clear_model(model, model_dir, checkpoint_idx=epoch+1, maximum_saved_model=opt.get('maximum_saved_model_train', const.DEFAULT_NUM_KEEP_MODEL_TRAIN))
+                # keep the best models per every bleu calculation
+                best_model_score = saver.save_model_best_to_path(model, model_dir, best_model_score, bleuscore, maximum_saved_model=opt.get('maximum_saved_model_eval', const.DEFAULT_NUM_KEEP_MODEL_TRAIN))
+                # print('epoch: {:03d} - iter: {:05d} - valid loss: {:.4f} - bleu score: {:.4f} - full evaluation time: {:.4f}'.format(epoch, i, valid_loss, bleuscore, time.time() - s))
+                logging.info('epoch: {:03d} - iter: {:05d} - valid loss: {:.4f} - bleu score: {:.4f} - full evaluation time: {:.4f}'.format(epoch, i, valid_loss, bleuscore, time.time() - s))
+            else:
+                # print('epoch: {:03d} - iter: {:05d} - valid loss: {:.4f} - validation time: {:.4f}'.format(epoch, i, valid_loss, time.time() - s))
+                logging.info('epoch: {:03d} - iter: {:05d} - valid loss: {:.4f} - validation time: {:.4f}'.format(epoch, i, valid_loss, time.time() - s))
+
+    def run_infer(self, features_file, predictions_file, src_lang=None, trg_lang=None, config=None, batch_size=None):
+        opt = self.config
+        # load model into specific device (GPU/CPU) memory   
+        model = self.to(opt.get('device', const.DEFAULT_DEVICE))
+        
+        # Read inference file
+        print("Reading features file from {}...".format(features_file))
+        with io.open(features_file, "r", encoding="utf-8") as read_file:
+            inputs = [l.strip() for l in read_file.readlines()]
+        
+        print("Performing inference ...")
+        # Append each translated sentence line by line
+#        results = "\n".join([model.loader.detokenize(model.translate_sentence(sentence)) for sentence in inputs])
+        # Translate by batched versions
+        start = time.time()
+        results = "\n".join( self.translate_batch_sentence(inputs, src_lang=src_lang, trg_lang=trg_lang, output_tokens=False, batch_size=batch_size))
+        print("Inference done, cost {:.2f} secs.".format(time.time() - start))
+
+        # Write results to system file
+        print("Writing results to {} ...".format(predictions_file))
+        with io.open(predictions_file, "w", encoding="utf-8") as write_file:
+            write_file.write(results)
+
+        print("All done!")
+
+    def encode(self, *args, **kwargs):
+        return self.encoder(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        return self.decoder(*args, **kwargs)
+
+    def to_logits(self, inputs): # function to include the logits. TODO use this in inference fns as well
+        return self.out(inputs)
+
+    def prepare_serve(self, serve_path, model_dir=None, check_trace=True, **kwargs):
+        self.eval()
+        """Run to prepare for serving."""
+        saver.save_model_name(type(self).__name__, model_dir)
+#        return
+#        raise NotImplementedError("trace_module currently not supported")
+        # jit to convert model to ScriptModule.
+        # create junk arguments for necessary modules
+        fake_batch, fake_srclen, fake_trglen, fake_range = 3, 7, 4, 1000
+        sample_src, sample_trg = torch.randint(fake_range, (fake_batch, fake_srclen), dtype=torch.long), torch.randint(fake_range, (fake_batch, fake_trglen), dtype=torch.long)
+        sample_src_mask, sample_trg_mask = torch.rand(fake_batch, 1, fake_srclen) > 0.5, torch.rand(fake_batch, fake_trglen, fake_trglen) > 0.5
+        sample_src, sample_trg, sample_src_mask, sample_trg_mask = [t.to(self.device) for t in [sample_src, sample_trg, sample_src_mask, sample_trg_mask]]
+        sample_encoded = self.encode(sample_src, sample_src_mask)
+        sample_before_logits = self.decode(sample_trg, sample_encoded, sample_src_mask, sample_trg_mask)
+        # bundle within dictionary
+        needed_fn = {'forward': (sample_src, sample_trg, sample_src_mask, sample_trg_mask), "encode": (sample_src, sample_src_mask), "decode": (sample_trg, sample_encoded, sample_src_mask, sample_trg_mask), "to_logits": sample_before_logits}
+        # create the ScriptModule. Currently disabling deterministic check
+        traced_model = torch.jit.trace_module(self, needed_fn, check_trace=check_trace)
+        # save it down
+        torch.jit.save(traced_model, serve_path)
+        return serve_path
+        
+
+    @property
+    def fields(self):
+        return (self.SRC, self.TRG)
diff --git a/modules/__init__.py b/modules/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..b585af14223e8bedac8384cc0eb66f879a5be37a
--- /dev/null
+++ b/modules/__init__.py
@@ -0,0 +1,3 @@
+from modules.default import *
+from modules.prototypes import Decoder, Encoder
+from modules.config import Config
diff --git a/modules/config.py b/modules/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..4489189dc7426c710ec3064b65cd9aa1c07b7262
--- /dev/null
+++ b/modules/config.py
@@ -0,0 +1,62 @@
+import yaml, json
+import os, io
+
+def extension_check(pth):
+  ext = os.path.splitext(pth)[-1]
+  return any( ext == valid_ext for valid_ext in [".json", ".yaml", ".yml"])
+
+def find_all_config(directory):
+  return [os.path.join(directory, f) for f in os.listdir(directory) if extension_check(f)]
+
+class Config(dict):
+  def __init__(self, path=None, **elements):
+    """Initiate a config object, where specified elements override the default config loaded"""
+    super(Config, self).__init__(self._try_load_path(path))
+    self.update(**elements)
+
+  def _load_json(self, json_path):
+    with io.open(json_path, "r", encoding="utf-8") as jf:
+      return json.load(jf)
+
+  def _load_yaml(self, yaml_path):
+    with io.open(yaml_path, "r", encoding="utf-8") as yf:
+      return yaml.safe_load(yf.read())
+
+  def _try_load_path(self, path):
+    assert isinstance(path, str), "Basic Config class can only support a single file path (str), but instead is {}({})".format(path, type(path))
+    assert os.path.isfile(path), "Config file {:s} does not exist".format(path)
+    extension = os.path.splitext(path)[-1]
+    if(extension == ".json"):
+      return self._load_json(path)
+    elif(extension == ".yml" or extension == ".yaml"):
+      return self._load_yaml(path)
+    else:
+      raise ValueError("Unrecognized extension ({:s}) from file {:s}".format(extension, path))
+
+  @property
+  def opt(self):
+    """Backward compatibility to original. Remove once finished."""
+    return self
+
+class MultiplePathConfig(Config):
+  def _try_load_path(self, paths):
+    """Update to support multiple paths."""
+    if(isinstance(paths, list)):
+      print("Loaded path is a list of locations. Load in the order received, overriding and merging as needed.")
+      result = {}
+      for pth in paths:
+        self._recursive_update(result, super(MultiplePathConfig, self)._try_load_path(pth))
+      return result
+    else:
+      return super(MultiplePathConfig, self)._try_load_path(paths)
+
+  def _recursive_update(self, orig, new):
+    """Instead of overriding dicts, merge them recursively."""
+#    print(orig, new)
+    for k, v in new.items():
+      if(k in orig and isinstance(orig[k], dict)):
+        assert isinstance(v, dict), "Mismatching config with key {}: {} - {}".format(k, orig[k], v)
+        orig[k] = self._recursive_update(orig[k], v)
+      else:
+        orig[k] = v;
+    return orig
diff --git a/modules/constants.py b/modules/constants.py
new file mode 100644
index 0000000000000000000000000000000000000000..9f78aaae174901073f63ae9b853d8c7fcc2a90aa
--- /dev/null
+++ b/modules/constants.py
@@ -0,0 +1,18 @@
+# DESIGNATE constants values for config
+DEFAULT_DECODE_STRATEGY = "BeamSearch"
+DEFAULT_STRATEGY_KWARGS = {}
+DEFAULT_SEED = 101
+DEFAULT_BATCH_SIZE = 64
+DEFAULT_EVAL_BATCH_SIZE = 8
+DEFAULT_TRAIN_TEST_SPLIT = 0.8
+DEFAULT_DEVICE = "cpu"
+DEFAULT_K = 5
+DEFAULT_INPUT_MAX_LENGTH = 200
+DEFAULT_MAX_LENGTH = 150
+DEFAULT_TRAIN_MAX_LENGTH = 100
+DEFAULT_LOWERCASE = True
+DEFAULT_NUM_KEEP_MODEL_TRAIN = 5
+DEFAULT_NUM_KEEP_MODEL_BEST = 5
+DEFAULT_SOS = "<sos>"
+DEFAULT_EOS = "<eos>"
+DEFAULT_PAD = "<pad>"
diff --git a/modules/default.py b/modules/default.py
new file mode 100644
index 0000000000000000000000000000000000000000..47c6df6ba46917adc2e86ef16382ed1a7d1045e9
--- /dev/null
+++ b/modules/default.py
@@ -0,0 +1,54 @@
+class MockLoader:
+  def __init__(self, *args, **kwargs):
+    """Only print stuff"""
+    print("MockLoader initialized, args/kwargs {} {}".format(args, kwargs))
+
+  def tokenize(self, inputs, **kwargs):
+    print("MockLoader tokenize called, inputs/kwargs {} {}".format(inputs, kwargs))
+    return inputs
+
+  def detokenize(self, inputs, **kwargs):
+    print("MockLoader detokenize called, inputs/kwargs {} {}".format(inputs, kwargs))
+    return inputs
+  
+  def reverse_lookup(self, inputs, **kwargs):
+    print("MockLoader reverse_lookup called, inputs/kwargs {} {}".format(inputs, kwargs))
+    return inputs
+
+  def lookup(self, inputs, **kwargs):
+    print("MockLoader lookup called, inputs/kwargs {} {}".format(inputs, kwargs))
+    return inputs
+
+  def embed(self, inputs, **kwargs):
+    print("MockLoader embed called, inputs/kwargs {} {}".format(inputs, kwargs))
+    return inputs
+
+class MockEncoder:
+  def __init__(self, *args, **kwargs):
+    """Only print stuff"""
+    print("MockEncoder initialized, args/kwargs {} {}".format(args, kwargs))
+
+  def encode(self, inputs, **kwargs):
+    print("MockEncoder encode called, inputs/kwargs {} {}".format(inputs, kwargs))
+    return inputs
+
+  def __call__(self, inputs, num_layers=3, **kwargs):
+    print("MockEncoder __call__ called, inputs/num_layers/kwargs {} {} {}".format(inputs, num_layers, kwargs))
+    for i in range(num_layers):
+      inputs = encode(inputs, **kwargs)
+    return inputs
+
+class MockDecoder:
+  def __init__(self, *args, **kwargs):
+    """Only print stuff"""
+    print("MockDecoder initialized, args/kwargs {} {}".format(args, kwargs))
+
+  def decode(self, inputs, **kwargs):
+    print("MockDecoder decode called, inputs/kwargs {} {}".format(inputs, kwargs))
+    return inputs
+
+  def __call__(self, inputs, num_layers=3, **kwargs):
+    print("MockDecoder __call__ called, inputs/num_layers/kwargs {} {} {}".format(inputs, num_layers, kwargs))
+    for i in range(num_layers):
+      inputs = decode(inputs, **kwargs)
+    return inputs
diff --git a/modules/inference/__init__.py b/modules/inference/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..d6899cda01b0a4373d44512d5aee63e43caa9c7e
--- /dev/null
+++ b/modules/inference/__init__.py
@@ -0,0 +1,10 @@
+from modules.inference.decode_strategy import DecodeStrategy
+from modules.inference.beam_search import BeamSearch
+from modules.inference.prototypes import BeamSearch2
+from modules.inference.sampling_temperature import GreedySearch
+
+strategies = {
+        "BeamSearch": BeamSearch,
+        "BeamSearch2": BeamSearch2,
+        "GreedySearch": GreedySearch
+}
diff --git a/modules/inference/__pycache__/__init__.cpython-36.pyc b/modules/inference/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..2f8917a8ade9fa5d8b144a394560a02ce4968ca6
Binary files /dev/null and b/modules/inference/__pycache__/__init__.cpython-36.pyc differ
diff --git a/modules/inference/__pycache__/beam_search.cpython-36.pyc b/modules/inference/__pycache__/beam_search.cpython-36.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..2edb23d1cbfbe5743a1beb12dd1ef340c633063b
Binary files /dev/null and b/modules/inference/__pycache__/beam_search.cpython-36.pyc differ
diff --git a/modules/inference/__pycache__/decode_strategy.cpython-36.pyc b/modules/inference/__pycache__/decode_strategy.cpython-36.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f53bbbc63717dadec0c0d77e0436ae0582068736
Binary files /dev/null and b/modules/inference/__pycache__/decode_strategy.cpython-36.pyc differ
diff --git a/modules/inference/__pycache__/prototypes.cpython-36.pyc b/modules/inference/__pycache__/prototypes.cpython-36.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7fed5693a0841e92183d70ab0a2dd92ccb595f4b
Binary files /dev/null and b/modules/inference/__pycache__/prototypes.cpython-36.pyc differ
diff --git a/modules/inference/__pycache__/sampling_temperature.cpython-36.pyc b/modules/inference/__pycache__/sampling_temperature.cpython-36.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..ed50b965c004a72e3d32772fba5e5313695b5b5c
Binary files /dev/null and b/modules/inference/__pycache__/sampling_temperature.cpython-36.pyc differ
diff --git a/modules/inference/beam_search.py b/modules/inference/beam_search.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9cd7986f5a484dd8adb934ebd77f56cc9142a1a
--- /dev/null
+++ b/modules/inference/beam_search.py
@@ -0,0 +1,336 @@
+import numpy as np
+import torch
+import math, time, operator
+import torch.nn.functional as functional
+import torch.nn as nn
+import logging
+from torch.autograd import Variable
+from torch.nn.utils.rnn import pad_sequence
+
+from modules.inference.decode_strategy import DecodeStrategy
+import modules.constants as const
+from utils.misc import no_peeking_mask
+from utils.data import generate_language_token
+
+class BeamSearch(DecodeStrategy):
+    def __init__(self, model, max_len, device, beam_size=5, use_synonym_fn=False, replace_unk=None, length_normalize=None):
+        """
+        Args:
+            model: the used model
+            max_len: the maximum timestep to be used
+            device: the device to perform calculation
+            beam_size: the size of the beam itself
+            use_synonym_fn: if set, use the get_synonym fn from wordnet to try replace <unk>
+            replace_unk: a tuple of [layer, head] designation, to replace the unknown word by chosen attention
+        """
+        super(BeamSearch, self).__init__(model, max_len, device)
+        self.beam_size = beam_size
+        self._use_synonym = use_synonym_fn
+        self._replace_unk = replace_unk
+        self._length_norm = length_normalize
+
+    def init_vars(self, src, start_token=const.DEFAULT_SOS):
+        """
+        Calculate the required matrices during translation after the model is finished
+        Input:
+        :param src: The batch of sentences
+
+        Output: Initialize the first character includes outputs, e_outputs, log_scores
+        """
+        model = self.model
+        batch_size = len(src)
+        row_b = self.beam_size * batch_size
+
+        init_tok = self.TRG.vocab.stoi[start_token]
+        src_mask = (src != self.SRC.vocab.stoi['<pad>']).unsqueeze(-2).to(self.device)
+        src = src.to(self.device)
+
+        # Encoder
+#        raise Exception(src.shape, src_mask.shape)
+        e_output = model.encode(src, src_mask)
+        outputs = torch.LongTensor([[init_tok] for i in range(batch_size)])
+        outputs = outputs.to(self.device)
+        trg_mask = no_peeking_mask(1, self.device)
+
+        # Decoder
+        out = model.to_logits(model.decode(outputs, e_output, src_mask, trg_mask))
+        out = functional.softmax(out, dim=-1)
+        probs, ix = out[:, -1].data.topk(self.beam_size)
+
+        log_scores = torch.Tensor([math.log(p) for p in probs.data.view(-1)]).view(-1, 1)
+
+        outputs = torch.zeros(row_b, self.max_len).long()
+        outputs = outputs.to(self.device)
+        outputs[:, 0] = init_tok
+        outputs[:, 1] = ix.view(-1)
+
+        e_outputs = torch.repeat_interleave(e_output, self.beam_size, 0)
+
+#        raise Exception(outputs[:, :2], e_outputs)
+
+        return outputs, e_outputs, log_scores
+
+    def compute_k_best(self, outputs, out, log_scores, i, debug=False):
+        """
+        Compute k words with the highest conditional probability
+        Args:
+            outputs: Array has k previous candidate output sequences. [batch_size*beam_size, max_len]
+            i: the current timestep to execute. Int
+            out: current output of the model at timestep. [batch_size*beam_size, vocab_size]
+            log_scores: Conditional probability of past candidates (in outputs) [batch_size * beam_size]
+
+        Returns: 
+            new outputs has k best candidate output sequences
+            log_scores for each of those candidate
+        """
+        row_b = len(out);  batch_size = row_b // self.beam_size
+        eos_id = self.TRG.vocab.stoi['<eos>']
+
+        probs, ix = out[:, -1].data.topk(self.beam_size)
+
+        probs_rep = torch.Tensor([[1] + [1e-100] * (self.beam_size-1)]*row_b).view(row_b, self.beam_size).to(self.device)
+        ix_rep = torch.LongTensor([[eos_id] + [-1]*(self.beam_size-1)]*row_b).view(row_b, self.beam_size).to(self.device)
+
+        check_eos = torch.repeat_interleave((outputs[:, i-1] == eos_id).view(row_b, 1), self.beam_size, 1)
+
+        probs = torch.where(check_eos, probs_rep, probs)
+        ix = torch.where(check_eos, ix_rep, ix)
+
+#        if(debug):
+#            print("kprobs before debug: ", probs, probs_rep, ix, ix_rep, log_scores)
+
+        log_probs = torch.log(probs).to(self.device) + log_scores.to(self.device) # CPU
+
+        k_probs, k_ix = log_probs.view(batch_size, -1).topk(self.beam_size)
+        if(debug):
+            print("kprobs_after_select: ", log_probs, k_probs, k_ix)
+
+        # Use cpu
+        k_probs, k_ix = torch.Tensor(k_probs.cpu().data.numpy()), torch.LongTensor(k_ix.cpu().data.numpy())
+        row = k_ix // self.beam_size + torch.LongTensor([[v*self.beam_size] for v in range(batch_size)])
+        col = k_ix % self.beam_size
+        if(debug):
+            print("kprobs row/col", row, col, ix[row.view(-1), col.view(-1)])
+            assert False
+
+        outputs[:, :i] = outputs[row.view(-1), :i]
+        outputs[:, i] = ix[row.view(-1), col.view(-1)]
+        log_scores = k_probs.view(-1, 1)
+
+        return outputs, log_scores
+
+    def replace_unknown(self, outputs, sentences, attn, selector_tuple, unknown_token="<unk>"):
+        """Replace the unknown words in the outputs with the highest valued attentionized words.
+        Args:
+            outputs: the output from decoding. [batch, beam] of list of str
+            sentences: the original wordings of the sentences. [batch_size, src_len] of str
+            attn: the attention received, in the form of list:  [layers units of (self-attention, attention) with shapes of [batchbeam, heads, tgt_len, tgt_len] & [batchbeam, heads, tgt_len, src_len] respectively]
+            selector_tuple: (layer, head) used to select the attention
+            unknown_token: token used for checking. str
+        Returns:
+            the replaced version, in the same shape as outputs
+            """
+
+#        is_finished = torch.LongTensor([[self.TRG.vocab.stoi['<eos>']] for i in range(self.beam_offset)]).view(-1).to(self.device)
+#        unk_token = self.SRC.vocab.stoi['<unk>']
+        layer_used, head_used = selector_tuple
+        used_attention = attn[layer_used][-1][:, head_used] # it should be [batchbeam, tgt_len, src_len], as we are using the attention to source
+        flattened_outputs = outputs.reshape((-1, )) # flatten the outputs back to batchbeam
+
+        select_id_src = torch.argmax(used_attention, dim=-1).cpu().numpy() # [batchbeam, tgt_len] of best indices. Also convert to numpy version (remove sos not needed as it is attention of outputs)
+        beam_size = select_id_src.shape[0] // len(sentences) # used custom-calculated beam_size as we might not output the entirety of beams. See beam_search fn for details
+        # select per batchbeam. source batch id is found by dividing batchbeam id per beam; we are selecting [tgt_len] indices from [src_len] tokens; then concat at the first dimensions to retrieve [batch_beam, tgt_len] of replacement tokens
+        # need itemgetter / map to retrieve from list
+        replace_tokens = [ operator.itemgetter(*src_idx)(sentences[bidx // beam_size]) for bidx, src_idx in enumerate(select_id_src)]
+        
+        # zip together with sentences; then output { the token if not unk / the replacement if is }. Note that this will trim the orig version down to repl size.
+        zipped = zip(flattened_outputs, replace_tokens)
+        replaced = np.array([[tok if tok != unknown_token else rpl for rpl, tok in zip(repl, orig)] for orig, repl in zipped], dtype=object)
+        # reshape back to outputs shape [batch, beam] of list
+        return replaced.reshape(outputs.shape)
+
+#        for i in range(1, self.max_len):
+#            ix = attn[0, 0, i-1, :].argmax().data
+#            outputs[:, i][outputs[:, i] == unk_token] = sentences[0][ix.data]
+#            if torch.equal(outputs[:, i], is_finished):
+#                break
+#
+#        return outputs
+
+    def beam_search(self, src, src_lang=None, trg_lang=None, src_tokens=None, n_best=1, length_norm=None, replace_unk=None, debug=False):
+        """
+        Beam search select k words with the highest conditional probability
+         to be the first word of the k candidate output sequences.
+        Args:
+            src: The batch of sentences, already in [batch_size, tokens] of int
+            src_tokens: src in str version, same size as above. Used almost exclusively for replace unknown word
+            n_best: number of usable values per beam loaded
+            length_norm: if specified, normalize as per (Wu, 2016); note that if not inputted then it will still use __init__ value as default. float
+            replace_unk: if specified, do replace unknown word using attention of (layer, head); note that if not inputted, it will still use __init__ value as default. (int, int)
+            debug: if true, print some debug information during the search
+        Return: 
+            An array of translated sentences, in list-of-tokens format. 
+            Either [batch_size, n_best, tgt_len] when n_best > 1
+            Or [batch_size, tgt_len] when n_best == 1
+        """
+        model = self.model
+        start_token = const.DEFAULT_SOS if trg_lang is None else generate_language_token(trg_lang)
+        outputs, e_outputs, log_scores = self.init_vars(src, start_token=start_token)
+
+        eos_tok = self.TRG.vocab.stoi[const.DEFAULT_EOS]
+        src_mask = (src != self.SRC.vocab.stoi[const.DEFAULT_PAD]).unsqueeze(-2)
+        src_mask = torch.repeat_interleave(src_mask, self.beam_size, 0).to(self.device)
+        is_finished = torch.LongTensor([[eos_tok] for i in range(self.beam_size*len(src))]).view(-1).to(self.device)
+        ind = None
+        for i in range(2, self.max_len):
+            trg_mask = no_peeking_mask(i, self.device)
+            
+            decoder_output, attn = model.decoder(outputs[:, :i], e_outputs, src_mask, trg_mask, output_attention=True)
+            out = model.out(decoder_output)
+            out = functional.softmax(out, dim=-1)
+            outputs, log_scores = self.compute_k_best(outputs, out, log_scores, i)
+
+            # Occurrences of end symbols for all input sentences.
+            if torch.equal(outputs[:, i], is_finished):
+                break
+
+        
+#        if(self._replace_unk):
+#            outputs = self.replace_unknown(attn, src, outputs)
+
+        # reshape outputs and log_probs to [batch, beam] numpy array
+        batch_size = src.shape[0]
+        outputs = outputs.cpu().numpy().reshape((batch_size, self.beam_size, self.max_len))
+        log_scores = log_scores.cpu().numpy().reshape((batch_size, self.beam_size))
+
+        # Get the best sentences for every beam: splice by length and itos the indices, result in an array of tokens
+        # also remove the first token in this timestep (as it is sos)
+        translated_sentences = np.empty(outputs.shape[:-1], dtype=object)
+        trim_and_itos = lambda sent: [self.TRG.vocab.itos[i] for i in sent[1:self._length(sent, eos_tok=eos_tok)]]
+        for ba in range(outputs.shape[0]):
+            for bm in range(outputs.shape[1]):
+                translated_sentences[ba, bm] = trim_and_itos(outputs[ba, bm])
+#        raise ValueError(translated_sentences)
+        #translated_sentences = np.apply_along_axis(lambda sent: tuple(sent.tolist()[:self._length(sent, eos_tok=eos_tok)]), -1, outputs)
+        #translated_sentences = np.vectorize(lambda sent: [self.TRG.vocab.itos[i] for i in sent])(translated_sentences)
+        if(replace_unk is None):
+            replace_unk = self._replace_unk
+        if(replace_unk):
+            # replace unknown words per translated sentences. Do it before normalization (since that is independent on actual tokens)
+            if(src_tokens is None):
+                logging.warn("replace_unknown option enabled but no src_tokens supplied for the task. The method will not run.")
+            else:
+                translated_sentences = self.replace_unknown(translated_sentences, src_tokens, attn, replace_unk)
+
+        if(length_norm is None):
+            length_norm = self._length_norm
+        if(length_norm is not None):
+#            raise ValueError(length_norm)
+            # perform length normalization calculation and reorganize the sentences accordingly
+            lengths = np.apply_along_axis(lambda x: self._length(x, eos_tok=eos_tok), -1, outputs)
+            log_scores, indices = self.length_normalize(lengths, log_scores, coff=length_norm)
+            translated_sentences = np.array([beams[ids] for beams, ids in zip(translated_sentences, indices)])
+#            outputs = np.array([beams[ids] for beams, ids in zip(outputs, indices)])
+
+#        assert n_best == 1, "Currently unsupported n_best > 1. TODO write."
+        if(n_best == 1):
+            return translated_sentences[:, 0]
+        else:
+            return translated_sentences[:, :n_best]
+
+    def translate_single_sentence(self, src, **kwargs):
+        """Translate a single sentence. Currently unused."""
+        raise NotImplementedError
+        return self.translate_batch_sentence([src], **kwargs)
+
+    def translate_batch_sentence(self, src, src_lang=None, trg_lang=None, field_processed=False, src_size_limit=None, output_tokens=False, replace_unk=None, debug=False):
+        """Translate a batch of sentences together. Currently disabling the synonym func.
+        Args:
+            src: the batch of sentences to be translated. list of str
+            src_lang: the language translated from. Only used with multilingual models, in preprocess. str
+            trg_lang: the language to be translated to. Only used with multilingual models, in beam_search. str
+            field_processed: bool, if the sentences had been already processed (i.e part of batched validation data)
+            src_size_limit: if set, trim the input if it cross this value. Added due to current positional encoding support only <=200 tokens
+            output_tokens: the output format. False will give a batch of sentences (str), while True will give batch of tokens (list of str)
+            replace_unk: see beam_search for usage. (int, int) or False to suppress __init__ value
+            debug: enable to print external values
+        Return:
+            the result of translation, with format dictated by output_tokens
+        """
+        self.model.eval()
+        # create the indiced batch.
+        processed_batch = self.preprocess_batch(src, src_lang=src_lang, field_processed=field_processed, src_size_limit=src_size_limit, output_tokens=True, debug=debug)
+        sent_ids, sent_tokens = (processed_batch, None) if(field_processed) else processed_batch
+        assert isinstance(sent_ids, torch.Tensor), "sent_ids is instead {}".format(type(sent_ids))
+
+        batch_start = time.time()
+        translated_sentences = self.beam_search(sent_ids, trg_lang=trg_lang, src_tokens=sent_tokens, replace_unk=replace_unk, debug=debug)
+        if(debug):
+            print("Time performed for batch {}: {:.2f}s".format(sent_ids.shape, time.time() - batch_start))
+
+        if(not output_tokens):
+            translated_sentences = [' '.join(tokens) for tokens in translated_sentences]
+
+        return translated_sentences
+
+    def preprocess_batch(self, sentences, src_lang=None, field_processed=False, pad_token="<pad>", src_size_limit=None, output_tokens=False, debug=True):
+        """Adding 
+            src_size_limit: int, option to limit the length of src.
+            src_lang: if specified (not None), append this token <{src_lang}> to the start of the batch
+            field_processed: bool: if the sentences had been already processed (i.e part of batched validation data)
+            output_tokens: if set, output a token version aside the id version, in [batch of [src_len]] str. Note that it won't work with field_processed
+            """
+        if(field_processed):
+            # do nothing, as it had already performed tokenizing/stoi. 
+            # Still cap the length of the batch due to possible infraction in valid
+            if(src_size_limit is not None):
+                sentences = sentences[:, :src_size_limit]
+            return sentences
+        processed_sent = map(self.SRC.preprocess, sentences)
+        if(src_lang is not None):
+            src_token = generate_language_token(src_lang)
+            processed_sent = map(lambda x: [src_token] + x, processed_sent)
+        if(src_size_limit):
+            processed_sent = map(lambda x: x[:src_size_limit], processed_sent)
+        processed_sent = list(processed_sent)
+        tokenized_sent = [torch.LongTensor([self._token_to_index(t) for t in s]) for s in processed_sent] # convert to tensors, in indices format
+        sentences = Variable(pad_sequence(tokenized_sent, True, padding_value=self.SRC.vocab.stoi[pad_token])) # padding sentences
+        if(debug):
+            print("Input batch after process: ", sentences.shape, sentences)
+        
+        if(output_tokens):
+            return sentences, processed_sent
+        else:
+            return sentences
+
+    def translate_batch(self, sentences, **kwargs):
+        return self.translate_batch_sentence(sentences, **kwargs)
+
+    def length_normalize(self, lengths, log_probs, coff=0.6):
+        """Normalize the probabilty score as in (Wu 2016). Use pure numpy values
+        Args:
+            lengths: the length of the hypothesis. [batch, beam] of int->float
+            log_probs: the unchanged log probability for the whole hypothesis. [batch, beam] of float
+            coff: the alpha coefficient.
+        Returns:
+            Tuple of (penalized_values, indices) to reorganize outputs."""
+        lengths = ((lengths + 5) / 6) ** coff
+        penalized_probs = log_probs / lengths
+        indices = np.argsort(penalized_probs, axis=-1)[::-1]
+        # basically take log_probs values for every batch
+        reorganized_probs = np.array([prb[ids] for prb, ids in zip(penalized_probs, indices)])
+        return reorganized_probs, indices
+
+    def _length(self, tokens, eos_tok=None):
+        """Retrieve the first location of eos_tok as length; else return the entire length"""
+        if(eos_tok is None):
+            eos_tok = self.TRG.vocab.stoi[const.DEFAULT_EOS]
+        eos,  = np.nonzero(tokens==eos_tok)
+        return len(tokens) if len(eos) == 0 else eos[0]
+
+    def _token_to_index(self, tok):
+        """Override to select, depending on the self._use_synonym param"""
+        if(self._use_synonym):
+            return super(BeamSearch, self)._token_to_index(tok)
+        else:
+            return self.SRC.vocab.stoi[tok]
diff --git a/modules/inference/beam_search1.py b/modules/inference/beam_search1.py
new file mode 100644
index 0000000000000000000000000000000000000000..a64a14adb47eaba8e3115645a73edf569582a709
--- /dev/null
+++ b/modules/inference/beam_search1.py
@@ -0,0 +1,346 @@
+import numpy as np
+import torch
+import math, time, operator
+import torch.nn.functional as functional
+import torch.nn as nn
+from torch.autograd import Variable
+from torch.nn.utils.rnn import pad_sequence
+
+from modules.inference.decode_strategy import DecodeStrategy
+from utils.misc import no_peeking_mask
+
+class BeamSearch1(DecodeStrategy):
+    def __init__(self, model, max_len, device, beam_size=5, use_synonym_fn=False, replace_unk=None):
+        """
+        Args:
+            model: the used model
+            max_len: the maximum timestep to be used
+            device: the device to perform calculation
+            beam_size: the size of the beam itself
+            use_synonym_fn: if set, use the get_synonym fn from wordnet to try replace <unk>
+            replace_unk: a tuple of [layer, head] designation, to replace the unknown word by chosen attention
+        """
+        super(BeamSearch1, self).__init__(model, max_len, device)
+        self.beam_size = beam_size
+        self._use_synonym = use_synonym_fn
+        self._replace_unk = replace_unk
+        # print("Init BeamSearch ----------------")
+
+    def trg_init_vars(self, src, batch_size, trg_init_token, trg_eos_token, single_src_mask):
+        """
+        Calculate the required matrices during translation after the model is finished
+        Input:
+        :param src: The batch of sentences
+
+        Output: Initialize the first character includes outputs, e_outputs, log_scores
+        """
+        # Initialize target sequence (start with '<sos>' token) [batch_size x k x max_len]
+        trg = torch.zeros(batch_size, self.beam_size, self.max_len, device=self.device).long()
+        trg[:, :, 0] = trg_init_token
+
+        # Precalc output from model's encoder
+        e_out = self.model.encoder(src, single_src_mask) # [batch_size x S x d_model]
+        # Output model prob
+        trg_mask = no_peeking_mask(1, device=self.device)
+        # [batch_size x 1]
+        inp_decoder = trg[:, 0, 0].view(batch_size, 1)
+        # [batch_size x 1 x vocab_size]
+        prob = self.model.out(self.model.decoder(inp_decoder, e_out, single_src_mask, trg_mask))
+        prob = functional.softmax(prob, dim=-1)
+    
+        # [batch_size x 1 x k]
+        k_prob, k_index = torch.topk(prob, self.beam_size, dim=-1)
+        trg[:, :, 1] = k_index.view(batch_size, self.beam_size)
+        # Init log scores from k beams [batch_size x k x 1]
+        log_scores = torch.log(k_prob.view(batch_size, self.beam_size, 1))
+    
+        # Repeat encoder's output k times for searching [(k * batch_size) x S x d_model]
+        e_outs = torch.repeat_interleave(e_out, self.beam_size, dim=0)
+        src_mask = torch.repeat_interleave(single_src_mask, self.beam_size, dim=0)
+
+        # Create mask for checking eos
+        sent_eos = torch.tensor([trg_eos_token for _ in range(self.beam_size)], device=self.device).view(1, self.beam_size)
+    
+        return sent_eos, log_scores, e_outs, e_out, src_mask, trg
+
+    def compute_k_best(self, outputs, out, log_scores, i, debug=False):
+        """
+        Compute k words with the highest conditional probability
+        Args:
+            outputs: Array has k previous candidate output sequences. [batch_size*beam_size, max_len]
+            i: the current timestep to execute. Int
+            out: current output of the model at timestep. [batch_size*beam_size, vocab_size]
+            log_scores: Conditional probability of past candidates (in outputs) [batch_size * beam_size]
+
+        Returns: 
+            new outputs has k best candidate output sequences
+            log_scores for each of those candidate
+        """
+        row_b = len(out);  
+        batch_size = row_b // self.beam_size
+        eos_id = self.TRG.vocab.stoi['<eos>']
+
+        probs, ix = out[:, -1].data.topk(self.beam_size)
+
+        probs_rep = torch.Tensor([[1] + [1e-100] * (self.beam_size-1)]*row_b).view(row_b, self.beam_size).to(self.device)
+        ix_rep = torch.LongTensor([[eos_id] + [-1]*(self.beam_size-1)]*row_b).view(row_b, self.beam_size).to(self.device)
+
+        check_eos = torch.repeat_interleave((outputs[:, i-1] == eos_id).view(row_b, 1), self.beam_size, 1)
+
+        probs = torch.where(check_eos, probs_rep, probs)
+        ix = torch.where(check_eos, ix_rep, ix)
+
+        log_probs = torch.log(probs).to(self.device) + log_scores.to(self.device) # CPU
+
+        k_probs, k_ix = log_probs.view(batch_size, -1).topk(self.beam_size)
+        if(debug):
+            print("kprobs_after_select: ", log_probs, k_probs, k_ix)
+
+        # Use cpu
+        k_probs, k_ix = torch.Tensor(k_probs.cpu().data.numpy()), torch.LongTensor(k_ix.cpu().data.numpy())
+        row = k_ix // self.beam_size + torch.LongTensor([[v*self.beam_size] for v in range(batch_size)])
+        col = k_ix % self.beam_size
+        if(debug):
+            print("kprobs row/col", row, col, ix[row.view(-1), col.view(-1)])
+            assert False
+
+        outputs[:, :i] = outputs[row.view(-1), :i]
+        outputs[:, i] = ix[row.view(-1), col.view(-1)]
+        log_scores = k_probs.view(-1, 1)
+
+        return outputs, log_scores
+
+    def replace_unknown(self, outputs, sentences, attn, selector_tuple, unknown_token="<unk>"):
+        """Replace the unknown words in the outputs with the highest valued attentionized words.
+        Args:
+            outputs: the output from decoding. [batchbeam] of list of str, with maximum values being 
+            sentences: the original wordings of the sentences. [batch_size, src_len] of str
+            attn: the attention received, in the form of list:  [layers units of (self-attention, attention) with shapes of [batchbeam, heads, tgt_len, tgt_len] & [batchbeam, heads, tgt_len, src_len] respectively]
+            selector_tuple: (layer, head) used to select the attention
+            unknown_token: token used for 
+        Returns:
+            the replaced version, in the same shape as outputs
+            """
+        layer_used, head_used = selector_tuple
+        # used_attention = attn[layer_used][-1][:, head_used] # it should be [batchbeam, tgt_len, src_len], as we are using the attention to source
+        inx = torch.arange(start=0,end=len(attn)-1, step=self.beam_size)
+        used_attention = attn[inx]
+        select_id_src = torch.argmax(used_attention, dim=-1).cpu().numpy() # [batchbeam, tgt_len] of best indices. Also convert to numpy version (remove sos not needed as it is attention of outputs)
+        # print(select_id_src, len(select_id_src))
+        beam_size = select_id_src.shape[0] // len(sentences) # used custom-calculated beam_size as we might not output the entirety of beams. See beam_search fn for details
+        # print("beam: ", beam_size)
+        # select per batchbeam. source batch id is found by dividing batchbeam id per beam; we are selecting [tgt_len] indices from [src_len] tokens; then concat at the first dimensions to retrieve [batch_beam, tgt_len] of replacement tokens
+        # need itemgetter / map to retrieve from list
+        # print([ operator.itemgetter(*src_idx)(sentences[bidx // beam_size]) for bidx, src_idx in enumerate(select_id_src)])
+        # print([print(sentences[bidx // beam_size], src_idx) for bidx, src_idx in enumerate(select_id_src)])
+        # replace_tokens = [ operator.itemgetter(*src_idx)(sentences[bidx // beam_size]) for bidx, src_idx in enumerate(select_id_src)]
+        
+        for i in range(len(outputs)):
+            for j in range(len(outputs[i])):
+                if outputs[i][j] == unknown_token:
+                    outputs[i][j] = sentences[i][select_id_src[i][j]]
+
+        # print(sentences[0][0], outputs[0][0])
+
+                    # print(i)
+        # zip together with sentences; then output { the token if not unk / the replacement if is }. Note that this will trim the orig version down to repl size.
+        # replaced = [ [tok if tok != unknown_token else rpl for rpl, tok in zip(repl, orig)] for orig, repl in zipped ]
+        
+        # return replaced
+        return outputs
+
+    # def beam_search(self, src, max_len, device, k=4):
+    def beam_search(self, src, src_tokens=None, n_best=1, debug=False):
+        """
+        Beam search for a single sentence
+        Args:
+        model : a Transformer instance
+        src   : a batch (tokenized + numerized) sentence [batch_size x S]
+        Returns:
+        trg   : a batch (tokenized + numerized) sentence [batch_size x T]
+        """
+        src = src.to(self.device)
+        trg_init_token = self.TRG.vocab.stoi["<sos>"]  
+        trg_eos_token = self.TRG.vocab.stoi["<eos>"]
+        single_src_mask = (src != self.SRC.vocab.stoi['<pad>']).unsqueeze(1).to(self.device)
+        batch_size = src.size(0)
+
+        sent_eos, log_scores, e_outs, e_out, src_mask, trg = self.trg_init_vars(src, batch_size, trg_init_token, trg_eos_token, single_src_mask)
+
+        # The batch indexes
+        batch_index = torch.arange(batch_size)
+        finished_batches = torch.zeros(batch_size, device=self.device).long()
+
+        log_attn = torch.zeros([self.beam_size*batch_size, self.max_len, len(src[0])])
+
+        # Iteratively searching
+        for i in range(2, self.max_len):
+            trg_mask = no_peeking_mask(i, self.device)
+      
+            # Flatten trg tensor for feeding into model [(k * batch_size) x i]
+            inp_decoder = trg[batch_index, :, :i].view(self.beam_size * len(batch_index), i)
+            # Output model prob [(k * batch_size) x i x vocab_size]
+            current_decode, attn = self.model.decoder(inp_decoder, e_outs, src_mask, trg_mask, output_attention=True)
+            # print(len(attn[0]))
+    
+            prob = self.model.out(current_decode)
+            prob = functional.softmax(prob, dim=-1)
+
+            # Only care the last prob i-th
+            # [(k * batch_size) x 1 x vocab_size]
+            prob = prob[:, i-1, :].view(self.beam_size * len(batch_index), 1, -1)
+
+            # Truncate prob to top k [(k * batch_size) x 1 x k]
+            k_prob, k_index = prob.data.topk(self.beam_size, dim=-1)
+
+            # Deflatten k_prob & k_index
+            k_prob = k_prob.view(len(batch_index), self.beam_size, 1, self.beam_size)
+            k_index = k_index.view(len(batch_index), self.beam_size, 1, self.beam_size)
+
+            # Preserve eos beams
+            # [batch_size x k] -> view -> [batch_size x k x 1 x 1] (broadcastable)
+            eos_mask = (trg[batch_index, :, i-1] == trg_eos_token).view(len(batch_index), self.beam_size, 1, 1)
+            k_prob.masked_fill_(eos_mask, 1.0)
+            k_index.masked_fill_(eos_mask, trg_eos_token)
+
+            # Find the best k cases
+            # Compute log score at i-th timestep 
+            # [batch_size x k x 1 x 1] + [batch_size x k x 1 x k] = [batch_size x k x 1 x k]
+            combine_probs = log_scores[batch_index].unsqueeze(-1) + torch.log(k_prob) 
+            # [batch_size x k x 1]
+            log_scores[batch_index], positions = torch.topk(combine_probs.view(len(batch_index), self.beam_size * self.beam_size, 1), self.beam_size, dim=1)
+
+            # The rows selected from top k
+            rows = positions.view(len(batch_index), self.beam_size) // self.beam_size
+            # The indexes in vocab respected to these rows
+            cols = positions.view(len(batch_index), self.beam_size) % self.beam_size
+      
+            batch_sim = torch.arange(len(batch_index)).view(-1, 1)
+            trg[batch_index, :, :] = trg[batch_index.view(-1, 1), rows, :]
+            trg[batch_index, :, i] = k_index[batch_sim, rows, :, cols].view(len(batch_index), self.beam_size)
+
+            # Update attn
+            inx = torch.repeat_interleave(finished_batches, self.beam_size, dim=0)
+            batch_attn = torch.nonzero(inx == 0).view(-1)
+            # import copy
+            # x = copy.deepcopy(attn[0][-1][:, 0].to("cpu"))
+            # log_attn[batch_attn, :i, :] = x
+
+            # if i == 7:
+            #     print(log_attn[batch_attn, :i, :].shape, attn[0][-1][:, 0].shape)
+            #     print(log_attn[batch_attn, :i, :])
+            # Update which sentences finished all its beams
+            mask = (trg[:, :, i] == sent_eos).all(1).view(-1).to(self.device)
+            finished_batches.masked_fill_(mask, value=1)
+            cnt = torch.sum(finished_batches).item()
+            if cnt == batch_size:
+                break
+      
+            # # Continue with remaining batches (if any)
+            batch_index = torch.nonzero(finished_batches == 0).view(-1)
+            e_outs = torch.repeat_interleave(e_out[batch_index], self.beam_size, dim=0)
+            src_mask = torch.repeat_interleave(single_src_mask[batch_index], self.beam_size, dim=0)
+        # End loop
+
+        # Get the best beam
+        log_scores = log_scores.view(batch_size, self.beam_size)
+        results = []
+        for t, j in enumerate(torch.argmax(log_scores, dim=-1)):
+            sent = []
+            for i in range(self.max_len):
+                token_id = trg[t, j.item(), i].item()
+                if token_id == trg_init_token:
+                    continue
+                if token_id == trg_eos_token:
+                    break
+                sent.append(self.TRG.vocab.itos[token_id])
+            results.append(sent)
+
+        # if(self._replace_unk and src_tokens is not None):
+        #     # replace unknown words per translated sentences.
+        #     # NOTE: lacking a src_tokens does not raise any warning. Add that in when logging module is available, to support error catching
+        #     # print("Replace unk -----------------------")
+        #     results = self.replace_unknown(results, src_tokens, log_attn, self._replace_unk)
+
+        return results
+
+    def translate_single_sentence(self, src, **kwargs):
+        """Translate a single sentence. Currently unused."""
+        raise NotImplementedError
+        return self.translate_batch_sentence([src], **kwargs)
+
+    def translate_batch_sentence(self, src, field_processed=False, src_size_limit=None, output_tokens=False, debug=False):
+        """Translate a batch of sentences together. Currently disabling the synonym func.
+        Args:
+            src: the batch of sentences to be translated
+            field_processed: bool, if the sentences had been already processed (i.e part of batched validation data)
+            src_size_limit: if set, trim the input if it cross this value. Added due to current positional encoding support only <=200 tokens
+            output_tokens: the output format. False will give a batch of sentences (str), while True will give batch of tokens (list of str)
+            debug: enable to print external values
+        Return:
+            the result of translation, with format dictated by output_tokens
+        """
+        # start = time.time()
+
+        self.model.eval()
+        # create the indiced batch.
+        processed_batch = self.preprocess_batch(src, field_processed=field_processed, src_size_limit=src_size_limit, output_tokens=True, debug=debug)
+        # print("Time preprocess_batch: ", time.time()-start)
+
+        sent_ids, sent_tokens = (processed_batch, None) if(field_processed) else processed_batch
+        assert isinstance(sent_ids, torch.Tensor), "sent_ids is instead {}".format(type(sent_ids))
+
+        translated_sentences = self.beam_search(sent_ids, src_tokens=sent_tokens, debug=debug)
+
+        # print("Time for one batch: ", time.time()-batch_start)
+        
+        # if time.time()-batch_start > 2:
+        #     [print("len src >2 : ++++++", len(i.split())) for i in src]
+        #     [print("len translate >2: ++++++", len(i)) for i in translated_sentences]
+        # else:
+        #     [print("len src : ====", len(i.split())) for i in src]
+        #     [print("len translate : ====", len(i)) for i in translated_sentences]
+        # print("=====================================")
+
+        # time.sleep(4) 
+        if(debug):
+            print("Time performed for batch {}: {:.2f}s".format(sent_ids.shape))
+
+        if(not output_tokens):
+            translated_sentences = [' '.join(tokens) for tokens in translated_sentences]
+
+        return translated_sentences
+
+    def preprocess_batch(self, sentences, field_processed=False, pad_token="<pad>", src_size_limit=None, output_tokens=False, debug=True):
+        """Adding 
+            src_size_limit: int, option to limit the length of src.
+            field_processed: bool: if the sentences had been already processed (i.e part of batched validation data)
+            output_tokens: if set, output a token version aside the id version, in [batch of [src_len]] str. Note that it won't work with field_processed
+            """
+
+        if(field_processed):
+            # do nothing, as it had already performed tokenizing/stoi
+            return sentences
+        processed_sent = map(self.SRC.preprocess, sentences)
+        if(src_size_limit):
+            processed_sent = map(lambda x: x[:src_size_limit], processed_sent)
+        processed_sent = list(processed_sent)
+        tokenized_sent = [torch.LongTensor([self._token_to_index(t) for t in s]) for s in processed_sent] # convert to tensors, in indices format
+        sentences = Variable(pad_sequence(tokenized_sent, True, padding_value=self.SRC.vocab.stoi[pad_token])) # padding sentences
+        if(debug):
+            print("Input batch after process: ", sentences.shape, sentences)
+
+        if(output_tokens):
+            return sentences, processed_sent
+        else:
+            return sentences
+
+    def translate_batch(self, sentences, **kwargs):
+        return self.translate_batch_sentence(sentences, **kwargs)
+
+    def _token_to_index(self, tok):
+        """Override to select, depending on the self._use_synonym param"""
+        if(self._use_synonym):
+            return super(BeamSearch1, self)._token_to_index(tok)
+        else:
+            return self.SRC.vocab.stoi[tok]
diff --git a/modules/inference/decode_strategy.py b/modules/inference/decode_strategy.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4e826d3dfc935e5d2275b6c38cce0923bccb14c
--- /dev/null
+++ b/modules/inference/decode_strategy.py
@@ -0,0 +1,62 @@
+import torch
+from torch.autograd import Variable
+from utils.data import get_synonym
+from torch.nn.utils.rnn import pad_sequence
+import abc
+
+class DecodeStrategy(object):
+    """
+    Base, abstract class for generation strategies. Contain specific call to base model that use it
+
+    """
+    def __init__(self, model, max_len, device):
+        self.model = model
+        self.max_len = max_len
+        self.device = device
+
+    @property
+    def SRC(self):
+        return self.model.SRC
+
+    @property
+    def TRG(self):
+        return self.model.TRG
+
+    @abc.abstractmethod
+    def translate_single(self, src_lang, trg_lang, sentences):
+        """Translate a single sentence. Might be useful as backcompatibility"""
+        raise NotImplementedError
+
+    @abc.abstractmethod
+    def translate_batch(self, src_lang, trg_lang, sentences):
+        """Translate a batch of sentences.
+        Args:
+            sentences: The sentences, formatted as [batch_size] Tensor of str
+        Returns: 
+            The detokenized output, most commonly [batch_size] of str
+        """
+
+        raise NotImplementedError
+
+    @abc.abstractmethod
+    def replace_unknown(self, *args):
+        """Replace unknown words from batched sentences"""
+        raise NotImplementedError
+
+    def preprocess_batch(self, lang, sentences, pad_token="<pad>"):
+        """Feed a unprocessed batch into the torchtext.Field of source.
+        Args:
+            sentences: [batch_size] of str
+            pad_token: the pad token used to pad the sentences
+        Returns:
+            the sentences in Tensor format, padded with pad_value"""
+        processed_sent = list(map(self.SRC.preprocess, sentences)) # tokenizing
+        tokenized_sent = [Torch.LongTensor([self._token_to_index(t) for t in s]) for s in processed_sent] # convert to tensors and indices
+        sentences = Variable(pad_sequence(tokenized_sent, True, padding_value=self.SRC.vocab.stoi[pad_token])) # padding sentences
+        return sentences
+
+    def _token_to_index(self, tok):
+        """Implementing get_synonym as default. Override if want to use default behavior (<unk> for unknown words, independent of wordnet)"""
+        if self.SRC.vocab.stoi[tok] != self.SRC.vocab.stoi['<eos>']:
+            return self.SRC.vocab.stoi[tok]
+        return get_synonym(tok, self.SRC)
diff --git a/modules/inference/greedy_search.py b/modules/inference/greedy_search.py
new file mode 100644
index 0000000000000000000000000000000000000000..a06ebd59f18ea368ab286cf5af19f767acfc41de
--- /dev/null
+++ b/modules/inference/greedy_search.py
@@ -0,0 +1,121 @@
+##@title Beam của mình
+import numpy as np
+import torch
+import math
+import torch.nn.functional as functional
+import torch.nn as nn
+from torch.autograd import Variable
+
+from modules.inference.decode_strategy import DecodeStrategy
+from utils.misc import no_peeking_mask
+
+class GreedySearch(DecodeStrategy):
+    def __init__(self, model, max_len, device, replace_unk=None):
+        """
+        :param beam_size
+        :param batch_size
+        :param beam_offset
+        """
+        super(GreedySearch, self).__init__(model, max_len, device)
+        # self.replace_unk = replace_unk
+        # raise NotImplementedError("Replace unk was yeeted from base class DecodeStrategy. Fix first.")
+
+    def initilize_value(self, sentences):
+        """
+        Calculate the required matrices during translation after the model is finished
+        Input:
+        :param src: Sentences
+
+        Output: Initialize the first character includes outputs, e_outputs, log_scores
+        """
+        batch_size=len(sentences)
+        init_tok = self.TRG.vocab.stoi['<sos>']
+        src_mask = (sentences != self.SRC.vocab.stoi['<pad>']).unsqueeze(-2)
+        eos_tok = self.TRG.vocab.stoi['<eos>']
+
+        # Encoder
+        e_output = self.model.encoder(sentences, src_mask)
+
+        out = torch.LongTensor([[init_tok] for i in range(batch_size)])
+        outputs = torch.zeros(batch_size, self.max_len).long()
+        outputs[:, :1] = out
+
+        outputs = outputs.to(self.device)
+        is_finished = torch.LongTensor([[eos_tok] for i in range(batch_size)]).view(-1).to(self.device)
+        return eos_tok, src_mask, is_finished, e_output, outputs
+
+    def create_trg_mask(self, i, device):
+        return no_peeking_mask(i, device)
+
+    def current_predict(self, outputs, e_output, src_mask, trg_mask):
+        model = self.model
+        # out, attn = model.out(model.decoder(outputs, e_output, src_mask, trg_mask))
+        decoder_output, attn = model.decoder(outputs, e_output, src_mask, trg_mask, output_attention=True)
+            # total_time_decode += time.time()-decode_time
+        out = model.out(decoder_output)
+
+        out = functional.softmax(out, dim=-1)
+        return out, attn
+
+    def greedy_search(self, sentences, sampling_temp=0.0, keep_topk=1):
+        batch_size=len(sentences)
+
+        eos_tok, src_mask, is_finished, e_output, outputs = self.initilize_value(sentences)
+
+        for i in range(1, self.max_len):
+            out, attn = self.current_predict(outputs[:, :i], e_output, src_mask, self.create_trg_mask(i, self.device))
+            topk_ix, topk_prob = self.sample_with_temperature(out[:, -1], sampling_temp, keep_topk)
+            outputs[:, i] = topk_ix.view(-1)
+            if torch.equal(outputs[:, i], is_finished):
+                break
+        
+        # if self.replace_unk == True:
+        #     outputs = self.replace_unknown(attn, sentences, outputs)
+
+        # print("\n".join([' '.join([self.TRG.vocab.itos[tok] for tok in line[1:]]) for line in outputs]))
+        # Write to file or Print to the console
+        translated_sentences = []
+        # Get the best sentences: idx = 0 + i*k
+        for i in range(0, len(outputs)):
+            is_eos = torch.nonzero(outputs[i]==eos_tok)
+            if len(is_eos) == 0:
+                # if there is no sequence end, remove
+                sent = outputs[i, 1:]
+            else:
+                length = is_eos[0]
+                sent = outputs[i, 1:length]
+            translated_sentences.append([self.TRG.vocab.itos[tok] for tok in sent])
+
+        return translated_sentences
+
+    def sample_with_temperature(self, logits, sampling_temp, keep_topk):
+        if sampling_temp == 0.0 or keep_topk == 1:
+            # For temp=0.0, take the argmax to avoid divide-by-zero errors.
+            # keep_topk=1 is also equivalent to argmax.
+            topk_scores, topk_ids = logits.topk(1, dim=-1)
+            if sampling_temp > 0:
+                topk_scores /= sampling_temp
+        else:
+            logits = torch.div(logits, sampling_temp)
+
+            if keep_topk > 0:
+                top_values, top_indices = torch.topk(logits, keep_topk, dim=1)
+                kth_best = top_values[:, -1].view([-1, 1])
+                kth_best = kth_best.repeat([1, logits.shape[1]]).float()
+
+                # Set all logits that are not in the top-k to -10000.
+                # This puts the probabilities close to 0.
+                ignore = torch.lt(logits, kth_best)
+                logits = logits.masked_fill(ignore, -10000)
+
+            dist = torch.distributions.Multinomial(
+                logits=logits, total_count=1)
+            topk_ids = torch.argmax(dist.sample(), dim=1, keepdim=True)
+            topk_scores = logits.gather(dim=1, index=topk_ids)
+        return topk_ids, topk_scores
+
+    def translate_batch(self, sentences, src_size_limit, output_tokens=True, debug=False):
+        # super(BeamSearch, self).__init__()
+        sentences = self.preprocess_batch(sentences).to(self.device)
+        return self.greedy_search(sentences, 0.2, 2)
+        # print(self.initilize_value(sentences))
diff --git a/modules/inference/prototypes.py b/modules/inference/prototypes.py
new file mode 100644
index 0000000000000000000000000000000000000000..7962f91d9c3ec240a2bfa607e8ef4c965145816c
--- /dev/null
+++ b/modules/inference/prototypes.py
@@ -0,0 +1,144 @@
+import torch, time
+import torch.nn.functional as functional
+from torch.autograd import Variable
+from torch.nn.utils.rnn import pad_sequence
+
+from modules.inference.beam_search import BeamSearch
+from utils.data import generate_language_token
+import modules.constants as const
+
+def generate_subsequent_mask(sz, device):
+    return torch.triu(
+        torch.ones(sz, sz, dtype=torch.int, device=device)
+    ).transpose(0, 1).unsqueeze(0)
+
+class BeamSearch2(BeamSearch):
+    """
+    Same with BeamSearch2 class.
+    Difference: remove the sentence which its beams terminated (reached <eos> token) from the time step loop.
+    Update to reuse functions already coded in normal BeamSearch. Note that replacing unknown words & n_best is not available.
+    """
+    def _convert_to_sent(self, sent_id, eos_token_id):
+        eos = torch.nonzero(sent_id == eos_token_id).view(-1)
+        t = eos[0] if len(eos) > 0 else len(sent_id)
+        return [self.TRG.vocab.itos[j] for j in sent_id[1 : t]]
+
+    @torch.no_grad()
+    def beam_search(self, src, src_lang=None, trg_lang=None, src_tokens=None, n_best=1, debug=False):
+        """
+        Beam search select k words with the highest conditional probability
+         to be the first word of the k candidate output sequences.
+        Args:
+            src: The batch of sentences, already in [batch_size, tokens] of int
+            src_tokens: src in str version, same size as above
+            n_best: number of usable values per beam loaded (Not implemented)
+            debug: if true, print some debug information during the search
+        Return: 
+            An array of translated sentences, in list-of-tokens format. TODO convert [batch_size, n_best, tgt_len] instead of [batch_size, tgt_len]
+        """
+        # Create some local variable
+        src_field, trg_field = self.SRC, self.TRG
+        sos_token = generate_language_token(trg_lang) if trg_lang is not None else const.DEFAULT_SOS
+        init_token = trg_field.vocab.stoi[sos_token]
+        eos_token_id = trg_field.vocab.stoi[const.DEFAULT_EOS]
+        src = src.to(self.device)
+        
+        batch_size = src.size(0)
+        model = self.model
+        k = self.beam_size
+        max_len = self.max_len
+        device = self.device
+
+        # Initialize target sequence (start with '<sos>' token) [batch_size x k x max_len]
+        trg = torch.zeros(batch_size, k, max_len, device=device).long()
+        trg[:, :, 0] = init_token
+
+        # Precalc output from model's encoder 
+        single_src_mask = (src != src_field.vocab.stoi['<pad>']).unsqueeze(1).to(device)
+        e_out = model.encoder(src, single_src_mask) # [batch_size x S x d_model]
+
+        # Output model prob
+        trg_mask = generate_subsequent_mask(1, device=device)
+        # [batch_size x 1]
+        inp_decoder = trg[:, 0, 0].view(batch_size, 1)
+        # [batch_size x 1 x vocab_size]
+        prob = model.out(model.decoder(inp_decoder, e_out, single_src_mask, trg_mask))
+        prob = functional.softmax(prob, dim=-1)
+        
+        # [batch_size x 1 x k]
+        k_prob, k_index = torch.topk(prob, k, dim=-1)
+        trg[:, :, 1] = k_index.view(batch_size, k)
+        # Init log scores from k beams [batch_size x k x 1]
+        log_scores = torch.log(k_prob.view(batch_size, k, 1))
+        
+        # Repeat encoder's output k times for searching [(k * batch_size) x S x d_model]
+        e_outs = torch.repeat_interleave(e_out, k, dim=0)
+        src_mask = torch.repeat_interleave(single_src_mask, k, dim=0)
+
+        # Create mask for checking eos
+        sent_eos = torch.tensor([eos_token_id for _ in range(k)], device=device).view(1, k)
+
+        # The batch indexes
+        batch_index = torch.arange(batch_size)
+        finished_batches = torch.zeros(batch_size, device=device).long()
+
+        # Iteratively searching
+        for i in range(2, max_len):
+            trg_mask = generate_subsequent_mask(i, device)
+            
+            # Flatten trg tensor for feeding into model [(k * batch_size) x i]
+            inp_decoder = trg[batch_index, :, :i].view(k * len(batch_index), i)
+            # Output model prob [(k * batch_size) x i x vocab_size]
+            prob = model.out(model.decoder(inp_decoder, e_outs, src_mask, trg_mask))
+            prob = functional.softmax(prob, dim=-1)
+
+            # Only care the last prob i-th
+            # [(k * batch_size) x 1 x vocab_size]
+            prob = prob[:, i-1, :].view(k * len(batch_index), 1, -1)
+
+            # Truncate prob to top k [(k * batch_size) x 1 x k]
+            k_prob, k_index = prob.data.topk(k, dim=-1)
+
+            # Deflatten k_prob & k_index
+            k_prob = k_prob.view(len(batch_index), k, 1, k)
+            k_index = k_index.view(len(batch_index), k, 1, k)
+
+            # Preserve eos beams
+            # [batch_size x k] -> view -> [batch_size x k x 1 x 1] (broadcastable)
+            eos_mask = (trg[batch_index, :, i-1] == eos_token_id).view(len(batch_index), k, 1, 1)
+            k_prob.masked_fill_(eos_mask, 1.0)
+            k_index.masked_fill_(eos_mask, eos_token_id)
+
+            # Find the best k cases
+            # Compute log score at i-th timestep 
+            # [batch_size x k x 1 x 1] + [batch_size x k x 1 x k] = [batch_size x k x 1 x k]
+            combine_probs = log_scores[batch_index].unsqueeze(-1) + torch.log(k_prob) 
+            # [batch_size x k x 1]
+            log_scores[batch_index], positions = torch.topk(combine_probs.view(len(batch_index), k * k, 1), k, dim=1)
+
+            # The rows selected from top k
+            rows = positions.view(len(batch_index), k) // k
+            # The indexes in vocab respected to these rows
+            cols = positions.view(len(batch_index), k) % k
+            
+            batch_sim = torch.arange(len(batch_index)).view(-1, 1)
+            trg[batch_index, :, :] = trg[batch_index.view(-1, 1), rows, :]
+            trg[batch_index, :, i] = k_index[batch_sim, rows, :, cols].view(len(batch_index), k)
+            
+            # Update which sentences finished all its beams
+            mask = (trg[:, :, i] == sent_eos).all(1).view(-1).to(device)
+            finished_batches.masked_fill_(mask, value=1)    
+            cnt = torch.sum(finished_batches).item()
+            if cnt == batch_size:
+                break
+            
+            # Continue with remaining batches (if any)
+            batch_index = torch.nonzero(finished_batches == 0).view(-1)
+            e_outs = torch.repeat_interleave(e_out[batch_index], k, dim=0)
+            src_mask = torch.repeat_interleave(single_src_mask[batch_index], k, dim=0)
+            # End loop
+
+        # Get the best beam
+        log_scores = log_scores.view(batch_size, k)
+        results = [self._convert_to_sent(trg[t, j.item(), :], eos_token_id) for t, j in enumerate(torch.argmax(log_scores, dim=-1))]
+        return results
diff --git a/modules/inference/sampling_temperature.py b/modules/inference/sampling_temperature.py
new file mode 100644
index 0000000000000000000000000000000000000000..e45f53c3b18e2a37147cbb486a19db13b32c281d
--- /dev/null
+++ b/modules/inference/sampling_temperature.py
@@ -0,0 +1,119 @@
+##@title Beam của mình
+import numpy as np
+import torch
+import math
+import torch.nn.functional as functional
+import torch.nn as nn
+from torch.autograd import Variable
+
+from modules.inference.decode_strategy import DecodeStrategy
+from utils.misc import no_peeking_mask
+
+class GreedySearch(DecodeStrategy):
+    def __init__(self, model, max_len, device, replace_unk=True):
+        """
+        :param beam_size
+        :param batch_size
+        :param beam_offset
+        """
+        super(GreedySearch, self).__init__(model, max_len, device)
+        self.batch_size = batch_size
+        self.replace_unk = replace_unk
+        raise NotImplementedError("Replace unk was yeeted from base class DecodeStrategy. Fix first.")
+
+    def initilize_value(self, sentences):
+        """
+        Calculate the required matrices during translation after the model is finished
+        Input:
+        :param src: Sentences
+
+        Output: Initialize the first character includes outputs, e_outputs, log_scores
+        """
+
+        init_tok = self.TRG.vocab.stoi['<sos>']
+        src_mask = (sentences != self.SRC.vocab.stoi['<pad>']).unsqueeze(-2)
+        eos_tok = self.TRG.vocab.stoi['<eos>']
+
+        # Encoder
+        e_output = self.model.encoder(sentences, src_mask)
+
+        out = torch.LongTensor([[init_tok] for i in range(self.batch_size)])
+        outputs = torch.zeros(self.batch_size, self.max_len).long()
+        outputs[:, :1] = out
+
+        outputs = outputs.to(self.device)
+        is_finished = torch.LongTensor([[eos_tok] for i in range(self.batch_size)]).view(-1).to(self.device)
+        return eos_tok, src_mask, is_finished, e_output, outputs
+
+    def create_trg_mask(self, i, device):
+        return no_peeking_mask(i, device)
+
+    def current_predict(self, outputs, e_output, src_mask, trg_mask):
+        out, attn = self.model.out(self.model.decoder(outputs,
+                                  e_output, src_mask, trg_mask))
+        out = functional.softmax(out, dim=-1)
+        return out, attn
+
+    def greedy_search(self, sentences, sampling_temp=0.0, keep_topk=1):
+        if len(sentences) < self.batch_size:
+            self.batch_size = len(sentences)
+
+        eos_tok, src_mask, is_finished, e_output, outputs = self.initilize_value(sentences)
+
+        for i in range(1, self.max_len):
+            out, attn = self.current_predict(outputs[:, :i], e_output, src_mask, self.create_trg_mask(i, self.device))
+            topk_ix, topk_prob = self.sample_with_temperature(out[:, -1], sampling_temp, keep_topk)
+            outputs[:, i] = topk_ix.view(-1)
+            if torch.equal(outputs[:, i], is_finished):
+                break
+        
+        if self.replace_unk == True:
+            outputs = self.replace_unknown(attn, sentences, outputs)
+
+        # print("\n".join([' '.join([self.TRG.vocab.itos[tok] for tok in line[1:]]) for line in outputs]))
+        # Write to file or Print to the console
+        translated_sentences = []
+        # Get the best sentences: idx = 0 + i*k
+        for i in range(0, len(outputs)):
+            is_eos = torch.nonzero(outputs[i]==eos_tok)
+            if len(is_eos) == 0:
+                # if there is no sequence end, remove
+                sent = outputs[i, 1:]
+            else:
+                length = is_eos[0]
+                sent = outputs[i, 1:length]
+            translated_sentences.append([self.TRG.vocab.itos[tok] for tok in sent])
+
+        return translated_sentences
+
+    def sample_with_temperature(self, logits, sampling_temp, keep_topk):
+        if sampling_temp == 0.0 or keep_topk == 1:
+            # For temp=0.0, take the argmax to avoid divide-by-zero errors.
+            # keep_topk=1 is also equivalent to argmax.
+            topk_scores, topk_ids = logits.topk(1, dim=-1)
+            if sampling_temp > 0:
+                topk_scores /= sampling_temp
+        else:
+            logits = torch.div(logits, sampling_temp)
+
+            if keep_topk > 0:
+                top_values, top_indices = torch.topk(logits, keep_topk, dim=1)
+                kth_best = top_values[:, -1].view([-1, 1])
+                kth_best = kth_best.repeat([1, logits.shape[1]]).float()
+
+                # Set all logits that are not in the top-k to -10000.
+                # This puts the probabilities close to 0.
+                ignore = torch.lt(logits, kth_best)
+                logits = logits.masked_fill(ignore, -10000)
+
+            dist = torch.distributions.Multinomial(
+                logits=logits, total_count=1)
+            topk_ids = torch.argmax(dist.sample(), dim=1, keepdim=True)
+            topk_scores = logits.gather(dim=1, index=topk_ids)
+        return topk_ids, topk_scores
+
+    def translate_batch(self, sentences):
+        # super(BeamSearch, self).__init__()
+        sentences = self.preprocess_batch(sentences).to(self.device)
+        return self.greedy_search(sentences, 0.2, 2)
+        # print(self.initilize_value(sentences))
diff --git a/modules/loader/__init__.py b/modules/loader/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..aae7ec181bc51babbfad6ebdcb584ae5f98e90c9
--- /dev/null
+++ b/modules/loader/__init__.py
@@ -0,0 +1,4 @@
+from .default_loader import DefaultLoader
+from .multilingual_loader import MultiLoader
+
+loaders = {"monoloader": DefaultLoader, "multiloader": MultiLoader}
diff --git a/modules/loader/__pycache__/__init__.cpython-36.pyc b/modules/loader/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3a0d4d5a822110f0579aea6acf7ceb390f1b5ecb
Binary files /dev/null and b/modules/loader/__pycache__/__init__.cpython-36.pyc differ
diff --git a/modules/loader/__pycache__/default_loader.cpython-36.pyc b/modules/loader/__pycache__/default_loader.cpython-36.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f8c88ec0b087ce50e5b1076880c010bc2f898aa4
Binary files /dev/null and b/modules/loader/__pycache__/default_loader.cpython-36.pyc differ
diff --git a/modules/loader/__pycache__/multilingual_loader.cpython-36.pyc b/modules/loader/__pycache__/multilingual_loader.cpython-36.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..cf61d58e2e35947ae26a8307778c674893e34cad
Binary files /dev/null and b/modules/loader/__pycache__/multilingual_loader.cpython-36.pyc differ
diff --git a/modules/loader/default_loader.py b/modules/loader/default_loader.py
new file mode 100644
index 0000000000000000000000000000000000000000..366141add6a775d2104a9a7c7f0f382a4256f15a
--- /dev/null
+++ b/modules/loader/default_loader.py
@@ -0,0 +1,114 @@
+import io, os
+import dill as pickle
+import torch
+from torch.utils.data import DataLoader
+from torchtext.data import BucketIterator, Dataset, Example, Field
+from torchtext.datasets import TranslationDataset, Multi30k, IWSLT, WMT14
+from collections import Counter
+
+import modules.constants as const
+from utils.save import load_vocab_from_path
+import laonlp
+
+class DefaultLoader:
+  def __init__(self, train_path_or_name, language_tuple=None, valid_path=None, eval_path=None, option=None):
+    """Load training/eval data file pairing, process and create data iterator for training """
+    self._language_tuple = language_tuple
+    self._train_path = train_path_or_name
+    self._eval_path = eval_path
+    self._option = option
+  
+  @property
+  def language_tuple(self):
+    """DefaultLoader will use the default lang option @bleu_batch_iter <sos>, hence, None"""
+    return None, None
+
+  def tokenize(self, sentence):
+    return sentence.strip().split()
+
+  def detokenize(self, list_of_tokens):
+    """Differentiate between [batch, len] and [len]; joining tokens back to strings"""
+    if( len(list_of_tokens) == 0 or isinstance(list_of_tokens[0], str)):
+      # [len], single sentence version
+      return " ".join(list_of_tokens)
+    else:
+      # [batch, len], batch sentence version
+      return [" ".join(tokens) for tokens in list_of_tokens]
+
+  def _train_path_is_name(self):
+    return os.path.isfile(self._train_path + self._language_tuple[0]) and os.path.isfile(self._train_path + self._language_tuple[1])
+
+  def create_length_constraint(self, token_limit):
+    """Filter an iterator if it pass a token limit"""
+    return lambda x: len(x.src) <= token_limit and len(x.trg) <= token_limit
+
+  def build_field(self, **kwargs):
+    """Build fields that will handle the conversion from token->idx and vice versa. To be overriden by MultiLoader."""
+    return Field(lower=False, tokenize=laonlp.tokenize.word_tokenize), Field(lower=False, tokenize=self.tokenize, init_token=const.DEFAULT_SOS, eos_token=const.DEFAULT_EOS, is_target=True)
+
+  def build_vocab(self, fields, model_path=None, data=None, **kwargs):
+    """Build the vocabulary object for torchtext Field. There are three flows:
+      - if the model path is present, it will first try to load the pickled/dilled vocab object from path. This is accessed on continued training & standalone inference
+      - if that failed and data is available, try to build the vocab from that data. This is accessed on first time training
+      - if data is not available, search for set of two vocab files and read them into the fields. This is accessed on first time training
+    TODO: expand on the vocab file option (loading pretrained vectors as well)
+    """
+    src_field, trg_field = fields
+    if(model_path is None or not load_vocab_from_path(model_path, self._language_tuple, fields)):
+      # the condition will try to load vocab pickled to model path.
+      if(data is not None):
+        print("Building vocab from received data.")
+        # build the vocab using formatted data.
+        src_field.build_vocab(data, **kwargs)
+        trg_field.build_vocab(data, **kwargs)
+      else:
+        print("Building vocab from preloaded text file.")
+        # load the vocab values from external location (a formatted text file). Initialize values as random
+        external_vocab_location = self._option.get("external_vocab_location", None)
+        src_ext, trg_ext = self._language_tuple
+        # read the files and create a mock Counter object; which then is passed to vocab's class
+        # see Field.build_vocab for the options used with vocab_cls
+        vocab_src = external_vocab_location + src_ext
+        with io.open(vocab_src, "r", encoding="utf-8") as svf:
+          mock_counter = Counter({w.strip():1 for w in svf.readlines()})
+          special_tokens = [src_field.unk_token, src_field.pad_token, src_field.init_token, src_field.eos_token]
+          src_field.vocab = src_field.vocab_cls(mock_counter, specials=special_tokens, min_freq=5, **kwargs)
+        vocab_trg = external_vocab_location + trg_ext
+        with io.open(vocab_trg, "r", encoding="utf-8") as tvf:
+          mock_counter = Counter({w.strip():1 for w in tvf.readlines()})
+          special_tokens = [trg_field.unk_token, trg_field.pad_token, trg_field.init_token, trg_field.eos_token]
+          trg_field.vocab = trg_field.vocab_cls(mock_counter, specials=special_tokens, min_freq=5, **kwargs)
+    else:
+      print("Load vocab from path successful.")
+
+  def create_iterator(self, fields, model_path=None):
+    """Create the iterator needed to load batches of data and bind them to existing fields
+    NOTE: unlike the previous loader, this one inputs list of tokens instead of a string, which necessitate redefinining of translate_sentence pipe"""
+    if(not self._train_path_is_name()):
+      # load the default torchtext dataset by name
+      # TODO load additional arguments in the config
+      dataset_cls = next( (s for s in [Multi30k, IWSLT, WMT14] if s.__name__ == self._train_path), None )
+      if(dataset_cls is None):
+        raise ValueError("The specified train path {:s}(+{:s}/{:s}) does neither point to a valid files path nor is a name of torchtext dataset class.".format(self._train_path, *self._language_tuple))
+      src_suffix, trg_suffix = ext = self._language_tuple
+#      print(ext, fields)
+      self._train_data, self._valid_data, self._eval_data = dataset_cls.splits(exts=ext, fields=fields) #, split_ratio=self._option.get("train_test_split", const.DEFAULT_TRAIN_TEST_SPLIT)
+    else:
+      # create dataset from path. Also add all necessary constraints (e.g lengths trimming/excluding)
+      src_suffix, trg_suffix = ext = self._language_tuple
+      filter_fn = self.create_length_constraint(self._option.get("train_max_length", const.DEFAULT_TRAIN_MAX_LENGTH))
+      self._train_data = TranslationDataset(self._train_path, ext, fields, filter_pred=filter_fn)
+      self._valid_data = self._eval_data = TranslationDataset(self._eval_path, ext, fields)
+#    first_sample = self._train_data[0]; raise Exception("{} {}".format(first_sample.src, first_sample.trg))
+    # whatever created, we now have the two set of data ready. add the necessary constraints/filtering/etc.
+    train_data = self._train_data
+    eval_data = self._eval_data
+    # now we can execute build_vocab. This function will try to load vocab from model_path, and if fail, build the vocab from train_data
+    build_vocab_kwargs = self._option.get("build_vocab_kwargs", {})
+    self.build_vocab(fields, data=train_data, model_path=model_path, **build_vocab_kwargs)
+#    raise Exception("{}".format(len(src_field.vocab)))
+    # crafting iterators
+    train_iter = BucketIterator(train_data, batch_size=self._option.get("batch_size", const.DEFAULT_BATCH_SIZE), device=self._option.get("device", const.DEFAULT_DEVICE) )
+    eval_iter = BucketIterator(eval_data, batch_size=self._option.get("eval_batch_size", const.DEFAULT_EVAL_BATCH_SIZE), device=self._option.get("device", const.DEFAULT_DEVICE), train=False )
+    return train_iter, eval_iter
+
diff --git a/modules/loader/multilingual_loader.py b/modules/loader/multilingual_loader.py
new file mode 100644
index 0000000000000000000000000000000000000000..191ea18a5cffe63077dd500f932c9cf952cf72da
--- /dev/null
+++ b/modules/loader/multilingual_loader.py
@@ -0,0 +1,139 @@
+import io, os
+import dill as pickle
+from collections import Counter
+import torch
+from torchtext.data import BucketIterator, Dataset, Example, Field, interleave_keys
+import modules.constants as const
+from utils.save import load_vocab_from_path
+from utils.data import generate_language_token
+from modules.loader.default_loader import DefaultLoader
+import laonlp
+
+class MultiDataset(Dataset):
+    """
+    Ensemble one or more corpuses from different languages.
+    The corpuses use global source vocab and target vocab.
+
+    Constructor Args:
+        data_info: list of datasets info <See `train` argument in MultiLoader class>
+        fields: A tuple containing src field and trg field.
+    """
+    @staticmethod
+    def sort_key(ex):
+        return interleave_keys(len(ex.src), len(ex.trg))
+
+    def __init__(self, data_info, fields,  **kwargs):
+        self.languages = set()
+
+        if not isinstance(fields[0], (tuple, list)):
+            fields = [('src', fields[0]), ('trg', fields[1])]
+
+        examples = []
+        for corpus, info in data_info:
+            print("Loading corpus {} ...".format(corpus))
+
+            src_lang = info["src_lang"]
+            trg_lang = info["trg_lang"]
+            src_path = os.path.expanduser('.'.join([info["path"], src_lang]))
+            trg_path = os.path.expanduser('.'.join([info["path"], trg_lang]))
+            self.languages.add(src_lang)
+            self.languages.add(trg_lang)
+
+            with io.open(src_path, mode='r', encoding='utf-8') as src_file, \
+                 io.open(trg_path, mode='r', encoding='utf-8') as trg_file:
+                for src_line, trg_line in zip(src_file, trg_file):
+                    src_line, trg_line = src_line.strip(), trg_line.strip()
+                    if src_line != '' and trg_line != '':
+                        # Append language-specific prefix token
+                        src_line = ' '.join([generate_language_token(src_lang), src_line])
+                        trg_line = ' '.join([generate_language_token(trg_lang), trg_line])
+                        examples.append(Example.fromlist([src_line, trg_line], fields))
+            print("Done!")
+
+        super(MultiDataset, self).__init__(examples, fields, **kwargs)
+
+
+class MultiLoader(DefaultLoader):
+    def __init__(self, train, valid=None, option=None):
+        """
+        Load multiple training/eval parallel data files, process and create data iterator
+        Constructor Args:
+            train: a dictionary contains training data information
+            valid (optional): a dictionary contains validation data information
+            option (optional): a dictionary contains configurable parameters
+
+            For example:
+            train = {
+                "corpus_1": {
+                    "path": path/to/training/data,
+                    "src_lang": src,
+                    "trg_lang": trg
+                },
+                "corpus_2": {
+                    ...
+                }
+            }
+        """
+        self._train_info = train
+        self._valid_info = valid
+        self._language_tuple = ('.src', '.trg')
+        self._option = option
+        
+    @property
+    def tokenize(self, sentence):
+        return sentence.strip().split()
+        
+        
+    def language_tuple(self):
+        """Currently output valid data's tuple for bleu_valid_iter, which would use <{trg_lang}> during inference. Since <{src_lang}> had already been added to the valid data, return None instead."""
+        return None, self._valid_info["trg_lang"]
+
+    def _is_path(self, path, lang):
+        """Check whether the path is a system path or a corpus name"""
+        return os.path.isfile(path + '.' + lang)
+
+    def build_field(self, **kwargs):
+        return Field(lower=False, tokenize=laonlp.tokenize.word_tokenize), Field(lower=False, tokenize=self.tokenize, init_token=const.DEFAULT_SOS, eos_token=const.DEFAULT_EOS)
+
+    def build_vocab(self, fields, model_path=None, data=None, **kwargs):
+        """Build the vocabulary object for torchtext Field. There are three flows:
+        - if the model path is present, it will first try to load the pickled/dilled vocab object from path. This is accessed on continued training & standalone inference
+        - if that failed and data is available, try to build the vocab from that data. This is accessed on first time training
+        - if data is not available, search for set of two vocab files and read them into the fields. This is accessed on first time training
+        TODO: expand on the vocab file option (loading pretrained vectors as well)
+        """
+        src_field, trg_field = fields
+        if model_path is None or not load_vocab_from_path(model_path, self._language_tuple, fields):
+            # the condition will try to load vocab pickled to model path.
+            if data is not None:
+                print("Building vocab from received data.")
+                # build the vocab using formatted data.
+                src_field.build_vocab(data, **kwargs)
+                trg_field.build_vocab(data, **kwargs)
+            else:
+                # Not implemented mixing preloaded datasets and external datasets 
+                raise ValueError("MultiLoader currently do not support preloaded text vocab")
+        else:
+            print("Load vocab from path successful.")
+
+    def create_iterator(self, fields, model_path=None):
+        """Create the iterator needed to load batches of data and bind them to existing fields"""
+        # create dataset from path. Also add all necessary constraints (e.g lengths trimming/excluding)
+        filter_fn = self.create_length_constraint(self._option.get("train_max_length", const.DEFAULT_TRAIN_MAX_LENGTH))
+        self._train_data = MultiDataset(data_info=self._train_info.items(), fields=fields, filter_pred=filter_fn)
+        
+        # now we can execute build_vocab. This function will try to load vocab from model_path, and if fail, build the vocab from train_data
+        build_vocab_kwargs = self._option.get("build_vocab_kwargs", {})
+        build_vocab_kwargs["specials"] = build_vocab_kwargs.pop("specials", []) + list(self._train_data.languages)
+        self.build_vocab(fields, data=self._train_data, model_path=model_path, **build_vocab_kwargs)
+
+        # Create train iterator
+        train_iter = BucketIterator(self._train_data, batch_size=self._option.get("batch_size", const.DEFAULT_BATCH_SIZE), device=self._option.get("device", const.DEFAULT_DEVICE))
+    
+        if self._valid_info is not None:
+            self._valid_data = MultiDataset(data_info=[("valid", self._valid_info)], fields=fields)
+            valid_iter = BucketIterator(self._valid_data, batch_size=self._option.get("eval_batch_size", const.DEFAULT_EVAL_BATCH_SIZE), device=self._option.get("device", const.DEFAULT_DEVICE), train=False)
+        else:
+            valid_iter = None      
+
+        return train_iter, valid_iter
diff --git a/modules/optim/__init__.py b/modules/optim/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..b354c93d2e9a13e7d60fc60375752cf55219a49c
--- /dev/null
+++ b/modules/optim/__init__.py
@@ -0,0 +1,5 @@
+from modules.optim.adam import AdamOptim
+from modules.optim.adabelief import AdaBeliefOptim
+from modules.optim.scheduler import ScheduledOptim
+
+optimizers = {"Adam": AdamOptim, "AdaBelief": AdaBeliefOptim}
diff --git a/modules/optim/__pycache__/__init__.cpython-36.pyc b/modules/optim/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7c7f5d178f49f817c449239e33eca5cb71cf9d3a
Binary files /dev/null and b/modules/optim/__pycache__/__init__.cpython-36.pyc differ
diff --git a/modules/optim/__pycache__/adabelief.cpython-36.pyc b/modules/optim/__pycache__/adabelief.cpython-36.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..72b9ecfa9bb4b7336ea6295d150c850863773d1a
Binary files /dev/null and b/modules/optim/__pycache__/adabelief.cpython-36.pyc differ
diff --git a/modules/optim/__pycache__/adam.cpython-36.pyc b/modules/optim/__pycache__/adam.cpython-36.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..946af33015f24171d8fe08068c030d0024c55d65
Binary files /dev/null and b/modules/optim/__pycache__/adam.cpython-36.pyc differ
diff --git a/modules/optim/__pycache__/scheduler.cpython-36.pyc b/modules/optim/__pycache__/scheduler.cpython-36.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..739160c965e803c9a8aca0d21f293726ca80a20e
Binary files /dev/null and b/modules/optim/__pycache__/scheduler.cpython-36.pyc differ
diff --git a/modules/optim/adabelief.py b/modules/optim/adabelief.py
new file mode 100644
index 0000000000000000000000000000000000000000..0ea1bebad51922655be68fd1b52b8576062e9042
--- /dev/null
+++ b/modules/optim/adabelief.py
@@ -0,0 +1,42 @@
+import torch, math
+
+class AdaBeliefOptim(torch.optim.Optimizer):
+    
+    def __init__(self, params, lr=1e-3, betas=(0.9, 0.98), eps=1e-9, **kwargs):
+        defaults = dict(lr=lr, betas=betas, eps=eps)
+        super().__init__(params, defaults)
+    
+    @torch.no_grad()
+    def step(self, closure=None):
+        for group in self.param_groups:
+            for p in group['params']:
+                if p.grad is None:
+                    # No backward
+                    continue
+                grad = p.grad
+                state = self.state[p]
+
+                if len(state) == 0:
+                    # Initial step
+                    state['step'] = 0
+                    state['exp_avg'] = torch.zeros_like(p)
+                    state['exp_avg_sq'] = torch.zeros_like(p)
+            
+                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
+                beta1, beta2 = group['betas']
+
+                state['step'] += 1  
+                bias_correction1 = 1 - beta1 ** state['step']
+                bias_correction2 = 1 - beta2 ** state['step']
+
+                exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
+
+                # This is the diff between Adam and Adabelief
+                centered_grad = grad.sub(exp_avg)
+                exp_avg_sq.mul_(beta2).addcmul_(centered_grad, centered_grad, value=1-beta2)
+                # !
+
+                denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(group['eps'])
+                step_size = group['lr'] / bias_correction1
+
+                p.addcdiv_(exp_avg, denom, value=-step_size)
diff --git a/modules/optim/adam.py b/modules/optim/adam.py
new file mode 100644
index 0000000000000000000000000000000000000000..82ac2c425a3914735f1122c290dd5ef33c75a497
--- /dev/null
+++ b/modules/optim/adam.py
@@ -0,0 +1,38 @@
+import torch, math
+
+class AdamOptim(torch.optim.Optimizer):
+    
+    def __init__(self, params, lr=1e-3, betas=(0.9, 0.98), eps=1e-9, **kwargs):
+        defaults = dict(lr=lr, betas=betas, eps=eps)
+        super().__init__(params, defaults)
+    
+    @torch.no_grad()
+    def step(self, closure=None):
+        for group in self.param_groups:
+            for p in group['params']:
+                if p.grad is None:
+                    # No backward
+                    continue
+                grad = p.grad
+                state = self.state[p]
+
+                if len(state) == 0:
+                    # Initial step
+                    state['step'] = 0
+                    state['exp_avg'] = torch.zeros_like(p)
+                    state['exp_avg_sq'] = torch.zeros_like(p)
+            
+                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
+                beta1, beta2 = group['betas']
+
+                state['step'] += 1  
+                bias_correction1 = 1 - beta1 ** state['step']
+                bias_correction2 = 1 - beta2 ** state['step']
+
+                exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
+                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
+                denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(group['eps'])
+
+                step_size = group['lr'] / bias_correction1
+
+                p.addcdiv_(exp_avg, denom, value=-step_size)
diff --git a/modules/optim/scheduler.py b/modules/optim/scheduler.py
new file mode 100644
index 0000000000000000000000000000000000000000..fcbf33f2ccf5b38f3cdd6647c8b812fc59c5c588
--- /dev/null
+++ b/modules/optim/scheduler.py
@@ -0,0 +1,56 @@
+import torch
+
+class ScheduledOptim():
+    '''A simple wrapper class for learning rate scheduling'''
+
+    def __init__(self, optimizer, init_lr, d_model, n_warmup_steps):
+        self._optimizer = optimizer
+        self.init_lr = init_lr
+        self.d_model = d_model
+        self.n_warmup_steps = n_warmup_steps
+        self.n_steps = 0
+
+
+    def step_and_update_lr(self):
+        "Step with the inner optimizer"
+        self._update_learning_rate()
+        self._optimizer.step()
+
+
+    def zero_grad(self):
+        "Zero out the gradients with the inner optimizer"
+        self._optimizer.zero_grad()
+
+
+    def _get_lr_scale(self):
+        d_model = self.d_model
+        n_steps, n_warmup_steps = self.n_steps, self.n_warmup_steps
+        return (d_model ** -0.5) * min(n_steps ** (-0.5), n_steps * n_warmup_steps ** (-1.5))
+
+    def state_dict(self):
+        optimizer_state_dict = {
+            'init_lr':self.init_lr,
+            'd_model':self.d_model,
+            'n_warmup_steps':self.n_warmup_steps,
+            'n_steps':self.n_steps,
+            '_optimizer':self._optimizer.state_dict(),
+        }
+        
+        return optimizer_state_dict
+    
+    def load_state_dict(self, state_dict):
+        self.init_lr = state_dict['init_lr']
+        self.d_model = state_dict['d_model']
+        self.n_warmup_steps = state_dict['n_warmup_steps']
+        self.n_steps = state_dict['n_steps']
+        
+        self._optimizer.load_state_dict(state_dict['_optimizer'])
+        
+    def _update_learning_rate(self):
+        ''' Learning rate scheduling per step '''
+
+        self.n_steps += 1
+        lr = self.init_lr * self._get_lr_scale()
+
+        for param_group in self._optimizer.param_groups:
+            param_group['lr'] = lr
diff --git a/modules/prototypes.py b/modules/prototypes.py
new file mode 100644
index 0000000000000000000000000000000000000000..da0d719f68fd60d3c3a9df46298f988ca758402a
--- /dev/null
+++ b/modules/prototypes.py
@@ -0,0 +1,209 @@
+import torch.nn as nn
+from torchtext import data
+import copy
+import layers as layers
+
+class Embedder(nn.Module):
+   def __init__(self, vocab_size, d_model):
+       super().__init__()
+       self.vocab_size = vocab_size
+       self.d_model = d_model
+       
+       self.embed = nn.Embedding(vocab_size, d_model)
+       
+   def forward(self, x):
+       return self.embed(x)
+
+class EncoderLayer(nn.Module):
+    def __init__(self, d_model, heads, dropout=0.1):
+        """An layer of the encoder. Contain a self-attention accepting padding mask
+        Args:
+            d_model: the inner dimension size of the layer
+            heads: number of heads used in the attention
+            dropout: applied dropout value during training
+            """
+        super().__init__()
+        self.norm_1 = layers.Norm(d_model)
+        self.norm_2 = layers.Norm(d_model)
+        self.attn = layers.MultiHeadAttention(heads, d_model, dropout=dropout)
+        self.ff = layers.FeedForward(d_model, dropout=dropout)
+        self.dropout_1 = nn.Dropout(dropout)
+        self.dropout_2 = nn.Dropout(dropout)
+
+    def forward(self, x, src_mask):
+        """Run the encoding layer
+        Args:
+            x: the input (either embedding values or previous layer output), should be in shape [batch_size, src_len, d_model]
+            src_mask: the padding mask, should be [batch_size, 1, src_len]
+        Return:
+            an output that have the same shape as input, [batch_size, src_len, d_model]
+            the attention used [batch_size, src_len, src_len]
+        """
+        x2 = self.norm_1(x)
+        # Self attention only
+        x_sa, sa = self.attn(x2, x2, x2, src_mask)
+        x = x + self.dropout_1(x_sa)
+        x2 = self.norm_2(x)
+        x = x + self.dropout_2(self.ff(x2))
+        return x, sa
+
+class DecoderLayer(nn.Module):
+    def __init__(self, d_model, heads, dropout=0.1):
+        """An layer of the decoder. Contain a self-attention that accept no-peeking mask and a normal attention tha t accept padding mask
+        Args:
+            d_model: the inner dimension size of the layer
+            heads: number of heads used in the attention
+            dropout: applied dropout value during training
+            """
+        super().__init__()
+        self.norm_1 = layers.Norm(d_model)
+        self.norm_2 = layers.Norm(d_model)
+        self.norm_3 = layers.Norm(d_model)
+
+        self.dropout_1 = nn.Dropout(dropout)
+        self.dropout_2 = nn.Dropout(dropout)
+        self.dropout_3 = nn.Dropout(dropout)
+
+        self.attn_1 = layers.MultiHeadAttention(heads, d_model, dropout=dropout)
+        self.attn_2 = layers.MultiHeadAttention(heads, d_model, dropout=dropout)
+        self.ff = layers.FeedForward(d_model, dropout=dropout)
+
+    def forward(self, x, memory, src_mask, trg_mask):
+        """Run the decoding layer
+        Args:
+            x: the input (either embedding values or previous layer output), should be in shape [batch_size, tgt_len, d_model]
+            memory: the outputs of the encoding section, used for normal attention. [batch_size, src_len, d_model]
+            src_mask: the padding mask for the memory, [batch_size, 1, src_len]
+            tgt_mask: the no-peeking mask for the decoder, [batch_size, tgt_len, tgt_len]
+        Return:
+            an output that have the same shape as input, [batch_size, tgt_len, d_model]
+            the self-attention and normal attention received [batch_size, head, tgt_len, tgt_len] & [batch_size, head, tgt_len, src_len]
+        """
+        x2 = self.norm_1(x)
+        # Self-attention
+        x_sa, sa = self.attn_1(x2, x2, x2, trg_mask)
+        x = x + self.dropout_1(x_sa)
+        x2 = self.norm_2(x)
+        # Normal multi-head attention
+        x_na, na = self.attn_2(x2, memory, memory, src_mask)
+        x = x + self.dropout_2(x_na)
+        x2 = self.norm_3(x)
+        x = x + self.dropout_3(self.ff(x2))
+        return x, (sa, na)
+
+def get_clones(module, N, keep_module=True):
+    if(keep_module and N >= 1):
+        # create N-1 copies in addition to the original
+        return nn.ModuleList([module] + [copy.deepcopy(module) for i in range(N-1)]) 
+    else:
+        # create N new copy
+        return nn.ModuleList([copy.deepcopy(module) for i in range(N)])
+
+class Encoder(nn.Module):
+    """A wrapper that embed, positional encode, and self-attention encode the inputs.
+    Args:
+        vocab_size: the size of the vocab. Used for embedding
+        d_model: the inner dim of the module
+        N: number of layers used
+        heads: number of heads used in the attention
+        dropout: applied dropout value during training
+        max_seq_length: the maximum length value used for this encoder. Needed for PositionalEncoder, due to caching
+    """
+    def __init__(self, vocab_size, d_model, N, heads, dropout, max_seq_length=200):
+        super().__init__()
+        self.N = N
+        self.embed = nn.Embedding(vocab_size, d_model)
+        self.pe = layers.PositionalEncoder(d_model, dropout=dropout, max_seq_length=max_seq_length)
+        self.layers = get_clones(EncoderLayer(d_model, heads, dropout), N)
+        self.norm = layers.Norm(d_model)
+
+        self._max_seq_length = max_seq_length
+
+    def forward(self, src, src_mask, output_attention=False, seq_length_check=False):
+        """Accepts a batch of indexed tokens, return the encoded values.
+        Args:
+            src: int Tensor of [batch_size, src_len]
+            src_mask: the padding mask, [batch_size, 1, src_len]
+            output_attention: if set, output a list containing used attention
+            seq_length_check: if set, automatically trim the input if it goes past the expected sequence length.
+        Returns:
+            the encoded values [batch_size, src_len, d_model]
+            if available, list of N (self-attention) calculated. They are in form of [batch_size, heads, src_len, src_len]
+        """
+        if(seq_length_check and src.shape[1] > self._max_seq_length):
+            src = src[:, :self._max_seq_length]
+            src_mask = src_mask[:, :, :self._max_seq_length]
+        x = self.embed(src)
+        x = self.pe(x)
+        attentions = [None] * self.N
+        for i in range(self.N):
+            x, attn = self.layers[i](x, src_mask)
+            attentions[i] = attn
+        x = self.norm(x)
+        return x if(not output_attention) else (x, attentions)
+
+class Decoder(nn.Module):
+    """A wrapper that receive the encoder outputs, run through the decoder process for a determined input
+    Args:
+        vocab_size: the size of the vocab. Used for embedding
+        d_model: the inner dim of the module
+        N: number of layers used
+        heads: number of heads used in the attention
+        dropout: applied dropout value during training
+        max_seq_length: the maximum length value used for this encoder. Needed for PositionalEncoder, due to caching
+    """
+    def __init__(self, vocab_size, d_model, N, heads, dropout, max_seq_length=200):
+        super().__init__()
+        self.N = N
+        self.embed = nn.Embedding(vocab_size, d_model)
+        self.pe = layers.PositionalEncoder(d_model, dropout=dropout, max_seq_length=max_seq_length)
+        self.layers = get_clones(DecoderLayer(d_model, heads, dropout), N)
+        self.norm = layers.Norm(d_model)
+
+        self._max_seq_length = max_seq_length
+
+    def forward(self, trg, memory, src_mask, trg_mask, output_attention=False):
+        """Accepts a batch of indexed tokens and the encoding outputs, return the decoded values.
+        Args:
+            trg: input Tensor of [batch_size, trg_len]
+            memory: output of Encoder [batch_size, src_len, d_model]
+            src_mask: the padding mask, [batch_size, 1, src_len]
+            trg_mask: the no-peeking mask, [batch_size, tgt_len, tgt_len]
+            output_attention: if set, output a list containing used attention
+        Returns:
+            the decoded values [batch_size, tgt_len, d_model]
+            if available, list of N (self-attention, attention) calculated. They are in form of [batch_size, heads, tgt_len, tgt/src_len]
+        """
+        x = self.embed(trg)
+        x = self.pe(x)
+
+        attentions = [None] * self.N
+        for i in range(self.N):
+            x, attn = self.layers[i](x, memory, src_mask, trg_mask)
+            attentions[i] = attn
+        x = self.norm(x)
+        return x if(not output_attention) else (x, attentions)
+
+
+class Config:
+    """Deprecated"""
+    def __init__(self):
+        self.opt = {
+            'train_src_data':'/workspace/khoai23/opennmt/data/iwslt_en_vi/train.en',
+            'train_trg_data':'/workspace/khoai23/opennmt/data/iwslt_en_vi/train.vi',
+            'valid_src_data':'/workspace/khoai23/opennmt/data/iwslt_en_vi/tst2013.en',
+            'valid_trg_data':'/workspace/khoai23/opennmt/data/iwslt_en_vi/tst2013.vi',
+            'src_lang':'en', # useless atm
+            'trg_lang':'en',#'vi_spacy_model', # useless atm
+            'max_strlen':160,
+            'batchsize':1500,
+            'device':'cuda',
+            'd_model': 512,
+            'n_layers': 6,
+            'heads': 8,
+            'dropout': 0.1,
+            'lr':0.0001,
+            'epochs':30,
+            'printevery': 200,
+            'k':5,
+        }
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..02ff51e5035cc39649821083620fcabdf2f00c44
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,10 @@
+torch
+nltk
+torchtext==0.6.0
+pyvi
+spacy
+PyYAML
+dill
+pandas
+laonlp
+perl
\ No newline at end of file
diff --git a/third-party/multi-bleu.perl b/third-party/multi-bleu.perl
new file mode 100644
index 0000000000000000000000000000000000000000..8ebe3918986eda9796f05c6addc0ae0aafce0b35
--- /dev/null
+++ b/third-party/multi-bleu.perl
@@ -0,0 +1,177 @@
+#!/usr/bin/env perl
+#
+# This file is part of moses.  Its use is licensed under the GNU Lesser General
+# Public License version 2.1 or, at your option, any later version.
+
+# $Id$
+use warnings;
+use strict;
+
+my $lowercase = 0;
+if ($ARGV[0] eq "-lc") {
+  $lowercase = 1;
+  shift;
+}
+
+my $stem = $ARGV[0];
+if (!defined $stem) {
+  print STDERR "usage: multi-bleu.pl [-lc] reference < hypothesis\n";
+  print STDERR "Reads the references from reference or reference0, reference1, ...\n";
+  exit(1);
+}
+
+$stem .= ".ref" if !-e $stem && !-e $stem."0" && -e $stem.".ref0";
+
+my @REF;
+my $ref=0;
+while(-e "$stem$ref") {
+    &add_to_ref("$stem$ref",\@REF);
+    $ref++;
+}
+&add_to_ref($stem,\@REF) if -e $stem;
+die("ERROR: could not find reference file $stem") unless scalar @REF;
+
+# add additional references explicitly specified on the command line
+shift;
+foreach my $stem (@ARGV) {
+    &add_to_ref($stem,\@REF) if -e $stem;
+}
+
+
+
+sub add_to_ref {
+    my ($file,$REF) = @_;
+    my $s=0;
+    if ($file =~ /.gz$/) {
+	open(REF,"gzip -dc $file|") or die "Can't read $file";
+    } else { 
+	open(REF,$file) or die "Can't read $file";
+    }
+    while(<REF>) {
+	chomp;
+	push @{$$REF[$s++]}, $_;
+    }
+    close(REF);
+}
+
+my(@CORRECT,@TOTAL,$length_translation,$length_reference);
+my $s=0;
+while(<STDIN>) {
+    chomp;
+    $_ = lc if $lowercase;
+    my @WORD = split;
+    my %REF_NGRAM = ();
+    my $length_translation_this_sentence = scalar(@WORD);
+    my ($closest_diff,$closest_length) = (9999,9999);
+    foreach my $reference (@{$REF[$s]}) {
+#      print "$s $_ <=> $reference\n";
+  $reference = lc($reference) if $lowercase;
+	my @WORD = split(' ',$reference);
+	my $length = scalar(@WORD);
+        my $diff = abs($length_translation_this_sentence-$length);
+	if ($diff < $closest_diff) {
+	    $closest_diff = $diff;
+	    $closest_length = $length;
+	    # print STDERR "$s: closest diff ".abs($length_translation_this_sentence-$length)." = abs($length_translation_this_sentence-$length), setting len: $closest_length\n";
+	} elsif ($diff == $closest_diff) {
+            $closest_length = $length if $length < $closest_length;
+            # from two references with the same closeness to me
+            # take the *shorter* into account, not the "first" one.
+        }
+	for(my $n=1;$n<=4;$n++) {
+	    my %REF_NGRAM_N = ();
+	    for(my $start=0;$start<=$#WORD-($n-1);$start++) {
+		my $ngram = "$n";
+		for(my $w=0;$w<$n;$w++) {
+		    $ngram .= " ".$WORD[$start+$w];
+		}
+		$REF_NGRAM_N{$ngram}++;
+	    }
+	    foreach my $ngram (keys %REF_NGRAM_N) {
+		if (!defined($REF_NGRAM{$ngram}) ||
+		    $REF_NGRAM{$ngram} < $REF_NGRAM_N{$ngram}) {
+		    $REF_NGRAM{$ngram} = $REF_NGRAM_N{$ngram};
+#	    print "$i: REF_NGRAM{$ngram} = $REF_NGRAM{$ngram}<BR>\n";
+		}
+	    }
+	}
+    }
+    $length_translation += $length_translation_this_sentence;
+    $length_reference += $closest_length;
+    for(my $n=1;$n<=4;$n++) {
+	my %T_NGRAM = ();
+	for(my $start=0;$start<=$#WORD-($n-1);$start++) {
+	    my $ngram = "$n";
+	    for(my $w=0;$w<$n;$w++) {
+		$ngram .= " ".$WORD[$start+$w];
+	    }
+	    $T_NGRAM{$ngram}++;
+	}
+	foreach my $ngram (keys %T_NGRAM) {
+	    $ngram =~ /^(\d+) /;
+	    my $n = $1;
+            # my $corr = 0;
+#	print "$i e $ngram $T_NGRAM{$ngram}<BR>\n";
+	    $TOTAL[$n] += $T_NGRAM{$ngram};
+	    if (defined($REF_NGRAM{$ngram})) {
+		if ($REF_NGRAM{$ngram} >= $T_NGRAM{$ngram}) {
+		    $CORRECT[$n] += $T_NGRAM{$ngram};
+                    # $corr =  $T_NGRAM{$ngram};
+#	    print "$i e correct1 $T_NGRAM{$ngram}<BR>\n";
+		}
+		else {
+		    $CORRECT[$n] += $REF_NGRAM{$ngram};
+                    # $corr =  $REF_NGRAM{$ngram};
+#	    print "$i e correct2 $REF_NGRAM{$ngram}<BR>\n";
+		}
+	    }
+            # $REF_NGRAM{$ngram} = 0 if !defined $REF_NGRAM{$ngram};
+            # print STDERR "$ngram: {$s, $REF_NGRAM{$ngram}, $T_NGRAM{$ngram}, $corr}\n"
+	}
+    }
+    $s++;
+}
+my $brevity_penalty = 1;
+my $bleu = 0;
+
+my @bleu=();
+
+for(my $n=1;$n<=4;$n++) {
+  if (defined ($TOTAL[$n])){
+    $bleu[$n]=($TOTAL[$n])?$CORRECT[$n]/$TOTAL[$n]:0;
+    # print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n";
+  }else{
+    $bleu[$n]=0;
+  }
+}
+
+if ($length_reference==0){
+  printf "BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)\n";
+  exit(1);
+}
+
+if ($length_translation<$length_reference) {
+  $brevity_penalty = exp(1-$length_reference/$length_translation);
+}
+$bleu = $brevity_penalty * exp((my_log( $bleu[1] ) +
+				my_log( $bleu[2] ) +
+				my_log( $bleu[3] ) +
+				my_log( $bleu[4] ) ) / 4) ;
+printf "BLEU = %.2f, %.1f/%.1f/%.1f/%.1f (BP=%.3f, ratio=%.3f, hyp_len=%d, ref_len=%d)\n",
+    100*$bleu,
+    100*$bleu[1],
+    100*$bleu[2],
+    100*$bleu[3],
+    100*$bleu[4],
+    $brevity_penalty,
+    $length_translation / $length_reference,
+    $length_translation,
+    $length_reference;
+
+
+print STDERR "It is in-advisable to publish scores from multi-bleu.perl.  The scores depend on your tokenizer, which is unlikely to be reproducible from your paper or consistent across research groups.  Instead you should detokenize then use mteval-v14.pl, which has a standard tokenization.  Scores from multi-bleu.perl can still be used for internal purposes when you have a consistent tokenizer.\n";
+
+sub my_log {
+  return -9999999999 unless $_[0];
+  return log($_[0]);
+}
diff --git a/utils/data.py b/utils/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..8534e5a3848b7cb4fca94118697310998329caed
--- /dev/null
+++ b/utils/data.py
@@ -0,0 +1,137 @@
+import re, os
+import nltk
+from nltk.corpus import wordnet
+import dill as pickle
+import pandas as pd
+from torchtext import data
+from laonlp import tokenize
+
+def multiple_replace(dict, text):
+  # Create a regular expression  from the dictionary keys
+  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
+
+  # For each match, look-up corresponding value in dictionary
+  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 
+
+# get_synonym  replace word with any synonym found among src
+def get_synonym(word, SRC):
+  syns = wordnet.synsets(word)
+  for s in syns:
+    for l in s.lemmas():
+      if SRC.vocab.stoi[l.name()] != 0:
+        return SRC.vocab.stoi[l.name()]
+                
+  return 0
+
+class Tokenizer:
+    def __init__(self, lang=None):
+        if(lang is not None):
+            self.nlp = spacy.load(lang)
+            self.tokenizer_fn = self.nlp.tokenizer
+        else:
+            self.tokenizer_fn = lambda l: l.strip().split()
+            
+    # def tokenize(self, sentence):
+    #     sentence = re.sub(
+    #     r"[\*\"“”\n\\…\+\-\/\=\(\)‘•:\[\]\|’\!;]", " ", str(sentence))
+    #     sentence = re.sub(r"[ ]+", " ", sentence)
+    #     sentence = re.sub(r"\!+", "!", sentence)
+    #     sentence = re.sub(r"\,+", ",", sentence)
+    #     sentence = re.sub(r"\?+", "?", sentence)
+    #     sentence = sentence.lower()
+    #     return [tok.text for tok in self.tokenizer_fn(sentence) if tok.text != " "]
+
+
+def read_data(src_file, trg_file):
+    src_data = open(src_file).read().strip().split('\n')
+
+    trg_data = open(trg_file).read().strip().split('\n')
+ 
+    return src_data, trg_data
+
+
+def read_file(file_dir):
+    f = open(file_dir, 'r')
+    data =  f.read().strip().split('\n')
+    return data
+
+def write_file(file_dir, content):
+    f = open(file_dir, "w")
+    f.write(content)
+    f.close()
+
+def create_fields(src_lang, trg_lang):
+    
+    #print("loading spacy tokenizers...")
+    #
+    # t_src = tokenize(src_lang)
+    # t_trg = tokenize(trg_lang)
+    # t_src_tokenizer = t_trg_tokenizer = lambda x: x.strip().split()
+    target_tokenizer = lambda x: x.strip().split()
+
+    TRG = data.Field(lower=True, tokenize=target_tokenizer, init_token='<sos>', eos_token='<eos>')
+    SRC = data.Field(lower=True, tokenize=tokenize.word_tokenize)
+        
+    return SRC, TRG
+
+def create_dataset(src_data, trg_data, max_strlen, batchsize, device, SRC, TRG, istrain=True):
+
+    print("creating dataset and iterator... ")
+
+    raw_data = {'src' : [line for line in src_data], 'trg': [line for line in trg_data]}
+    df = pd.DataFrame(raw_data, columns=["src", "trg"])
+    
+    mask = (df['src'].str.count(' ') < max_strlen) & (df['trg'].str.count(' ') < max_strlen)
+    df = df.loc[mask]
+
+    df.to_csv("translate_transformer_temp.csv", index=False)
+    
+    data_fields = [('src', SRC), ('trg', TRG)]
+    train = data.TabularDataset('./translate_transformer_temp.csv', format='csv', fields=data_fields)
+
+    train_iter = MyIterator(train, batch_size=batchsize, device=device,
+                        repeat=False, sort_key=lambda x: (len(x.src), len(x.trg)),
+                        batch_size_fn=batch_size_fn, train=istrain, shuffle=True)
+    
+    os.remove('translate_transformer_temp.csv')
+    
+    if istrain:
+        SRC.build_vocab(train)
+        TRG.build_vocab(train)
+
+    return train_iter
+
+class MyIterator(data.Iterator):
+    def create_batches(self):
+        if self.train:
+            def pool(d, random_shuffler):
+                for p in data.batch(d, self.batch_size * 100):
+                    p_batch = data.batch(
+                        sorted(p, key=self.sort_key),
+                        self.batch_size, self.batch_size_fn)
+                    for b in random_shuffler(list(p_batch)):
+                        yield b
+            self.batches = pool(self.data(), self.random_shuffler)
+            
+        else:
+            self.batches = []
+            for b in data.batch(self.data(), self.batch_size,
+                                          self.batch_size_fn):
+                self.batches.append(sorted(b, key=self.sort_key))
+
+global max_src_in_batch, max_tgt_in_batch
+
+def batch_size_fn(new, count, sofar):
+    "Keep augmenting batch and calculate total number of tokens + padding."
+    global max_src_in_batch, max_tgt_in_batch
+    if count == 1:
+        max_src_in_batch = 0
+        max_tgt_in_batch = 0
+    max_src_in_batch = max(max_src_in_batch,  len(new.src))
+    max_tgt_in_batch = max(max_tgt_in_batch,  len(new.trg) + 2)
+    src_elements = count * max_src_in_batch
+    tgt_elements = count * max_tgt_in_batch
+    return max(src_elements, tgt_elements)
+
+def generate_language_token(lang: str):
+    return '<{}>'.format(lang.strip())
\ No newline at end of file
diff --git a/utils/decode_old.py b/utils/decode_old.py
new file mode 100644
index 0000000000000000000000000000000000000000..ddac381bf7abe390a234a1c199bf685bf352efe3
--- /dev/null
+++ b/utils/decode_old.py
@@ -0,0 +1,163 @@
+import numpy as np
+import math
+import torch
+from torch.autograd import Variable
+import torch.nn.functional as functional
+
+from utils.data import multiple_replace, get_synonym
+
+def no_peeking_mask(size, device):
+    """
+    Tạo mask được sử dụng trong decoder để lúc dự đoán trong quá trình huấn luyện
+    mô hình không nhìn thấy được các từ ở tương lai
+    """
+    np_mask = np.triu(np.ones((1, size, size)),
+k=1).astype('uint8')
+    np_mask =  Variable(torch.from_numpy(np_mask) == 0)
+    np_mask = np_mask.to(device)
+    
+    return np_mask
+
+def create_masks(src, trg, src_pad, trg_pad, device):
+    """ Tạo mask cho encoder, 
+    để mô hình không bỏ qua thông tin của các kí tự PAD do chúng ta thêm vào 
+    """
+    src_mask = (src != src_pad).unsqueeze(-2)
+
+    if trg is not None:
+        trg_mask = (trg != trg_pad).unsqueeze(-2)
+        size = trg.size(1) # get seq_len for matrix
+        np_mask = no_peeking_mask(size, device)
+        if trg.is_cuda:
+            np_mask.cuda()
+        trg_mask = trg_mask & np_mask
+        
+    else:
+        trg_mask = None
+    return src_mask, trg_mask
+
+def init_vars(src, model, SRC, TRG, device, k, max_len):
+    """ Tính toán các ma trận cần thiết trong quá trình translation sau khi mô hình học xong
+    """
+    init_tok = TRG.vocab.stoi['<sos>']
+    src_mask = (src != SRC.vocab.stoi['<pad>']).unsqueeze(-2)
+
+    # tính sẵn output của encoder
+    e_output = model.encoder(src, src_mask)
+
+    outputs = torch.LongTensor([[init_tok]])
+
+    outputs = outputs.to(device)
+
+    trg_mask = no_peeking_mask(1, device)
+    # dự đoán kí tự đầu tiên
+    out = model.out(model.decoder(outputs,
+    e_output, src_mask, trg_mask))
+    out = functional.softmax(out, dim=-1)
+
+    probs, ix = out[:, -1].data.topk(k)
+    log_scores = torch.Tensor([math.log(prob) for prob in probs.data[0]]).unsqueeze(0)
+
+    outputs = torch.zeros(k, max_len).long()
+    outputs = outputs.to(device)
+    outputs[:, 0] = init_tok
+    outputs[:, 1] = ix[0]
+
+    e_outputs = torch.zeros(k, e_output.size(-2),e_output.size(-1))
+
+    e_outputs = e_outputs.to(device)
+    e_outputs[:, :] = e_output[0]
+
+    return outputs, e_outputs, log_scores
+
+def k_best_outputs(outputs, out, log_scores, i, k):
+    # debug print
+    
+    probs, ix = out[:, -1].data.topk(k)
+    log_probs = torch.Tensor([math.log(p) for p in probs.data.view(-1)]).view(k, -1) + log_scores.transpose(0,1)
+    k_probs, k_ix = log_probs.view(-1).topk(k)
+
+    row = k_ix // k
+    col = k_ix % k
+
+    outputs[:, :i] = outputs[row, :i]
+    outputs[:, i] = ix[row, col]
+
+    log_scores = k_probs.unsqueeze(0)
+
+    return outputs, log_scores
+
+def beam_search(src, model, SRC, TRG, device, k, max_len, debug=False, output_list_of_tokens=False):
+
+    outputs, e_outputs, log_scores = init_vars(src, model, SRC, TRG, device, k, max_len)
+    eos_tok = TRG.vocab.stoi['<eos>']
+    src_mask = (src != SRC.vocab.stoi['<pad>']).unsqueeze(-2)
+    ind = None
+    for i in range(2, max_len):
+        if(debug):
+            print("Current iteration to maxlen: {:d}".format(i))
+
+        trg_mask = no_peeking_mask(i, device)
+
+        out = model.out(model.decoder(outputs[:,:i], e_outputs, src_mask, trg_mask))
+
+        out = functional.softmax(out, dim=-1)
+
+        outputs, log_scores = k_best_outputs(outputs, out, log_scores, i, k)
+
+        ones = (outputs==eos_tok).nonzero() # Occurrences of end symbols for all input sentences.
+        sentence_lengths = torch.zeros(len(outputs), dtype=torch.long).to(device)
+        for vec in ones:
+            i = vec[0]
+            if sentence_lengths[i]==0: # First end symbol has not been found yet
+                sentence_lengths[i] = vec[1] # Position of first end symbol
+
+        num_finished_sentences = len([s for s in sentence_lengths if s > 0])
+
+        if num_finished_sentences == k:
+            alpha = 0.7
+            div = 1/(sentence_lengths.type_as(log_scores)**alpha)
+            _, ind = torch.max(log_scores * div, 1)
+            ind = ind.data[0]
+            break
+
+    # additional change to output list of tokens instead of string
+    join_fn = (lambda x: x) if(output_list_of_tokens) else (lambda x: " ".join(x))
+
+    if ind is None:
+        length = (outputs[0]==eos_tok).nonzero()[0] if len((outputs[0]==eos_tok).nonzero()) > 0 else -1
+        return join_fn([TRG.vocab.itos[tok] for tok in outputs[0, 1:length]])
+    else:
+        length = (outputs[ind]==eos_tok).nonzero()[0]
+        return join_fn([TRG.vocab.itos[tok] for tok in outputs[ind, 1:length]])
+
+def translate_sentence(raw_sentence, model, SRC, TRG, device, k, max_len, debug=False, output_list_of_tokens=False):
+    """Dịch một câu sử dụng beamsearch
+    """
+    model.eval()
+    indexed = []
+    if(isinstance(raw_sentence, str)):
+        # single sentence, require preprocessing
+        sentence = SRC.preprocess(raw_sentence)
+    else:
+        # already tokenized (taken from iterators, etc.)
+        sentence = raw_sentence
+    
+    for tok in sentence:
+        if SRC.vocab.stoi[tok] != SRC.vocab.stoi['<eos>']:
+            indexed.append(SRC.vocab.stoi[tok])
+        else:
+            indexed.append(get_synonym(tok, SRC))
+    
+    output = Variable(torch.LongTensor([indexed]))
+    
+    output = output.to(device)
+    
+    output = beam_search(output, model, SRC, TRG, device, k, max_len, output_list_of_tokens=output_list_of_tokens)
+
+    if(debug):
+        print("{} -> {}".format(raw_sentence, output))
+
+    return output
+
+#    return  multiple_replace({' ?' : '?',' !':'!',' .':'.','\' ':'\'',' ,':','}, sentence)
diff --git a/utils/logging.py b/utils/logging.py
new file mode 100644
index 0000000000000000000000000000000000000000..53852deb4357e6373c009d6255d9cae073ac4de1
--- /dev/null
+++ b/utils/logging.py
@@ -0,0 +1,22 @@
+from __future__ import absolute_import
+
+import os
+import logging
+
+def init_logger(model_dir, log_file=None, rotate=False):
+
+    logging.basicConfig(level=logging.DEBUG,
+                            format='[%(asctime)s %(levelname)s] %(message)s',
+                            datefmt='%a, %d %b %Y %H:%M:%S',
+                            filename=os.path.join(model_dir, log_file),
+                            filemode='w')
+    console = logging.StreamHandler()
+    console.setLevel(logging.INFO)
+    # set a format which is simpler for console use
+    formatter = logging.Formatter('[%(asctime)s %(levelname)s] %(message)s', '%a, %d %b %Y %H:%M:%S')
+    # tell the handler to use this format
+    console.setFormatter(formatter)
+    # add the handler to the root logger
+    logging.getLogger('').addHandler(console)
+
+    return logging
\ No newline at end of file
diff --git a/utils/loss.py b/utils/loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..e89e9a3caa6b0cb995a697a7a3a9475d9c031d92
--- /dev/null
+++ b/utils/loss.py
@@ -0,0 +1,25 @@
+import torch
+import torch.nn as nn
+
+class LabelSmoothingLoss(nn.Module):
+    def __init__(self, classes, padding_idx, smoothing=0.0, dim=-1):
+        super(LabelSmoothingLoss, self).__init__()
+        self.confidence = 1.0 - smoothing
+        self.smoothing = smoothing
+        self.cls = classes
+        self.dim = dim
+        self.padding_idx = padding_idx
+
+    def forward(self, pred, target):
+        pred = pred.log_softmax(dim=self.dim)
+        with torch.no_grad():
+            # true_dist = pred.data.clone()
+            true_dist = torch.zeros_like(pred)
+            true_dist.fill_(self.smoothing / (self.cls - 2))
+            true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
+            true_dist[:, self.padding_idx] = 0
+            mask = torch.nonzero(target.data == self.padding_idx) #, as_tuple=False is redundant and causing error
+            if mask.dim() > 0:
+                true_dist.index_fill_(0, mask.squeeze(), 0.0)
+            
+        return torch.mean(torch.sum(-true_dist * pred, dim=self.dim))
diff --git a/utils/metric.py b/utils/metric.py
new file mode 100644
index 0000000000000000000000000000000000000000..08a4fdf303344460e5886ba55d706e3d031670d7
--- /dev/null
+++ b/utils/metric.py
@@ -0,0 +1,60 @@
+from torchtext.data.metrics import bleu_score
+
+def bleu(valid_src_data, valid_trg_data, model, device, k, max_strlen):
+    pred_sents = []
+    for sentence in valid_src_data:
+        pred_trg = model.translate_sentence(sentence, device, k, max_strlen)
+        pred_sents.append(pred_trg)
+    
+    pred_sents = [self.TRG.preprocess(sent) for sent in pred_sents]
+    trg_sents = [[sent.split()] for sent in valid_trg_data]
+    
+    return bleu_score(pred_sents, trg_sents)
+
+def bleu_single(model, valid_dataset, debug=False):
+    """Perform single sentence translation, then calculate bleu score. Update when batch beam search is online"""
+    # need to join the sentence back per sample (the iterator is the one that had already been split to tokens)
+    # THIS METRIC USE 2D vs 3D! AAAAAAHHHHHHH!!!!
+    translate_pair = ( ([pair.trg], model.translate_sentence(pair.src, debug=debug)) for pair in valid_dataset)
+#    raise Exception(next(translate_pair))
+    labels, predictions = [list(l) for l in zip(*translate_pair)] # zip( *((l, p.split()) for l, p in translate_pair) )
+    return bleu_score(predictions, labels)
+
+def bleu_batch(model, valid_dataset, batch_size, debug=False):
+    """Perform batch sentence translation in the same vein."""
+    predictions = model.translate_batch_sentence([s.src for s in valid_dataset], output_tokens=True, batch_size=batch_size)
+    labels = [[s.trg] for s in valid_dataset]
+    return bleu_score(predictions, labels)
+
+
+def _revert_trg(sent, eos): # revert batching process on sentence
+    try:
+        endloc = sent.index(eos)
+        return sent[1:endloc]
+    except ValueError:
+        return sent[1:]
+
+def bleu_batch_iter(model, valid_iter, src_lang=None, trg_lang=None, eos_token="<eos>", debug=False):
+    """Perform batched translations; other metrics are the same. Note that the inputs/outputs had been preprocessed, but have [length, batch_size] shape as per BucketIterator"""
+#    raise NotImplementedError("Error during calculation, currently unusable.")
+ #   raise Exception([[model.SRC.vocab.itos[t] for t in batch] for batch in next(iter(valid_iter)).src.transpose(0, 1)])
+    
+    translated_batched_pair = (
+        (
+            pair.trg.transpose(0, 1), # transpose due to timestep-first batches
+            model.decode_strategy.translate_batch_sentence(
+                pair.src.transpose(0, 1),
+                src_lang=src_lang,
+                trg_lang=trg_lang,
+                output_tokens=True, 
+                field_processed=True, 
+                replace_unk=False, # do not replace in this version
+                debug=debug
+            )
+        ) 
+        for pair in valid_iter 
+    ) 
+    flattened_pair = ( ([model.TRG.vocab.itos[i] for i in trg], pred) for batch_trg, batch_pred in translated_batched_pair for trg, pred in zip(batch_trg, batch_pred) )
+    flat_labels, predictions = [list(l) for l in zip(*flattened_pair)]
+    labels = [[_revert_trg(l, eos_token)] for l in flat_labels] # remove <sos> and <eos> also updim the trg for 3D requirements.
+    return bleu_score(predictions, labels)
diff --git a/utils/misc.py b/utils/misc.py
new file mode 100644
index 0000000000000000000000000000000000000000..2717a8bafd0acdee3d955aab64b6e3985e191e41
--- /dev/null
+++ b/utils/misc.py
@@ -0,0 +1,35 @@
+import numpy as np
+import torch
+from torch.autograd import Variable
+
+
+def no_peeking_mask(size, device):
+    """
+        Creating a mask for decoder
+        that future words cannot be seen at prediction during training.
+    """
+    np_mask = np.triu(np.ones((1, size, size)),
+    k=1).astype('uint8')
+    np_mask =  Variable(torch.from_numpy(np_mask) == 0)
+    np_mask = np_mask.to(device)
+
+    return np_mask
+
+def create_masks(src, trg, src_pad, trg_pad, device):
+    """
+        Creating a mask for Encoder
+        That the model does not ignore the information of the PAD characters we added
+    """
+    src_mask = (src != src_pad).unsqueeze(-2)
+
+    if trg is not None:
+        trg_mask = (trg != trg_pad).unsqueeze(-2)
+        size = trg.size(1) # get seq_len for matrix
+        np_mask = no_peeking_mask(size, device)
+        if trg.is_cuda:
+            np_mask.cuda()
+        trg_mask = trg_mask & np_mask
+
+    else:
+        trg_mask = None
+    return src_mask, trg_mask
diff --git a/utils/save.py b/utils/save.py
new file mode 100644
index 0000000000000000000000000000000000000000..21fdb173c8dcca52afdb08f84d063202b0684030
--- /dev/null
+++ b/utils/save.py
@@ -0,0 +1,116 @@
+import torch
+import os, re, io
+import json
+import dill as pickle
+from shutil import copy2 as copy
+MODEL_EXTENSION = ".pkl"
+MODEL_FILE_FORMAT = "{:s}_{:d}{:s}" # model_prefix, epoch and extension
+BEST_MODEL_FILE = ".model_score.txt"
+MODEL_SERVE_FILE = ".serve.txt"
+VOCAB_FILE_FORMAT = "{:s}{:s}{:s}"
+
+def save_model_name(name, path, serve_config_path=MODEL_SERVE_FILE):
+  with io.open(os.path.join(path, serve_config_path), "w", encoding="utf-8") as serve_config_file:
+    serve_config_file.write(name)
+
+def save_vocab_to_path(path, language_tuple, fields, name_prefix="vocab", check_saved_vocab=True):
+  src_field, trg_field = fields
+  src_ext, trg_ext = language_tuple
+  src_vocab_path = os.path.join(path, VOCAB_FILE_FORMAT.format(name_prefix, src_ext, MODEL_EXTENSION))
+  trg_vocab_path = os.path.join(path, VOCAB_FILE_FORMAT.format(name_prefix, trg_ext, MODEL_EXTENSION))
+  if(check_saved_vocab and os.path.isfile(src_vocab_path) and os.path.isfile(trg_vocab_path)):# do nothing if already exist
+    return
+  with io.open(src_vocab_path , "wb") as src_vocab_file:
+    pickle.dump(src_field.vocab, src_vocab_file)
+  with io.open(trg_vocab_path , "wb") as trg_vocab_file:
+    pickle.dump(trg_field.vocab, trg_vocab_file)
+
+def load_vocab_from_path(path, language_tuple, fields, name_prefix="vocab"):
+  """Load the vocabulary from path into respective fields. If files doesn't exist, return False; if loaded properly, return True"""
+  src_field, trg_field = fields
+  src_ext, trg_ext = language_tuple
+  src_vocab_file_path = os.path.join(path, VOCAB_FILE_FORMAT.format(name_prefix, src_ext, MODEL_EXTENSION))
+  trg_vocab_file_path = os.path.join(path, VOCAB_FILE_FORMAT.format(name_prefix, trg_ext, MODEL_EXTENSION))
+  if(not os.path.isfile(src_vocab_file_path) or not os.path.isfile(trg_vocab_file_path)):
+    # the vocab file wasn't dumped, return False
+    return False
+  with io.open(src_vocab_file_path, "rb") as src_vocab_file, io.open(trg_vocab_file_path, "rb") as trg_vocab_file:
+    src_vocab = pickle.load(src_vocab_file)
+    src_field.vocab = src_vocab
+    trg_vocab = pickle.load(trg_vocab_file)
+    trg_field.vocab = trg_vocab
+  return True
+
+def save_model_to_path(model, path, name_prefix="model", checkpoint_idx=0, save_vocab=True):
+  save_path = os.path.join(path, MODEL_FILE_FORMAT.format(name_prefix, checkpoint_idx, MODEL_EXTENSION))
+  torch.save(model.state_dict(), save_path)
+  if(save_vocab):
+    save_vocab_to_path(path, model.loader._language_tuple, model.fields)
+
+def load_model_from_path(model, path, name_prefix="model", checkpoint_idx=0):
+  # do not load vocab here, as the vocab structure will be decided in model.loader.build_vocab
+  save_path = os.path.join(path, MODEL_FILE_FORMAT.format(name_prefix, checkpoint_idx, MODEL_EXTENSION))
+  model.load_state_dict(torch.load(save_path))
+
+
+def load_model(model, model_path):
+  model.load_state_dict(torch.load(model_path))
+
+def check_model_in_path(path, name_prefix="model", return_all_checkpoint=False):
+  model_re = re.compile(r"{:s}_(\d+){:s}".format(name_prefix, MODEL_EXTENSION))
+  if(not os.path.isdir(path)):
+    return 0
+  matches = [re.match(model_re, f) for f in os.listdir(path) if os.path.isfile(os.path.join(path, f))]
+#  print(matches)
+  indices = sorted([int(m.group(1)) for m in matches if m is not None])
+  if(return_all_checkpoint):
+    return indices
+  elif(len(indices) == 0):
+    return 0
+  else:
+    return indices[-1]
+
+def save_and_clear_model(model, path, name_prefix="model", checkpoint_idx=0, maximum_saved_model=5):
+  """Keep only last n models when saving. Explicitly save the model regardless of its checkpoint index, e.g if checkpoint_idx=0 & model 3 4 5 6 7 is in path, it will remove 3 and save 0 instead."""
+  indices = check_model_in_path(path, name_prefix=name_prefix, return_all_checkpoint=True)
+  if(maximum_saved_model <= len(indices)):
+    # remove models until n-1 models are left
+    for i in indices[:-(maximum_saved_model-1)]:
+      os.remove(os.path.join(path, MODEL_FILE_FORMAT.format(name_prefix, i, MODEL_EXTENSION)))
+  # perform save as normal
+  save_model_to_path(model, path, name_prefix=name_prefix, checkpoint_idx=checkpoint_idx)
+
+def load_model_score(path, score_file=BEST_MODEL_FILE):
+  """Load the model score as a list from a json dump, organized from best to worst."""
+  score_file_path = os.path.join(path, score_file)
+  if(not os.path.isfile(score_file_path)):
+    return []
+  with io.open(score_file_path, "r") as jf:
+    return json.load(jf)
+
+def write_model_score(path, score_obj, score_file=BEST_MODEL_FILE):
+  with io.open(os.path.join(path, score_file), "w") as jf:
+    json.dump(score_obj, jf)
+
+def save_model_best_to_path(model, path, score_obj, model_metric, best_model_prefix="best_model", maximum_saved_model=5, score_file=BEST_MODEL_FILE, save_after_update=True):
+  worst_score = score_obj[-1] if len(score_obj) > 0 else -1.0
+  if(model_metric > worst_score):
+    # perform update, overriding a slot or create new if needed
+    insert_loc = next((idx for idx, score in enumerate(score_obj) if model_metric > score), 0)
+    # every model below it, up to {maximum_saved_model}, will be moved down an index
+    for i in range(insert_loc, min(len(score_obj), maximum_saved_model)-1): # -1, due to the models are copied up to +1
+      old_loc = save_path = os.path.join(path, MODEL_FILE_FORMAT.format(best_model_prefix, i, MODEL_EXTENSION))
+      new_loc = save_path = os.path.join(path, MODEL_FILE_FORMAT.format(best_model_prefix, i+1, MODEL_EXTENSION))
+      copy(old_loc, new_loc)
+    # save the model to the selected loc
+    save_model_to_path(model, path, name_prefix=best_model_prefix, checkpoint_idx=insert_loc)
+    # update the score obj
+    score_obj.insert(insert_loc, model_metric)
+    score_obj = score_obj[:maximum_saved_model]
+    # also update in disk, if enabled
+    if(save_after_update):
+      write_model_score(path, score_obj, score_file=score_file)
+  # after routine had been done, return the obj
+  return score_obj
+
+