File size: 3,443 Bytes
158b61b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# Image to Text

---------

**WARNING**: This example is based on the [legacy version of OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py/tree/legacy)!

---------

A deep learning-based approach to learning the image-to-text conversion, built on top of the <a href="http://opennmt.net/">OpenNMT</a> system. It is completely data-driven, hence can be used for a variety of image-to-text problems, such as image captioning, optical character recognition and LaTeX decompilation. 

Take LaTeX decompilation as an example, given a formula image:

<p align="center"><img src="http://lstm.seas.harvard.edu/latex/results/website/images/119b93a445-orig.png"></p>

The goal is to infer the LaTeX source that can be compiled to such an image:

```
 d s _ { 1 1 } ^ { 2 } = d x ^ { + } d x ^ { - } + l _ { p } ^ { 9 } \frac { p _ { - } } { r ^ { 7 } } \delta ( x ^ { - } ) d x ^ { - } d x ^ { - } + d x _ { 1 } ^ { 2 } + \; \cdots \; + d x _ { 9 } ^ { 2 } 
```

The paper [[What You Get Is What You See: A Visual Markup Decompiler]](https://arxiv.org/pdf/1609.04938.pdf) provides more technical details of this model.

### Dependencies

* `torchvision`: `conda install torchvision`
* `Pillow`: `pip install Pillow`

### Quick Start

To get started, we provide a toy Math-to-LaTex example. We assume that the working directory is `OpenNMT-py` throughout this document.

Im2Text consists of four commands:

0) Download the data.

```bash
wget -O data/im2text.tgz http://lstm.seas.harvard.edu/latex/im2text_small.tgz; tar zxf data/im2text.tgz -C data/
```

1) Preprocess the data.

```bash
onmt_preprocess -data_type img \
                -src_dir data/im2text/images/ \
                -train_src data/im2text/src-train.txt \
                -train_tgt data/im2text/tgt-train.txt -valid_src data/im2text/src-val.txt \
                -valid_tgt data/im2text/tgt-val.txt -save_data data/im2text/demo \
                -tgt_seq_length 150 \
                -tgt_words_min_frequency 2 \
                -shard_size 500 \
                -image_channel_size 1
```

2) Train the model.

```bash
onmt_train -model_type img \
           -data data/im2text/demo \
           -save_model demo-model \
           -gpu_ranks 0 \
           -batch_size 20 \
           -max_grad_norm 20 \
           -learning_rate 0.1 \
           -word_vec_size 80 \
           -encoder_type brnn \
           -image_channel_size 1
```

3) Translate the images.

```bash
onmt_translate -data_type img \
               -model demo-model_acc_x_ppl_x_e13.pt \
               -src_dir data/im2text/images \
               -src data/im2text/src-test.txt \
               -output pred.txt \
               -max_length 150 \
               -beam_size 5 \
               -gpu 0 \
               -verbose
```

The above dataset is sampled from the [im2latex-100k-dataset](http://lstm.seas.harvard.edu/latex/im2text.tgz). We provide a trained model [[link]](http://lstm.seas.harvard.edu/latex/py-model.pt) on this dataset.

### Options

* `-src_dir`: The directory containing the images.

* `-train_tgt`: The file storing the tokenized labels, one label per line. It shall look like:
```
<label0_token0> <label0_token1> ... <label0_tokenN0>
<label1_token0> <label1_token1> ... <label1_tokenN1>
<label2_token0> <label2_token1> ... <label2_tokenN2>
...
```

* `-train_src`: The file storing the paths of the images (relative to `src_dir`).
```
<image0_path>
<image1_path>
<image2_path>
...
```