File size: 10,859 Bytes
8ef089c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
import streamlit as st

# Custom CSS for better styling
st.markdown("""

    <style>

        .main-title {

            font-size: 36px;

            color: #4A90E2;

            font-weight: bold;

            text-align: center;

        }

        .sub-title {

            font-size: 24px;

            color: #4A90E2;

            margin-top: 20px;

        }

        .section {

            background-color: #f9f9f9;

            padding: 15px;

            border-radius: 10px;

            margin-top: 20px;

        }

        .section h2 {

            font-size: 22px;

            color: #4A90E2;

        }

        .section p, .section ul {

            color: #666666;

        }

        .link {

            color: #4A90E2;

            text-decoration: none;

        }

        .benchmark-table {

            width: 100%;

            border-collapse: collapse;

            margin-top: 20px;

        }

        .benchmark-table th, .benchmark-table td {

            border: 1px solid #ddd;

            padding: 8px;

            text-align: left;

        }

        .benchmark-table th {

            background-color: #4A90E2;

            color: white;

        }

        .benchmark-table td {

            background-color: #f2f2f2;

        }

    </style>

""", unsafe_allow_html=True)

# Main Title
st.markdown('<div class="main-title">Image Captioning with VisionEncoderDecoderModel</div>', unsafe_allow_html=True)

# Description
st.markdown("""

<div class="section">

    <p><strong>VisionEncoderDecoderModel</strong> allows you to initialize an image-to-text model using any pretrained Transformer-based vision model (e.g., ViT, BEiT, DeiT, Swin) as the encoder and any pretrained language model (e.g., RoBERTa, GPT2, BERT, DistilBERT) as the decoder.</p>

    <p>This approach has been demonstrated to be effective in models like TrOCR: <a class="link" href="https://arxiv.org/abs/2103.14030" target="_blank">Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li et al.</a></p>

    <p>After training or fine-tuning a VisionEncoderDecoderModel, it can be saved and loaded just like any other model. Examples are provided below.</p>

</div>

""", unsafe_allow_html=True)

# Image Captioning Overview
st.markdown('<div class="sub-title">What is Image Captioning?</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p><strong>Image Captioning</strong> is the task of generating a textual description of an image. It uses a model to encode the image into a feature representation, which is then decoded by a language model to produce a natural language description.</p>

    <h2>How It Works</h2>

    <p>Image captioning typically involves the following steps:</p>

    <ul>

        <li><strong>Image Encoding</strong>: The image is passed through a vision model (e.g., ViT) to produce a feature representation.</li>

        <li><strong>Caption Generation</strong>: The feature representation is fed into a language model (e.g., GPT2) to generate a caption for the image.</li>

    </ul>

    <h2>Why Use Image Captioning?</h2>

    <p>Image captioning is useful for:</p>

    <ul>

        <li>Automatically generating descriptions for images, enhancing accessibility.</li>

        <li>Improving search engine capabilities by allowing images to be indexed with textual content.</li>

        <li>Supporting content management systems with automated tagging and description generation.</li>

    </ul>

    <h2>Where to Use It</h2>

    <p>Applications of image captioning span various domains:</p>

    <ul>

        <li><strong>Social Media</strong>: Automatically generating captions for user-uploaded images.</li>

        <li><strong>Digital Libraries</strong>: Creating descriptive metadata for image collections.</li>

        <li><strong>Accessibility</strong>: Assisting visually impaired individuals by describing visual content.</li>

    </ul>

    <h2>Importance</h2>

    <p>Image captioning is essential for bridging the gap between visual and textual information, enabling better interaction between machines and users by providing context and meaning to images.</p>

</div>

""", unsafe_allow_html=True)

# How to Use
st.markdown('<div class="sub-title">How to Use the Model</div>', unsafe_allow_html=True)
st.code('''

import sparknlp

from sparknlp.base import *

from sparknlp.annotator import *

from pyspark.ml import Pipeline



# Load image data

imageDF = spark.read \\

    .format("image") \\

    .option("dropInvalid", value = True) \\

    .load("src/test/resources/image/")



# Define Image Assembler

imageAssembler = ImageAssembler() \\

    .setInputCol("image") \\

    .setOutputCol("image_assembler")



# Define VisionEncoderDecoder for image captioning

imageCaptioning = VisionEncoderDecoderForImageCaptioning \\

    .pretrained() \\

    .setBeamSize(2) \\

    .setDoSample(False) \\

    .setInputCols(["image_assembler"]) \\

    .setOutputCol("caption")



# Create pipeline

pipeline = Pipeline().setStages([imageAssembler, imageCaptioning])



# Apply pipeline to image data

pipelineDF = pipeline.fit(imageDF).transform(imageDF)



# Show results

pipelineDF \\

    .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "caption.result") \\

    .show(truncate = False)

''', language='python')

# Results
st.markdown('<div class="sub-title">Results</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <table class="benchmark-table">

        <tr>

            <th>Image Name</th>

            <th>Result</th>

        </tr>

        <tr>

            <td>palace.JPEG</td>

            <td>[a large room filled with furniture and a large window]</td>

        </tr>

        <tr>

            <td>egyptian_cat.jpeg</td>

            <td>[a cat laying on a couch next to another cat]</td>

        </tr>

        <tr>

            <td>hippopotamus.JPEG</td>

            <td>[a brown bear in a body of water]</td>

        </tr>

        <tr>

            <td>hen.JPEG</td>

            <td>[a flock of chickens standing next to each other]</td>

        </tr>

        <tr>

            <td>ostrich.JPEG</td>

            <td>[a large bird standing on top of a lush green field]</td>

        </tr>

        <tr>

            <td>junco.JPEG</td>

            <td>[a small bird standing on a wet ground]</td>

        </tr>

        <tr>

            <td>bluetick.jpg</td>

            <td>[a small dog standing on a wooden floor]</td>

        </tr>

        <tr>

            <td>chihuahua.jpg</td>

            <td>[a small brown dog wearing a blue sweater]</td>

        </tr>

        <tr>

            <td>tractor.JPEG</td>

            <td>[a man is standing in a field with a tractor]</td>

        </tr>

        <tr>

            <td>ox.JPEG</td>

            <td>[a large brown cow standing on top of a lush green field]</td>

        </tr>

    </table>

</div>

""", unsafe_allow_html=True)

# Model Information
st.markdown('<div class="sub-title">Model Information</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <table class="benchmark-table">

        <tr>

            <th>Attribute</th>

            <th>Description</th>

        </tr>

        <tr>

            <td><strong>Model Name</strong></td>

            <td>image_captioning_vit_gpt2</td>

        </tr>

        <tr>

            <td><strong>Compatibility</strong></td>

            <td>Spark NLP 5.1.2+</td>

        </tr>

        <tr>

            <td><strong>License</strong></td>

            <td>Open Source</td>

        </tr>

        <tr>

            <td><strong>Edition</strong></td>

            <td>Official</td>

        </tr>

        <tr>

            <td><strong>Input Labels</strong></td>

            <td>[image_assembler]</td>

        </tr>

        <tr>

            <td><strong>Output Labels</strong></td>

            <td>[caption]</td>

        </tr>

        <tr>

            <td><strong>Language</strong></td>

            <td>en</td>

        </tr>

        <tr>

            <td><strong>Size</strong></td>

            <td>890.3 MB</td>

        </tr>

    </table>

</div>

""", unsafe_allow_html=True)

# Data Source Section
st.markdown('<div class="sub-title">Data Source</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>The image captioning model is available on <a class="link" href="https://huggingface.co/nlpconnect/vit-gpt2-image-captioning" target="_blank">Hugging Face</a>. This model uses ViT for image encoding and GPT2 for generating captions.</p>

</div>

""", unsafe_allow_html=True)

# Conclusion
st.markdown('<div class="sub-title">Conclusion</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <p>The <strong>VisionEncoderDecoderModel</strong> represents a powerful approach for bridging the gap between visual and textual information. By leveraging pretrained models for both image encoding and text generation, it effectively captures the nuances of both domains, resulting in high-quality outputs such as detailed image captions and accurate text-based interpretations of visual content.</p>

</div>

""", unsafe_allow_html=True)

# References
st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://sparknlp.org/2023/09/20/image_captioning_vit_gpt2_en.html" target="_blank" rel="noopener">Image Captioning Model on Spark NLP</a></li>

        <li><a class="link" href="https://huggingface.co/nlpconnect/vit-gpt2-image-captioning" target="_blank">Image Captioning Model on Hugging Face</a></li>

        <li><a class="link" href="https://arxiv.org/abs/2103.14030" target="_blank">TrOCR Paper</a></li

    </ul>

</div>

""", unsafe_allow_html=True)

# Community & Support
st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>

        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>

        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>

        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>

        <li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>

    </ul>

</div>

""", unsafe_allow_html=True)