File size: 16,021 Bytes
c309169
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
import streamlit as st
import pandas as pd

# Custom CSS for better styling
st.markdown("""

    <style>

        .main-title {

            font-size: 36px;

            color: #4A90E2;

            font-weight: bold;

            text-align: center;

        }

        .sub-title {

            font-size: 24px;

            color: #4A90E2;

            margin-top: 20px;

        }

        .section {

            background-color: #f9f9f9;

            padding: 15px;

            border-radius: 10px;

            margin-top: 20px;

        }

        .section p, .section ul {

            color: #666666;

        }

        .link {

            color: #4A90E2;

            text-decoration: none;

        }

        h2 {

            color: #4A90E2;

            font-size: 28px;

            font-weight: bold;

            margin-top: 30px;

        }

        h3 {

            color: #4A90E2;

            font-size: 22px;

            font-weight: bold;

            margin-top: 20px;

        }

        h4 {

            color: #4A90E2;

            font-size: 18px;

            font-weight: bold;

            margin-top: 15px;

        }

    </style>

""", unsafe_allow_html=True)

# Main Title
st.markdown('<div class="main-title">Question Answering Over Tables with TAPAS and Spark NLP</div>', unsafe_allow_html=True)

# Overview Section
st.markdown("""

<div class="section">

    <p>As data becomes increasingly complex, extracting meaningful insights from tabular data is more important than ever. TAPAS, a transformer-based model developed by Google, is designed specifically to handle question-answering over tables. By combining TAPAS with Spark NLP, we can leverage the power of distributed computing to process large datasets efficiently.</p>

    <p>This guide will walk you through the process of setting up TAPAS in Spark NLP, implementing two specific models (<code>table_qa_tapas_base_finetuned_wtq</code> and <code>table_qa_tapas_base_finetuned_sqa</code>), and understanding their best use cases.</p>

</div>

""", unsafe_allow_html=True)

# Introduction to TAPAS and Spark NLP
st.markdown('<div class="sub-title">Introduction to TAPAS and Spark NLP</div>', unsafe_allow_html=True)

# What is TAPAS?
st.markdown("""

<div class="section">

    <h3>What is TAPAS?</h3>

    <p>TAPAS (Table Parsing Supervised via Pre-trained Language Models) is a model that extends the BERT architecture to handle tabular data. Unlike traditional models that require flattening tables into text, TAPAS can directly interpret tables, making it a powerful tool for answering questions that involve tabular data.</p>

</div>

""", unsafe_allow_html=True)

# Why Use TAPAS with Spark NLP?
st.markdown("""

<div class="section">

    <h3>Why Use TAPAS with Spark NLP?</h3>

    <p>Spark NLP, developed by John Snow Labs, is an open-source library that provides state-of-the-art natural language processing capabilities within a distributed computing framework. Integrating TAPAS with Spark NLP allows you to scale your question-answering tasks across large datasets, making it ideal for big data environments.</p>

</div>

""", unsafe_allow_html=True)

# Pipeline and Results
st.markdown('<div class="sub-title">Pipeline and Results</div>', unsafe_allow_html=True)

st.markdown("""

<div class="section">

    <p>In this section, we’ll build a pipeline using Spark NLP to process a table and answer questions about the data it contains. We will utilize two different TAPAS models, each suited for different types of queries.</p>

</div>

""", unsafe_allow_html=True)

# Step 1: Creating the Data
st.markdown("""

<div class="section">

    <h4>Step 1: Creating the Data</h4>

    <p>We'll start by creating a Spark DataFrame that includes a table in JSON format and a set of questions.</p>

""", unsafe_allow_html=True)

st.code("""

json_data = '''

{

  "header": ["name", "money", "age"],

  "rows": [

    ["Donald Trump", "$100,000,000", "75"],

    ["Elon Musk", "$20,000,000,000,000", "55"]

  ]

}

'''



queries = [

    "Who earns less than 200,000,000?",

    "Who earns 100,000,000?",

    "How much money has Donald Trump?",

    "How old are they?",

    "How much money have they total?",

    "Who earns more than Donald Trump?"

]



data = spark.createDataFrame([[json_data, " ".join(queries)]])\\

    .toDF("table_json", "questions")

""", language="python")

# Step 2: Assembling the Pipeline
st.markdown("""

<div class="section">

    <h4>Step 2: Assembling the Pipeline</h4>

    <p>We will now set up a Spark NLP pipeline that includes the necessary annotators for processing the table and questions.</p>

""", unsafe_allow_html=True)

st.code("""

from sparknlp.annotator import TapasForQuestionAnswering, SentenceDetector

from sparknlp.base import MultiDocumentAssembler, TableAssembler

from pyspark.ml import Pipeline

from pyspark.sql import functions as F



# Step 1: Transforms raw texts to `document` annotation

document_assembler = MultiDocumentAssembler() \\

    .setInputCols("table_json", "questions") \\

    .setOutputCols("document_table", "document_questions")



# Step 2: Getting the sentences

sentence_detector = SentenceDetector() \\

    .setInputCols(["document_questions"]) \\

    .setOutputCol("questions")



# Step 3: Get the tables

table_assembler = TableAssembler()\\

    .setInputCols(["document_table"])\\

    .setOutputCol("table")



# WTQ TAPAS model

tapas_wtq = TapasForQuestionAnswering\\

    .pretrained("table_qa_tapas_base_finetuned_wtq", "en")\\

    .setInputCols(["questions", "table"])\\

    .setOutputCol("answers_wtq")



# SQA TAPAS model

tapas_sqa = TapasForQuestionAnswering\\

    .pretrained("table_qa_tapas_base_finetuned_sqa", "en")\\

    .setInputCols(["questions", "table"])\\

    .setOutputCol("answers_sqa")



# Define pipeline

pipeline = Pipeline(stages=[

    document_assembler,

    sentence_detector,

    table_assembler,

    tapas_wtq,

    tapas_sqa

])



# Fit and transform data

model = pipeline.fit(data)

result = model.transform(data)

""", language="python")

# Step 3: Viewing the Results
st.markdown("""

<div class="section">

    <h4>Step 3: Viewing the Results</h4>

    <p>After processing, we can explore the results generated by each model:</p>

""", unsafe_allow_html=True)

st.code("""

# WTQ Model Results:

result.select(F.explode(result.answers_wtq)).show(truncate=False)

""", language="python")

st.text("""

+--------------------------------------+

|col                                   |

+--------------------------------------+

|Donald Trump                          |

|Donald Trump                          |

|SUM($100,000,000)                     |

|AVERAGE(75, 55)                       |

|SUM($100,000,000, $20,000,000,000,000)|

|Elon Musk                             |

+--------------------------------------+

""")

st.code("""

# SQA Model Results:

result.select(F.explode(result.answers_sqa)).show(truncate=False)

""", language="python")

st.text("""

+---------------------------------+

|col                              |

+---------------------------------+

|Donald Trump                     |

|Donald Trump                     |

|$100,000,000                     |

|75, 55                           |

|$100,000,000, $20,000,000,000,000|

|Elon Musk                        |

+---------------------------------+

""")

# Comparing Results
st.markdown("""

<div class="section">

    <h4>Comparing Results</h4>

    <p>To better understand the differences, we can compare the results from both models side by side:</p>

""", unsafe_allow_html=True)

st.code("""

result.select(F.explode(F.arrays_zip(result.questions.result, 

                                     result.answers_sqa.result, 

                                     result.answers_wtq.result)).alias("cols"))\\

      .select(F.expr("cols['0']").alias("question"), 

              F.expr("cols['1']").alias("answer_sqa"),

              F.expr("cols['2']").alias("answer_wtq")).show(truncate=False)

""", language="python")

st.text("""

+---------------------------------+---------------------------------+--------------------------------------+

|question                         |answer_sqa                       |answer_wtq                            |

+---------------------------------+---------------------------------+--------------------------------------+

|Who earns less than 200,000,000? |Donald Trump                     |Donald Trump                          |

|Who earns 100,000,000?           |Donald Trump                     |Donald Trump                          |

|How much money has Donald Trump? |$100,000,000                     |SUM($100,000,000)                     |

|How old are they?                |75, 55                           |AVERAGE(75, 55)                       |

|How much money have they total?  |$100,000,000, $20,000,000,000,000|SUM($100,000,000, $20,000,000,000,000)|

|Who earns more than Donald Trump?|Elon Musk                        |Elon Musk                             |

+---------------------------------+---------------------------------+--------------------------------------+

""")

# One-Liner Alternative
st.markdown("""

<div class="section">

    <h4>One-Liner Alternative</h4>

    <p>For those who prefer a simpler approach, John Snow Labs offers a one-liner API to quickly get answers using TAPAS models.</p>

""", unsafe_allow_html=True)

st.code("""

#Downliad the johnsnowlabs library

pip install johnsnowlabs

""", language="bash")

st.code("""

import pandas as pd

from johnsnowlabs import nlp



# Create the context DataFrame

context_df = pd.DataFrame({

    'name': ['Donald Trump', 'Elon Musk'], 

    'money': ['$100,000,000', '$20,000,000,000,000'], 

    'age': ['75', '55']

})



# Define the questions

questions = [

    "Who earns less than 200,000,000?",

    "Who earns 100,000,000?",

    "How much money has Donald Trump?",

    "How old are they?",

    "How much money have they total?",

    "Who earns more than Donald Trump?"

]



# Combine context and questions into a tuple

tapas_data = (context_df, questions)



# Use the one-liner API with the WTQ model

answers_wtq = nlp.load('en.answer_question.tapas.wtq.large_finetuned').predict(tapas_data)

answers_wtq[['sentence', 'tapas_qa_UNIQUE_answer']]

""", language="python")

# Define the data as a list of dictionaries
data = {
    "sentence": [
        "Who earns less than 200,000,000?",
        "Who earns 100,000,000?",
        "How much money has Donald Trump?",
        "How old are they?",
        "How much money have they total? Who earns more..."
    ],
    "tapas_qa_UNIQUE_answer": [
        "Donald Trump",
        "Donald Trump",
        "SUM($100,000,000)",
        "SUM(55)",
        "SUM($20,000,000,000,000)"
    ]
}
st.dataframe(pd.DataFrame(data))

# Model Information and Use Cases
st.markdown("""

<div class="section">

    <h4>Model Information and Use Cases</h4>

    <p>Understanding the strengths of each TAPAS model can help you choose the right tool for your task.</p>

    <ul>

        <li><b>table_qa_tapas_base_finetuned_wtq</b></li>

        <ul>

            <li>Best for: answering questions involving table-wide aggregation (e.g., sums, averages).</li>

        </ul>

        <li><b>table_qa_tapas_base_finetuned_sqa</b></li>

        <ul>

            <li>Best for: answering questions in a sequential question-answering context, where the current question depends on previous answers.</li>

        </ul>

    </ul>

</div>

""", unsafe_allow_html=True)

# Conclusion
st.markdown("""

<div class="section">

    <h4>Conclusion</h4>

    <p>TAPAS, integrated with Spark NLP, provides a powerful solution for question-answering over tables, capable of handling both complex aggregation queries and straightforward Q&A tasks. Whether you're working with large datasets or simple tables, TAPAS offers flexibility and scalability. The <code>table_qa_tapas_base_finetuned_wtq</code> model excels in aggregation tasks, while <code>table_qa_tapas_base_finetuned_sqa</code> is best for direct, sequential question-answering.</p>

    <p>By following this guide, you can efficiently implement TAPAS in your own projects, leveraging Spark NLP's powerful processing capabilities to extract insights from your data.</p>

</div>

""", unsafe_allow_html=True)

# References
st.markdown("""

<div class="section">

    <h4>References</h4>

    <ul>

        <li>Documentation : <a class="link" href="https://nlp.johnsnowlabs.com/docs/en/annotators#multidocumentassembler" target="_blank" rel="noopener">MultiDocumentAssembler</a>, <a class="link" href="https://nlp.johnsnowlabs.com/docs/en/annotators#TapasForQuestionAnswering">TapasForQuestionAnswering</a></li>

        <li>Python Doc : <a class="link" href="https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/multi_document_assembler/index.html#sparknlp.base.multi_document_assembler.MultiDocumentAssembler.setIdCol" target="_blank" rel="noopener">MultiDocumentAssembler</a>, <a class="link" href="https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/classifier_dl/tapas_for_question_answering/index.html" target="_blank" rel="noopener">TapasForQuestionAnswering</a></li>

        <li>Scala Doc : <a class="link" href="https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/MultiDocumentAssembler.html" target="_blank" rel="noopener">MultiDocumentAssembler</a>, <a class="link" href="https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/classifier/dl/TapasForQuestionAnswering.html">TapasForQuestionAnswering</a></li>

        <li>Models Used : <a class="link" href="https://sparknlp.org/2022/09/30/table_qa_tapas_base_finetuned_wtq_en.html" target="_blank" rel="noopener">table_qa_tapas_base_finetuned_wtq</a>, <a class="link" href="https://sparknlp.org/2022/09/30/table_qa_tapas_base_finetuned_sqa_en.html">table_qa_tapas_base_finetuned_sqa</a></li>

        <li>For extended examples of usage, see the notebooks for <a class="link" href="https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/document-assembler/Loading_Multiple_Documents.ipynb" target="_blank" rel="noopener">MultiDocumentAssembler</a>, <a class="link" href="https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/15.1_Table_Question_Answering.ipynb" target="_blank" rel="noopener">TapasForQuestionAnswering</a>.</li>

        <li><a href="https://arxiv.org/abs/2004.02349" class="link" target="_blank">TAPAS: Weakly Supervised Table Parsing via Pre-trained Language Models</a></li>

        <li><a href="https://nlp.johnsnowlabs.com/" class="link" target="_blank">Spark NLP Documentation</a></li>

        <li><a href="https://nlp.johnsnowlabs.com/models" class="link" target="_blank">John Snow Labs Models Hub</a></li>

    </ul>

</div>

""", unsafe_allow_html=True)

# Community & Support
st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
st.markdown("""

<div class="section">

    <ul>

        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>

        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>

        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>

        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>

        <li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>

    </ul>

</div>

""", unsafe_allow_html=True)