greco commited on
Commit
dd8fcb1
·
1 Parent(s): e021580

update explanations

Browse files
Files changed (1) hide show
  1. app.py +51 -27
app.py CHANGED
@@ -66,12 +66,14 @@ def read_climate_change_results():
66
  sentiment_results, zero_shot_results = read_climate_change_results()
67
 
68
 
 
69
  # intro to app
70
  st.title('Survey Analytic Techniques')
71
  st.write('''
72
- Organisations collect lots of data everyday through surveys, to get feedback, understand user behaviour, track trends across time etc.
73
  It can be resource intensive to craft a good survey and getting responders to fill in their answers, and we should make full use of the data obtained.
74
- Processing and analysing the data is tedious and time consuming but it doesn't have to be!
 
75
  We can employ the help of machines to comb through the data and provide actionable insights.
76
  ''')
77
 
@@ -85,7 +87,7 @@ st.write('''
85
  - Factor Analysis - Clustering responders based on their answers
86
  - Topic Modelling - Uncovering topics from text responses
87
  - Zero-shot Classification - Classifying text responses into user-defined labels
88
- - Sentiment Analysis - Quantifying sentiment of responders text responses
89
  ''')
90
  st.write('\n')
91
  st.write('\n')
@@ -94,20 +96,22 @@ st.markdown('''---''')
94
  st.header('Clustering Survey Responders')
95
  st.write('''
96
  Having knowledge and understanding about different groups of responders can help us to customise our interactions with them.
97
- E.g. Within the Financial Institutions we have banks, insurers, and payment services, and they have different structures and behaviours.
98
- We want to be able to cluster survey reponders into various groups based on how their answers.
99
- This can be achieved though **Factor Analysis**.
100
  ''')
101
  st.write('\n')
102
  st.write('\n')
103
 
 
 
104
  # copy data
105
  df_factor_analysis = data_survey.copy()
106
 
107
  st.subheader('Sample Survey Data')
108
  st.write('''
109
  Here we have a sample survey dataset where responders answer questions about their personality traits on a scale from 1 (Very Inaccurate) to 6 (Very Accurate).
110
- Factor Analysis gives us \'factors\' or clusters of responders which provide us insights about the different personalities of the responders.
111
  ''')
112
 
113
  # split page into two columns
@@ -125,7 +129,7 @@ st.write('\n')
125
  st.subheader('Factor Analysis Suitability')
126
  st.write('''
127
  Before performing Factor Analysis on the data, we need to evaluate if it is suitable to do so.
128
- We apply two statistical tests (Bartlett's and KMO test) the data.
129
  These two tests check if the variables in the data are correlated with each other.
130
  If there isn't any correlation between the variables, then the data is unsuitable for factor analysis as there are no natural clusters.
131
  ''')
@@ -246,7 +250,7 @@ fa_z_scores = fa_z_scores.groupby('cluster').mean().reset_index()
246
  fa_z_scores = fa_z_scores.apply(lambda x: round(x, 2))
247
 
248
  st.write('''
249
- Aggregating the scores of the clusters gives us detail insights to the personality traits of the responders.
250
  The scores here have been normalised to Z-scores, which is a measure of how many standard deviations (SD) is the score away from the mean.
251
  E.g. A Z-score of 0 indicates the score is identical to the mean, while a Z-score of 1 indicates the score is 1 SD away from the mean.
252
  ''')
@@ -285,7 +289,7 @@ st.write('\n')
285
  st.header('Uncovering Topics from Text Responses')
286
  st.write('''
287
  With feedback forms or open-ended survey questions, we want to know what are the responders generally talking about.
288
- One way would be to manually read all the collected response to get a sense of the topics within, however, this is very manual and subjective.
289
  Using **Topic Modelling**, we can programmatically extract common topics with the help of machine learning.
290
  ''')
291
  st.write('\n')
@@ -303,7 +307,7 @@ st.write('\n')
303
 
304
  st.subheader('Visualising Topics')
305
  st.write('''
306
- Lets generate some topics without performing any cleaning to the data.
307
  ''')
308
 
309
  # load and plot topics using unclean data
@@ -314,7 +318,7 @@ st.plotly_chart(fig, use_container_width=True)
314
  st.write('''
315
  From the chart above, we can see that 'Topic 0' and 'Topic 5' have some words that are not as meaningful.
316
  For 'Topic 0', we already know that the tweets are about the Tokyo 2020 Olympics, having a topic for that isn't helpful.
317
- 'Tokyo', '2020', 'Olympics', etc., we refer to these as *stopwords*, and lets remove them and regenerate the topics.
318
  ''')
319
  st.write('\n')
320
 
@@ -337,7 +341,7 @@ st.plotly_chart(fig, use_container_width=True)
337
 
338
  st.write('''
339
  Now we can see that the topics have improved.
340
- We can make use of the top words in each topic to come up with a meaningful name, this has to be done manually and is subjective.
341
  ''')
342
  st.write('\n')
343
  st.write('\n')
@@ -361,7 +365,7 @@ st.write('''
361
  The model has an understanding of the relationship between words, e.g. 'Andy Murray' is related to 'tennis'.
362
  For example:
363
  *'Cilic vs Menezes, after more than 3 hours and millions of unconverted match points, is one of the worst quality ten…'*
364
- This tweet is in the Topic 9 - Tennis without the word 'tennis' in it.
365
 
366
  Here we can inspect the individual tweets within each topic.
367
  ''')
@@ -385,7 +389,6 @@ st.write(f'''
385
  st.dataframe(topic_results.loc[(topic_results['Topic'] == inspect_topic)])
386
  st.markdown('''---''')
387
  st.write('\n')
388
- st.write('\n')
389
 
390
 
391
 
@@ -393,8 +396,8 @@ st.write('\n')
393
 
394
  st.header('Classifiying Text Responses and Sentiment Analysis')
395
  st.write(f'''
396
- With survey responses, sometimes as a business user, we already have an general idea of what responders are talking about and we want to categorise or classify the responses accordingly.
397
- An an example, within the topic of 'Climate Change', we are interested in finance, politics, technology, and wildlife.
398
  Using **Zero-shot Classification**, we can classify responses into one of these four categories.
399
  As an added bonus, we can also find out how responders feel about the categories using **Sentiment Analysis**.
400
  ''')
@@ -495,7 +498,7 @@ st.write(f'''
495
  Main category score ranges from 0 to 1, with 1 being very likely.
496
 
497
  The full set of scores are: {dict(zip(zero_shot_sample['labels'], [round(score, 2) for score in zero_shot_sample['scores']]))}
498
- Full set of scores cores add up to 1.
499
 
500
  The sentiment is: {emoji[sentiment_label]} **{sentiment_label}** with a score of {round(sentiment_sample, 2)}
501
  Sentiment score ranges from 0 to 1, with 1 being very positive.
@@ -509,14 +512,14 @@ zero_shot_results = zero_shot_results.rename(columns={'sequence':'tweet', 'label
509
 
510
  st.subheader('Zero-Shot Classification and Sentiment Analysis Results')
511
  st.write(f'''
512
- Lets review all the tweets and how they fall into the categories of finance, politics, technology, and wildlife.
513
  ''')
514
 
515
  st.dataframe(zero_shot_results.style.format(precision=2))
516
 
517
  st.write(f'''
518
  We can observe that the model does not have strong confidence in predicting the categories for some of the tweets.
519
- It is likely that the tweet does not natually fall into one of the defined categories.
520
  Before performing further analysis on our results, we can set a score threshold to only keep predictions that we're confident in.
521
  ''')
522
  st.write('\n')
@@ -535,7 +538,7 @@ zero_shot_results_clean = zero_shot_results.loc[(zero_shot_results['score'] >= u
535
  sentiment_results.columns = ['tweet', 'sentiment']
536
 
537
  st.write(f'''
538
- The predictions get better with a higher threshold, but reduces the final number of tweets available for further analysis.
539
  Out of the {len(sentiment_results):,} tweets, we are now left with {len(zero_shot_results_clean)}.
540
  We also add on the sentiment score for the tweets, the score here ranges from 0 (most negative) to 1 (most positive).
541
  ''')
@@ -548,11 +551,17 @@ classification_sentiment_df = classification_sentiment_df[['tweet', 'category',
548
  st.dataframe(classification_sentiment_df.style.format(precision=2))
549
 
550
  st.write(f'''
551
- The difficult part for zero-shot classification is defining the right set of categories for each business case.
552
- Some trial and error is required to find the appropriate words that can return the optimal results.
 
553
  ''')
554
- st.write('\n')
555
 
 
 
 
 
 
 
556
  # group by category, count tweets and get mean of sentiment
557
  classification_sentiment_agg = classification_sentiment_df.groupby(['category']).agg({'tweet':'count', 'sentiment':'mean'}).reset_index()
558
  classification_sentiment_agg = classification_sentiment_agg.rename(columns={'tweet':'count'})
@@ -587,13 +596,28 @@ fig.update_yaxes(range=[0, 1])
587
  fig.add_hline(y=0.5, line_width=3, line_color='darkgreen')
588
  st.plotly_chart(fig, use_container_width=True)
589
 
 
 
 
 
 
 
 
 
 
 
 
 
 
590
  st.markdown('''---''')
591
  st.write('\n')
592
  st.write('\n')
593
 
594
  st.write('''
595
- That's the end of the this demo 😎, the source code can be found on [Github](https://github.com/Greco1899/survey_analytics).
596
  ''')
597
  st.write('\n')
598
- st.image('https://images.unsplash.com/photo-1620712943543-bcc4688e7485?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=2565&q=80')
599
- st.caption('Photo by [Andrea De Santis](https://unsplash.com/@santesson89) on [Unsplash](https://unsplash.com).')
 
 
 
66
  sentiment_results, zero_shot_results = read_climate_change_results()
67
 
68
 
69
+
70
  # intro to app
71
  st.title('Survey Analytic Techniques')
72
  st.write('''
73
+ Organisations collect lots of data every day through surveys, to get feedback, understand user behaviour, track trends across time etc.
74
  It can be resource intensive to craft a good survey and getting responders to fill in their answers, and we should make full use of the data obtained.
75
+
76
+ Processing and analysing the data is tedious and time-consuming but it doesn't have to be!
77
  We can employ the help of machines to comb through the data and provide actionable insights.
78
  ''')
79
 
 
87
  - Factor Analysis - Clustering responders based on their answers
88
  - Topic Modelling - Uncovering topics from text responses
89
  - Zero-shot Classification - Classifying text responses into user-defined labels
90
+ - Sentiment Analysis - Quantifying sentiment of responders' text responses
91
  ''')
92
  st.write('\n')
93
  st.write('\n')
 
96
  st.header('Clustering Survey Responders')
97
  st.write('''
98
  Having knowledge and understanding about different groups of responders can help us to customise our interactions with them.
99
+ E.g. Within Financial Institutions, we have banks, insurers, and payment services, and they have different structures and behaviours from one another.
100
+ We want to be able to cluster survey responders into various groups based on how their answers.
101
+ This can be achieved through **Factor Analysis**.
102
  ''')
103
  st.write('\n')
104
  st.write('\n')
105
 
106
+
107
+
108
  # copy data
109
  df_factor_analysis = data_survey.copy()
110
 
111
  st.subheader('Sample Survey Data')
112
  st.write('''
113
  Here we have a sample survey dataset where responders answer questions about their personality traits on a scale from 1 (Very Inaccurate) to 6 (Very Accurate).
114
+ Factor Analysis gives us \'factors\' or clusters of responders which provide us insights into the different personalities of the responders.
115
  ''')
116
 
117
  # split page into two columns
 
129
  st.subheader('Factor Analysis Suitability')
130
  st.write('''
131
  Before performing Factor Analysis on the data, we need to evaluate if it is suitable to do so.
132
+ We apply two statistical tests (Bartlett's and KMO test) to the data.
133
  These two tests check if the variables in the data are correlated with each other.
134
  If there isn't any correlation between the variables, then the data is unsuitable for factor analysis as there are no natural clusters.
135
  ''')
 
250
  fa_z_scores = fa_z_scores.apply(lambda x: round(x, 2))
251
 
252
  st.write('''
253
+ Aggregating the scores of the clusters gives us detailed insights into the personality traits of the responders.
254
  The scores here have been normalised to Z-scores, which is a measure of how many standard deviations (SD) is the score away from the mean.
255
  E.g. A Z-score of 0 indicates the score is identical to the mean, while a Z-score of 1 indicates the score is 1 SD away from the mean.
256
  ''')
 
289
  st.header('Uncovering Topics from Text Responses')
290
  st.write('''
291
  With feedback forms or open-ended survey questions, we want to know what are the responders generally talking about.
292
+ One way would be to manually read all the collected responses to get a sense of the topics within, however, this is very manual and subjective.
293
  Using **Topic Modelling**, we can programmatically extract common topics with the help of machine learning.
294
  ''')
295
  st.write('\n')
 
307
 
308
  st.subheader('Visualising Topics')
309
  st.write('''
310
+ Let's generate some topics without performing any cleaning to the data.
311
  ''')
312
 
313
  # load and plot topics using unclean data
 
318
  st.write('''
319
  From the chart above, we can see that 'Topic 0' and 'Topic 5' have some words that are not as meaningful.
320
  For 'Topic 0', we already know that the tweets are about the Tokyo 2020 Olympics, having a topic for that isn't helpful.
321
+ 'Tokyo', '2020', 'Olympics', etc., we refer to these as *stopwords*, and let's remove them and regenerate the topics.
322
  ''')
323
  st.write('\n')
324
 
 
341
 
342
  st.write('''
343
  Now we can see that the topics have improved.
344
+ We can use the top words in each topic to come up with a meaningful name, this has to be done manually and is subjective.
345
  ''')
346
  st.write('\n')
347
  st.write('\n')
 
365
  The model has an understanding of the relationship between words, e.g. 'Andy Murray' is related to 'tennis'.
366
  For example:
367
  *'Cilic vs Menezes, after more than 3 hours and millions of unconverted match points, is one of the worst quality ten…'*
368
+ This tweet is in Topic 9 - Tennis without the word 'tennis' in it.
369
 
370
  Here we can inspect the individual tweets within each topic.
371
  ''')
 
389
  st.dataframe(topic_results.loc[(topic_results['Topic'] == inspect_topic)])
390
  st.markdown('''---''')
391
  st.write('\n')
 
392
 
393
 
394
 
 
396
 
397
  st.header('Classifiying Text Responses and Sentiment Analysis')
398
  st.write(f'''
399
+ With survey responses, sometimes as a business user, we already have a general idea of what responders are talking about and we want to categorise or classify the responses accordingly.
400
+ As an example, within the topic of 'Climate Change', we are interested in finance, politics, technology, and wildlife.
401
  Using **Zero-shot Classification**, we can classify responses into one of these four categories.
402
  As an added bonus, we can also find out how responders feel about the categories using **Sentiment Analysis**.
403
  ''')
 
498
  Main category score ranges from 0 to 1, with 1 being very likely.
499
 
500
  The full set of scores are: {dict(zip(zero_shot_sample['labels'], [round(score, 2) for score in zero_shot_sample['scores']]))}
501
+ Full set of scores adds up to 1.
502
 
503
  The sentiment is: {emoji[sentiment_label]} **{sentiment_label}** with a score of {round(sentiment_sample, 2)}
504
  Sentiment score ranges from 0 to 1, with 1 being very positive.
 
512
 
513
  st.subheader('Zero-Shot Classification and Sentiment Analysis Results')
514
  st.write(f'''
515
+ Let's review all the tweets and how they fall into the categories of finance, politics, technology, and wildlife.
516
  ''')
517
 
518
  st.dataframe(zero_shot_results.style.format(precision=2))
519
 
520
  st.write(f'''
521
  We can observe that the model does not have strong confidence in predicting the categories for some of the tweets.
522
+ It is likely that the tweet does not naturally fall into one of the defined categories.
523
  Before performing further analysis on our results, we can set a score threshold to only keep predictions that we're confident in.
524
  ''')
525
  st.write('\n')
 
538
  sentiment_results.columns = ['tweet', 'sentiment']
539
 
540
  st.write(f'''
541
+ The predictions get better with a higher threshold but reduces the final number of tweets available for further analysis.
542
  Out of the {len(sentiment_results):,} tweets, we are now left with {len(zero_shot_results_clean)}.
543
  We also add on the sentiment score for the tweets, the score here ranges from 0 (most negative) to 1 (most positive).
544
  ''')
 
551
  st.dataframe(classification_sentiment_df.style.format(precision=2))
552
 
553
  st.write(f'''
554
+ The difficult part of zero-shot classification is defining the right set of categories for each business case.
555
+ Some trial and error are required to find the appropriate words that can return the optimal results.
556
+ E.g. Do we want to differentiate between 'plants' and 'animals', or is 'wildlife' better as an overall category?
557
  ''')
 
558
 
559
+ st.write(f'''
560
+ With sentiment analysis, the model typically has pitfalls such as not being able to detect sarcasm well.
561
+ However, sarcastic responses are typically outliers in survey data and the data points would be smoothed out when we look at average the sentiment scores.
562
+ ''')
563
+
564
+ st.write('\n')
565
  # group by category, count tweets and get mean of sentiment
566
  classification_sentiment_agg = classification_sentiment_df.groupby(['category']).agg({'tweet':'count', 'sentiment':'mean'}).reset_index()
567
  classification_sentiment_agg = classification_sentiment_agg.rename(columns={'tweet':'count'})
 
596
  fig.add_hline(y=0.5, line_width=3, line_color='darkgreen')
597
  st.plotly_chart(fig, use_container_width=True)
598
 
599
+ st.write('''
600
+ To improve the performance of the models, further fine tuning can be done.
601
+ We would also need labelled data to test against which is usually not readily available and can be difficult and expensive to obtain.
602
+
603
+ If you're just thinking of exploring the feasibility of applying text analysis on your dataset, the pre-trained models used in this app will be perfect!
604
+ We've leveraged state-of-the-art deep learning models to jumpstart our analytics capabilities.
605
+ The base models used for sentiment analysis and zero-shot classification and are called BERT (developed by Google) and BART (developed by Facebook) respectively.
606
+
607
+ These language models require large amounts of data and resources to be trained.
608
+ BERT by Google was trained on the whole Wikipedia (about 2.5 billion words) and 11 thousand books, while BART was trained the same plus 63 million news articles and other text scraped from the internet.
609
+ An example of a fine-tuned model is FinBERT, which builds on top of BERT and is further trained on financial news to analyse the sentiment of finance-related text.
610
+ ''')
611
+
612
  st.markdown('''---''')
613
  st.write('\n')
614
  st.write('\n')
615
 
616
  st.write('''
617
+ That's the end of this demo 😎, the source code can be found on [Github](https://github.com/Greco1899/survey_analytics).
618
  ''')
619
  st.write('\n')
620
+ st.image('https://images.unsplash.com/photo-1620712943543-bcc4688e7485')
621
+ st.caption('Photo by [Andrea De Santis](https://unsplash.com/@santesson89) on [Unsplash](https://unsplash.com).')
622
+
623
+