KrispyKarim commited on
Commit
893c91b
·
verified ·
1 Parent(s): f8b6138

Update pages/02_Initial_Topic_Modeling.py

Browse files
Files changed (1) hide show
  1. pages/02_Initial_Topic_Modeling.py +123 -33
pages/02_Initial_Topic_Modeling.py CHANGED
@@ -1,33 +1,123 @@
1
- import streamlit as st
2
- import streamlit.components.v1 as components
3
-
4
- st.title("Topic Modeling")
5
-
6
- def introduction():
7
- st.title("Research & Methodology")
8
- st.markdown("LDA as Baseline: "
9
- "Describe the use of Latent Dirichlet Allocation as a baseline for comparison and understanding.")
10
- st.markdown("Process Flow: Step-by-step breakdown of the analysis process, from data gathering to insights extraction.")
11
-
12
- # Display the LDA visualization HTML file
13
- components.html(open('lda_visualization.html', 'r').read(), height=800)
14
-
15
-
16
- def lda_page():
17
- st.title("Insights & Findings of Latent Dirichlet Allocation (LDA) Model")
18
- st.markdown("Priliminary Results: findings, notebooks, documentation")
19
- st.markdown("Visualizations including pyLDAvis: ")
20
- st.markdown("Key Trends: ")
21
-
22
- sidebar_pages = ["Introduction", "Latent Dirichlet Allocation"]
23
- def main():
24
- st.sidebar.title("Navigation")
25
- page = st.sidebar.selectbox("Select a page:", sidebar_pages)
26
-
27
- if page == "Introduction":
28
- introduction()
29
- elif page == "Latent Dirichlet Allocation":
30
- lda_page()
31
-
32
- if __name__ == "__main__":
33
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ from PIL import Image
3
+
4
+ st.title("Latent Dirichlet Allocation (LDA)")
5
+
6
+ st.header("Research & Methodology", divider = "blue")
7
+
8
+ st.markdown("""
9
+ ### Purpose
10
+
11
+ We used LDA as a baseline model to detect and analyze climate anxiety among youth on social media platforms like Reddit and Twitter (X).
12
+ Being a well-established probabilistic topic modeling method, it's well-suited for identifying high-level thematic structures
13
+ in large text corpora. The performance of our LDA model will serve as a benchmark for the more advanced BERTopic model.
14
+ This will allow us to evaluate the added value of BERTopic’s contextual embeddings and clustering capabilities compared
15
+ to LDA’s traditional bag-of-words approach.
16
+
17
+ ---
18
+
19
+ ### Process Flow
20
+
21
+ The analysis pipeline involved collecting and cleaning climate-related text data from Reddit and Twitter, followed by a preprocessing
22
+ phase that included normalization, lemmatization, and customized filtering to preserve topic-relevant terms. The cleaned data was
23
+ then transformed into a document-term matrix using bag-of-words techniques with frequency-based term filtering. Topic modeling was
24
+ performed using LDA, optimized through hyperparameter tuning, and evaluated via coherence and perplexity
25
+ metrics. Finally, insights were drawn by comparing topic coherence and granularity to better understand how climate anxiety is expressed
26
+ in different online communities.
27
+
28
+ """)
29
+
30
+ pdf_path = "documents/LDA Documentation.pdf"
31
+ with open(pdf_path, "rb") as file:
32
+ st.download_button("Download documentation", file, file_name="LDA Documentation.pdf")
33
+
34
+
35
+
36
+ st.header("Results", divider="green")
37
+
38
+ st.markdown("""
39
+ ### Coherence and Perplexity Scores Analysis
40
+
41
+ Coherence scores measure the semantic similarity between words within each topic. A higher coherence score (ranging from 0 to 1) suggests that the topics are more meaningful and interpretable.
42
+
43
+ - Reddit model: Coherence score of 0.38
44
+ → Moderate interpretability, because of overlap between topics or less distinct themes.
45
+
46
+ - Twitter model: Coherence score of 0.43
47
+ → Topics are more consistent, with clearer distinctions between them and easier interpretability.
48
+
49
+ Perplexity measures how well the model predicts unseen words in the corpus. Lower perplexity generally indicates better generalization of topics.
50
+
51
+ - Reddit model: Perplexity score of 2874.72
52
+ → Model struggles with generalization and most likely overfitting.
53
+
54
+ - Twitter model: Perplexity score of 1865.36
55
+ → Generalizes and captures patterns in data better than the reddit model, but still overfits.
56
+ ---
57
+
58
+ ### Why the Twitter Model Performs Better
59
+
60
+ Several factors may contribute to the superior performance of the Twitter model:
61
+
62
+ 1. Twitter's character limit leads to more concise and direct language. This increases linguistic similarity within topics,
63
+ making them easier to cluster and interpret.
64
+
65
+ 2. Reddit posts and comments are longer and often multifaceted, blending several ideas into one post. This makes topic separation
66
+ more difficult and increases ambiguity.
67
+
68
+
69
+ Overall, the Twitter model outperforms the Reddit model due to the focused and streamlined nature of tweets, which allow for clearer topic
70
+ boundaries and improved coherence and perplexity scores. Reddit's complex and multifaceted discourse poses challenges for
71
+ traditional topic modeling, resulting in more blended and less distinct topics.
72
+ """)
73
+
74
+
75
+
76
+ st.header("Insights in Climate Focus", divider="red")
77
+
78
+ reddit_vis = Image.open("visualizations/lda_reddit_twd.png")
79
+
80
+
81
+ twitter_vis = Image.open("visualizations/lda_twitter_twd.png")
82
+
83
+ col1, col2 = st.columns(2)
84
+ with col1:
85
+ st.image(reddit_vis, caption="Reddit Topic-Word Distribution")
86
+
87
+ with col2:
88
+ st.image(twitter_vis, caption="Twitter Topic-Word Distribution")
89
+
90
+ st.markdown(""" By analyzing key terms and themes within each platform's topic-word distribution, we
91
+ can better understand the unique aspects each platform emphasizes in the broader climate discourse.
92
+
93
+ ---
94
+
95
+ ### Reddit's Focus
96
+
97
+ Reddit’s discussions are generally analytical, scientific, and policy-driven, exhibiting the following trends:
98
+
99
+ - Frequent references to climate models, temperature trends, and global warming projections.
100
+ - Strong emphasis on government policies, sustainability initiatives, and legislative debates regarding climate regulations.
101
+ - Active discussion around solar, wind, and other alternative energy innovations.
102
+ - Coverage of grassroots environmental movements, activism tactics, and advocacy campaigns.
103
+ - While largely pro-science, some threads engage with climate skepticism, often aiming to debunk misinformation.
104
+
105
+ ---
106
+
107
+ ### Twitter's Focus
108
+
109
+ Twitter (or X) tends to be more event-driven, highlighting immediate climate developments and their social impacts:
110
+
111
+ - High volume of posts about hurricanes, floods, wildfires, and other disasters linked to climate change.
112
+ - Trending hashtags, viral campaigns, and calls to action dominate the climate narrative.
113
+ - Tweets often focus on statements and actions by politicians, corporations, and influencers.
114
+ - Strong emphasis on how climate change disproportionately affects marginalized communities.
115
+ - Real-time responses to events take precedence over long-term scientific forecasts.
116
+
117
+ Reddit and Twitter offer complementary perspectives on climate discourse. Reddit tends to focus on research-oriented,
118
+ long-form content, emphasizing data-driven science, legislative solutions, and long-term climate projections.
119
+ In contrast, Twitter (X) centers on real-time events and social reactions, often highlighting individual actions,
120
+ corporate accountability, and justice-driven activism. While Reddit fosters in-depth, technical discussions,
121
+ Twitter amplifies public awareness through dynamic, media-rich narratives. Together, these platforms provide
122
+ a multifaceted view of how climate anxiety and engagement manifest across different online communities.
123
+ """)