Spaces:

KrispyKarim
/

Climate_Anxiety_Detection

Sleeping

App Files Files Community

KrispyKarim commited on Apr 16

Commit

893c91b

verified ·

1 Parent(s): f8b6138

Update pages/02_Initial_Topic_Modeling.py

Browse files

Files changed (1) hide show

pages/02_Initial_Topic_Modeling.py +123 -33

pages/02_Initial_Topic_Modeling.py CHANGED Viewed

@@ -1,33 +1,123 @@
-import streamlit as st
-import streamlit.components.v1 as components
-st.title("Topic Modeling")
-def introduction():
-    st.title("Research & Methodology")
-    st.markdown("LDA as Baseline: "
-    "Describe the use of Latent Dirichlet Allocation as a baseline for comparison and understanding.")
-    st.markdown("Process Flow: Step-by-step breakdown of the analysis process, from data gathering to insights extraction.")
-    # Display the LDA visualization HTML file
-    components.html(open('lda_visualization.html', 'r').read(), height=800)
-def lda_page():
-    st.title("Insights & Findings of Latent Dirichlet Allocation (LDA) Model")
-    st.markdown("Priliminary Results: findings, notebooks, documentation")
-    st.markdown("Visualizations including pyLDAvis: ")
-    st.markdown("Key Trends: ")
-sidebar_pages = ["Introduction", "Latent Dirichlet Allocation"]
-def main():
-    st.sidebar.title("Navigation")
-    page = st.sidebar.selectbox("Select a page:", sidebar_pages)
-    if page == "Introduction":
-        introduction()
-    elif page == "Latent Dirichlet Allocation":
-        lda_page()
-if __name__ == "__main__":
-    main()

+import streamlit as st
+from PIL import Image
+st.title("Latent Dirichlet Allocation (LDA)")
+st.header("Research & Methodology", divider = "blue")
+st.markdown("""
+### Purpose
+We used LDA as a baseline model to detect and analyze climate anxiety among youth on social media platforms like Reddit and Twitter (X).
+            Being a well-established probabilistic topic modeling method, it's well-suited for identifying high-level thematic structures
+            in large text corpora. The performance of our LDA model will serve as a benchmark for the more advanced BERTopic model.
+            This will allow us to evaluate the added value of BERTopic’s contextual embeddings and clustering capabilities compared
+            to LDA’s traditional bag-of-words approach.
+---
+### Process Flow
+The analysis pipeline involved collecting and cleaning climate-related text data from Reddit and Twitter, followed by a preprocessing
+            phase that included normalization, lemmatization, and customized filtering to preserve topic-relevant terms. The cleaned data was
+            then transformed into a document-term matrix using bag-of-words techniques with frequency-based term filtering. Topic modeling was
+            performed using LDA, optimized through hyperparameter tuning, and evaluated via coherence and perplexity
+            metrics. Finally, insights were drawn by comparing topic coherence and granularity to better understand how climate anxiety is expressed
+            in different online communities.
+""")
+pdf_path = "documents/LDA Documentation.pdf"
+with open(pdf_path, "rb") as file:
+    st.download_button("Download documentation", file, file_name="LDA Documentation.pdf")
+st.header("Results", divider="green")
+st.markdown("""
+### Coherence and Perplexity Scores Analysis
+Coherence scores measure the semantic similarity between words within each topic. A higher coherence score (ranging from 0 to 1) suggests that the topics are more meaningful and interpretable.
+- Reddit model: Coherence score of 0.38
+  → Moderate interpretability, because of overlap between topics or less distinct themes.
+- Twitter model: Coherence score of 0.43
+  → Topics are more consistent, with clearer distinctions between them and easier interpretability.
+Perplexity measures how well the model predicts unseen words in the corpus. Lower perplexity generally indicates better generalization of topics.
+- Reddit model: Perplexity score of 2874.72
+  → Model struggles with generalization and most likely overfitting.
+- Twitter model: Perplexity score of 1865.36
+  → Generalizes and captures patterns in data better than the reddit model, but still overfits.
+---
+### Why the Twitter Model Performs Better
+Several factors may contribute to the superior performance of the Twitter model:
+1. Twitter's character limit leads to more concise and direct language. This increases linguistic similarity within topics,
+            making them easier to cluster and interpret.
+2. Reddit posts and comments are longer and often multifaceted, blending several ideas into one post. This makes topic separation
+            more difficult and increases ambiguity.
+Overall, the Twitter model outperforms the Reddit model due to the focused and streamlined nature of tweets, which allow for clearer topic
+            boundaries and improved coherence and perplexity scores. Reddit's complex and multifaceted discourse poses challenges for
+            traditional topic modeling, resulting in more blended and less distinct topics.
+""")
+st.header("Insights in Climate Focus", divider="red")
+reddit_vis = Image.open("visualizations/lda_reddit_twd.png")
+twitter_vis = Image.open("visualizations/lda_twitter_twd.png")
+col1, col2 = st.columns(2)
+with col1:
+    st.image(reddit_vis, caption="Reddit Topic-Word Distribution")
+with col2:
+    st.image(twitter_vis, caption="Twitter Topic-Word Distribution")
+st.markdown(""" By analyzing key terms and themes within each platform's topic-word distribution, we
+            can better understand the unique aspects each platform emphasizes in the broader climate discourse.
+---
+### Reddit's Focus
+Reddit’s discussions are generally analytical, scientific, and policy-driven, exhibiting the following trends:
+- Frequent references to climate models, temperature trends, and global warming projections.
+- Strong emphasis on government policies, sustainability initiatives, and legislative debates regarding climate regulations.
+- Active discussion around solar, wind, and other alternative energy innovations.
+- Coverage of grassroots environmental movements, activism tactics, and advocacy campaigns.
+- While largely pro-science, some threads engage with climate skepticism, often aiming to debunk misinformation.
+---
+### Twitter's Focus
+Twitter (or X) tends to be more event-driven, highlighting immediate climate developments and their social impacts:
+- High volume of posts about hurricanes, floods, wildfires, and other disasters linked to climate change.
+- Trending hashtags, viral campaigns, and calls to action dominate the climate narrative.
+- Tweets often focus on statements and actions by politicians, corporations, and influencers.
+- Strong emphasis on how climate change disproportionately affects marginalized communities.
+- Real-time responses to events take precedence over long-term scientific forecasts.
+Reddit and Twitter offer complementary perspectives on climate discourse. Reddit tends to focus on research-oriented,
+            long-form content, emphasizing data-driven science, legislative solutions, and long-term climate projections.
+            In contrast, Twitter (X) centers on real-time events and social reactions, often highlighting individual actions,
+            corporate accountability, and justice-driven activism. While Reddit fosters in-depth, technical discussions,
+            Twitter amplifies public awareness through dynamic, media-rich narratives. Together, these platforms provide
+            a multifaceted view of how climate anxiety and engagement manifest across different online communities.
+""")