Spaces:
Running
Running
raymondEDS
commited on
Commit
·
31653a7
1
Parent(s):
46e47b6
week 8 writing
Browse files- Reference files/Copy_Lab_5_hands_on_peer_review.ipynb +0 -0
- Reference files/Data_cleaning_lab.ipynb +0 -0
- Reference files/W8 - Curriculum Content.md +0 -0
- Reference files/W8 - Learning Objectives on Writing Paper.md +45 -0
- Reference files/Week_4_content.txt +0 -630
- Reference files/w6_logistic_regression_lab.py +0 -400
- Reference files/week 7/W7_Lab_KNN_clustering.ipynb +0 -481
- Reference files/week 7/Week7_Clustering Curriculum.docx +0 -0
- Reference files/week 7/Week7_Clustering Learning Objectives.docx +0 -0
- Reference files/week 7/w7_curriculum +0 -178
- app/__pycache__/main.cpython-311.pyc +0 -0
- app/main.py +4 -1
- app/pages/__pycache__/week_8.cpython-311.pyc +0 -0
- app/pages/week_8.py +564 -0
Reference files/Copy_Lab_5_hands_on_peer_review.ipynb
DELETED
The diff for this file is too large to render.
See raw diff
|
|
Reference files/Data_cleaning_lab.ipynb
DELETED
The diff for this file is too large to render.
See raw diff
|
|
Reference files/W8 - Curriculum Content.md
ADDED
The diff for this file is too large to render.
See raw diff
|
|
Reference files/W8 - Learning Objectives on Writing Paper.md
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## **Remember (Knowledge)**
|
2 |
+
|
3 |
+
Students will be able to:
|
4 |
+
|
5 |
+
* Recall LaTeX syntax for document structure, figures, citations, and spacing
|
6 |
+
* Identify components of ML research papers (introduction, methods, results, conclusion, limitations)
|
7 |
+
* Recognize standard formatting requirements for academic conferences and journals
|
8 |
+
|
9 |
+
## **Understand (Comprehension)**
|
10 |
+
|
11 |
+
Students will be able to:
|
12 |
+
|
13 |
+
* Describe the purpose and audience for each section of a research paper
|
14 |
+
|
15 |
+
## **Apply (Application)**
|
16 |
+
|
17 |
+
Students will be able to:
|
18 |
+
|
19 |
+
* Format complete research papers in LaTeX with proper figures, tables, and citations
|
20 |
+
* Write clear methodology sections with sufficient detail for reproducibility
|
21 |
+
* Present experimental results using appropriate visualizations and statistical analysis
|
22 |
+
|
23 |
+
## **Analyze (Analysis)**
|
24 |
+
|
25 |
+
Students will be able to:
|
26 |
+
|
27 |
+
* Diagnose LaTeX formatting issues and resolve compilation errors (if applicable)
|
28 |
+
* Examine related work to identify research gaps and position their contributions
|
29 |
+
* Compare their methodology approaches with existing methods
|
30 |
+
|
31 |
+
## **Evaluate (Evaluation)**
|
32 |
+
|
33 |
+
Students will be able to:
|
34 |
+
|
35 |
+
* Critically assess the validity and reliability of their experimental design
|
36 |
+
* Evaluate the clarity and persuasiveness of their written arguments
|
37 |
+
|
38 |
+
## **Create (Synthesis)**
|
39 |
+
|
40 |
+
Students will be able to:
|
41 |
+
|
42 |
+
* Produce research papers
|
43 |
+
* Develop compelling visualizations that effectively communicate complex ML concepts
|
44 |
+
* Synthesize technical knowledge into coherent research narratives
|
45 |
+
|
Reference files/Week_4_content.txt
DELETED
@@ -1,630 +0,0 @@
|
|
1 |
-
|
2 |
-
In this course, you'll learn the complete NLP workflow by exploring a fascinating real-world question: Does review length and language relate to reviewer ratings and decisions in academic peer review? If so, how?
|
3 |
-
Using data from the International Conference on Learning Representations (ICLR), you'll develop practical NLP skills while investigating how reviewers express their opinions. Each module builds upon the previous one, creating a coherent analytical pipeline from raw data to insight.
|
4 |
-
Learning Path
|
5 |
-
Data Loading and Initial Exploration: Setting up your environment and understanding your dataset
|
6 |
-
Text Preprocessing and Normalization: Cleaning and standardizing text data
|
7 |
-
Feature Extraction and Measurement: Calculating metrics from text
|
8 |
-
Visualization and Pattern Recognition: Creating insightful visualizations
|
9 |
-
Drawing Conclusions from Text Analysis: Synthesizing findings into actionable insights
|
10 |
-
Let's begin our exploration of how NLP can provide insights into academic peer review!
|
11 |
-
|
12 |
-
Module 1: Initial Exploration
|
13 |
-
The Challenge
|
14 |
-
Before we can analyze how review length relates to paper evaluations, we need to understand our dataset. In this module, we'll set up our Python environment and explore the ICLR conference data.
|
15 |
-
1.1: Set up and get to your data
|
16 |
-
The first step in any NLP project is loading and understanding your data. Let's set up our environment and examine what we're working with:
|
17 |
-
python
|
18 |
-
# Import necessary libraries
|
19 |
-
import pandas as pd
|
20 |
-
import numpy as np
|
21 |
-
import matplotlib.pyplot as plt
|
22 |
-
import seaborn as sns
|
23 |
-
import string
|
24 |
-
from nltk.corpus import stopwords
|
25 |
-
from nltk.tokenize import word_tokenize, sent_tokenize
|
26 |
-
from wordcloud import WordCloud
|
27 |
-
|
28 |
-
# Load the datasets
|
29 |
-
df_reviews = pd.read_csv('../data/reviews.csv')
|
30 |
-
df_submissions = pd.read_csv('../data/Submissions.csv')
|
31 |
-
df_dec = pd.read_csv('../data/decision.csv')
|
32 |
-
df_keyword = pd.read_csv('../data/submission_keyword.csv')
|
33 |
-
Let's look at the first few rows of each dataset to understand what information we have:
|
34 |
-
python
|
35 |
-
# View the first few rows of the submissions dataset
|
36 |
-
df_submissions.head()
|
37 |
-
# View the first few rows of the reviews dataset
|
38 |
-
df_reviews.head()
|
39 |
-
# View all columns and rows in the reviews dataset
|
40 |
-
df_reviews
|
41 |
-
# View the first few rows of the keywords dataset
|
42 |
-
df_keyword.head()
|
43 |
-
1.2: Looking at Review Content
|
44 |
-
Let's examine an actual review to understand the text we'll be analyzing:
|
45 |
-
python
|
46 |
-
# Display a sample review
|
47 |
-
df_reviews['review'][1]
|
48 |
-
Think about: What kinds of information do you see in this review? What language patterns do you notice?
|
49 |
-
1.3: Calculating Basic Metrics
|
50 |
-
Let's calculate our first simple metric - the average review score for each paper:
|
51 |
-
python
|
52 |
-
# Get the average review score for each paper
|
53 |
-
df_average_review_score = df_reviews.groupby('forum')['rating_int'].mean().reset_index()
|
54 |
-
df_average_review_score
|
55 |
-
Key Insight: Each paper (identified by 'forum') receives multiple reviews with different scores. The average score gives us an overall assessment of each paper.
|
56 |
-
Module 2: Data Integration
|
57 |
-
In this module, we'll merge datasets for later analysis.
|
58 |
-
2.1 Understanding the Need for Data Integration
|
59 |
-
In many NLP projects, the data we need is spread across multiple files or tables. In our case:
|
60 |
-
The df_reviews dataset contains the review text and ratings
|
61 |
-
The df_dec dataset contains the final decisions for each paper
|
62 |
-
To analyze how review text relates to paper decisions, we need to merge these datasets.
|
63 |
-
2.2 Performing a Dataset Merge
|
64 |
-
Let's combine our review data with the decision data:
|
65 |
-
python
|
66 |
-
# Step 1 - Merge the reviews dataframe with the decisions dataframe
|
67 |
-
df_rev_dec = pd.merge(
|
68 |
-
df_reviews, # First dataframe (reviews)
|
69 |
-
df_dec, # Second dataframe (decisions)
|
70 |
-
left_on='forum', # Join key in the first dataframe
|
71 |
-
right_on='forum', # Join key in the second dataframe
|
72 |
-
how='inner' # Keep only matching rows
|
73 |
-
)[['review','decision','conf_name_y','rating_int','forum']] # Select only these columns
|
74 |
-
# Display the first few rows of the merged dataframe
|
75 |
-
df_rev_dec.head()
|
76 |
-
2.3 Understanding Merge Concepts
|
77 |
-
Join Key: The 'forum' column identifies the paper and connects our datasets
|
78 |
-
Inner Join: Only keeps papers that appear in both datasets
|
79 |
-
Column Selection: We keep only relevant columns for our analysis
|
80 |
-
How to Verify: Always check the shape of your merged dataset to ensure you haven't lost data unexpectedly
|
81 |
-
Try it yourself: How many rows does the merged dataframe have compared to the original review dataframe? What might explain any differences?
|
82 |
-
|
83 |
-
Module 3: Basic Text Preprocessing
|
84 |
-
In this module, you'll learn essential data preprocessing techniques for NLP projects. We'll standardize text through case folding, clean up categorical variables, and prepare our review text for analysis.
|
85 |
-
3.1 Case Folding (Lowercase Conversion)
|
86 |
-
A fundamental text preprocessing step is converting all text to lowercase to ensure consistency:
|
87 |
-
python
|
88 |
-
# Convert all review text to lowercase (case folding)
|
89 |
-
df_rev_dec['review'] = df_rev_dec['review'].str.lower()
|
90 |
-
# Display the updated dataframe
|
91 |
-
df_rev_dec
|
92 |
-
Why Case Folding Matters
|
93 |
-
Consistency: "Novel" and "novel" will be treated as the same word
|
94 |
-
Reduced Dimensionality: Fewer unique tokens to process
|
95 |
-
Improved Pattern Recognition: Easier to identify word frequencies and patterns
|
96 |
-
Note: While case folding is generally helpful, it can sometimes remove meaningful distinctions (e.g., "US" vs. "us"). For our academic review analysis, lowercase conversion is appropriate.
|
97 |
-
|
98 |
-
3.2 Examining Categorical Values
|
99 |
-
Let's first check what unique decision categories exist in our dataset:
|
100 |
-
python
|
101 |
-
# Display the unique decision categories
|
102 |
-
df_rev_dec['decision'].unique()
|
103 |
-
3.3 Standardizing Decision Categories
|
104 |
-
We can see that there are multiple "Accept" categories with different presentation formats. Let's standardize these:
|
105 |
-
python
|
106 |
-
# Define a function to clean up and standardize decision categories
|
107 |
-
def clean_up_decision(text):
|
108 |
-
if text in ['Accept (Poster)','Accept (Spotlight)', 'Accept (Oral)','Accept (Talk)']:
|
109 |
-
return 'Accept'
|
110 |
-
else:
|
111 |
-
return text
|
112 |
-
# Apply the function to create a new standardized decision column
|
113 |
-
df_rev_dec['decision_clean'] = df_rev_dec['decision'].apply(clean_up_decision)
|
114 |
-
# Check our new standardized decision categories
|
115 |
-
df_rev_dec['decision_clean'].unique()
|
116 |
-
Why Standardization Matters
|
117 |
-
Simplified Analysis: Reduces the number of categories to analyze
|
118 |
-
Clearer Patterns: Makes it easier to identify trends by decision outcome
|
119 |
-
Better Visualization: Creates more meaningful and readable plots
|
120 |
-
Consistent Terminology: Aligns with how conferences typically report accept/reject decisions
|
121 |
-
Try it yourself: What other ways could you group or standardize these decision categories? What information might be lost in our current approach?
|
122 |
-
|
123 |
-
Module 4: Text Tokenization
|
124 |
-
4.1 Introduction to Tokenization
|
125 |
-
Tokenization is the process of breaking text into smaller units like sentences or words. Let's examine a review:
|
126 |
-
python
|
127 |
-
# Display a sample review
|
128 |
-
df_reviews['review'][1]
|
129 |
-
4.2 Sentence Tokenization
|
130 |
-
Let's break this review into sentences using NLTK's sentence tokenizer:
|
131 |
-
python
|
132 |
-
# Import the necessary library if not already imported
|
133 |
-
from nltk.tokenize import sent_tokenize
|
134 |
-
# Tokenize the review into sentences
|
135 |
-
sent_tokenize(df_reviews['review'][1])
|
136 |
-
4.3 Counting Sentences
|
137 |
-
Now let's count the number of sentences in the review:
|
138 |
-
python
|
139 |
-
# Count the number of sentences
|
140 |
-
len(sent_tokenize(df_reviews['review'][1]))
|
141 |
-
4.4 Creating a Reusable Function
|
142 |
-
Let's create a function to count sentences in any text:
|
143 |
-
python
|
144 |
-
# Define a function to count sentences in a text
|
145 |
-
def sentence_count(text):
|
146 |
-
return len(sent_tokenize(text))
|
147 |
-
4.5 Applying Our Function to All Reviews
|
148 |
-
Now we'll apply our function to all reviews to get sentence counts:
|
149 |
-
python
|
150 |
-
# Add a new column with the sentence count for each review
|
151 |
-
df_rev_dec['sent_count'] = df_rev_dec['review'].apply(sentence_count)
|
152 |
-
# Display the updated dataframe
|
153 |
-
df_rev_dec.head()
|
154 |
-
Key Insight: Sentence count is a simple yet effective way to quantify review length. The number of sentences can indicate how thoroughly a reviewer has evaluated a paper.
|
155 |
-
|
156 |
-
Module 5: Visualization of Text Metrics
|
157 |
-
5.1 Creating a 2D Histogram
|
158 |
-
Let's visualize the relationship between review length (in sentences), rating, and decision outcome:
|
159 |
-
python
|
160 |
-
# Create a 2D histogram with sentence count, rating, and decision
|
161 |
-
ax = sns.histplot(data=df_rev_dec, x='sent_count',
|
162 |
-
y='rating_int',
|
163 |
-
hue='decision_clean',
|
164 |
-
kde=True,
|
165 |
-
log_scale=(True,False),
|
166 |
-
legend=True)
|
167 |
-
5.2 Enhancing Our Visualization
|
168 |
-
Let's improve our visualization with better labels and formatting:
|
169 |
-
python
|
170 |
-
# Set axis labels
|
171 |
-
ax.set(xlabel='Review Length (# Sentences)', ylabel='Review Rating')
|
172 |
-
# Move the legend outside the plot for better visibility
|
173 |
-
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
|
174 |
-
# Ensure the layout is properly configured
|
175 |
-
plt.tight_layout()
|
176 |
-
# Display the plot
|
177 |
-
plt.show()
|
178 |
-
5.3 Interpreting the Visualization
|
179 |
-
This visualization reveals several interesting patterns:
|
180 |
-
Length-Rating Relationship: Is there a pattern that entails how length of review is correlated with rating?
|
181 |
-
Decision Patterns: Are there visible clusters for accepted vs. rejected papers?
|
182 |
-
Density Distribution: Where are most reviews concentrated in terms of length and rating?
|
183 |
-
Outliers: Are there unusually long or short reviews at certain rating levels?
|
184 |
-
Discussion Question: Based on this visualization, do reviewers tend to write longer reviews when they're more positive or more critical? What might explain this pattern?
|
185 |
-
|
186 |
-
Module 6: Additional Text Processing - Tokenization
|
187 |
-
Tokenization is the process of breaking text into smaller units (tokens) that serve as the building blocks for natural language processing. In this lesson, we'll explore how to tokenize text, remove stopwords and punctuation, and analyze the results.
|
188 |
-
6.1 Text Cleaning
|
189 |
-
Before tokenization, we often clean the text to remove unwanted characters. Let's start by removing punctuation:
|
190 |
-
python
|
191 |
-
# Removing punctuation
|
192 |
-
df_rev_dec['clean_review_word'] = df_rev_dec['review'].str.translate(str.maketrans('', '', string.punctuation))
|
193 |
-
What's happening here?
|
194 |
-
string.punctuation contains all punctuation characters (.,!?;:'"()[]{}-_)
|
195 |
-
str.maketrans('', '', string.punctuation) creates a translation table to remove these characters
|
196 |
-
df_rev_dec['review'].str.translate() applies this translation to all review texts
|
197 |
-
6.2 Word Tokenization
|
198 |
-
After cleaning, we can tokenize the text into individual words:
|
199 |
-
python
|
200 |
-
# Tokenizing the text
|
201 |
-
df_rev_dec['tokens'] = df_rev_dec['clean_review_word'].apply(word_tokenize)
|
202 |
-
|
203 |
-
# Example: Look at tokens for the 6th review
|
204 |
-
df_rev_dec['tokens'][5]
|
205 |
-
What's happening here?
|
206 |
-
word_tokenize() is an NLTK function that splits text into a list of words
|
207 |
-
We apply this function to each review using pandas' apply() method
|
208 |
-
The result is a new column containing lists of words for each review
|
209 |
-
6.3 Removing Stopwords
|
210 |
-
Stopwords are common words like "the," "and," "is" that often don't add meaningful information for analysis:
|
211 |
-
python
|
212 |
-
# Getting the list of English stopwords
|
213 |
-
stop_words = set(stopwords.words('english'))
|
214 |
-
|
215 |
-
# Removing stopwords from our tokens
|
216 |
-
df_rev_dec['tokens'] = df_rev_dec['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
|
217 |
-
What's happening here?
|
218 |
-
stopwords.words('english') returns a list of common English stopwords
|
219 |
-
We convert it to a set for faster lookup
|
220 |
-
The lambda function filters each token list, keeping only words that aren't stopwords
|
221 |
-
This creates more meaningful token lists focused on content words
|
222 |
-
6.4 Counting Tokens
|
223 |
-
Now that we have our cleaned and filtered tokens, let's count them to measure review length:
|
224 |
-
python
|
225 |
-
# Count tokens for each review
|
226 |
-
df_rev_dec['tokens_counts'] = df_rev_dec['tokens'].apply(len)
|
227 |
-
|
228 |
-
# View the token counts
|
229 |
-
df_rev_dec['tokens_counts']
|
230 |
-
What's happening here?
|
231 |
-
We use apply(len) to count the number of tokens in each review
|
232 |
-
This gives us a quantitative measure of review length after removing stopwords
|
233 |
-
The difference between this and raw word count shows the prevalence of stopwords
|
234 |
-
6.5 Visualizing Token Counts vs. Ratings
|
235 |
-
Let's visualize the relationship between token count, rating, and decision:
|
236 |
-
python
|
237 |
-
# Create a 2D histogram with token count, rating, and decision
|
238 |
-
ax = sns.histplot(data=df_rev_dec, x='tokens_counts',
|
239 |
-
y='rating_int',
|
240 |
-
hue='decision_clean',
|
241 |
-
kde=True,
|
242 |
-
log_scale=(True,False),
|
243 |
-
legend=True)
|
244 |
-
|
245 |
-
# Set axis labels
|
246 |
-
ax.set(xlabel='Review Length (# Tokens)', ylabel='Review Rating')
|
247 |
-
|
248 |
-
# Move the legend outside the plot
|
249 |
-
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
|
250 |
-
|
251 |
-
plt.tight_layout()
|
252 |
-
plt.show()
|
253 |
-
What's happening here?
|
254 |
-
We create a 2D histogram showing the distribution of token counts and ratings
|
255 |
-
Colors distinguish between accepted and rejected papers
|
256 |
-
Log scale on the x-axis helps visualize the wide range of token counts
|
257 |
-
Kernel density estimation (KDE) shows the concentration of reviews
|
258 |
-
Module 7: Aggregating Data by Paper
|
259 |
-
7.1 Understanding Data Aggregation
|
260 |
-
So far, we've been analyzing individual reviews. However, each paper (identified by 'forum') may have multiple reviews. To understand paper-level patterns, we need to aggregate our data.
|
261 |
-
7.2 Calculating Paper-Level Metrics
|
262 |
-
Let's aggregate our review metrics to the paper level by calculating means:
|
263 |
-
python
|
264 |
-
# Aggregate reviews to paper level (mean of metrics for each paper)
|
265 |
-
df_rev_dec_ave = df_rev_dec.groupby(['forum','decision_clean'])[['rating_int','tokens_counts','sent_count']].mean().reset_index()
|
266 |
-
What's happening here?
|
267 |
-
We're grouping reviews by both 'forum' (paper ID) and 'decision_clean' (accept/reject)
|
268 |
-
For each group, we calculate the mean of 'rating_int', 'tokens_counts', and 'sent_count'
|
269 |
-
The reset_index() turns the result back into a regular DataFrame
|
270 |
-
The result is a paper-level dataset with average metrics for each paper
|
271 |
-
Try it yourself: How many papers do we have in our dataset compared to reviews? What does this tell us about the review process?
|
272 |
-
Module 8: Visualizing Token Count vs. Rating
|
273 |
-
8.1 Creating an Advanced Visualization
|
274 |
-
Now let's visualize the relationship between token count and rating at the paper level:
|
275 |
-
python
|
276 |
-
# Create a 2D histogram with token count, rating, and decision
|
277 |
-
ax = sns.histplot(data=df_rev_dec_ave, x='tokens_counts',
|
278 |
-
y='rating_int',
|
279 |
-
hue='decision_clean',
|
280 |
-
kde=True,
|
281 |
-
log_scale=(True,False),
|
282 |
-
legend=True)
|
283 |
-
|
284 |
-
# Set axis labels
|
285 |
-
ax.set(xlabel='Review Length (# Tokens)', ylabel='Review Rating')
|
286 |
-
|
287 |
-
# Move the legend outside the plot
|
288 |
-
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
|
289 |
-
|
290 |
-
plt.tight_layout()
|
291 |
-
plt.show()
|
292 |
-
8.2 Interpreting the Visualization
|
293 |
-
This visualization reveals important patterns in our data:
|
294 |
-
Decision Boundaries: Notice where the color changes from one decision to another
|
295 |
-
Length-Rating Relationship: Is there a correlation between review length and rating?
|
296 |
-
Clustering: Are there natural clusters in the data?
|
297 |
-
Outliers: What papers received unusually long or short reviews?
|
298 |
-
Key Insight: At the paper level, we can see if the average review length for a paper relates to its likelihood of acceptance.
|
299 |
-
Module 9: Comparing Token Count and Sentence Count
|
300 |
-
9.1 Visualizing Sentence Count vs. Rating
|
301 |
-
Let's create a similar visualization using sentence count instead of token count:
|
302 |
-
python
|
303 |
-
# Create a 2D histogram with sentence count, rating, and decision
|
304 |
-
ax = sns.histplot(data=df_rev_dec_ave, x='sent_count',
|
305 |
-
y='rating_int',
|
306 |
-
hue='decision_clean',
|
307 |
-
kde=True,
|
308 |
-
log_scale=(True,False),
|
309 |
-
legend=True)
|
310 |
-
|
311 |
-
# Set axis labels
|
312 |
-
ax.set(xlabel='Review Length (# Sentences)', ylabel='Review Rating')
|
313 |
-
|
314 |
-
# Move the legend outside the plot
|
315 |
-
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
|
316 |
-
|
317 |
-
plt.tight_layout()
|
318 |
-
plt.show()
|
319 |
-
9.2 Comparing Token vs. Sentence Metrics
|
320 |
-
By comparing these two visualizations, we can understand:
|
321 |
-
Which Metric is More Informative: Do token counts or sentence counts better differentiate accepted vs. rejected papers?
|
322 |
-
Different Patterns: Do some papers have many short sentences while others have fewer long ones?
|
323 |
-
Consistency: Are the patterns consistent across both metrics?
|
324 |
-
Discussion Question: Which metric—tokens or sentences—seems to be a better predictor of paper acceptance? Why might that be?
|
325 |
-
Module 10: Word Cloud Visualizations
|
326 |
-
10.1 Creating a Word Cloud from Review Text
|
327 |
-
Word clouds are a powerful way to visualize the most frequent words in a text corpus:
|
328 |
-
python
|
329 |
-
# Concatenate all review text
|
330 |
-
text = ' '.join(df_rev_dec['clean_review_word'])
|
331 |
-
|
332 |
-
# Generate word cloud
|
333 |
-
wordcloud = WordCloud().generate(text)
|
334 |
-
|
335 |
-
# Display word cloud
|
336 |
-
plt.figure(figsize=(8, 6))
|
337 |
-
plt.imshow(wordcloud, interpolation='bilinear')
|
338 |
-
plt.axis('off')
|
339 |
-
plt.show()
|
340 |
-
10.2 Visualizing Paper Keywords
|
341 |
-
Now let's visualize the primary keywords associated with the papers:
|
342 |
-
python
|
343 |
-
# Concatenate all primary keywords
|
344 |
-
text = ' '.join(df_keyword['primary_keyword'])
|
345 |
-
|
346 |
-
# Generate word cloud
|
347 |
-
wordcloud = WordCloud().generate(text)
|
348 |
-
|
349 |
-
# Display word cloud
|
350 |
-
plt.figure(figsize=(8, 6))
|
351 |
-
plt.imshow(wordcloud, interpolation='bilinear')
|
352 |
-
plt.axis('off')
|
353 |
-
plt.show()
|
354 |
-
10.3 Visualizing Paper Abstracts
|
355 |
-
Finally, let's create a word cloud from paper abstracts:
|
356 |
-
python
|
357 |
-
# Concatenate all abstracts
|
358 |
-
text = ' '.join(df_submissions['abstract'])
|
359 |
-
|
360 |
-
# Generate word cloud
|
361 |
-
wordcloud = WordCloud().generate(text)
|
362 |
-
|
363 |
-
# Display word cloud
|
364 |
-
plt.figure(figsize=(8, 6))
|
365 |
-
plt.imshow(wordcloud, interpolation='bilinear')
|
366 |
-
plt.axis('off')
|
367 |
-
plt.show()
|
368 |
-
Interpreting Word Clouds
|
369 |
-
Word clouds provide insights about:
|
370 |
-
Dominant Themes: The most frequent words appear largest
|
371 |
-
Vocabulary Differences: Compare terms across different sources (reviews vs. abstracts)
|
372 |
-
Field-Specific Terminology: Technical terms reveal the focus of the conference
|
373 |
-
Sentiment Indicators: Evaluative words in reviews reveal assessment patterns
|
374 |
-
Try it yourself: What differences do you notice between the word clouds from reviews, keywords, and abstracts? What do these differences tell you about academic communication?
|
375 |
-
|
376 |
-
|
377 |
-
|
378 |
-
|
379 |
-
|
380 |
-
|
381 |
-
|
382 |
-
|
383 |
-
|
384 |
-
|
385 |
-
|
386 |
-
|
387 |
-
|
388 |
-
|
389 |
-
|
390 |
-
|
391 |
-
|
392 |
-
|
393 |
-
|
394 |
-
|
395 |
-
|
396 |
-
V1.1 Week 4 - Intro to NLP
|
397 |
-
Course Overview
|
398 |
-
In this course, you'll learn fundamental Natural Language Processing (NLP) concepts by exploring a fascinating real-world question: What is the effect of releasing a preprint of a paper before it is submitted for peer review?
|
399 |
-
Using the ICLR (International Conference on Learning Representations) database - which contains submissions, reviews, and author profiles from 2017-2022 - you'll develop practical NLP skills while investigating potential biases and patterns in academic publishing.
|
400 |
-
Learning Path
|
401 |
-
Understanding Text as Data: How computers represent and work with text
|
402 |
-
Text Processing Fundamentals: Basic cleaning and normalization
|
403 |
-
Quantitative Text Analysis: Measuring and comparing text features
|
404 |
-
Tokenization Approaches: Breaking text into meaningful units
|
405 |
-
Text Visualization Techniques: Creating insightful visual representations
|
406 |
-
From Analysis to Insights: Drawing evidence-based conclusions
|
407 |
-
Let's dive in!
|
408 |
-
…
|
409 |
-
Step 4: Text Cleaning and Normalization for Academic Content
|
410 |
-
Academic papers contain specialized vocabulary, citations, equations, and other elements that require careful normalization.
|
411 |
-
Key Concept: Scientific text normalization preserves meaningful technical content while standardizing format.
|
412 |
-
Stop Words Removal
|
413 |
-
Definition: Stop words are extremely common words that appear frequently in text but typically carry little meaningful information for analysis purposes. In English, these include articles (the, a, an), conjunctions (and, but, or), prepositions (in, on, at), and certain pronouns (I, you, it).
|
414 |
-
Stop words removal is the process of filtering these words out before analysis to:
|
415 |
-
Reduce noise in the data
|
416 |
-
Decrease the dimensionality of the text representation
|
417 |
-
Focus analysis on the content-bearing words
|
418 |
-
In academic text, we often extend standard stop word lists to include domain-specific terms that are ubiquitous but not analytically useful (e.g., "paper," "method," "result").
|
419 |
-
python
|
420 |
-
# Load standard English stop words
|
421 |
-
from nltk.corpus import stopwords
|
422 |
-
standard_stop_words = set(stopwords.words('english'))
|
423 |
-
|
424 |
-
# Add academic-specific stop words
|
425 |
-
academic_stop_words = ['et', 'al', 'fig', 'table', 'paper', 'using', 'used',
|
426 |
-
'method', 'result', 'show', 'propose', 'use']
|
427 |
-
all_stop_words = standard_stop_words.union(academic_stop_words)
|
428 |
-
|
429 |
-
# Apply stop word removal
|
430 |
-
def remove_stop_words(text):
|
431 |
-
words = text.split()
|
432 |
-
filtered_words = [word for word in words if word.lower() not in all_stop_words]
|
433 |
-
return ' '.join(filtered_words)
|
434 |
-
|
435 |
-
# Compare before and after
|
436 |
-
example = "We propose a novel method that shows impressive results on the benchmark dataset."
|
437 |
-
filtered = remove_stop_words(example)
|
438 |
-
|
439 |
-
print("Original:", example)
|
440 |
-
print("After stop word removal:", filtered)
|
441 |
-
# Output: "propose novel method shows impressive results benchmark dataset."
|
442 |
-
Stemming and Lemmatization
|
443 |
-
Definition: Stemming and lemmatization are text normalization techniques that reduce words to their root or base forms, allowing different inflections or derivations of the same word to be treated as equivalent.
|
444 |
-
Stemming is a simpler, rule-based approach that works by truncating words to their stems, often by removing suffixes. For example:
|
445 |
-
"running," "runs," and "runner" might all be reduced to "run"
|
446 |
-
"connection," "connected," and "connecting" might all become "connect"
|
447 |
-
Stemming is faster but can sometimes produce non-words or incorrect reductions.
|
448 |
-
Lemmatization is a more sophisticated approach that uses vocabulary and morphological analysis to return the dictionary base form (lemma) of a word. For example:
|
449 |
-
"better" becomes "good"
|
450 |
-
"was" and "were" become "be"
|
451 |
-
"studying" becomes "study"
|
452 |
-
Lemmatization generally produces more accurate results but requires more computational resources.
|
453 |
-
python
|
454 |
-
from nltk.stem import PorterStemmer, WordNetLemmatizer
|
455 |
-
import nltk
|
456 |
-
nltk.download('wordnet')
|
457 |
-
|
458 |
-
# Initialize stemmer and lemmatizer
|
459 |
-
stemmer = PorterStemmer()
|
460 |
-
lemmatizer = WordNetLemmatizer()
|
461 |
-
|
462 |
-
# Example words
|
463 |
-
academic_terms = ["algorithms", "computing", "learning", "trained",
|
464 |
-
"networks", "better", "studies", "analyzed"]
|
465 |
-
|
466 |
-
# Compare stemming and lemmatization
|
467 |
-
for term in academic_terms:
|
468 |
-
print(f"Original: {term}")
|
469 |
-
print(f"Stemmed: {stemmer.stem(term)}")
|
470 |
-
print(f"Lemmatized: {lemmatizer.lemmatize(term)}")
|
471 |
-
print()
|
472 |
-
|
473 |
-
# Demonstration in context
|
474 |
-
academic_sentence = "The training algorithms performed better than expected when analyzing multiple neural networks."
|
475 |
-
|
476 |
-
# Apply stemming
|
477 |
-
stemmed_words = [stemmer.stem(word) for word in academic_sentence.lower().split()]
|
478 |
-
stemmed_sentence = ' '.join(stemmed_words)
|
479 |
-
|
480 |
-
# Apply lemmatization
|
481 |
-
lemmatized_words = [lemmatizer.lemmatize(word) for word in academic_sentence.lower().split()]
|
482 |
-
lemmatized_sentence = ' '.join(lemmatized_words)
|
483 |
-
|
484 |
-
print("Original:", academic_sentence)
|
485 |
-
print("Stemmed:", stemmed_sentence)
|
486 |
-
print("Lemmatized:", lemmatized_sentence)
|
487 |
-
When to use which approach:
|
488 |
-
For academic text analysis:
|
489 |
-
Stemming is useful when processing speed is important and approximate matching is sufficient
|
490 |
-
Lemmatization is preferred when precision is crucial, especially for technical terms where preserving meaning is essential
|
491 |
-
In our ICLR paper analysis, lemmatization would likely be more appropriate since technical terminology often carries specific meanings that should be preserved accurately.
|
492 |
-
Challenge Question: How might stemming versus lemmatization affect our analysis of technical innovation in ICLR papers? Can you think of specific machine learning terms where these approaches would yield different results?
|
493 |
-
|
494 |
-
|
495 |
-
V1.0 Week 4 - Intro to NLP
|
496 |
-
The Real-World Problem
|
497 |
-
Imagine you're part of a small business team that has just launched a new product. You've received hundreds of customer reviews across various platforms, and your manager has asked you to make sense of this feedback. Looking at the mountain of text data, you realize you need a systematic way to understand what customers are saying without reading each review individually.
|
498 |
-
Your challenge: How can you efficiently analyze customer feedback to identify common themes, sentiments, and specific product issues?
|
499 |
-
Our Approach
|
500 |
-
In this module, we'll learn how to transform unstructured text feedback into structured insights using Natural Language Processing. Here's our journey:
|
501 |
-
Understanding text as data
|
502 |
-
Basic processing of text information
|
503 |
-
Measuring text properties
|
504 |
-
Cleaning and normalizing customer feedback
|
505 |
-
Visualizing patterns in the feedback
|
506 |
-
Analyzing words vs. tokens
|
507 |
-
Let's begin!
|
508 |
-
Step 1: Text as Data - A New Perspective
|
509 |
-
When we look at customer reviews like:
|
510 |
-
"Love this product! So easy to use and the battery lasts forever."
|
511 |
-
"Terrible design. Buttons stopped working after two weeks."
|
512 |
-
We naturally understand the meaning and sentiment. But how can a computer understand this?
|
513 |
-
Key Concept: Text can be treated as data that we can analyze quantitatively.
|
514 |
-
Unlike numerical data (age, price, temperature) that has inherent mathematical properties, text data needs to be transformed before we can analyze it.
|
515 |
-
Interactive Exercise: Look at these two reviews. As a human, what information can you extract? Now think about how a computer might "see" this text without any processing.
|
516 |
-
Challenge Question: What types of information might we want to extract from customer reviews? List at least three analytical goals.
|
517 |
-
Step 2: Basic Text Processing - Breaking Down Language
|
518 |
-
Before we can analyze text, we need to break it down into meaningful units.
|
519 |
-
Key Concept: Tokenization is the process of splitting text into smaller pieces (tokens) such as words, phrases, or characters.
|
520 |
-
For example, the review "Love this product!" can be tokenized into ["Love", "this", "product", "!"] or ["Love", "this", "product", "!"] depending on our approach.
|
521 |
-
Interactive Example: Let's tokenize these customer reviews:
|
522 |
-
python
|
523 |
-
# Simple word tokenization
|
524 |
-
review = "Battery life is amazing but the app crashes frequently."
|
525 |
-
tokens = review.split() # Results in ["Battery", "life", "is", "amazing", "but", "the", "app", "crashes", "frequently."]
|
526 |
-
Notice how "frequently." includes the period. Basic tokenization has limitations!
|
527 |
-
Challenge Question: How might we handle contractions like "doesn't" or hyphenated words like "user-friendly" when tokenizing?
|
528 |
-
Step 3: Measuring Text - Quantifying Feedback
|
529 |
-
Now that we've broken text into pieces, we can start measuring properties of our customer feedback.
|
530 |
-
Key Concept: Text metrics help us quantify and compare text data.
|
531 |
-
Common metrics include:
|
532 |
-
Length (words, characters)
|
533 |
-
Complexity (average word length, unique words ratio)
|
534 |
-
Sentiment scores (positive/negative)
|
535 |
-
Interactive Example: Let's calculate basic metrics for customer reviews:
|
536 |
-
python
|
537 |
-
# Word count
|
538 |
-
review = "The interface is intuitive and responsive."
|
539 |
-
word_count = len(review.split()) # 6 words
|
540 |
-
|
541 |
-
# Character count (including spaces)
|
542 |
-
char_count = len(review) # 41 characters
|
543 |
-
|
544 |
-
# Unique words ratio
|
545 |
-
unique_words = len(set(review.lower().split()))
|
546 |
-
unique_ratio = unique_words / word_count # 1.0 (all words are unique)
|
547 |
-
Challenge Question: Why might longer reviews not necessarily contain more information than shorter ones? What other metrics beyond length might better capture information content?
|
548 |
-
Step 4: Text Cleaning and Normalization
|
549 |
-
Customer feedback often contains inconsistencies: spelling variations, punctuation, capitalization, etc.
|
550 |
-
Key Concept: Text normalization creates a standardized format for analysis.
|
551 |
-
Common normalization steps:
|
552 |
-
Converting to lowercase
|
553 |
-
Removing punctuation
|
554 |
-
Correcting spelling
|
555 |
-
Removing stop words (common words like "the", "is")
|
556 |
-
Stemming or lemmatizing (reducing words to their base form)
|
557 |
-
Interactive Example: Let's normalize a review:
|
558 |
-
python
|
559 |
-
# Original review
|
560 |
-
review = "The battery LIFE is amazing!!! Works for days."
|
561 |
-
|
562 |
-
# Lowercase
|
563 |
-
review = review.lower() # "the battery life is amazing!!! works for days."
|
564 |
-
|
565 |
-
# Remove punctuation and extra spaces
|
566 |
-
import re
|
567 |
-
review = re.sub(r'[^\w\s]', '', review) # "the battery life is amazing works for days"
|
568 |
-
|
569 |
-
# Remove stop words
|
570 |
-
stop_words = ["the", "is", "for"]
|
571 |
-
words = review.split()
|
572 |
-
filtered_words = [word for word in words if word not in stop_words]
|
573 |
-
# Result: ["battery", "life", "amazing", "works", "days"]
|
574 |
-
Challenge Question: How might normalization affect sentiment analysis? Could removing punctuation or stop words change the perceived sentiment of a review?
|
575 |
-
Step 5: Text Visualization - Seeing Patterns
|
576 |
-
Visual representations help us identify patterns across many reviews.
|
577 |
-
Key Concept: Text visualization techniques reveal insights that are difficult to see in raw text.
|
578 |
-
Common visualization methods:
|
579 |
-
Word clouds
|
580 |
-
Frequency distributions
|
581 |
-
Sentiment over time
|
582 |
-
Topic clusters
|
583 |
-
Interactive Example: Creating a simple word frequency chart:
|
584 |
-
python
|
585 |
-
from collections import Counter
|
586 |
-
|
587 |
-
# Combined reviews
|
588 |
-
reviews = ["Battery life is amazing", "Battery drains too quickly",
|
589 |
-
"Great battery performance", "Screen is too small"]
|
590 |
-
|
591 |
-
# Count word frequencies
|
592 |
-
all_words = " ".join(reviews).lower().split()
|
593 |
-
word_counts = Counter(all_words)
|
594 |
-
# Result: {'battery': 3, 'life': 1, 'is': 2, 'amazing': 1, 'drains': 1, 'too': 2, 'quickly': 1, 'great': 1, 'performance': 1, 'screen': 1, 'small': 1}
|
595 |
-
|
596 |
-
# We could visualize this as a bar chart
|
597 |
-
# Most frequent: 'battery' (3), 'is' (2), 'too' (2)
|
598 |
-
Challenge Question: Why might a word cloud be misleading for understanding customer sentiment? What additional information would make the visualization more informative?
|
599 |
-
Step 6: Words vs. Tokens - Making Choices
|
600 |
-
As we advance in NLP, we face an important decision: should we analyze whole words or more sophisticated tokens?
|
601 |
-
Key Concept: Different tokenization approaches have distinct advantages and limitations.
|
602 |
-
Word-based analysis:
|
603 |
-
Intuitive and interpretable
|
604 |
-
Misses connections between related words (run/running/ran)
|
605 |
-
Struggles with compound words and new terms
|
606 |
-
Token-based analysis:
|
607 |
-
Can capture subword information
|
608 |
-
Handles unknown words better
|
609 |
-
May lose some human interpretability
|
610 |
-
Interactive Example: Comparing approaches:
|
611 |
-
python
|
612 |
-
# Word-based
|
613 |
-
review = "The touchscreen is unresponsive"
|
614 |
-
words = review.lower().split() # ['the', 'touchscreen', 'is', 'unresponsive']
|
615 |
-
|
616 |
-
# Subword tokenization (simplified example)
|
617 |
-
subwords = ['the', 'touch', 'screen', 'is', 'un', 'responsive']
|
618 |
-
Challenge Question: For our customer feedback analysis, which approach would be better: analyzing whole words or subword tokens? What factors would influence this decision?
|
619 |
-
Putting It All Together: Solving Our Problem
|
620 |
-
Now that we've learned these fundamental NLP concepts, let's return to our original challenge: analyzing customer feedback at scale.
|
621 |
-
Here's how we'd approach it:
|
622 |
-
Collect and tokenize all customer reviews
|
623 |
-
Clean and normalize the text
|
624 |
-
Calculate key metrics (length, sentiment scores)
|
625 |
-
Visualize common terms and topics
|
626 |
-
Identify positive and negative feedback themes
|
627 |
-
Generate an automated summary for the product team
|
628 |
-
By applying these NLP fundamentals, we've transformed an overwhelming mass of text into actionable insights that can drive product improvements!
|
629 |
-
Final Challenge: How could we extend this analysis to track customer sentiment over time as we release product updates? What additional NLP techniques might be helpful?
|
630 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Reference files/w6_logistic_regression_lab.py
DELETED
@@ -1,400 +0,0 @@
|
|
1 |
-
# -*- coding: utf-8 -*-
|
2 |
-
"""W6_Logistic_regression_lab
|
3 |
-
|
4 |
-
Automatically generated by Colab.
|
5 |
-
|
6 |
-
Original file is located at
|
7 |
-
https://colab.research.google.com/drive/1MG7N2HN-Nxow9fzvc0fzxvp3WyKqtgs8
|
8 |
-
|
9 |
-
# 🚀 Logistic Regression Lab: Stock Market Prediction
|
10 |
-
|
11 |
-
## Lab Overview
|
12 |
-
In this lab, we'll use logistic regression to try predicting whether the stock market goes up or down. Spoiler alert: This is intentionally a challenging prediction problem that will teach us important lessons about when logistic regression works well and when it doesn't.
|
13 |
-
## Learning Goals:
|
14 |
-
|
15 |
-
- Apply logistic regression to real data
|
16 |
-
- Interpret probabilities and coefficients
|
17 |
-
- Understand why some prediction problems are inherently difficult
|
18 |
-
- Learn proper model evaluation techniques
|
19 |
-
|
20 |
-
## The Stock Market Data
|
21 |
-
|
22 |
-
In this lab we will examine the `Smarket`
|
23 |
-
data, which is part of the `ISLP`
|
24 |
-
library. This data set consists of percentage returns for the S&P 500
|
25 |
-
stock index over 1,250 days, from the beginning of 2001 until the end
|
26 |
-
of 2005. For each date, we have recorded the percentage returns for
|
27 |
-
each of the five previous trading days, `Lag1` through
|
28 |
-
`Lag5`. We have also recorded `Volume` (the number of
|
29 |
-
shares traded on the previous day, in billions), `Today` (the
|
30 |
-
percentage return on the date in question) and `Direction`
|
31 |
-
(whether the market was `Up` or `Down` on this date).
|
32 |
-
|
33 |
-
### Your Challenge
|
34 |
-
**Question**: Can we predict if the S&P 500 will go up or down based on recent trading patterns?
|
35 |
-
|
36 |
-
**Why This Matters:** If predictable, this would be incredibly valuable. If not predictable, we learn about market efficiency and realistic expectations for prediction models.
|
37 |
-
|
38 |
-
|
39 |
-
To answer the question, **we start by importing our libraries at this top level; these are all imports we have seen in previous labs.**
|
40 |
-
"""
|
41 |
-
|
42 |
-
import numpy as np
|
43 |
-
import pandas as pd
|
44 |
-
from matplotlib.pyplot import subplots
|
45 |
-
import statsmodels.api as sm
|
46 |
-
from ISLP import load_data
|
47 |
-
from ISLP.models import (ModelSpec as MS,
|
48 |
-
summarize)
|
49 |
-
|
50 |
-
"""We also collect together the new imports needed for this lab."""
|
51 |
-
|
52 |
-
from ISLP import confusion_table
|
53 |
-
from ISLP.models import contrast
|
54 |
-
from sklearn.discriminant_analysis import \
|
55 |
-
(LinearDiscriminantAnalysis as LDA,
|
56 |
-
QuadraticDiscriminantAnalysis as QDA)
|
57 |
-
from sklearn.naive_bayes import GaussianNB
|
58 |
-
from sklearn.neighbors import KNeighborsClassifier
|
59 |
-
from sklearn.preprocessing import StandardScaler
|
60 |
-
from sklearn.model_selection import train_test_split
|
61 |
-
from sklearn.linear_model import LogisticRegression
|
62 |
-
|
63 |
-
"""Now we are ready to load the `Smarket` data."""
|
64 |
-
|
65 |
-
Smarket = load_data('Smarket')
|
66 |
-
Smarket
|
67 |
-
|
68 |
-
"""This gives a truncated listing of the data.
|
69 |
-
We can see what the variable names are.
|
70 |
-
"""
|
71 |
-
|
72 |
-
Smarket.columns
|
73 |
-
|
74 |
-
"""We compute the correlation matrix using the `corr()` method
|
75 |
-
for data frames, which produces a matrix that contains all of
|
76 |
-
the pairwise correlations among the variables.
|
77 |
-
|
78 |
-
By instructing `pandas` to use only numeric variables, the `corr()` method does not report a correlation for the `Direction` variable because it is
|
79 |
-
qualitative.
|
80 |
-
|
81 |
-

|
82 |
-
"""
|
83 |
-
|
84 |
-
Smarket.corr(numeric_only=True)
|
85 |
-
|
86 |
-
"""As one would expect, the correlations between the lagged return variables and
|
87 |
-
today’s return are close to zero. The only substantial correlation is between `Year` and
|
88 |
-
`Volume`. By plotting the data we see that `Volume`
|
89 |
-
is increasing over time. In other words, the average number of shares traded
|
90 |
-
daily increased from 2001 to 2005.
|
91 |
-
|
92 |
-
"""
|
93 |
-
|
94 |
-
Smarket.plot(y='Volume');
|
95 |
-
|
96 |
-
"""## Logistic Regression
|
97 |
-
Next, we will fit a logistic regression model in order to predict
|
98 |
-
`Direction` using `Lag1` through `Lag5` and
|
99 |
-
`Volume`. The `sm.GLM()` function fits *generalized linear models*, a class of
|
100 |
-
models that includes logistic regression. Alternatively,
|
101 |
-
the function `sm.Logit()` fits a logistic regression
|
102 |
-
model directly. The syntax of
|
103 |
-
`sm.GLM()` is similar to that of `sm.OLS()`, except
|
104 |
-
that we must pass in the argument `family=sm.families.Binomial()`
|
105 |
-
in order to tell `statsmodels` to run a logistic regression rather than some other
|
106 |
-
type of generalized linear model.
|
107 |
-
"""
|
108 |
-
|
109 |
-
allvars = Smarket.columns.drop(['Today', 'Direction', 'Year'])
|
110 |
-
design = MS(allvars)
|
111 |
-
X = design.fit_transform(Smarket)
|
112 |
-
y = Smarket.Direction == 'Up'
|
113 |
-
glm = sm.GLM(y,
|
114 |
-
X,
|
115 |
-
family=sm.families.Binomial())
|
116 |
-
results = glm.fit()
|
117 |
-
summarize(results)
|
118 |
-
|
119 |
-
"""The smallest *p*-value here is associated with `Lag1`. The
|
120 |
-
negative coefficient for this predictor suggests that if the market
|
121 |
-
had a positive return yesterday, then it is less likely to go up
|
122 |
-
today. However, at a value of 0.15, the *p*-value is still
|
123 |
-
relatively large, and so there is no clear evidence of a real
|
124 |
-
association between `Lag1` and `Direction`.
|
125 |
-
|
126 |
-
We use the `params` attribute of `results`
|
127 |
-
in order to access just the
|
128 |
-
coefficients for this fitted model.
|
129 |
-
"""
|
130 |
-
|
131 |
-
results.params
|
132 |
-
|
133 |
-
"""Likewise we can use the
|
134 |
-
`pvalues` attribute to access the *p*-values for the coefficients.
|
135 |
-
"""
|
136 |
-
|
137 |
-
results.pvalues
|
138 |
-
|
139 |
-
"""The `predict()` method of `results` can be used to predict the
|
140 |
-
probability that the market will go up, given values of the
|
141 |
-
predictors. This method returns predictions
|
142 |
-
on the probability scale. If no data set is supplied to the `predict()`
|
143 |
-
function, then the probabilities are computed for the training data
|
144 |
-
that was used to fit the logistic regression model.
|
145 |
-
As with linear regression, one can pass an optional `exog` argument consistent
|
146 |
-
with a design matrix if desired. Here we have
|
147 |
-
printed only the first ten probabilities.
|
148 |
-
"""
|
149 |
-
|
150 |
-
probs = results.predict()
|
151 |
-
probs[:10]
|
152 |
-
|
153 |
-
"""In order to make a prediction as to whether the market will go up or
|
154 |
-
down on a particular day, we must convert these predicted
|
155 |
-
probabilities into class labels, `Up` or `Down`. The
|
156 |
-
following two commands create a vector of class predictions based on
|
157 |
-
whether the predicted probability of a market increase is greater than
|
158 |
-
or less than 0.5.
|
159 |
-
"""
|
160 |
-
|
161 |
-
labels = np.array(['Down']*1250)
|
162 |
-
labels[probs>0.5] = "Up"
|
163 |
-
|
164 |
-
"""The `confusion_table()`
|
165 |
-
function from the `ISLP` package summarizes these predictions, showing how
|
166 |
-
many observations were correctly or incorrectly classified. Our function, which is adapted from a similar function
|
167 |
-
in the module `sklearn.metrics`, transposes the resulting
|
168 |
-
matrix and includes row and column labels.
|
169 |
-
The `confusion_table()` function takes as first argument the
|
170 |
-
predicted labels, and second argument the true labels.
|
171 |
-
"""
|
172 |
-
|
173 |
-
confusion_table(labels, Smarket.Direction)
|
174 |
-
|
175 |
-
"""The diagonal elements of the confusion matrix indicate correct
|
176 |
-
predictions, while the off-diagonals represent incorrect
|
177 |
-
predictions. Hence our model correctly predicted that the market would
|
178 |
-
go up on 507 days and that it would go down on 145 days, for a
|
179 |
-
total of 507 + 145 = 652 correct predictions. The `np.mean()`
|
180 |
-
function can be used to compute the fraction of days for which the
|
181 |
-
prediction was correct. In this case, logistic regression correctly
|
182 |
-
predicted the movement of the market 52.2% of the time.
|
183 |
-
|
184 |
-
"""
|
185 |
-
|
186 |
-
(507+145)/1250, np.mean(labels == Smarket.Direction)
|
187 |
-
|
188 |
-
"""At first glance, it appears that the logistic regression model is
|
189 |
-
working a little better than random guessing. However, this result is
|
190 |
-
misleading because we trained and tested the model on the same set of
|
191 |
-
1,250 observations. In other words, $100-52.2=47.8%$ is the
|
192 |
-
*training* error rate. As we have seen
|
193 |
-
previously, the training error rate is often overly optimistic --- it
|
194 |
-
tends to underestimate the test error rate. In
|
195 |
-
order to better assess the accuracy of the logistic regression model
|
196 |
-
in this setting, we can fit the model using part of the data, and
|
197 |
-
then examine how well it predicts the *held out* data. This
|
198 |
-
will yield a more realistic error rate, in the sense that in practice
|
199 |
-
we will be interested in our model’s performance not on the data that
|
200 |
-
we used to fit the model, but rather on days in the future for which
|
201 |
-
the market’s movements are unknown.
|
202 |
-
|
203 |
-
To implement this strategy, we first create a Boolean vector
|
204 |
-
corresponding to the observations from 2001 through 2004. We then
|
205 |
-
use this vector to create a held out data set of observations from
|
206 |
-
2005.
|
207 |
-
"""
|
208 |
-
|
209 |
-
train = (Smarket.Year < 2005)
|
210 |
-
Smarket_train = Smarket.loc[train]
|
211 |
-
Smarket_test = Smarket.loc[~train]
|
212 |
-
Smarket_test.shape
|
213 |
-
|
214 |
-
"""The object `train` is a vector of 1,250 elements, corresponding
|
215 |
-
to the observations in our data set. The elements of the vector that
|
216 |
-
correspond to observations that occurred before 2005 are set to
|
217 |
-
`True`, whereas those that correspond to observations in 2005 are
|
218 |
-
set to `False`. Hence `train` is a
|
219 |
-
*boolean* array, since its
|
220 |
-
elements are `True` and `False`. Boolean arrays can be used
|
221 |
-
to obtain a subset of the rows or columns of a data frame
|
222 |
-
using the `loc` method. For instance,
|
223 |
-
the command `Smarket.loc[train]` would pick out a submatrix of the
|
224 |
-
stock market data set, corresponding only to the dates before 2005,
|
225 |
-
since those are the ones for which the elements of `train` are
|
226 |
-
`True`. The `~` symbol can be used to negate all of the
|
227 |
-
elements of a Boolean vector. That is, `~train` is a vector
|
228 |
-
similar to `train`, except that the elements that are `True`
|
229 |
-
in `train` get swapped to `False` in `~train`, and vice versa.
|
230 |
-
Therefore, `Smarket.loc[~train]` yields a
|
231 |
-
subset of the rows of the data frame
|
232 |
-
of the stock market data containing only the observations for which
|
233 |
-
`train` is `False`.
|
234 |
-
The output above indicates that there are 252 such
|
235 |
-
observations.
|
236 |
-
|
237 |
-
We now fit a logistic regression model using only the subset of the
|
238 |
-
observations that correspond to dates before 2005. We then obtain predicted probabilities of the
|
239 |
-
stock market going up for each of the days in our test set --- that is,
|
240 |
-
for the days in 2005.
|
241 |
-
"""
|
242 |
-
|
243 |
-
X_train, X_test = X.loc[train], X.loc[~train]
|
244 |
-
y_train, y_test = y.loc[train], y.loc[~train]
|
245 |
-
glm_train = sm.GLM(y_train,
|
246 |
-
X_train,
|
247 |
-
family=sm.families.Binomial())
|
248 |
-
results = glm_train.fit()
|
249 |
-
probs = results.predict(exog=X_test)
|
250 |
-
|
251 |
-
"""Notice that we have trained and tested our model on two completely
|
252 |
-
separate data sets: training was performed using only the dates before
|
253 |
-
2005, and testing was performed using only the dates in 2005.
|
254 |
-
|
255 |
-
Finally, we compare the predictions for 2005 to the
|
256 |
-
actual movements of the market over that time period.
|
257 |
-
We will first store the test and training labels (recall `y_test` is binary).
|
258 |
-
"""
|
259 |
-
|
260 |
-
D = Smarket.Direction
|
261 |
-
L_train, L_test = D.loc[train], D.loc[~train]
|
262 |
-
|
263 |
-
"""Now we threshold the
|
264 |
-
fitted probability at 50% to form
|
265 |
-
our predicted labels.
|
266 |
-
"""
|
267 |
-
|
268 |
-
labels = np.array(['Down']*252)
|
269 |
-
labels[probs>0.5] = 'Up'
|
270 |
-
confusion_table(labels, L_test)
|
271 |
-
|
272 |
-
"""The test accuracy is about 48% while the error rate is about 52%"""
|
273 |
-
|
274 |
-
np.mean(labels == L_test), np.mean(labels != L_test)
|
275 |
-
|
276 |
-
"""The `!=` notation means *not equal to*, and so the last command
|
277 |
-
computes the test set error rate. The results are rather
|
278 |
-
disappointing: the test error rate is 52%, which is worse than
|
279 |
-
random guessing! Of course this result is not all that surprising,
|
280 |
-
given that one would not generally expect to be able to use previous
|
281 |
-
days’ returns to predict future market performance. (After all, if it
|
282 |
-
were possible to do so, then the authors of this book would be out
|
283 |
-
striking it rich rather than writing a statistics textbook.)
|
284 |
-
|
285 |
-
We recall that the logistic regression model had very underwhelming
|
286 |
-
*p*-values associated with all of the predictors, and that the
|
287 |
-
smallest *p*-value, though not very small, corresponded to
|
288 |
-
`Lag1`. Perhaps by removing the variables that appear not to be
|
289 |
-
helpful in predicting `Direction`, we can obtain a more
|
290 |
-
effective model. After all, using predictors that have no relationship
|
291 |
-
with the response tends to cause a deterioration in the test error
|
292 |
-
rate (since such predictors cause an increase in variance without a
|
293 |
-
corresponding decrease in bias), and so removing such predictors may
|
294 |
-
in turn yield an improvement. Below we refit the logistic
|
295 |
-
regression using just `Lag1` and `Lag2`, which seemed to
|
296 |
-
have the highest predictive power in the original logistic regression
|
297 |
-
model.
|
298 |
-
"""
|
299 |
-
|
300 |
-
model = MS(['Lag1', 'Lag2']).fit(Smarket)
|
301 |
-
X = model.transform(Smarket)
|
302 |
-
X_train, X_test = X.loc[train], X.loc[~train]
|
303 |
-
glm_train = sm.GLM(y_train,
|
304 |
-
X_train,
|
305 |
-
family=sm.families.Binomial())
|
306 |
-
results = glm_train.fit()
|
307 |
-
probs = results.predict(exog=X_test)
|
308 |
-
labels = np.array(['Down']*252)
|
309 |
-
labels[probs>0.5] = 'Up'
|
310 |
-
confusion_table(labels, L_test)
|
311 |
-
|
312 |
-
"""Let’s evaluate the overall accuracy as well as the accuracy within the days when
|
313 |
-
logistic regression predicts an increase.
|
314 |
-
"""
|
315 |
-
|
316 |
-
(35+106)/252,106/(106+76)
|
317 |
-
|
318 |
-
"""Now the results appear to be a little better: 56% of the daily
|
319 |
-
movements have been correctly predicted. It is worth noting that in
|
320 |
-
this case, a much simpler strategy of predicting that the market will
|
321 |
-
increase every day will also be correct 56% of the time! Hence, in
|
322 |
-
terms of overall error rate, the logistic regression method is no
|
323 |
-
better than the naive approach. However, the confusion matrix
|
324 |
-
shows that on days when logistic regression predicts an increase in
|
325 |
-
the market, it has a 58% accuracy rate. This suggests a possible
|
326 |
-
trading strategy of buying on days when the model predicts an
|
327 |
-
increasing market, and avoiding trades on days when a decrease is
|
328 |
-
predicted. Of course one would need to investigate more carefully
|
329 |
-
whether this small improvement was real or just due to random chance.
|
330 |
-
|
331 |
-
Suppose that we want to predict the returns associated with particular
|
332 |
-
values of `Lag1` and `Lag2`. In particular, we want to
|
333 |
-
predict `Direction` on a day when `Lag1` and
|
334 |
-
`Lag2` equal $1.2$ and $1.1$, respectively, and on a day when they
|
335 |
-
equal $1.5$ and $-0.8$. We do this using the `predict()`
|
336 |
-
function.
|
337 |
-
"""
|
338 |
-
|
339 |
-
newdata = pd.DataFrame({'Lag1':[1.2, 1.5],
|
340 |
-
'Lag2':[1.1, -0.8]});
|
341 |
-
newX = model.transform(newdata)
|
342 |
-
results.predict(newX)
|
343 |
-
|
344 |
-
Smarket
|
345 |
-
|
346 |
-
import pandas as pd
|
347 |
-
import numpy as np
|
348 |
-
import matplotlib.pyplot as plt
|
349 |
-
from sklearn.model_selection import train_test_split
|
350 |
-
from sklearn.linear_model import LogisticRegression
|
351 |
-
from sklearn.metrics import classification_report, confusion_matrix
|
352 |
-
import statsmodels.api as sm
|
353 |
-
|
354 |
-
|
355 |
-
# Load the dataset
|
356 |
-
data = load_data('Smarket')
|
357 |
-
|
358 |
-
# Display the first few rows of the dataset
|
359 |
-
print(data.head())
|
360 |
-
|
361 |
-
# Prepare the data for logistic regression
|
362 |
-
# Using 'Lag1' and 'Lag2' as predictors and 'Direction' as the response
|
363 |
-
data['Direction'] = data['Direction'].map({'Up': 1, 'Down': 0})
|
364 |
-
X = data[['Lag1', 'Lag2']]
|
365 |
-
y = data['Direction']
|
366 |
-
|
367 |
-
# Split the data into training and testing sets
|
368 |
-
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
|
369 |
-
|
370 |
-
# Fit the logistic regression model
|
371 |
-
log_reg = LogisticRegression()
|
372 |
-
log_reg.fit(X_train, y_train)
|
373 |
-
|
374 |
-
# Make predictions on the test set
|
375 |
-
y_pred = log_reg.predict(X_test)
|
376 |
-
|
377 |
-
# Print classification report and confusion matrix
|
378 |
-
print(classification_report(y_test, y_pred))
|
379 |
-
print(confusion_matrix(y_test, y_pred))
|
380 |
-
|
381 |
-
# Visualize the decision boundary
|
382 |
-
plt.figure(figsize=(10, 6))
|
383 |
-
|
384 |
-
# Create a mesh grid for plotting decision boundary
|
385 |
-
x_min, x_max = X['Lag1'].min() - 1, X['Lag1'].max() + 1
|
386 |
-
y_min, y_max = X['Lag2'].min() - 1, X['Lag2'].max() + 1
|
387 |
-
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
|
388 |
-
np.arange(y_min, y_max, 0.01))
|
389 |
-
|
390 |
-
# Predict the function value for the whole grid
|
391 |
-
Z = log_reg.predict(np.c_[xx.ravel(), yy.ravel()])
|
392 |
-
Z = Z.reshape(xx.shape)
|
393 |
-
|
394 |
-
# Plot the decision boundary
|
395 |
-
plt.contourf(xx, yy, Z, alpha=0.8)
|
396 |
-
plt.scatter(X_test['Lag1'], X_test['Lag2'], c=y_test, edgecolor='k', s=20)
|
397 |
-
plt.xlabel('Lag1')
|
398 |
-
plt.ylabel('Lag2')
|
399 |
-
plt.title('Logistic Regression Decision Boundary')
|
400 |
-
plt.show()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Reference files/week 7/W7_Lab_KNN_clustering.ipynb
DELETED
@@ -1,481 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"cells": [
|
3 |
-
{
|
4 |
-
"cell_type": "markdown",
|
5 |
-
"id": "b1c6a137",
|
6 |
-
"metadata": {
|
7 |
-
"id": "b1c6a137"
|
8 |
-
},
|
9 |
-
"source": [
|
10 |
-
"# Clustering Lab: State Crime Pattern Analysis\n",
|
11 |
-
"\n",
|
12 |
-
"## Lab Overview\n",
|
13 |
-
"\n",
|
14 |
-
"Welcome to your hands-on clustering lab! You'll be working as a policy analyst for the Department of Justice, analyzing crime patterns across US states. Your mission: discover hidden safety profiles that could inform federal resource allocation and crime prevention strategies.\n",
|
15 |
-
"\n",
|
16 |
-
"**Your Deliverable**: A policy brief with visualizations and recommendations based on your clustering analysis.\n",
|
17 |
-
"\n",
|
18 |
-
"---\n",
|
19 |
-
"\n",
|
20 |
-
"## Exercise 1: Data Detective Work\n",
|
21 |
-
"**Time: 15 minutes | Product: Data Summary Report**\n",
|
22 |
-
"\n",
|
23 |
-
"### Your Task\n",
|
24 |
-
"Before any analysis, you need to understand what you're working with. Create a brief data summary that a non-technical policy maker could understand.\n"
|
25 |
-
]
|
26 |
-
},
|
27 |
-
{
|
28 |
-
"cell_type": "code",
|
29 |
-
"source": [
|
30 |
-
"```python\n",
|
31 |
-
"import numpy as np\n",
|
32 |
-
"import pandas as pd\n",
|
33 |
-
"import matplotlib.pyplot as plt\n",
|
34 |
-
"from statsmodels.datasets import get_rdataset\n",
|
35 |
-
"from sklearn.preprocessing import StandardScaler\n",
|
36 |
-
"from sklearn.cluster import KMeans, AgglomerativeClustering\n",
|
37 |
-
"\n",
|
38 |
-
"# Load the data\n",
|
39 |
-
"USArrests = get_rdataset('USArrests').data\n",
|
40 |
-
"print(\"Dataset shape:\", USArrests.shape)\n",
|
41 |
-
"print(\"\\nVariables:\", USArrests.columns.tolist())\n",
|
42 |
-
"print(\"\\nFirst 5 states:\")\n",
|
43 |
-
"print(USArrests.head())\n",
|
44 |
-
"```"
|
45 |
-
],
|
46 |
-
"metadata": {
|
47 |
-
"colab": {
|
48 |
-
"base_uri": "https://localhost:8080/",
|
49 |
-
"height": 106
|
50 |
-
},
|
51 |
-
"id": "mqRVE1hlXK9x",
|
52 |
-
"outputId": "5a1bbd64-15cd-4e1c-9344-64a901d8a396"
|
53 |
-
},
|
54 |
-
"id": "mqRVE1hlXK9x",
|
55 |
-
"execution_count": null,
|
56 |
-
"outputs": [
|
57 |
-
{
|
58 |
-
"output_type": "error",
|
59 |
-
"ename": "SyntaxError",
|
60 |
-
"evalue": "invalid syntax (<ipython-input-1-2035427107>, line 1)",
|
61 |
-
"traceback": [
|
62 |
-
"\u001b[0;36m File \u001b[0;32m\"<ipython-input-1-2035427107>\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m ```python\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
|
63 |
-
]
|
64 |
-
}
|
65 |
-
]
|
66 |
-
},
|
67 |
-
{
|
68 |
-
"cell_type": "markdown",
|
69 |
-
"source": [
|
70 |
-
"## Your Investigation\n",
|
71 |
-
"Complete this data summary table:\n",
|
72 |
-
"\n",
|
73 |
-
"| Variable | What it measures | Average Value | Highest State | Lowest State |\n",
|
74 |
-
"|----------|------------------|---------------|---------------|--------------|\n",
|
75 |
-
"| Murder | Rate per 100,000 people | ??? | ??? | ??? |\n",
|
76 |
-
"| Assault | Rate per 100,000 people | ??? | ??? | ??? |\n",
|
77 |
-
"| UrbanPop | Percentage living in cities | ??? | ??? | ??? |\n",
|
78 |
-
"| Rape | Rate per 100,000 people | ??? | ??? | ??? |\n",
|
79 |
-
"\n",
|
80 |
-
"**Deliverable**: Write 2-3 sentences describing the biggest surprises in this data. Which states are not what you expected?\n",
|
81 |
-
"\n",
|
82 |
-
"---\n",
|
83 |
-
"\n",
|
84 |
-
"## Exercise 2: The Scaling Challenge\n",
|
85 |
-
"**Time: 10 minutes | Product: Before/After Comparison**\n",
|
86 |
-
"\n",
|
87 |
-
"### Your Task\n",
|
88 |
-
"Demonstrate why scaling is critical for clustering crime data.\n",
|
89 |
-
"\n"
|
90 |
-
],
|
91 |
-
"metadata": {
|
92 |
-
"id": "7qkDKTe4XLtG"
|
93 |
-
},
|
94 |
-
"id": "7qkDKTe4XLtG"
|
95 |
-
},
|
96 |
-
{
|
97 |
-
"cell_type": "code",
|
98 |
-
"source": [
|
99 |
-
"```python\n",
|
100 |
-
"# Check the scale differences\n",
|
101 |
-
"print(\"Original data ranges:\")\n",
|
102 |
-
"print(USArrests.describe())\n",
|
103 |
-
"\n",
|
104 |
-
"print(\"\\nVariances (how spread out the data is):\")\n",
|
105 |
-
"print(USArrests.var())\n",
|
106 |
-
"\n",
|
107 |
-
"# Scale the data\n",
|
108 |
-
"scaler = StandardScaler()\n",
|
109 |
-
"USArrests_scaled = scaler.fit_transform(USArrests)\n",
|
110 |
-
"scaled_df = pd.DataFrame(USArrests_scaled,\n",
|
111 |
-
" columns=USArrests.columns,\n",
|
112 |
-
" index=USArrests.index)\n",
|
113 |
-
"\n",
|
114 |
-
"print(\"\\nAfter scaling - all variables now have similar ranges:\")\n",
|
115 |
-
"print(scaled_df.describe())\n",
|
116 |
-
"```"
|
117 |
-
],
|
118 |
-
"metadata": {
|
119 |
-
"id": "zQ3VowYNXLeQ"
|
120 |
-
},
|
121 |
-
"id": "zQ3VowYNXLeQ",
|
122 |
-
"execution_count": null,
|
123 |
-
"outputs": []
|
124 |
-
},
|
125 |
-
{
|
126 |
-
"cell_type": "markdown",
|
127 |
-
"source": [
|
128 |
-
"### Your Analysis\n",
|
129 |
-
"1. **Before scaling**: Which variable would dominate the clustering? Why?\n",
|
130 |
-
"2. **After scaling**: Explain in simple terms what StandardScaler did to the data.\n",
|
131 |
-
"\n",
|
132 |
-
"**Deliverable**: One paragraph explaining why a policy analyst should care about data scaling.\n",
|
133 |
-
"\n",
|
134 |
-
"---\n",
|
135 |
-
"\n",
|
136 |
-
"## Exercise 3: Finding the Right Number of Groups\n",
|
137 |
-
"**Time: 20 minutes | Product: Recommendation with Visual Evidence**\n",
|
138 |
-
"\n",
|
139 |
-
"### Your Task\n",
|
140 |
-
"Use the elbow method to determine how many distinct crime profiles exist among US states.\n"
|
141 |
-
],
|
142 |
-
"metadata": {
|
143 |
-
"id": "FnOT700SXLPh"
|
144 |
-
},
|
145 |
-
"id": "FnOT700SXLPh"
|
146 |
-
},
|
147 |
-
{
|
148 |
-
"cell_type": "code",
|
149 |
-
"source": [
|
150 |
-
"```python\n",
|
151 |
-
"# Test different numbers of clusters\n",
|
152 |
-
"inertias = []\n",
|
153 |
-
"K_values = range(1, 11)\n",
|
154 |
-
"\n",
|
155 |
-
"for k in K_values:\n",
|
156 |
-
" kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)\n",
|
157 |
-
" kmeans.fit(USArrests_scaled)\n",
|
158 |
-
" inertias.append(kmeans.inertia_)\n",
|
159 |
-
"\n",
|
160 |
-
"# Create the elbow plot\n",
|
161 |
-
"plt.figure(figsize=(10, 6))\n",
|
162 |
-
"plt.plot(K_values, inertias, 'bo-', linewidth=2, markersize=8)\n",
|
163 |
-
"plt.xlabel('Number of Clusters (K)')\n",
|
164 |
-
"plt.ylabel('Within-Cluster Sum of Squares')\n",
|
165 |
-
"plt.title('Finding the Optimal Number of State Crime Profiles')\n",
|
166 |
-
"plt.grid(True, alpha=0.3)\n",
|
167 |
-
"plt.show()\n",
|
168 |
-
"\n",
|
169 |
-
"# Print the inertia values\n",
|
170 |
-
"for k, inertia in zip(K_values, inertias):\n",
|
171 |
-
" print(f\"K={k}: Inertia = {inertia:.1f}\")\n",
|
172 |
-
"```"
|
173 |
-
],
|
174 |
-
"metadata": {
|
175 |
-
"id": "zOQrS9lmXpTF"
|
176 |
-
},
|
177 |
-
"id": "zOQrS9lmXpTF",
|
178 |
-
"execution_count": null,
|
179 |
-
"outputs": []
|
180 |
-
},
|
181 |
-
{
|
182 |
-
"cell_type": "markdown",
|
183 |
-
"id": "2e388ef2",
|
184 |
-
"metadata": {
|
185 |
-
"id": "2e388ef2"
|
186 |
-
},
|
187 |
-
"source": [
|
188 |
-
"### Your Decision\n",
|
189 |
-
"Based on your elbow plot:\n",
|
190 |
-
"1. **What value of K do you recommend?** (Look for the \"elbow\" where the line starts to flatten)\n",
|
191 |
-
"2. **What does this mean in policy terms?** (How many distinct types of state crime profiles exist?)\n",
|
192 |
-
"\n",
|
193 |
-
"**Deliverable**: A one-paragraph recommendation with your chosen K value and reasoning.\n",
|
194 |
-
"\n",
|
195 |
-
"---\n",
|
196 |
-
"\n",
|
197 |
-
"## Exercise 4: K-Means State Profiling\n",
|
198 |
-
"**Time: 25 minutes | Product: State Crime Profile Report**\n",
|
199 |
-
"\n",
|
200 |
-
"### Your Task\n",
|
201 |
-
"Create distinct crime profiles and identify which states belong to each category.\n",
|
202 |
-
"\n",
|
203 |
-
"\n",
|
204 |
-
"\n",
|
205 |
-
"\n"
|
206 |
-
]
|
207 |
-
},
|
208 |
-
{
|
209 |
-
"cell_type": "code",
|
210 |
-
"source": [
|
211 |
-
"```python\n",
|
212 |
-
"# Use your chosen K value from Exercise 3\n",
|
213 |
-
"optimal_k = 4 # Replace with your chosen value\n",
|
214 |
-
"\n",
|
215 |
-
"# Perform K-means clustering\n",
|
216 |
-
"kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=20)\n",
|
217 |
-
"cluster_labels = kmeans.fit_predict(USArrests_scaled)\n",
|
218 |
-
"\n",
|
219 |
-
"# Add cluster labels to original data\n",
|
220 |
-
"USArrests_clustered = USArrests.copy()\n",
|
221 |
-
"USArrests_clustered['Cluster'] = cluster_labels\n",
|
222 |
-
"\n",
|
223 |
-
"# Analyze each cluster\n",
|
224 |
-
"print(\"State Crime Profiles Analysis\")\n",
|
225 |
-
"print(\"=\" * 50)\n",
|
226 |
-
"\n",
|
227 |
-
"for cluster_num in range(optimal_k):\n",
|
228 |
-
" cluster_states = USArrests_clustered[USArrests_clustered['Cluster'] == cluster_num]\n",
|
229 |
-
" print(f\"\\nCLUSTER {cluster_num}: {len(cluster_states)} states\")\n",
|
230 |
-
" print(\"States:\", \", \".join(cluster_states.index.tolist()))\n",
|
231 |
-
" print(\"Average characteristics:\")\n",
|
232 |
-
" avg_profile = cluster_states[['Murder', 'Assault', 'UrbanPop', 'Rape']].mean()\n",
|
233 |
-
" for var, value in avg_profile.items():\n",
|
234 |
-
" print(f\" {var}: {value:.1f}\")\n",
|
235 |
-
"```"
|
236 |
-
],
|
237 |
-
"metadata": {
|
238 |
-
"id": "_5b0nE6KXv1P"
|
239 |
-
},
|
240 |
-
"id": "_5b0nE6KXv1P",
|
241 |
-
"execution_count": null,
|
242 |
-
"outputs": []
|
243 |
-
},
|
244 |
-
{
|
245 |
-
"cell_type": "markdown",
|
246 |
-
"source": [
|
247 |
-
"### Your Analysis\n",
|
248 |
-
"For each cluster, create a profile:\n",
|
249 |
-
"\n",
|
250 |
-
"**Cluster 0: \"[Your Creative Name]\"**\n",
|
251 |
-
"- **States**: [List them]\n",
|
252 |
-
"- **Characteristics**: [Describe the pattern]\n",
|
253 |
-
"- **Policy Insight**: [What should federal agencies know about these states?]\n",
|
254 |
-
"\n",
|
255 |
-
"**Deliverable**: A table summarizing each cluster with creative names and policy recommendations.\n",
|
256 |
-
"\n",
|
257 |
-
"---\n",
|
258 |
-
"\n",
|
259 |
-
"## Exercise 5: Hierarchical Clustering Exploration\n",
|
260 |
-
"**Time: 25 minutes | Product: Family Tree Interpretation**\n",
|
261 |
-
"\n",
|
262 |
-
"### Your Task\n",
|
263 |
-
"Create a dendrogram to understand how states naturally group together.\n"
|
264 |
-
],
|
265 |
-
"metadata": {
|
266 |
-
"id": "J1WVGb_nX4ye"
|
267 |
-
},
|
268 |
-
"id": "J1WVGb_nX4ye"
|
269 |
-
},
|
270 |
-
{
|
271 |
-
"cell_type": "code",
|
272 |
-
"source": [
|
273 |
-
"```python\n",
|
274 |
-
"from scipy.cluster.hierarchy import dendrogram, linkage\n",
|
275 |
-
"\n",
|
276 |
-
"# Create hierarchical clustering\n",
|
277 |
-
"linkage_matrix = linkage(USArrests_scaled, method='complete')\n",
|
278 |
-
"\n",
|
279 |
-
"# Plot the dendrogram\n",
|
280 |
-
"plt.figure(figsize=(15, 8))\n",
|
281 |
-
"dendrogram(linkage_matrix,\n",
|
282 |
-
" labels=USArrests.index.tolist(),\n",
|
283 |
-
" leaf_rotation=90,\n",
|
284 |
-
" leaf_font_size=10)\n",
|
285 |
-
"plt.title('State Crime Pattern Family Tree')\n",
|
286 |
-
"plt.xlabel('States')\n",
|
287 |
-
"plt.ylabel('Distance Between Groups')\n",
|
288 |
-
"plt.tight_layout()\n",
|
289 |
-
"plt.show()\n",
|
290 |
-
"```"
|
291 |
-
],
|
292 |
-
"metadata": {
|
293 |
-
"id": "Y9a_cbZKX7QX"
|
294 |
-
},
|
295 |
-
"id": "Y9a_cbZKX7QX",
|
296 |
-
"execution_count": null,
|
297 |
-
"outputs": []
|
298 |
-
},
|
299 |
-
{
|
300 |
-
"cell_type": "markdown",
|
301 |
-
"source": [
|
302 |
-
"### Your Interpretation\n",
|
303 |
-
"1. **Closest Pairs**: Which two states are most similar in crime patterns?\n",
|
304 |
-
"2. **Biggest Divide**: Where is the largest split in the tree? What does this represent?\n",
|
305 |
-
"3. **Surprising Neighbors**: Which states cluster together that surprised you geographically?\n",
|
306 |
-
"\n",
|
307 |
-
"### Code to Compare Methods"
|
308 |
-
],
|
309 |
-
"metadata": {
|
310 |
-
"id": "0PaImqZtX6f3"
|
311 |
-
},
|
312 |
-
"id": "0PaImqZtX6f3"
|
313 |
-
},
|
314 |
-
{
|
315 |
-
"cell_type": "code",
|
316 |
-
"source": [
|
317 |
-
"```python\n",
|
318 |
-
"# Compare your K-means results with hierarchical clustering\n",
|
319 |
-
"from scipy.cluster.hierarchy import fcluster\n",
|
320 |
-
"\n",
|
321 |
-
"# Cut the tree to get the same number of clusters as K-means\n",
|
322 |
-
"hierarchical_labels = fcluster(linkage_matrix, optimal_k, criterion='maxclust') - 1\n",
|
323 |
-
"\n",
|
324 |
-
"# Create comparison\n",
|
325 |
-
"comparison_df = pd.DataFrame({\n",
|
326 |
-
" 'State': USArrests.index,\n",
|
327 |
-
" 'K_Means_Cluster': cluster_labels,\n",
|
328 |
-
" 'Hierarchical_Cluster': hierarchical_labels\n",
|
329 |
-
"})\n",
|
330 |
-
"\n",
|
331 |
-
"print(\"Comparison of K-Means vs Hierarchical Clustering:\")\n",
|
332 |
-
"print(comparison_df.sort_values('State'))\n",
|
333 |
-
"\n",
|
334 |
-
"# Count agreements\n",
|
335 |
-
"agreements = sum(comparison_df['K_Means_Cluster'] == comparison_df['Hierarchical_Cluster'])\n",
|
336 |
-
"print(f\"\\nMethods agreed on {agreements} out of {len(comparison_df)} states ({agreements/len(comparison_df)*100:.1f}%)\")\n",
|
337 |
-
"```"
|
338 |
-
],
|
339 |
-
"metadata": {
|
340 |
-
"id": "tJQ-C5GFYBRT"
|
341 |
-
},
|
342 |
-
"id": "tJQ-C5GFYBRT",
|
343 |
-
"execution_count": null,
|
344 |
-
"outputs": []
|
345 |
-
},
|
346 |
-
{
|
347 |
-
"cell_type": "markdown",
|
348 |
-
"source": [
|
349 |
-
"**Deliverable**: A paragraph explaining the key differences between what K-means and hierarchical clustering revealed.\n",
|
350 |
-
"\n",
|
351 |
-
"---\n",
|
352 |
-
"\n",
|
353 |
-
"## Exercise 6: Policy Brief Creation\n",
|
354 |
-
"**Time: 20 minutes | Product: Executive Summary**\n",
|
355 |
-
"\n",
|
356 |
-
"### Your Task\n",
|
357 |
-
"Synthesize your findings into a policy brief for Department of Justice leadership.\n",
|
358 |
-
"\n",
|
359 |
-
"### Code Framework for Final Visualization"
|
360 |
-
],
|
361 |
-
"metadata": {
|
362 |
-
"id": "dx1fNhu4YD7-"
|
363 |
-
},
|
364 |
-
"id": "dx1fNhu4YD7-"
|
365 |
-
},
|
366 |
-
{
|
367 |
-
"cell_type": "code",
|
368 |
-
"source": [
|
369 |
-
"```python\n",
|
370 |
-
"# Create a comprehensive visualization\n",
|
371 |
-
"fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))\n",
|
372 |
-
"\n",
|
373 |
-
"# Plot 1: Murder vs Assault by cluster\n",
|
374 |
-
"colors = ['red', 'blue', 'green', 'orange', 'purple']\n",
|
375 |
-
"for i in range(optimal_k):\n",
|
376 |
-
" cluster_data = USArrests_clustered[USArrests_clustered['Cluster'] == i]\n",
|
377 |
-
" ax1.scatter(cluster_data['Murder'], cluster_data['Assault'],\n",
|
378 |
-
" c=colors[i], label=f'Cluster {i}', s=60, alpha=0.7)\n",
|
379 |
-
"ax1.set_xlabel('Murder Rate')\n",
|
380 |
-
"ax1.set_ylabel('Assault Rate')\n",
|
381 |
-
"ax1.set_title('Murder vs Assault by Crime Profile')\n",
|
382 |
-
"ax1.legend()\n",
|
383 |
-
"ax1.grid(True, alpha=0.3)\n",
|
384 |
-
"\n",
|
385 |
-
"# Plot 2: Urban Population vs Rape by cluster\n",
|
386 |
-
"for i in range(optimal_k):\n",
|
387 |
-
" cluster_data = USArrests_clustered[USArrests_clustered['Cluster'] == i]\n",
|
388 |
-
" ax2.scatter(cluster_data['UrbanPop'], cluster_data['Rape'],\n",
|
389 |
-
" c=colors[i], label=f'Cluster {i}', s=60, alpha=0.7)\n",
|
390 |
-
"ax2.set_xlabel('Urban Population %')\n",
|
391 |
-
"ax2.set_ylabel('Rape Rate')\n",
|
392 |
-
"ax2.set_title('Urban Population vs Rape Rate by Crime Profile')\n",
|
393 |
-
"ax2.legend()\n",
|
394 |
-
"ax2.grid(True, alpha=0.3)\n",
|
395 |
-
"\n",
|
396 |
-
"# Plot 3: Cluster size comparison\n",
|
397 |
-
"cluster_sizes = USArrests_clustered['Cluster'].value_counts().sort_index()\n",
|
398 |
-
"ax3.bar(range(len(cluster_sizes)), cluster_sizes.values, color=colors[:len(cluster_sizes)])\n",
|
399 |
-
"ax3.set_xlabel('Cluster Number')\n",
|
400 |
-
"ax3.set_ylabel('Number of States')\n",
|
401 |
-
"ax3.set_title('Number of States in Each Crime Profile')\n",
|
402 |
-
"ax3.set_xticks(range(len(cluster_sizes)))\n",
|
403 |
-
"\n",
|
404 |
-
"# Plot 4: Average crime rates by cluster\n",
|
405 |
-
"cluster_means = USArrests_clustered.groupby('Cluster')[['Murder', 'Assault', 'Rape']].mean()\n",
|
406 |
-
"cluster_means.plot(kind='bar', ax=ax4)\n",
|
407 |
-
"ax4.set_xlabel('Cluster Number')\n",
|
408 |
-
"ax4.set_ylabel('Average Rate')\n",
|
409 |
-
"ax4.set_title('Average Crime Rates by Profile')\n",
|
410 |
-
"ax4.legend()\n",
|
411 |
-
"ax4.tick_params(axis='x', rotation=0)\n",
|
412 |
-
"\n",
|
413 |
-
"plt.tight_layout()\n",
|
414 |
-
"plt.show()\n",
|
415 |
-
"```"
|
416 |
-
],
|
417 |
-
"metadata": {
|
418 |
-
"id": "N8bkxURpYHJF"
|
419 |
-
},
|
420 |
-
"id": "N8bkxURpYHJF",
|
421 |
-
"execution_count": null,
|
422 |
-
"outputs": []
|
423 |
-
},
|
424 |
-
{
|
425 |
-
"cell_type": "markdown",
|
426 |
-
"source": [
|
427 |
-
"### Your Policy Brief Template\n",
|
428 |
-
"\n",
|
429 |
-
"**EXECUTIVE SUMMARY: US State Crime Profile Analysis**\n",
|
430 |
-
"\n",
|
431 |
-
"**Key Findings:**\n",
|
432 |
-
"- We identified [X] distinct crime profiles among US states\n",
|
433 |
-
"- [State examples] represent the highest-risk profile\n",
|
434 |
-
"- [State examples] represent the lowest-risk profile\n",
|
435 |
-
"- Urban population [does/does not] strongly correlate with violent crime\n",
|
436 |
-
"\n",
|
437 |
-
"**Policy Recommendations:**\n",
|
438 |
-
"1. **High-Priority States**: [List and explain why]\n",
|
439 |
-
"2. **Resource Allocation**: [Suggest how to distribute federal crime prevention funds]\n",
|
440 |
-
"3. **Best Practice Sharing**: [Which states should learn from which others?]\n",
|
441 |
-
"\n",
|
442 |
-
"**Methodology Note**: Analysis used unsupervised clustering on 4 crime variables across 50 states, with data standardization to ensure fair comparison.\n",
|
443 |
-
"\n",
|
444 |
-
"**Deliverable**: A complete 1-page policy brief with your clustering insights and specific recommendations.\n"
|
445 |
-
],
|
446 |
-
"metadata": {
|
447 |
-
"id": "rAy_Ye0WYLK0"
|
448 |
-
},
|
449 |
-
"id": "rAy_Ye0WYLK0"
|
450 |
-
}
|
451 |
-
],
|
452 |
-
"metadata": {
|
453 |
-
"jupytext": {
|
454 |
-
"cell_metadata_filter": "-all",
|
455 |
-
"formats": "Rmd,ipynb",
|
456 |
-
"main_language": "python"
|
457 |
-
},
|
458 |
-
"kernelspec": {
|
459 |
-
"display_name": "Python 3 (ipykernel)",
|
460 |
-
"language": "python",
|
461 |
-
"name": "python3"
|
462 |
-
},
|
463 |
-
"language_info": {
|
464 |
-
"codemirror_mode": {
|
465 |
-
"name": "ipython",
|
466 |
-
"version": 3
|
467 |
-
},
|
468 |
-
"file_extension": ".py",
|
469 |
-
"mimetype": "text/x-python",
|
470 |
-
"name": "python",
|
471 |
-
"nbconvert_exporter": "python",
|
472 |
-
"pygments_lexer": "ipython3",
|
473 |
-
"version": "3.10.4"
|
474 |
-
},
|
475 |
-
"colab": {
|
476 |
-
"provenance": []
|
477 |
-
}
|
478 |
-
},
|
479 |
-
"nbformat": 4,
|
480 |
-
"nbformat_minor": 5
|
481 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Reference files/week 7/Week7_Clustering Curriculum.docx
DELETED
Binary file (18.4 kB)
|
|
Reference files/week 7/Week7_Clustering Learning Objectives.docx
DELETED
Binary file (11.4 kB)
|
|
Reference files/week 7/w7_curriculum
DELETED
@@ -1,178 +0,0 @@
|
|
1 |
-
Unsupervised Learning: K-means and Hierarchical Clustering
|
2 |
-
1. Course Overview
|
3 |
-
The State Safety Profile Challenge
|
4 |
-
In this week, we'll explore unsupervised machine learning through a compelling real-world challenge: Understanding crime patterns across US states without any predetermined categories.
|
5 |
-
Unsupervised Learning: A type of machine learning where we find hidden patterns in data without being told what to look for. Think of it like being a detective who examines evidence without knowing what crime was committed - you're looking for patterns and connections that emerge naturally from the data.
|
6 |
-
Example: Instead of being told "find violent states vs. peaceful states," unsupervised learning lets the data reveal its own natural groupings, like "states with high murder but low assault" or "urban states with moderate crime."
|
7 |
-
Imagine you're a policy researcher working with the FBI's crime statistics. You have data on violent crime rates across all 50 US states - murder rates, assault rates, urban population percentages, and rape statistics. But here's the key challenge: you don't know how states naturally group together in terms of crime profiles.
|
8 |
-
Your Mission: Discover hidden patterns in state crime profiles without any predefined classifications!
|
9 |
-
The Challenge: Without any predetermined safety categories, you need to:
|
10 |
-
● Uncover natural groupings of states based on their crime characteristics
|
11 |
-
● Identify which crime factors tend to cluster together
|
12 |
-
● Understand regional patterns that might not follow obvious geographic boundaries
|
13 |
-
● Find states with surprisingly similar or different crime profiles
|
14 |
-
Cluster: A group of similar things. In our case, states that have similar crime patterns naturally group together in a cluster.
|
15 |
-
Example: You might discover that Alaska, Nevada, and Florida cluster together because they all have high crime rates despite being in different regions of the country.
|
16 |
-
Why This Matters: Traditional approaches might group states by region (South, Northeast, etc.) or population size. But what if crime patterns reveal different natural groupings? What if some Southern states cluster more closely with Western states based on crime profiles? What if urban percentage affects crime differently than expected?
|
17 |
-
Urban Percentage: The proportion of a state's population that lives in cities rather than rural areas.
|
18 |
-
Example: New York has a high urban percentage (87%) while Wyoming has a low urban percentage (29%).
|
19 |
-
What You'll Discover Through This Challenge
|
20 |
-
● Hidden State Safety Types: Use clustering to identify groups of states with similar crime profiles
|
21 |
-
● Crime Pattern Relationships: Find unexpected connections between different types of violent crime
|
22 |
-
● Urban vs. Rural Effects: Discover how urbanization relates to different crime patterns
|
23 |
-
● Policy Insights: Understand which states face similar challenges and might benefit from shared approaches
|
24 |
-
Clustering: The process of grouping similar data points together. It's like organizing your music library - songs naturally group by genre, but clustering might reveal unexpected groups like "workout songs" or "rainy day music" that cross traditional genre boundaries.
|
25 |
-
Core Techniques We'll Master
|
26 |
-
K-Means Clustering: A method that divides data into exactly K groups (where you choose the number K). It's like being asked to organize 50 students into exactly 4 study groups based on their academic interests.
|
27 |
-
Hierarchical Clustering: A method that creates a tree-like structure showing how data points relate to each other at different levels. It's like a family tree, but for data - showing which states are "cousins" and which are "distant relatives" in terms of crime patterns.
|
28 |
-
Both K-Means and Hierarchical Clustering are examples of unsupervised learning.
|
29 |
-
|
30 |
-
2. K-Means Clustering
|
31 |
-
|
32 |
-
What it does: Divides data into exactly K groups by finding central points (centroids).
|
33 |
-
Central Points (Centroids): The "center" or average point of each group. Think of it like the center of a basketball team huddle - it's the point that best represents where all the players are standing.
|
34 |
-
Example: If you have a cluster of high-crime states, the centroid might represent "average murder rate of 8.5, average assault rate of 250, average urban population of 70%."
|
35 |
-
USArrests Example: Analyzing crime data across 50 states, you might discover 4 distinct state safety profiles:
|
36 |
-
● High Crime States (above average in murder, assault, and rape rates)
|
37 |
-
● Urban Safe States (high urban population but lower violent crime rates)
|
38 |
-
● Rural Traditional States (low urban population, moderate crime rates)
|
39 |
-
● Mixed Profile States (high in some crime types but not others)
|
40 |
-
How to Read K-Means Results:
|
41 |
-
● Scatter Plot: Points (states) colored by cluster membership
|
42 |
-
○ Well-separated colors indicate distinct state profiles
|
43 |
-
○ Mixed colors suggest overlapping crime patterns
|
44 |
-
● Cluster Centers: Average crime characteristics of each state group
|
45 |
-
● Elbow Plot: Helps choose optimal number of state groupings
|
46 |
-
Cluster Membership: Which group each data point belongs to. Like being assigned to a team - each state gets assigned to exactly one crime profile group.
|
47 |
-
Example: Texas might be assigned to "High Crime States" while Vermont is assigned to "Rural Traditional States."
|
48 |
-
Scatter Plot: A graph where each point represents one observation (in our case, one state). Points that are close together have similar characteristics.
|
49 |
-
Elbow Plot: A graph that helps you choose the right number of clusters. It's called "elbow" because you look for a bend in the line that looks like an elbow joint.
|
50 |
-
Key Parameters:
|
51 |
-
python
|
52 |
-
# Essential parameters from the lab
|
53 |
-
KMeans(
|
54 |
-
n_clusters=4, # Number of state safety profiles to discover
|
55 |
-
random_state=42, # For reproducible results
|
56 |
-
n_init=20 # Run algorithm 20 times, keep best result
|
57 |
-
)
|
58 |
-
Parameters: Settings that control how the algorithm works. Like settings on your phone - you can adjust them to get different results.
|
59 |
-
n_clusters: How many groups you want to create. You have to decide this ahead of time.
|
60 |
-
random_state: A number that ensures you get the same results every time you run the analysis. Like setting a specific starting point so everyone gets the same answer.
|
61 |
-
n_init: How many times to run the algorithm. The computer tries multiple starting points and picks the best result. More tries = better results.
|
62 |
-
|
63 |
-
3. Hierarchical Clustering
|
64 |
-
What it does: Creates a tree structure (dendrogram) showing how data points group together at different levels.
|
65 |
-
Dendrogram: A tree-like diagram that shows how groups form at different levels. Think of it like a family tree, but for data. At the bottom are individuals (states), and as you go up, you see how they group into families, then extended families, then larger clans.
|
66 |
-
Example: At the bottom level, you might see Vermont and New Hampshire grouped together. Moving up, they might join with Maine to form a "New England Low Crime" group. Moving up further, this group might combine with other regional groups.
|
67 |
-
USArrests Example: Analyzing state crime patterns might reveal:
|
68 |
-
● Level 1: High Crime vs. Low Crime states
|
69 |
-
● Level 2: Within high crime: Urban-driven vs. Rural-driven crime patterns
|
70 |
-
● Level 3: Within urban-driven: Assault-heavy vs. Murder-heavy profiles
|
71 |
-
How to Read Dendrograms:
|
72 |
-
● Height: Distance between groups when they merge
|
73 |
-
○ Higher merges = very different crime profiles
|
74 |
-
○ Lower merges = similar crime patterns
|
75 |
-
● Branches: Each split shows a potential state grouping
|
76 |
-
● Cutting the Tree: Draw a horizontal line to create clusters
|
77 |
-
Height: In a dendrogram, height represents how different two groups are. Think of it like difficulty level - it takes more "effort" (higher height) to combine very different groups.
|
78 |
-
Example: Combining two very similar states (like Vermont and New Hampshire) happens at low height. Combining very different groups (like "High Crime States" and "Low Crime States") happens at high height.
|
79 |
-
Cutting the Tree: Drawing a horizontal line across the dendrogram to create a specific number of groups. Like slicing a layer cake - where you cut determines how many pieces you get.
|
80 |
-
Three Linkage Methods:
|
81 |
-
● Complete Linkage: Measures distance between most different states (good for distinct profiles)
|
82 |
-
● Average Linkage: Uses average distance between all states (balanced approach)
|
83 |
-
● Single Linkage: Uses closest states (tends to create chains, often less useful)
|
84 |
-
Linkage Methods: Different ways to measure how close or far apart groups are. It's like different ways to measure the distance between two cities - you could use the distance between the farthest suburbs (complete), the average distance between all neighborhoods (average), or the distance between the closest points (single).
|
85 |
-
Example: When deciding if "High Crime Group" and "Medium Crime Group" should merge, complete linkage looks at the most different states between the groups, while average linkage looks at the typical difference.
|
86 |
-
Choosing Between K-Means and Hierarchical:
|
87 |
-
● Use K-Means when: You want to segment states into specific number of safety categories for policy targeting
|
88 |
-
● Use Hierarchical when: You want to explore the natural structure of crime patterns without assumptions
|
89 |
-
Segmentation: Dividing your data into groups for specific purposes. Like organizing students into study groups - you might want exactly 4 groups so each has a teaching assistant.
|
90 |
-
Exploratory Analysis: Looking at data to discover patterns without knowing what you'll find. Like being an explorer in uncharted territory - you're not looking for a specific destination, just seeing what interesting things you can discover.
|
91 |
-
|
92 |
-
4. Data Exploration
|
93 |
-
Step 1: Understanding Your Data
|
94 |
-
Essential Checks (from the USArrests example):
|
95 |
-
python
|
96 |
-
# Check the basic structure
|
97 |
-
print(data.shape) # How many observations and variables?
|
98 |
-
print(data.columns) # What variables do you have?
|
99 |
-
print(data.head()) # What do the first few rows look like?
|
100 |
-
|
101 |
-
# Examine the distribution
|
102 |
-
print(data.mean()) # Average values
|
103 |
-
print(data.var()) # Variability
|
104 |
-
print(data.describe()) # Full statistical summary
|
105 |
-
Observations: Individual data points we're studying. In our case, each of the 50 US states is one observation.
|
106 |
-
Variables: The characteristics we're measuring for each observation. In USArrests, we have 4 variables: Murder rate, Assault rate, Urban Population percentage, and Rape rate.
|
107 |
-
Example: For California (one observation), we might have Murder=9.0, Assault=276, UrbanPop=91, Rape=40.6 (four variables).
|
108 |
-
Distribution: How values are spread out. Like looking at test scores in a class - are most scores clustered around the average, or spread out widely?
|
109 |
-
Variability (Variance): How much the values differ from each other. High variance means values are spread out; low variance means they're clustered together.
|
110 |
-
Why This Matters: The USArrests data showed vastly different scales:
|
111 |
-
● Murder: Average 7.8, Variance 19
|
112 |
-
● Assault: Average 170.8, Variance 6,945
|
113 |
-
● This scale difference would dominate any analysis without preprocessing
|
114 |
-
Scales: The range and units of measurement for different variables. Like comparing dollars ($50,000 salary) to percentages (75% approval rating) - they're measured very differently.
|
115 |
-
Example: Assault rates are in the hundreds (like 276 per 100,000) while murder rates are single digits (like 7.8 per 100,000). Without adjustment, assault would seem much more important just because the numbers are bigger.
|
116 |
-
Step 2: Data Preprocessing
|
117 |
-
Standardization (Critical for clustering):
|
118 |
-
python
|
119 |
-
from sklearn.preprocessing import StandardScaler
|
120 |
-
|
121 |
-
# Always scale when variables have different units
|
122 |
-
scaler = StandardScaler()
|
123 |
-
data_scaled = scaler.fit_transform(data)
|
124 |
-
Standardization: Converting all variables to the same scale so they can be fairly compared. Like converting all measurements to the same units - instead of comparing feet to meters, you convert everything to inches.
|
125 |
-
StandardScaler: A tool that transforms data so each variable has an average of 0 and standard deviation of 1. Think of it like grading on a curve - it makes all variables equally important.
|
126 |
-
Example: After standardization, a murder rate of 7.8 might become 0.2, and an assault rate of 276 might become 1.5. Now they're on comparable scales.
|
127 |
-
When to Scale:
|
128 |
-
● ✅ Always scale when variables have different units (dollars vs. percentages)
|
129 |
-
● ✅ Scale when variances differ by orders of magnitude
|
130 |
-
● ❓ Consider not scaling when all variables are in the same meaningful units
|
131 |
-
Orders of Magnitude: When one number is 10 times, 100 times, or 1000 times bigger than another. In USArrests, assault variance (6,945) is about 365 times bigger than murder variance (19) - that's two orders of magnitude difference.
|
132 |
-
Step 3: Exploratory Analysis
|
133 |
-
For K-Means Clustering:
|
134 |
-
python
|
135 |
-
# Try different numbers of clusters to find optimal K
|
136 |
-
inertias = []
|
137 |
-
K_range = range(1, 11)
|
138 |
-
for k in K_range:
|
139 |
-
kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)
|
140 |
-
kmeans.fit(data_scaled)
|
141 |
-
inertias.append(kmeans.inertia_)
|
142 |
-
|
143 |
-
# Plot elbow curve
|
144 |
-
plt.plot(K_range, inertias, 'bo-')
|
145 |
-
plt.xlabel('Number of Clusters (K)')
|
146 |
-
plt.ylabel('Within-Cluster Sum of Squares')
|
147 |
-
plt.title('Elbow Method for Optimal K')
|
148 |
-
Inertias: A measure of how tightly grouped each cluster is. Lower inertia means points in each cluster are closer together (better clustering). It's like measuring how close teammates stand to each other - closer teammates indicate better team cohesion.
|
149 |
-
Within-Cluster Sum of Squares: The total distance from each point to its cluster center. Think of it as measuring how far each student sits from their group's center - smaller distances mean tighter, more cohesive groups.
|
150 |
-
Elbow Method: A technique for choosing the best number of clusters. You plot the results and look for the "elbow" - the point where adding more clusters doesn't help much anymore.
|
151 |
-
For Hierarchical Clustering:
|
152 |
-
python
|
153 |
-
# Create dendrogram to explore natural groupings
|
154 |
-
from sklearn.cluster import AgglomerativeClustering
|
155 |
-
from ISLP.cluster import compute_linkage
|
156 |
-
from scipy.cluster.hierarchy import dendrogram
|
157 |
-
|
158 |
-
hc = AgglomerativeClustering(distance_threshold=0, n_clusters=None, linkage='complete')
|
159 |
-
hc.fit(data_scaled)
|
160 |
-
linkage_matrix = compute_linkage(hc)
|
161 |
-
|
162 |
-
plt.figure(figsize=(12, 8))
|
163 |
-
dendrogram(linkage_matrix, color_threshold=-np.inf, above_threshold_color='black')
|
164 |
-
plt.title('Hierarchical Clustering Dendrogram')
|
165 |
-
AgglomerativeClustering: A type of hierarchical clustering that starts with individual points and gradually combines them into larger groups. Like building a pyramid from the bottom up.
|
166 |
-
distance_threshold=0: A setting that tells the algorithm to build the complete tree structure without stopping early.
|
167 |
-
Linkage Matrix: A mathematical representation of how the tree structure was built. Think of it as the blueprint showing how the dendrogram was constructed.
|
168 |
-
Step 4: Validation Questions
|
169 |
-
Before proceeding with analysis, ask:
|
170 |
-
1. Do the variables make sense together? (e.g., don't cluster height with income)
|
171 |
-
2. Are there obvious outliers that need attention?
|
172 |
-
3. Do you have enough data points? (Rule of thumb: at least 10x more observations than variables)
|
173 |
-
4. Are there missing values that need handling?
|
174 |
-
Outliers: Data points that are very different from all the others. Like a 7-foot-tall person in a group of average-height people - they're so different they might skew your analysis.
|
175 |
-
Example: If most states have murder rates between 1-15, but one state has a rate of 50, that's probably an outlier that needs special attention.
|
176 |
-
Missing Values: Data points where we don't have complete information. Like a student who didn't take one of the tests - you need to decide how to handle that gap in the data.
|
177 |
-
Rule of Thumb: A general guideline that works in most situations. For clustering, having at least 10 times more observations than variables helps ensure reliable results.
|
178 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app/__pycache__/main.cpython-311.pyc
CHANGED
Binary files a/app/__pycache__/main.cpython-311.pyc and b/app/__pycache__/main.cpython-311.pyc differ
|
|
app/main.py
CHANGED
@@ -24,6 +24,7 @@ from app.pages import week_4
|
|
24 |
from app.pages import week_5
|
25 |
from app.pages import week_6
|
26 |
from app.pages import week_7
|
|
|
27 |
# Page configuration
|
28 |
st.set_page_config(
|
29 |
page_title="Data Science Course App",
|
@@ -165,6 +166,8 @@ def show_week_content():
|
|
165 |
week_6.show()
|
166 |
elif st.session_state.current_week == 7:
|
167 |
week_7.show()
|
|
|
|
|
168 |
else:
|
169 |
st.warning("Content for this week is not yet available.")
|
170 |
|
@@ -177,7 +180,7 @@ def main():
|
|
177 |
return
|
178 |
|
179 |
# User is logged in, show course content
|
180 |
-
if st.session_state.current_week in [1, 2, 3, 4, 5, 6, 7]:
|
181 |
show_week_content()
|
182 |
else:
|
183 |
st.title("Data Science Research Paper Course")
|
|
|
24 |
from app.pages import week_5
|
25 |
from app.pages import week_6
|
26 |
from app.pages import week_7
|
27 |
+
from app.pages import week_8
|
28 |
# Page configuration
|
29 |
st.set_page_config(
|
30 |
page_title="Data Science Course App",
|
|
|
166 |
week_6.show()
|
167 |
elif st.session_state.current_week == 7:
|
168 |
week_7.show()
|
169 |
+
elif st.session_state.current_week == 8:
|
170 |
+
week_8.show()
|
171 |
else:
|
172 |
st.warning("Content for this week is not yet available.")
|
173 |
|
|
|
180 |
return
|
181 |
|
182 |
# User is logged in, show course content
|
183 |
+
if st.session_state.current_week in [1, 2, 3, 4, 5, 6, 7, 8]:
|
184 |
show_week_content()
|
185 |
else:
|
186 |
st.title("Data Science Research Paper Course")
|
app/pages/__pycache__/week_8.cpython-311.pyc
ADDED
Binary file (27.8 kB). View file
|
|
app/pages/week_8.py
ADDED
@@ -0,0 +1,564 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import pandas as pd
|
3 |
+
import numpy as np
|
4 |
+
import matplotlib.pyplot as plt
|
5 |
+
import seaborn as sns
|
6 |
+
import plotly.express as px
|
7 |
+
import plotly.graph_objects as go
|
8 |
+
from plotly.subplots import make_subplots
|
9 |
+
|
10 |
+
def show():
|
11 |
+
st.title("Week 8: Research Paper Writing and LaTeX")
|
12 |
+
|
13 |
+
# Introduction
|
14 |
+
st.header("Learning Objectives")
|
15 |
+
st.markdown("""
|
16 |
+
By the end of this week, you will be able to:
|
17 |
+
|
18 |
+
**Remember (Knowledge):**
|
19 |
+
- Recall LaTeX syntax for document structure, figures, citations, and spacing
|
20 |
+
- Identify components of ML research papers (introduction, methods, results, conclusion, limitations)
|
21 |
+
- Recognize standard formatting requirements for academic conferences and journals
|
22 |
+
|
23 |
+
**Understand (Comprehension):**
|
24 |
+
- Describe the purpose and audience for each section of a research paper
|
25 |
+
|
26 |
+
**Apply (Application):**
|
27 |
+
- Format complete research papers in LaTeX with proper figures, tables, and citations
|
28 |
+
- Write clear methodology sections with sufficient detail for reproducibility
|
29 |
+
- Present experimental results using appropriate visualizations and statistical analysis
|
30 |
+
|
31 |
+
**Analyze (Analysis):**
|
32 |
+
- Diagnose LaTeX formatting issues and resolve compilation errors
|
33 |
+
- Examine related work to identify research gaps and position their contributions
|
34 |
+
- Compare methodology approaches with existing methods
|
35 |
+
|
36 |
+
**Evaluate (Evaluation):**
|
37 |
+
- Critically assess the validity and reliability of experimental design
|
38 |
+
- Evaluate the clarity and persuasiveness of written arguments
|
39 |
+
|
40 |
+
**Create (Synthesis):**
|
41 |
+
- Produce research papers
|
42 |
+
- Develop compelling visualizations that effectively communicate complex ML concepts
|
43 |
+
- Synthesize technical knowledge into coherent research narratives
|
44 |
+
""")
|
45 |
+
|
46 |
+
# Module 1: Research Paper Architecture
|
47 |
+
st.header("Module 1: Research Paper Architecture")
|
48 |
+
|
49 |
+
st.markdown("""
|
50 |
+
Every section of your paper must answer specific questions that reviewers ask. Think of your paper as a conversation
|
51 |
+
with skeptical experts who need convincing.
|
52 |
+
""")
|
53 |
+
|
54 |
+
# Paper Structure Table
|
55 |
+
st.subheader("Research Paper Structure")
|
56 |
+
|
57 |
+
paper_structure = {
|
58 |
+
"Section": ["🔥 Introduction", "🔬 Methods", "📊 Results", "🎯 Conclusion", "⚠️ Limitations"],
|
59 |
+
"Key Problems/Focus": [
|
60 |
+
"What problem are you solving? Why does it matter? How is your approach different?",
|
61 |
+
"How did you collect data? What analysis techniques? Can others replicate this?",
|
62 |
+
"What concrete findings emerged? How do they address your research questions?",
|
63 |
+
"What's the key takeaway? How does this advance the field? What are practical implications?",
|
64 |
+
"What are honest constraints? What biases might exist? What couldn't you address?"
|
65 |
+
],
|
66 |
+
"Aim For": [
|
67 |
+
"Compelling motivation",
|
68 |
+
"Rigorous reproducibility",
|
69 |
+
"Clear evidence",
|
70 |
+
"Lasting impact",
|
71 |
+
"Honest transparency"
|
72 |
+
]
|
73 |
+
}
|
74 |
+
|
75 |
+
st.dataframe(pd.DataFrame(paper_structure))
|
76 |
+
|
77 |
+
# Detailed Section Guidelines
|
78 |
+
st.subheader("Detailed Section Guidelines")
|
79 |
+
|
80 |
+
# Introduction Section
|
81 |
+
with st.expander("🔥 Introduction: Building Compelling Motivation"):
|
82 |
+
st.markdown("""
|
83 |
+
**What is it:** The introduction is your paper's first impression and often determines whether reviewers continue reading.
|
84 |
+
|
85 |
+
**Why this matters:** A weak introduction leads to immediate rejection, regardless of how brilliant your technical contribution might be.
|
86 |
+
|
87 |
+
**What to do:**
|
88 |
+
1. Use the "inverted pyramid" approach
|
89 |
+
2. Start with broad context, then narrow to specific problem
|
90 |
+
3. Clearly articulate the gap in existing solutions
|
91 |
+
4. Present your approach as a logical response
|
92 |
+
5. Conclude with explicit contributions (3-4 bullet points)
|
93 |
+
|
94 |
+
**Example Structure:**
|
95 |
+
```
|
96 |
+
1. Broad context about the field
|
97 |
+
2. Specific problem you're addressing
|
98 |
+
3. Gap in existing solutions
|
99 |
+
4. Your approach as response to gap
|
100 |
+
5. Explicit contributions
|
101 |
+
```
|
102 |
+
""")
|
103 |
+
|
104 |
+
# Methods Section
|
105 |
+
with st.expander("🔬 Methods: Ensuring Rigorous Reproducibility"):
|
106 |
+
st.markdown("""
|
107 |
+
**What is it:** The methods section has evolved from simple description to detailed documentation that enables complete replication.
|
108 |
+
|
109 |
+
**Why this matters:** Irreproducible research wastes community resources and undermines scientific credibility.
|
110 |
+
|
111 |
+
**What to document:**
|
112 |
+
- Dataset specifics (exact version, preprocessing steps, train/validation/test splits)
|
113 |
+
- Model architecture details (layer sizes, activation functions, initialization schemes)
|
114 |
+
- Training procedures (optimization algorithm, learning rate schedules, batch sizes)
|
115 |
+
- Computational environment (hardware specifications, software versions, random seeds)
|
116 |
+
|
117 |
+
**Write as if creating a recipe** that a competent colleague could follow to recreate your exact results.
|
118 |
+
""")
|
119 |
+
|
120 |
+
# Results Section
|
121 |
+
with st.expander("📊 Results: Presenting Clear Evidence"):
|
122 |
+
st.markdown("""
|
123 |
+
**What is it:** The results section synthesizes your raw findings into compelling evidence for your claims.
|
124 |
+
|
125 |
+
**Why this matters:** This section proves whether your methodology actually works and answers your research questions.
|
126 |
+
|
127 |
+
**What to do:**
|
128 |
+
1. Organize results logically (general performance to specific analyses)
|
129 |
+
2. Start with overall model performance using standard metrics
|
130 |
+
3. Include detailed comparisons, ablation studies, and error analysis
|
131 |
+
4. Use clear visualizations with appropriate error bars
|
132 |
+
5. Report negative results honestly
|
133 |
+
6. Connect each finding back to your original research questions
|
134 |
+
""")
|
135 |
+
|
136 |
+
# Conclusion Section
|
137 |
+
with st.expander("🎯 Conclusion: Creating Lasting Impact"):
|
138 |
+
st.markdown("""
|
139 |
+
**What is it:** The conclusion shapes how the research community understands and remembers your contribution.
|
140 |
+
|
141 |
+
**Why this matters:** Your technical contribution only matters if others can understand its significance and apply it.
|
142 |
+
|
143 |
+
**What to do:**
|
144 |
+
1. Begin with concise summary of key findings (2-3 sentences)
|
145 |
+
2. State how findings advance theoretical understanding or practical applications
|
146 |
+
3. Discuss broader implications beyond your specific problem domain
|
147 |
+
4. Suggest concrete directions for future research
|
148 |
+
5. Balance confidence with humility about scope
|
149 |
+
""")
|
150 |
+
|
151 |
+
# Limitations Section
|
152 |
+
with st.expander("⚠️ Limitations: Demonstrating Honest Transparency"):
|
153 |
+
st.markdown("""
|
154 |
+
**What is it:** Acknowledging limitations shows scientific maturity and helps readers appropriately interpret your findings.
|
155 |
+
|
156 |
+
**Why this matters:** Every study has constraints, and attempting to hide them makes reviewers suspicious.
|
157 |
+
|
158 |
+
**Three types of limitations to address:**
|
159 |
+
1. **Scope limitations:** What populations, contexts, or problem types might your results not apply to?
|
160 |
+
2. **Methodological constraints:** Sample size issues, measurement limitations, or experimental design trade-offs
|
161 |
+
3. **Potential biases:** Dataset bias, researcher bias, or systematic errors in your approach
|
162 |
+
|
163 |
+
**For each limitation:** Explain potential impact and suggest how future work could address it.
|
164 |
+
""")
|
165 |
+
|
166 |
+
# Quick Reference Framework
|
167 |
+
st.subheader("Quick Reference Framework")
|
168 |
+
st.markdown("""
|
169 |
+
**Title → Problem → Gap → Method → Findings → Impact → Limitations**
|
170 |
+
|
171 |
+
This progression ensures logical flow and helps readers follow your research narrative from motivation through contribution to appropriate interpretation.
|
172 |
+
""")
|
173 |
+
|
174 |
+
# Module 2: LaTeX Introduction
|
175 |
+
st.header("Module 2: Introduction to LaTeX")
|
176 |
+
|
177 |
+
st.markdown("""
|
178 |
+
**What is LaTeX?**
|
179 |
+
|
180 |
+
Think of LaTeX as a sophisticated word processor that works differently from Microsoft Word or Google Docs.
|
181 |
+
Instead of clicking buttons to format text, you write commands that tell the computer how to format your document.
|
182 |
+
""")
|
183 |
+
|
184 |
+
# Why LaTeX
|
185 |
+
st.subheader("Why Learn LaTeX for Academic Writing?")
|
186 |
+
|
187 |
+
latex_benefits = {
|
188 |
+
"Benefit": [
|
189 |
+
"Professional appearance",
|
190 |
+
"Mathematical notation",
|
191 |
+
"Reference management",
|
192 |
+
"Industry standard"
|
193 |
+
],
|
194 |
+
"Description": [
|
195 |
+
"LaTeX automatically handles spacing, fonts, and layout to meet academic standards",
|
196 |
+
"Essential for ML papers with equations and formulas",
|
197 |
+
"Automatically formats citations and bibliographies",
|
198 |
+
"Most computer science conferences and journals expect LaTeX submissions"
|
199 |
+
]
|
200 |
+
}
|
201 |
+
|
202 |
+
st.dataframe(pd.DataFrame(latex_benefits))
|
203 |
+
|
204 |
+
# LaTeX Code Examples
|
205 |
+
st.subheader("LaTeX Code Examples")
|
206 |
+
|
207 |
+
# Basic Structure
|
208 |
+
with st.expander("Basic Document Structure"):
|
209 |
+
st.markdown("**LaTeX Code:**")
|
210 |
+
st.code("""
|
211 |
+
\\documentclass{article}
|
212 |
+
\\usepackage[utf8]{inputenc}
|
213 |
+
\\usepackage{graphicx}
|
214 |
+
\\title{Your Research Paper Title}
|
215 |
+
\\author{Your Name}
|
216 |
+
\\date{\\today}
|
217 |
+
\\begin{document}
|
218 |
+
\\maketitle
|
219 |
+
\\section{Introduction}
|
220 |
+
Your introduction text goes here.
|
221 |
+
\\section{Methods}
|
222 |
+
Your methods section goes here.
|
223 |
+
\\section{Results}
|
224 |
+
Your results section goes here.
|
225 |
+
\\section{Conclusion}
|
226 |
+
Your conclusion goes here.
|
227 |
+
\\end{document}
|
228 |
+
""", language="latex")
|
229 |
+
|
230 |
+
st.markdown("**Rendered Output:**")
|
231 |
+
st.markdown("""
|
232 |
+
<div style="border: 1px solid #ccc; padding: 40px; margin: 20px auto; background-color: white; font-family: 'Times New Roman', Times, serif; color: black; box-shadow: 0 0 10px rgba(0,0,0,0.1); max-width: 800px;">
|
233 |
+
<h1 style="text-align: center; font-size: 22px; font-weight: bold; margin-bottom: 10px; color: black;">Your Research Paper Title</h1>
|
234 |
+
<p style="text-align: center; font-size: 16px; margin-bottom: 30px; color: black;"><em>Your Name</em><br><em>Today's Date</em></p>
|
235 |
+
<h2 style="font-size: 18px; font-weight: bold; margin-top: 20px; margin-bottom: 10px; color: black;">1. Introduction</h2>
|
236 |
+
<p style="font-size: 16px; line-height: 1.6; color: black;">Your introduction text goes here.</p>
|
237 |
+
<h2 style="font-size: 18px; font-weight: bold; margin-top: 20px; margin-bottom: 10px; color: black;">2. Methods</h2>
|
238 |
+
<p style="font-size: 16px; line-height: 1.6; color: black;">Your methods section goes here.</p>
|
239 |
+
<h2 style="font-size: 18px; font-weight: bold; margin-top: 20px; margin-bottom: 10px; color: black;">3. Results</h2>
|
240 |
+
<p style="font-size: 16px; line-height: 1.6; color: black;">Your results section goes here.</p>
|
241 |
+
<h2 style="font-size: 18px; font-weight: bold; margin-top: 20px; margin-bottom: 10px; color: black;">4. Conclusion</h2>
|
242 |
+
<p style="font-size: 16px; line-height: 1.6; color: black;">Your conclusion goes here.</p>
|
243 |
+
</div>
|
244 |
+
""", unsafe_allow_html=True)
|
245 |
+
|
246 |
+
# Sections and Subsections
|
247 |
+
with st.expander("Creating Sections and Subsections"):
|
248 |
+
st.markdown("**LaTeX Code:**")
|
249 |
+
st.code("""
|
250 |
+
\\section{Introduction} % Creates: 1. Introduction
|
251 |
+
\\subsection{Background} % Creates: 1.1 Background
|
252 |
+
\\subsubsection{Deep Learning} % Creates: 1.1.1 Deep Learning
|
253 |
+
|
254 |
+
% Tip: Overleaf shows section structure in the left panel for easy navigation
|
255 |
+
""", language="latex")
|
256 |
+
|
257 |
+
st.markdown("**Rendered Output:**")
|
258 |
+
st.markdown("""
|
259 |
+
<div style="border: 1px solid #ddd; padding: 20px; background-color: white; font-family: 'Times New Roman', serif; color: black;">
|
260 |
+
<h2 style="color: black; font-size: 18px; font-weight: bold;">1. Introduction</h2>
|
261 |
+
<h3 style="color: black; font-size: 16px; font-weight: bold; padding-left: 20px;">1.1 Background</h3>
|
262 |
+
<h4 style="color: black; font-size: 16px; font-style: italic; padding-left: 40px;">1.1.1 Deep Learning</h4>
|
263 |
+
<p style="color: black; padding-left: 40px; margin-top: 10px;"><em>Tip: Overleaf shows section structure in the left panel for easy navigation</em></p>
|
264 |
+
</div>
|
265 |
+
""", unsafe_allow_html=True)
|
266 |
+
|
267 |
+
# Figures
|
268 |
+
with st.expander("Adding Figures"):
|
269 |
+
st.markdown("**LaTeX Code:**")
|
270 |
+
st.code("""
|
271 |
+
\\begin{figure}[h]
|
272 |
+
\\centering
|
273 |
+
\\includegraphics[width=0.8\\textwidth]{research_question.jpg}
|
274 |
+
\\caption{The cycle of research from practical problem to research answer.}
|
275 |
+
\\label{fig:research_cycle}
|
276 |
+
\\end{figure}
|
277 |
+
|
278 |
+
% Reference it in your text
|
279 |
+
Figure~\\ref{fig:research_cycle} shows the relationship between problems, questions, and answers.
|
280 |
+
""", language="latex")
|
281 |
+
|
282 |
+
st.markdown("**Rendered Output:**")
|
283 |
+
|
284 |
+
# Center the image using columns
|
285 |
+
col1, col2, col3 = st.columns([1, 2, 1])
|
286 |
+
with col2:
|
287 |
+
st.image("assets/Pictures/research_question.jpg", width=384, caption="Figure 1: The cycle of research from practical problem to research answer.")
|
288 |
+
|
289 |
+
st.markdown("""
|
290 |
+
<div style="background-color: white; font-family: 'Times New Roman', serif; color: black; padding: 0 20px 20px 20px;">
|
291 |
+
<p style="color: black; text-align: left;">Figure 1 shows the relationship between problems, questions, and answers.</p>
|
292 |
+
</div>
|
293 |
+
""", unsafe_allow_html=True)
|
294 |
+
|
295 |
+
# Citations and Bibliography
|
296 |
+
with st.expander("Citations and Bibliography"):
|
297 |
+
st.markdown("**LaTeX Code:**")
|
298 |
+
st.code("""
|
299 |
+
% In your main document
|
300 |
+
\\usepackage{biblatex}
|
301 |
+
\\addbibresource{sample.bib}
|
302 |
+
|
303 |
+
% Cite a reference
|
304 |
+
Our approach builds on recent work \\cite{einstein} and extends it by...
|
305 |
+
|
306 |
+
% Print bibliography
|
307 |
+
\\printbibliography
|
308 |
+
|
309 |
+
% In sample.bib file:
|
310 |
+
@article{einstein,
|
311 |
+
title={On the electrodynamics of moving bodies},
|
312 |
+
author={Einstein, Albert},
|
313 |
+
journal={Annalen der Physik},
|
314 |
+
volume={322},
|
315 |
+
number={10},
|
316 |
+
pages={891--921},
|
317 |
+
year={1905}
|
318 |
+
}
|
319 |
+
""", language="latex")
|
320 |
+
|
321 |
+
st.markdown("**Rendered Output:**")
|
322 |
+
st.markdown("""
|
323 |
+
<div style="border: 1px solid #ddd; padding: 20px; background-color: white; font-family: Times, 'Times New Roman', serif; color: black;">
|
324 |
+
<p style="color: black; margin-bottom: 1.5em;">Our approach builds on recent work [1] and extends it by...</p>
|
325 |
+
<h3 style="color: black; margin-bottom: 0.5em; font-weight: bold;">References</h3>
|
326 |
+
<p style="line-height: 1.6; padding-left: 2em; text-indent: -2em;">
|
327 |
+
[1] A. Einstein, "On the electrodynamics of moving bodies," <i>Annalen der Physik</i>, vol. 322, no. 10, pp. 891–921, 1905.
|
328 |
+
</p>
|
329 |
+
</div>
|
330 |
+
""", unsafe_allow_html=True)
|
331 |
+
|
332 |
+
# Mathematical Equations
|
333 |
+
with st.expander("Mathematical Equations"):
|
334 |
+
st.markdown("**LaTeX Code:**")
|
335 |
+
st.code("""
|
336 |
+
% Inline math
|
337 |
+
The loss function $L(\\theta)$ is defined as...
|
338 |
+
|
339 |
+
% Display math
|
340 |
+
\\begin{equation}
|
341 |
+
L(\\theta) = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - f(x_i, \\theta))^2
|
342 |
+
\\label{eq:loss}
|
343 |
+
\\end{equation}
|
344 |
+
|
345 |
+
% Reference the equation
|
346 |
+
As shown in Equation~\\ref{eq:loss}, the loss function...
|
347 |
+
""", language="latex")
|
348 |
+
|
349 |
+
st.markdown("**Rendered Output:**")
|
350 |
+
st.markdown("""
|
351 |
+
<div style="border: 1px solid #ddd; padding: 20px; background-color: white; font-family: 'Times New Roman', serif; color: black;">
|
352 |
+
<p style="color: black;">The loss function <em>L(θ)</em> is defined as...</p>
|
353 |
+
<div style="display: flex; justify-content: space-between; align-items: center; margin: 20px 0;">
|
354 |
+
<div style="flex-grow: 1; text-align: center;">
|
355 |
+
<img src="https://latex.codecogs.com/svg.latex?L(\\theta)%20=%20\\frac{1}{n}%20\\sum_{i=1}^{n}%20(y_i%20-%20f(x_i,%20\\theta))^2" />
|
356 |
+
</div>
|
357 |
+
<div style="font-style: italic; color: black;">(1)</div>
|
358 |
+
</div>
|
359 |
+
<p style="color: black;">As shown in Equation (1), the loss function...</p>
|
360 |
+
</div>
|
361 |
+
""", unsafe_allow_html=True)
|
362 |
+
|
363 |
+
# Tables
|
364 |
+
with st.expander("Creating Tables"):
|
365 |
+
st.markdown("**LaTeX Code:**")
|
366 |
+
st.code("""
|
367 |
+
\\begin{table}[h]
|
368 |
+
\\centering
|
369 |
+
\\begin{tabular}{|l|c|r|}
|
370 |
+
\\hline
|
371 |
+
\\textbf{Method} & \\textbf{Accuracy} & \\textbf{Time (s)} \\\\
|
372 |
+
\\hline
|
373 |
+
Baseline & 85.2\\% & 120 \\\\
|
374 |
+
Our Method & 89.7\\% & 95 \\\\
|
375 |
+
\\hline
|
376 |
+
\\end{tabular}
|
377 |
+
\\caption{Performance comparison of different methods}
|
378 |
+
\\label{tab:results}
|
379 |
+
\\end{table}
|
380 |
+
|
381 |
+
% Reference the table
|
382 |
+
Table~\\ref{tab:results} shows the performance comparison...
|
383 |
+
""", language="latex")
|
384 |
+
|
385 |
+
st.markdown("**Rendered Output:**")
|
386 |
+
st.markdown("""
|
387 |
+
<div style="border: 1px solid #ddd; padding: 20px; background-color: white; font-family: 'Times New Roman', serif; color: black;">
|
388 |
+
<div style="text-align: center; margin: 20px 0;">
|
389 |
+
<table style="border-collapse: collapse; width: 100%; max-width: 500px; margin: 0 auto;">
|
390 |
+
<tr style="border: 1px solid #000;">
|
391 |
+
<th style="border: 1px solid #000; padding: 8px; text-align: left; font-weight: bold;">Method</th>
|
392 |
+
<th style="border: 1px solid #000; padding: 8px; text-align: center; font-weight: bold;">Accuracy</th>
|
393 |
+
<th style="border: 1px solid #000; padding: 8px; text-align: right; font-weight: bold;">Time (s)</th>
|
394 |
+
</tr>
|
395 |
+
<tr style="border: 1px solid #000;">
|
396 |
+
<td style="border: 1px solid #000; padding: 8px; text-align: left;">Baseline</td>
|
397 |
+
<td style="border: 1px solid #000; padding: 8px; text-align: center;">85.2%</td>
|
398 |
+
<td style="border: 1px solid #000; padding: 8px; text-align: right;">120</td>
|
399 |
+
</tr>
|
400 |
+
<tr style="border: 1px solid #000;">
|
401 |
+
<td style="border: 1px solid #000; padding: 8px; text-align: left;">Our Method</td>
|
402 |
+
<td style="border: 1px solid #000; padding: 8px; text-align: center;">89.7%</td>
|
403 |
+
<td style="border: 1px solid #000; padding: 8px; text-align: right;">95</td>
|
404 |
+
</tr>
|
405 |
+
</table>
|
406 |
+
<p style="margin-top: 10px; font-style: italic; color: black;">Table 1: Performance comparison of different methods</p>
|
407 |
+
</div>
|
408 |
+
<p style="color: black;">Table 1 shows the performance comparison...</p>
|
409 |
+
</div>
|
410 |
+
""", unsafe_allow_html=True)
|
411 |
+
|
412 |
+
# Interactive LaTeX Practice
|
413 |
+
st.header("Interactive LaTeX Practice")
|
414 |
+
|
415 |
+
st.markdown("""
|
416 |
+
Let's practice some common LaTeX commands. Try these exercises:
|
417 |
+
""")
|
418 |
+
|
419 |
+
# Exercise 1: Basic Document
|
420 |
+
with st.expander("Exercise 1: Create a Basic Document"):
|
421 |
+
st.markdown("""
|
422 |
+
**Task:** Create a basic LaTeX document with title, author, and three sections.
|
423 |
+
|
424 |
+
**Steps:**
|
425 |
+
1. Open Overleaf and create a new project
|
426 |
+
2. Replace the default content with your own
|
427 |
+
3. Add a title and your name
|
428 |
+
4. Create three sections: Introduction, Methods, Results
|
429 |
+
5. Add some placeholder text to each section
|
430 |
+
6. Compile to see your PDF
|
431 |
+
""")
|
432 |
+
|
433 |
+
st.code("""
|
434 |
+
\\documentclass{article}
|
435 |
+
\\title{My First LaTeX Document}
|
436 |
+
\\author{Your Name}
|
437 |
+
\\date{\\today}
|
438 |
+
|
439 |
+
\\begin{document}
|
440 |
+
\\maketitle
|
441 |
+
|
442 |
+
\\section{Introduction}
|
443 |
+
This is the introduction section.
|
444 |
+
|
445 |
+
\\section{Methods}
|
446 |
+
This is the methods section.
|
447 |
+
|
448 |
+
\\section{Results}
|
449 |
+
This is the results section.
|
450 |
+
|
451 |
+
\\end{document}
|
452 |
+
""", language="latex")
|
453 |
+
|
454 |
+
# Exercise 2: Adding Figures
|
455 |
+
with st.expander("Exercise 2: Adding a Figure"):
|
456 |
+
st.markdown("""
|
457 |
+
**Task:** Add a figure to your document.
|
458 |
+
|
459 |
+
**Steps:**
|
460 |
+
1. Upload an image to your Overleaf project
|
461 |
+
2. Add the figure code to your document
|
462 |
+
3. Add a caption and label
|
463 |
+
4. Reference the figure in your text
|
464 |
+
""")
|
465 |
+
|
466 |
+
st.code("""
|
467 |
+
\\begin{figure}[h]
|
468 |
+
\\centering
|
469 |
+
\\includegraphics[width=0.7\\textwidth]{your-image.png}
|
470 |
+
\\caption{Description of your figure}
|
471 |
+
\\label{fig:example}
|
472 |
+
\\end{figure}
|
473 |
+
|
474 |
+
As shown in Figure~\\ref{fig:example}, our results demonstrate...
|
475 |
+
""", language="latex")
|
476 |
+
|
477 |
+
# Exercise 3: Citations
|
478 |
+
with st.expander("Exercise 3: Adding Citations"):
|
479 |
+
st.markdown("""
|
480 |
+
**Task:** Add citations to your document.
|
481 |
+
|
482 |
+
**Steps:**
|
483 |
+
1. Create a .bib file with your references
|
484 |
+
2. Add the bibliography package to your document
|
485 |
+
3. Add citations in your text
|
486 |
+
4. Include the bibliography at the end
|
487 |
+
""")
|
488 |
+
|
489 |
+
st.code("""
|
490 |
+
% In your main document
|
491 |
+
\\usepackage{biblatex}
|
492 |
+
\\addbibresource{references.bib}
|
493 |
+
|
494 |
+
% Add citations
|
495 |
+
Recent work \\cite{smith2023} has shown that...
|
496 |
+
|
497 |
+
\\printbibliography
|
498 |
+
|
499 |
+
% In references.bib:
|
500 |
+
@article{smith2023,
|
501 |
+
title={Recent advances in machine learning},
|
502 |
+
author={Smith, John and Johnson, Jane},
|
503 |
+
journal={Journal of ML Research},
|
504 |
+
year={2023}
|
505 |
+
}
|
506 |
+
""", language="latex")
|
507 |
+
|
508 |
+
# Common LaTeX Issues and Solutions
|
509 |
+
st.header("Common LaTeX Issues and Solutions")
|
510 |
+
|
511 |
+
issues_solutions = {
|
512 |
+
"Issue": [
|
513 |
+
"Document won't compile",
|
514 |
+
"Figure not appearing",
|
515 |
+
"Citations not showing",
|
516 |
+
"Math equations not rendering",
|
517 |
+
"Bibliography not generating"
|
518 |
+
],
|
519 |
+
"Common Cause": [
|
520 |
+
"Missing closing brace or bracket",
|
521 |
+
"Wrong filename or path",
|
522 |
+
"Missing \\printbibliography command",
|
523 |
+
"Missing math mode delimiters",
|
524 |
+
"Missing \\addbibresource command"
|
525 |
+
],
|
526 |
+
"Solution": [
|
527 |
+
"Check for matching braces and brackets",
|
528 |
+
"Verify filename and upload to Overleaf",
|
529 |
+
"Add \\printbibliography at end of document",
|
530 |
+
"Use $ for inline, \\begin{equation} for display",
|
531 |
+
"Add \\addbibresource{filename.bib}"
|
532 |
+
]
|
533 |
+
}
|
534 |
+
|
535 |
+
st.dataframe(pd.DataFrame(issues_solutions))
|
536 |
+
|
537 |
+
# Best Practices
|
538 |
+
st.header("Best Practices for Research Paper Writing")
|
539 |
+
|
540 |
+
st.markdown("""
|
541 |
+
**Writing Tips:**
|
542 |
+
1. **Start with an outline** - Plan your paper structure before writing
|
543 |
+
2. **Write the methods first** - It's usually the easiest section
|
544 |
+
3. **Use clear, concise language** - Avoid jargon when possible
|
545 |
+
4. **Be specific** - Use concrete numbers and examples
|
546 |
+
5. **Revise multiple times** - Good writing is rewriting
|
547 |
+
|
548 |
+
**LaTeX Tips:**
|
549 |
+
1. **Compile frequently** - Catch errors early
|
550 |
+
2. **Use meaningful labels** - fig:results is better than fig:1
|
551 |
+
3. **Keep backups** - Version control your LaTeX files
|
552 |
+
4. **Use templates** - Start with conference/journal templates
|
553 |
+
5. **Learn keyboard shortcuts** - Speed up your workflow
|
554 |
+
""")
|
555 |
+
|
556 |
+
# Additional Resources
|
557 |
+
st.header("Additional Resources")
|
558 |
+
st.markdown("""
|
559 |
+
**LaTeX Resources:**
|
560 |
+
- [Overleaf Documentation](https://www.overleaf.com/learn)
|
561 |
+
- [LaTeX Wikibook](https://en.wikibooks.org/wiki/LaTeX)
|
562 |
+
- [CTAN (Comprehensive TeX Archive Network)](https://ctan.org/)
|
563 |
+
|
564 |
+
""")
|