raymondEDS commited on
Commit
31653a7
·
1 Parent(s): 46e47b6

week 8 writing

Browse files
Reference files/Copy_Lab_5_hands_on_peer_review.ipynb DELETED
The diff for this file is too large to render. See raw diff
 
Reference files/Data_cleaning_lab.ipynb DELETED
The diff for this file is too large to render. See raw diff
 
Reference files/W8 - Curriculum Content.md ADDED
The diff for this file is too large to render. See raw diff
 
Reference files/W8 - Learning Objectives on Writing Paper.md ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## **Remember (Knowledge)**
2
+
3
+ Students will be able to:
4
+
5
+ * Recall LaTeX syntax for document structure, figures, citations, and spacing
6
+ * Identify components of ML research papers (introduction, methods, results, conclusion, limitations)
7
+ * Recognize standard formatting requirements for academic conferences and journals
8
+
9
+ ## **Understand (Comprehension)**
10
+
11
+ Students will be able to:
12
+
13
+ * Describe the purpose and audience for each section of a research paper
14
+
15
+ ## **Apply (Application)**
16
+
17
+ Students will be able to:
18
+
19
+ * Format complete research papers in LaTeX with proper figures, tables, and citations
20
+ * Write clear methodology sections with sufficient detail for reproducibility
21
+ * Present experimental results using appropriate visualizations and statistical analysis
22
+
23
+ ## **Analyze (Analysis)**
24
+
25
+ Students will be able to:
26
+
27
+ * Diagnose LaTeX formatting issues and resolve compilation errors (if applicable)
28
+ * Examine related work to identify research gaps and position their contributions
29
+ * Compare their methodology approaches with existing methods
30
+
31
+ ## **Evaluate (Evaluation)**
32
+
33
+ Students will be able to:
34
+
35
+ * Critically assess the validity and reliability of their experimental design
36
+ * Evaluate the clarity and persuasiveness of their written arguments
37
+
38
+ ## **Create (Synthesis)**
39
+
40
+ Students will be able to:
41
+
42
+ * Produce research papers
43
+ * Develop compelling visualizations that effectively communicate complex ML concepts
44
+ * Synthesize technical knowledge into coherent research narratives
45
+
Reference files/Week_4_content.txt DELETED
@@ -1,630 +0,0 @@
1
-
2
- In this course, you'll learn the complete NLP workflow by exploring a fascinating real-world question: Does review length and language relate to reviewer ratings and decisions in academic peer review? If so, how?
3
- Using data from the International Conference on Learning Representations (ICLR), you'll develop practical NLP skills while investigating how reviewers express their opinions. Each module builds upon the previous one, creating a coherent analytical pipeline from raw data to insight.
4
- Learning Path
5
- Data Loading and Initial Exploration: Setting up your environment and understanding your dataset
6
- Text Preprocessing and Normalization: Cleaning and standardizing text data
7
- Feature Extraction and Measurement: Calculating metrics from text
8
- Visualization and Pattern Recognition: Creating insightful visualizations
9
- Drawing Conclusions from Text Analysis: Synthesizing findings into actionable insights
10
- Let's begin our exploration of how NLP can provide insights into academic peer review!
11
-
12
- Module 1: Initial Exploration
13
- The Challenge
14
- Before we can analyze how review length relates to paper evaluations, we need to understand our dataset. In this module, we'll set up our Python environment and explore the ICLR conference data.
15
- 1.1: Set up and get to your data
16
- The first step in any NLP project is loading and understanding your data. Let's set up our environment and examine what we're working with:
17
- python
18
- # Import necessary libraries
19
- import pandas as pd
20
- import numpy as np
21
- import matplotlib.pyplot as plt
22
- import seaborn as sns
23
- import string
24
- from nltk.corpus import stopwords
25
- from nltk.tokenize import word_tokenize, sent_tokenize
26
- from wordcloud import WordCloud
27
-
28
- # Load the datasets
29
- df_reviews = pd.read_csv('../data/reviews.csv')
30
- df_submissions = pd.read_csv('../data/Submissions.csv')
31
- df_dec = pd.read_csv('../data/decision.csv')
32
- df_keyword = pd.read_csv('../data/submission_keyword.csv')
33
- Let's look at the first few rows of each dataset to understand what information we have:
34
- python
35
- # View the first few rows of the submissions dataset
36
- df_submissions.head()
37
- # View the first few rows of the reviews dataset
38
- df_reviews.head()
39
- # View all columns and rows in the reviews dataset
40
- df_reviews
41
- # View the first few rows of the keywords dataset
42
- df_keyword.head()
43
- 1.2: Looking at Review Content
44
- Let's examine an actual review to understand the text we'll be analyzing:
45
- python
46
- # Display a sample review
47
- df_reviews['review'][1]
48
- Think about: What kinds of information do you see in this review? What language patterns do you notice?
49
- 1.3: Calculating Basic Metrics
50
- Let's calculate our first simple metric - the average review score for each paper:
51
- python
52
- # Get the average review score for each paper
53
- df_average_review_score = df_reviews.groupby('forum')['rating_int'].mean().reset_index()
54
- df_average_review_score
55
- Key Insight: Each paper (identified by 'forum') receives multiple reviews with different scores. The average score gives us an overall assessment of each paper.
56
- Module 2: Data Integration
57
- In this module, we'll merge datasets for later analysis.
58
- 2.1 Understanding the Need for Data Integration
59
- In many NLP projects, the data we need is spread across multiple files or tables. In our case:
60
- The df_reviews dataset contains the review text and ratings
61
- The df_dec dataset contains the final decisions for each paper
62
- To analyze how review text relates to paper decisions, we need to merge these datasets.
63
- 2.2 Performing a Dataset Merge
64
- Let's combine our review data with the decision data:
65
- python
66
- # Step 1 - Merge the reviews dataframe with the decisions dataframe
67
- df_rev_dec = pd.merge(
68
- df_reviews, # First dataframe (reviews)
69
- df_dec, # Second dataframe (decisions)
70
- left_on='forum', # Join key in the first dataframe
71
- right_on='forum', # Join key in the second dataframe
72
- how='inner' # Keep only matching rows
73
- )[['review','decision','conf_name_y','rating_int','forum']] # Select only these columns
74
- # Display the first few rows of the merged dataframe
75
- df_rev_dec.head()
76
- 2.3 Understanding Merge Concepts
77
- Join Key: The 'forum' column identifies the paper and connects our datasets
78
- Inner Join: Only keeps papers that appear in both datasets
79
- Column Selection: We keep only relevant columns for our analysis
80
- How to Verify: Always check the shape of your merged dataset to ensure you haven't lost data unexpectedly
81
- Try it yourself: How many rows does the merged dataframe have compared to the original review dataframe? What might explain any differences?
82
-
83
- Module 3: Basic Text Preprocessing
84
- In this module, you'll learn essential data preprocessing techniques for NLP projects. We'll standardize text through case folding, clean up categorical variables, and prepare our review text for analysis.
85
- 3.1 Case Folding (Lowercase Conversion)
86
- A fundamental text preprocessing step is converting all text to lowercase to ensure consistency:
87
- python
88
- # Convert all review text to lowercase (case folding)
89
- df_rev_dec['review'] = df_rev_dec['review'].str.lower()
90
- # Display the updated dataframe
91
- df_rev_dec
92
- Why Case Folding Matters
93
- Consistency: "Novel" and "novel" will be treated as the same word
94
- Reduced Dimensionality: Fewer unique tokens to process
95
- Improved Pattern Recognition: Easier to identify word frequencies and patterns
96
- Note: While case folding is generally helpful, it can sometimes remove meaningful distinctions (e.g., "US" vs. "us"). For our academic review analysis, lowercase conversion is appropriate.
97
-
98
- 3.2 Examining Categorical Values
99
- Let's first check what unique decision categories exist in our dataset:
100
- python
101
- # Display the unique decision categories
102
- df_rev_dec['decision'].unique()
103
- 3.3 Standardizing Decision Categories
104
- We can see that there are multiple "Accept" categories with different presentation formats. Let's standardize these:
105
- python
106
- # Define a function to clean up and standardize decision categories
107
- def clean_up_decision(text):
108
- if text in ['Accept (Poster)','Accept (Spotlight)', 'Accept (Oral)','Accept (Talk)']:
109
- return 'Accept'
110
- else:
111
- return text
112
- # Apply the function to create a new standardized decision column
113
- df_rev_dec['decision_clean'] = df_rev_dec['decision'].apply(clean_up_decision)
114
- # Check our new standardized decision categories
115
- df_rev_dec['decision_clean'].unique()
116
- Why Standardization Matters
117
- Simplified Analysis: Reduces the number of categories to analyze
118
- Clearer Patterns: Makes it easier to identify trends by decision outcome
119
- Better Visualization: Creates more meaningful and readable plots
120
- Consistent Terminology: Aligns with how conferences typically report accept/reject decisions
121
- Try it yourself: What other ways could you group or standardize these decision categories? What information might be lost in our current approach?
122
-
123
- Module 4: Text Tokenization
124
- 4.1 Introduction to Tokenization
125
- Tokenization is the process of breaking text into smaller units like sentences or words. Let's examine a review:
126
- python
127
- # Display a sample review
128
- df_reviews['review'][1]
129
- 4.2 Sentence Tokenization
130
- Let's break this review into sentences using NLTK's sentence tokenizer:
131
- python
132
- # Import the necessary library if not already imported
133
- from nltk.tokenize import sent_tokenize
134
- # Tokenize the review into sentences
135
- sent_tokenize(df_reviews['review'][1])
136
- 4.3 Counting Sentences
137
- Now let's count the number of sentences in the review:
138
- python
139
- # Count the number of sentences
140
- len(sent_tokenize(df_reviews['review'][1]))
141
- 4.4 Creating a Reusable Function
142
- Let's create a function to count sentences in any text:
143
- python
144
- # Define a function to count sentences in a text
145
- def sentence_count(text):
146
- return len(sent_tokenize(text))
147
- 4.5 Applying Our Function to All Reviews
148
- Now we'll apply our function to all reviews to get sentence counts:
149
- python
150
- # Add a new column with the sentence count for each review
151
- df_rev_dec['sent_count'] = df_rev_dec['review'].apply(sentence_count)
152
- # Display the updated dataframe
153
- df_rev_dec.head()
154
- Key Insight: Sentence count is a simple yet effective way to quantify review length. The number of sentences can indicate how thoroughly a reviewer has evaluated a paper.
155
-
156
- Module 5: Visualization of Text Metrics
157
- 5.1 Creating a 2D Histogram
158
- Let's visualize the relationship between review length (in sentences), rating, and decision outcome:
159
- python
160
- # Create a 2D histogram with sentence count, rating, and decision
161
- ax = sns.histplot(data=df_rev_dec, x='sent_count',
162
- y='rating_int',
163
- hue='decision_clean',
164
- kde=True,
165
- log_scale=(True,False),
166
- legend=True)
167
- 5.2 Enhancing Our Visualization
168
- Let's improve our visualization with better labels and formatting:
169
- python
170
- # Set axis labels
171
- ax.set(xlabel='Review Length (# Sentences)', ylabel='Review Rating')
172
- # Move the legend outside the plot for better visibility
173
- sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
174
- # Ensure the layout is properly configured
175
- plt.tight_layout()
176
- # Display the plot
177
- plt.show()
178
- 5.3 Interpreting the Visualization
179
- This visualization reveals several interesting patterns:
180
- Length-Rating Relationship: Is there a pattern that entails how length of review is correlated with rating?
181
- Decision Patterns: Are there visible clusters for accepted vs. rejected papers?
182
- Density Distribution: Where are most reviews concentrated in terms of length and rating?
183
- Outliers: Are there unusually long or short reviews at certain rating levels?
184
- Discussion Question: Based on this visualization, do reviewers tend to write longer reviews when they're more positive or more critical? What might explain this pattern?
185
-
186
- Module 6: Additional Text Processing - Tokenization
187
- Tokenization is the process of breaking text into smaller units (tokens) that serve as the building blocks for natural language processing. In this lesson, we'll explore how to tokenize text, remove stopwords and punctuation, and analyze the results.
188
- 6.1 Text Cleaning
189
- Before tokenization, we often clean the text to remove unwanted characters. Let's start by removing punctuation:
190
- python
191
- # Removing punctuation
192
- df_rev_dec['clean_review_word'] = df_rev_dec['review'].str.translate(str.maketrans('', '', string.punctuation))
193
- What's happening here?
194
- string.punctuation contains all punctuation characters (.,!?;:'"()[]{}-_)
195
- str.maketrans('', '', string.punctuation) creates a translation table to remove these characters
196
- df_rev_dec['review'].str.translate() applies this translation to all review texts
197
- 6.2 Word Tokenization
198
- After cleaning, we can tokenize the text into individual words:
199
- python
200
- # Tokenizing the text
201
- df_rev_dec['tokens'] = df_rev_dec['clean_review_word'].apply(word_tokenize)
202
-
203
- # Example: Look at tokens for the 6th review
204
- df_rev_dec['tokens'][5]
205
- What's happening here?
206
- word_tokenize() is an NLTK function that splits text into a list of words
207
- We apply this function to each review using pandas' apply() method
208
- The result is a new column containing lists of words for each review
209
- 6.3 Removing Stopwords
210
- Stopwords are common words like "the," "and," "is" that often don't add meaningful information for analysis:
211
- python
212
- # Getting the list of English stopwords
213
- stop_words = set(stopwords.words('english'))
214
-
215
- # Removing stopwords from our tokens
216
- df_rev_dec['tokens'] = df_rev_dec['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
217
- What's happening here?
218
- stopwords.words('english') returns a list of common English stopwords
219
- We convert it to a set for faster lookup
220
- The lambda function filters each token list, keeping only words that aren't stopwords
221
- This creates more meaningful token lists focused on content words
222
- 6.4 Counting Tokens
223
- Now that we have our cleaned and filtered tokens, let's count them to measure review length:
224
- python
225
- # Count tokens for each review
226
- df_rev_dec['tokens_counts'] = df_rev_dec['tokens'].apply(len)
227
-
228
- # View the token counts
229
- df_rev_dec['tokens_counts']
230
- What's happening here?
231
- We use apply(len) to count the number of tokens in each review
232
- This gives us a quantitative measure of review length after removing stopwords
233
- The difference between this and raw word count shows the prevalence of stopwords
234
- 6.5 Visualizing Token Counts vs. Ratings
235
- Let's visualize the relationship between token count, rating, and decision:
236
- python
237
- # Create a 2D histogram with token count, rating, and decision
238
- ax = sns.histplot(data=df_rev_dec, x='tokens_counts',
239
- y='rating_int',
240
- hue='decision_clean',
241
- kde=True,
242
- log_scale=(True,False),
243
- legend=True)
244
-
245
- # Set axis labels
246
- ax.set(xlabel='Review Length (# Tokens)', ylabel='Review Rating')
247
-
248
- # Move the legend outside the plot
249
- sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
250
-
251
- plt.tight_layout()
252
- plt.show()
253
- What's happening here?
254
- We create a 2D histogram showing the distribution of token counts and ratings
255
- Colors distinguish between accepted and rejected papers
256
- Log scale on the x-axis helps visualize the wide range of token counts
257
- Kernel density estimation (KDE) shows the concentration of reviews
258
- Module 7: Aggregating Data by Paper
259
- 7.1 Understanding Data Aggregation
260
- So far, we've been analyzing individual reviews. However, each paper (identified by 'forum') may have multiple reviews. To understand paper-level patterns, we need to aggregate our data.
261
- 7.2 Calculating Paper-Level Metrics
262
- Let's aggregate our review metrics to the paper level by calculating means:
263
- python
264
- # Aggregate reviews to paper level (mean of metrics for each paper)
265
- df_rev_dec_ave = df_rev_dec.groupby(['forum','decision_clean'])[['rating_int','tokens_counts','sent_count']].mean().reset_index()
266
- What's happening here?
267
- We're grouping reviews by both 'forum' (paper ID) and 'decision_clean' (accept/reject)
268
- For each group, we calculate the mean of 'rating_int', 'tokens_counts', and 'sent_count'
269
- The reset_index() turns the result back into a regular DataFrame
270
- The result is a paper-level dataset with average metrics for each paper
271
- Try it yourself: How many papers do we have in our dataset compared to reviews? What does this tell us about the review process?
272
- Module 8: Visualizing Token Count vs. Rating
273
- 8.1 Creating an Advanced Visualization
274
- Now let's visualize the relationship between token count and rating at the paper level:
275
- python
276
- # Create a 2D histogram with token count, rating, and decision
277
- ax = sns.histplot(data=df_rev_dec_ave, x='tokens_counts',
278
- y='rating_int',
279
- hue='decision_clean',
280
- kde=True,
281
- log_scale=(True,False),
282
- legend=True)
283
-
284
- # Set axis labels
285
- ax.set(xlabel='Review Length (# Tokens)', ylabel='Review Rating')
286
-
287
- # Move the legend outside the plot
288
- sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
289
-
290
- plt.tight_layout()
291
- plt.show()
292
- 8.2 Interpreting the Visualization
293
- This visualization reveals important patterns in our data:
294
- Decision Boundaries: Notice where the color changes from one decision to another
295
- Length-Rating Relationship: Is there a correlation between review length and rating?
296
- Clustering: Are there natural clusters in the data?
297
- Outliers: What papers received unusually long or short reviews?
298
- Key Insight: At the paper level, we can see if the average review length for a paper relates to its likelihood of acceptance.
299
- Module 9: Comparing Token Count and Sentence Count
300
- 9.1 Visualizing Sentence Count vs. Rating
301
- Let's create a similar visualization using sentence count instead of token count:
302
- python
303
- # Create a 2D histogram with sentence count, rating, and decision
304
- ax = sns.histplot(data=df_rev_dec_ave, x='sent_count',
305
- y='rating_int',
306
- hue='decision_clean',
307
- kde=True,
308
- log_scale=(True,False),
309
- legend=True)
310
-
311
- # Set axis labels
312
- ax.set(xlabel='Review Length (# Sentences)', ylabel='Review Rating')
313
-
314
- # Move the legend outside the plot
315
- sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
316
-
317
- plt.tight_layout()
318
- plt.show()
319
- 9.2 Comparing Token vs. Sentence Metrics
320
- By comparing these two visualizations, we can understand:
321
- Which Metric is More Informative: Do token counts or sentence counts better differentiate accepted vs. rejected papers?
322
- Different Patterns: Do some papers have many short sentences while others have fewer long ones?
323
- Consistency: Are the patterns consistent across both metrics?
324
- Discussion Question: Which metric—tokens or sentences—seems to be a better predictor of paper acceptance? Why might that be?
325
- Module 10: Word Cloud Visualizations
326
- 10.1 Creating a Word Cloud from Review Text
327
- Word clouds are a powerful way to visualize the most frequent words in a text corpus:
328
- python
329
- # Concatenate all review text
330
- text = ' '.join(df_rev_dec['clean_review_word'])
331
-
332
- # Generate word cloud
333
- wordcloud = WordCloud().generate(text)
334
-
335
- # Display word cloud
336
- plt.figure(figsize=(8, 6))
337
- plt.imshow(wordcloud, interpolation='bilinear')
338
- plt.axis('off')
339
- plt.show()
340
- 10.2 Visualizing Paper Keywords
341
- Now let's visualize the primary keywords associated with the papers:
342
- python
343
- # Concatenate all primary keywords
344
- text = ' '.join(df_keyword['primary_keyword'])
345
-
346
- # Generate word cloud
347
- wordcloud = WordCloud().generate(text)
348
-
349
- # Display word cloud
350
- plt.figure(figsize=(8, 6))
351
- plt.imshow(wordcloud, interpolation='bilinear')
352
- plt.axis('off')
353
- plt.show()
354
- 10.3 Visualizing Paper Abstracts
355
- Finally, let's create a word cloud from paper abstracts:
356
- python
357
- # Concatenate all abstracts
358
- text = ' '.join(df_submissions['abstract'])
359
-
360
- # Generate word cloud
361
- wordcloud = WordCloud().generate(text)
362
-
363
- # Display word cloud
364
- plt.figure(figsize=(8, 6))
365
- plt.imshow(wordcloud, interpolation='bilinear')
366
- plt.axis('off')
367
- plt.show()
368
- Interpreting Word Clouds
369
- Word clouds provide insights about:
370
- Dominant Themes: The most frequent words appear largest
371
- Vocabulary Differences: Compare terms across different sources (reviews vs. abstracts)
372
- Field-Specific Terminology: Technical terms reveal the focus of the conference
373
- Sentiment Indicators: Evaluative words in reviews reveal assessment patterns
374
- Try it yourself: What differences do you notice between the word clouds from reviews, keywords, and abstracts? What do these differences tell you about academic communication?
375
-
376
-
377
-
378
-
379
-
380
-
381
-
382
-
383
-
384
-
385
-
386
-
387
-
388
-
389
-
390
-
391
-
392
-
393
-
394
-
395
-
396
- V1.1 Week 4 - Intro to NLP
397
- Course Overview
398
- In this course, you'll learn fundamental Natural Language Processing (NLP) concepts by exploring a fascinating real-world question: What is the effect of releasing a preprint of a paper before it is submitted for peer review?
399
- Using the ICLR (International Conference on Learning Representations) database - which contains submissions, reviews, and author profiles from 2017-2022 - you'll develop practical NLP skills while investigating potential biases and patterns in academic publishing.
400
- Learning Path
401
- Understanding Text as Data: How computers represent and work with text
402
- Text Processing Fundamentals: Basic cleaning and normalization
403
- Quantitative Text Analysis: Measuring and comparing text features
404
- Tokenization Approaches: Breaking text into meaningful units
405
- Text Visualization Techniques: Creating insightful visual representations
406
- From Analysis to Insights: Drawing evidence-based conclusions
407
- Let's dive in!
408
-
409
- Step 4: Text Cleaning and Normalization for Academic Content
410
- Academic papers contain specialized vocabulary, citations, equations, and other elements that require careful normalization.
411
- Key Concept: Scientific text normalization preserves meaningful technical content while standardizing format.
412
- Stop Words Removal
413
- Definition: Stop words are extremely common words that appear frequently in text but typically carry little meaningful information for analysis purposes. In English, these include articles (the, a, an), conjunctions (and, but, or), prepositions (in, on, at), and certain pronouns (I, you, it).
414
- Stop words removal is the process of filtering these words out before analysis to:
415
- Reduce noise in the data
416
- Decrease the dimensionality of the text representation
417
- Focus analysis on the content-bearing words
418
- In academic text, we often extend standard stop word lists to include domain-specific terms that are ubiquitous but not analytically useful (e.g., "paper," "method," "result").
419
- python
420
- # Load standard English stop words
421
- from nltk.corpus import stopwords
422
- standard_stop_words = set(stopwords.words('english'))
423
-
424
- # Add academic-specific stop words
425
- academic_stop_words = ['et', 'al', 'fig', 'table', 'paper', 'using', 'used',
426
- 'method', 'result', 'show', 'propose', 'use']
427
- all_stop_words = standard_stop_words.union(academic_stop_words)
428
-
429
- # Apply stop word removal
430
- def remove_stop_words(text):
431
- words = text.split()
432
- filtered_words = [word for word in words if word.lower() not in all_stop_words]
433
- return ' '.join(filtered_words)
434
-
435
- # Compare before and after
436
- example = "We propose a novel method that shows impressive results on the benchmark dataset."
437
- filtered = remove_stop_words(example)
438
-
439
- print("Original:", example)
440
- print("After stop word removal:", filtered)
441
- # Output: "propose novel method shows impressive results benchmark dataset."
442
- Stemming and Lemmatization
443
- Definition: Stemming and lemmatization are text normalization techniques that reduce words to their root or base forms, allowing different inflections or derivations of the same word to be treated as equivalent.
444
- Stemming is a simpler, rule-based approach that works by truncating words to their stems, often by removing suffixes. For example:
445
- "running," "runs," and "runner" might all be reduced to "run"
446
- "connection," "connected," and "connecting" might all become "connect"
447
- Stemming is faster but can sometimes produce non-words or incorrect reductions.
448
- Lemmatization is a more sophisticated approach that uses vocabulary and morphological analysis to return the dictionary base form (lemma) of a word. For example:
449
- "better" becomes "good"
450
- "was" and "were" become "be"
451
- "studying" becomes "study"
452
- Lemmatization generally produces more accurate results but requires more computational resources.
453
- python
454
- from nltk.stem import PorterStemmer, WordNetLemmatizer
455
- import nltk
456
- nltk.download('wordnet')
457
-
458
- # Initialize stemmer and lemmatizer
459
- stemmer = PorterStemmer()
460
- lemmatizer = WordNetLemmatizer()
461
-
462
- # Example words
463
- academic_terms = ["algorithms", "computing", "learning", "trained",
464
- "networks", "better", "studies", "analyzed"]
465
-
466
- # Compare stemming and lemmatization
467
- for term in academic_terms:
468
- print(f"Original: {term}")
469
- print(f"Stemmed: {stemmer.stem(term)}")
470
- print(f"Lemmatized: {lemmatizer.lemmatize(term)}")
471
- print()
472
-
473
- # Demonstration in context
474
- academic_sentence = "The training algorithms performed better than expected when analyzing multiple neural networks."
475
-
476
- # Apply stemming
477
- stemmed_words = [stemmer.stem(word) for word in academic_sentence.lower().split()]
478
- stemmed_sentence = ' '.join(stemmed_words)
479
-
480
- # Apply lemmatization
481
- lemmatized_words = [lemmatizer.lemmatize(word) for word in academic_sentence.lower().split()]
482
- lemmatized_sentence = ' '.join(lemmatized_words)
483
-
484
- print("Original:", academic_sentence)
485
- print("Stemmed:", stemmed_sentence)
486
- print("Lemmatized:", lemmatized_sentence)
487
- When to use which approach:
488
- For academic text analysis:
489
- Stemming is useful when processing speed is important and approximate matching is sufficient
490
- Lemmatization is preferred when precision is crucial, especially for technical terms where preserving meaning is essential
491
- In our ICLR paper analysis, lemmatization would likely be more appropriate since technical terminology often carries specific meanings that should be preserved accurately.
492
- Challenge Question: How might stemming versus lemmatization affect our analysis of technical innovation in ICLR papers? Can you think of specific machine learning terms where these approaches would yield different results?
493
-
494
-
495
- V1.0 Week 4 - Intro to NLP
496
- The Real-World Problem
497
- Imagine you're part of a small business team that has just launched a new product. You've received hundreds of customer reviews across various platforms, and your manager has asked you to make sense of this feedback. Looking at the mountain of text data, you realize you need a systematic way to understand what customers are saying without reading each review individually.
498
- Your challenge: How can you efficiently analyze customer feedback to identify common themes, sentiments, and specific product issues?
499
- Our Approach
500
- In this module, we'll learn how to transform unstructured text feedback into structured insights using Natural Language Processing. Here's our journey:
501
- Understanding text as data
502
- Basic processing of text information
503
- Measuring text properties
504
- Cleaning and normalizing customer feedback
505
- Visualizing patterns in the feedback
506
- Analyzing words vs. tokens
507
- Let's begin!
508
- Step 1: Text as Data - A New Perspective
509
- When we look at customer reviews like:
510
- "Love this product! So easy to use and the battery lasts forever."
511
- "Terrible design. Buttons stopped working after two weeks."
512
- We naturally understand the meaning and sentiment. But how can a computer understand this?
513
- Key Concept: Text can be treated as data that we can analyze quantitatively.
514
- Unlike numerical data (age, price, temperature) that has inherent mathematical properties, text data needs to be transformed before we can analyze it.
515
- Interactive Exercise: Look at these two reviews. As a human, what information can you extract? Now think about how a computer might "see" this text without any processing.
516
- Challenge Question: What types of information might we want to extract from customer reviews? List at least three analytical goals.
517
- Step 2: Basic Text Processing - Breaking Down Language
518
- Before we can analyze text, we need to break it down into meaningful units.
519
- Key Concept: Tokenization is the process of splitting text into smaller pieces (tokens) such as words, phrases, or characters.
520
- For example, the review "Love this product!" can be tokenized into ["Love", "this", "product", "!"] or ["Love", "this", "product", "!"] depending on our approach.
521
- Interactive Example: Let's tokenize these customer reviews:
522
- python
523
- # Simple word tokenization
524
- review = "Battery life is amazing but the app crashes frequently."
525
- tokens = review.split() # Results in ["Battery", "life", "is", "amazing", "but", "the", "app", "crashes", "frequently."]
526
- Notice how "frequently." includes the period. Basic tokenization has limitations!
527
- Challenge Question: How might we handle contractions like "doesn't" or hyphenated words like "user-friendly" when tokenizing?
528
- Step 3: Measuring Text - Quantifying Feedback
529
- Now that we've broken text into pieces, we can start measuring properties of our customer feedback.
530
- Key Concept: Text metrics help us quantify and compare text data.
531
- Common metrics include:
532
- Length (words, characters)
533
- Complexity (average word length, unique words ratio)
534
- Sentiment scores (positive/negative)
535
- Interactive Example: Let's calculate basic metrics for customer reviews:
536
- python
537
- # Word count
538
- review = "The interface is intuitive and responsive."
539
- word_count = len(review.split()) # 6 words
540
-
541
- # Character count (including spaces)
542
- char_count = len(review) # 41 characters
543
-
544
- # Unique words ratio
545
- unique_words = len(set(review.lower().split()))
546
- unique_ratio = unique_words / word_count # 1.0 (all words are unique)
547
- Challenge Question: Why might longer reviews not necessarily contain more information than shorter ones? What other metrics beyond length might better capture information content?
548
- Step 4: Text Cleaning and Normalization
549
- Customer feedback often contains inconsistencies: spelling variations, punctuation, capitalization, etc.
550
- Key Concept: Text normalization creates a standardized format for analysis.
551
- Common normalization steps:
552
- Converting to lowercase
553
- Removing punctuation
554
- Correcting spelling
555
- Removing stop words (common words like "the", "is")
556
- Stemming or lemmatizing (reducing words to their base form)
557
- Interactive Example: Let's normalize a review:
558
- python
559
- # Original review
560
- review = "The battery LIFE is amazing!!! Works for days."
561
-
562
- # Lowercase
563
- review = review.lower() # "the battery life is amazing!!! works for days."
564
-
565
- # Remove punctuation and extra spaces
566
- import re
567
- review = re.sub(r'[^\w\s]', '', review) # "the battery life is amazing works for days"
568
-
569
- # Remove stop words
570
- stop_words = ["the", "is", "for"]
571
- words = review.split()
572
- filtered_words = [word for word in words if word not in stop_words]
573
- # Result: ["battery", "life", "amazing", "works", "days"]
574
- Challenge Question: How might normalization affect sentiment analysis? Could removing punctuation or stop words change the perceived sentiment of a review?
575
- Step 5: Text Visualization - Seeing Patterns
576
- Visual representations help us identify patterns across many reviews.
577
- Key Concept: Text visualization techniques reveal insights that are difficult to see in raw text.
578
- Common visualization methods:
579
- Word clouds
580
- Frequency distributions
581
- Sentiment over time
582
- Topic clusters
583
- Interactive Example: Creating a simple word frequency chart:
584
- python
585
- from collections import Counter
586
-
587
- # Combined reviews
588
- reviews = ["Battery life is amazing", "Battery drains too quickly",
589
- "Great battery performance", "Screen is too small"]
590
-
591
- # Count word frequencies
592
- all_words = " ".join(reviews).lower().split()
593
- word_counts = Counter(all_words)
594
- # Result: {'battery': 3, 'life': 1, 'is': 2, 'amazing': 1, 'drains': 1, 'too': 2, 'quickly': 1, 'great': 1, 'performance': 1, 'screen': 1, 'small': 1}
595
-
596
- # We could visualize this as a bar chart
597
- # Most frequent: 'battery' (3), 'is' (2), 'too' (2)
598
- Challenge Question: Why might a word cloud be misleading for understanding customer sentiment? What additional information would make the visualization more informative?
599
- Step 6: Words vs. Tokens - Making Choices
600
- As we advance in NLP, we face an important decision: should we analyze whole words or more sophisticated tokens?
601
- Key Concept: Different tokenization approaches have distinct advantages and limitations.
602
- Word-based analysis:
603
- Intuitive and interpretable
604
- Misses connections between related words (run/running/ran)
605
- Struggles with compound words and new terms
606
- Token-based analysis:
607
- Can capture subword information
608
- Handles unknown words better
609
- May lose some human interpretability
610
- Interactive Example: Comparing approaches:
611
- python
612
- # Word-based
613
- review = "The touchscreen is unresponsive"
614
- words = review.lower().split() # ['the', 'touchscreen', 'is', 'unresponsive']
615
-
616
- # Subword tokenization (simplified example)
617
- subwords = ['the', 'touch', 'screen', 'is', 'un', 'responsive']
618
- Challenge Question: For our customer feedback analysis, which approach would be better: analyzing whole words or subword tokens? What factors would influence this decision?
619
- Putting It All Together: Solving Our Problem
620
- Now that we've learned these fundamental NLP concepts, let's return to our original challenge: analyzing customer feedback at scale.
621
- Here's how we'd approach it:
622
- Collect and tokenize all customer reviews
623
- Clean and normalize the text
624
- Calculate key metrics (length, sentiment scores)
625
- Visualize common terms and topics
626
- Identify positive and negative feedback themes
627
- Generate an automated summary for the product team
628
- By applying these NLP fundamentals, we've transformed an overwhelming mass of text into actionable insights that can drive product improvements!
629
- Final Challenge: How could we extend this analysis to track customer sentiment over time as we release product updates? What additional NLP techniques might be helpful?
630
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Reference files/w6_logistic_regression_lab.py DELETED
@@ -1,400 +0,0 @@
1
- # -*- coding: utf-8 -*-
2
- """W6_Logistic_regression_lab
3
-
4
- Automatically generated by Colab.
5
-
6
- Original file is located at
7
- https://colab.research.google.com/drive/1MG7N2HN-Nxow9fzvc0fzxvp3WyKqtgs8
8
-
9
- # 🚀 Logistic Regression Lab: Stock Market Prediction
10
-
11
- ## Lab Overview
12
- In this lab, we'll use logistic regression to try predicting whether the stock market goes up or down. Spoiler alert: This is intentionally a challenging prediction problem that will teach us important lessons about when logistic regression works well and when it doesn't.
13
- ## Learning Goals:
14
-
15
- - Apply logistic regression to real data
16
- - Interpret probabilities and coefficients
17
- - Understand why some prediction problems are inherently difficult
18
- - Learn proper model evaluation techniques
19
-
20
- ## The Stock Market Data
21
-
22
- In this lab we will examine the `Smarket`
23
- data, which is part of the `ISLP`
24
- library. This data set consists of percentage returns for the S&P 500
25
- stock index over 1,250 days, from the beginning of 2001 until the end
26
- of 2005. For each date, we have recorded the percentage returns for
27
- each of the five previous trading days, `Lag1` through
28
- `Lag5`. We have also recorded `Volume` (the number of
29
- shares traded on the previous day, in billions), `Today` (the
30
- percentage return on the date in question) and `Direction`
31
- (whether the market was `Up` or `Down` on this date).
32
-
33
- ### Your Challenge
34
- **Question**: Can we predict if the S&P 500 will go up or down based on recent trading patterns?
35
-
36
- **Why This Matters:** If predictable, this would be incredibly valuable. If not predictable, we learn about market efficiency and realistic expectations for prediction models.
37
-
38
-
39
- To answer the question, **we start by importing our libraries at this top level; these are all imports we have seen in previous labs.**
40
- """
41
-
42
- import numpy as np
43
- import pandas as pd
44
- from matplotlib.pyplot import subplots
45
- import statsmodels.api as sm
46
- from ISLP import load_data
47
- from ISLP.models import (ModelSpec as MS,
48
- summarize)
49
-
50
- """We also collect together the new imports needed for this lab."""
51
-
52
- from ISLP import confusion_table
53
- from ISLP.models import contrast
54
- from sklearn.discriminant_analysis import \
55
- (LinearDiscriminantAnalysis as LDA,
56
- QuadraticDiscriminantAnalysis as QDA)
57
- from sklearn.naive_bayes import GaussianNB
58
- from sklearn.neighbors import KNeighborsClassifier
59
- from sklearn.preprocessing import StandardScaler
60
- from sklearn.model_selection import train_test_split
61
- from sklearn.linear_model import LogisticRegression
62
-
63
- """Now we are ready to load the `Smarket` data."""
64
-
65
- Smarket = load_data('Smarket')
66
- Smarket
67
-
68
- """This gives a truncated listing of the data.
69
- We can see what the variable names are.
70
- """
71
-
72
- Smarket.columns
73
-
74
- """We compute the correlation matrix using the `corr()` method
75
- for data frames, which produces a matrix that contains all of
76
- the pairwise correlations among the variables.
77
-
78
- By instructing `pandas` to use only numeric variables, the `corr()` method does not report a correlation for the `Direction` variable because it is
79
- qualitative.
80
-
81
- ![image.png](attachment:image.png)
82
- """
83
-
84
- Smarket.corr(numeric_only=True)
85
-
86
- """As one would expect, the correlations between the lagged return variables and
87
- today’s return are close to zero. The only substantial correlation is between `Year` and
88
- `Volume`. By plotting the data we see that `Volume`
89
- is increasing over time. In other words, the average number of shares traded
90
- daily increased from 2001 to 2005.
91
-
92
- """
93
-
94
- Smarket.plot(y='Volume');
95
-
96
- """## Logistic Regression
97
- Next, we will fit a logistic regression model in order to predict
98
- `Direction` using `Lag1` through `Lag5` and
99
- `Volume`. The `sm.GLM()` function fits *generalized linear models*, a class of
100
- models that includes logistic regression. Alternatively,
101
- the function `sm.Logit()` fits a logistic regression
102
- model directly. The syntax of
103
- `sm.GLM()` is similar to that of `sm.OLS()`, except
104
- that we must pass in the argument `family=sm.families.Binomial()`
105
- in order to tell `statsmodels` to run a logistic regression rather than some other
106
- type of generalized linear model.
107
- """
108
-
109
- allvars = Smarket.columns.drop(['Today', 'Direction', 'Year'])
110
- design = MS(allvars)
111
- X = design.fit_transform(Smarket)
112
- y = Smarket.Direction == 'Up'
113
- glm = sm.GLM(y,
114
- X,
115
- family=sm.families.Binomial())
116
- results = glm.fit()
117
- summarize(results)
118
-
119
- """The smallest *p*-value here is associated with `Lag1`. The
120
- negative coefficient for this predictor suggests that if the market
121
- had a positive return yesterday, then it is less likely to go up
122
- today. However, at a value of 0.15, the *p*-value is still
123
- relatively large, and so there is no clear evidence of a real
124
- association between `Lag1` and `Direction`.
125
-
126
- We use the `params` attribute of `results`
127
- in order to access just the
128
- coefficients for this fitted model.
129
- """
130
-
131
- results.params
132
-
133
- """Likewise we can use the
134
- `pvalues` attribute to access the *p*-values for the coefficients.
135
- """
136
-
137
- results.pvalues
138
-
139
- """The `predict()` method of `results` can be used to predict the
140
- probability that the market will go up, given values of the
141
- predictors. This method returns predictions
142
- on the probability scale. If no data set is supplied to the `predict()`
143
- function, then the probabilities are computed for the training data
144
- that was used to fit the logistic regression model.
145
- As with linear regression, one can pass an optional `exog` argument consistent
146
- with a design matrix if desired. Here we have
147
- printed only the first ten probabilities.
148
- """
149
-
150
- probs = results.predict()
151
- probs[:10]
152
-
153
- """In order to make a prediction as to whether the market will go up or
154
- down on a particular day, we must convert these predicted
155
- probabilities into class labels, `Up` or `Down`. The
156
- following two commands create a vector of class predictions based on
157
- whether the predicted probability of a market increase is greater than
158
- or less than 0.5.
159
- """
160
-
161
- labels = np.array(['Down']*1250)
162
- labels[probs>0.5] = "Up"
163
-
164
- """The `confusion_table()`
165
- function from the `ISLP` package summarizes these predictions, showing how
166
- many observations were correctly or incorrectly classified. Our function, which is adapted from a similar function
167
- in the module `sklearn.metrics`, transposes the resulting
168
- matrix and includes row and column labels.
169
- The `confusion_table()` function takes as first argument the
170
- predicted labels, and second argument the true labels.
171
- """
172
-
173
- confusion_table(labels, Smarket.Direction)
174
-
175
- """The diagonal elements of the confusion matrix indicate correct
176
- predictions, while the off-diagonals represent incorrect
177
- predictions. Hence our model correctly predicted that the market would
178
- go up on 507 days and that it would go down on 145 days, for a
179
- total of 507 + 145 = 652 correct predictions. The `np.mean()`
180
- function can be used to compute the fraction of days for which the
181
- prediction was correct. In this case, logistic regression correctly
182
- predicted the movement of the market 52.2% of the time.
183
-
184
- """
185
-
186
- (507+145)/1250, np.mean(labels == Smarket.Direction)
187
-
188
- """At first glance, it appears that the logistic regression model is
189
- working a little better than random guessing. However, this result is
190
- misleading because we trained and tested the model on the same set of
191
- 1,250 observations. In other words, $100-52.2=47.8%$ is the
192
- *training* error rate. As we have seen
193
- previously, the training error rate is often overly optimistic --- it
194
- tends to underestimate the test error rate. In
195
- order to better assess the accuracy of the logistic regression model
196
- in this setting, we can fit the model using part of the data, and
197
- then examine how well it predicts the *held out* data. This
198
- will yield a more realistic error rate, in the sense that in practice
199
- we will be interested in our model’s performance not on the data that
200
- we used to fit the model, but rather on days in the future for which
201
- the market’s movements are unknown.
202
-
203
- To implement this strategy, we first create a Boolean vector
204
- corresponding to the observations from 2001 through 2004. We then
205
- use this vector to create a held out data set of observations from
206
- 2005.
207
- """
208
-
209
- train = (Smarket.Year < 2005)
210
- Smarket_train = Smarket.loc[train]
211
- Smarket_test = Smarket.loc[~train]
212
- Smarket_test.shape
213
-
214
- """The object `train` is a vector of 1,250 elements, corresponding
215
- to the observations in our data set. The elements of the vector that
216
- correspond to observations that occurred before 2005 are set to
217
- `True`, whereas those that correspond to observations in 2005 are
218
- set to `False`. Hence `train` is a
219
- *boolean* array, since its
220
- elements are `True` and `False`. Boolean arrays can be used
221
- to obtain a subset of the rows or columns of a data frame
222
- using the `loc` method. For instance,
223
- the command `Smarket.loc[train]` would pick out a submatrix of the
224
- stock market data set, corresponding only to the dates before 2005,
225
- since those are the ones for which the elements of `train` are
226
- `True`. The `~` symbol can be used to negate all of the
227
- elements of a Boolean vector. That is, `~train` is a vector
228
- similar to `train`, except that the elements that are `True`
229
- in `train` get swapped to `False` in `~train`, and vice versa.
230
- Therefore, `Smarket.loc[~train]` yields a
231
- subset of the rows of the data frame
232
- of the stock market data containing only the observations for which
233
- `train` is `False`.
234
- The output above indicates that there are 252 such
235
- observations.
236
-
237
- We now fit a logistic regression model using only the subset of the
238
- observations that correspond to dates before 2005. We then obtain predicted probabilities of the
239
- stock market going up for each of the days in our test set --- that is,
240
- for the days in 2005.
241
- """
242
-
243
- X_train, X_test = X.loc[train], X.loc[~train]
244
- y_train, y_test = y.loc[train], y.loc[~train]
245
- glm_train = sm.GLM(y_train,
246
- X_train,
247
- family=sm.families.Binomial())
248
- results = glm_train.fit()
249
- probs = results.predict(exog=X_test)
250
-
251
- """Notice that we have trained and tested our model on two completely
252
- separate data sets: training was performed using only the dates before
253
- 2005, and testing was performed using only the dates in 2005.
254
-
255
- Finally, we compare the predictions for 2005 to the
256
- actual movements of the market over that time period.
257
- We will first store the test and training labels (recall `y_test` is binary).
258
- """
259
-
260
- D = Smarket.Direction
261
- L_train, L_test = D.loc[train], D.loc[~train]
262
-
263
- """Now we threshold the
264
- fitted probability at 50% to form
265
- our predicted labels.
266
- """
267
-
268
- labels = np.array(['Down']*252)
269
- labels[probs>0.5] = 'Up'
270
- confusion_table(labels, L_test)
271
-
272
- """The test accuracy is about 48% while the error rate is about 52%"""
273
-
274
- np.mean(labels == L_test), np.mean(labels != L_test)
275
-
276
- """The `!=` notation means *not equal to*, and so the last command
277
- computes the test set error rate. The results are rather
278
- disappointing: the test error rate is 52%, which is worse than
279
- random guessing! Of course this result is not all that surprising,
280
- given that one would not generally expect to be able to use previous
281
- days’ returns to predict future market performance. (After all, if it
282
- were possible to do so, then the authors of this book would be out
283
- striking it rich rather than writing a statistics textbook.)
284
-
285
- We recall that the logistic regression model had very underwhelming
286
- *p*-values associated with all of the predictors, and that the
287
- smallest *p*-value, though not very small, corresponded to
288
- `Lag1`. Perhaps by removing the variables that appear not to be
289
- helpful in predicting `Direction`, we can obtain a more
290
- effective model. After all, using predictors that have no relationship
291
- with the response tends to cause a deterioration in the test error
292
- rate (since such predictors cause an increase in variance without a
293
- corresponding decrease in bias), and so removing such predictors may
294
- in turn yield an improvement. Below we refit the logistic
295
- regression using just `Lag1` and `Lag2`, which seemed to
296
- have the highest predictive power in the original logistic regression
297
- model.
298
- """
299
-
300
- model = MS(['Lag1', 'Lag2']).fit(Smarket)
301
- X = model.transform(Smarket)
302
- X_train, X_test = X.loc[train], X.loc[~train]
303
- glm_train = sm.GLM(y_train,
304
- X_train,
305
- family=sm.families.Binomial())
306
- results = glm_train.fit()
307
- probs = results.predict(exog=X_test)
308
- labels = np.array(['Down']*252)
309
- labels[probs>0.5] = 'Up'
310
- confusion_table(labels, L_test)
311
-
312
- """Let’s evaluate the overall accuracy as well as the accuracy within the days when
313
- logistic regression predicts an increase.
314
- """
315
-
316
- (35+106)/252,106/(106+76)
317
-
318
- """Now the results appear to be a little better: 56% of the daily
319
- movements have been correctly predicted. It is worth noting that in
320
- this case, a much simpler strategy of predicting that the market will
321
- increase every day will also be correct 56% of the time! Hence, in
322
- terms of overall error rate, the logistic regression method is no
323
- better than the naive approach. However, the confusion matrix
324
- shows that on days when logistic regression predicts an increase in
325
- the market, it has a 58% accuracy rate. This suggests a possible
326
- trading strategy of buying on days when the model predicts an
327
- increasing market, and avoiding trades on days when a decrease is
328
- predicted. Of course one would need to investigate more carefully
329
- whether this small improvement was real or just due to random chance.
330
-
331
- Suppose that we want to predict the returns associated with particular
332
- values of `Lag1` and `Lag2`. In particular, we want to
333
- predict `Direction` on a day when `Lag1` and
334
- `Lag2` equal $1.2$ and $1.1$, respectively, and on a day when they
335
- equal $1.5$ and $-0.8$. We do this using the `predict()`
336
- function.
337
- """
338
-
339
- newdata = pd.DataFrame({'Lag1':[1.2, 1.5],
340
- 'Lag2':[1.1, -0.8]});
341
- newX = model.transform(newdata)
342
- results.predict(newX)
343
-
344
- Smarket
345
-
346
- import pandas as pd
347
- import numpy as np
348
- import matplotlib.pyplot as plt
349
- from sklearn.model_selection import train_test_split
350
- from sklearn.linear_model import LogisticRegression
351
- from sklearn.metrics import classification_report, confusion_matrix
352
- import statsmodels.api as sm
353
-
354
-
355
- # Load the dataset
356
- data = load_data('Smarket')
357
-
358
- # Display the first few rows of the dataset
359
- print(data.head())
360
-
361
- # Prepare the data for logistic regression
362
- # Using 'Lag1' and 'Lag2' as predictors and 'Direction' as the response
363
- data['Direction'] = data['Direction'].map({'Up': 1, 'Down': 0})
364
- X = data[['Lag1', 'Lag2']]
365
- y = data['Direction']
366
-
367
- # Split the data into training and testing sets
368
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
369
-
370
- # Fit the logistic regression model
371
- log_reg = LogisticRegression()
372
- log_reg.fit(X_train, y_train)
373
-
374
- # Make predictions on the test set
375
- y_pred = log_reg.predict(X_test)
376
-
377
- # Print classification report and confusion matrix
378
- print(classification_report(y_test, y_pred))
379
- print(confusion_matrix(y_test, y_pred))
380
-
381
- # Visualize the decision boundary
382
- plt.figure(figsize=(10, 6))
383
-
384
- # Create a mesh grid for plotting decision boundary
385
- x_min, x_max = X['Lag1'].min() - 1, X['Lag1'].max() + 1
386
- y_min, y_max = X['Lag2'].min() - 1, X['Lag2'].max() + 1
387
- xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
388
- np.arange(y_min, y_max, 0.01))
389
-
390
- # Predict the function value for the whole grid
391
- Z = log_reg.predict(np.c_[xx.ravel(), yy.ravel()])
392
- Z = Z.reshape(xx.shape)
393
-
394
- # Plot the decision boundary
395
- plt.contourf(xx, yy, Z, alpha=0.8)
396
- plt.scatter(X_test['Lag1'], X_test['Lag2'], c=y_test, edgecolor='k', s=20)
397
- plt.xlabel('Lag1')
398
- plt.ylabel('Lag2')
399
- plt.title('Logistic Regression Decision Boundary')
400
- plt.show()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Reference files/week 7/W7_Lab_KNN_clustering.ipynb DELETED
@@ -1,481 +0,0 @@
1
- {
2
- "cells": [
3
- {
4
- "cell_type": "markdown",
5
- "id": "b1c6a137",
6
- "metadata": {
7
- "id": "b1c6a137"
8
- },
9
- "source": [
10
- "# Clustering Lab: State Crime Pattern Analysis\n",
11
- "\n",
12
- "## Lab Overview\n",
13
- "\n",
14
- "Welcome to your hands-on clustering lab! You'll be working as a policy analyst for the Department of Justice, analyzing crime patterns across US states. Your mission: discover hidden safety profiles that could inform federal resource allocation and crime prevention strategies.\n",
15
- "\n",
16
- "**Your Deliverable**: A policy brief with visualizations and recommendations based on your clustering analysis.\n",
17
- "\n",
18
- "---\n",
19
- "\n",
20
- "## Exercise 1: Data Detective Work\n",
21
- "**Time: 15 minutes | Product: Data Summary Report**\n",
22
- "\n",
23
- "### Your Task\n",
24
- "Before any analysis, you need to understand what you're working with. Create a brief data summary that a non-technical policy maker could understand.\n"
25
- ]
26
- },
27
- {
28
- "cell_type": "code",
29
- "source": [
30
- "```python\n",
31
- "import numpy as np\n",
32
- "import pandas as pd\n",
33
- "import matplotlib.pyplot as plt\n",
34
- "from statsmodels.datasets import get_rdataset\n",
35
- "from sklearn.preprocessing import StandardScaler\n",
36
- "from sklearn.cluster import KMeans, AgglomerativeClustering\n",
37
- "\n",
38
- "# Load the data\n",
39
- "USArrests = get_rdataset('USArrests').data\n",
40
- "print(\"Dataset shape:\", USArrests.shape)\n",
41
- "print(\"\\nVariables:\", USArrests.columns.tolist())\n",
42
- "print(\"\\nFirst 5 states:\")\n",
43
- "print(USArrests.head())\n",
44
- "```"
45
- ],
46
- "metadata": {
47
- "colab": {
48
- "base_uri": "https://localhost:8080/",
49
- "height": 106
50
- },
51
- "id": "mqRVE1hlXK9x",
52
- "outputId": "5a1bbd64-15cd-4e1c-9344-64a901d8a396"
53
- },
54
- "id": "mqRVE1hlXK9x",
55
- "execution_count": null,
56
- "outputs": [
57
- {
58
- "output_type": "error",
59
- "ename": "SyntaxError",
60
- "evalue": "invalid syntax (<ipython-input-1-2035427107>, line 1)",
61
- "traceback": [
62
- "\u001b[0;36m File \u001b[0;32m\"<ipython-input-1-2035427107>\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m ```python\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
63
- ]
64
- }
65
- ]
66
- },
67
- {
68
- "cell_type": "markdown",
69
- "source": [
70
- "## Your Investigation\n",
71
- "Complete this data summary table:\n",
72
- "\n",
73
- "| Variable | What it measures | Average Value | Highest State | Lowest State |\n",
74
- "|----------|------------------|---------------|---------------|--------------|\n",
75
- "| Murder | Rate per 100,000 people | ??? | ??? | ??? |\n",
76
- "| Assault | Rate per 100,000 people | ??? | ??? | ??? |\n",
77
- "| UrbanPop | Percentage living in cities | ??? | ??? | ??? |\n",
78
- "| Rape | Rate per 100,000 people | ??? | ??? | ??? |\n",
79
- "\n",
80
- "**Deliverable**: Write 2-3 sentences describing the biggest surprises in this data. Which states are not what you expected?\n",
81
- "\n",
82
- "---\n",
83
- "\n",
84
- "## Exercise 2: The Scaling Challenge\n",
85
- "**Time: 10 minutes | Product: Before/After Comparison**\n",
86
- "\n",
87
- "### Your Task\n",
88
- "Demonstrate why scaling is critical for clustering crime data.\n",
89
- "\n"
90
- ],
91
- "metadata": {
92
- "id": "7qkDKTe4XLtG"
93
- },
94
- "id": "7qkDKTe4XLtG"
95
- },
96
- {
97
- "cell_type": "code",
98
- "source": [
99
- "```python\n",
100
- "# Check the scale differences\n",
101
- "print(\"Original data ranges:\")\n",
102
- "print(USArrests.describe())\n",
103
- "\n",
104
- "print(\"\\nVariances (how spread out the data is):\")\n",
105
- "print(USArrests.var())\n",
106
- "\n",
107
- "# Scale the data\n",
108
- "scaler = StandardScaler()\n",
109
- "USArrests_scaled = scaler.fit_transform(USArrests)\n",
110
- "scaled_df = pd.DataFrame(USArrests_scaled,\n",
111
- " columns=USArrests.columns,\n",
112
- " index=USArrests.index)\n",
113
- "\n",
114
- "print(\"\\nAfter scaling - all variables now have similar ranges:\")\n",
115
- "print(scaled_df.describe())\n",
116
- "```"
117
- ],
118
- "metadata": {
119
- "id": "zQ3VowYNXLeQ"
120
- },
121
- "id": "zQ3VowYNXLeQ",
122
- "execution_count": null,
123
- "outputs": []
124
- },
125
- {
126
- "cell_type": "markdown",
127
- "source": [
128
- "### Your Analysis\n",
129
- "1. **Before scaling**: Which variable would dominate the clustering? Why?\n",
130
- "2. **After scaling**: Explain in simple terms what StandardScaler did to the data.\n",
131
- "\n",
132
- "**Deliverable**: One paragraph explaining why a policy analyst should care about data scaling.\n",
133
- "\n",
134
- "---\n",
135
- "\n",
136
- "## Exercise 3: Finding the Right Number of Groups\n",
137
- "**Time: 20 minutes | Product: Recommendation with Visual Evidence**\n",
138
- "\n",
139
- "### Your Task\n",
140
- "Use the elbow method to determine how many distinct crime profiles exist among US states.\n"
141
- ],
142
- "metadata": {
143
- "id": "FnOT700SXLPh"
144
- },
145
- "id": "FnOT700SXLPh"
146
- },
147
- {
148
- "cell_type": "code",
149
- "source": [
150
- "```python\n",
151
- "# Test different numbers of clusters\n",
152
- "inertias = []\n",
153
- "K_values = range(1, 11)\n",
154
- "\n",
155
- "for k in K_values:\n",
156
- " kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)\n",
157
- " kmeans.fit(USArrests_scaled)\n",
158
- " inertias.append(kmeans.inertia_)\n",
159
- "\n",
160
- "# Create the elbow plot\n",
161
- "plt.figure(figsize=(10, 6))\n",
162
- "plt.plot(K_values, inertias, 'bo-', linewidth=2, markersize=8)\n",
163
- "plt.xlabel('Number of Clusters (K)')\n",
164
- "plt.ylabel('Within-Cluster Sum of Squares')\n",
165
- "plt.title('Finding the Optimal Number of State Crime Profiles')\n",
166
- "plt.grid(True, alpha=0.3)\n",
167
- "plt.show()\n",
168
- "\n",
169
- "# Print the inertia values\n",
170
- "for k, inertia in zip(K_values, inertias):\n",
171
- " print(f\"K={k}: Inertia = {inertia:.1f}\")\n",
172
- "```"
173
- ],
174
- "metadata": {
175
- "id": "zOQrS9lmXpTF"
176
- },
177
- "id": "zOQrS9lmXpTF",
178
- "execution_count": null,
179
- "outputs": []
180
- },
181
- {
182
- "cell_type": "markdown",
183
- "id": "2e388ef2",
184
- "metadata": {
185
- "id": "2e388ef2"
186
- },
187
- "source": [
188
- "### Your Decision\n",
189
- "Based on your elbow plot:\n",
190
- "1. **What value of K do you recommend?** (Look for the \"elbow\" where the line starts to flatten)\n",
191
- "2. **What does this mean in policy terms?** (How many distinct types of state crime profiles exist?)\n",
192
- "\n",
193
- "**Deliverable**: A one-paragraph recommendation with your chosen K value and reasoning.\n",
194
- "\n",
195
- "---\n",
196
- "\n",
197
- "## Exercise 4: K-Means State Profiling\n",
198
- "**Time: 25 minutes | Product: State Crime Profile Report**\n",
199
- "\n",
200
- "### Your Task\n",
201
- "Create distinct crime profiles and identify which states belong to each category.\n",
202
- "\n",
203
- "\n",
204
- "\n",
205
- "\n"
206
- ]
207
- },
208
- {
209
- "cell_type": "code",
210
- "source": [
211
- "```python\n",
212
- "# Use your chosen K value from Exercise 3\n",
213
- "optimal_k = 4 # Replace with your chosen value\n",
214
- "\n",
215
- "# Perform K-means clustering\n",
216
- "kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=20)\n",
217
- "cluster_labels = kmeans.fit_predict(USArrests_scaled)\n",
218
- "\n",
219
- "# Add cluster labels to original data\n",
220
- "USArrests_clustered = USArrests.copy()\n",
221
- "USArrests_clustered['Cluster'] = cluster_labels\n",
222
- "\n",
223
- "# Analyze each cluster\n",
224
- "print(\"State Crime Profiles Analysis\")\n",
225
- "print(\"=\" * 50)\n",
226
- "\n",
227
- "for cluster_num in range(optimal_k):\n",
228
- " cluster_states = USArrests_clustered[USArrests_clustered['Cluster'] == cluster_num]\n",
229
- " print(f\"\\nCLUSTER {cluster_num}: {len(cluster_states)} states\")\n",
230
- " print(\"States:\", \", \".join(cluster_states.index.tolist()))\n",
231
- " print(\"Average characteristics:\")\n",
232
- " avg_profile = cluster_states[['Murder', 'Assault', 'UrbanPop', 'Rape']].mean()\n",
233
- " for var, value in avg_profile.items():\n",
234
- " print(f\" {var}: {value:.1f}\")\n",
235
- "```"
236
- ],
237
- "metadata": {
238
- "id": "_5b0nE6KXv1P"
239
- },
240
- "id": "_5b0nE6KXv1P",
241
- "execution_count": null,
242
- "outputs": []
243
- },
244
- {
245
- "cell_type": "markdown",
246
- "source": [
247
- "### Your Analysis\n",
248
- "For each cluster, create a profile:\n",
249
- "\n",
250
- "**Cluster 0: \"[Your Creative Name]\"**\n",
251
- "- **States**: [List them]\n",
252
- "- **Characteristics**: [Describe the pattern]\n",
253
- "- **Policy Insight**: [What should federal agencies know about these states?]\n",
254
- "\n",
255
- "**Deliverable**: A table summarizing each cluster with creative names and policy recommendations.\n",
256
- "\n",
257
- "---\n",
258
- "\n",
259
- "## Exercise 5: Hierarchical Clustering Exploration\n",
260
- "**Time: 25 minutes | Product: Family Tree Interpretation**\n",
261
- "\n",
262
- "### Your Task\n",
263
- "Create a dendrogram to understand how states naturally group together.\n"
264
- ],
265
- "metadata": {
266
- "id": "J1WVGb_nX4ye"
267
- },
268
- "id": "J1WVGb_nX4ye"
269
- },
270
- {
271
- "cell_type": "code",
272
- "source": [
273
- "```python\n",
274
- "from scipy.cluster.hierarchy import dendrogram, linkage\n",
275
- "\n",
276
- "# Create hierarchical clustering\n",
277
- "linkage_matrix = linkage(USArrests_scaled, method='complete')\n",
278
- "\n",
279
- "# Plot the dendrogram\n",
280
- "plt.figure(figsize=(15, 8))\n",
281
- "dendrogram(linkage_matrix,\n",
282
- " labels=USArrests.index.tolist(),\n",
283
- " leaf_rotation=90,\n",
284
- " leaf_font_size=10)\n",
285
- "plt.title('State Crime Pattern Family Tree')\n",
286
- "plt.xlabel('States')\n",
287
- "plt.ylabel('Distance Between Groups')\n",
288
- "plt.tight_layout()\n",
289
- "plt.show()\n",
290
- "```"
291
- ],
292
- "metadata": {
293
- "id": "Y9a_cbZKX7QX"
294
- },
295
- "id": "Y9a_cbZKX7QX",
296
- "execution_count": null,
297
- "outputs": []
298
- },
299
- {
300
- "cell_type": "markdown",
301
- "source": [
302
- "### Your Interpretation\n",
303
- "1. **Closest Pairs**: Which two states are most similar in crime patterns?\n",
304
- "2. **Biggest Divide**: Where is the largest split in the tree? What does this represent?\n",
305
- "3. **Surprising Neighbors**: Which states cluster together that surprised you geographically?\n",
306
- "\n",
307
- "### Code to Compare Methods"
308
- ],
309
- "metadata": {
310
- "id": "0PaImqZtX6f3"
311
- },
312
- "id": "0PaImqZtX6f3"
313
- },
314
- {
315
- "cell_type": "code",
316
- "source": [
317
- "```python\n",
318
- "# Compare your K-means results with hierarchical clustering\n",
319
- "from scipy.cluster.hierarchy import fcluster\n",
320
- "\n",
321
- "# Cut the tree to get the same number of clusters as K-means\n",
322
- "hierarchical_labels = fcluster(linkage_matrix, optimal_k, criterion='maxclust') - 1\n",
323
- "\n",
324
- "# Create comparison\n",
325
- "comparison_df = pd.DataFrame({\n",
326
- " 'State': USArrests.index,\n",
327
- " 'K_Means_Cluster': cluster_labels,\n",
328
- " 'Hierarchical_Cluster': hierarchical_labels\n",
329
- "})\n",
330
- "\n",
331
- "print(\"Comparison of K-Means vs Hierarchical Clustering:\")\n",
332
- "print(comparison_df.sort_values('State'))\n",
333
- "\n",
334
- "# Count agreements\n",
335
- "agreements = sum(comparison_df['K_Means_Cluster'] == comparison_df['Hierarchical_Cluster'])\n",
336
- "print(f\"\\nMethods agreed on {agreements} out of {len(comparison_df)} states ({agreements/len(comparison_df)*100:.1f}%)\")\n",
337
- "```"
338
- ],
339
- "metadata": {
340
- "id": "tJQ-C5GFYBRT"
341
- },
342
- "id": "tJQ-C5GFYBRT",
343
- "execution_count": null,
344
- "outputs": []
345
- },
346
- {
347
- "cell_type": "markdown",
348
- "source": [
349
- "**Deliverable**: A paragraph explaining the key differences between what K-means and hierarchical clustering revealed.\n",
350
- "\n",
351
- "---\n",
352
- "\n",
353
- "## Exercise 6: Policy Brief Creation\n",
354
- "**Time: 20 minutes | Product: Executive Summary**\n",
355
- "\n",
356
- "### Your Task\n",
357
- "Synthesize your findings into a policy brief for Department of Justice leadership.\n",
358
- "\n",
359
- "### Code Framework for Final Visualization"
360
- ],
361
- "metadata": {
362
- "id": "dx1fNhu4YD7-"
363
- },
364
- "id": "dx1fNhu4YD7-"
365
- },
366
- {
367
- "cell_type": "code",
368
- "source": [
369
- "```python\n",
370
- "# Create a comprehensive visualization\n",
371
- "fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))\n",
372
- "\n",
373
- "# Plot 1: Murder vs Assault by cluster\n",
374
- "colors = ['red', 'blue', 'green', 'orange', 'purple']\n",
375
- "for i in range(optimal_k):\n",
376
- " cluster_data = USArrests_clustered[USArrests_clustered['Cluster'] == i]\n",
377
- " ax1.scatter(cluster_data['Murder'], cluster_data['Assault'],\n",
378
- " c=colors[i], label=f'Cluster {i}', s=60, alpha=0.7)\n",
379
- "ax1.set_xlabel('Murder Rate')\n",
380
- "ax1.set_ylabel('Assault Rate')\n",
381
- "ax1.set_title('Murder vs Assault by Crime Profile')\n",
382
- "ax1.legend()\n",
383
- "ax1.grid(True, alpha=0.3)\n",
384
- "\n",
385
- "# Plot 2: Urban Population vs Rape by cluster\n",
386
- "for i in range(optimal_k):\n",
387
- " cluster_data = USArrests_clustered[USArrests_clustered['Cluster'] == i]\n",
388
- " ax2.scatter(cluster_data['UrbanPop'], cluster_data['Rape'],\n",
389
- " c=colors[i], label=f'Cluster {i}', s=60, alpha=0.7)\n",
390
- "ax2.set_xlabel('Urban Population %')\n",
391
- "ax2.set_ylabel('Rape Rate')\n",
392
- "ax2.set_title('Urban Population vs Rape Rate by Crime Profile')\n",
393
- "ax2.legend()\n",
394
- "ax2.grid(True, alpha=0.3)\n",
395
- "\n",
396
- "# Plot 3: Cluster size comparison\n",
397
- "cluster_sizes = USArrests_clustered['Cluster'].value_counts().sort_index()\n",
398
- "ax3.bar(range(len(cluster_sizes)), cluster_sizes.values, color=colors[:len(cluster_sizes)])\n",
399
- "ax3.set_xlabel('Cluster Number')\n",
400
- "ax3.set_ylabel('Number of States')\n",
401
- "ax3.set_title('Number of States in Each Crime Profile')\n",
402
- "ax3.set_xticks(range(len(cluster_sizes)))\n",
403
- "\n",
404
- "# Plot 4: Average crime rates by cluster\n",
405
- "cluster_means = USArrests_clustered.groupby('Cluster')[['Murder', 'Assault', 'Rape']].mean()\n",
406
- "cluster_means.plot(kind='bar', ax=ax4)\n",
407
- "ax4.set_xlabel('Cluster Number')\n",
408
- "ax4.set_ylabel('Average Rate')\n",
409
- "ax4.set_title('Average Crime Rates by Profile')\n",
410
- "ax4.legend()\n",
411
- "ax4.tick_params(axis='x', rotation=0)\n",
412
- "\n",
413
- "plt.tight_layout()\n",
414
- "plt.show()\n",
415
- "```"
416
- ],
417
- "metadata": {
418
- "id": "N8bkxURpYHJF"
419
- },
420
- "id": "N8bkxURpYHJF",
421
- "execution_count": null,
422
- "outputs": []
423
- },
424
- {
425
- "cell_type": "markdown",
426
- "source": [
427
- "### Your Policy Brief Template\n",
428
- "\n",
429
- "**EXECUTIVE SUMMARY: US State Crime Profile Analysis**\n",
430
- "\n",
431
- "**Key Findings:**\n",
432
- "- We identified [X] distinct crime profiles among US states\n",
433
- "- [State examples] represent the highest-risk profile\n",
434
- "- [State examples] represent the lowest-risk profile\n",
435
- "- Urban population [does/does not] strongly correlate with violent crime\n",
436
- "\n",
437
- "**Policy Recommendations:**\n",
438
- "1. **High-Priority States**: [List and explain why]\n",
439
- "2. **Resource Allocation**: [Suggest how to distribute federal crime prevention funds]\n",
440
- "3. **Best Practice Sharing**: [Which states should learn from which others?]\n",
441
- "\n",
442
- "**Methodology Note**: Analysis used unsupervised clustering on 4 crime variables across 50 states, with data standardization to ensure fair comparison.\n",
443
- "\n",
444
- "**Deliverable**: A complete 1-page policy brief with your clustering insights and specific recommendations.\n"
445
- ],
446
- "metadata": {
447
- "id": "rAy_Ye0WYLK0"
448
- },
449
- "id": "rAy_Ye0WYLK0"
450
- }
451
- ],
452
- "metadata": {
453
- "jupytext": {
454
- "cell_metadata_filter": "-all",
455
- "formats": "Rmd,ipynb",
456
- "main_language": "python"
457
- },
458
- "kernelspec": {
459
- "display_name": "Python 3 (ipykernel)",
460
- "language": "python",
461
- "name": "python3"
462
- },
463
- "language_info": {
464
- "codemirror_mode": {
465
- "name": "ipython",
466
- "version": 3
467
- },
468
- "file_extension": ".py",
469
- "mimetype": "text/x-python",
470
- "name": "python",
471
- "nbconvert_exporter": "python",
472
- "pygments_lexer": "ipython3",
473
- "version": "3.10.4"
474
- },
475
- "colab": {
476
- "provenance": []
477
- }
478
- },
479
- "nbformat": 4,
480
- "nbformat_minor": 5
481
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Reference files/week 7/Week7_Clustering Curriculum.docx DELETED
Binary file (18.4 kB)
 
Reference files/week 7/Week7_Clustering Learning Objectives.docx DELETED
Binary file (11.4 kB)
 
Reference files/week 7/w7_curriculum DELETED
@@ -1,178 +0,0 @@
1
- Unsupervised Learning: K-means and Hierarchical Clustering
2
- 1. Course Overview
3
- The State Safety Profile Challenge
4
- In this week, we'll explore unsupervised machine learning through a compelling real-world challenge: Understanding crime patterns across US states without any predetermined categories.
5
- Unsupervised Learning: A type of machine learning where we find hidden patterns in data without being told what to look for. Think of it like being a detective who examines evidence without knowing what crime was committed - you're looking for patterns and connections that emerge naturally from the data.
6
- Example: Instead of being told "find violent states vs. peaceful states," unsupervised learning lets the data reveal its own natural groupings, like "states with high murder but low assault" or "urban states with moderate crime."
7
- Imagine you're a policy researcher working with the FBI's crime statistics. You have data on violent crime rates across all 50 US states - murder rates, assault rates, urban population percentages, and rape statistics. But here's the key challenge: you don't know how states naturally group together in terms of crime profiles.
8
- Your Mission: Discover hidden patterns in state crime profiles without any predefined classifications!
9
- The Challenge: Without any predetermined safety categories, you need to:
10
- ● Uncover natural groupings of states based on their crime characteristics
11
- ● Identify which crime factors tend to cluster together
12
- ● Understand regional patterns that might not follow obvious geographic boundaries
13
- ● Find states with surprisingly similar or different crime profiles
14
- Cluster: A group of similar things. In our case, states that have similar crime patterns naturally group together in a cluster.
15
- Example: You might discover that Alaska, Nevada, and Florida cluster together because they all have high crime rates despite being in different regions of the country.
16
- Why This Matters: Traditional approaches might group states by region (South, Northeast, etc.) or population size. But what if crime patterns reveal different natural groupings? What if some Southern states cluster more closely with Western states based on crime profiles? What if urban percentage affects crime differently than expected?
17
- Urban Percentage: The proportion of a state's population that lives in cities rather than rural areas.
18
- Example: New York has a high urban percentage (87%) while Wyoming has a low urban percentage (29%).
19
- What You'll Discover Through This Challenge
20
- ● Hidden State Safety Types: Use clustering to identify groups of states with similar crime profiles
21
- ● Crime Pattern Relationships: Find unexpected connections between different types of violent crime
22
- ● Urban vs. Rural Effects: Discover how urbanization relates to different crime patterns
23
- ● Policy Insights: Understand which states face similar challenges and might benefit from shared approaches
24
- Clustering: The process of grouping similar data points together. It's like organizing your music library - songs naturally group by genre, but clustering might reveal unexpected groups like "workout songs" or "rainy day music" that cross traditional genre boundaries.
25
- Core Techniques We'll Master
26
- K-Means Clustering: A method that divides data into exactly K groups (where you choose the number K). It's like being asked to organize 50 students into exactly 4 study groups based on their academic interests.
27
- Hierarchical Clustering: A method that creates a tree-like structure showing how data points relate to each other at different levels. It's like a family tree, but for data - showing which states are "cousins" and which are "distant relatives" in terms of crime patterns.
28
- Both K-Means and Hierarchical Clustering are examples of unsupervised learning.
29
-
30
- 2. K-Means Clustering
31
-
32
- What it does: Divides data into exactly K groups by finding central points (centroids).
33
- Central Points (Centroids): The "center" or average point of each group. Think of it like the center of a basketball team huddle - it's the point that best represents where all the players are standing.
34
- Example: If you have a cluster of high-crime states, the centroid might represent "average murder rate of 8.5, average assault rate of 250, average urban population of 70%."
35
- USArrests Example: Analyzing crime data across 50 states, you might discover 4 distinct state safety profiles:
36
- ● High Crime States (above average in murder, assault, and rape rates)
37
- ● Urban Safe States (high urban population but lower violent crime rates)
38
- ● Rural Traditional States (low urban population, moderate crime rates)
39
- ● Mixed Profile States (high in some crime types but not others)
40
- How to Read K-Means Results:
41
- ● Scatter Plot: Points (states) colored by cluster membership
42
- ○ Well-separated colors indicate distinct state profiles
43
- ○ Mixed colors suggest overlapping crime patterns
44
- ● Cluster Centers: Average crime characteristics of each state group
45
- ● Elbow Plot: Helps choose optimal number of state groupings
46
- Cluster Membership: Which group each data point belongs to. Like being assigned to a team - each state gets assigned to exactly one crime profile group.
47
- Example: Texas might be assigned to "High Crime States" while Vermont is assigned to "Rural Traditional States."
48
- Scatter Plot: A graph where each point represents one observation (in our case, one state). Points that are close together have similar characteristics.
49
- Elbow Plot: A graph that helps you choose the right number of clusters. It's called "elbow" because you look for a bend in the line that looks like an elbow joint.
50
- Key Parameters:
51
- python
52
- # Essential parameters from the lab
53
- KMeans(
54
- n_clusters=4, # Number of state safety profiles to discover
55
- random_state=42, # For reproducible results
56
- n_init=20 # Run algorithm 20 times, keep best result
57
- )
58
- Parameters: Settings that control how the algorithm works. Like settings on your phone - you can adjust them to get different results.
59
- n_clusters: How many groups you want to create. You have to decide this ahead of time.
60
- random_state: A number that ensures you get the same results every time you run the analysis. Like setting a specific starting point so everyone gets the same answer.
61
- n_init: How many times to run the algorithm. The computer tries multiple starting points and picks the best result. More tries = better results.
62
-
63
- 3. Hierarchical Clustering
64
- What it does: Creates a tree structure (dendrogram) showing how data points group together at different levels.
65
- Dendrogram: A tree-like diagram that shows how groups form at different levels. Think of it like a family tree, but for data. At the bottom are individuals (states), and as you go up, you see how they group into families, then extended families, then larger clans.
66
- Example: At the bottom level, you might see Vermont and New Hampshire grouped together. Moving up, they might join with Maine to form a "New England Low Crime" group. Moving up further, this group might combine with other regional groups.
67
- USArrests Example: Analyzing state crime patterns might reveal:
68
- ● Level 1: High Crime vs. Low Crime states
69
- ● Level 2: Within high crime: Urban-driven vs. Rural-driven crime patterns
70
- ● Level 3: Within urban-driven: Assault-heavy vs. Murder-heavy profiles
71
- How to Read Dendrograms:
72
- ● Height: Distance between groups when they merge
73
- ○ Higher merges = very different crime profiles
74
- ○ Lower merges = similar crime patterns
75
- ● Branches: Each split shows a potential state grouping
76
- ● Cutting the Tree: Draw a horizontal line to create clusters
77
- Height: In a dendrogram, height represents how different two groups are. Think of it like difficulty level - it takes more "effort" (higher height) to combine very different groups.
78
- Example: Combining two very similar states (like Vermont and New Hampshire) happens at low height. Combining very different groups (like "High Crime States" and "Low Crime States") happens at high height.
79
- Cutting the Tree: Drawing a horizontal line across the dendrogram to create a specific number of groups. Like slicing a layer cake - where you cut determines how many pieces you get.
80
- Three Linkage Methods:
81
- ● Complete Linkage: Measures distance between most different states (good for distinct profiles)
82
- ● Average Linkage: Uses average distance between all states (balanced approach)
83
- ● Single Linkage: Uses closest states (tends to create chains, often less useful)
84
- Linkage Methods: Different ways to measure how close or far apart groups are. It's like different ways to measure the distance between two cities - you could use the distance between the farthest suburbs (complete), the average distance between all neighborhoods (average), or the distance between the closest points (single).
85
- Example: When deciding if "High Crime Group" and "Medium Crime Group" should merge, complete linkage looks at the most different states between the groups, while average linkage looks at the typical difference.
86
- Choosing Between K-Means and Hierarchical:
87
- ● Use K-Means when: You want to segment states into specific number of safety categories for policy targeting
88
- ● Use Hierarchical when: You want to explore the natural structure of crime patterns without assumptions
89
- Segmentation: Dividing your data into groups for specific purposes. Like organizing students into study groups - you might want exactly 4 groups so each has a teaching assistant.
90
- Exploratory Analysis: Looking at data to discover patterns without knowing what you'll find. Like being an explorer in uncharted territory - you're not looking for a specific destination, just seeing what interesting things you can discover.
91
-
92
- 4. Data Exploration
93
- Step 1: Understanding Your Data
94
- Essential Checks (from the USArrests example):
95
- python
96
- # Check the basic structure
97
- print(data.shape) # How many observations and variables?
98
- print(data.columns) # What variables do you have?
99
- print(data.head()) # What do the first few rows look like?
100
-
101
- # Examine the distribution
102
- print(data.mean()) # Average values
103
- print(data.var()) # Variability
104
- print(data.describe()) # Full statistical summary
105
- Observations: Individual data points we're studying. In our case, each of the 50 US states is one observation.
106
- Variables: The characteristics we're measuring for each observation. In USArrests, we have 4 variables: Murder rate, Assault rate, Urban Population percentage, and Rape rate.
107
- Example: For California (one observation), we might have Murder=9.0, Assault=276, UrbanPop=91, Rape=40.6 (four variables).
108
- Distribution: How values are spread out. Like looking at test scores in a class - are most scores clustered around the average, or spread out widely?
109
- Variability (Variance): How much the values differ from each other. High variance means values are spread out; low variance means they're clustered together.
110
- Why This Matters: The USArrests data showed vastly different scales:
111
- ● Murder: Average 7.8, Variance 19
112
- ● Assault: Average 170.8, Variance 6,945
113
- ● This scale difference would dominate any analysis without preprocessing
114
- Scales: The range and units of measurement for different variables. Like comparing dollars ($50,000 salary) to percentages (75% approval rating) - they're measured very differently.
115
- Example: Assault rates are in the hundreds (like 276 per 100,000) while murder rates are single digits (like 7.8 per 100,000). Without adjustment, assault would seem much more important just because the numbers are bigger.
116
- Step 2: Data Preprocessing
117
- Standardization (Critical for clustering):
118
- python
119
- from sklearn.preprocessing import StandardScaler
120
-
121
- # Always scale when variables have different units
122
- scaler = StandardScaler()
123
- data_scaled = scaler.fit_transform(data)
124
- Standardization: Converting all variables to the same scale so they can be fairly compared. Like converting all measurements to the same units - instead of comparing feet to meters, you convert everything to inches.
125
- StandardScaler: A tool that transforms data so each variable has an average of 0 and standard deviation of 1. Think of it like grading on a curve - it makes all variables equally important.
126
- Example: After standardization, a murder rate of 7.8 might become 0.2, and an assault rate of 276 might become 1.5. Now they're on comparable scales.
127
- When to Scale:
128
- ● ✅ Always scale when variables have different units (dollars vs. percentages)
129
- ● ✅ Scale when variances differ by orders of magnitude
130
- ● ❓ Consider not scaling when all variables are in the same meaningful units
131
- Orders of Magnitude: When one number is 10 times, 100 times, or 1000 times bigger than another. In USArrests, assault variance (6,945) is about 365 times bigger than murder variance (19) - that's two orders of magnitude difference.
132
- Step 3: Exploratory Analysis
133
- For K-Means Clustering:
134
- python
135
- # Try different numbers of clusters to find optimal K
136
- inertias = []
137
- K_range = range(1, 11)
138
- for k in K_range:
139
- kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)
140
- kmeans.fit(data_scaled)
141
- inertias.append(kmeans.inertia_)
142
-
143
- # Plot elbow curve
144
- plt.plot(K_range, inertias, 'bo-')
145
- plt.xlabel('Number of Clusters (K)')
146
- plt.ylabel('Within-Cluster Sum of Squares')
147
- plt.title('Elbow Method for Optimal K')
148
- Inertias: A measure of how tightly grouped each cluster is. Lower inertia means points in each cluster are closer together (better clustering). It's like measuring how close teammates stand to each other - closer teammates indicate better team cohesion.
149
- Within-Cluster Sum of Squares: The total distance from each point to its cluster center. Think of it as measuring how far each student sits from their group's center - smaller distances mean tighter, more cohesive groups.
150
- Elbow Method: A technique for choosing the best number of clusters. You plot the results and look for the "elbow" - the point where adding more clusters doesn't help much anymore.
151
- For Hierarchical Clustering:
152
- python
153
- # Create dendrogram to explore natural groupings
154
- from sklearn.cluster import AgglomerativeClustering
155
- from ISLP.cluster import compute_linkage
156
- from scipy.cluster.hierarchy import dendrogram
157
-
158
- hc = AgglomerativeClustering(distance_threshold=0, n_clusters=None, linkage='complete')
159
- hc.fit(data_scaled)
160
- linkage_matrix = compute_linkage(hc)
161
-
162
- plt.figure(figsize=(12, 8))
163
- dendrogram(linkage_matrix, color_threshold=-np.inf, above_threshold_color='black')
164
- plt.title('Hierarchical Clustering Dendrogram')
165
- AgglomerativeClustering: A type of hierarchical clustering that starts with individual points and gradually combines them into larger groups. Like building a pyramid from the bottom up.
166
- distance_threshold=0: A setting that tells the algorithm to build the complete tree structure without stopping early.
167
- Linkage Matrix: A mathematical representation of how the tree structure was built. Think of it as the blueprint showing how the dendrogram was constructed.
168
- Step 4: Validation Questions
169
- Before proceeding with analysis, ask:
170
- 1. Do the variables make sense together? (e.g., don't cluster height with income)
171
- 2. Are there obvious outliers that need attention?
172
- 3. Do you have enough data points? (Rule of thumb: at least 10x more observations than variables)
173
- 4. Are there missing values that need handling?
174
- Outliers: Data points that are very different from all the others. Like a 7-foot-tall person in a group of average-height people - they're so different they might skew your analysis.
175
- Example: If most states have murder rates between 1-15, but one state has a rate of 50, that's probably an outlier that needs special attention.
176
- Missing Values: Data points where we don't have complete information. Like a student who didn't take one of the tests - you need to decide how to handle that gap in the data.
177
- Rule of Thumb: A general guideline that works in most situations. For clustering, having at least 10 times more observations than variables helps ensure reliable results.
178
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/__pycache__/main.cpython-311.pyc CHANGED
Binary files a/app/__pycache__/main.cpython-311.pyc and b/app/__pycache__/main.cpython-311.pyc differ
 
app/main.py CHANGED
@@ -24,6 +24,7 @@ from app.pages import week_4
24
  from app.pages import week_5
25
  from app.pages import week_6
26
  from app.pages import week_7
 
27
  # Page configuration
28
  st.set_page_config(
29
  page_title="Data Science Course App",
@@ -165,6 +166,8 @@ def show_week_content():
165
  week_6.show()
166
  elif st.session_state.current_week == 7:
167
  week_7.show()
 
 
168
  else:
169
  st.warning("Content for this week is not yet available.")
170
 
@@ -177,7 +180,7 @@ def main():
177
  return
178
 
179
  # User is logged in, show course content
180
- if st.session_state.current_week in [1, 2, 3, 4, 5, 6, 7]:
181
  show_week_content()
182
  else:
183
  st.title("Data Science Research Paper Course")
 
24
  from app.pages import week_5
25
  from app.pages import week_6
26
  from app.pages import week_7
27
+ from app.pages import week_8
28
  # Page configuration
29
  st.set_page_config(
30
  page_title="Data Science Course App",
 
166
  week_6.show()
167
  elif st.session_state.current_week == 7:
168
  week_7.show()
169
+ elif st.session_state.current_week == 8:
170
+ week_8.show()
171
  else:
172
  st.warning("Content for this week is not yet available.")
173
 
 
180
  return
181
 
182
  # User is logged in, show course content
183
+ if st.session_state.current_week in [1, 2, 3, 4, 5, 6, 7, 8]:
184
  show_week_content()
185
  else:
186
  st.title("Data Science Research Paper Course")
app/pages/__pycache__/week_8.cpython-311.pyc ADDED
Binary file (27.8 kB). View file
 
app/pages/week_8.py ADDED
@@ -0,0 +1,564 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import pandas as pd
3
+ import numpy as np
4
+ import matplotlib.pyplot as plt
5
+ import seaborn as sns
6
+ import plotly.express as px
7
+ import plotly.graph_objects as go
8
+ from plotly.subplots import make_subplots
9
+
10
+ def show():
11
+ st.title("Week 8: Research Paper Writing and LaTeX")
12
+
13
+ # Introduction
14
+ st.header("Learning Objectives")
15
+ st.markdown("""
16
+ By the end of this week, you will be able to:
17
+
18
+ **Remember (Knowledge):**
19
+ - Recall LaTeX syntax for document structure, figures, citations, and spacing
20
+ - Identify components of ML research papers (introduction, methods, results, conclusion, limitations)
21
+ - Recognize standard formatting requirements for academic conferences and journals
22
+
23
+ **Understand (Comprehension):**
24
+ - Describe the purpose and audience for each section of a research paper
25
+
26
+ **Apply (Application):**
27
+ - Format complete research papers in LaTeX with proper figures, tables, and citations
28
+ - Write clear methodology sections with sufficient detail for reproducibility
29
+ - Present experimental results using appropriate visualizations and statistical analysis
30
+
31
+ **Analyze (Analysis):**
32
+ - Diagnose LaTeX formatting issues and resolve compilation errors
33
+ - Examine related work to identify research gaps and position their contributions
34
+ - Compare methodology approaches with existing methods
35
+
36
+ **Evaluate (Evaluation):**
37
+ - Critically assess the validity and reliability of experimental design
38
+ - Evaluate the clarity and persuasiveness of written arguments
39
+
40
+ **Create (Synthesis):**
41
+ - Produce research papers
42
+ - Develop compelling visualizations that effectively communicate complex ML concepts
43
+ - Synthesize technical knowledge into coherent research narratives
44
+ """)
45
+
46
+ # Module 1: Research Paper Architecture
47
+ st.header("Module 1: Research Paper Architecture")
48
+
49
+ st.markdown("""
50
+ Every section of your paper must answer specific questions that reviewers ask. Think of your paper as a conversation
51
+ with skeptical experts who need convincing.
52
+ """)
53
+
54
+ # Paper Structure Table
55
+ st.subheader("Research Paper Structure")
56
+
57
+ paper_structure = {
58
+ "Section": ["🔥 Introduction", "🔬 Methods", "📊 Results", "🎯 Conclusion", "⚠️ Limitations"],
59
+ "Key Problems/Focus": [
60
+ "What problem are you solving? Why does it matter? How is your approach different?",
61
+ "How did you collect data? What analysis techniques? Can others replicate this?",
62
+ "What concrete findings emerged? How do they address your research questions?",
63
+ "What's the key takeaway? How does this advance the field? What are practical implications?",
64
+ "What are honest constraints? What biases might exist? What couldn't you address?"
65
+ ],
66
+ "Aim For": [
67
+ "Compelling motivation",
68
+ "Rigorous reproducibility",
69
+ "Clear evidence",
70
+ "Lasting impact",
71
+ "Honest transparency"
72
+ ]
73
+ }
74
+
75
+ st.dataframe(pd.DataFrame(paper_structure))
76
+
77
+ # Detailed Section Guidelines
78
+ st.subheader("Detailed Section Guidelines")
79
+
80
+ # Introduction Section
81
+ with st.expander("🔥 Introduction: Building Compelling Motivation"):
82
+ st.markdown("""
83
+ **What is it:** The introduction is your paper's first impression and often determines whether reviewers continue reading.
84
+
85
+ **Why this matters:** A weak introduction leads to immediate rejection, regardless of how brilliant your technical contribution might be.
86
+
87
+ **What to do:**
88
+ 1. Use the "inverted pyramid" approach
89
+ 2. Start with broad context, then narrow to specific problem
90
+ 3. Clearly articulate the gap in existing solutions
91
+ 4. Present your approach as a logical response
92
+ 5. Conclude with explicit contributions (3-4 bullet points)
93
+
94
+ **Example Structure:**
95
+ ```
96
+ 1. Broad context about the field
97
+ 2. Specific problem you're addressing
98
+ 3. Gap in existing solutions
99
+ 4. Your approach as response to gap
100
+ 5. Explicit contributions
101
+ ```
102
+ """)
103
+
104
+ # Methods Section
105
+ with st.expander("🔬 Methods: Ensuring Rigorous Reproducibility"):
106
+ st.markdown("""
107
+ **What is it:** The methods section has evolved from simple description to detailed documentation that enables complete replication.
108
+
109
+ **Why this matters:** Irreproducible research wastes community resources and undermines scientific credibility.
110
+
111
+ **What to document:**
112
+ - Dataset specifics (exact version, preprocessing steps, train/validation/test splits)
113
+ - Model architecture details (layer sizes, activation functions, initialization schemes)
114
+ - Training procedures (optimization algorithm, learning rate schedules, batch sizes)
115
+ - Computational environment (hardware specifications, software versions, random seeds)
116
+
117
+ **Write as if creating a recipe** that a competent colleague could follow to recreate your exact results.
118
+ """)
119
+
120
+ # Results Section
121
+ with st.expander("📊 Results: Presenting Clear Evidence"):
122
+ st.markdown("""
123
+ **What is it:** The results section synthesizes your raw findings into compelling evidence for your claims.
124
+
125
+ **Why this matters:** This section proves whether your methodology actually works and answers your research questions.
126
+
127
+ **What to do:**
128
+ 1. Organize results logically (general performance to specific analyses)
129
+ 2. Start with overall model performance using standard metrics
130
+ 3. Include detailed comparisons, ablation studies, and error analysis
131
+ 4. Use clear visualizations with appropriate error bars
132
+ 5. Report negative results honestly
133
+ 6. Connect each finding back to your original research questions
134
+ """)
135
+
136
+ # Conclusion Section
137
+ with st.expander("🎯 Conclusion: Creating Lasting Impact"):
138
+ st.markdown("""
139
+ **What is it:** The conclusion shapes how the research community understands and remembers your contribution.
140
+
141
+ **Why this matters:** Your technical contribution only matters if others can understand its significance and apply it.
142
+
143
+ **What to do:**
144
+ 1. Begin with concise summary of key findings (2-3 sentences)
145
+ 2. State how findings advance theoretical understanding or practical applications
146
+ 3. Discuss broader implications beyond your specific problem domain
147
+ 4. Suggest concrete directions for future research
148
+ 5. Balance confidence with humility about scope
149
+ """)
150
+
151
+ # Limitations Section
152
+ with st.expander("⚠️ Limitations: Demonstrating Honest Transparency"):
153
+ st.markdown("""
154
+ **What is it:** Acknowledging limitations shows scientific maturity and helps readers appropriately interpret your findings.
155
+
156
+ **Why this matters:** Every study has constraints, and attempting to hide them makes reviewers suspicious.
157
+
158
+ **Three types of limitations to address:**
159
+ 1. **Scope limitations:** What populations, contexts, or problem types might your results not apply to?
160
+ 2. **Methodological constraints:** Sample size issues, measurement limitations, or experimental design trade-offs
161
+ 3. **Potential biases:** Dataset bias, researcher bias, or systematic errors in your approach
162
+
163
+ **For each limitation:** Explain potential impact and suggest how future work could address it.
164
+ """)
165
+
166
+ # Quick Reference Framework
167
+ st.subheader("Quick Reference Framework")
168
+ st.markdown("""
169
+ **Title → Problem → Gap → Method → Findings → Impact → Limitations**
170
+
171
+ This progression ensures logical flow and helps readers follow your research narrative from motivation through contribution to appropriate interpretation.
172
+ """)
173
+
174
+ # Module 2: LaTeX Introduction
175
+ st.header("Module 2: Introduction to LaTeX")
176
+
177
+ st.markdown("""
178
+ **What is LaTeX?**
179
+
180
+ Think of LaTeX as a sophisticated word processor that works differently from Microsoft Word or Google Docs.
181
+ Instead of clicking buttons to format text, you write commands that tell the computer how to format your document.
182
+ """)
183
+
184
+ # Why LaTeX
185
+ st.subheader("Why Learn LaTeX for Academic Writing?")
186
+
187
+ latex_benefits = {
188
+ "Benefit": [
189
+ "Professional appearance",
190
+ "Mathematical notation",
191
+ "Reference management",
192
+ "Industry standard"
193
+ ],
194
+ "Description": [
195
+ "LaTeX automatically handles spacing, fonts, and layout to meet academic standards",
196
+ "Essential for ML papers with equations and formulas",
197
+ "Automatically formats citations and bibliographies",
198
+ "Most computer science conferences and journals expect LaTeX submissions"
199
+ ]
200
+ }
201
+
202
+ st.dataframe(pd.DataFrame(latex_benefits))
203
+
204
+ # LaTeX Code Examples
205
+ st.subheader("LaTeX Code Examples")
206
+
207
+ # Basic Structure
208
+ with st.expander("Basic Document Structure"):
209
+ st.markdown("**LaTeX Code:**")
210
+ st.code("""
211
+ \\documentclass{article}
212
+ \\usepackage[utf8]{inputenc}
213
+ \\usepackage{graphicx}
214
+ \\title{Your Research Paper Title}
215
+ \\author{Your Name}
216
+ \\date{\\today}
217
+ \\begin{document}
218
+ \\maketitle
219
+ \\section{Introduction}
220
+ Your introduction text goes here.
221
+ \\section{Methods}
222
+ Your methods section goes here.
223
+ \\section{Results}
224
+ Your results section goes here.
225
+ \\section{Conclusion}
226
+ Your conclusion goes here.
227
+ \\end{document}
228
+ """, language="latex")
229
+
230
+ st.markdown("**Rendered Output:**")
231
+ st.markdown("""
232
+ <div style="border: 1px solid #ccc; padding: 40px; margin: 20px auto; background-color: white; font-family: 'Times New Roman', Times, serif; color: black; box-shadow: 0 0 10px rgba(0,0,0,0.1); max-width: 800px;">
233
+ <h1 style="text-align: center; font-size: 22px; font-weight: bold; margin-bottom: 10px; color: black;">Your Research Paper Title</h1>
234
+ <p style="text-align: center; font-size: 16px; margin-bottom: 30px; color: black;"><em>Your Name</em><br><em>Today's Date</em></p>
235
+ <h2 style="font-size: 18px; font-weight: bold; margin-top: 20px; margin-bottom: 10px; color: black;">1. Introduction</h2>
236
+ <p style="font-size: 16px; line-height: 1.6; color: black;">Your introduction text goes here.</p>
237
+ <h2 style="font-size: 18px; font-weight: bold; margin-top: 20px; margin-bottom: 10px; color: black;">2. Methods</h2>
238
+ <p style="font-size: 16px; line-height: 1.6; color: black;">Your methods section goes here.</p>
239
+ <h2 style="font-size: 18px; font-weight: bold; margin-top: 20px; margin-bottom: 10px; color: black;">3. Results</h2>
240
+ <p style="font-size: 16px; line-height: 1.6; color: black;">Your results section goes here.</p>
241
+ <h2 style="font-size: 18px; font-weight: bold; margin-top: 20px; margin-bottom: 10px; color: black;">4. Conclusion</h2>
242
+ <p style="font-size: 16px; line-height: 1.6; color: black;">Your conclusion goes here.</p>
243
+ </div>
244
+ """, unsafe_allow_html=True)
245
+
246
+ # Sections and Subsections
247
+ with st.expander("Creating Sections and Subsections"):
248
+ st.markdown("**LaTeX Code:**")
249
+ st.code("""
250
+ \\section{Introduction} % Creates: 1. Introduction
251
+ \\subsection{Background} % Creates: 1.1 Background
252
+ \\subsubsection{Deep Learning} % Creates: 1.1.1 Deep Learning
253
+
254
+ % Tip: Overleaf shows section structure in the left panel for easy navigation
255
+ """, language="latex")
256
+
257
+ st.markdown("**Rendered Output:**")
258
+ st.markdown("""
259
+ <div style="border: 1px solid #ddd; padding: 20px; background-color: white; font-family: 'Times New Roman', serif; color: black;">
260
+ <h2 style="color: black; font-size: 18px; font-weight: bold;">1. Introduction</h2>
261
+ <h3 style="color: black; font-size: 16px; font-weight: bold; padding-left: 20px;">1.1 Background</h3>
262
+ <h4 style="color: black; font-size: 16px; font-style: italic; padding-left: 40px;">1.1.1 Deep Learning</h4>
263
+ <p style="color: black; padding-left: 40px; margin-top: 10px;"><em>Tip: Overleaf shows section structure in the left panel for easy navigation</em></p>
264
+ </div>
265
+ """, unsafe_allow_html=True)
266
+
267
+ # Figures
268
+ with st.expander("Adding Figures"):
269
+ st.markdown("**LaTeX Code:**")
270
+ st.code("""
271
+ \\begin{figure}[h]
272
+ \\centering
273
+ \\includegraphics[width=0.8\\textwidth]{research_question.jpg}
274
+ \\caption{The cycle of research from practical problem to research answer.}
275
+ \\label{fig:research_cycle}
276
+ \\end{figure}
277
+
278
+ % Reference it in your text
279
+ Figure~\\ref{fig:research_cycle} shows the relationship between problems, questions, and answers.
280
+ """, language="latex")
281
+
282
+ st.markdown("**Rendered Output:**")
283
+
284
+ # Center the image using columns
285
+ col1, col2, col3 = st.columns([1, 2, 1])
286
+ with col2:
287
+ st.image("assets/Pictures/research_question.jpg", width=384, caption="Figure 1: The cycle of research from practical problem to research answer.")
288
+
289
+ st.markdown("""
290
+ <div style="background-color: white; font-family: 'Times New Roman', serif; color: black; padding: 0 20px 20px 20px;">
291
+ <p style="color: black; text-align: left;">Figure 1 shows the relationship between problems, questions, and answers.</p>
292
+ </div>
293
+ """, unsafe_allow_html=True)
294
+
295
+ # Citations and Bibliography
296
+ with st.expander("Citations and Bibliography"):
297
+ st.markdown("**LaTeX Code:**")
298
+ st.code("""
299
+ % In your main document
300
+ \\usepackage{biblatex}
301
+ \\addbibresource{sample.bib}
302
+
303
+ % Cite a reference
304
+ Our approach builds on recent work \\cite{einstein} and extends it by...
305
+
306
+ % Print bibliography
307
+ \\printbibliography
308
+
309
+ % In sample.bib file:
310
+ @article{einstein,
311
+ title={On the electrodynamics of moving bodies},
312
+ author={Einstein, Albert},
313
+ journal={Annalen der Physik},
314
+ volume={322},
315
+ number={10},
316
+ pages={891--921},
317
+ year={1905}
318
+ }
319
+ """, language="latex")
320
+
321
+ st.markdown("**Rendered Output:**")
322
+ st.markdown("""
323
+ <div style="border: 1px solid #ddd; padding: 20px; background-color: white; font-family: Times, 'Times New Roman', serif; color: black;">
324
+ <p style="color: black; margin-bottom: 1.5em;">Our approach builds on recent work [1] and extends it by...</p>
325
+ <h3 style="color: black; margin-bottom: 0.5em; font-weight: bold;">References</h3>
326
+ <p style="line-height: 1.6; padding-left: 2em; text-indent: -2em;">
327
+ [1] A. Einstein, "On the electrodynamics of moving bodies," <i>Annalen der Physik</i>, vol. 322, no. 10, pp. 891–921, 1905.
328
+ </p>
329
+ </div>
330
+ """, unsafe_allow_html=True)
331
+
332
+ # Mathematical Equations
333
+ with st.expander("Mathematical Equations"):
334
+ st.markdown("**LaTeX Code:**")
335
+ st.code("""
336
+ % Inline math
337
+ The loss function $L(\\theta)$ is defined as...
338
+
339
+ % Display math
340
+ \\begin{equation}
341
+ L(\\theta) = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - f(x_i, \\theta))^2
342
+ \\label{eq:loss}
343
+ \\end{equation}
344
+
345
+ % Reference the equation
346
+ As shown in Equation~\\ref{eq:loss}, the loss function...
347
+ """, language="latex")
348
+
349
+ st.markdown("**Rendered Output:**")
350
+ st.markdown("""
351
+ <div style="border: 1px solid #ddd; padding: 20px; background-color: white; font-family: 'Times New Roman', serif; color: black;">
352
+ <p style="color: black;">The loss function <em>L(&theta;)</em> is defined as...</p>
353
+ <div style="display: flex; justify-content: space-between; align-items: center; margin: 20px 0;">
354
+ <div style="flex-grow: 1; text-align: center;">
355
+ <img src="https://latex.codecogs.com/svg.latex?L(\\theta)%20=%20\\frac{1}{n}%20\\sum_{i=1}^{n}%20(y_i%20-%20f(x_i,%20\\theta))^2" />
356
+ </div>
357
+ <div style="font-style: italic; color: black;">(1)</div>
358
+ </div>
359
+ <p style="color: black;">As shown in Equation (1), the loss function...</p>
360
+ </div>
361
+ """, unsafe_allow_html=True)
362
+
363
+ # Tables
364
+ with st.expander("Creating Tables"):
365
+ st.markdown("**LaTeX Code:**")
366
+ st.code("""
367
+ \\begin{table}[h]
368
+ \\centering
369
+ \\begin{tabular}{|l|c|r|}
370
+ \\hline
371
+ \\textbf{Method} & \\textbf{Accuracy} & \\textbf{Time (s)} \\\\
372
+ \\hline
373
+ Baseline & 85.2\\% & 120 \\\\
374
+ Our Method & 89.7\\% & 95 \\\\
375
+ \\hline
376
+ \\end{tabular}
377
+ \\caption{Performance comparison of different methods}
378
+ \\label{tab:results}
379
+ \\end{table}
380
+
381
+ % Reference the table
382
+ Table~\\ref{tab:results} shows the performance comparison...
383
+ """, language="latex")
384
+
385
+ st.markdown("**Rendered Output:**")
386
+ st.markdown("""
387
+ <div style="border: 1px solid #ddd; padding: 20px; background-color: white; font-family: 'Times New Roman', serif; color: black;">
388
+ <div style="text-align: center; margin: 20px 0;">
389
+ <table style="border-collapse: collapse; width: 100%; max-width: 500px; margin: 0 auto;">
390
+ <tr style="border: 1px solid #000;">
391
+ <th style="border: 1px solid #000; padding: 8px; text-align: left; font-weight: bold;">Method</th>
392
+ <th style="border: 1px solid #000; padding: 8px; text-align: center; font-weight: bold;">Accuracy</th>
393
+ <th style="border: 1px solid #000; padding: 8px; text-align: right; font-weight: bold;">Time (s)</th>
394
+ </tr>
395
+ <tr style="border: 1px solid #000;">
396
+ <td style="border: 1px solid #000; padding: 8px; text-align: left;">Baseline</td>
397
+ <td style="border: 1px solid #000; padding: 8px; text-align: center;">85.2%</td>
398
+ <td style="border: 1px solid #000; padding: 8px; text-align: right;">120</td>
399
+ </tr>
400
+ <tr style="border: 1px solid #000;">
401
+ <td style="border: 1px solid #000; padding: 8px; text-align: left;">Our Method</td>
402
+ <td style="border: 1px solid #000; padding: 8px; text-align: center;">89.7%</td>
403
+ <td style="border: 1px solid #000; padding: 8px; text-align: right;">95</td>
404
+ </tr>
405
+ </table>
406
+ <p style="margin-top: 10px; font-style: italic; color: black;">Table 1: Performance comparison of different methods</p>
407
+ </div>
408
+ <p style="color: black;">Table 1 shows the performance comparison...</p>
409
+ </div>
410
+ """, unsafe_allow_html=True)
411
+
412
+ # Interactive LaTeX Practice
413
+ st.header("Interactive LaTeX Practice")
414
+
415
+ st.markdown("""
416
+ Let's practice some common LaTeX commands. Try these exercises:
417
+ """)
418
+
419
+ # Exercise 1: Basic Document
420
+ with st.expander("Exercise 1: Create a Basic Document"):
421
+ st.markdown("""
422
+ **Task:** Create a basic LaTeX document with title, author, and three sections.
423
+
424
+ **Steps:**
425
+ 1. Open Overleaf and create a new project
426
+ 2. Replace the default content with your own
427
+ 3. Add a title and your name
428
+ 4. Create three sections: Introduction, Methods, Results
429
+ 5. Add some placeholder text to each section
430
+ 6. Compile to see your PDF
431
+ """)
432
+
433
+ st.code("""
434
+ \\documentclass{article}
435
+ \\title{My First LaTeX Document}
436
+ \\author{Your Name}
437
+ \\date{\\today}
438
+
439
+ \\begin{document}
440
+ \\maketitle
441
+
442
+ \\section{Introduction}
443
+ This is the introduction section.
444
+
445
+ \\section{Methods}
446
+ This is the methods section.
447
+
448
+ \\section{Results}
449
+ This is the results section.
450
+
451
+ \\end{document}
452
+ """, language="latex")
453
+
454
+ # Exercise 2: Adding Figures
455
+ with st.expander("Exercise 2: Adding a Figure"):
456
+ st.markdown("""
457
+ **Task:** Add a figure to your document.
458
+
459
+ **Steps:**
460
+ 1. Upload an image to your Overleaf project
461
+ 2. Add the figure code to your document
462
+ 3. Add a caption and label
463
+ 4. Reference the figure in your text
464
+ """)
465
+
466
+ st.code("""
467
+ \\begin{figure}[h]
468
+ \\centering
469
+ \\includegraphics[width=0.7\\textwidth]{your-image.png}
470
+ \\caption{Description of your figure}
471
+ \\label{fig:example}
472
+ \\end{figure}
473
+
474
+ As shown in Figure~\\ref{fig:example}, our results demonstrate...
475
+ """, language="latex")
476
+
477
+ # Exercise 3: Citations
478
+ with st.expander("Exercise 3: Adding Citations"):
479
+ st.markdown("""
480
+ **Task:** Add citations to your document.
481
+
482
+ **Steps:**
483
+ 1. Create a .bib file with your references
484
+ 2. Add the bibliography package to your document
485
+ 3. Add citations in your text
486
+ 4. Include the bibliography at the end
487
+ """)
488
+
489
+ st.code("""
490
+ % In your main document
491
+ \\usepackage{biblatex}
492
+ \\addbibresource{references.bib}
493
+
494
+ % Add citations
495
+ Recent work \\cite{smith2023} has shown that...
496
+
497
+ \\printbibliography
498
+
499
+ % In references.bib:
500
+ @article{smith2023,
501
+ title={Recent advances in machine learning},
502
+ author={Smith, John and Johnson, Jane},
503
+ journal={Journal of ML Research},
504
+ year={2023}
505
+ }
506
+ """, language="latex")
507
+
508
+ # Common LaTeX Issues and Solutions
509
+ st.header("Common LaTeX Issues and Solutions")
510
+
511
+ issues_solutions = {
512
+ "Issue": [
513
+ "Document won't compile",
514
+ "Figure not appearing",
515
+ "Citations not showing",
516
+ "Math equations not rendering",
517
+ "Bibliography not generating"
518
+ ],
519
+ "Common Cause": [
520
+ "Missing closing brace or bracket",
521
+ "Wrong filename or path",
522
+ "Missing \\printbibliography command",
523
+ "Missing math mode delimiters",
524
+ "Missing \\addbibresource command"
525
+ ],
526
+ "Solution": [
527
+ "Check for matching braces and brackets",
528
+ "Verify filename and upload to Overleaf",
529
+ "Add \\printbibliography at end of document",
530
+ "Use $ for inline, \\begin{equation} for display",
531
+ "Add \\addbibresource{filename.bib}"
532
+ ]
533
+ }
534
+
535
+ st.dataframe(pd.DataFrame(issues_solutions))
536
+
537
+ # Best Practices
538
+ st.header("Best Practices for Research Paper Writing")
539
+
540
+ st.markdown("""
541
+ **Writing Tips:**
542
+ 1. **Start with an outline** - Plan your paper structure before writing
543
+ 2. **Write the methods first** - It's usually the easiest section
544
+ 3. **Use clear, concise language** - Avoid jargon when possible
545
+ 4. **Be specific** - Use concrete numbers and examples
546
+ 5. **Revise multiple times** - Good writing is rewriting
547
+
548
+ **LaTeX Tips:**
549
+ 1. **Compile frequently** - Catch errors early
550
+ 2. **Use meaningful labels** - fig:results is better than fig:1
551
+ 3. **Keep backups** - Version control your LaTeX files
552
+ 4. **Use templates** - Start with conference/journal templates
553
+ 5. **Learn keyboard shortcuts** - Speed up your workflow
554
+ """)
555
+
556
+ # Additional Resources
557
+ st.header("Additional Resources")
558
+ st.markdown("""
559
+ **LaTeX Resources:**
560
+ - [Overleaf Documentation](https://www.overleaf.com/learn)
561
+ - [LaTeX Wikibook](https://en.wikibooks.org/wiki/LaTeX)
562
+ - [CTAN (Comprehensive TeX Archive Network)](https://ctan.org/)
563
+
564
+ """)