Susanna Anil commited on
Commit
d87e477
·
2 Parent(s): 6a77b20 0d7e8fa

Merge branch 'main' of https://github.com/YZhu0225/reddit_text_classification into main

Browse files
.github/workflows/sync_to_hugging_face_hub.yml ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Sync to Hugging Face hub
2
+
3
+ on:
4
+ push:
5
+ branches: [main]
6
+
7
+ # to run this workflow manually from the Actions tab
8
+ workflow_dispatch:
9
+
10
+ jobs:
11
+ sync-to-hub:
12
+ runs-on: ubuntu-latest
13
+ steps:
14
+ - uses: actions/checkout@v2
15
+ with:
16
+ fetch-depth: 0
17
+ - name: Push to hub
18
+ env:
19
+ HF_TOKEN: ${{ secrets.HF_TOKEN }}
20
+ run: git push --force https://yjzhu0225:[email protected]/spaces/yjzhu0225/reddit_text_classification_app main
README.md CHANGED
@@ -1,5 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
  # reddit_text_classification
2
 
 
 
3
  Ideas - text classification
4
 
5
  API that does a microservice
@@ -45,3 +59,23 @@ Things to do:
45
  Due date: December 16, 2022
46
 
47
  Demo - split up the workload so that it uses everybody’s best talents, not everyone has to present, break problem up so that final outcome is the best, one person really good at editing, can be editor, if one person is good at voiceocer then do the voiceover, if one person is good at documentation, then one person does documentation, if one person is doing coding, then one person is doing coding,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: reddit_text_classification_app
3
+ emoji: 🐠
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: 3.13.0
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+
13
  # reddit_text_classification
14
 
15
+ [![Sync to Hugging Face hub](https://github.com/YZhu0225/reddit_text_classification/actions/workflows/sync_to_hugging_face_hub.yml/badge.svg)](https://github.com/YZhu0225/reddit_text_classification/actions/workflows/sync_to_hugging_face_hub.yml)
16
+
17
  Ideas - text classification
18
 
19
  API that does a microservice
 
59
  Due date: December 16, 2022
60
 
61
  Demo - split up the workload so that it uses everybody’s best talents, not everyone has to present, break problem up so that final outcome is the best, one person really good at editing, can be editor, if one person is good at voiceocer then do the voiceover, if one person is good at documentation, then one person does documentation, if one person is doing coding, then one person is doing coding,
62
+
63
+
64
+
65
+
66
+
67
+ ### Get Reddit data
68
+ * Data pulled in notebook `reddit_data/reddit_new.ipynb`
69
+ ### Verify GPU works
70
+ * Run pytorch training test: `python utils/quickstart_pytorch.py`
71
+ * Run pytorch CUDA test: `python utils/verify_cuda_pytorch.py`
72
+ * Run tensorflow training test: `python utils/quickstart_tf2.py`
73
+ * Run nvidia monitoring test: `nvidia-smi -l 1`
74
+
75
+ ### Finetune text classifier model and upload to Hugging Face
76
+ * In terminal, run `huggingface-cli login`
77
+ * Run `python fine_tune_berft.py` to finetune the model on Reddit data
78
+ * Run `rename_labels.py` to change the output labels of the classifier
79
+ * Check out the fine-tuned model [here](https://huggingface.co/michellejieli/inappropriate_text_classifier)
80
+ * [Spaces APP](https://huggingface.co/spaces/yjzhu0225/reddit_text_classification_app)
81
+
__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+
reddit_data/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+
reddit_data/reddit_annotated.csv ADDED
The diff for this file is too large to render. See raw diff
 
reddit_dataset.csv → reddit_data/reddit_dataset.csv RENAMED
File without changes
reddit_data/reddit_new.ipynb ADDED
@@ -0,0 +1,369 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "metadata": {},
7
+ "outputs": [],
8
+ "source": [
9
+ "import pandas as pd "
10
+ ]
11
+ },
12
+ {
13
+ "cell_type": "code",
14
+ "execution_count": 2,
15
+ "metadata": {},
16
+ "outputs": [],
17
+ "source": [
18
+ "# read json file, change row to column\n",
19
+ "df = pd.read_json('/Users/liuxiaoquan/Documents/706/Final_project/Reddit_new.json', orient='index')"
20
+ ]
21
+ },
22
+ {
23
+ "cell_type": "code",
24
+ "execution_count": 3,
25
+ "metadata": {},
26
+ "outputs": [
27
+ {
28
+ "data": {
29
+ "text/plain": [
30
+ "24506"
31
+ ]
32
+ },
33
+ "execution_count": 3,
34
+ "metadata": {},
35
+ "output_type": "execute_result"
36
+ }
37
+ ],
38
+ "source": [
39
+ "len(df)"
40
+ ]
41
+ },
42
+ {
43
+ "cell_type": "code",
44
+ "execution_count": 4,
45
+ "metadata": {},
46
+ "outputs": [
47
+ {
48
+ "data": {
49
+ "text/html": [
50
+ "<div>\n",
51
+ "<style scoped>\n",
52
+ " .dataframe tbody tr th:only-of-type {\n",
53
+ " vertical-align: middle;\n",
54
+ " }\n",
55
+ "\n",
56
+ " .dataframe tbody tr th {\n",
57
+ " vertical-align: top;\n",
58
+ " }\n",
59
+ "\n",
60
+ " .dataframe thead th {\n",
61
+ " text-align: right;\n",
62
+ " }\n",
63
+ "</style>\n",
64
+ "<table border=\"1\" class=\"dataframe\">\n",
65
+ " <thead>\n",
66
+ " <tr style=\"text-align: right;\">\n",
67
+ " <th></th>\n",
68
+ " <th>title</th>\n",
69
+ " <th>body</th>\n",
70
+ " <th>comment</th>\n",
71
+ " <th>L1</th>\n",
72
+ " <th>L2</th>\n",
73
+ " <th>L3</th>\n",
74
+ " <th>L4</th>\n",
75
+ " <th>L5</th>\n",
76
+ " <th>L6</th>\n",
77
+ " </tr>\n",
78
+ " </thead>\n",
79
+ " <tbody>\n",
80
+ " <tr>\n",
81
+ " <th>2k7915</th>\n",
82
+ " <td>Why is the NW map so small?</td>\n",
83
+ " <td>When did this change? It's not been fun spawni...</td>\n",
84
+ " <td>Maybe they need to keep the map large to start...</td>\n",
85
+ " <td>0</td>\n",
86
+ " <td>0</td>\n",
87
+ " <td>0</td>\n",
88
+ " <td>0</td>\n",
89
+ " <td>0</td>\n",
90
+ " <td>0</td>\n",
91
+ " </tr>\n",
92
+ " <tr>\n",
93
+ " <th>1k3845</th>\n",
94
+ " <td>Any updates in regards to the Flame War?</td>\n",
95
+ " <td>Just out of curiosity. I'm only wondering what...</td>\n",
96
+ " <td>Shut the fuck up freeloading asshat</td>\n",
97
+ " <td>3</td>\n",
98
+ " <td>1</td>\n",
99
+ " <td>0</td>\n",
100
+ " <td>0</td>\n",
101
+ " <td>0</td>\n",
102
+ " <td>1</td>\n",
103
+ " </tr>\n",
104
+ " <tr>\n",
105
+ " <th>1k8446</th>\n",
106
+ " <td>Hey</td>\n",
107
+ " <td>Im not phased by anything, love you all and I'...</td>\n",
108
+ " <td>MORE WIGGER SHIT TO DECODE</td>\n",
109
+ " <td>3</td>\n",
110
+ " <td>1</td>\n",
111
+ " <td>0</td>\n",
112
+ " <td>0</td>\n",
113
+ " <td>0</td>\n",
114
+ " <td>1</td>\n",
115
+ " </tr>\n",
116
+ " <tr>\n",
117
+ " <th>14k940</th>\n",
118
+ " <td>Any tips for final exams?</td>\n",
119
+ " <td>I am a first year student in Bachelor of Scien...</td>\n",
120
+ " <td>For Calc2, do past exams \\* 6, remember to exp...</td>\n",
121
+ " <td>0</td>\n",
122
+ " <td>0</td>\n",
123
+ " <td>0</td>\n",
124
+ " <td>0</td>\n",
125
+ " <td>0</td>\n",
126
+ " <td>0</td>\n",
127
+ " </tr>\n",
128
+ " <tr>\n",
129
+ " <th>12k646</th>\n",
130
+ " <td>My orthodontist just said I can't have nuts be...</td>\n",
131
+ " <td>What do I do I want to keep my nuts</td>\n",
132
+ " <td>just eat em and be careful it's fine</td>\n",
133
+ " <td>0</td>\n",
134
+ " <td>0</td>\n",
135
+ " <td>0</td>\n",
136
+ " <td>0</td>\n",
137
+ " <td>0</td>\n",
138
+ " <td>0</td>\n",
139
+ " </tr>\n",
140
+ " </tbody>\n",
141
+ "</table>\n",
142
+ "</div>"
143
+ ],
144
+ "text/plain": [
145
+ " title \\\n",
146
+ "2k7915 Why is the NW map so small? \n",
147
+ "1k3845 Any updates in regards to the Flame War? \n",
148
+ "1k8446 Hey \n",
149
+ "14k940 Any tips for final exams? \n",
150
+ "12k646 My orthodontist just said I can't have nuts be... \n",
151
+ "\n",
152
+ " body \\\n",
153
+ "2k7915 When did this change? It's not been fun spawni... \n",
154
+ "1k3845 Just out of curiosity. I'm only wondering what... \n",
155
+ "1k8446 Im not phased by anything, love you all and I'... \n",
156
+ "14k940 I am a first year student in Bachelor of Scien... \n",
157
+ "12k646 What do I do I want to keep my nuts \n",
158
+ "\n",
159
+ " comment L1 L2 L3 L4 L5 \\\n",
160
+ "2k7915 Maybe they need to keep the map large to start... 0 0 0 0 0 \n",
161
+ "1k3845 Shut the fuck up freeloading asshat 3 1 0 0 0 \n",
162
+ "1k8446 MORE WIGGER SHIT TO DECODE 3 1 0 0 0 \n",
163
+ "14k940 For Calc2, do past exams \\* 6, remember to exp... 0 0 0 0 0 \n",
164
+ "12k646 just eat em and be careful it's fine 0 0 0 0 0 \n",
165
+ "\n",
166
+ " L6 \n",
167
+ "2k7915 0 \n",
168
+ "1k3845 1 \n",
169
+ "1k8446 1 \n",
170
+ "14k940 0 \n",
171
+ "12k646 0 "
172
+ ]
173
+ },
174
+ "execution_count": 4,
175
+ "metadata": {},
176
+ "output_type": "execute_result"
177
+ }
178
+ ],
179
+ "source": [
180
+ "df.head()"
181
+ ]
182
+ },
183
+ {
184
+ "cell_type": "code",
185
+ "execution_count": 5,
186
+ "metadata": {},
187
+ "outputs": [],
188
+ "source": [
189
+ "#select body and L2, change L2 to Class\n",
190
+ "df_select = df[['body', 'L2']].copy()\n",
191
+ "df_select.rename(columns={'L2':'Class'}, inplace=True) \n"
192
+ ]
193
+ },
194
+ {
195
+ "cell_type": "code",
196
+ "execution_count": 6,
197
+ "metadata": {},
198
+ "outputs": [
199
+ {
200
+ "data": {
201
+ "text/html": [
202
+ "<div>\n",
203
+ "<style scoped>\n",
204
+ " .dataframe tbody tr th:only-of-type {\n",
205
+ " vertical-align: middle;\n",
206
+ " }\n",
207
+ "\n",
208
+ " .dataframe tbody tr th {\n",
209
+ " vertical-align: top;\n",
210
+ " }\n",
211
+ "\n",
212
+ " .dataframe thead th {\n",
213
+ " text-align: right;\n",
214
+ " }\n",
215
+ "</style>\n",
216
+ "<table border=\"1\" class=\"dataframe\">\n",
217
+ " <thead>\n",
218
+ " <tr style=\"text-align: right;\">\n",
219
+ " <th></th>\n",
220
+ " <th>body</th>\n",
221
+ " <th>Class</th>\n",
222
+ " </tr>\n",
223
+ " </thead>\n",
224
+ " <tbody>\n",
225
+ " <tr>\n",
226
+ " <th>2k7915</th>\n",
227
+ " <td>When did this change? It's not been fun spawni...</td>\n",
228
+ " <td>0</td>\n",
229
+ " </tr>\n",
230
+ " <tr>\n",
231
+ " <th>1k3845</th>\n",
232
+ " <td>Just out of curiosity. I'm only wondering what...</td>\n",
233
+ " <td>1</td>\n",
234
+ " </tr>\n",
235
+ " <tr>\n",
236
+ " <th>1k8446</th>\n",
237
+ " <td>Im not phased by anything, love you all and I'...</td>\n",
238
+ " <td>1</td>\n",
239
+ " </tr>\n",
240
+ " <tr>\n",
241
+ " <th>14k940</th>\n",
242
+ " <td>I am a first year student in Bachelor of Scien...</td>\n",
243
+ " <td>0</td>\n",
244
+ " </tr>\n",
245
+ " <tr>\n",
246
+ " <th>12k646</th>\n",
247
+ " <td>What do I do I want to keep my nuts</td>\n",
248
+ " <td>0</td>\n",
249
+ " </tr>\n",
250
+ " </tbody>\n",
251
+ "</table>\n",
252
+ "</div>"
253
+ ],
254
+ "text/plain": [
255
+ " body Class\n",
256
+ "2k7915 When did this change? It's not been fun spawni... 0\n",
257
+ "1k3845 Just out of curiosity. I'm only wondering what... 1\n",
258
+ "1k8446 Im not phased by anything, love you all and I'... 1\n",
259
+ "14k940 I am a first year student in Bachelor of Scien... 0\n",
260
+ "12k646 What do I do I want to keep my nuts 0"
261
+ ]
262
+ },
263
+ "execution_count": 6,
264
+ "metadata": {},
265
+ "output_type": "execute_result"
266
+ }
267
+ ],
268
+ "source": [
269
+ "df_select.head()"
270
+ ]
271
+ },
272
+ {
273
+ "cell_type": "code",
274
+ "execution_count": 7,
275
+ "metadata": {},
276
+ "outputs": [
277
+ {
278
+ "data": {
279
+ "text/plain": [
280
+ "1 12577\n",
281
+ "0 11929\n",
282
+ "Name: Class, dtype: int64"
283
+ ]
284
+ },
285
+ "execution_count": 7,
286
+ "metadata": {},
287
+ "output_type": "execute_result"
288
+ }
289
+ ],
290
+ "source": [
291
+ "#check the number of each class\n",
292
+ "df_select['Class'].value_counts()\n"
293
+ ]
294
+ },
295
+ {
296
+ "cell_type": "code",
297
+ "execution_count": 8,
298
+ "metadata": {},
299
+ "outputs": [],
300
+ "source": [
301
+ "# save to csv\n",
302
+ "df_select.to_csv('reddit_annotated.csv', index=False)"
303
+ ]
304
+ },
305
+ {
306
+ "cell_type": "code",
307
+ "execution_count": 16,
308
+ "metadata": {},
309
+ "outputs": [],
310
+ "source": [
311
+ "ht = pd.read_table('/Users/liuxiaoquan/Documents/706/Final_project/RAL-E/retrain_reddit_abuse_test.txt', header=None)"
312
+ ]
313
+ },
314
+ {
315
+ "cell_type": "code",
316
+ "execution_count": 20,
317
+ "metadata": {},
318
+ "outputs": [
319
+ {
320
+ "data": {
321
+ "text/plain": [
322
+ "14932"
323
+ ]
324
+ },
325
+ "execution_count": 20,
326
+ "metadata": {},
327
+ "output_type": "execute_result"
328
+ }
329
+ ],
330
+ "source": [
331
+ "len(ht)"
332
+ ]
333
+ },
334
+ {
335
+ "cell_type": "code",
336
+ "execution_count": null,
337
+ "metadata": {},
338
+ "outputs": [],
339
+ "source": []
340
+ }
341
+ ],
342
+ "metadata": {
343
+ "kernelspec": {
344
+ "display_name": "Python 3.10.6 ('base')",
345
+ "language": "python",
346
+ "name": "python3"
347
+ },
348
+ "language_info": {
349
+ "codemirror_mode": {
350
+ "name": "ipython",
351
+ "version": 3
352
+ },
353
+ "file_extension": ".py",
354
+ "mimetype": "text/x-python",
355
+ "name": "python",
356
+ "nbconvert_exporter": "python",
357
+ "pygments_lexer": "ipython3",
358
+ "version": "3.10.6"
359
+ },
360
+ "orig_nbformat": 4,
361
+ "vscode": {
362
+ "interpreter": {
363
+ "hash": "3d597f4c481aa0f25dceb95d2a0067e73c0966dcbd003d741d821a7208527ecf"
364
+ }
365
+ }
366
+ },
367
+ "nbformat": 4,
368
+ "nbformat_minor": 2
369
+ }
reddit_scraping.ipynb → reddit_data/reddit_scraping.ipynb RENAMED
File without changes
requirements.txt CHANGED
@@ -5,7 +5,8 @@ uvicorn[standard]
5
  pandas
6
  black
7
  transformers
 
8
  praw
9
  numpy
10
  gradio
11
- altair
 
5
  pandas
6
  black
7
  transformers
8
+ torch
9
  praw
10
  numpy
11
  gradio
12
+ altair