Spaces:

yjzhu0225
/

reddit_text_classification_app

Runtime error

App Files Files Community

Susanna Anil commited on Dec 13, 2022

Commit

d87e477

2 Parent(s): 6a77b20 0d7e8fa

Merge branch 'main' of https://github.com/YZhu0225/reddit_text_classification into main

Browse files

Files changed (9) hide show

.github/workflows/sync_to_hugging_face_hub.yml +20 -0
README.md +34 -0
__init__.py +1 -0
reddit_data/__init__.py +1 -0
reddit_data/reddit_annotated.csv +0 -0
reddit_dataset.csv → reddit_data/reddit_dataset.csv +0 -0
reddit_data/reddit_new.ipynb +369 -0
reddit_scraping.ipynb → reddit_data/reddit_scraping.ipynb +0 -0
requirements.txt +2 -1

.github/workflows/sync_to_hugging_face_hub.yml ADDED Viewed

	@@ -0,0 +1,20 @@

+name: Sync to Hugging Face hub
+on:
+  push:
+    branches: [main]
+  # to run this workflow manually from the Actions tab
+  workflow_dispatch:
+jobs:
+  sync-to-hub:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: Push to hub
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: git push --force https://yjzhu0225:[email protected]/spaces/yjzhu0225/reddit_text_classification_app main

README.md CHANGED Viewed

@@ -1,5 +1,19 @@
 # reddit_text_classification
 Ideas - text classification
 API that does a microservice
@@ -45,3 +59,23 @@ Things to do:
 Due date: December 16, 2022
 Demo - split up the workload so that it uses everybody’s best talents, not everyone has to present, break problem up so that final outcome is the best, one person really good at editing, can be editor, if one person is good at voiceocer then do the voiceover, if one person is good at documentation, then one person does documentation, if one person is doing coding, then one person is doing coding,

+---
+title: reddit_text_classification_app
+emoji: 🐠
+colorFrom: blue
+colorTo: green
+sdk: gradio
+sdk_version: 3.13.0
+app_file: app.py
+pinned: false
+---
 # reddit_text_classification
+[![Sync to Hugging Face hub](https://github.com/YZhu0225/reddit_text_classification/actions/workflows/sync_to_hugging_face_hub.yml/badge.svg)](https://github.com/YZhu0225/reddit_text_classification/actions/workflows/sync_to_hugging_face_hub.yml)
 Ideas - text classification
 API that does a microservice
 Due date: December 16, 2022
 Demo - split up the workload so that it uses everybody’s best talents, not everyone has to present, break problem up so that final outcome is the best, one person really good at editing, can be editor, if one person is good at voiceocer then do the voiceover, if one person is good at documentation, then one person does documentation, if one person is doing coding, then one person is doing coding,
+### Get Reddit data
+* Data pulled in notebook `reddit_data/reddit_new.ipynb`
+### Verify GPU works
+* Run pytorch training test: `python utils/quickstart_pytorch.py`
+* Run pytorch CUDA test: `python utils/verify_cuda_pytorch.py`
+* Run tensorflow training test: `python utils/quickstart_tf2.py`
+* Run nvidia monitoring test: `nvidia-smi -l 1`
+### Finetune text classifier model and upload to Hugging Face
+* In terminal, run `huggingface-cli login`
+* Run `python fine_tune_berft.py` to finetune the model on Reddit data
+* Run `rename_labels.py` to change the output labels of the classifier
+* Check out the fine-tuned model [here](https://huggingface.co/michellejieli/inappropriate_text_classifier)
+* [Spaces APP](https://huggingface.co/spaces/yjzhu0225/reddit_text_classification_app)

__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+

reddit_data/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+

reddit_data/reddit_annotated.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

reddit_dataset.csv → reddit_data/reddit_dataset.csv RENAMED Viewed

File without changes

reddit_data/reddit_new.ipynb ADDED Viewed

	@@ -0,0 +1,369 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# read json file, change row to column\n",
+    "df = pd.read_json('/Users/liuxiaoquan/Documents/706/Final_project/Reddit_new.json', orient='index')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "24506"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(df)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>title</th>\n",
+       "      <th>body</th>\n",
+       "      <th>comment</th>\n",
+       "      <th>L1</th>\n",
+       "      <th>L2</th>\n",
+       "      <th>L3</th>\n",
+       "      <th>L4</th>\n",
+       "      <th>L5</th>\n",
+       "      <th>L6</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>2k7915</th>\n",
+       "      <td>Why is the NW map so small?</td>\n",
+       "      <td>When did this change? It's not been fun spawni...</td>\n",
+       "      <td>Maybe they need to keep the map large to start...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1k3845</th>\n",
+       "      <td>Any updates in regards to the Flame War?</td>\n",
+       "      <td>Just out of curiosity. I'm only wondering what...</td>\n",
+       "      <td>Shut the fuck up freeloading asshat</td>\n",
+       "      <td>3</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1k8446</th>\n",
+       "      <td>Hey</td>\n",
+       "      <td>Im not phased by anything, love you all and I'...</td>\n",
+       "      <td>MORE WIGGER SHIT TO DECODE</td>\n",
+       "      <td>3</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>14k940</th>\n",
+       "      <td>Any tips for final exams?</td>\n",
+       "      <td>I am a first year student in Bachelor of Scien...</td>\n",
+       "      <td>For Calc2, do past exams \\* 6, remember to exp...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12k646</th>\n",
+       "      <td>My orthodontist just said I can't have nuts be...</td>\n",
+       "      <td>What do I do I want to keep my nuts</td>\n",
+       "      <td>just eat em and be careful it's fine</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                    title  \\\n",
+       "2k7915                        Why is the NW map so small?   \n",
+       "1k3845           Any updates in regards to the Flame War?   \n",
+       "1k8446                                                Hey   \n",
+       "14k940                          Any tips for final exams?   \n",
+       "12k646  My orthodontist just said I can't have nuts be...   \n",
+       "\n",
+       "                                                     body  \\\n",
+       "2k7915  When did this change? It's not been fun spawni...   \n",
+       "1k3845  Just out of curiosity. I'm only wondering what...   \n",
+       "1k8446  Im not phased by anything, love you all and I'...   \n",
+       "14k940  I am a first year student in Bachelor of Scien...   \n",
+       "12k646                What do I do I want to keep my nuts   \n",
+       "\n",
+       "                                                  comment  L1  L2  L3  L4  L5  \\\n",
+       "2k7915  Maybe they need to keep the map large to start...   0   0   0   0   0   \n",
+       "1k3845                Shut the fuck up freeloading asshat   3   1   0   0   0   \n",
+       "1k8446                         MORE WIGGER SHIT TO DECODE   3   1   0   0   0   \n",
+       "14k940  For Calc2, do past exams \\* 6, remember to exp...   0   0   0   0   0   \n",
+       "12k646               just eat em and be careful it's fine   0   0   0   0   0   \n",
+       "\n",
+       "        L6  \n",
+       "2k7915   0  \n",
+       "1k3845   1  \n",
+       "1k8446   1  \n",
+       "14k940   0  \n",
+       "12k646   0  "
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#select body and L2, change L2 to Class\n",
+    "df_select = df[['body', 'L2']].copy()\n",
+    "df_select.rename(columns={'L2':'Class'}, inplace=True) \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>body</th>\n",
+       "      <th>Class</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>2k7915</th>\n",
+       "      <td>When did this change? It's not been fun spawni...</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1k3845</th>\n",
+       "      <td>Just out of curiosity. I'm only wondering what...</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1k8446</th>\n",
+       "      <td>Im not phased by anything, love you all and I'...</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>14k940</th>\n",
+       "      <td>I am a first year student in Bachelor of Scien...</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12k646</th>\n",
+       "      <td>What do I do I want to keep my nuts</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                     body  Class\n",
+       "2k7915  When did this change? It's not been fun spawni...      0\n",
+       "1k3845  Just out of curiosity. I'm only wondering what...      1\n",
+       "1k8446  Im not phased by anything, love you all and I'...      1\n",
+       "14k940  I am a first year student in Bachelor of Scien...      0\n",
+       "12k646                What do I do I want to keep my nuts      0"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df_select.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "1    12577\n",
+       "0    11929\n",
+       "Name: Class, dtype: int64"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#check the number of each class\n",
+    "df_select['Class'].value_counts()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# save to csv\n",
+    "df_select.to_csv('reddit_annotated.csv', index=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ht = pd.read_table('/Users/liuxiaoquan/Documents/706/Final_project/RAL-E/retrain_reddit_abuse_test.txt', header=None)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "14932"
+      ]
+     },
+     "execution_count": 20,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(ht)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.10.6 ('base')",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  },
+  "orig_nbformat": 4,
+  "vscode": {
+   "interpreter": {
+    "hash": "3d597f4c481aa0f25dceb95d2a0067e73c0966dcbd003d741d821a7208527ecf"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

reddit_scraping.ipynb → reddit_data/reddit_scraping.ipynb RENAMED Viewed

File without changes

requirements.txt CHANGED Viewed

@@ -5,7 +5,8 @@ uvicorn[standard]
 pandas
 black
 transformers
 praw
 numpy
 gradio
-altair

 pandas
 black
 transformers
+torch
 praw
 numpy
 gradio
+altair