Spaces:
Runtime error
Runtime error
File size: 4,140 Bytes
65691f6 39dd8c7 25680b2 69676a1 044ea91 bd9aef9 8c57de7 352031b a481ddd bd9aef9 4cc413f 352031b 071c546 352031b 25680b2 4cc413f 22e5788 4cc413f 25680b2 22e5788 352031b 4cc413f ace250d 25680b2 4cc413f 826732e 7f13af4 826732e 7f13af4 4cc413f 25680b2 0d7e8fa 25680b2 8b62098 00de712 8b62098 6bfcf87 22e5788 4a5cef4 22e5788 6bfcf87 22e5788 4a5cef4 22e5788 4a5cef4 22e5788 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
---
title: reddit_text_classification_app
emoji: 🐠
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 3.13.0
app_file: app.py
pinned: false
---
# Reddit Explicit Text Classifier






[](https://github.com/YZhu0225/reddit_text_classification/actions/workflows/main.yml)
[](https://github.com/YZhu0225/reddit_text_classification/actions/workflows/sync_to_hugging_face_hub.yml)
## Demo
Link to Youtube demo:
[<img width="700" src="https://user-images.githubusercontent.com/112578003/207716480-a5ac9596-8095-46d5-9df9-d6973af38e3e.png">](https://youtu.be/0OY0CCK3lI4 "Reddit")
## Introduction
Reddit is a place where people come together to have a variety of conversations on the internet. However, the negative impacts of abusive language on users in online communities are severe. As students passionate about data science, we are interested in detecting inappropriate and unprofessional Reddit posts and warn users about explicit content in these posts.
In this project, we created a text classifier Hugging Face Spaces app and a Gradio interface that classifies not safe for work (NSFW) content, specifically text that is considered inappropriate and unprofessional. We used a pre-trained DistilBERT transformer model for the sentiment analysis. The model was fine-tuned on Reddit posts and predicts 2 classes - NSFW and safe for work (SFW).
## Workflow
<p align="center">
<img width="750" height="450" src="https://user-images.githubusercontent.com/112578003/207698683-233c228e-c2d0-441f-bbba-139dd24a98d3.png" />
</p>
### Get Reddit data
* Data pulled in notebook `reddit_data/reddit_new.ipynb` to fine-tune Hugging Face model.
### Verify GPU works in this [repo](https://github.com/nogibjj/Reddit_Classifier_Final_Project)
* Run pytorch training test: `python utils/quickstart_pytorch.py`
* Run pytorch CUDA test: `python utils/verify_cuda_pytorch.py`
* Run tensorflow training test: `python utils/quickstart_tf2.py`
* Run nvidia monitoring test: `nvidia-smi -l 1`
### DistilBERT transformer model
<p align="center">
<img width="700" height="350" src="https://user-images.githubusercontent.com/112578003/207486477-a40d62be-8d06-4a35-ae4c-7077569bef44.png" />
</p>
### Finetune text classifier model and upload to Hugging Face
* In terminal, run `huggingface-cli login`
* Run `python fine_tune_berft.py` to finetune the model on Reddit data
* Run `rename_labels.py` to change the output labels of the classifier
* Check out the fine-tuned model [here](https://huggingface.co/michellejieli/inappropriate_text_classifier)
### Gradio interface
* In terminal, run `python3 app.py`
* Open the browser
* Put reddit URL in *input_url* and get output
* Or directly check out the spaces app [here](https://huggingface.co/spaces/yjzhu0225/reddit_text_classification_app)
**SAFE Reddit URL**
<p align="center">
<img width="700" height="250" src="https://user-images.githubusercontent.com/112578003/207698979-f3751140-fc91-4613-9892-c22f2e5b7dfa.png">
</p>
**WARNING Reddit URL**
<p align="center">
<img width="700" height="250" src="https://user-images.githubusercontent.com/112578003/207699308-8847e2f3-be76-47e4-8a0b-ba4406f5a693.png">
</p>
### Reference
[1] “CADD_dataset,” GitHub, Sep. 26, 2022. https://github.com/nlpcl-lab/cadd_dataset
[2] H. Song, S. H. Ryu, H. Lee, and J. Park, “A Large-scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit,” ACLWeb, Nov. 01, 2021. https://aclanthology.org/2021.conll-1.43/
|