Spaces:
Runtime error
Runtime error
Michelle Li
commited on
add project description in readme
Browse files
README.md
CHANGED
@@ -10,59 +10,9 @@ pinned: false
|
|
10 |
---
|
11 |
|
12 |
|
13 |
-
#
|
14 |
-
|
15 |
-
[](https://github.com/YZhu0225/reddit_text_classification/actions/workflows/sync_to_hugging_face_hub.yml)
|
16 |
-
|
17 |
-
Ideas - text classification
|
18 |
-
|
19 |
-
API that does a microservice
|
20 |
-
|
21 |
-
Use swagger documentation
|
22 |
-
|
23 |
-
Filter words
|
24 |
-
Query on it
|
25 |
-
Text classification - sentiment analysis
|
26 |
-
|
27 |
-
Filter, spam, sensitive content, cuss words
|
28 |
-
|
29 |
-
https://new.pythonforengineers.com/blog/build-a-reddit-bot-part-1/
|
30 |
-
|
31 |
-
Look for explicit things, further research on which subreddit to use
|
32 |
-
|
33 |
-
Find data contains labels with data that is similar to how people type on reddit
|
34 |
-
|
35 |
-
Figure out which model
|
36 |
-
|
37 |
-
If it does
|
38 |
-
|
39 |
-
Next steps:
|
40 |
-
EOD Wednesday
|
41 |
-
|
42 |
-
Meet next Monday 12:30 PM
|
43 |
-
|
44 |
-
https://new.pythonforengineers.com/blog/build-a-reddit-bot-part-1/
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
Trying to break the project into as small as pieces as possible, where we are able to get in static copy of some known matches and some where they dont match, some examples of posts that are good and some examples of posts are bad, get everything locally first, and then once we get that working, then will try to hook it up to API, get reddit api (if we can get it), real time stuff - cherry on sundae (nice to have) but without the system working, better to not do it at all, download posts first, take half an hour to see if API gave me ability to grab a post, use API to grab a few posts, put in some fake bad words, inject some bad words into it inject into original post, get everything working, hugging face model to detect toxic content (that itself is good enough), have some data, get spaces app, command line tool app, 99% on that, real time is only if we have time
|
49 |
-
|
50 |
-
Things to do:
|
51 |
-
- (Yuanjing and Xiaoquan) Find examples of reddit posts for both classes we are trying to classify, Convert to CSV (2 columns - text, class), 2 classes, Thursday December 8, 2022
|
52 |
-
- (Michelle) Find Hugging Face model to use to classify posts
|
53 |
-
- (Michelle) Finetune model on reddit posts and upload to Hugging Face (create API)
|
54 |
-
- (Susanna) Create CLI or spaces app on Hugging Face
|
55 |
-
- Connect to real time (optional)
|
56 |
-
- (Xiaoquan and Yuanjing) Make demo video
|
57 |
-
|
58 |
-
|
59 |
-
Due date: December 16, 2022
|
60 |
-
|
61 |
-
Demo - split up the workload so that it uses everybody’s best talents, not everyone has to present, break problem up so that final outcome is the best, one person really good at editing, can be editor, if one person is good at voiceocer then do the voiceover, if one person is good at documentation, then one person does documentation, if one person is doing coding, then one person is doing coding,
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
|
|
|
66 |
|
67 |
### Get Reddit data
|
68 |
* Data pulled in notebook `reddit_data/reddit_new.ipynb`
|
@@ -77,5 +27,5 @@ Demo - split up the workload so that it uses everybody’s best talents, not eve
|
|
77 |
* Run `python fine_tune_berft.py` to finetune the model on Reddit data
|
78 |
* Run `rename_labels.py` to change the output labels of the classifier
|
79 |
* Check out the fine-tuned model [here](https://huggingface.co/michellejieli/inappropriate_text_classifier)
|
80 |
-
* [Spaces APP](https://huggingface.co/spaces/yjzhu0225/reddit_text_classification_app)
|
81 |
|
|
|
10 |
---
|
11 |
|
12 |
|
13 |
+
# Reddit Explicit Text Classifier
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
+
In this project, we created a text classifier Hugging Face Spaces app and Gradio interface that classifies not safe for work (NSFW) content, specifically text that is considered inappropriate and unprofessional. We used a pre-trained DistilBERT transformer model for the sentiment analysis. The model was fine-tuned on Reddit posts and predicts 2 classes - which are NSFW and safe for work (SFW).
|
16 |
|
17 |
### Get Reddit data
|
18 |
* Data pulled in notebook `reddit_data/reddit_new.ipynb`
|
|
|
27 |
* Run `python fine_tune_berft.py` to finetune the model on Reddit data
|
28 |
* Run `rename_labels.py` to change the output labels of the classifier
|
29 |
* Check out the fine-tuned model [here](https://huggingface.co/michellejieli/inappropriate_text_classifier)
|
30 |
+
* Check out the spaces app [Spaces APP](https://huggingface.co/spaces/yjzhu0225/reddit_text_classification_app)
|
31 |
|