Michelle Li commited on
Commit
39dd8c7
·
unverified ·
1 Parent(s): 6baf964

add project description in readme

Browse files
Files changed (1) hide show
  1. README.md +3 -53
README.md CHANGED
@@ -10,59 +10,9 @@ pinned: false
10
  ---
11
 
12
 
13
- # reddit_text_classification
14
-
15
- [![Sync to Hugging Face hub](https://github.com/YZhu0225/reddit_text_classification/actions/workflows/sync_to_hugging_face_hub.yml/badge.svg)](https://github.com/YZhu0225/reddit_text_classification/actions/workflows/sync_to_hugging_face_hub.yml)
16
-
17
- Ideas - text classification
18
-
19
- API that does a microservice
20
-
21
- Use swagger documentation
22
-
23
- Filter words
24
- Query on it
25
- Text classification - sentiment analysis
26
-
27
- Filter, spam, sensitive content, cuss words
28
-
29
- https://new.pythonforengineers.com/blog/build-a-reddit-bot-part-1/
30
-
31
- Look for explicit things, further research on which subreddit to use
32
-
33
- Find data contains labels with data that is similar to how people type on reddit
34
-
35
- Figure out which model
36
-
37
- If it does
38
-
39
- Next steps:
40
- EOD Wednesday
41
-
42
- Meet next Monday 12:30 PM
43
-
44
- https://new.pythonforengineers.com/blog/build-a-reddit-bot-part-1/
45
-
46
-
47
-
48
- Trying to break the project into as small as pieces as possible, where we are able to get in static copy of some known matches and some where they dont match, some examples of posts that are good and some examples of posts are bad, get everything locally first, and then once we get that working, then will try to hook it up to API, get reddit api (if we can get it), real time stuff - cherry on sundae (nice to have) but without the system working, better to not do it at all, download posts first, take half an hour to see if API gave me ability to grab a post, use API to grab a few posts, put in some fake bad words, inject some bad words into it inject into original post, get everything working, hugging face model to detect toxic content (that itself is good enough), have some data, get spaces app, command line tool app, 99% on that, real time is only if we have time
49
-
50
- Things to do:
51
- - (Yuanjing and Xiaoquan) Find examples of reddit posts for both classes we are trying to classify, Convert to CSV (2 columns - text, class), 2 classes, Thursday December 8, 2022
52
- - (Michelle) Find Hugging Face model to use to classify posts
53
- - (Michelle) Finetune model on reddit posts and upload to Hugging Face (create API)
54
- - (Susanna) Create CLI or spaces app on Hugging Face
55
- - Connect to real time (optional)
56
- - (Xiaoquan and Yuanjing) Make demo video
57
-
58
-
59
- Due date: December 16, 2022
60
-
61
- Demo - split up the workload so that it uses everybody’s best talents, not everyone has to present, break problem up so that final outcome is the best, one person really good at editing, can be editor, if one person is good at voiceocer then do the voiceover, if one person is good at documentation, then one person does documentation, if one person is doing coding, then one person is doing coding,
62
-
63
-
64
-
65
 
 
66
 
67
  ### Get Reddit data
68
  * Data pulled in notebook `reddit_data/reddit_new.ipynb`
@@ -77,5 +27,5 @@ Demo - split up the workload so that it uses everybody’s best talents, not eve
77
  * Run `python fine_tune_berft.py` to finetune the model on Reddit data
78
  * Run `rename_labels.py` to change the output labels of the classifier
79
  * Check out the fine-tuned model [here](https://huggingface.co/michellejieli/inappropriate_text_classifier)
80
- * [Spaces APP](https://huggingface.co/spaces/yjzhu0225/reddit_text_classification_app)
81
 
 
10
  ---
11
 
12
 
13
+ # Reddit Explicit Text Classifier
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
+ In this project, we created a text classifier Hugging Face Spaces app and Gradio interface that classifies not safe for work (NSFW) content, specifically text that is considered inappropriate and unprofessional. We used a pre-trained DistilBERT transformer model for the sentiment analysis. The model was fine-tuned on Reddit posts and predicts 2 classes - which are NSFW and safe for work (SFW).
16
 
17
  ### Get Reddit data
18
  * Data pulled in notebook `reddit_data/reddit_new.ipynb`
 
27
  * Run `python fine_tune_berft.py` to finetune the model on Reddit data
28
  * Run `rename_labels.py` to change the output labels of the classifier
29
  * Check out the fine-tuned model [here](https://huggingface.co/michellejieli/inappropriate_text_classifier)
30
+ * Check out the spaces app [Spaces APP](https://huggingface.co/spaces/yjzhu0225/reddit_text_classification_app)
31