dmgcsilva commited on
Commit
7dd1f6f
1 Parent(s): d448586

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -5
README.md CHANGED
@@ -1,10 +1,8 @@
1
  ---
2
  license: apache-2.0
 
3
  language:
4
  - en
5
- library_name: transformers
6
-
7
-
8
  widget:
9
  - text: "I am good. Just got back from vacation"
10
  example_title: "Example 1"
@@ -16,10 +14,44 @@ widget:
16
  example_title: "Example 4"
17
  ---
18
 
19
- Simple Model trained to classify 4 levels of politeness:
 
 
 
 
 
 
 
20
  - Not Polite (aka Rude or Impolite)
21
  - Neutral
22
  - Somewhat Polite
23
  - Polite
24
 
25
- Made by Diogo M. Silva. PhD Student at NOVA FCT and Affiliated PhD Student at CMU
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ inference: False
4
  language:
5
  - en
 
 
 
6
  widget:
7
  - text: "I am good. Just got back from vacation"
8
  example_title: "Example 1"
 
14
  example_title: "Example 4"
15
  ---
16
 
17
+ # Polite Bert
18
+
19
+ <img src="https://imgur.com/y8vqhCs" alt="drawing" style="width:300px;"/>
20
+
21
+ ### Model Details
22
+
23
+ Polite Bert is, as the name implies, a BERT model trained to classify a given sentence on a scale of politeness:
24
+
25
  - Not Polite (aka Rude or Impolite)
26
  - Neutral
27
  - Somewhat Polite
28
  - Polite
29
 
30
+ #### Training
31
+
32
+ Polite Bert was trained by fine-tuning a [BERT](https://huggingface.co/google-bert/bert-base-uncased) model on annotated politeness-level data.
33
+ The model was trained using SFT for 4 epochs, with a batch size 16, and a max sequence length of 128 tokens.
34
+
35
+ ### Data Details
36
+
37
+ The training data consisted of 2000 annotated sentences. This training data was composed of the following:
38
+
39
+ Manually annotated data:
40
+ - 250 sentences sampled from [EUROPARL](https://www.statmt.org/europarl/) dataset. Specifically from the English version of PT-EN data.
41
+ - 250 sentences sampled from [SIMMC2.0](https://aclanthology.org/2021.emnlp-main.401.pdf) dataset. From any domain (Fashion or Furniture) and speaker (System or User).
42
+ - 250 sentences sampled from the Philosophy and Politics data of the [StackExchange](https://stackoverflow.blog/2014/01/23/stack-exchange-cc-data-now-hosted-by-the-internet-archive/) dataset.
43
+ - 250 sentences sampled from a collection of hotel review replies from Trip Advisor.
44
+
45
+ Automatically annotated data:
46
+ - 1000 sentences from [4Chan Pol](https://arxiv.org/abs/2001.07487) dataset. Specifically, we only considered sentences annotated with TOXICITY > 0.85, SEVERE_TOXICITY > 0.85, and INSULT > 0.5.
47
+
48
+ While we manually labelled the first 1000 sentences, the 1000 sentences from 4ChanPol were automatically set to **Not Polite**.
49
+
50
+ These source datasets were chosen due to their likelihood of containing distinct, but pronounced, politeness levels (hate speech from 4chan, formal and polite speech from hotel staff and parliament members, etc)
51
+
52
+ #### License
53
+
54
+ Apache 2.0 license.
55
+
56
+
57
+ Made by Diogo Glória-Silva. PhD Student at NOVA FCT and Affiliated PhD Student at CMU