avichr commited on
Commit
c6a76dc
โ€ข
1 Parent(s): 8f8c26a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -46
README.md CHANGED
@@ -1,16 +1,16 @@
1
  ## HeBERT: Pre-trained BERT for Polarity Analysis and Emotion Recognition
2
- HeBERT is a Hebrew pretrained language model. It is based on Google's BERT architecture and it is BERT-Base config [(Devlin et al. 2018)](https://arxiv.org/abs/1810.04805). <br>
3
 
4
- HeBert was trained on three dataset:
5
- 1. A Hebrew version of OSCAR [(Ortiz, 2019)](https://oscar-corpus.com/): ~9.8 GB of data, including 1 billion words and over 20.8 millions sentences.
6
- 2. A Hebrew dump of Wikipedia: ~650 MB of data, including over 63 millions words and 3.8 millions sentences
7
- 3. Emotion UGC data that was collected for the purpose of this study. (described below)
8
- We evaluated the model on emotion recognition and sentiment analysis, for a downstream tasks.
9
 
10
  ### Emotion UGC Data Description
11
- Our User Genrated Content (UGC) is comments written on articles collected from 3 major news sites, between January 2020 to August 2020,. Total data size ~150 MB of data, including over 7 millions words and 350K sentences.
12
- 4000 sentences annotated by crowd members (3-10 annotators per sentence) for 8 emotions (anger, disgust, expectation , fear, happy, sadness, surprise and trust) and overall sentiment / polarity<br>
13
- In order to valid the annotation, we search an agreement between raters to emotion in each sentence using krippendorff's alpha [(krippendorff, 1970)](https://journals.sagepub.com/doi/pdf/10.1177/001316447003000105). We left sentences that got alpha > 0.7. Note that while we found a general agreement between raters about emotion like happy, trust and disgust, there are few emotion with general disagreement about them, apparently given the complexity of finding them in the text (e.g. expectation and surprise).
14
 
15
  ### Performance
16
  #### sentiment analysis
@@ -26,47 +26,47 @@ In order to valid the annotation, we search an agreement between raters to emoti
26
 
27
  ## How to use
28
  ### For masked-LM model (can be fine-tunned to any down-stream task)
29
- from transformers import AutoTokenizer, AutoModel
30
- tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT")
31
- model = AutoModel.from_pretrained("avichr/heBERT")
32
-
33
- from transformers import pipeline
34
- fill_mask = pipeline(
35
- "fill-mask",
36
- model="avichr/heBERT",
37
- tokenizer="avichr/heBERT"
38
- )
39
- fill_mask("ื”ืงื•ืจื•ื ื” ืœืงื—ื” ืืช [MASK] ื•ืœื ื• ืœื ื ืฉืืจ ื“ื‘ืจ.")
40
 
41
  ### For sentiment classification model (polarity ONLY):
42
- from transformers import AutoTokenizer, AutoModel, pipeline
43
- tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT_sentiment_analysis") #same as 'avichr/heBERT' tokenizer
44
- model = AutoModel.from_pretrained("avichr/heBERT_sentiment_analysis")
45
-
46
- # how to use?
47
- sentiment_analysis = pipeline(
48
- "sentiment-analysis",
49
- model="avichr/heBERT_sentiment_analysis",
50
- tokenizer="avichr/heBERT_sentiment_analysis",
51
- return_all_scores = True
52
- )
53
-
54
- sentiment_analysis('ืื ื™ ืžืชืœื‘ื˜ ืžื” ืœืื›ื•ืœ ืœืืจื•ื—ืช ืฆื”ืจื™ื™ื')
55
- >>> [[{'label': 'natural', 'score': 0.9978172183036804},
56
- >>> {'label': 'positive', 'score': 0.0014792329166084528},
57
- >>> {'label': 'negative', 'score': 0.0007035882445052266}]]
58
 
59
- sentiment_analysis('ืงืคื” ื–ื” ื˜ืขื™ื')
60
- >>> [[{'label': 'natural', 'score': 0.00047328314394690096},
61
- >>> {'label': 'possitive', 'score': 0.9994067549705505},
62
- >>> {'label': 'negetive', 'score': 0.00011996887042187154}]]
63
 
64
- sentiment_analysis('ืื ื™ ืœื ืื•ื”ื‘ ืืช ื”ืขื•ืœื')
65
- >>> [[{'label': 'natural', 'score': 9.214012970915064e-05},
66
- >>> {'label': 'possitive', 'score': 8.876807987689972e-05},
67
- >>> {'label': 'negetive', 'score': 0.9998190999031067}]]
68
 
69
-
70
  Our model is also available on AWS! for more information visit [AWS' git](https://github.com/aws-samples/aws-lambda-docker-serverless-inference/tree/main/hebert-sentiment-analysis-inference-docker-lambda)
71
 
72
 
@@ -80,7 +80,7 @@ our git: https://github.com/avichaychriqui/HeBERT
80
  Chriqui, A., & Yahav, I. (2021). HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition. arXiv preprint arXiv:2102.01909.
81
  ```
82
  @article{chriqui2021hebert,
83
- title={HeBERT \\& HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition},
84
  author={Chriqui, Avihay and Yahav, Inbal},
85
  journal={arXiv preprint arXiv:2102.01909},
86
  year={2021}
 
1
  ## HeBERT: Pre-trained BERT for Polarity Analysis and Emotion Recognition
2
+ HeBERT is a Hebrew pre-trained language model. It is based on Google's BERT architecture and it is BERT-Base config [(Devlin et al. 2018)](https://arxiv.org/abs/1810.04805). <br>
3
 
4
+ HeBert was trained on three datasets:
5
+ 1. A Hebrew version of OSCAR [(Ortiz, 2019)](https://oscar-corpus.com/): ~9.8 GB of data, including 1 billion words and over 20.8 million sentences.
6
+ 2. A Hebrew dump of Wikipedia: ~650 MB of data, including over 63 million words and 3.8 million sentences
7
+ 3. Emotion UGC data was collected for the purpose of this study. (described below)
8
+ We evaluated the model on emotion recognition and sentiment analysis, for downstream tasks.
9
 
10
  ### Emotion UGC Data Description
11
+ Our User-Generated Content (UGC) is comments written on articles collected from 3 major news sites, between January 2020 to August 2020, Total data size of ~150 MB of data, including over 7 million words and 350K sentences.
12
+ 4000 sentences annotated by crowd members (3-10 annotators per sentence) for 8 emotions (anger, disgust, expectation, fear, happy, sadness, surprise, and trust) and overall sentiment/polarity <br>
13
+ In order to validate the annotation, we search for an agreement between raters to emotion in each sentence using Krippendorff's alpha [(krippendorff, 1970)](https://journals.sagepub.com/doi/pdf/10.1177/001316447003000105). We left sentences that got alpha > 0.7. Note that while we found a general agreement between raters about emotions like happiness, trust, and disgust, there are few emotions with general disagreement about them, apparently given the complexity of finding them in the text (e.g. expectation and surprise).
14
 
15
  ### Performance
16
  #### sentiment analysis
 
26
 
27
  ## How to use
28
  ### For masked-LM model (can be fine-tunned to any down-stream task)
29
+ \tfrom transformers import AutoTokenizer, AutoModel
30
+ \ttokenizer = AutoTokenizer.from_pretrained("avichr/heBERT")
31
+ \tmodel = AutoModel.from_pretrained("avichr/heBERT")
32
+ \t
33
+ \tfrom transformers import pipeline
34
+ \tfill_mask = pipeline(
35
+ \t "fill-mask",
36
+ \t model="avichr/heBERT",
37
+ \t tokenizer="avichr/heBERT"
38
+ \t)
39
+ \tfill_mask("ื”ืงื•ืจื•ื ื” ืœืงื—ื” ืืช [MASK] ื•ืœื ื• ืœื ื ืฉืืจ ื“ื‘ืจ.")
40
 
41
  ### For sentiment classification model (polarity ONLY):
42
+ \tfrom transformers import AutoTokenizer, AutoModel, pipeline
43
+ \ttokenizer = AutoTokenizer.from_pretrained("avichr/heBERT_sentiment_analysis") #same as 'avichr/heBERT' tokenizer
44
+ \tmodel = AutoModel.from_pretrained("avichr/heBERT_sentiment_analysis")
45
+ \t
46
+ \t# how to use?
47
+ \tsentiment_analysis = pipeline(
48
+ \t "sentiment-analysis",
49
+ \t model="avichr/heBERT_sentiment_analysis",
50
+ \t tokenizer="avichr/heBERT_sentiment_analysis",
51
+ \t return_all_scores = True
52
+ \t)
53
+ \t
54
+ \tsentiment_analysis('ืื ื™ ืžืชืœื‘ื˜ ืžื” ืœืื›ื•ืœ ืœืืจื•ื—ืช ืฆื”ืจื™ื™ื')\t
55
+ \t>>> [[{'label': 'natural', 'score': 0.9978172183036804},
56
+ \t>>> {'label': 'positive', 'score': 0.0014792329166084528},
57
+ \t>>> {'label': 'negative', 'score': 0.0007035882445052266}]]
58
 
59
+ \tsentiment_analysis('ืงืคื” ื–ื” ื˜ืขื™ื')
60
+ \t>>> [[{'label': 'natural', 'score': 0.00047328314394690096},
61
+ \t>>> {'label': 'possitive', 'score': 0.9994067549705505},
62
+ \t>>> {'label': 'negetive', 'score': 0.00011996887042187154}]]
63
 
64
+ \tsentiment_analysis('ืื ื™ ืœื ืื•ื”ื‘ ืืช ื”ืขื•ืœื')
65
+ \t>>> [[{'label': 'natural', 'score': 9.214012970915064e-05},
66
+ \t>>> {'label': 'possitive', 'score': 8.876807987689972e-05},
67
+ \t>>> {'label': 'negetive', 'score': 0.9998190999031067}]]
68
 
69
+ \t
70
  Our model is also available on AWS! for more information visit [AWS' git](https://github.com/aws-samples/aws-lambda-docker-serverless-inference/tree/main/hebert-sentiment-analysis-inference-docker-lambda)
71
 
72
 
 
80
  Chriqui, A., & Yahav, I. (2021). HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition. arXiv preprint arXiv:2102.01909.
81
  ```
82
  @article{chriqui2021hebert,
83
+ title={HeBERT \\\\& HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition},
84
  author={Chriqui, Avihay and Yahav, Inbal},
85
  journal={arXiv preprint arXiv:2102.01909},
86
  year={2021}