File size: 3,174 Bytes
1afbf8c
 
 
 
 
 
 
c3dc18b
b0f53fc
 
1afbf8c
 
 
 
 
e71b20e
5d5fc17
e71b20e
1afbf8c
 
 
7efb8bc
 
 
 
 
 
 
1afbf8c
 
 
 
 
 
 
0734649
 
 
 
 
7efb8bc
5d5fc17
 
7efb8bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
language: fr
license: mit
datasets:
- amazon_reviews_multi
- allocine
widget:
- text: "Je pensais lire un livre nul, mais finalement je l'ai trouvé super !"
- text: "Cette banque est très bien, mais elle n'offre pas les services de paiements sans contact."
- text: "Cette banque est très bien et elle offre en plus les services de paiements sans contact."
---

DistilCamemBERT-Sentiment
=========================

We present DistilCamemBERT-Sentiment which is [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base) fine tuned for the sentiment analysis task for the French language. This model is constructed over 2 datasets: [Amazon Reviews](https://huggingface.co/datasets/amazon_reviews_multi) and [Allociné.fr](https://huggingface.co/datasets/allocine) in order to minimize the bias. Indeed, Amazon reviews are very similar in the messages and relatively shorts, contrary to Allociné critics which are long and rich texts.

This modelization is close to [tblard/tf-allocine](https://huggingface.co/tblard/tf-allocine) based on [CamemBERT](https://huggingface.co/camembert-base) model. The problem of the modelizations based on CamemBERT is at the scaling moment, for the production phase for example. Indeed, inference cost can be a technological issue. To counteract this effect, we propose this modelization which **divides the inference time by 2** with the same consumption power thanks to [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base).

Dataset
-------

The dataset is composed of XXX,XXX reviews for training and X,XXX review for the test issue of Amazon, and respectively XXX,XXX and X,XXX critics issue of Allocine website. The dataset is labeled into 5 categories:
* 1 star: represent very bad appreciation,
* 2 stars: bad appreciation,
* 3 stars: neutral appreciation,
* 4 stars: good appreciation,
* 5 stars: very good appreciation. 
 
Evaluation results
------------------

Benchmark
---------

This model is compared to 3 reference models (see below). As each model doesn't have the same definition of targets, we detail the performance measure used for each of them. For the mean inference time measure, an **AMD Ryzen 5 4500U @ 2.3GHz with 6 cores** was used.

### [bert-base-multilingual-uncased-sentiment](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)

### [tf-allociné](https://huggingface.co/tblard/tf-allocine) and [barthez-sentiment-classification](https://huggingface.co/moussaKam/barthez-sentiment-classification)

How to use DistilCamemBERT-Sentiment
------------------------------------

```python
from transformers import pipeline

analyzer = pipeline(
    task='text-classification',
    model="cmarkea/distilcamembert-base-sentiment",
    tokenizer="cmarkea/distilcamembert-base-sentiment"
)
result = analyzer("J'aime marché dans la nature même si ça me donne mal au pied.")

result
[{'label': '1 star',
  'score': 0.07675889134407043},
 {'label': '2 stars',
  'score': 0.19822990894317627},
 {'label': '3 stars',
  'score': 0.38655608892440796},
 {'label': '4 stars',
  'score': 0.24029818177223206},
 {'label': '5 stars',
  'score': 0.09815695881843567}]
```