weiqipedia
commited on
Update metrics in README.md
Browse files
README.md
CHANGED
@@ -43,13 +43,14 @@ The evaluation was done zero-shot with Indonesian prompts and only a sample of 1
|
|
43 |
| Model | QA (F1) | Sentiment (F1) | Toxicity (F1) | Eng>Indo (ChrF++) | Indo>Eng (ChrF++) | Summary (ROUGE-L) | NLI (Acc) | Causal (Acc) |
|
44 |
|--------------------------------|---------|----------------|---------------|-------------------|-------------------|-------------------|-----------|--------------|
|
45 |
| SEA-LION-7B-Instruct-Research | 24.86 | 76.13 | 24.45 | 52.50 | 46.82 | 15.44 | 33.20 | 23.80 |
|
46 |
-
| SEA-LION-7B-Instruct | **68.41
|
47 |
| SeaLLM 7B v1 | 30.96 | 56.29 | 22.60 | 62.23 | 41.55 | 14.03 | 26.50 | 56.60 |
|
48 |
-
| SeaLLM 7B v2 | 44.40 | 80.13 | **55.24**
|
49 |
-
| Sailor-7B (Base) | 65.43 | 59.48 | 20.48 | **64.27**
|
|
|
50 |
| Llama 2 7B Chat | 11.12 | 52.32 | 0.00 | 44.09 | 57.58 | 9.24 | 0.00 | 0.00 |
|
51 |
| Mistral 7B Instruct v0.1 | 38.85 | 74.38 | 20.83 | 30.60 | 51.43 | 15.63 | 28.60 | 50.80 |
|
52 |
-
| GPT-4
|
53 |
|
54 |
- For Natural Language Understanding (NLU) tasks, we tested the model on Sentiment Analysis (`Sentiment`) using the NusaX dataset, Question Answering (`QA`) using the TyDiQA dataset, and Toxicity Detection (`Toxicity`) using the Indonesian Multi-Label Hate Speech Detection dataset. The metrics used are F1 scores for all three tasks.
|
55 |
- For Natural Language Generation (NLG) tasks, we tested the model on Machine Translation from English to Indonesian (`Eng>Indo`) and from Indonesian to English (`Indo>Eng`) using the FLORES-200 dataset, and Abstractive Summarization (`Summary`) using the XLSum dataset. The metrics used for Machine Translation and Abstractive Summarization are ChrF++ and ROUGE-L respectively.
|
|
|
43 |
| Model | QA (F1) | Sentiment (F1) | Toxicity (F1) | Eng>Indo (ChrF++) | Indo>Eng (ChrF++) | Summary (ROUGE-L) | NLI (Acc) | Causal (Acc) |
|
44 |
|--------------------------------|---------|----------------|---------------|-------------------|-------------------|-------------------|-----------|--------------|
|
45 |
| SEA-LION-7B-Instruct-Research | 24.86 | 76.13 | 24.45 | 52.50 | 46.82 | 15.44 | 33.20 | 23.80 |
|
46 |
+
| SEA-LION-7B-Instruct | **68.41**| **91.45** | 17.98 | 57.48 | 58.04 | **17.54** | 53.10 | 60.80 |
|
47 |
| SeaLLM 7B v1 | 30.96 | 56.29 | 22.60 | 62.23 | 41.55 | 14.03 | 26.50 | 56.60 |
|
48 |
+
| SeaLLM 7B v2 | 44.40 | 80.13 | **55.24** | 64.01 | **63.28** | 17.31 | 43.60 | 82.00 |
|
49 |
+
| Sailor-7B (Base) | 65.43 | 59.48 | 20.48 | **64.27** | 60.68 | 8.69 | 15.10 | 38.40 |
|
50 |
+
| Sailor-7B-Chat | 38.02 | 87.64 | 52.07 | 64.25 | 61.87 | 15.28 | **68.30** |**85.60** |
|
51 |
| Llama 2 7B Chat | 11.12 | 52.32 | 0.00 | 44.09 | 57.58 | 9.24 | 0.00 | 0.00 |
|
52 |
| Mistral 7B Instruct v0.1 | 38.85 | 74.38 | 20.83 | 30.60 | 51.43 | 15.63 | 28.60 | 50.80 |
|
53 |
+
| GPT-4 (gpt-4-0314) | 73.60 | 74.14 | 63.96 | 69.38 | 67.53 | 18.71 | 83.20 | 96.00 |
|
54 |
|
55 |
- For Natural Language Understanding (NLU) tasks, we tested the model on Sentiment Analysis (`Sentiment`) using the NusaX dataset, Question Answering (`QA`) using the TyDiQA dataset, and Toxicity Detection (`Toxicity`) using the Indonesian Multi-Label Hate Speech Detection dataset. The metrics used are F1 scores for all three tasks.
|
56 |
- For Natural Language Generation (NLG) tasks, we tested the model on Machine Translation from English to Indonesian (`Eng>Indo`) and from Indonesian to English (`Indo>Eng`) using the FLORES-200 dataset, and Abstractive Summarization (`Summary`) using the XLSum dataset. The metrics used for Machine Translation and Abstractive Summarization are ChrF++ and ROUGE-L respectively.
|