Updated readme to add more details on bhasa
Browse files
README.md
CHANGED
@@ -26,19 +26,23 @@ We evaluated SEA-LION-7B-Instruct on the BHASA benchmark ([arXiv](https://arxiv.
|
|
26 |
|
27 |
BHASA stands out amongst other evaluations for SEA languages for its holistic approach to evaluation, including not just traditional Natural Language Processing (NLP) benchmarking tasks (such as sentiment analysis and question answering), but also linguistic and cultural diagnostic tests which are meticulously handcrafted.
|
28 |
|
29 |
-
The scores shown in the table below have been adjusted to only consider answers provided in the appropriate language.
|
30 |
|
31 |
| Model | QA (F1) | Sentiment (F1) | Toxicity (F1) | Eng>Indo (ChrF++) | Indo>Eng (ChrF++) | Summary (ROUGE-L) | NLI (Acc) | Causal (Acc) |
|
32 |
|--------------------------------|---------|----------------|---------------|-------------------|-------------------|-------------------|-----------|--------------|
|
33 |
| SEA-LION-7B-Instruct-Research | 24.86 | 76.13 | 24.45 | 52.50 | 46.82 | 15.44 | 33.20 | 23.80 |
|
34 |
-
| SEA-LION-7B-Instruct | **68.41** | **91.45** | 17.98 | 57.48 | 58.04 | **17.54** | 53.10 | 60.80 |
|
35 |
| SeaLLM 7B v1 | 30.96 | 56.29 | 22.60 | 62.23 | 41.55 | 14.03 | 26.50 | 56.60 |
|
36 |
| SeaLLM 7B v2 | 44.40 | 80.13 | **55.24** | 64.01 | **63.28** | 17.31 | 43.60 | **82.00** |
|
37 |
| Sailor-7B | 65.43 | 59.48 | 20.48 | **64.27** | 60.68 | 8.69 | 15.10 | 38.40 |
|
38 |
| Llama 2 7B Chat | 11.12 | 52.32 | 0.00 | 44.09 | 57.58 | 9.24 | 0.00 | 0.00 |
|
39 |
-
| Mistral 7B Instruct v0.1 | 38.85 | 74.38 | 20.83 | 30.60 | 51.43 | 15.63 |
|
40 |
| GPT-4 | 73.60 | 74.14 | 63.96 | 69.38 | 67.53 | 18.71 | 83.20 | 96.00 |
|
41 |
|
|
|
|
|
|
|
|
|
42 |
### Usage
|
43 |
SEA-LION can be run using the 🤗 Transformers library
|
44 |
```python
|
|
|
26 |
|
27 |
BHASA stands out amongst other evaluations for SEA languages for its holistic approach to evaluation, including not just traditional Natural Language Processing (NLP) benchmarking tasks (such as sentiment analysis and question answering), but also linguistic and cultural diagnostic tests which are meticulously handcrafted.
|
28 |
|
29 |
+
The evaluation was done zero-shot with Indonesian prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the BHASA paper. The scores shown in the table below have been adjusted to only consider answers provided in the appropriate language.
|
30 |
|
31 |
| Model | QA (F1) | Sentiment (F1) | Toxicity (F1) | Eng>Indo (ChrF++) | Indo>Eng (ChrF++) | Summary (ROUGE-L) | NLI (Acc) | Causal (Acc) |
|
32 |
|--------------------------------|---------|----------------|---------------|-------------------|-------------------|-------------------|-----------|--------------|
|
33 |
| SEA-LION-7B-Instruct-Research | 24.86 | 76.13 | 24.45 | 52.50 | 46.82 | 15.44 | 33.20 | 23.80 |
|
34 |
+
| SEA-LION-7B-Instruct | **68.41** | **91.45** | 17.98 | 57.48 | 58.04 | **17.54** | **53.10** | 60.80 |
|
35 |
| SeaLLM 7B v1 | 30.96 | 56.29 | 22.60 | 62.23 | 41.55 | 14.03 | 26.50 | 56.60 |
|
36 |
| SeaLLM 7B v2 | 44.40 | 80.13 | **55.24** | 64.01 | **63.28** | 17.31 | 43.60 | **82.00** |
|
37 |
| Sailor-7B | 65.43 | 59.48 | 20.48 | **64.27** | 60.68 | 8.69 | 15.10 | 38.40 |
|
38 |
| Llama 2 7B Chat | 11.12 | 52.32 | 0.00 | 44.09 | 57.58 | 9.24 | 0.00 | 0.00 |
|
39 |
+
| Mistral 7B Instruct v0.1 | 38.85 | 74.38 | 20.83 | 30.60 | 51.43 | 15.63 | 28.60 | 50.80 |
|
40 |
| GPT-4 | 73.60 | 74.14 | 63.96 | 69.38 | 67.53 | 18.71 | 83.20 | 96.00 |
|
41 |
|
42 |
+
- For Natural Language Understanding (NLU) tasks, we tested the model on Sentiment Analysis (`Sentiment`) using the NusaX dataset, Question Answering (`QA`) using the TyDiQA dataset, and Toxicity Detection (`Toxicity`) using the Indonesian Multi-Label Hate Speech Detection dataset. The metrics used are F1 scores for all three tasks.
|
43 |
+
- For Natural Language Generation (NLG) tasks, we tested the model on Machine Translation from English to Indonesian (`Eng>Indo`) and from Indonesian to English (`Indo>Eng`) using the FLORES-200 dataset, and Abstractive Summarization (`Summary`) using the XLSum dataset. The metrics used for Machine Translation and Abstractive Summarization are ChrF++ and ROUGE-L respectively.
|
44 |
+
- For Natural Language Reasoning (NLR) tasks, we tested the model on Natural Language Inference (`NLI`) using the IndoNLI lay dataset and on Causal Reasoning (`Causal`) using the XCOPA dataset. The metrics are based on accuracy for both tasks.
|
45 |
+
|
46 |
### Usage
|
47 |
SEA-LION can be run using the 🤗 Transformers library
|
48 |
```python
|