Sami92 commited on
Commit
b26f325
1 Parent(s): 7820915

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -46,7 +46,7 @@ Use the code below to get started with the model.
46
 
47
  The model was trained on three datasets, each based on the data from partypress/partypress-multilingual. The first dataset was weakly labeled using GPT-4o. The [prompt](https://huggingface.co/Sami92/XLM-R-Large-PartyPress/blob/main/FinalPromptPartyPress.txt) contained the label description taken from [Erfort et al. (2023)](https://journals.sagepub.com/doi/10.1177/20531680231183512). The weakly labeled dataset contains 32,060 press releases.
48
  The second dataset was drawn from Telegram channels. More specifically a sample from about 200 channels that have been subject to a fact-check from either Correctiv, dpa, Faktenfuchs or AFP. 7741 posts were sampled and weakly annotated by GPT-4o with the same prompt as before.
49
- The third dataset is the human-annotated dataset that is used for training partypress/partypress-multilingual. For training only the single-coded examples were used (24,117). Evaluation was performed on the data that is annotated by two human coders per example (3,121).
50
 
51
 
52
 
@@ -68,10 +68,11 @@ The third dataset is the human-annotated dataset that is used for training party
68
 
69
 
70
  ### Testing Data
 
71
 
72
- The testing was performed on the same data as for the [Sami92/XLM-R-Large-PartyPress](https://huggingface.co/Sami92/XLM-R-Large-PartyPress/edit/main/README.md). Due to the extra training step on the Telegram data, the F1-score on press releases reduced from 0.72 to 0.62.
73
 
74
- However, for the second test, there is an improvement. For testing on Telegram data, a sample of 84 posts was taken and labeled by the model. Three annotators were then asked if the prediction of the model is either a main topic of the post, a subtopic, or incorrect. The majority vote was used as final label. The detailed results can be found below. For 93% of the Telegram posts, the model prediction was either a main or subtopic. For [Sami92/XLM-R-Large-PartyPress](https://huggingface.co/Sami92/XLM-R-Large-PartyPress/edit/main/README.md) it was only in 88% of the cases a main or subtopic. The improvement is even more visible when focusing on main topics only. For the Telegram-fine-tuned model the prediction is a main topic in 82% of the cases and for the model without training on Telegram data it is 75%.
75
 
76
 
77
  ### Results
 
46
 
47
  The model was trained on three datasets, each based on the data from partypress/partypress-multilingual. The first dataset was weakly labeled using GPT-4o. The [prompt](https://huggingface.co/Sami92/XLM-R-Large-PartyPress/blob/main/FinalPromptPartyPress.txt) contained the label description taken from [Erfort et al. (2023)](https://journals.sagepub.com/doi/10.1177/20531680231183512). The weakly labeled dataset contains 32,060 press releases.
48
  The second dataset was drawn from Telegram channels. More specifically a sample from about 200 channels that have been subject to a fact-check from either Correctiv, dpa, Faktenfuchs or AFP. 7741 posts were sampled and weakly annotated by GPT-4o with the same prompt as before.
49
+ The third dataset is the human-annotated dataset that is used for training partypress/partypress-multilingual. For training only the single-coded examples were used (24,117).
50
 
51
 
52
 
 
68
 
69
 
70
  ### Testing Data
71
+ The model was evaluated on two datasets. The first are the press releases that are annotated by two human coders per example (3,121). It is the same test data as for the [Sami92/XLM-R-Large-PartyPress](https://huggingface.co/Sami92/XLM-R-Large-PartyPress/edit/main/README.md). For testing on Telegram data, a sample of 84 posts was taken and labeled by the model. Three annotators were then asked if the prediction of the model is either a main topic of the post, a subtopic, or incorrect. The majority vote was used as final label.
72
 
73
+ For testing on the first dataset, consisting of press releases, the F1-score reduced from 0.72 to 0.62 compared to [Sami92/XLM-R-Large-PartyPress](https://huggingface.co/Sami92/XLM-R-Large-PartyPress/edit/main/README.md).
74
 
75
+ For the second test, there is an improvement. The detailed results can be found below. For 93% of the Telegram posts, the model prediction was either a main or subtopic. For [Sami92/XLM-R-Large-PartyPress](https://huggingface.co/Sami92/XLM-R-Large-PartyPress/edit/main/README.md) it was only in 88% of the cases a main or subtopic. The improvement is even more visible when focusing on main topics only. For the Telegram-fine-tuned model the prediction is a main topic in 82% of the cases and for the model without training on Telegram data it is 75%.
76
 
77
 
78
  ### Results