Model Card for gllama-alarm-implicit-hate

GLlama Alarm is a suite of knowledge-Guided versions of Llama 2 instruction fine-tuned for non-binary abusive language detection and explanation generation tasks.

Model Details

This version has been instruction fine-tuned on Implicit Hate Corpus for multi-class expressiveness detection and explanation generation (i.e., implicit hate speech, explicit hate speech, not hate) as well as on encyclopedic, commonsense and temporal linguistic knowledge.

Model Description

Developed by: Chiara Di Bonaventura, Lucia Siciliani, Pierpaolo Basile
Funded by: The Alan Turing Institute, Fondazione FAIR
Language: English
Finetuned from model: meta-llama/Llama-2-7b-hf

Model Sources

Paper: https://kclpure.kcl.ac.uk/ws/portalfiles/portal/316198577/2025_COLING_from_detection_to_explanation.pdf

Uses

GLlama Alarm is intended for research use in English, especially for NLP tasks in the domain of social media, which might contain offensive content. Our suite can be used to detect different levels of offensiveness and expressiveness of abusive language (e.g. offensive comments, implicit hate speech, which has proven to be hard for many LLMs) and to generate structured textual explanations entailing why the text contains abusive language.

In any case, language models, including ours, can potentially be used for language generation in a harmful way. GLlama Alarm should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.

Training Details

GLlama Alarm builds on top of the foundational model Llama 2 (7B), which is an auto-regressive language model that uses an optimized transformer architecture. Llama 2 was trained on a mix of publicly available online data between January 2023 and July 2023. We select the base version of Llama 2, which has 7B parameters. We instruction-funed Llama 2 on the following datasets: HateXplain and Implicit Hate Corpus, separately. This version is the one instruction fine-tuned on Implicit Hate Corpus. These datasets contain publicly available data designed for hate speech detection, thus ensuring data privacy and protection. To instruction fine-tune Llama 2, we created knowledge-guided prompts following our paradigm. The template is shown in Table 9 of the paper. We instruction fine-tuned Llama 2 with 17k knowledge-guided prompts for HateXplain and Implicit Hate for 5 epochs, while setting the other parameters as suggested by Taori et al., 2023.

Citation

BibTeX:

@inproceedings{dibonaventura2025gllama_alarm,

title={From Detection to Explanation: Effective Learning Strategies for LLMs in Online Abusive Language Research},

author={Di Bonaventura, Chiara and Siciliani, Lucia and Basile, Pierpaolo and Merono-Penuela, Albert and McGillivray, Barbara},

booktitle={Proceedings of the 2025 International Conference on Computational Linguistics (COLING 2025)},

year={2025} }

APA:

Di Bonaventura, C., Siciliani, L., Basile, P., Merono-Penuela, A., & McGillivray, B. 2025. From Detection to Explanation: Effective Learning Strategies for LLMs in Online Abusive Language Research. In Proceedings of the 2025 International Conference on Computational Linguistics (COLING 2025).

Model Card Contact

[email protected]

Framework versions

PEFT 0.10.0

dibo
/

gllama-alarm-implicit-hate

You need to agree to share your contact information to access this model