|
--- |
|
license: cc-by-nc-sa-4.0 |
|
datasets: |
|
- QCRI/LlamaLens-English |
|
- QCRI/LlamaLens-Arabic |
|
- QCRI/LlamaLens-Hindi |
|
language: |
|
- ar |
|
- en |
|
- hi |
|
base_model: |
|
- meta-llama/Llama-3.1-8B-Instruct |
|
pipeline_tag: text-generation |
|
tags: |
|
- Social-Media |
|
- Hate-Speech |
|
- Summarization |
|
- offensive-language |
|
- News-Genre |
|
--- |
|
# LlamaLens: Specialized Multilingual LLM forAnalyzing News and Social Media Content |
|
|
|
## Overview |
|
LlamaLens is a specialized multilingual LLM designed for analyzing news and social media content. It focuses on 19 NLP tasks, leveraging 52 datasets across Arabic, English, and Hindi. |
|
|
|
<p align="center"> |
|
<picture> |
|
<img width="352" alt="capablities_tasks_datasets" src="./llamalens-avatar.png"> |
|
</picture> |
|
</p> |
|
|
|
## Dataset |
|
The model was trained on the [LlamaLens dataset](https://huggingface.co/collections/QCRI/llamalens-672f7e0604a0498c6a2f0fe9). |
|
|
|
## To Replicate the Experiments |
|
The code to replicate the experiments is available on [GitHub](https://github.com/firojalam/LlamaLens). |
|
|
|
|
|
## Model Inference |
|
|
|
To utilize the LlamaLens model for inference, follow these steps: |
|
|
|
1. **Install the Required Libraries**: |
|
|
|
Ensure you have the necessary libraries installed. You can do this using pip: |
|
|
|
```bash |
|
pip install transformers torch |
|
``` |
|
2. **Load the Model and Tokenizer:**: |
|
Use the transformers library to load the LlamaLens model and its tokenizer: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
model_name = "QCRI/LlamaLens" |
|
pipe = pipeline("text-generation", model=model_name) |
|
``` |
|
3. **Prepare the Input:**: |
|
Tokenize your input text: |
|
```python |
|
input_text = "Your input text here" |
|
system_message = "Your system message text here" |
|
messages = [ |
|
{"role": "system", "content": system_message}, |
|
{"role": "user", "content": input_text}, |
|
] |
|
|
|
|
|
``` |
|
4. **Generate the Output:**: |
|
Generate a response using the model: |
|
```python |
|
generated_text = pipe(messages, num_return_sequences=1) |
|
print(generated_text) |
|
``` |
|
|
|
## Results |
|
|
|
Below, we present the performance of **LlamaLens** compared to existing SOTA (if available) and the Llama-Instruct baseline, The βΞβ (Delta) column here is |
|
calculated as **(LLamalens β SOTA)**. |
|
|
|
--- |
|
|
|
## Arabic |
|
|
|
| **Task** | **Dataset** | **Metric** | **SOTA** | **Llama-instruct** | **LLamalens** | **Ξ** (LLamalens - SOTA) | |
|
|------------------------|---------------------------|-----------:|--------:|--------------------:|--------------:|------------------------------:| |
|
| News Summarization | xlsum | R-2 | 0.137 | 0.034 | 0.075 | -0.062 | |
|
| News Genre | ASND | Ma-F1 | 0.770 | 0.587 | 0.938 | 0.168 | |
|
| News Genre | SANADAkhbarona | Acc | 0.940 | 0.784 | 0.922 | -0.018 | |
|
| News Genre | SANADAlArabiya | Acc | 0.974 | 0.893 | 0.986 | 0.012 | |
|
| News Genre | SANADAlkhaleej | Acc | 0.986 | 0.865 | 0.967 | -0.019 | |
|
| News Genre | UltimateDataset | Ma-F1 | 0.970 | 0.376 | 0.883 | -0.087 | |
|
| News Credibility | NewsCredibility | Acc | 0.899 | 0.455 | 0.494 | -0.405 | |
|
| Emotion | Emotional-Tone | W-F1 | 0.658 | 0.358 | 0.748 | 0.090 | |
|
| Emotion | NewsHeadline | Acc | 1.000 | 0.406 | 0.551 | -0.449 | |
|
| Sarcasm | ArSarcasm-v2 | F1_Pos | 0.584 | 0.477 | 0.307 | -0.277 | |
|
| Sentiment | ar_reviews_100k | F1_Pos | β | 0.343 | 0.665 | β | |
|
| Sentiment | ArSAS | Acc | 0.920 | 0.603 | 0.795 | -0.125 | |
|
| Stance | stance | Ma-F1 | 0.767 | 0.608 | 0.936 | 0.169 | |
|
| Stance | Mawqif-Arabic-Stance | Ma-F1 | 0.789 | 0.764 | 0.867 | 0.078 | |
|
| Att.worthiness | CT22Attentionworthy | W-F1 | 0.412 | 0.158 | 0.544 | 0.132 | |
|
| Checkworthiness | CT24_T1 | F1_Pos | 0.569 | 0.404 | 0.877 | 0.308 | |
|
| Claim | CT22Claim | Acc | 0.703 | 0.581 | 0.778 | 0.075 | |
|
| Factuality | Arafacts | Mi-F1 | 0.850 | 0.210 | 0.534 | -0.316 | |
|
| Factuality | COVID19Factuality | W-F1 | 0.831 | 0.492 | 0.781 | -0.050 | |
|
| Propaganda | ArPro | Mi-F1 | 0.767 | 0.597 | 0.762 | -0.005 | |
|
| Cyberbullying | ArCyc_CB | Acc | 0.863 | 0.766 | 0.753 | -0.110 | |
|
| Harmfulness | CT22Harmful | F1_Pos | 0.557 | 0.507 | 0.508 | -0.049 | |
|
| Hate Speech | annotated-hatetweets-4 | W-F1 | 0.630 | 0.257 | 0.549 | -0.081 | |
|
| Hate Speech | OSACT4SubtaskB | Mi-F1 | 0.950 | 0.819 | 0.802 | -0.148 | |
|
| Offensive | ArCyc_OFF | Ma-F1 | 0.878 | 0.489 | 0.652 | -0.226 | |
|
| Offensive | OSACT4SubtaskA | Ma-F1 | 0.905 | 0.782 | 0.899 | -0.006 | |
|
|
|
--- |
|
|
|
## English |
|
|
|
| **Task** | **Dataset** | **Metric** | **SOTA** | **Llama-instruct** | **LLamalens** | **Ξ** (LLamalens - SOTA) | |
|
|----------------------|---------------------------|-----------:|--------:|--------------------:|--------------:|------------------------------:| |
|
| News Summarization | xlsum | R-2 | 0.152 | 0.074 | 0.141 | -0.011 | |
|
| News Genre | CNN_News_Articles | Acc | 0.940 | 0.644 | 0.915 | -0.025 | |
|
| News Genre | News_Category | Ma-F1 | 0.769 | 0.970 | 0.505 | -0.264 | |
|
| News Genre | SemEval23T3-ST1 | Mi-F1 | 0.815 | 0.687 | 0.241 | -0.574 | |
|
| Subjectivity | CT24_T2 | Ma-F1 | 0.744 | 0.535 | 0.508 | -0.236 | |
|
| Emotion | emotion | Ma-F1 | 0.790 | 0.353 | 0.878 | 0.088 | |
|
| Sarcasm | News-Headlines | Acc | 0.897 | 0.668 | 0.956 | 0.059 | |
|
| Sentiment | NewsMTSC | Ma-F1 | 0.817 | 0.628 | 0.627 | -0.190 | |
|
| Checkworthiness | CT24_T1 | F1_Pos | 0.753 | 0.404 | 0.877 | 0.124 | |
|
| Claim | claim-detection | Mi-F1 | β | 0.545 | 0.915 | β | |
|
| Factuality | News_dataset | Acc | 0.920 | 0.654 | 0.946 | 0.026 | |
|
| Factuality | Politifact | W-F1 | 0.490 | 0.121 | 0.290 | -0.200 | |
|
| Propaganda | QProp | Ma-F1 | 0.667 | 0.759 | 0.851 | 0.184 | |
|
| Cyberbullying | Cyberbullying | Acc | 0.907 | 0.175 | 0.847 | -0.060 | |
|
| Offensive | Offensive_Hateful | Mi-F1 | β | 0.692 | 0.805 | β | |
|
| Offensive | offensive_language | Mi-F1 | 0.994 | 0.646 | 0.884 | -0.110 | |
|
| Offensive & Hate | hate-offensive-speech | Acc | 0.945 | 0.602 | 0.924 | -0.021 | |
|
|
|
--- |
|
|
|
## Hindi |
|
|
|
| **Task** | **Dataset** | **Metric** | **SOTA** | **Llama-instruct** | **LLamalens** | **Ξ** (LLamalens - SOTA) | |
|
|------------------------|------------------------|-----------:|--------:|--------------------:|--------------:|------------------------------:| |
|
| NLI | NLI_dataset | W-F1 | 0.646 | 0.633 | 0.655 | 0.009 | |
|
| News Summarization | xlsum | R-2 | 0.136 | 0.078 | 0.117 | -0.019 | |
|
| Sentiment | Sentiment Analysis | Acc | 0.697 | 0.552 | 0.669 | -0.028 | |
|
| Factuality | fake-news | Mi-F1 | β | 0.759 | 0.713 | β | |
|
| Hate Speech | hate-speech-detection | Mi-F1 | 0.639 | 0.750 | 0.994 | 0.355 | |
|
| Hate Speech | Hindi-Hostility | W-F1 | 0.841 | 0.469 | 0.720 | -0.121 | |
|
| Offensive | Offensive Speech | Mi-F1 | 0.723 | 0.621 | 0.847 | 0.124 | |
|
| Cyberbullying | MC_Hinglish1 | Acc | 0.609 | 0.233 | 0.587 | -0.022 | |
|
|
|
## Paper |
|
For an in-depth understanding, refer to our paper: [**LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content**](https://arxiv.org/pdf/2410.15308). |
|
|
|
|
|
|
|
|
|
# License |
|
This model is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). |
|
|
|
|
|
# Citation |
|
Please cite [our paper](https://arxiv.org/pdf/2410.15308) when using this model: |
|
|
|
``` |
|
@article{kmainasi2024llamalensspecializedmultilingualllm, |
|
title={LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content}, |
|
author={Mohamed Bayan Kmainasi and Ali Ezzat Shahroor and Maram Hasanain and Sahinur Rahman Laskar and Naeemul Hassan and Firoj Alam}, |
|
year={2024}, |
|
journal={arXiv preprint arXiv:2410.15308}, |
|
volume={}, |
|
number={}, |
|
pages={}, |
|
url={https://arxiv.org/abs/2410.15308}, |
|
eprint={2410.15308}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|