Disaster-Twitter-XLM-RoBERTa-AL

This is a multilingual Twitter-XLM-RoBERTa-base model fine-tuned for the identification of disaster-related tweets. It was trained using a two-step procedure. First, we fine-tuned the model with 179,391 labelled tweets from CrisisLex in English, Spanish, German, French and Italian. Subsequently, the model was fine-tuned further using data from the 2021 Ahr Valley flood in Germany and the 2023 Chile forest fires using a greedy coreset active learning approach.

Labels

The model classifies short texts using either one of the following two labels:

  • LABEL_0: NOT disaster-related
  • LABEL_1: Disaster-related

Example Pipeline

from transformers import pipeline
MODEL_NAME = 'hannybal/disaster-twitter-xlm-roberta-al'
classifier = pipeline('text-classification', model=MODEL_NAME, tokenizer='cardiffnlp/twitter-xlm-roberta-base')
classifier('I can see fire and smoke from the nearby fire!')

Output:

[{'label': 'LABEL_0', 'score': 0.9967854022979736}]

Full Classification Example

from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
import numpy as np
from scipy.special import softmax

def preprocess(text: str) -> str:
    """Pre-process texts by replacing usernames and links with placeholders.
    """
    new_text: list[str] = []
    for t in text.split(" "):
        t: str = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

MODEL_NAME = 'hannybal/disaster-twitter-xlm-roberta-al'

tokenizer = AutoTokenizer.from_pretrained('cardiffnlp/twitter-xlm-roberta-base')
config = AutoConfig.from_pretrained(MODEL_NAME)

# example classification
text = "Das ist alles, was von meinem Keller noch übrig ist... #flood #ahr @ Bad Neuenahr-Ahrweiler https://t.co/C68fBaKZWR"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

# print labels and their respective scores
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = config.id2label[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

Output:

1) LABEL_1 0.9999
2) LABEL_0 0.0001

Reference

@inproceedings{Hanny.2024a,
  title = {Active {{Learning}} for~{{Identifying Disaster-Related Tweets}}: {{A Comparison}} with~{{Keyword Filtering}} and~{{Generic Fine-Tuning}}},
  shorttitle = {Active {{Learning}} for~{{Identifying Disaster-Related Tweets}}},
  booktitle = {Intelligent {{Systems}} and {{Applications}}},
  author = {Hanny, David and Schmidt, Sebastian and Resch, Bernd},
  editor = {Arai, Kohei},
  year = {2024},
  pages = {126--142},
  publisher = {Springer Nature Switzerland},
  address = {Cham},
  doi = {10.1007/978-3-031-66428-1_8},
  isbn = {978-3-031-66428-1},
  langid = {english}
}

Acknowledgements

This work has received funding from the European Commission - European Union under HORIZON EUROPE (HORIZON Research and Innovation Actions) as part of the TEMA project (grant agreement 101093003; HORIZON-CL4-2022-DATA-01-01). This work has also received funding from the Austrian Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology (BMK) project GeoSHARING (Grant Number 878652).

Downloads last month
12
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.