Model Card
SNIFFER is a multimodal large language model specifically engineered for Out-Of-Context misinformation detection and explanation. It employs two-stage instruction tuning on InstructBLIP, including news-domain alignment and task-specific tuning.
The whole model is composed of three parts: 1) internal checking that analyzes the consistency of the image and text content; 2) external checking that analyzes the relevance between the context of the retrieved image and the provided text, and 3) composed reasoning that combines the two-pronged analysis to arrive at a final judgment and explanation.
Here the checkpoint is used for the internal checking part.
Model Sources
- Paper: https://arxiv.org/abs/2403.03170 (to be appear in CVPR 2024)
- Project: https://pengqi.site/Sniffer/
- Repository: https://github.com/MischaQI/Sniffer
Results
Dataset: NewsCLIPpings
Model | All | Fake | Real |
---|---|---|---|
SAFE | 52.8 | 54.8 | 52.0 |
EANN | 58.1 | 61.8 | 56.2 |
VisualBERT | 58.6 | 38.9 | 78.4 |
CLIP | 66.0 | 64.3 | 67.7 |
DT-Transformer | 77.1 | 78.6 | 75.6 |
CCN | 84.7 | 84.8 | 84.5 |
Neu-Sym detector | 68.2 | - | - |
SNIFFER (ours) | 88.4 | 86.9 | 91.8 |
Citation
@inproceedings{qi2023sniffer,
author = {Qi, Peng and Yan, Zehong and Hsu, Wynne and Lee, Mong Li},
title = {SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2024}
}