Model Details

Arabic CLIP is an adaptation of the Contrastive Language-Image Pre-training (CLIP) for the Arabic language. CLIP is an OpenAI-developed model that learns conceptual concepts from images and relates them with textual descriptions. This work attempts to improve the model's understanding and interpretation of visual information in the context of the Arabic language.

Model Use


from transformers import AutoTokenizer, FlaxVisionTextDualEncoderModel
model = FlaxVisionTextDualEncoderModel.from_pretrained("LinaAlhuri/Arabic-clip-vit-base-patch32", logit_scale_init_value=1,from_pt=True)
model.save_pretrained("arabic_clip") 

tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-base-arabic", cache_dir=None, use_fast=True)

Data

The aim was to create a comprehensive Arabic image-text dataset by combining various data sources due to the scarcity of Arabic resources. Challenges included limited Arabic data and the quality of translated datasets. The approach involved merging genuine datasets for rich information and using translated datasets to cover diverse domains, scenarios, and objects, striking a balance between their respective pros and cons.

Dataset name Images
Arabic Conceptual Captions 1,427,210
Arabic COCO 2014 414,113
Arabic WIT 109,366
Arabic Flicker8K 24,272
Proposed (WAP) dataset 151,252
Total 2,126,213

Performance and Limitations

We have tested the efficacy of Arabic CLIP across different benchmarks tailored for tasks like zero-shot learning, image retrieval, localization, and image search.

  • Conceptual Captions
  • COCO
  • ImageNet
  • Unsplash

Zero-shot Learning

Multilingual CLIP Top 1 Top 5 Top 10 Top 100
Short translation 10.10 21.99 26.70 47.57
Long translation 9.518 20.942 25.54 45.59
Arabic Baseline Patch 32 Top 1 Top 5 Top 10 Top 100
Short translation 17.58 37.15 45.60 73.02
Long translation 16.94 37.12 45.44 72.94

Image Retrieval

Conceptual Captions Evaluation

Metric MCLIP Baseline Patch 32
MRR@1 0.064 0.165
MRR@5 0.093 0.231
MRR@10 0.100 0.244

COCO Evaluation

Metric MCLIP Baseline Patch 32
MRR@1 0.043 0.082
MRR@5 0.068 0.127
MRR@10 0.074 0.138

Limitations

To summarize the limitations into points

  • Arabic CLIP struggles to count after 3.
  • Limited genuine samples for the Arabic language.
  • Various noises and biases might be introduced into Arabic CLIP because no studies have been conducted yet to address this issue in the published Arabic dataset or Arabic language models.

Bias

For gender bias, it is important to note that Arabic uses a two-gender system in which all nouns are classified as masculine or feminine. However, this is not the case for English. Translating the text from English to Arabic may result in information loss or even make it prone to gender bias.

Downloads last month
59
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.