metadata

license: gemma
datasets:
  - damerajee/clean_hin_vqa
language:
  - en
  - hi
inference: false
library_name: transformers
pipeline_tag: visual-question-answering
tags:
  - visual-question-answering
  - Bilingual

ViLaH

ViLaH (Vision Language Hindi) is a model with 3 billion parameters, fine-tuned from the base-model google/paligemma-3b-pt-224 to handle input images and bilingual (Hindi and English) text sequences for both input and output.

Training Details

Model Configuration: Fine-tuned on a single epoch using 2 T4 GPUs with Distributed Data Parallel (DDP) setup.
Training Duration: Approximately one day.
Evaluation Loss: Achieved an eval loss of 1.6384 at the end of the epoch.

Dataset

The dataset was finetuned on only one dataset

damerajee/clean_hin_vqa : This dataset was derived from Lin-Chen/ShareGPT4V and filtered to include only images from the COCO dataset. The original dataset was translated and cleaned to ensure high-quality Hindi visual question answering content.

How to Use

!pip install peft trl datasets accelerate bitsandbytes
!pip install transformers --upgrade

To Run the model on a single T4 GPU on Float16

from peft import get_peft_model, LoraConfig,prepare_model_for_kbit_training
from transformers import TrainingArguments, Trainer , PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig,AutoTokenizer
from peft import PeftModel, PeftConfig
from datasets import load_dataset
import torch
from datasets import load_dataset

dataset = load_dataset("damerajee/clean_hin_vqa",split='train')
test_example = dataset[10000]
test_image = test_example["image"]
text = test_example['question']

device_index = torch.cuda.current_device()
print("device_index:",device_index)
base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/ViLaH",device_map={"": device_index},torch_dtype=torch.float16,low_cpu_mem_usage=True)
processor = AutoProcessor.from_pretrained("BhashaAI/ViLaH")

inputs = processor(text=text, images=test_image, return_tensors="pt").to("cuda")
for k,v in inputs.items():
  print(k,v.shape)

MAX_LENGTH = 200
# Autoregressively generate
# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate
generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH)

# Next we turn each predicted token ID back into a string using the decode method
# We chop of the prompt, which consists of image tokens and our text prompt
image_token_index = base_model.config.image_token_index
num_image_tokens = len(generated_ids[generated_ids==image_token_index])
num_text_tokens = len(processor.tokenizer.encode(text))
num_prompt_tokens = num_image_tokens + num_text_tokens + 2
generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
generated_text

To Run the model on a single T4 GPU in 4Bits

from peft import get_peft_model, LoraConfig,prepare_model_for_kbit_training
from transformers import TrainingArguments, Trainer , PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig,AutoTokenizer
from peft import PeftModel, PeftConfig
from datasets import load_dataset
import torch
from datasets import load_dataset

dataset = load_dataset("damerajee/clean_hin_vqa",split='train')
test_example = dataset[10000]
test_image = test_example["image"]
text = test_example['question']

device_index = torch.cuda.current_device()
print("device_index:",device_index)
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/ViLaH",device_map={"": device_index},quantization_config=quantization_config,torch_dtype=torch.float16,low_cpu_mem_usage=True)
processor = AutoProcessor.from_pretrained("BhashaAI/ViLaH")

inputs = processor(text=text, images=test_image, return_tensors="pt").to("cuda")
for k,v in inputs.items():
  print(k,v.shape)

MAX_LENGTH = 200
# Autoregressively generate
# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate
generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH)

# Next we turn each predicted token ID back into a string using the decode method
# We chop of the prompt, which consists of image tokens and our text prompt
image_token_index = base_model.config.image_token_index
num_image_tokens = len(generated_ids[generated_ids==image_token_index])
num_text_tokens = len(processor.tokenizer.encode(text))
num_prompt_tokens = num_image_tokens + num_text_tokens + 2
generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
generated_text

Usage and limitations

Intended usage

Open Vision Language Models (VLMs) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.

The model can be further fine-tuned on bigger and better dataset or your own custom dataset
The model can be used in apps to provide real-time visual and text-based assistance in Hindi and English.
The model can be a tool for researchers to develop new vision-language technologies and applications.

Ethical considerations and risks

The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following:

Bias and Fairness
- VLMs trained on large-scale, real-world image-text data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card.
Misinformation and Misuse
- VLMs can be misused to generate text that is false, misleading, or harmful.
- Guidelines are provided for responsible use with the model, see the Responsible Generative AI Toolkit.
Transparency and Accountability
- This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes.
- A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem.

Risks identified and mitigations:

Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases.
Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases.
Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of LLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the Gemma Prohibited Use Policy.
Privacy violations: Models were trained on data filtered to remove certain personal information and sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.

Limitations

Most limitations inherited from the underlying Gemma model still apply:
- VLMs are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging.
- Natural language is inherently complex. VLMs might struggle to grasp subtle nuances, sarcasm, or figurative language.
- VLMs generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements.
- VLMs rely on statistical patterns in language and images. They might lack the ability to apply common sense reasoning in certain situations.