|
--- |
|
language: |
|
- th |
|
- en |
|
metrics: |
|
- sacrebleu |
|
base_model: |
|
- HuggingFaceM4/Idefics3-8B-Llama3 |
|
pipeline_tag: visual-question-answering |
|
--- |
|
|
|
# Pathumma-llm-vision-1.0.0 |
|
|
|
## Model Overview |
|
Pathumma-llm-vision-1.0.0 is a multi-modal language model fine-tuned for Visual Question Answering (VQA) and Image Captioning tasks. It contains 8 billion parameters and leverages both image and text processing to understand and generate multi-modal content. |
|
|
|
- **Model Name**: Pathumma-llm-vision-1.0.0 |
|
- **Base Model**: HuggingFaceM4/Idefics3-8B-Llama3 |
|
- **Architecture**: Multi-modal LLM (Visual Language Model) |
|
- **Parameters**: 8 Billion |
|
- **Organization**: NECTEC |
|
- **License**: [Specify License] |
|
|
|
## Intended Use |
|
- **Primary Use Cases**: |
|
- Visual Question Answering (VQA) |
|
- Image Captioning |
|
- **Intended Users**: Developers, researchers, and AI practitioners working on multi-modal tasks. |
|
- **Possible Applications**: Educational tools, accessibility applications, interactive visual content generation. |
|
|
|
## Model Description |
|
Pathumma-llm-vision-1.0.0 is designed to perform multi-modal tasks by integrating both visual and textual information. The model is fine-tuned with diverse datasets to improve its ability to understand and generate content that aligns with both image and text inputs. |
|
|
|
## Training Data |
|
The model was fine-tuned on several datasets: |
|
- **Thai Image Caption**: Data sourced from image captioning competitions on Kaggle. |
|
- **Thai Shorthand Dataset**: Data related to the Thai language. |
|
- **ShareGPT-4o (translated into Thai)**: Data translated from GPT-4o-mini outputs into Thai. |
|
- **Small-Thai-Wikipedia-location**: Articles in Thai from Wikipedia about geographic locations. |
|
- **Synthetic Data**: Additional synthetic data generated to increase dataset diversity. |
|
|
|
### Dataset Size |
|
- **Training Dataset Size**: 112,768 examples |
|
- **Validation Dataset Size**: 9,036 examples |
|
|
|
## Training Details |
|
- **Hardware Used**: |
|
- **HPC Cluster**: Lanta |
|
- **Number of Nodes**: 16 Nodes |
|
- **GPUs per Node**: 4 GPUs |
|
- **Total GPUs Used**: 64 GPUs |
|
- **Fine-tuning Duration**: 3 hours, 18 minutes, and 11 seconds (excluding evaluation) |
|
|
|
## Evaluation Results |
|
|
|
| Type | Encoder | Decoder | Sentence SacreBLEU <br>(test) | Unique Tokens | |
|
|---------------------------------------|------------------------------------|--------------------------------|-------------------------------|---------------| |
|
| Idefic3-8B-Llama3 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 0.02657 | 12990 | |
|
| Pathumma-llm-vision-beta-0.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 13.45412 | 1148 | |
|
| Pathumma-llm-vision-1.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 17.66370 | 1312 | |
|
|
|
|
|
- **Accuracy on Manual-VQA Tasks**: 30.34% |
|
|
|
## Required Libraries |
|
|
|
Before you start, ensure you have the following libraries installed: |
|
|
|
``` |
|
pip install git+https://github.com/andimarafioti/transformers.git@idefics3 |
|
``` |
|
|
|
## Usage |
|
We provide a [inference tutorial](https://colab.research.google.com/drive/1TakNg4v6hHFXLih-SFcibxzYBTs2-EFn?usp=sharing). |
|
To use the model with the Hugging Face `transformers` library: |
|
|
|
```python |
|
import io |
|
import os |
|
import time |
|
import random |
|
import requests |
|
import shutil |
|
from IPython.display import display, Markdown |
|
from IPython.display import clear_output as cls |
|
|
|
import numpy as np |
|
import pandas as pd |
|
from PIL import Image |
|
|
|
import torch |
|
|
|
import transformers |
|
from transformers import ( |
|
Idefics3ForConditionalGeneration, |
|
AutoProcessor, |
|
BitsAndBytesConfig, |
|
) |
|
``` |
|
|
|
```python |
|
|
|
DEVICE = f"cuda" if torch.cuda.is_available() else 'cpu' if torch.cpu.is_available() else 'mps' |
|
print(DEVICE) |
|
if DEVICE == 'cuda': display(torch.cuda.device_count()) |
|
|
|
N = 5 |
|
|
|
revision = "quantized8bit" |
|
processor = AutoProcessor.from_pretrained( |
|
"nectec/Pathumma-llm-vision-1.0.0", |
|
revision=revision, # Optional |
|
do_image_splitting=False, |
|
# size={"longest_edge": N*364}, # Optional |
|
# size={"height": N*364, "width": N*364}, # Optional |
|
) |
|
|
|
model = Idefics3ForConditionalGeneration.from_pretrained( |
|
"nectec/Pathumma-llm-vision-1.0.0", |
|
revision=revision, # Optional |
|
torch_dtype=torch.float16, |
|
device_map=DEVICE |
|
) |
|
|
|
print(processor.image_processor.size) |
|
|
|
url_path = None |
|
local_path = "./path/picture.jpg" if not url_path else io.BytesIO(requests.get(url_path).content) |
|
image = Image.open(local_path) |
|
|
|
question = "รายละเอียดของรูปภาพนี้" |
|
messages = [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "text", "text": "You are a helpful assistant."}, |
|
{"type": "image"}, |
|
{"type": "text", "text": question} |
|
] |
|
} |
|
] |
|
|
|
text = processor.apply_chat_template( |
|
messages, |
|
add_generation_prompt=True, |
|
) |
|
|
|
encoding = processor( |
|
images=image, |
|
text=text.strip(), |
|
# padding='max_length', |
|
# truncation=True, |
|
# max_length=, |
|
return_tensors="pt" |
|
) |
|
|
|
encoding = {k: v.to(DEVICE) for k, v in encoding.items()} |
|
|
|
# Example: Run inference on text input |
|
start_time = time.time() |
|
model.eval() |
|
with torch.inference_mode(): |
|
# Generate |
|
generated_ids = model.generate( |
|
**inputs, |
|
max_new_tokens=128, |
|
# temperature=.5, |
|
# repetition_penalty=1., |
|
# # top_k=1., |
|
# top_p=1, |
|
) |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
end_time = time.time() |
|
|
|
## Get letency_time... |
|
latency_time = end_time - start_time |
|
|
|
answer_prompt = generated_text.split('Assistant:')[1].strip() |
|
|
|
# Output processing (depends on task requirements) |
|
print(answer_prompt) |
|
print(f"latency_time: {latency_time:.3f} sec.") |
|
|
|
# >>> output: |
|
# >>> ลูกฮิปโปแคระกำลังยืนอยู่ข้างแม่ฮิปโปแคระที่กำลังอาบน้ำ |
|
# >>> latency_time: 7.642 sec. |
|
``` |
|
|
|
## Limitations and Biases |
|
- The model may exhibit biases due to the training data, which might not be fully representative of all contexts. |
|
- Performance may degrade on unfamiliar images or non-standard question formats. |
|
|
|
## Ethical Considerations |
|
- The model should not be used to generate misleading information or in ways that violate privacy. |
|
- Consider fairness and minimize bias when using the model for language and image processing tasks. |
|
|
|
## Citation |
|
If you use this model, please cite it as follows: |
|
|
|
```bibtex |
|
@misc{PathummaVision, |
|
author = {Thirawarit Pitiphiphat and NECTEC Team}, |
|
title = {nectec/Pathumma-llm-vision-1.0.0}, |
|
year = {2024}, |
|
url = {https://huggingface.co/nectec/Pathumma-llm-vision-1.0.0} |
|
} |
|
``` |
|
|
|
```bibtex |
|
@misc{laurençon2024building, |
|
title={Building and better understanding vision-language models: insights and future directions.}, |
|
author={Hugo Laurençon and Andrés Marafioti and Victor Sanh and Léo Tronchon}, |
|
year={2024}, |
|
eprint={2408.12637}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV} |
|
} |
|
``` |
|
|
|
## **Contributor Contract** |
|
|
|
**LLM Team** |
|
Pakawat Phasook ([email protected])<br> |
|
Jessada Pranee ([email protected])<br> |
|
Arnon Saeoung ([email protected])<br> |
|
Kun Kerdthaisong ([email protected])<br> |
|
Kittisak Sukhantharat ([email protected])<br> |
|
Chaianun Damrongrat ([email protected])<br> |
|
Sarawoot Kongyoung ([email protected]) |
|
|
|
**Audio Team** |
|
Pattara Tipaksorn ([email protected])<br> |
|
Wayupuk Sommuang ([email protected])<br> |
|
Oatsada Chatthong ([email protected])<br> |
|
Kwanchiva Thangthai ([email protected]) |
|
|
|
**Vision Team** |
|
Thirawarit Pitiphiphat ([email protected])<br> |
|
Peerapas Ngokpon ([email protected])<br> |
|
Theerasit Issaranon ([email protected]) |
|
|
|
## Contact |
|
For questions or support, please contact **https://discord.gg/3WJwJjZt7r**. |
|
|
|
``` |
|
This formatting provides a clean, structured, and readable Markdown layout for these sections. Let me know if further adjustments are needed! |
|
``` |
|
|