Thirawarit's picture
Update README.md
0ef381e verified
---
language:
- th
- en
metrics:
- sacrebleu
base_model:
- HuggingFaceM4/Idefics3-8B-Llama3
pipeline_tag: visual-question-answering
---
# Pathumma-llm-vision-1.0.0
## Model Overview
Pathumma-llm-vision-1.0.0 is a multi-modal language model fine-tuned for Visual Question Answering (VQA) and Image Captioning tasks. It contains 8 billion parameters and leverages both image and text processing to understand and generate multi-modal content.
- **Model Name**: Pathumma-llm-vision-1.0.0
- **Base Model**: HuggingFaceM4/Idefics3-8B-Llama3
- **Architecture**: Multi-modal LLM (Visual Language Model)
- **Parameters**: 8 Billion
- **Organization**: NECTEC
- **License**: [Specify License]
## Intended Use
- **Primary Use Cases**:
- Visual Question Answering (VQA)
- Image Captioning
- **Intended Users**: Developers, researchers, and AI practitioners working on multi-modal tasks.
- **Possible Applications**: Educational tools, accessibility applications, interactive visual content generation.
## Model Description
Pathumma-llm-vision-1.0.0 is designed to perform multi-modal tasks by integrating both visual and textual information. The model is fine-tuned with diverse datasets to improve its ability to understand and generate content that aligns with both image and text inputs.
## Training Data
The model was fine-tuned on several datasets:
- **Thai Image Caption**: Data sourced from image captioning competitions on Kaggle.
- **Thai Shorthand Dataset**: Data related to the Thai language.
- **ShareGPT-4o (translated into Thai)**: Data translated from GPT-4o-mini outputs into Thai.
- **Small-Thai-Wikipedia-location**: Articles in Thai from Wikipedia about geographic locations.
- **Synthetic Data**: Additional synthetic data generated to increase dataset diversity.
### Dataset Size
- **Training Dataset Size**: 112,768 examples
- **Validation Dataset Size**: 9,036 examples
## Training Details
- **Hardware Used**:
- **HPC Cluster**: Lanta
- **Number of Nodes**: 16 Nodes
- **GPUs per Node**: 4 GPUs
- **Total GPUs Used**: 64 GPUs
- **Fine-tuning Duration**: 3 hours, 18 minutes, and 11 seconds (excluding evaluation)
## Evaluation Results
| Type | Encoder | Decoder | IPU24-dataset <br>(test) <br>(Sentence SacreBLEU) |
|----------------------------------------|------------------------------------|-------------------------------------|-------------------------------|
| Idefic3-8B-Llama3 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 0.02657 |
| Pathumma-llm-vision-beta-0.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 13.45412 |
| Pathumma-llm-vision-1.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | **17.66370** |
| llama-3-typhoon-v1.5-8b-vision-preview | siglip-so400m-patch14-384 | Llama-3-Typhoon-1.5-8B-instruct | 8.288626 |
**\*\*Note**: Other models not target fine-tuned on IPU24-datasets may be less representative of IPU24 performance.
- **Accuracy on VQA Tasks with testing a private dataset**: 30.34%
## Required Libraries
Before you start, ensure you have the following libraries installed:
```
pip install git+https://github.com/andimarafioti/transformers.git@idefics3
```
## Usage
We provide a [inference tutorial](https://colab.research.google.com/drive/1TakNg4v6hHFXLih-SFcibxzYBTs2-EFn?usp=sharing).
To use the model with the Hugging Face `transformers` library:
```python
import io
import os
import time
import random
import requests
import shutil
from IPython.display import display, Markdown
from IPython.display import clear_output as cls
import numpy as np
import pandas as pd
from PIL import Image
import torch
import transformers
from transformers import (
Idefics3ForConditionalGeneration,
AutoProcessor,
BitsAndBytesConfig,
)
```
```python
DEVICE = f"cuda" if torch.cuda.is_available() else 'cpu' if torch.cpu.is_available() else 'mps'
print(DEVICE)
if DEVICE == 'cuda': display(torch.cuda.device_count())
N = 5
revision = "quantized8bit"
processor = AutoProcessor.from_pretrained(
"nectec/Pathumma-llm-vision-1.0.0",
revision=revision, # Optional
do_image_splitting=False,
# size={"longest_edge": N*364}, # Optional
# size={"height": N*364, "width": N*364}, # Optional
)
model = Idefics3ForConditionalGeneration.from_pretrained(
"nectec/Pathumma-llm-vision-1.0.0",
revision=revision, # Optional
torch_dtype=torch.float16,
device_map=DEVICE
)
print(processor.image_processor.size)
url_path = None
local_path = "./path/picture.jpg" if not url_path else io.BytesIO(requests.get(url_path).content)
image = Image.open(local_path)
question = "รายละเอียดของรูปภาพนี้"
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "You are a helpful assistant."},
{"type": "image"},
{"type": "text", "text": question}
]
}
]
text = processor.apply_chat_template(
messages,
add_generation_prompt=True,
)
encoding = processor(
images=image,
text=text.strip(),
# padding='max_length',
# truncation=True,
# max_length=,
return_tensors="pt"
)
encoding = {k: v.to(DEVICE) for k, v in encoding.items()}
# Example: Run inference on text input
start_time = time.time()
model.eval()
with torch.inference_mode():
# Generate
generated_ids = model.generate(
**inputs,
max_new_tokens=128,
# temperature=.5,
# repetition_penalty=1.,
# # top_k=1.,
# top_p=1,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
end_time = time.time()
## Get letency_time...
latency_time = end_time - start_time
answer_prompt = generated_text.split('Assistant:')[1].strip()
# Output processing (depends on task requirements)
print(answer_prompt)
print(f"latency_time: {latency_time:.3f} sec.")
# >>> output:
# >>> ลูกฮิปโปแคระกำลังยืนอยู่ข้างแม่ฮิปโปแคระที่กำลังอาบน้ำ
# >>> latency_time: 7.642 sec.
```
## Limitations and Biases
- The model may exhibit biases due to the training data, which might not be fully representative of all contexts.
- Performance may degrade on unfamiliar images or non-standard question formats.
## Ethical Considerations
- The model should not be used to generate misleading information or in ways that violate privacy.
- Consider fairness and minimize bias when using the model for language and image processing tasks.
## Citation
If you use this model, please cite it as follows:
```bibtex
@misc{PathummaVision,
author = {Thirawarit Pitiphiphat and NECTEC Team},
title = {nectec/Pathumma-llm-vision-1.0.0},
year = {2024},
url = {https://huggingface.co/nectec/Pathumma-llm-vision-1.0.0}
}
```
```bibtex
@misc{laurençon2024building,
title={Building and better understanding vision-language models: insights and future directions.},
author={Hugo Laurençon and Andrés Marafioti and Victor Sanh and Léo Tronchon},
year={2024},
eprint={2408.12637},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
## **Contributor Contract**
**LLM Team**
Pakawat Phasook ([email protected])<br>
Jessada Pranee ([email protected])<br>
Arnon Saeoung ([email protected])<br>
Kun Kerdthaisong ([email protected])<br>
Kittisak Sukhantharat ([email protected])<br>
Chaianun Damrongrat ([email protected])<br>
Sarawoot Kongyoung ([email protected])
**Audio Team**
Pattara Tipaksorn ([email protected])<br>
Wayupuk Sommuang ([email protected])<br>
Oatsada Chatthong ([email protected])<br>
Kwanchiva Thangthai ([email protected])
**Vision Team**
Thirawarit Pitiphiphat ([email protected])<br>
Peerapas Ngokpon ([email protected])<br>
Theerasit Issaranon ([email protected])
## Contact
For questions or support, please contact **https://discord.gg/3WJwJjZt7r**.
```
This formatting provides a clean, structured, and readable Markdown layout for these sections. Let me know if further adjustments are needed!
```