Update README.md

0ef381e verified 8 days ago

8.57 kB

	---
	language:
	- th
	- en
	metrics:
	- sacrebleu
	base_model:
	- HuggingFaceM4/Idefics3-8B-Llama3
	pipeline_tag: visual-question-answering
	---

	# Pathumma-llm-vision-1.0.0

	## Model Overview
	Pathumma-llm-vision-1.0.0 is a multi-modal language model fine-tuned for Visual Question Answering (VQA) and Image Captioning tasks. It contains 8 billion parameters and leverages both image and text processing to understand and generate multi-modal content.

	- Model Name: Pathumma-llm-vision-1.0.0
	- Base Model: HuggingFaceM4/Idefics3-8B-Llama3
	- Architecture: Multi-modal LLM (Visual Language Model)
	- Parameters: 8 Billion
	- Organization: NECTEC
	- License: [Specify License]

	## Intended Use
	- Primary Use Cases:
	- Visual Question Answering (VQA)
	- Image Captioning
	- Intended Users: Developers, researchers, and AI practitioners working on multi-modal tasks.
	- Possible Applications: Educational tools, accessibility applications, interactive visual content generation.

	## Model Description
	Pathumma-llm-vision-1.0.0 is designed to perform multi-modal tasks by integrating both visual and textual information. The model is fine-tuned with diverse datasets to improve its ability to understand and generate content that aligns with both image and text inputs.

	## Training Data
	The model was fine-tuned on several datasets:
	- Thai Image Caption: Data sourced from image captioning competitions on Kaggle.
	- Thai Shorthand Dataset: Data related to the Thai language.
	- ShareGPT-4o (translated into Thai): Data translated from GPT-4o-mini outputs into Thai.
	- Small-Thai-Wikipedia-location: Articles in Thai from Wikipedia about geographic locations.
	- Synthetic Data: Additional synthetic data generated to increase dataset diversity.

	### Dataset Size
	- Training Dataset Size: 112,768 examples
	- Validation Dataset Size: 9,036 examples

	## Training Details
	- Hardware Used:
	- HPC Cluster: Lanta
	- Number of Nodes: 16 Nodes
	- GPUs per Node: 4 GPUs
	- Total GPUs Used: 64 GPUs
	- Fine-tuning Duration: 3 hours, 18 minutes, and 11 seconds (excluding evaluation)

	## Evaluation Results

	\| Type \| Encoder \| Decoder \| IPU24-dataset <br>(test) <br>(Sentence SacreBLEU) \|
	\|----------------------------------------\|------------------------------------\|-------------------------------------\|-------------------------------\|
	\| Idefic3-8B-Llama3 \| siglip-so400m-patch14-384 \| Meta-Llama-3.1-8B-Instruct \| 0.02657 \|
	\| Pathumma-llm-vision-beta-0.0.0 \| siglip-so400m-patch14-384 \| Meta-Llama-3.1-8B-Instruct \| 13.45412 \|
	\| Pathumma-llm-vision-1.0.0 \| siglip-so400m-patch14-384 \| Meta-Llama-3.1-8B-Instruct \| 17.66370 \|
	\| llama-3-typhoon-v1.5-8b-vision-preview \| siglip-so400m-patch14-384 \| Llama-3-Typhoon-1.5-8B-instruct \| 8.288626 \|

	*\\Note*: Other models not target fine-tuned on IPU24-datasets may be less representative of IPU24 performance.

	- Accuracy on VQA Tasks with testing a private dataset: 30.34%

	## Required Libraries

	Before you start, ensure you have the following libraries installed:

	```
	pip install git+https://github.com/andimarafioti/transformers.git@idefics3
	```

	## Usage
	We provide a [inference tutorial](https://colab.research.google.com/drive/1TakNg4v6hHFXLih-SFcibxzYBTs2-EFn?usp=sharing).
	To use the model with the Hugging Face `transformers` library:

	```python
	import io
	import os
	import time
	import random
	import requests
	import shutil
	from IPython.display import display, Markdown
	from IPython.display import clear_output as cls

	import numpy as np
	import pandas as pd
	from PIL import Image

	import torch

	import transformers
	from transformers import (
	Idefics3ForConditionalGeneration,
	AutoProcessor,
	BitsAndBytesConfig,
	)
	```

	```python

	DEVICE = f"cuda" if torch.cuda.is_available() else 'cpu' if torch.cpu.is_available() else 'mps'
	print(DEVICE)
	if DEVICE == 'cuda': display(torch.cuda.device_count())

	N = 5

	revision = "quantized8bit"
	processor = AutoProcessor.from_pretrained(
	"nectec/Pathumma-llm-vision-1.0.0",
	revision=revision, # Optional
	do_image_splitting=False,
	# size={"longest_edge": N*364}, # Optional
	# size={"height": N364, "width": N364}, # Optional
	)

	model = Idefics3ForConditionalGeneration.from_pretrained(
	"nectec/Pathumma-llm-vision-1.0.0",
	revision=revision, # Optional
	torch_dtype=torch.float16,
	device_map=DEVICE
	)

	print(processor.image_processor.size)

	url_path = None
	local_path = "./path/picture.jpg" if not url_path else io.BytesIO(requests.get(url_path).content)
	image = Image.open(local_path)

	question = "รายละเอียดของรูปภาพนี้"
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "text", "text": "You are a helpful assistant."},
	{"type": "image"},
	{"type": "text", "text": question}
	]
	}
	]

	text = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	)

	encoding = processor(
	images=image,
	text=text.strip(),
	# padding='max_length',
	# truncation=True,
	# max_length=,
	return_tensors="pt"
	)

	encoding = {k: v.to(DEVICE) for k, v in encoding.items()}

	# Example: Run inference on text input
	start_time = time.time()
	model.eval()
	with torch.inference_mode():
	# Generate
	generated_ids = model.generate(
	**inputs,
	max_new_tokens=128,
	# temperature=.5,
	# repetition_penalty=1.,
	# # top_k=1.,
	# top_p=1,
	)
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
	end_time = time.time()

	## Get letency_time...
	latency_time = end_time - start_time

	answer_prompt = generated_text.split('Assistant:')[1].strip()

	# Output processing (depends on task requirements)
	print(answer_prompt)
	print(f"latency_time: {latency_time:.3f} sec.")

	# >>> output:
	# >>> ลูกฮิปโปแคระกำลังยืนอยู่ข้างแม่ฮิปโปแคระที่กำลังอาบน้ำ
	# >>> latency_time: 7.642 sec.
	```

	## Limitations and Biases
	- The model may exhibit biases due to the training data, which might not be fully representative of all contexts.
	- Performance may degrade on unfamiliar images or non-standard question formats.

	## Ethical Considerations
	- The model should not be used to generate misleading information or in ways that violate privacy.
	- Consider fairness and minimize bias when using the model for language and image processing tasks.

	## Citation
	If you use this model, please cite it as follows:

	```bibtex
	@misc{PathummaVision,
	author = {Thirawarit Pitiphiphat and NECTEC Team},
	title = {nectec/Pathumma-llm-vision-1.0.0},
	year = {2024},
	url = {https://huggingface.co/nectec/Pathumma-llm-vision-1.0.0}
	}
	```

	```bibtex
	@misc{laurençon2024building,
	title={Building and better understanding vision-language models: insights and future directions.},
	author={Hugo Laurençon and Andrés Marafioti and Victor Sanh and Léo Tronchon},
	year={2024},
	eprint={2408.12637},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```

	## Contributor Contract

	LLM Team
	Pakawat Phasook ([email protected])<br>
	Jessada Pranee ([email protected])<br>
	Arnon Saeoung ([email protected])<br>
	Kun Kerdthaisong ([email protected])<br>
	Kittisak Sukhantharat ([email protected])<br>
	Chaianun Damrongrat ([email protected])<br>
	Sarawoot Kongyoung ([email protected])

	Audio Team
	Pattara Tipaksorn ([email protected])<br>
	Wayupuk Sommuang ([email protected])<br>
	Oatsada Chatthong ([email protected])<br>
	Kwanchiva Thangthai ([email protected])

	Vision Team
	Thirawarit Pitiphiphat ([email protected])<br>
	Peerapas Ngokpon ([email protected])<br>
	Theerasit Issaranon ([email protected])

	## Contact
	For questions or support, please contact https://discord.gg/3WJwJjZt7r.

	```
	This formatting provides a clean, structured, and readable Markdown layout for these sections. Let me know if further adjustments are needed!
	```