--- language: - th - en metrics: - sacrebleu base_model: - HuggingFaceM4/Idefics3-8B-Llama3 pipeline_tag: visual-question-answering --- # Pathumma-llm-vision-1.0.0 ## Model Overview Pathumma-llm-vision-1.0.0 is a multi-modal language model fine-tuned for Visual Question Answering (VQA) and Image Captioning tasks. It contains 8 billion parameters and leverages both image and text processing to understand and generate multi-modal content. - **Model Name**: Pathumma-llm-vision-1.0.0 - **Base Model**: HuggingFaceM4/Idefics3-8B-Llama3 - **Architecture**: Multi-modal LLM (Visual Language Model) - **Parameters**: 8 Billion - **Organization**: NECTEC - **License**: [Specify License] ## Intended Use - **Primary Use Cases**: - Visual Question Answering (VQA) - Image Captioning - **Intended Users**: Developers, researchers, and AI practitioners working on multi-modal tasks. - **Possible Applications**: Educational tools, accessibility applications, interactive visual content generation. ## Model Description Pathumma-llm-vision-1.0.0 is designed to perform multi-modal tasks by integrating both visual and textual information. The model is fine-tuned with diverse datasets to improve its ability to understand and generate content that aligns with both image and text inputs. ## Training Data The model was fine-tuned on several datasets: - **Thai Image Caption**: Data sourced from image captioning competitions on Kaggle. - **Thai Shorthand Dataset**: Data related to the Thai language. - **ShareGPT-4o (translated into Thai)**: Data translated from GPT-4o-mini outputs into Thai. - **Small-Thai-Wikipedia-location**: Articles in Thai from Wikipedia about geographic locations. - **Synthetic Data**: Additional synthetic data generated to increase dataset diversity. ### Dataset Size - **Training Dataset Size**: 112,768 examples - **Validation Dataset Size**: 9,036 examples ## Training Details - **Hardware Used**: - **HPC Cluster**: Lanta - **Number of Nodes**: 16 Nodes - **GPUs per Node**: 4 GPUs - **Total GPUs Used**: 64 GPUs - **Fine-tuning Duration**: 3 hours, 18 minutes, and 11 seconds (excluding evaluation) ## Evaluation Results | Type | Encoder | Decoder | Sentence SacreBLEU
(test) | Unique Tokens | |---------------------------------------|------------------------------------|--------------------------------|-------------------------------|---------------| | Idefic3-8B-Llama3 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 0.02657 | 12990 | | Pathumma-llm-vision-beta-0.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 13.45412 | 1148 | | Pathumma-llm-vision-1.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 17.66370 | 1312 | - **Accuracy on Manual-VQA Tasks**: 30.34% ## Required Libraries Before you start, ensure you have the following libraries installed: ``` pip install git+https://github.com/andimarafioti/transformers.git@idefics3 ``` ## Usage We provide a [inference tutorial](https://colab.research.google.com/drive/1TakNg4v6hHFXLih-SFcibxzYBTs2-EFn?usp=sharing). To use the model with the Hugging Face `transformers` library: ```python import io import os import time import random import requests import shutil from IPython.display import display, Markdown from IPython.display import clear_output as cls import numpy as np import pandas as pd from PIL import Image import torch import transformers from transformers import ( Idefics3ForConditionalGeneration, AutoProcessor, BitsAndBytesConfig, ) ``` ```python DEVICE = f"cuda" if torch.cuda.is_available() else 'cpu' if torch.cpu.is_available() else 'mps' print(DEVICE) if DEVICE == 'cuda': display(torch.cuda.device_count()) N = 5 revision = "quantized8bit" processor = AutoProcessor.from_pretrained( "nectec/Pathumma-llm-vision-1.0.0", revision=revision, # Optional do_image_splitting=False, # size={"longest_edge": N*364}, # Optional # size={"height": N*364, "width": N*364}, # Optional ) model = Idefics3ForConditionalGeneration.from_pretrained( "nectec/Pathumma-llm-vision-1.0.0", revision=revision, # Optional torch_dtype=torch.float16, device_map=DEVICE ) print(processor.image_processor.size) url_path = None local_path = "./path/picture.jpg" if not url_path else io.BytesIO(requests.get(url_path).content) image = Image.open(local_path) question = "รายละเอียดของรูปภาพนี้" messages = [ { "role": "user", "content": [ {"type": "text", "text": "You are a helpful assistant."}, {"type": "image"}, {"type": "text", "text": question} ] } ] text = processor.apply_chat_template( messages, add_generation_prompt=True, ) encoding = processor( images=image, text=text.strip(), # padding='max_length', # truncation=True, # max_length=, return_tensors="pt" ) encoding = {k: v.to(DEVICE) for k, v in encoding.items()} # Example: Run inference on text input start_time = time.time() model.eval() with torch.inference_mode(): # Generate generated_ids = model.generate( **inputs, max_new_tokens=128, # temperature=.5, # repetition_penalty=1., # # top_k=1., # top_p=1, ) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] end_time = time.time() ## Get letency_time... latency_time = end_time - start_time answer_prompt = generated_text.split('Assistant:')[1].strip() # Output processing (depends on task requirements) print(answer_prompt) print(f"latency_time: {latency_time:.3f} sec.") # >>> output: # >>> ลูกฮิปโปแคระกำลังยืนอยู่ข้างแม่ฮิปโปแคระที่กำลังอาบน้ำ # >>> latency_time: 7.642 sec. ``` ## Limitations and Biases - The model may exhibit biases due to the training data, which might not be fully representative of all contexts. - Performance may degrade on unfamiliar images or non-standard question formats. ## Ethical Considerations - The model should not be used to generate misleading information or in ways that violate privacy. - Consider fairness and minimize bias when using the model for language and image processing tasks. ## Citation If you use this model, please cite it as follows: ```bibtex @misc{PathummaVision, author = {Thirawarit Pitiphiphat and NECTEC Team}, title = {nectec/Pathumma-llm-vision-1.0.0}, year = {2024}, url = {https://huggingface.co/nectec/Pathumma-llm-vision-1.0.0} } ``` ```bibtex @misc{laurençon2024building, title={Building and better understanding vision-language models: insights and future directions.}, author={Hugo Laurençon and Andrés Marafioti and Victor Sanh and Léo Tronchon}, year={2024}, eprint={2408.12637}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` ## **Contributor Contract** **LLM Team** Pakawat Phasook (pakawat.phas@kmutt.ac.th)
Jessada Pranee (jessada.pran@kmutt.ac.th)
Arnon Saeoung (anon.saeoueng@gmail.com)
Kun Kerdthaisong (kun.ker@dome.tu.ac.th)
Kittisak Sukhantharat (kittisak.suk@stu.nida.ac.th)
Chaianun Damrongrat (chaianun.damrongrat@nectec.or.th)
Sarawoot Kongyoung (sarawoot.kongyoung@nectec.or.th) **Audio Team** Pattara Tipaksorn (pattara.tip@ncr.nstda.or.th)
Wayupuk Sommuang (wayupuk.som@dome.tu.ac.th)
Oatsada Chatthong (atsada.cha@dome.tu.ac.th)
Kwanchiva Thangthai (kwanchiva.thangthai@nectec.or.th) **Vision Team** Thirawarit Pitiphiphat (thirawarit.p@gmail.com)
Peerapas Ngokpon (jamesselmon78169@gmail.com)
Theerasit Issaranon (theerasit.issaranon@nectec.or.th) ## Contact For questions or support, please contact **https://discord.gg/3WJwJjZt7r**. ``` This formatting provides a clean, structured, and readable Markdown layout for these sections. Let me know if further adjustments are needed! ```