File size: 6,361 Bytes
f1b4e16 fe0ee6d d675ad7 f1b4e16 a7bf642 f1b4e16 a7bf642 f1b4e16 a5791f1 f1b4e16 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
---
language:
- th
- en
metrics:
- sacrebleu
base_model:
- HuggingFaceM4/Idefics3-8B-Llama3
pipeline_tag: visual-question-answering
---
# Pathumma-llm-vision-1.0.0
## Model Overview
Pathumma-llm-vision-1.0.0 is a multi-modal language model fine-tuned for Visual Question Answering (VQA) and Image Captioning tasks. It contains 8 billion parameters and leverages both image and text processing to understand and generate multi-modal content.
- **Model Name**: Pathumma-llm-vision-1.0.0
- **Base Model**: HuggingFaceM4/Idefics3-8B-Llama3
- **Architecture**: Multi-modal LLM (Visual Language Model)
- **Parameters**: 8 Billion
- **Organization**: NECTEC
- **License**: [Specify License]
## Intended Use
- **Primary Use Cases**:
- Visual Question Answering (VQA)
- Image Captioning
- **Intended Users**: Developers, researchers, and AI practitioners working on multi-modal tasks.
- **Possible Applications**: Educational tools, accessibility applications, interactive visual content generation.
## Model Description
Pathumma-llm-vision-1.0.0 is designed to perform multi-modal tasks by integrating both visual and textual information. The model is fine-tuned with diverse datasets to improve its ability to understand and generate content that aligns with both image and text inputs.
## Training Data
The model was fine-tuned on several datasets:
- **Image Caption Competition (Kaggle)**: Data sourced from image captioning competitions on Kaggle.
- **Thai Shorthand Dataset**: Data related to the Thai language.
- **ShareGPT-4o (translated into Thai)**: Data translated from GPT-4o-mini outputs into Thai.
- **Small-Thai-Wikipedia-location**: Articles in Thai from Wikipedia about geographic locations.
- **Synthetic Data**: Additional synthetic data generated to increase dataset diversity.
### Dataset Size
- **Training Dataset Size**: 112,768 examples
- **Validation Dataset Size**: 9,036 examples
## Training Details
- **Hardware Used**:
- **HPC Cluster**: Lanta
- **Number of Nodes**: 16 Nodes
- **GPUs per Node**: 4 GPUs
- **Total GPUs Used**: 64 GPUs
- **Fine-tuning Duration**: 3 hours, 18 minutes, and 11 seconds (excluding evaluation)
## Evaluation Results
| Type | Encoder | Decoder | Learning Rate | Sentence SacreBLEU | Unique Tokens |
|---------------------------------------|------------------------------------|--------------------------------|---------------|--------------------|---------------|
| Idefic3-8B-Llama3 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | - | 0.02657 | 12990 |
| Pathumma-llm-vision-beta-0.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 1e-4 | 13.45412 | 1148 |
| Pathumma-llm-vision-1.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 1e-4 | 17.66370 | 1312 |
- **Accuracy on Manual-VQA Tasks**: 30.34%
## Required Libraries
Before you start, ensure you have the following libraries installed:
```
pip install git+https://github.com/andimarafioti/transformers.git@idefics3
```
## Usage
To use the model with the Hugging Face `transformers` library:
```python
from transformers import AutoProcessor, Idefics3ForConditionalGeneration
DEVICE = f"cuda" if torch.cuda.is_available() else 'cpu' if torch.cpu.is_available() else 'mps'
display(DEVICE)
if DEVICE == 'cuda': display(torch.cuda.device_count())
N = 5
processor = AutoProcessor.from_pretrained(
"nectec/Pathumma-llm-vision-1.0.0",
do_image_splitting=False,
# size={"longest_edge": N*364}, # Optional
# size={"height": N*364, "width": N*364}, # Optional
)
model = Idefics3ForConditionalGeneration.from_pretrained(
"nectec/Pathumma-llm-vision-1.0.0",
torch_dtype=torch.float16,
device_map=DEVICE
)
print(processor.image_processor.size)
url_path = None
local_path = "./path/picture.jpg" if not url_path else io.BytesIO(requests.get(url_path).content)
image = Image.open(local_path)
question = "รายละเอียดของรูปภาพนี้"
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "You are a helpful assistant."},
{"type": "image"},
{"type": "text", "text": question}
]
}
]
text = processor.apply_chat_template(
messages,
add_generation_prompt=True,
)
encoding = processor(
images=image,
text=text.strip(),
# padding='max_length',
# truncation=True,
# max_length=,
return_tensors="pt"
)
encoding = {k: v.to(DEVICE) for k, v in encoding.items()}
# Example: Run inference on text input
start_time = time.time()
model.eval()
with torch.inference_mode():
# Generate
generated_ids = model.generate(
**inputs,
max_new_tokens=128,
# temperature=.5,
# repetition_penalty=1.,
# # top_k=1.,
# top_p=1,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
end_time = time.time()
## Get letency_time...
latency_time = end_time - start_time
answer_prompt = generated_text.split('Assistant:')[1].strip()
# Output processing (depends on task requirements)
print(answer_prompt)
print(latency_time)
```
## Limitations and Biases
- The model may exhibit biases due to the training data, which might not be fully representative of all contexts.
- Performance may degrade on unfamiliar images or non-standard question formats.
## Ethical Considerations
- The model should not be used to generate misleading information or in ways that violate privacy.
- Consider fairness and minimize bias when using the model for language and image processing tasks.
## Citation
If you use this model, please cite it as follows:
```bibtex
@misc{PathummaVision,
author = {NECTEC Team},
title = {nectec/Pathumma-llm-vision-1.0.0},
year = {2024},
url = {https://huggingface.co/nectec/Pathumma-llm-vision-1.0.0}
}
```
## Contact
For questions or support, please contact **https://discord.gg/3WJwJjZt7r**.
```
This formatting provides a clean, structured, and readable Markdown layout for these sections. Let me know if further adjustments are needed!
```
|