Thirawarit
commited on
Commit
•
f1b4e16
1
Parent(s):
0b70950
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- th
|
4 |
+
- en
|
5 |
+
metrics:
|
6 |
+
- sacrebleu
|
7 |
+
base_model:
|
8 |
+
- HuggingFaceM4/Idefics3-8B-Llama3
|
9 |
+
pipeline_tag: visual-question-answering
|
10 |
+
---
|
11 |
+
|
12 |
+
# Pathumma-llm-vision-Idefic3-8b-llama3-1.0.0
|
13 |
+
|
14 |
+
## Model Overview
|
15 |
+
Pathumma-llm-vision-1.0.0 is a multi-modal language model fine-tuned for Visual Question Answering (VQA) and Image Captioning tasks. It contains 8 billion parameters and leverages both image and text processing to understand and generate multi-modal content.
|
16 |
+
|
17 |
+
- **Model Name**: Pathumma-llm-vision-1.0.0
|
18 |
+
- **Base Model**: HuggingFaceM4/Idefics3-8B-Llama3
|
19 |
+
- **Architecture**: Multi-modal LLM (Visual Language Model)
|
20 |
+
- **Parameters**: 8 Billion
|
21 |
+
- **Organization**: NECTEC
|
22 |
+
- **License**: [Specify License]
|
23 |
+
|
24 |
+
## Intended Use
|
25 |
+
- **Primary Use Cases**:
|
26 |
+
- Visual Question Answering (VQA)
|
27 |
+
- Image Captioning
|
28 |
+
- **Intended Users**: Developers, researchers, and AI practitioners working on multi-modal tasks.
|
29 |
+
- **Possible Applications**: Educational tools, accessibility applications, interactive visual content generation.
|
30 |
+
|
31 |
+
## Model Description
|
32 |
+
Pathumma-llm-vision-1.0.0 is designed to perform multi-modal tasks by integrating both visual and textual information. The model is fine-tuned with diverse datasets to improve its ability to understand and generate content that aligns with both image and text inputs.
|
33 |
+
|
34 |
+
## Training Data
|
35 |
+
The model was fine-tuned on several datasets:
|
36 |
+
- **Image Caption Competition (Kaggle)**: Data sourced from image captioning competitions on Kaggle.
|
37 |
+
- **Thai Shorthand Dataset**: Data related to the Thai language.
|
38 |
+
- **ShareGPT-4o (translated into Thai)**: Data translated from GPT-4o-mini outputs into Thai.
|
39 |
+
- **Small-Thai-Wikipedia-location**: Articles in Thai from Wikipedia about geographic locations.
|
40 |
+
- **Synthetic Data**: Additional synthetic data generated to increase dataset diversity.
|
41 |
+
|
42 |
+
### Dataset Size
|
43 |
+
- **Training Dataset Size**: 112,768 examples
|
44 |
+
- **Validation Dataset Size**: 9,036 examples
|
45 |
+
|
46 |
+
## Training Details
|
47 |
+
- **Hardware Used**:
|
48 |
+
- **HPC Cluster**: Lanta
|
49 |
+
- **Number of Nodes**: 16 Nodes
|
50 |
+
- **GPUs per Node**: 4 GPUs
|
51 |
+
- **Total GPUs Used**: 64 GPUs
|
52 |
+
- **Fine-tuning Duration**: 3 hours, 18 minutes, and 11 seconds (excluding evaluation)
|
53 |
+
|
54 |
+
## Evaluation Results
|
55 |
+
|
56 |
+
| Type | Encoder | Decoder | Learning Rate | Sentence SacreBLEU | Unique Tokens |
|
57 |
+
|---------------------------------------|------------------------------------|--------------------------------|---------------|--------------------|---------------|
|
58 |
+
| Idefic3-8B-Llama3 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | - | 0.02657 | 12990 |
|
59 |
+
| Pathumma-llm-vision-beta-0.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 1e-4 | 13.45412 | 1148 |
|
60 |
+
| Pathumma-llm-vision-1.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 1e-4 | 17.66370 | 1312 |
|
61 |
+
|
62 |
+
|
63 |
+
- **Accuracy on Manual-VQA Tasks**: 30.34%
|
64 |
+
|
65 |
+
## Usage
|
66 |
+
To use the model with the Hugging Face `transformers` library:
|
67 |
+
|
68 |
+
```python
|
69 |
+
from transformers import AutoTokenizer, AutoModel
|
70 |
+
|
71 |
+
# Load the tokenizer and model
|
72 |
+
tokenizer = AutoTokenizer.from_pretrained("nectec/Pathumma-llm-vision-1.0.0")
|
73 |
+
model = AutoModel.from_pretrained("nectec/Pathumma-llm-vision-1.0.0")
|
74 |
+
N = 5
|
75 |
+
|
76 |
+
processor = AutoProcessor.from_pretrained(
|
77 |
+
"nectec/Pathumma-llm-vision-1.0.0",
|
78 |
+
do_image_splitting=False,
|
79 |
+
# size={"longest_edge": N*364}, # Optional
|
80 |
+
# size={"height": N*364, "width": N*364}, # Optional
|
81 |
+
)
|
82 |
+
|
83 |
+
model = Idefics3ForConditionalGeneration.from_pretrained(
|
84 |
+
"nectec/Pathumma-llm-vision-1.0.0",
|
85 |
+
torch_dtype=torch.float16,
|
86 |
+
device_map=DEVICE
|
87 |
+
)
|
88 |
+
|
89 |
+
print(processor.image_processor.size)
|
90 |
+
|
91 |
+
url_path = None
|
92 |
+
local_path = "./path/picture.jpg" if not url_path else io.BytesIO(requests.get(url_path).content)
|
93 |
+
image = Image.open(local_path)
|
94 |
+
|
95 |
+
question = "รายละเอียดของรูปภาพนี้"
|
96 |
+
messages = [
|
97 |
+
{
|
98 |
+
"role": "user",
|
99 |
+
"content": [
|
100 |
+
{"type": "text", "text": "You are a helpful assistant."},
|
101 |
+
{"type": "image"},
|
102 |
+
{"type": "text", "text": question}
|
103 |
+
]
|
104 |
+
}
|
105 |
+
]
|
106 |
+
|
107 |
+
text = processor.apply_chat_template(
|
108 |
+
messages,
|
109 |
+
add_generation_prompt=True,
|
110 |
+
)
|
111 |
+
|
112 |
+
encoding = processor(
|
113 |
+
images=image,
|
114 |
+
text=text.strip(),
|
115 |
+
# padding='max_length',
|
116 |
+
# truncation=True,
|
117 |
+
# max_length=,
|
118 |
+
return_tensors="pt"
|
119 |
+
)
|
120 |
+
|
121 |
+
encoding = {k: v.to(DEVICE) for k, v in encoding.items()}
|
122 |
+
|
123 |
+
# Example: Run inference on text input
|
124 |
+
start_time = time.time()
|
125 |
+
model.eval()
|
126 |
+
with torch.inference_mode():
|
127 |
+
# Generate
|
128 |
+
generated_ids = model.generate(
|
129 |
+
**inputs,
|
130 |
+
max_new_tokens=128,
|
131 |
+
# temperature=.5,
|
132 |
+
# repetition_penalty=1.,
|
133 |
+
# # top_k=1.,
|
134 |
+
# top_p=1,
|
135 |
+
)
|
136 |
+
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
137 |
+
end_time = time.time()
|
138 |
+
|
139 |
+
## Get letency_time...
|
140 |
+
latency_time = end_time - start_time
|
141 |
+
|
142 |
+
answer_prompt = generated_text.split('Assistant:')[1].strip()
|
143 |
+
|
144 |
+
# Output processing (depends on task requirements)
|
145 |
+
print(answer_prompt)
|
146 |
+
print(latency_time)
|
147 |
+
```
|
148 |
+
|
149 |
+
## Limitations and Biases
|
150 |
+
- The model may exhibit biases due to the training data, which might not be fully representative of all contexts.
|
151 |
+
- Performance may degrade on unfamiliar images or non-standard question formats.
|
152 |
+
|
153 |
+
## Ethical Considerations
|
154 |
+
- The model should not be used to generate misleading information or in ways that violate privacy.
|
155 |
+
- Consider fairness and minimize bias when using the model for language and image processing tasks.
|
156 |
+
|
157 |
+
## Citation
|
158 |
+
If you use this model, please cite it as follows:
|
159 |
+
|
160 |
+
```bibtex
|
161 |
+
@misc{PathummaVision,
|
162 |
+
author = {NECTEC Team},
|
163 |
+
title = {nectec/Pathumma-llm-vision-1.0.0},
|
164 |
+
year = {2024},
|
165 |
+
url = {https://huggingface.co/nectec/Pathumma-llm-vision-1.0.0}
|
166 |
+
}
|
167 |
+
```
|
168 |
+
|
169 |
+
## Contact
|
170 |
+
For questions or support, please contact [[email protected]].
|
171 |
+
|
172 |
+
```
|
173 |
+
This formatting provides a clean, structured, and readable Markdown layout for these sections. Let me know if further adjustments are needed!
|
174 |
+
```
|