Thirawarit commited on
Commit
f1b4e16
1 Parent(s): 0b70950

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +174 -0
README.md ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - th
4
+ - en
5
+ metrics:
6
+ - sacrebleu
7
+ base_model:
8
+ - HuggingFaceM4/Idefics3-8B-Llama3
9
+ pipeline_tag: visual-question-answering
10
+ ---
11
+
12
+ # Pathumma-llm-vision-Idefic3-8b-llama3-1.0.0
13
+
14
+ ## Model Overview
15
+ Pathumma-llm-vision-1.0.0 is a multi-modal language model fine-tuned for Visual Question Answering (VQA) and Image Captioning tasks. It contains 8 billion parameters and leverages both image and text processing to understand and generate multi-modal content.
16
+
17
+ - **Model Name**: Pathumma-llm-vision-1.0.0
18
+ - **Base Model**: HuggingFaceM4/Idefics3-8B-Llama3
19
+ - **Architecture**: Multi-modal LLM (Visual Language Model)
20
+ - **Parameters**: 8 Billion
21
+ - **Organization**: NECTEC
22
+ - **License**: [Specify License]
23
+
24
+ ## Intended Use
25
+ - **Primary Use Cases**:
26
+ - Visual Question Answering (VQA)
27
+ - Image Captioning
28
+ - **Intended Users**: Developers, researchers, and AI practitioners working on multi-modal tasks.
29
+ - **Possible Applications**: Educational tools, accessibility applications, interactive visual content generation.
30
+
31
+ ## Model Description
32
+ Pathumma-llm-vision-1.0.0 is designed to perform multi-modal tasks by integrating both visual and textual information. The model is fine-tuned with diverse datasets to improve its ability to understand and generate content that aligns with both image and text inputs.
33
+
34
+ ## Training Data
35
+ The model was fine-tuned on several datasets:
36
+ - **Image Caption Competition (Kaggle)**: Data sourced from image captioning competitions on Kaggle.
37
+ - **Thai Shorthand Dataset**: Data related to the Thai language.
38
+ - **ShareGPT-4o (translated into Thai)**: Data translated from GPT-4o-mini outputs into Thai.
39
+ - **Small-Thai-Wikipedia-location**: Articles in Thai from Wikipedia about geographic locations.
40
+ - **Synthetic Data**: Additional synthetic data generated to increase dataset diversity.
41
+
42
+ ### Dataset Size
43
+ - **Training Dataset Size**: 112,768 examples
44
+ - **Validation Dataset Size**: 9,036 examples
45
+
46
+ ## Training Details
47
+ - **Hardware Used**:
48
+ - **HPC Cluster**: Lanta
49
+ - **Number of Nodes**: 16 Nodes
50
+ - **GPUs per Node**: 4 GPUs
51
+ - **Total GPUs Used**: 64 GPUs
52
+ - **Fine-tuning Duration**: 3 hours, 18 minutes, and 11 seconds (excluding evaluation)
53
+
54
+ ## Evaluation Results
55
+
56
+ | Type | Encoder | Decoder | Learning Rate | Sentence SacreBLEU | Unique Tokens |
57
+ |---------------------------------------|------------------------------------|--------------------------------|---------------|--------------------|---------------|
58
+ | Idefic3-8B-Llama3 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | - | 0.02657 | 12990 |
59
+ | Pathumma-llm-vision-beta-0.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 1e-4 | 13.45412 | 1148 |
60
+ | Pathumma-llm-vision-1.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 1e-4 | 17.66370 | 1312 |
61
+
62
+
63
+ - **Accuracy on Manual-VQA Tasks**: 30.34%
64
+
65
+ ## Usage
66
+ To use the model with the Hugging Face `transformers` library:
67
+
68
+ ```python
69
+ from transformers import AutoTokenizer, AutoModel
70
+
71
+ # Load the tokenizer and model
72
+ tokenizer = AutoTokenizer.from_pretrained("nectec/Pathumma-llm-vision-1.0.0")
73
+ model = AutoModel.from_pretrained("nectec/Pathumma-llm-vision-1.0.0")
74
+ N = 5
75
+
76
+ processor = AutoProcessor.from_pretrained(
77
+ "nectec/Pathumma-llm-vision-1.0.0",
78
+ do_image_splitting=False,
79
+ # size={"longest_edge": N*364}, # Optional
80
+ # size={"height": N*364, "width": N*364}, # Optional
81
+ )
82
+
83
+ model = Idefics3ForConditionalGeneration.from_pretrained(
84
+ "nectec/Pathumma-llm-vision-1.0.0",
85
+ torch_dtype=torch.float16,
86
+ device_map=DEVICE
87
+ )
88
+
89
+ print(processor.image_processor.size)
90
+
91
+ url_path = None
92
+ local_path = "./path/picture.jpg" if not url_path else io.BytesIO(requests.get(url_path).content)
93
+ image = Image.open(local_path)
94
+
95
+ question = "รายละเอียดของรูปภาพนี้"
96
+ messages = [
97
+ {
98
+ "role": "user",
99
+ "content": [
100
+ {"type": "text", "text": "You are a helpful assistant."},
101
+ {"type": "image"},
102
+ {"type": "text", "text": question}
103
+ ]
104
+ }
105
+ ]
106
+
107
+ text = processor.apply_chat_template(
108
+ messages,
109
+ add_generation_prompt=True,
110
+ )
111
+
112
+ encoding = processor(
113
+ images=image,
114
+ text=text.strip(),
115
+ # padding='max_length',
116
+ # truncation=True,
117
+ # max_length=,
118
+ return_tensors="pt"
119
+ )
120
+
121
+ encoding = {k: v.to(DEVICE) for k, v in encoding.items()}
122
+
123
+ # Example: Run inference on text input
124
+ start_time = time.time()
125
+ model.eval()
126
+ with torch.inference_mode():
127
+ # Generate
128
+ generated_ids = model.generate(
129
+ **inputs,
130
+ max_new_tokens=128,
131
+ # temperature=.5,
132
+ # repetition_penalty=1.,
133
+ # # top_k=1.,
134
+ # top_p=1,
135
+ )
136
+ generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
137
+ end_time = time.time()
138
+
139
+ ## Get letency_time...
140
+ latency_time = end_time - start_time
141
+
142
+ answer_prompt = generated_text.split('Assistant:')[1].strip()
143
+
144
+ # Output processing (depends on task requirements)
145
+ print(answer_prompt)
146
+ print(latency_time)
147
+ ```
148
+
149
+ ## Limitations and Biases
150
+ - The model may exhibit biases due to the training data, which might not be fully representative of all contexts.
151
+ - Performance may degrade on unfamiliar images or non-standard question formats.
152
+
153
+ ## Ethical Considerations
154
+ - The model should not be used to generate misleading information or in ways that violate privacy.
155
+ - Consider fairness and minimize bias when using the model for language and image processing tasks.
156
+
157
+ ## Citation
158
+ If you use this model, please cite it as follows:
159
+
160
+ ```bibtex
161
+ @misc{PathummaVision,
162
+ author = {NECTEC Team},
163
+ title = {nectec/Pathumma-llm-vision-1.0.0},
164
+ year = {2024},
165
+ url = {https://huggingface.co/nectec/Pathumma-llm-vision-1.0.0}
166
+ }
167
+ ```
168
+
169
+ ## Contact
170
+ For questions or support, please contact [[email protected]].
171
+
172
+ ```
173
+ This formatting provides a clean, structured, and readable Markdown layout for these sections. Let me know if further adjustments are needed!
174
+ ```