nectec
/

Pathumma-llm-vision-1.0.0

+---
+language:
+- th
+- en
+metrics:
+- sacrebleu
+base_model:
+- HuggingFaceM4/Idefics3-8B-Llama3
+pipeline_tag: visual-question-answering
+---
+# Pathumma-llm-vision-Idefic3-8b-llama3-1.0.0
+## Model Overview
+Pathumma-llm-vision-1.0.0 is a multi-modal language model fine-tuned for Visual Question Answering (VQA) and Image Captioning tasks. It contains 8 billion parameters and leverages both image and text processing to understand and generate multi-modal content.
+- **Model Name**: Pathumma-llm-vision-1.0.0
+- **Base Model**: HuggingFaceM4/Idefics3-8B-Llama3
+- **Architecture**: Multi-modal LLM (Visual Language Model)
+- **Parameters**: 8 Billion
+- **Organization**: NECTEC
+- **License**: [Specify License]
+## Intended Use
+- **Primary Use Cases**:
+  - Visual Question Answering (VQA)
+  - Image Captioning
+- **Intended Users**: Developers, researchers, and AI practitioners working on multi-modal tasks.
+- **Possible Applications**: Educational tools, accessibility applications, interactive visual content generation.
+## Model Description
+Pathumma-llm-vision-1.0.0 is designed to perform multi-modal tasks by integrating both visual and textual information. The model is fine-tuned with diverse datasets to improve its ability to understand and generate content that aligns with both image and text inputs.
+## Training Data
+The model was fine-tuned on several datasets:
+- **Image Caption Competition (Kaggle)**: Data sourced from image captioning competitions on Kaggle.
+- **Thai Shorthand Dataset**: Data related to the Thai language.
+- **ShareGPT-4o (translated into Thai)**: Data translated from GPT-4o-mini outputs into Thai.
+- **Small-Thai-Wikipedia-location**: Articles in Thai from Wikipedia about geographic locations.
+- **Synthetic Data**: Additional synthetic data generated to increase dataset diversity.
+### Dataset Size
+- **Training Dataset Size**: 112,768 examples
+- **Validation Dataset Size**: 9,036 examples
+## Training Details
+- **Hardware Used**:
+  - **HPC Cluster**: Lanta
+  - **Number of Nodes**: 16 Nodes
+  - **GPUs per Node**: 4 GPUs
+  - **Total GPUs Used**: 64 GPUs
+- **Fine-tuning Duration**: 3 hours, 18 minutes, and 11 seconds (excluding evaluation)
+## Evaluation Results
+| Type                                  | Encoder                            | Decoder                        | Learning Rate | Sentence SacreBLEU | Unique Tokens |
+|---------------------------------------|------------------------------------|--------------------------------|---------------|--------------------|---------------|
+| Idefic3-8B-Llama3                     | siglip-so400m-patch14-384          | Meta-Llama-3.1-8B-Instruct     | -             | 0.02657            | 12990         |
+| Pathumma-llm-vision-beta-0.0.0        | siglip-so400m-patch14-384          | Meta-Llama-3.1-8B-Instruct     | 1e-4          | 13.45412           | 1148          |
+| Pathumma-llm-vision-1.0.0             | siglip-so400m-patch14-384          | Meta-Llama-3.1-8B-Instruct     | 1e-4          | 17.66370           | 1312          |
+- **Accuracy on Manual-VQA Tasks**: 30.34%
+## Usage
+To use the model with the Hugging Face `transformers` library:
+```python
+from transformers import AutoTokenizer, AutoModel
+# Load the tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained("nectec/Pathumma-llm-vision-1.0.0")
+model = AutoModel.from_pretrained("nectec/Pathumma-llm-vision-1.0.0")
+N = 5
+processor = AutoProcessor.from_pretrained(
+    "nectec/Pathumma-llm-vision-1.0.0",
+    do_image_splitting=False,
+    # size={"longest_edge": N*364},            # Optional
+    # size={"height": N*364, "width": N*364},  # Optional
+)
+model = Idefics3ForConditionalGeneration.from_pretrained(
+        "nectec/Pathumma-llm-vision-1.0.0",
+        torch_dtype=torch.float16,
+        device_map=DEVICE
+    )
+print(processor.image_processor.size)
+url_path = None
+local_path = "./path/picture.jpg" if not url_path else io.BytesIO(requests.get(url_path).content)
+image = Image.open(local_path)
+question = "รายละเอียดของรูปภาพนี้"
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "You are a helpful assistant."},
+            {"type": "image"},
+            {"type": "text", "text": question}
+        ]
+    }
+]
+text = processor.apply_chat_template(
+    messages,
+    add_generation_prompt=True,
+)
+encoding = processor(
+    images=image,
+    text=text.strip(),
+    # padding='max_length',
+    # truncation=True,
+    # max_length=,
+    return_tensors="pt"
+)
+encoding = {k: v.to(DEVICE) for k, v in encoding.items()}
+# Example: Run inference on text input
+start_time = time.time()
+model.eval()
+with torch.inference_mode():
+    # Generate
+    generated_ids = model.generate(
+        **inputs,
+        max_new_tokens=128,
+        # temperature=.5,
+        # repetition_penalty=1.,
+        # # top_k=1.,
+        # top_p=1,
+    )
+    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
+end_time = time.time()
+## Get letency_time...
+latency_time = end_time - start_time
+answer_prompt = generated_text.split('Assistant:')[1].strip()
+# Output processing (depends on task requirements)
+print(answer_prompt)
+print(latency_time)
+```
+## Limitations and Biases
+- The model may exhibit biases due to the training data, which might not be fully representative of all contexts.
+- Performance may degrade on unfamiliar images or non-standard question formats.
+## Ethical Considerations
+- The model should not be used to generate misleading information or in ways that violate privacy.
+- Consider fairness and minimize bias when using the model for language and image processing tasks.
+## Citation
+If you use this model, please cite it as follows:
+```bibtex
+@misc{PathummaVision,
+  author = {NECTEC Team},
+  title = {nectec/Pathumma-llm-vision-1.0.0},
+  year = {2024},
+  url = {https://huggingface.co/nectec/Pathumma-llm-vision-1.0.0}
+}
+```
+## Contact
+For questions or support, please contact [[email protected]].
+```
+This formatting provides a clean, structured, and readable Markdown layout for these sections. Let me know if further adjustments are needed!
+```