---
language:
- th
- en
metrics:
- sacrebleu
base_model:
- HuggingFaceM4/Idefics3-8B-Llama3
pipeline_tag: visual-question-answering
---

# Pathumma-llm-vision-1.0.0

## Model Overview
Pathumma-llm-vision-1.0.0 is a multi-modal language model fine-tuned for Visual Question Answering (VQA) and Image Captioning tasks. It contains 8 billion parameters and leverages both image and text processing to understand and generate multi-modal content.

- **Model Name**: Pathumma-llm-vision-1.0.0
- **Base Model**: HuggingFaceM4/Idefics3-8B-Llama3
- **Architecture**: Multi-modal LLM (Visual Language Model)
- **Parameters**: 8 Billion
- **Organization**: NECTEC
- **License**: [Specify License]

## Intended Use
- **Primary Use Cases**: 
  - Visual Question Answering (VQA)
  - Image Captioning
- **Intended Users**: Developers, researchers, and AI practitioners working on multi-modal tasks.
- **Possible Applications**: Educational tools, accessibility applications, interactive visual content generation.

## Model Description
Pathumma-llm-vision-1.0.0 is designed to perform multi-modal tasks by integrating both visual and textual information. The model is fine-tuned with diverse datasets to improve its ability to understand and generate content that aligns with both image and text inputs.

## Training Data
The model was fine-tuned on several datasets:
- **Image Caption Competition (Kaggle)**: Data sourced from image captioning competitions on Kaggle.
- **Thai Shorthand Dataset**: Data related to the Thai language.
- **ShareGPT-4o (translated into Thai)**: Data translated from GPT-4o-mini outputs into Thai.
- **Small-Thai-Wikipedia-location**: Articles in Thai from Wikipedia about geographic locations.
- **Synthetic Data**: Additional synthetic data generated to increase dataset diversity.

### Dataset Size
- **Training Dataset Size**: 112,768 examples
- **Validation Dataset Size**: 9,036 examples

## Training Details
- **Hardware Used**: 
  - **HPC Cluster**: Lanta
  - **Number of Nodes**: 16 Nodes
  - **GPUs per Node**: 4 GPUs
  - **Total GPUs Used**: 64 GPUs
- **Fine-tuning Duration**: 3 hours, 18 minutes, and 11 seconds (excluding evaluation)

## Evaluation Results

| Type                                  | Encoder                            | Decoder                        | Learning Rate | Sentence SacreBLEU | Unique Tokens |
|---------------------------------------|------------------------------------|--------------------------------|---------------|--------------------|---------------|
| Idefic3-8B-Llama3                     | siglip-so400m-patch14-384          | Meta-Llama-3.1-8B-Instruct     | -             | 0.02657            | 12990         |
| Pathumma-llm-vision-beta-0.0.0        | siglip-so400m-patch14-384          | Meta-Llama-3.1-8B-Instruct     | 1e-4          | 13.45412           | 1148          |
| Pathumma-llm-vision-1.0.0             | siglip-so400m-patch14-384          | Meta-Llama-3.1-8B-Instruct     | 1e-4          | 17.66370           | 1312          |


- **Accuracy on Manual-VQA Tasks**: 30.34%

## Required Libraries

Before you start, ensure you have the following libraries installed:

```
pip install git+https://github.com/andimarafioti/transformers.git@idefics3
```

## Usage
To use the model with the Hugging Face `transformers` library:

```python
from transformers import AutoProcessor, Idefics3ForConditionalGeneration

DEVICE = f"cuda" if torch.cuda.is_available() else 'cpu' if torch.cpu.is_available() else 'mps'
display(DEVICE)
if DEVICE == 'cuda': display(torch.cuda.device_count())

N = 5

processor = AutoProcessor.from_pretrained(
    "nectec/Pathumma-llm-vision-1.0.0",
    do_image_splitting=False,
    # size={"longest_edge": N*364},            # Optional
    # size={"height": N*364, "width": N*364},  # Optional
)

model = Idefics3ForConditionalGeneration.from_pretrained(
        "nectec/Pathumma-llm-vision-1.0.0",
        torch_dtype=torch.float16,
        device_map=DEVICE
    )

print(processor.image_processor.size)

url_path = None
local_path = "./path/picture.jpg" if not url_path else io.BytesIO(requests.get(url_path).content)
image = Image.open(local_path)

question = "รายละเอียดของรูปภาพนี้"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "You are a helpful assistant."},
            {"type": "image"},
            {"type": "text", "text": question}
        ]
    }
]

text = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

encoding = processor(
    images=image,
    text=text.strip(),
    # padding='max_length',
    # truncation=True,
    # max_length=,
    return_tensors="pt"
)

encoding = {k: v.to(DEVICE) for k, v in encoding.items()}

# Example: Run inference on text input
start_time = time.time()
model.eval()
with torch.inference_mode():
    # Generate
    generated_ids = model.generate(
        **inputs, 
        max_new_tokens=128, 
        # temperature=.5, 
        # repetition_penalty=1.,
        # # top_k=1.,
        # top_p=1,
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
end_time = time.time()

## Get letency_time...
latency_time = end_time - start_time

answer_prompt = generated_text.split('Assistant:')[1].strip()

# Output processing (depends on task requirements)
print(answer_prompt)
print(latency_time)
```

## Limitations and Biases
- The model may exhibit biases due to the training data, which might not be fully representative of all contexts.
- Performance may degrade on unfamiliar images or non-standard question formats.

## Ethical Considerations
- The model should not be used to generate misleading information or in ways that violate privacy.
- Consider fairness and minimize bias when using the model for language and image processing tasks.

## Citation
If you use this model, please cite it as follows:

```bibtex
@misc{PathummaVision,
  author = {NECTEC Team},
  title = {nectec/Pathumma-llm-vision-1.0.0},
  year = {2024},
  url = {https://huggingface.co/nectec/Pathumma-llm-vision-1.0.0}
}
```

## Contact
For questions or support, please contact **https://discord.gg/3WJwJjZt7r**.

```
This formatting provides a clean, structured, and readable Markdown layout for these sections. Let me know if further adjustments are needed!
```