---
license: apache-2.0
datasets:
- HuggingFaceM4/DocumentVQA
language:
- en
library_name: transformers
---

![Alt text](Banner.webp)


# Model Card for Florence-2-FT-DocVQA

This model card provides details about the Florence-2-FT-DocVQA model, which is fine-tuned for Document Visual Question Answering (VQA) tasks.

## Model Details

### Model Description

**Developed by:** Mayank Chaudhary  
**Model type:** AutoModelForCausalLM  
**Language(s) (NLP):** English  
**License:** apache-2.0  
**Finetuned from model:** Florence-2-base-ft  

The Florence-2-FT-DocVQA model is designed to handle Document VQA tasks, enabling automated question answering based on document images.

### Model Sources

- **Repository:** [GitHub - FineTuning-VLMs](https://github.com/mynkchaudhry/FineTuning-VLMs)
- **Paper [optional]:** [arXiv:2311.06242](https://arxiv.org/abs/2311.06242)

## Uses

The model can be further fine-tuned for specific Document VQA tasks or integrated into applications requiring automated document question answering.

## Requirements
 - **datasets**
 - **transformers**
 - **torch**
 - **Pillow**

## How to Get Started with the Model

To get started with the Florence-2-FT-DocVQA model, you can use the following code:

```python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoProcessor
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA").to(device)
processor = AutoProcessor.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA")

data = load_dataset("HuggingFaceM4/DocumentVQA")

def run_example(task_prompt, text_input, image):
    prompt = task_prompt + text_input
    if image.mode != "RGB":
        image = image.convert("RGB")
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer

for idx in range(3):
    print(run_example("DocVQA", 'What do you see in this image?', data['train'][idx]['image']))