mynkchaudhry's picture
Update README.md
36d8a20 verified
---
license: apache-2.0
datasets:
- HuggingFaceM4/DocumentVQA
language:
- en
library_name: transformers
---
![Alt text](Banner.webp)
# Model Card for Florence-2-FT-DocVQA
This model card provides details about the Florence-2-FT-DocVQA model, which is fine-tuned for Document Visual Question Answering (VQA) tasks.
## Model Details
### Model Description
**Developed by:** Mayank Chaudhary
**Model type:** AutoModelForCausalLM
**Language(s) (NLP):** English
**License:** apache-2.0
**Finetuned from model:** Florence-2-base-ft
The Florence-2-FT-DocVQA model is designed to handle Document VQA tasks, enabling automated question answering based on document images.
### Model Sources
- **Repository:** [GitHub - FineTuning-VLMs](https://github.com/mynkchaudhry/FineTuning-VLMs)
- **Paper [optional]:** [arXiv:2311.06242](https://arxiv.org/abs/2311.06242)
## Uses
The model can be further fine-tuned for specific Document VQA tasks or integrated into applications requiring automated document question answering.
## Requirements
- **datasets**
- **transformers**
- **torch**
- **Pillow**
## How to Get Started with the Model
To get started with the Florence-2-FT-DocVQA model, you can use the following code:
```python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA").to(device)
processor = AutoProcessor.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA")
data = load_dataset("HuggingFaceM4/DocumentVQA")
def run_example(task_prompt, text_input, image):
prompt = task_prompt + text_input
if image.mode != "RGB":
image = image.convert("RGB")
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
return parsed_answer
for idx in range(3):
print(run_example("DocVQA", 'What do you see in this image?', data['train'][idx]['image']))