--- license: apache-2.0 datasets: - HuggingFaceM4/DocumentVQA language: - en library_name: transformers --- ![Alt text](Banner.webp) # Model Card for Florence-2-FT-DocVQA This model card provides details about the Florence-2-FT-DocVQA model, which is fine-tuned for Document Visual Question Answering (VQA) tasks. ## Model Details ### Model Description **Developed by:** Mayank Chaudhary **Model type:** AutoModelForCausalLM **Language(s) (NLP):** English **License:** apache-2.0 **Finetuned from model:** Florence-2-base-ft The Florence-2-FT-DocVQA model is designed to handle Document VQA tasks, enabling automated question answering based on document images. ### Model Sources - **Repository:** [GitHub - FineTuning-VLMs](https://github.com/mynkchaudhry/FineTuning-VLMs) - **Paper [optional]:** [arXiv:2311.06242](https://arxiv.org/abs/2311.06242) ## Uses The model can be further fine-tuned for specific Document VQA tasks or integrated into applications requiring automated document question answering. ## Requirements - **datasets** - **transformers** - **torch** - **Pillow** ## How to Get Started with the Model To get started with the Florence-2-FT-DocVQA model, you can use the following code: ```python from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoProcessor import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = AutoModelForCausalLM.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA").to(device) processor = AutoProcessor.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA") data = load_dataset("HuggingFaceM4/DocumentVQA") def run_example(task_prompt, text_input, image): prompt = task_prompt + text_input if image.mode != "RGB": image = image.convert("RGB") inputs = processor(text=prompt, images=image, return_tensors="pt").to(device) generated_ids = model.generate( input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"], max_new_tokens=1024, num_beams=3 ) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0] parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height)) return parsed_answer for idx in range(3): print(run_example("DocVQA", 'What do you see in this image?', data['train'][idx]['image']))