mynkchaudhry
/

Florence-2-FT-DocVQA

Text Generation

Model card Files Files and versions Community

Florence-2-FT-DocVQA / README.md

mynkchaudhry's picture

Update README.md

36d8a20 verified 3 months ago

|

history blame contribute delete

2.42 kB

	---
	license: apache-2.0
	datasets:
	- HuggingFaceM4/DocumentVQA
	language:
	- en
	library_name: transformers
	---

	![Alt text](Banner.webp)



	# Model Card for Florence-2-FT-DocVQA

	This model card provides details about the Florence-2-FT-DocVQA model, which is fine-tuned for Document Visual Question Answering (VQA) tasks.

	## Model Details

	### Model Description

	Developed by: Mayank Chaudhary
	Model type: AutoModelForCausalLM
	Language(s) (NLP): English
	License: apache-2.0
	Finetuned from model: Florence-2-base-ft

	The Florence-2-FT-DocVQA model is designed to handle Document VQA tasks, enabling automated question answering based on document images.

	### Model Sources

	- Repository: [GitHub - FineTuning-VLMs](https://github.com/mynkchaudhry/FineTuning-VLMs)
	- Paper [optional]: [arXiv:2311.06242](https://arxiv.org/abs/2311.06242)

	## Uses

	The model can be further fine-tuned for specific Document VQA tasks or integrated into applications requiring automated document question answering.

	## Requirements
	- datasets
	- transformers
	- torch
	- Pillow

	## How to Get Started with the Model

	To get started with the Florence-2-FT-DocVQA model, you can use the following code:

	```python
	from datasets import load_dataset
	from transformers import AutoModelForCausalLM, AutoProcessor
	import torch

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = AutoModelForCausalLM.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA").to(device)
	processor = AutoProcessor.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA")

	data = load_dataset("HuggingFaceM4/DocumentVQA")

	def run_example(task_prompt, text_input, image):
	prompt = task_prompt + text_input
	if image.mode != "RGB":
	image = image.convert("RGB")
	inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
	generated_ids = model.generate(
	input_ids=inputs["input_ids"],
	pixel_values=inputs["pixel_values"],
	max_new_tokens=1024,
	num_beams=3
	)
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
	parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
	return parsed_answer

	for idx in range(3):
	print(run_example("DocVQA", 'What do you see in this image?', data['train'][idx]['image']))