lakshya-rawat
/

document-qa-model

Document Question Answering

document-understanding

Model card Files Files and versions Community

document-qa-model / README.md

lakshya-rawat's picture

Update README.md

f21adb1 verified 2 months ago

|

3.49 kB

	---
	library_name: transformers
	tags:
	- document-question-answering
	- layoutlmv3
	- ocr
	- document-understanding
	- paddleocr
	- multilingual
	- layout-aware
	- lakshya-singh
	license: apache-2.0
	language:
	- en
	base_model:
	- microsoft/layoutlmv3-base
	datasets:
	- nielsr/docvqa_1200_examples
	---

	# Document QA Model

	This is a fine-tuned document question-answering model based on `layoutlmv3-base`. It is trained to understand documents using OCR data (via PaddleOCR) and accurately answer questions related to structured information in the document layout.

	---

	## Model Details

	### Model Description

	- Model Name: `document-qa-model`
	- Base Model: [`microsoft/layoutlmv3-base`](https://huggingface.co/microsoft/layoutlmv3-base)
	- Fine-tuned by: Lakshya Singh (solo contributor)
	- Languages: English, Spanish, Chinese
	- License: Apache-2.0 (inherited from base model)
	- Intended Use: Extract answers to structured queries from scanned documents
	- Not funded — this project was completed independently.

	---

	## Model Sources

	- Repository: [https://github.com/Lakshyasinghrawat12]
	- Trained on: Adapted version of [`nielsr/docvqa_1200_examples`](https://huggingface.co/datasets/nielsr/docvqa_1200_examples)
	- Model metrics: See ![training_history.png](https://cdn-uploads.huggingface.co/production/uploads/66a7331438fbd9075584523f/MtMe5CZy3wb2nEG1wTRMc.png)

	---

	## Uses

	### Direct Use

	This model can be used for:
	- Question Answering on document images (PDFs, invoices, utility bills)
	- Information extraction tasks using OCR and layout-aware understanding

	### Out-of-Scope Use

	- Not suitable for conversational QA
	- Not suitable for images with no OCR-processed text

	---

	## Training Details

	### Dataset

	The dataset consisted of:
	- Images of utility bills and documents
	- OCR data with bounding boxes (from PaddleOCR)
	- Queries in English, Spanish, and Chinese
	- Answer spans with match scores and positions

	### Training Procedure

	- Preprocessing: PaddleOCR was used to extract tokens, positions, and structure
	- Model: LayoutLMv3-base
	- Epochs: 4
	- Learning rate schedule: Shown in image below

	### Training Metrics

	- F1 Score (validation): ![training_history.png](https://cdn-uploads.huggingface.co/production/uploads/66a7331438fbd9075584523f/MtMe5CZy3wb2nEG1wTRMc.png)
	- Loss & Learning Rate Chart: ![training_history.png](https://cdn-uploads.huggingface.co/production/uploads/66a7331438fbd9075584523f/MtMe5CZy3wb2nEG1wTRMc.png)

	---

	## Evaluation

	### Metrics Used
	- F1 score
	- Match score of predicted spans
	- Token overlap vs ground truth

	### Summary

	The model performs well on document-style QA tasks, especially with:
	- Clearly structured OCR results
	- Document types similar to utility bills, invoices, and forms

	---

	## How to Use

	```python
	from transformers import LayoutLMv3Processor, LayoutLMv3ForQuestionAnswering
	from PIL import Image
	import torch

	processor = LayoutLMv3Processor.from_pretrained("lakshya-singh/document-qa-model")
	model = LayoutLMv3ForQuestionAnswering.from_pretrained("lakshya-singh/document-qa-model")

	image = Image.open("your_document.png")
	question = "What is the total amount due?"

	inputs = processor(image, question, return_tensors="pt")
	outputs = model(**inputs)

	start_idx = torch.argmax(outputs.start_logits)
	end_idx = torch.argmax(outputs.end_logits)

	answer = processor.tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx+1])
	print("Answer:", answer)