document-qa-model / README.md
lakshya-rawat's picture
Update README.md
f21adb1 verified
|
raw
history blame
3.49 kB
---
library_name: transformers
tags:
- document-question-answering
- layoutlmv3
- ocr
- document-understanding
- paddleocr
- multilingual
- layout-aware
- lakshya-singh
license: apache-2.0
language:
- en
base_model:
- microsoft/layoutlmv3-base
datasets:
- nielsr/docvqa_1200_examples
---
# Document QA Model
This is a fine-tuned **document question-answering model** based on `layoutlmv3-base`. It is trained to understand documents using OCR data (via PaddleOCR) and accurately answer questions related to structured information in the document layout.
---
## Model Details
### Model Description
- **Model Name:** `document-qa-model`
- **Base Model:** [`microsoft/layoutlmv3-base`](https://huggingface.co/microsoft/layoutlmv3-base)
- **Fine-tuned by:** Lakshya Singh (solo contributor)
- **Languages:** English, Spanish, Chinese
- **License:** Apache-2.0 (inherited from base model)
- **Intended Use:** Extract answers to structured queries from scanned documents
- **Not funded** — this project was completed independently.
---
## Model Sources
- **Repository:** [https://github.com/Lakshyasinghrawat12]
- **Trained on:** Adapted version of [`nielsr/docvqa_1200_examples`](https://huggingface.co/datasets/nielsr/docvqa_1200_examples)
- **Model metrics:** See ![training_history.png](https://cdn-uploads.huggingface.co/production/uploads/66a7331438fbd9075584523f/MtMe5CZy3wb2nEG1wTRMc.png)
---
## Uses
### Direct Use
This model can be used for:
- Question Answering on document images (PDFs, invoices, utility bills)
- Information extraction tasks using OCR and layout-aware understanding
### Out-of-Scope Use
- Not suitable for conversational QA
- Not suitable for images with no OCR-processed text
---
## Training Details
### Dataset
The dataset consisted of:
- **Images** of utility bills and documents
- **OCR data** with bounding boxes (from PaddleOCR)
- **Queries** in English, Spanish, and Chinese
- **Answer spans** with match scores and positions
### Training Procedure
- Preprocessing: PaddleOCR was used to extract tokens, positions, and structure
- Model: LayoutLMv3-base
- Epochs: 4
- Learning rate schedule: Shown in image below
### Training Metrics
- **F1 Score** (validation): ![training_history.png](https://cdn-uploads.huggingface.co/production/uploads/66a7331438fbd9075584523f/MtMe5CZy3wb2nEG1wTRMc.png)
- **Loss & Learning Rate Chart**: ![training_history.png](https://cdn-uploads.huggingface.co/production/uploads/66a7331438fbd9075584523f/MtMe5CZy3wb2nEG1wTRMc.png)
---
## Evaluation
### Metrics Used
- F1 score
- Match score of predicted spans
- Token overlap vs ground truth
### Summary
The model performs well on document-style QA tasks, especially with:
- Clearly structured OCR results
- Document types similar to utility bills, invoices, and forms
---
## How to Use
```python
from transformers import LayoutLMv3Processor, LayoutLMv3ForQuestionAnswering
from PIL import Image
import torch
processor = LayoutLMv3Processor.from_pretrained("lakshya-singh/document-qa-model")
model = LayoutLMv3ForQuestionAnswering.from_pretrained("lakshya-singh/document-qa-model")
image = Image.open("your_document.png")
question = "What is the total amount due?"
inputs = processor(image, question, return_tensors="pt")
outputs = model(**inputs)
start_idx = torch.argmax(outputs.start_logits)
end_idx = torch.argmax(outputs.end_logits)
answer = processor.tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx+1])
print("Answer:", answer)