|
--- |
|
library_name: transformers |
|
tags: |
|
- document-question-answering |
|
- layoutlmv3 |
|
- ocr |
|
- document-understanding |
|
- paddleocr |
|
- multilingual |
|
- layout-aware |
|
- lakshya-singh |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- microsoft/layoutlmv3-base |
|
datasets: |
|
- nielsr/docvqa_1200_examples |
|
--- |
|
|
|
# Document QA Model |
|
|
|
This is a fine-tuned **document question-answering model** based on `layoutlmv3-base`. It is trained to understand documents using OCR data (via PaddleOCR) and accurately answer questions related to structured information in the document layout. |
|
|
|
--- |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Model Name:** `document-qa-model` |
|
- **Base Model:** [`microsoft/layoutlmv3-base`](https://huggingface.co/microsoft/layoutlmv3-base) |
|
- **Fine-tuned by:** Lakshya Singh (solo contributor) |
|
- **Languages:** English, Spanish, Chinese |
|
- **License:** Apache-2.0 (inherited from base model) |
|
- **Intended Use:** Extract answers to structured queries from scanned documents |
|
- **Not funded** — this project was completed independently. |
|
|
|
--- |
|
|
|
## Model Sources |
|
|
|
- **Repository:** [https://github.com/Lakshyasinghrawat12] |
|
- **Trained on:** Adapted version of [`nielsr/docvqa_1200_examples`](https://huggingface.co/datasets/nielsr/docvqa_1200_examples) |
|
- **Model metrics:** See  |
|
|
|
--- |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
This model can be used for: |
|
- Question Answering on document images (PDFs, invoices, utility bills) |
|
- Information extraction tasks using OCR and layout-aware understanding |
|
|
|
### Out-of-Scope Use |
|
|
|
- Not suitable for conversational QA |
|
- Not suitable for images with no OCR-processed text |
|
|
|
--- |
|
|
|
## Training Details |
|
|
|
### Dataset |
|
|
|
The dataset consisted of: |
|
- **Images** of utility bills and documents |
|
- **OCR data** with bounding boxes (from PaddleOCR) |
|
- **Queries** in English, Spanish, and Chinese |
|
- **Answer spans** with match scores and positions |
|
|
|
### Training Procedure |
|
|
|
- Preprocessing: PaddleOCR was used to extract tokens, positions, and structure |
|
- Model: LayoutLMv3-base |
|
- Epochs: 4 |
|
- Learning rate schedule: Shown in image below |
|
|
|
### Training Metrics |
|
|
|
- **F1 Score** (validation):  |
|
- **Loss & Learning Rate Chart**:  |
|
|
|
--- |
|
|
|
## Evaluation |
|
|
|
### Metrics Used |
|
- F1 score |
|
- Match score of predicted spans |
|
- Token overlap vs ground truth |
|
|
|
### Summary |
|
|
|
The model performs well on document-style QA tasks, especially with: |
|
- Clearly structured OCR results |
|
- Document types similar to utility bills, invoices, and forms |
|
|
|
--- |
|
|
|
## How to Use |
|
|
|
```python |
|
from transformers import LayoutLMv3Processor, LayoutLMv3ForQuestionAnswering |
|
from PIL import Image |
|
import torch |
|
|
|
processor = LayoutLMv3Processor.from_pretrained("lakshya-singh/document-qa-model") |
|
model = LayoutLMv3ForQuestionAnswering.from_pretrained("lakshya-singh/document-qa-model") |
|
|
|
image = Image.open("your_document.png") |
|
question = "What is the total amount due?" |
|
|
|
inputs = processor(image, question, return_tensors="pt") |
|
outputs = model(**inputs) |
|
|
|
start_idx = torch.argmax(outputs.start_logits) |
|
end_idx = torch.argmax(outputs.end_logits) |
|
|
|
answer = processor.tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx+1]) |
|
print("Answer:", answer) |