--- license: mit tags: - vision inference: false pipeline_tag: image-text-to-text --- # UDOP model The UDOP model was proposed in [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal. ## Model description UDOP adopts an encoder-decoder Transformer architecture based on T5 for document AI tasks like document image classification, document parsing and document visual question answering. ## Intended uses & limitations You can use the model for document image classification, document parsing and document visual question answering (DocVQA). ### How to use Here's how to use the model for one-shot semantic segmentation: ```python from transformers import AutoProcessor, UdopForConditionalGeneration from datasets import load_dataset # load model and processor processor = AutoProcessor.from_pretrained("microsoft/udop-large", apply_ocr=False) model = UdopForConditionalGeneration.from_pretrained("microsoft/udop-large") dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train") example = dataset[0] image = example["image"] words = example["tokens"] boxes = example["bboxes"] question = "Question answering. What is the date on the form?" encoding = processor(image, question, words, boxes=boxes, return_tensors="pt") # autoregressive generation predicted_ids = model.generate(**encoding) print(processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]) 9/30/92 ``` ### BibTeX entry and citation info ```bibtex @misc{tang2023unifying, title={Unifying Vision, Text, and Layout for Universal Document Processing}, author={Zineng Tang and Ziyi Yang and Guoxin Wang and Yuwei Fang and Yang Liu and Chenguang Zhu and Michael Zeng and Cha Zhang and Mohit Bansal}, year={2023}, eprint={2212.02623}, archivePrefix={arXiv}, primaryClass={cs.CV} } ```