trocr-indic / README.md
QuickHawk's picture
Upload folder using huggingface_hub
df7854a verified
metadata
license: mit
inference: true
language:
  - en
metrics:
  - cer
  - wer
base_model:
  - facebook/deit-base-patch16-224
  - ai4bharat/IndicBART
pipeline_tag: image-to-text
tags:
  - text-generation
  - scene-text-recognition
  - text-recognition
  - computer-vision
  - language-model

trocr-indic

This model utilizes the trocr approach to predict the Indic Texts from cropped_images.

Model Details

The model follows the TrOCR approach of training OCR for Scene Texts. Since, there is scarcity for generalized model for majority of Indian Languages, this model serves it replacement.

TrOCR_Architecture.jpg Courtesty: TrOCR - original paper

The model is trained for the following languages:

  • Assamese
  • Bengali
  • Gujarati
  • Hindi
  • Kannada
  • Malayalam
  • Marathi
  • Odia
  • Punjabi
  • Telugu
  • Tamil

Model Description

IMPORTANT Although the model is trained on these languages due to limitations of IndicBART, the model is trained with only Devnagiri Scripts.

The output is in the following format:

<LANGUAGE TOKEN> <TEXT TOKENS> <EOS TOKEN>

The following flowchart gives a better picture on the approach of training and inference regarding this model.

Reworked_Implementation

  • Datasets used: IndicSTR12
  • Developed by: Aarya Devarla
  • Model type: Visio-Lingual Model / Vision-Language Model
  • License: mit
  • Finetuned from model: deit, indicBART

Results

Metric Assamese Bengali Gujarati Hindi Kannada Malayalam Marathi Odia Punjabi Tamil Telugu
CER 0.069 0.133 0.058 0.075 0.212 0.154 0.082 0.120 0.097 0.122 0.220
WER 0.205 0.395 0.192 0.283 0.576 0.519 0.312 0.375 0.304 0.409 0.612

Well, the model isn't perfect. But it's a start.

Limitations

The main limitation comes from IndicBART which is primarily trained on IndicTexts.

Recommendations

Since the TrOCR is modular in approach one can just swap out the IndicBART model and train it with new model. Must keep in mind about the preprocessing and outputs.