ABINet

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Abstract

Linguistic knowledge is of great benefit to scene text recognition. However, how to effectively model linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from: 1) implicitly language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet for scene text recognition. Firstly, the autonomous suggests to block gradient flow between vision and language models to enforce explicitly language modeling. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for language model which can effectively alleviate the impact of noise input. Additionally, based on the ensemble of iterative predictions, we propose a self-training method which can learn from unlabeled images effectively. Extensive experiments indicate that ABINet has superiority on low-quality images and achieves state-of-the-art results on several mainstream benchmarks. Besides, the ABINet trained with ensemble self-training shows promising improvement in realizing human-level recognition.

Dataset

Train Dataset

trainset	instance_num	repeat_num	note
Syn90k	8919273	1	synth
SynthText	7239272	1	alphanumeric

Test Dataset

testset	instance_num	note
IIIT5K	3000	regular
SVT	647	regular
IC13	1015	regular
IC15	2077	irregular
SVTP	645	irregular
CT80	288	irregular

Results and models

methods	pretrained		Regular Text			Irregular Text		download
		IIIT5K	SVT	IC13-1015	IC15-2077	SVTP	CT80
ABINet-Vision	-	0.9523	0.9196	0.9369	0.7896	0.8403	0.8437	model \| log
ABINet-Vision-TTA	-	0.9523	0.9196	0.9360	0.8175	0.8450	0.8542
ABINet	Pretrained	0.9603	0.9397	0.9557	0.8146	0.8868	0.8785	model \| log
ABINet-TTA	Pretrained	0.9597	0.9397	0.9527	0.8426	0.8930	0.8854

1. ABINet allows its encoder to run and be trained without decoder and fuser. Its encoder is designed to recognize texts as a stand-alone model and therefore can work as an independent text recognizer. We release it as ABINet-Vision.
2. Facts about the pretrained model: MMOCR does not have a systematic pipeline to pretrain the language model (LM) yet, thus the weights of LM are converted from [the official pretrained model](https://github.com/FangShancheng/ABINet). The weights of ABINet-Vision are directly used as the vision model of ABINet.

Citation

@article{fang2021read,
  title={Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition},
  author={Fang, Shancheng and Xie, Hongtao and Wang, Yuxin and Mao, Zhendong and Zhang, Yongdong},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2021}
}