TY - JOUR
T1 - Two-dimensional multi-scale perceptive context for scene text recognition
AU - Li, Haojie
AU - Yang, Daihui
AU - Huang, Shuangping
AU - Lam, Kin Man
AU - Jin, Lianwen
AU - Zhuang, Zhenzhou
N1 - Funding Information:
Lianwen Jin received the B.S. degree from the University of Science and Technology of China, Anhui, China, and the Ph.D. degree from the South China University of Technology, Guangzhou, China, in 1991 and 1996, respectively. He is currently a Professor with the School of Electronic and Information Engineering, South China University of Technology. He is the author of more than 200 scientific papers. Dr. Jin was a recipient of the award of New Century Excellent Talent Program of MOE in 2006 and the Guangdong Pearl River Distinguished Professor Award in 2011. His research interests include image processing, handwriting analysis and recognition, and intelligent systems.
Funding Information:
This work was supported in part by the Natural Science Foundation of China under Grant 61673182 , Grant 61936003 , Grant U1801262 , and Grant U1636218 , in part by the Guangdong-Natural Science Foundation under Grant 2017A030312006 , in part by the Science and Technology Program of Guangzhou, China under Grant 201707010160 , Grant 201902010069 , and in part by the Guangzhou Key Laboratory of Body Data Science under Grant 201605030011 .
Publisher Copyright:
© 2020 Elsevier B.V.
PY - 2020/11/6
Y1 - 2020/11/6
N2 - Inspired by speech recognition, most of the recent state-of-the-art works convert scene text recognition into sequence prediction. Like most speech recognition problems, context modeling is considered as a critical component in these methods for achieving better performance. However, they usually only consider using a holistic or single-scale local sequence context, in a single dimension. Actually, scene texts or sequence contexts may span arbitrarily across a two-dimensional (2-D) space and in any style, not limited to only horizontal. Moreover, contexts of various scales may synthetically contribute to text recognition, in particular for irregular text recognition. In our method, we consider the context in a 2-D manner, and simultaneously consider context reasoning at various scales, from local to global. Based on this, we propose a new Two-Dimensional Multi-Scale Perceptive Context (TDMSPC) module, which performs multi-scale context learning, along both the horizontal and vertical directions, and then merges them. This can generate shape and layout-dependent feature maps for scene text recognition. This proposed module can be handily inserted into existing sequence-based frameworks to replace their context learning mechanism. Furthermore, a new scene text recognition network, called TDMSPC-Net, is built, by using the TDMSPC module as a building block for the encoder, and adopting an attention-based LSTM as the decoder. Experiments on benchmark datasets show that the TDMSPC module can substantially boost the performance of existing sequence-based scene text recognizers, irrespective of the decoder or backbone network being used. The proposed TDMSPC-Net achieves state-of-the-art accuracy on all the benchmark datasets.
AB - Inspired by speech recognition, most of the recent state-of-the-art works convert scene text recognition into sequence prediction. Like most speech recognition problems, context modeling is considered as a critical component in these methods for achieving better performance. However, they usually only consider using a holistic or single-scale local sequence context, in a single dimension. Actually, scene texts or sequence contexts may span arbitrarily across a two-dimensional (2-D) space and in any style, not limited to only horizontal. Moreover, contexts of various scales may synthetically contribute to text recognition, in particular for irregular text recognition. In our method, we consider the context in a 2-D manner, and simultaneously consider context reasoning at various scales, from local to global. Based on this, we propose a new Two-Dimensional Multi-Scale Perceptive Context (TDMSPC) module, which performs multi-scale context learning, along both the horizontal and vertical directions, and then merges them. This can generate shape and layout-dependent feature maps for scene text recognition. This proposed module can be handily inserted into existing sequence-based frameworks to replace their context learning mechanism. Furthermore, a new scene text recognition network, called TDMSPC-Net, is built, by using the TDMSPC module as a building block for the encoder, and adopting an attention-based LSTM as the decoder. Experiments on benchmark datasets show that the TDMSPC module can substantially boost the performance of existing sequence-based scene text recognizers, irrespective of the decoder or backbone network being used. The proposed TDMSPC-Net achieves state-of-the-art accuracy on all the benchmark datasets.
KW - Multi-scale perceptive context
KW - Scene text recognition
KW - Two-dimensional context
UR - http://www.scopus.com/inward/record.url?scp=85089400670&partnerID=8YFLogxK
U2 - 10.1016/j.neucom.2020.06.071
DO - 10.1016/j.neucom.2020.06.071
M3 - Journal article
AN - SCOPUS:85089400670
SN - 0925-2312
VL - 413
SP - 410
EP - 421
JO - Neurocomputing
JF - Neurocomputing
ER -