Text detection method based on text enhancement and multi-branch convolution
-
摘要:
自然场景下的文本检测技术是多种工业应用的前提,但常用检测方法的准确率不佳.为此提出一种基于文本强化与多分支卷积的神经网络方法,用于检测自然场景中的图片文本.首先,在主干网络前加入文本区域强化的网络结构, 在浅层网络提高文本区域的特征值,加强网络对文本特征的学习能力并抑制背景特征的表达.其次,针对场景文本高宽比差异大的特点,设计多分支结构的卷积模块,用接近文本形状的卷积核来表达差异化的感受野,并通过轻量级的注意力机制,补充网络对通道重要性的学习,其参数量仅为通道数的六倍.最后,改进损失函数在分类损失和检测框损失上的计算公式,对文本像素进行加权并引入覆盖预测框和标签框的最小矩形来表达重合度,提高网络在文本数据集上的训练有效性.消融实验和对比实验的结果表明,该方法的各个改进措施有效,在ICDAR2015和MSRA-TD500数据集上的分别取得了83.3%和82.4%的F值,同时在模糊文本、光斑文本和密集文本等困难样本的检测对比中表现较好.
Abstract:Because text detection technology in natural scenes is the premise of many industrial applications and the accuracy of common detection methods is not good, this paper proposes a neural network method based on text enhancement and multi-branch convolution to detect the picture text in natural scenes. Firstly, this paper adds the network structure of text area reinforcement in front of the backbone network, and increases the feature value of text area in the shallow network to strengthen the learning ability of the network to text features and suppress the expression of background features. Secondly, in view of the large difference in the aspect ratio of the scene text, this paper designs a convolution module with multi-branch structure and uses convolution kernel close to the shape of text to express the differentiated receptive field, and uses a lightweight attention mechanism to supplement the network's learning of the importance of channels with its parameters being only six times the number of channels. Finally, this paper improves the calculation formula of loss function on classification loss and detection box loss to weight text pixels and introduce the smallest rectangle covering prediction box and label box to express coincidence degree, thus improving the effectiveness of network training on text data sets. The results of ablation experiment and comparison experiment show that all the improvement measures of this method are effective, which achieves 83.3% and 82.4% F values on ICDAR2015 and MSRA-TD500 data sets, respectively, and performs well in the detection and comparison of difficult samples such as fuzzy text, light spot text and dense text.
-
Key words:
- neural networks /
- text detection /
- Digital image processing /
- attention mechanism /
- loss function
-
表 1 在ICDAR2015数据集上的实验结果
Table 1. experimental results on ICDAR2015 dataset
表 2 各组件的实验结果
Table 2. Experimental results of each component
ResNet 文本强化 多分支卷积+LCEM 改良损失函数 P/% R/% F/% √ × × × 85.8 82.8 84.3 √ √ × × 86.2 83.6 84.9 √ × √ × 88.3 82.4 85.2 √ √ √ × 87.7 83.6 85.6 √ √ √ √ 88.4 83.9 86.1 表 3 在MSRA-TD500数据集上的实验结果
Table 3. Experimental results on MSRA-TD500 dataset
-
[1] NING C C, ZHOU H J, SONG Y, et al. Inception single shot MultiBox detector for object detection[C]//2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). Hong Kong, China: IEEE, 2017: 549-554. DOI: 10.1109/ICMEW.2017.8026312. [2] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. DOI: 10.1109/TPAMI.2016.2577031. [3] TIAN Z, HUANG W L, HE T, et al. Detecting text in natural image with connectionist text proposal network[C]//Proceeding of the 14th European Conference on Computer Vision. Amsterdam: Springer, 2016: 56-72. DOI: 10.1007/978-3-319-46484-8_4. [4] SHI B G, BAI X, BELONGIE S. Detecting oriented text in natural images by linking segments[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 3482-3490. DOI: 10.1109/CVPR.2017.371. [5] MA J Q, SHAO W Y, YE H, et al. Arbitrary-oriented scene text detection via rotation proposals[J]. IEEE Transactions on Multimedia, 2018, 20(11): 3111-3122. DOI: 10.1109/TMM.2018.2818020. [6] HE P, HUANG W L, HE T, et al. Single shot text detector with regional attention[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 3066-3074. DOI: 10.1109/ICCV.2017.331. [7] HU H, ZHANG C Q, LUO Y X, et al. WordSup: Exploiting word annotations for character based text detection[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 4950-4959. DOI: 10.1109/ICCV.2017.529. [8] HE W H, ZHANG X Y, YIN F, et al. Deep direct regression for multi-oriented scene text detection[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 745-753. DOI: 10.1109/ICCV.2017.87. [9] WANG W H, XIE E Z, LI X, et al. Shape robust text detection with progressive scale expansion network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 9328-9337. DOI: 10.1109/CVPR.2019.00956. [10] DU C, WANG C H, WANG Y N, et al. TextEdge: multi-oriented scene text detection via region segmentation and edge classification[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). Sydney: IEEE, 2019: 375-380. DOI: 10.1109/ICDAR.2019.00067. [11] LIAO M H, ZHU Z, SHI B G, et al. Rotation-sensitive regression for oriented scene text detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 5909-5918. DOI: 10.1109/CVPR.2018.00619. [12] LYU P Y, LIAO M H, YAO C, et al. Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes[C]//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich: Springer, 2018: 71-88. DOI: 10.1007/978-3-030-01264-9_5. [13] GUO J M, ZHANG S F, LI J M. Hash learning with convolutional neural networks for semantic based image retrieval[C]//20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Auckland: Springer, 2016: 227-238. DOI: 10.1007/978-3-319-31753-3_19. [14] WU S T, ZHONG S H, LIU Y. Deep residual learning for image steganalysis[J]. Multimedia Tools and Applications, 2018, 77(9): 10437-10453. DOI: 10.1007/s11042-017-4440-4. [15] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 2818-2826. DOI: 10.1109/CVPR.2016.308. [16] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 318-327. DOI: 10.1109/TPAMI.2018.2858826. [17] ZHENG Z H, WANG P, LIU W, et al. Distance-IoU loss: faster and better learning for bounding box regression[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2020: 12993-13000. DOI: 10.1609/aaai.v34i07.6999. [18] LIU Y L, JIN L W. Deep matching prior network: toward tighter multi-oriented text detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 3454-3461. DOI: 10.1109/CVPR.2017.368. [19] ZHOU X Y, YAO C, WEN H, et al. East: an efficient and accurate scene text detector[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 2642-2651. DOI: 10.1109/CVPR.2017.283. [20] 赵鹏, 徐本朋, 闫石, 等. 基于双分支特征融合的场景文本检测方法[J]. 控制与决策, 2021, 36(9): 2179-2186. DOI: 10.13195/j.kzyjc.2020.0002.ZHAO P, XU B P, YAN S, et al. A scene text detection based on dual-path feature fusion[J]. Control and Decision, 2021, 36(9): 2179-2186. DOI: 10.13195/j.kzyjc.2020.0002. [21] DENG D, LIU H F, LI X L, et al. PixelLink: detecting scene text via instance segmentation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans: AAAI, 2018. DOI: 10.1609/aaai.v32i1.12269. [22] XU Y C, WANG Y K, ZHOU W, et al. TextField: learning a deep direction field for irregular scene text detection[J]. IEEE Transactions on Image Processing, 2019, 28(11): 5566-5579. DOI: 10.1109/TIP.2019.2900589. [23] LONG S B, RUAN J Q, ZHANG W J, et al. TextSnake: A flexible representation for detecting text of arbitrary shapes[C]//Proceedings of the 2018 15th European Conference on Computer Vision. Munich: Springer, 2018: 19-35. DOI: 10.1007/978-3-030-01216-8_2. [24] PENG S D, JIANG W, PI H J, et al. Deep snake for real-time instance segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 8530-8539. DOI: 10.1109/CVPR42600.2020.00856. [25] BAEK Y, LEE B, HAN D, et al. Character region awareness for text detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 9357-9366. DOI: 10.1109/CVPR.2019.00959. -