Natural scene text detection based on LFN
-
摘要:
在自然场景文本检测领域,现有的深度学习网络仍存在文本误检、漏检、定位不准确的情况. 针对这一问题,本文设计出一种基于大感受野特征网络(Large Receptive Field Feature Network,LFN)的文本检测算法. 首先选取速度和准确度更好的轻量级主干网络ShuffleNet V2,并加入细粒度特征融合模块以获取更多隐藏的文本特征信息;再通过分析不同尺度的特征图感受野不同,并对比不同尺度的特征图进行归一化后得到的特征图尺寸对结果的影响,构造了双融合特征提取模块,对输入图像提取多尺度特征以减少文本特征丢失,增大感受野;最后为处理正负样本失衡的问题,在可微二值化模块中引入Dice Loss,增加文本定位的准确度. 在ICDAR2015和CTW1500数据集上的实验表明,该网络无论是在性能还是速度上对文本检测效果都有显著提升. 其中在ICDAR2015数据集上
F 1为86.1%,较性能最优的PSENet网络提升了0.4%,速度达到了50 fps,较速度最快的DBNet网络提升了约1.92倍,在CTW1500数据集上F 1为83.2%,较PSENet网络提升了1%,速度达到了35 fps,较EAST网络提升了约1.65倍.Abstract:In the field of natural scene text detection, the existing deep learning network has the situation of text false detection, missed detection, and inaccurate positioning. To solve this problem, a text detection algorithm based on Large Receptive Field Feature Network (LFN) is designed. First, ShuffleNet V2 is selected as a lightweight backbone network with better speed and accuracy, and a fine-grained feature fusion module is added to obtain more hidden text feature information. Then the double fusion feature extraction module is constructed to extract multi-scale features from the input image by analyzing the different receptive fields of different scale feature maps and the influence of the size of the feature map obtained after normalization of feature maps with different scales on the results is compared, thereby reduced feature loss and increased the receptive field. Finally, Dice Loss is introduced into the differentiable binary module to deal with the imbalance between positive and negative samples and increase the accuracy of text location. Experiments result on the ICDAR2015 and CTW1500 datasets show that the network has significantly improved text detection in both performance and speed. The F1 on ICDAR2015 dataset is 86.1%, which is 0.4% higher than PSENet method with the best performance and the speed is 50.3fps, which is about 1.92 times higher than DBNet method with the fastest speed. The F1 on CTW1500 dataset is 83.2%, which is 1% higher than PSENet method and the speed is 35fps, which is about 1.65 times higher than EAST method.
-
表 1 ShuffleNet V2 整体结构网络
Table 1. ShuffleNet V2 overall structure network
Layer Output Size KSize S R Output channels 0.5 1 Image 224×224 3 3 Conv1 112×112 3×3 2 24 24 Max-Pool 56×56 3×3 2 1 24 24 Stage 2 28×28 2 1 48 116 28×28 1 3 Stage 3 14×14 2 1 96 232 14×14 1 7 Stage 4 7×7 2 1 192 464 7×7 1 3 表 2 ICDAR2015上的消融实验
Table 2. Ablation experiment on ICDAR2015
Method Precision Recall F1 FPS LFN-FGFF 83.6 84.7 84.2 36 LFN-DFFE 89.1 81.4 85.1 28 LFN 89.7 82.8 86.1 50 PSENet 86.9 84.5 85.7 1.6 表 3 ICDAR2015上归一化分析
Table 3. Normalization analysis on ICDAR2015
Size Precision Recall F1 FPS P2 90.1 78.4 83.8 30 P3 89.7 82.8 86.1 50 P4 87.2 65.6 74.9 53 P5 89.7 26.7 41.2 45 表 4 ICDAR2015数据集检测结果
Table 4. Test results on ICDAR2015 dataset
-
[1] 王润民, 桑农, 丁丁, 等. 自然场景图像中的文本检测综述[J]. 自动化学报,2018,44(12):2113-2141. DOI: 10.16383/j.aas.2018.c170572.WANG R M, SANG N, DING D, et al. Text detection in natural scene image: asurvey[J]. Acta Automatica Sinica,2018,44(12):2113-2141. DOI: 10.16383/j.aas.2018.c170572. [2] EPSHTEIN B, OFEK E, WEXLER Y. Detecting text in natural scenes with stroke width transform[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco: IEEE, 2010: 2963-2970. [3] MATAS J, CHUM O, URBAN M, et al. Robust wide-baseline stereo from maximally stable extremal regions[J]. Image and Vision Computing,2004,22(10):761-767. DOI: 10.1016/j.imavis.2004.02.006. [4] TIAN S X, PAN Y F, HUANG C, et al, Text flow: aunified text detection system in natural scene Images[C]//2015 IEEE International Conference on Computer Vision (ICCV). Santiago: IEEE, 2015: 4651-4659. [5] 谢斌红, 秦耀龙, 张英俊. 基于学习主动中心轮廓模型的场景文本检测[J]. 计算机工程,2022,48(3):244-252. DOI: 10.19678/j.issn.1000-3428.0060828.XIE B H, QIN Y L, ZHANG Y J. Scene text detection based on learning active center contour model[J]. Computer Engineering,2022,48(3):244-252. DOI: 10.19678/j.issn.1000-3428.0060828. [6] 易尧华, 杨锶齐, 王新宇, 等. 自然场景文本检测关键技术及应用[J]. 数字印刷,2020(4):1-11. DOI: 10.19370/j.cnki.cn10-1304/ts.2020.04.001.YI R H, YANG S Q, WANG X Y, et al. Keytechnology and application of natural scene text detection[J]. Digital Printing,2020(4):1-11. DOI: 10.19370/j.cnki.cn10-1304/ts.2020.04.001. [7] 李云洪, 闫君宏, 胡蕾. 局部与全局双重特征融合的自然场景文本检测[J]. 数据采集与处理,2022,37(2):415-425. DOI: 10.16337/j.1004-9037.2022.02.014.LI Y H, YAN J H, HU L. Natural scene text detection based on local and global dual-feature fusion[J]. Journal of Data Acquisition and Processing,2022,37(2):415-425. DOI: 10.16337/j.1004-9037.2022.02.014. [8] MA J Q, SHAO W Y, YE H, et al. Arbitrary-oriented scene text detection via rotation proposals[J]. IEEE Transactions on Multimedia,2018,20(11):3111-3122. DOI: 10.1109/TMM.2018.2818020. [9] LIAO M, SHI B, BAI X. TextBoxes++: asingle-shot oriented scene text detector[J]. IEEE Transactions on Image Processing,2018,27(8):3676-3690. DOI: 10.1109/TIP.2018.2825107. [10] TIAN Z, HUANG W L, HE T, et al. Detecting text in natural image with connectionist text proposal network[C]//14th European Conference on Computer Vision. Amsterdam: Springer, 2016: 56-72. [11] WANG W H, XIE E Z, LI X, et al. Shape robust text detection with progressive scale expansion network[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach: IEEE, 2019: 9328-9337. [12] DENG D, LIU H F, LI X L, et al. PixelLink: detecting scene text via instance segmentation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2018. [13] LONG S B, RUAN J Q, ZHANG W J, et al. TextSnake: aflexible representation for detecting text of arbitrary shapes[C]//15th European Conference on Computer Vision. Munich: Springer, 2018: 19-35. [14] MA N N, ZHANG X Y, ZHENG H T, et al. ShuffleNet V2: practical guidelines for efficient CNNarchitecture design[C]//15th European Conference on Computer Vision. Munich: Springer, 2018: 122-138. [15] ZHANG X Y, ZHOU X Y, LIN M X, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices[C]//2018 IEEE/CVF Conference on ComputerVision and Pattern Recognition (CVPR). Salt Lake City: IEEE, 2018: 6848-6856. [16] LIAO M H, WAN Z Y, YAO C, et al. Real-time scene text detection with differentiable binarization[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2020: 11474-11481. [17] VATTIBR. A generic solution to polygon clipping[J]. Communications of the ACM,1992,35(7):56-63. DOI: 10.1145/129902.129906. [18] MILLETARI F, NAVAB N, AHMADI S A. V-Net: fully convolutional neural networks for volumetric medical image segmentation[C]//4th International Conference on 3D Vision. Stanford: IEEE, 2016: 565-571. [19] SHRIVASTAVA A, GUPTA A, GIRSHICK R. Training region-based object detectors with online hard example mining[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 761-769. [20] KARATZAS D, GOMEZ-BIGORDA L, NICOLAOU A, et al. ICDAR 2015 competition on robust reading[C]//2015 13th International Conference on Document Analysis andRecognition. Tunis: IEEE, 2015: 1156-1160. [21] LIU Y L, JIN L W, ZHANG S T, et al. Detecting curve text in the wild: new dataset and new solution[J]. arXiv: 1712.02170, 2017. [22] 李煌, 王晓莉, 项欣光. 基于文本三区域分割的场景文本检测方法[J]. 计算机科学,2020,47(11):142-147. DOI: 10.11896/jsjkx.200800157.LI H, WANG X L, XIANG X G. Scene text detection based on triple segmentation[J]. Computer Science,2020,47(11):142-147. DOI: 10.11896/jsjkx.200800157. [23] ZHOU X Y, YAO C, WEN H, et al. EAST: an efficient and accurate scene text detector[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017: 2642-2651. -