Visual gaze target tracking method based on spatiotemporal attention mechanism and joint attention
-
摘要:
目前视觉跟踪技术易忽视人物与场景图之间的联系、以及缺少对联合注意力的分析和检测,导致检测性能不理想.为此提出一种基于时空注意力机制和联合注意力的视觉凝视目标跟踪方法.对于给定任意一幅图像,利用深度神经网络来提取人物的头部特征后,加入场景和头部之间的交互可以帮助增强图像的显著性,并引入一个强化注意力模块来过滤掉深度和视野上的干扰信息.此外,将场景中其余人物的注意力也考虑进所关注的区域,通过注意推送来增强标准显著性模型.加入时空注意力机制后,可以有效地将候选目标、目标注视方向和时间帧数约束结合起来,达到识别共享位置,利用显著性信息能够更好地检测和定位联合注意力.最后将图像中的注意力以热力图的形式可视化.实验表明:该模型能够有效地推断视频中的动态注意力和联合注意力,且效果良好.
Abstract:The current visual tracking technology tends to ignore the connection between the figure and the scene graph, as well as the lack of analysis and detection of joint attention, which results in unsatisfactory detection performance. In response to these problems, this paper proposed a visual gaze target tracking method based on spatiotemporal attention mechanism and joint attention. For any given image, the method extracts the head features of a person by using a deep neural network, and then adds extra-interaction between the scene and the head to enhance the saliency of images. Lots of interference information on the depth and field of view can be filtered out by the enhanced attention module. In addition, the attention of the remaining characters in the scene is considered into the area of interest to improve the standard saliency model. After adding the spatiotemporal attention mechanism, the candidate target, target gaze direction and time frame number constraints can be effectively combined to identify the shared location, and the saliency information can be used to detect and locate joint attention better. Finally, the image is visualized as a heat map. Experiments show that the model can effectively infer dynamic attention and joint attention in videos with good results.
-
Key words:
- convolutional neural network /
- deep learning /
- visual tracking /
- target detection /
- joint attention
-
表 1 在GazeFollow数据集上的性能比较
Table 1. Performance comparison on GazeFollow dataset
Method AUC Dist Min Dist Ang Center 0.633 0.313 0.230 49.0° Judd et al.[18] 0.711 0.337 0.250 54.0° Random 0.504 0.484 0.391 69.0° Fixed bias 0.674 0.306 0.219 48.0° SVM + one grid 0.758 0.276 0.193 43.0° SVM + shift grid 0.788 0.268 0.186 40.0° Chong et al.[6] 0.896 0.187 0.112 n/a SalGAN[16] 0.848 0.238 0.192 36.7° Recasens et al.[3] 0.878 0.190 0.113 24.0° Recasens et al.* 0.881 0.175 0.101 22.5° Lian et al. [5] 0.906 0.145 0.081 17.6° One human 0.924 0.096 0.040 11.0° Ours 0.889 0.179 0.106 19.2° 表 2 部分组件被禁用的性能对比
Table 2. Performance comparison of some components disabled
Method AUC Dist Min Dist Ang No image 0.803 0.252 0.174 25.1° No head position 0.816 0.249 0.161 35.3° No head feature 0.732 0.273 0.203 32.4° No eltwise 0.869 0.196 0.138 14.6° No attention map 0.659 0.291 0.242 42.3° No fusion 0.834 0.225 0.156 19.6° No temporal 0.836 0.219 0.151 15.1° Ours Full 0.889 0.179 0.106 19.2° 表 3 注意力机制对VideoCoAtt数据集的联合注意力定位的影响
Table 3. Effect of attention mechanism on joint attention localization on VideoCoAtt dataset
Method L2 Distance NO Encoder 72.15 Encoder and Generator jointly 70.36 Reduce learning rate 68.59 Channel-wise attention 61.87 Spatial attention 64.58 #注:(1) No Encoder是冻结了编码器,只训练生成器;(2) Encoder and Generator jointly是训练编码器和生成器;(3) Reduce learning rate是利用迁移学习来学习编码器,并从头开始训练生成器. 表 4 注意力机制在VideoCoAtt数据集上对联合注意定位的影响
Table 4. Effect of attention mechanism on joint attention localization on VideoCoAtt dataset
-
[1] IYER S J, SARANYA P, SIVARAM M. Human pose-estimation and low-cost interpolation for text to Indian sign language[C]//Proceedings of the 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence). Noida: IEEE, 2021: 130-135. DOI: 10.1109/Confluence51648.2021.9377047. [2] HUANG Y F, CAI M J, SATO Y. An ego-vision system for discovering human joint attention[J]. IEEE Transactions on Human-Machine Systems, 2020, 50(4): 306-316. DOI: 10.1109/THMS.2020.2965429. [3] RECASENS A, VONDRICK C, KHOSLA A, et al. Following gaze in video[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 1444-1452. DOI: 10.1109/ICCV.2017.160. [4] GORJIS, CLARK J J. Attentional push: a deep convolutional network for augmenting image salience with shared attention modeling in social scenes[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 3472-3481. DOI: 10.1109/CVPR.2017.370. [5] LIAN D Z, YU Z H, GAO S H. Believe it or not, we know what you are looking at![C]//Proceedings of the 14th Asian Conference on Computer Vision. Perth: Springer, 2018: 35-50. DOI: 10.1007/978-3-030-20893-6_3. [6] CHONG E, RUIZ N, WANG Y X, et al. Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency[C]//Proceedings of the 15th European Conference on Computer Vision. Munich: Springer, 2018: 397-412. DOI: 10.1007/978-3-030-01228-1_24. [7] 何梦, 许达文. 基于客户端-服务器的容错神经网络训练架构[J]. 微电子学与计算机, 2021, 38(10): 73-78. DOI: 10.19304/J.ISSN1000-7180.2021.0035.HE M, XU D W. Fault-tolerant neural network training framework based on client-server[J]. Microelectronics & Computer, 2021, 38(10): 73-78. DOI: 10.19304/J.ISSN1000-7180.2021.0035. [8] RANFTL R, LASINGER K, HAFNER D, et al. Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(3): 1623-1637. DOI: 10.1109/TPAMI.2020.3019967. [9] 杨梅, 贾旭, 殷浩东, 等. 基于联合注意力孪生网络目标跟踪算法[J]. 仪器仪表学报, 2021, 42(1): 127-136. DOI: 10.19650/j.cnki.cjsi.J2006906.YANG M, JIA X, YIN H D, et al. Object tracking algorithm based on Siamese network with combined attention[J]. Chinese Journal of Scientific Instrument, 2021, 42(1): 127-136. DOI: 10.19650/j.cnki.cjsi.J2006906. [10] XIA J Y, TIAN J Q, XING J K, et al. Social data assisted multi-modal video analysis for saliency detection[C]//Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 2278-2282. DOI: 10.1109/ICASSP40776.2020.9053705. [11] WANG F, JIANG M Q, QIAN C, et al. Residual attention network for image classification[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 6450-6458. DOI: 10.1109/CVPR.2017.683. [12] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the 15th European Conference on Computer Vision. Munich: Springer, 2018: 3-19. DOI: 10.1007/978-3-030-01234-2_1. [13] 赖远哲, 陈向阳, 李旭东, 等. 基于残差结构的GAN网络的显著性预测研究[J]. 微电子学与计算机, 2021, 38(8): 95-100. DOI: 10.19304/j.cnki.issn1000_7180.2021.08.015.LAI Y Z, CHEN X Y, LI X D, et al. Research on saliency prediction of GAN network based on residual structure[J]. Microelectronics & Computer, 2021, 38(8): 95-100. DOI: 10.19304/j.cnki.issn1000_7180.2021.08.015. [14] CHEN Y Y, ZHANG W G, WANG S H, et al. Huang. Saliency-based spatiotemporal attention for video captioning[C]//Proceedings of the 2018 IEEE Fourth International Conference on Multimedia Big Data. Xi'an: IEEE, 2018: 1-8. DOI: 10.1109/BigMM.2018.8499257. [15] LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 936-944. DOI: 10.1109/CVPR.2017.106. [16] PAN J T, FERRER C C, MCGUINNESS K, et al. SalGAN: visual saliency prediction with generative adversarial networks[J]. arXiv: 1701.01081, 2018. [17] MATEO-GARCÍA G, ADSUARA J E, PÉREZ-SUAY A, et al. Convolutional long short-term memory network for multitemporal cloud detection over landmarks[C]//Proceedings of 2019 IEEE International Geoscience and Remote Sensing Symposium. Yokohama: IEEE, 2019: 210-213. DOI: 10.1109/IGARSS.2019.8897832. [18] BYLINSKⅡ Z, JUDD T, OLIVA A, et al. What do different evaluation metrics tell us about saliency models?[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(3): 740-757. DOI: 10.1109/TPAMI.2018.2815601. [19] FAN L F, CHEN Y X, WEI P, et al. Inferring shared attention in social scene videos[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6460-6468. DOI: 10.1109/CVPR.2018.00676. [20] SVMER Ö, GERJETS P, TRAUTWEIN U, et al. Attention flow: end-to-end joint attention estimation[C]//Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision. Snowmass: IEEE, 2020: 3316-3325. DOI: 10.1109/WACV45572.2020.9093515. [21] CHONG E, WANG Y X, RUIZ N, et al. Detecting attended visual targets in video[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 5395-5405. DOI: 10.1109/CVPR42600.2020.00544. -