• 北大核心期刊(《中文核心期刊要目总览》2017版)
  • 中国科技核心期刊(中国科技论文统计源期刊)
  • JST 日本科学技术振兴机构数据库(日)收录期刊

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于时空注意力机制和联合注意力的视觉凝视目标跟踪方法

王智捷 任健 廖磊

王智捷, 任健, 廖磊. 基于时空注意力机制和联合注意力的视觉凝视目标跟踪方法[J]. 微电子学与计算机, 2022, 39(11): 45-53. doi: 10.19304/J.ISSN1000-7180.2022.0148
引用本文: 王智捷, 任健, 廖磊. 基于时空注意力机制和联合注意力的视觉凝视目标跟踪方法[J]. 微电子学与计算机, 2022, 39(11): 45-53. doi: 10.19304/J.ISSN1000-7180.2022.0148
WANG Zhijie, REN Jian, LIAO Lei. Visual gaze target tracking method based on spatiotemporal attention mechanism and joint attention[J]. Microelectronics & Computer, 2022, 39(11): 45-53. doi: 10.19304/J.ISSN1000-7180.2022.0148
Citation: WANG Zhijie, REN Jian, LIAO Lei. Visual gaze target tracking method based on spatiotemporal attention mechanism and joint attention[J]. Microelectronics & Computer, 2022, 39(11): 45-53. doi: 10.19304/J.ISSN1000-7180.2022.0148

基于时空注意力机制和联合注意力的视觉凝视目标跟踪方法

doi: 10.19304/J.ISSN1000-7180.2022.0148
基金项目: 

国家自然科学基金项目 61901289

详细信息
    作者简介:

    王智捷  男,(1996-),硕士研究生.研究方向为视觉跟踪

    任健  男,(1996-),硕士研究生.研究方向为轻量化模型检测

    通讯作者:

    廖磊(通讯作者)   男,(1974-),硕士,教授.研究方向为机器学习.E-mail:liaolei@sicnu.edu.cn

  • 中图分类号: TP391

Visual gaze target tracking method based on spatiotemporal attention mechanism and joint attention

  • 摘要:

    目前视觉跟踪技术易忽视人物与场景图之间的联系、以及缺少对联合注意力的分析和检测,导致检测性能不理想.为此提出一种基于时空注意力机制和联合注意力的视觉凝视目标跟踪方法.对于给定任意一幅图像,利用深度神经网络来提取人物的头部特征后,加入场景和头部之间的交互可以帮助增强图像的显著性,并引入一个强化注意力模块来过滤掉深度和视野上的干扰信息.此外,将场景中其余人物的注意力也考虑进所关注的区域,通过注意推送来增强标准显著性模型.加入时空注意力机制后,可以有效地将候选目标、目标注视方向和时间帧数约束结合起来,达到识别共享位置,利用显著性信息能够更好地检测和定位联合注意力.最后将图像中的注意力以热力图的形式可视化.实验表明:该模型能够有效地推断视频中的动态注意力和联合注意力,且效果良好.

     

  • 图 1  凝视目标检测结构体系

    Figure 1.  Gaze target detection architecture

    图 2  通道特征注意和卷积空间注意块

    Figure 2.  Channel feature attention and convolutional spatial attention blocks

    图 3  凝视目标检测示例

    Figure 3.  Example of gaze target detection

    图 4  可视化头部注意

    Figure 4.  Visualizing head attention

    图 5  对凝视目标的检测

    Figure 5.  Detection of gaze targets

    表  1  在GazeFollow数据集上的性能比较

    Table  1.   Performance comparison on GazeFollow dataset

    Method AUC Dist Min Dist Ang
    Center 0.633 0.313 0.230 49.0°
    Judd et al.[18] 0.711 0.337 0.250 54.0°
    Random 0.504 0.484 0.391 69.0°
    Fixed bias 0.674 0.306 0.219 48.0°
    SVM + one grid 0.758 0.276 0.193 43.0°
    SVM + shift grid 0.788 0.268 0.186 40.0°
    Chong et al.[6] 0.896 0.187 0.112 n/a
    SalGAN[16] 0.848 0.238 0.192 36.7°
    Recasens et al.[3] 0.878 0.190 0.113 24.0°
    Recasens et al.* 0.881 0.175 0.101 22.5°
    Lian et al. [5] 0.906 0.145 0.081 17.6°
    One human 0.924 0.096 0.040 11.0°
    Ours 0.889 0.179 0.106 19.2°
    下载: 导出CSV

    表  2  部分组件被禁用的性能对比

    Table  2.   Performance comparison of some components disabled

    Method AUC Dist Min Dist Ang
    No image 0.803 0.252 0.174 25.1°
    No head position 0.816 0.249 0.161 35.3°
    No head feature 0.732 0.273 0.203 32.4°
    No eltwise 0.869 0.196 0.138 14.6°
    No attention map 0.659 0.291 0.242 42.3°
    No fusion 0.834 0.225 0.156 19.6°
    No temporal 0.836 0.219 0.151 15.1°
    Ours Full 0.889 0.179 0.106 19.2°
    下载: 导出CSV

    表  3  注意力机制对VideoCoAtt数据集的联合注意力定位的影响

    Table  3.   Effect of attention mechanism on joint attention localization on VideoCoAtt dataset

    Method L2 Distance
    NO Encoder 72.15
    Encoder and Generator jointly 70.36
    Reduce learning rate 68.59
    Channel-wise attention 61.87
    Spatial attention 64.58
    #注:(1) No Encoder是冻结了编码器,只训练生成器;(2) Encoder and Generator jointly是训练编码器和生成器;(3) Reduce learning rate是利用迁移学习来学习编码器,并从头开始训练生成器.
    下载: 导出CSV

    表  4  注意力机制在VideoCoAtt数据集上对联合注意定位的影响

    Table  4.   Effect of attention mechanism on joint attention localization on VideoCoAtt dataset

    Method AUC L2 Distance
    Random 50.8 286
    Fixed Bias 52.4 122
    GazeFollow 58.7 102
    Raw Image 52.3 188
    Only Gaze 64.0 108
    Gaze + Saliency 59.4 83
    Gaze + Saliency + LSTM 66.2 71
    Only RP 58.0 110
    Gaze + RP 68.5 74
    Gaze + RP + LSTM[19] 71.4 62
    Sumer[20] 78.1 63
    VideoAtt* [21] 83.3 57
    Ours 80.1 59
    下载: 导出CSV
  • [1] IYER S J, SARANYA P, SIVARAM M. Human pose-estimation and low-cost interpolation for text to Indian sign language[C]//Proceedings of the 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence). Noida: IEEE, 2021: 130-135. DOI: 10.1109/Confluence51648.2021.9377047.
    [2] HUANG Y F, CAI M J, SATO Y. An ego-vision system for discovering human joint attention[J]. IEEE Transactions on Human-Machine Systems, 2020, 50(4): 306-316. DOI: 10.1109/THMS.2020.2965429.
    [3] RECASENS A, VONDRICK C, KHOSLA A, et al. Following gaze in video[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 1444-1452. DOI: 10.1109/ICCV.2017.160.
    [4] GORJIS, CLARK J J. Attentional push: a deep convolutional network for augmenting image salience with shared attention modeling in social scenes[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 3472-3481. DOI: 10.1109/CVPR.2017.370.
    [5] LIAN D Z, YU Z H, GAO S H. Believe it or not, we know what you are looking at![C]//Proceedings of the 14th Asian Conference on Computer Vision. Perth: Springer, 2018: 35-50. DOI: 10.1007/978-3-030-20893-6_3.
    [6] CHONG E, RUIZ N, WANG Y X, et al. Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency[C]//Proceedings of the 15th European Conference on Computer Vision. Munich: Springer, 2018: 397-412. DOI: 10.1007/978-3-030-01228-1_24.
    [7] 何梦, 许达文. 基于客户端-服务器的容错神经网络训练架构[J]. 微电子学与计算机, 2021, 38(10): 73-78. DOI: 10.19304/J.ISSN1000-7180.2021.0035.

    HE M, XU D W. Fault-tolerant neural network training framework based on client-server[J]. Microelectronics & Computer, 2021, 38(10): 73-78. DOI: 10.19304/J.ISSN1000-7180.2021.0035.
    [8] RANFTL R, LASINGER K, HAFNER D, et al. Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(3): 1623-1637. DOI: 10.1109/TPAMI.2020.3019967.
    [9] 杨梅, 贾旭, 殷浩东, 等. 基于联合注意力孪生网络目标跟踪算法[J]. 仪器仪表学报, 2021, 42(1): 127-136. DOI: 10.19650/j.cnki.cjsi.J2006906.

    YANG M, JIA X, YIN H D, et al. Object tracking algorithm based on Siamese network with combined attention[J]. Chinese Journal of Scientific Instrument, 2021, 42(1): 127-136. DOI: 10.19650/j.cnki.cjsi.J2006906.
    [10] XIA J Y, TIAN J Q, XING J K, et al. Social data assisted multi-modal video analysis for saliency detection[C]//Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 2278-2282. DOI: 10.1109/ICASSP40776.2020.9053705.
    [11] WANG F, JIANG M Q, QIAN C, et al. Residual attention network for image classification[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 6450-6458. DOI: 10.1109/CVPR.2017.683.
    [12] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the 15th European Conference on Computer Vision. Munich: Springer, 2018: 3-19. DOI: 10.1007/978-3-030-01234-2_1.
    [13] 赖远哲, 陈向阳, 李旭东, 等. 基于残差结构的GAN网络的显著性预测研究[J]. 微电子学与计算机, 2021, 38(8): 95-100. DOI: 10.19304/j.cnki.issn1000_7180.2021.08.015.

    LAI Y Z, CHEN X Y, LI X D, et al. Research on saliency prediction of GAN network based on residual structure[J]. Microelectronics & Computer, 2021, 38(8): 95-100. DOI: 10.19304/j.cnki.issn1000_7180.2021.08.015.
    [14] CHEN Y Y, ZHANG W G, WANG S H, et al. Huang. Saliency-based spatiotemporal attention for video captioning[C]//Proceedings of the 2018 IEEE Fourth International Conference on Multimedia Big Data. Xi'an: IEEE, 2018: 1-8. DOI: 10.1109/BigMM.2018.8499257.
    [15] LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 936-944. DOI: 10.1109/CVPR.2017.106.
    [16] PAN J T, FERRER C C, MCGUINNESS K, et al. SalGAN: visual saliency prediction with generative adversarial networks[J]. arXiv: 1701.01081, 2018.
    [17] MATEO-GARCÍA G, ADSUARA J E, PÉREZ-SUAY A, et al. Convolutional long short-term memory network for multitemporal cloud detection over landmarks[C]//Proceedings of 2019 IEEE International Geoscience and Remote Sensing Symposium. Yokohama: IEEE, 2019: 210-213. DOI: 10.1109/IGARSS.2019.8897832.
    [18] BYLINSKⅡ Z, JUDD T, OLIVA A, et al. What do different evaluation metrics tell us about saliency models?[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(3): 740-757. DOI: 10.1109/TPAMI.2018.2815601.
    [19] FAN L F, CHEN Y X, WEI P, et al. Inferring shared attention in social scene videos[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6460-6468. DOI: 10.1109/CVPR.2018.00676.
    [20] SVMER Ö, GERJETS P, TRAUTWEIN U, et al. Attention flow: end-to-end joint attention estimation[C]//Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision. Snowmass: IEEE, 2020: 3316-3325. DOI: 10.1109/WACV45572.2020.9093515.
    [21] CHONG E, WANG Y X, RUIZ N, et al. Detecting attended visual targets in video[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 5395-5405. DOI: 10.1109/CVPR42600.2020.00544.
  • 加载中
图(5) / 表(4)
计量
  • 文章访问数:  100
  • HTML全文浏览量:  63
  • PDF下载量:  16
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-03-04
  • 修回日期:  2022-03-29
  • 网络出版日期:  2022-11-29

目录

    /

    返回文章
    返回