• 北大核心期刊(《中文核心期刊要目总览》2017版)
  • 中国科技核心期刊(中国科技论文统计源期刊)
  • JST 日本科学技术振兴机构数据库(日)收录期刊

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于注意力特征融合的视觉问答模型

李宽 张荣芬 刘宇红 鲁鑫鑫

李宽, 张荣芬, 刘宇红, 鲁鑫鑫. 基于注意力特征融合的视觉问答模型[J]. 微电子学与计算机, 2022, 39(4): 83-90. doi: 10.19304/J.ISSN1000-7180.2021.1102
引用本文: 李宽, 张荣芬, 刘宇红, 鲁鑫鑫. 基于注意力特征融合的视觉问答模型[J]. 微电子学与计算机, 2022, 39(4): 83-90. doi: 10.19304/J.ISSN1000-7180.2021.1102
LI Kuan, ZHANG Rongfen, LIU Yuhong, LU Xinxin. Visual question answering model based on attention feature fusion[J]. Microelectronics & Computer, 2022, 39(4): 83-90. doi: 10.19304/J.ISSN1000-7180.2021.1102
Citation: LI Kuan, ZHANG Rongfen, LIU Yuhong, LU Xinxin. Visual question answering model based on attention feature fusion[J]. Microelectronics & Computer, 2022, 39(4): 83-90. doi: 10.19304/J.ISSN1000-7180.2021.1102

基于注意力特征融合的视觉问答模型

doi: 10.19304/J.ISSN1000-7180.2021.1102
基金项目: 

贵州省科学技术基金资助项目 黔科合基础-ZK[2021]重点001

详细信息
    作者简介:

    李宽    男,(1993-),硕士研究生.研究方向为图像处理、深度学习

    张荣芬    女,(1978-),博士,教授.研究方向为人工智能、图像处理、计算机视觉

    鲁鑫鑫    女,(1997-),硕士研究生.研究方向为大数据、深度学习

    通讯作者:

    刘宇红(通信作者)    男(1963-),硕士,教授.研究方向为大数据、图像处理、计算机视觉. E-mail:1459539967@qq.com

  • 中图分类号: TP391

Visual question answering model based on attention feature fusion

  • 摘要:

    随着深度学习的兴起和不断发展,视觉问答领域的研究取得了显著进展,当前较多视觉问答模型通过引入注意力机制和相关迭代操作来提取图像区域与高频疑问词对的相关性,但在获取图像与问题的空间语义关联方面的有效性较低,从而影响答案的准确性.为此,提出一种基于MobileNetV3网络及注意力特征融合的视觉问答模型,首先,为优化图像特征提取模块,引入MobileNetV3网络,并加入空间金字塔池化结构,在减少网络模型计算复杂度的同时保证模型准确率.此外,对输出分类器进行改进,将其中的特征融合方式使用基于注意力特征融合方式连接,提升问答的准确率.最后在公开数据集VQA 2.0上进行对比实验,结果表明文章所提模型与当前主流模型相比更具优越性.

     

  • 图 1  整体网络架构

    Figure 1.  TNA network architecture

    图 2  GRU模型

    Figure 2.  GRU model

    图 3  空间金字塔池化结构

    Figure 3.  Spatial Pyramid Pooling Construction

    图 4  改进的输出分类器模型

    Figure 4.  Output classifier model

    (a)Basic output classifier(b)Improved output classifier

    图 5  AFF架构

    Figure 5.  AFFArchitecture

    图 6  文章模型训练过程中准确率的变化

    Figure 6.  Changes in accuracy during article model training

    图 7  文章模型在test-dev和test-std测试集上的结果

    Figure 7.  The results of the article model on the test-dev and test-std test sets

    图 8  三种模型在VQA示例上的对比效果

    Figure 8.  Comparison of the three models on the VQA example

    (a)Simple scene(b)More complex scene(c)Complex scene

    表  1  MobileNetV3-Large网络结构图

    Table  1.   MobileNetV3-Large network structure diagram

    Input Operator Exp size #Out NL
    416×416×3 conv2d - 16 HS
    208×208×16 bneck, 3×3 16 16 RE
    208×208×16 bneck, 3×3 64 24 RE
    104×104×24 bneck, 3×3 72 24 RE
    104×104×24 bneck, 5×5 72 40 RE
    52×52×40 bneck, 5×5 120 40 RE
    52×52×40 bneck, 5×5 120 40 RE
    52×52×40 bneck, 3×3 240 80 HS
    26×26×80 bneck, 3×3 200 80 HS
    26×26×80 bneck, 3×3 184 80 HS
    26×26×80 bneck, 3×3 184 80 HS
    26×26×80 bneck, 3×3 480 112 HS
    26×26×112 bneck, 3×3 672 112 HS
    26×26×112 bneck, 5×5 672 160 HS
    13×13×160 bneck, 5×5 960 160 HS
    13×13×160 bneck, 5×5 960 160 HS
    13×13×160 conv2d. 1×1 - 960 HS
    13×13×190 pool, 7×7 - - -
    1×1×960 conv2d. 1×1 - 1280 HS
    1×1×1280 bneck, 3×3NBN - k -
    下载: 导出CSV

    表  2  不同模型在test-std测试集上的准确率比较/%

    Table  2.   Comparison of the accuracy of different models on the test-std test set /%

    Model VQA 2.0 Test-standard
    YES/NO OTHER NUM ALL
    Team-Prior 60.27 0.56 1.70 26.37
    Ban-model 81.90 56.52 54.36 64.25
    Count-model 79.49 53.19 48.93 62.06
    Graph-model 81.71 56.72 51.01 63.49
    Language-only 68.01 32.20 27.15 49.32
    MCB 79.18 41.30 54.10 62.10
    Up-down 80.98 56.71 53.28 64.03
    Ours 82.26 56.73 55.58 64.52
    注:加粗字体为每列最优结果
    下载: 导出CSV

    表  3  在VQA 2.0 test-std上做消融实验的准确率/%

    Table  3.   Accuracy rate of ablation experiment on VQA 2.0 test-std /%

    Method Yes/No Other Count All
    Up-down 80.98 57.06 53.28 64.03
    Up-down+MSP 81.07 57.06 52.79 64.25
    Up-down+AFF 82.26 56.7 53.6 64.27
    Up-down+MSP+AFF 82.26 56.73 55.58 64.52
    下载: 导出CSV
  • [1] PARK H Y, BAE H J, HONG G S, et al. Realistic high-resolution body computed tomography image synthesis by using progressive growing generative adversarial network: visual turing test[J]. JMIR Medical Informatics, 2020, 9(3): e23328. DOI: 10.2196/23328.
    [2] CHEN K, WANG J, CHEN L C, et al. ABC-CNN: an attention based convolutional neural network for visual question answering[J]. arXiv: 1511.05960, 2015
    [3] LU C, CHEN F S, SU X F, et al. MS-AFF: a novel semantic segmentation approach for buried object based on multi-scale attentional feature fusion[J]. Optical and Quantum Electronics, 2021, 53(6): 302. DOI: 10.1007/s11082-021-02952-6.
    [4] XU H J, SAENKO K. Ask, attend and answer: exploring question-guided spatial attention for visual question answering[C]//14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016. DOI: 10.1007%2F978-3-319-46478-7_28.
    [5] RUWA N, MAO Q R, WANG L J, et al. Mood-aware visual question answering[J]. Neurocomputing, 2019, 330: 305-316. DOI: 10.1016/j.neucom.2018.11.049.
    [6] GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2016. DOI: 10.1109/CVPR.2017.670.
    [7] MALINOWSKI M, ROHRBACH M, FRITZ M. Ask Your Neurons: a neural-based approach to answering questions about images[C]//2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE, 2015. DOI: 10.1109/ICCV.2015.9.
    [8] MALINOWSKI M, FRITZ M. A multi-world approach to question answering about real-world scenes based on uncertain input[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, MA, United States: MIT Press, 2014. DOI: 10.5555/2968826.2969014.
    [9] MA C, SHEN C H, DICK A, et al. Visual question answering with memory-augmented networks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018. DOI: 10.1109/CVPR.2018.00729.
    [10] YANG Z C, HE X D, GAO J F, et al. Stacked attention networks for image question answering[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE Computer Society, 2015. DOI: 10.1109/CVPR.2016.10.
    [11] LU J S, YANG J W, BATRA D, et al. Hierarchical question-image Co-attention for visual question answering[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2016. DOI: 10.5555/3157096.3157129.
    [12] NAM H, HA J W, KIM J. Dual attention networks for multimodal reasoning and matching[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2016. DOI: 10.1109/CVPR.2017.232.
    [13] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT: IEEE, 2017. DOI: 10.1109/CVPR.2018.00636.
    [14] SONG J K, ZENG P P, GAO L L, et al. From pixels to objects: cubic visual attention for visual question answering[C]//Twenty-Seventh International Joint Conference on Artificial Intelligence. Stockholm, Sweden: IJCAI, 2018. DOI: 10.24963/ijcai.2018/126.
    [15] MA Z Y, ZHENG W F, CHEN X B, et al. Joint embedding VQA model based on dynamic word vector[J]. PeerJ Computer Science, 2021, 7: e353. DOI: 10.7717/peerj-cs.353.
    [16] DEL RÍO R E, PARDO-NOVOA J C, CERDA-GARCÍA-ROJASC M, et al. Vibrational circular dichroism behavior of quinol cacalolides from Psacalium aff. sinuatum[J]. Journal of Molecular Structure, 2021, 1224: 128987. DOI: 10.1016/j.molstruc.2020.128987.
    [17] WEI B Y, SHEN X L, YUAN Y L. Remote Sensing Scene Classification Based on Improved GhostNet[J]. Journal of Physics: Conference Series, 2020, 1621: 012091. DOI: 10.1088/1742-6596/1621/1/012091.
    [18] MALINI A, PRIYADHARSHINI P, SABEENA S. An automatic assessment of road condition from aerial imagery using modified VGG architecture in faster-RCNN framework[J]. Journal of Intelligent & Fuzzy Systems, 2021, 40(6): 11411-11422. DOI: 10.3233/JIFS-202596.
    [19] GARG S, SRIVASTAVA R. Object sequences: encoding categorical and spatial information for a yes/no visual question answering task[J]. IET Computer Vision, 2018, 12(8): 1141-1150. DOI: 10.1049/iet-cvi.2018.5226.
  • 加载中
图(8) / 表(3)
计量
  • 文章访问数:  6
  • HTML全文浏览量:  4
  • PDF下载量:  1
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-09-17
  • 修回日期:  2021-10-20
  • 网络出版日期:  2022-05-12

目录

    /

    返回文章
    返回