杨有, 方小龙, 邓毅, 吴春燕, 姚露. 融入视觉常识和注意力的图像描述[J]. 微电子学与计算机, 2022, 39(6): 51-59. DOI: 10.19304/J.ISSN1000-7180.2021.1226
引用本文: 杨有, 方小龙, 邓毅, 吴春燕, 姚露. 融入视觉常识和注意力的图像描述[J]. 微电子学与计算机, 2022, 39(6): 51-59. DOI: 10.19304/J.ISSN1000-7180.2021.1226
YANG You, FANG Xiaolong, DENG Yi, WU Chunyan, YAO Lu. Visual commonsense and attention for image captioning[J]. Microelectronics & Computer, 2022, 39(6): 51-59. DOI: 10.19304/J.ISSN1000-7180.2021.1226
Citation: YANG You, FANG Xiaolong, DENG Yi, WU Chunyan, YAO Lu. Visual commonsense and attention for image captioning[J]. Microelectronics & Computer, 2022, 39(6): 51-59. DOI: 10.19304/J.ISSN1000-7180.2021.1226

融入视觉常识和注意力的图像描述

Visual commonsense and attention for image captioning

  • 摘要: 图像描述任务是使计算机自动生成给定图像的自然语言描述文本,它涉及计算机视觉与自然语言处理两个领域,可应用于检索系统、盲人导航和医学报告生成等领域.针对现有的图像描述模型对视觉语义关系挖掘不充分,及多层注意力机制建模特征存在注意偏差的问题,提出一种融入视觉常识和注意力的图像描述模型.在编解码器结构框架下,编码部分引入了视觉常识来指导局部特征产生常识语义关系,采用Faster R-CNN和VC R-CNN提取图像的局部特征和视觉常识特征;并对多层注意力挖掘的高层语义施加AoA(Attention on Attention)机制,以增强特征并获得更好的相关性,从而减少注意偏差误导解码端序列生成.解码部分采用注意力机制对特征加权选择相关信息,使用LSTM和门控线性单元生成输出单词序列.在MS COCO数据集上进行测试,实验结果表明,所提出的模型在BLEU、METEOR、ROUGE-L、CIDEr和SPICE多种评价指标上有一定程度的提升,表明了该模型能够更加准确且丰富地表达图像语义内容.

     

    Abstract: Image Captioning is to make the computer automatically generate the natural language description of a given image, it involves computer vision and natural language processing, and can be applied to retrieval systems, navigation for the blind and medical report generation. visual commonsense and attention for image captioning is proposed to address the problems that the existing image captioning models do not sufficiently mine the visual semantic relations and attention deviation exists in the modeling feature of multilevel attention mechanism. Under the framework of codec structure, visual commonsense is introduced in the encoding part to guide local features to generate commonsense semantic relations, Faster R-CNN and VC R-CNN were used to extract local features and visual commonsense features, and attention on attention is applied to the high-level semantics mined by multi-layer attention, which can enhance features and obtain better relevance and reduce attention deviation to mislead sequence generation at the decoding part. The attention mechanism is used to select relevant information weighted by features, and LSTM and Gated Linear Unit are used to generate the output sequence in the decoding part. The model was tested on MS COCO dataset, and the experimental results showed that BLEU、METEOR、ROUGE-L、CIDEr and SPICE were improved to some extent, which indicated that the model could express the semantic content of images more accurately and more richly.

     

/

返回文章
返回