Abstract:
Image Captioning can be applied to retrieval systems, navigation for the blind and medical report generation. visual commonsense and attention on attention for image captioning is proposed to address the problems that the image captioning models for local features do not sufficiently mine the visual semantic relations and the features extracted by the multi-level attention mechanism suffer from attention deviation. Under the framework of codec structure, visual commonsense is introduced in the encoding part to guide local features to generate commonsense semantic relations, and attention on attention is applied to the high-level semantics mined by multi-layer attention, which can enhance features and obtain better relevance and reduce attention deviation to mislead sequence generation at the decoding part. The model was tested on MS COCO dataset, and the experimental results showed that BLEU, CIDEr and SPICE were improved to some extent, which indicated that the model could express the semantic content of images more accurately and more richly.