Text-to-image synthesis method based on feature-enhanced generative adversarial network
-
摘要:
针对文本生成图像任务过程中存在图像视觉特征和通道特征信息利用不充分问题,提出一种基于特征增强生成对抗网络(FE-GAN)的文本生成图像方法. 首先,在动态记忆读取时,设计二次记忆(MoM)模块来对生成的中间特征进行注意与融合,利用注意力机制在记忆读取时进行第一次视觉特征增强,再将得到的注意力结果和上一个生成器生成的图像特征进行融合,实现第二次图像视觉特征增强. 然后,在残差块中引入通道注意力来获取图像特征中的不同语义,提升相似语义通道之间的关联性,实现通道特征增强. 最后,将实例归一化上采样块和批量归一化上采样块相结合来提高图像分辨率,同时缓解批量大小对生成效果的影响,提升生成图像风格多样性能力. 在CUB-200-2011和Oxford-102数据集上进行的仿真实验表明,所提方法的IS分别达到了4.83和4.13,与DM-GAN相比分别提高了1.68%和5.62%. 实验结果表明,FE-GAN生成的图像在细节处理上更好,更加符合文本语义.
Abstract:To address the problem of insufficient utilization of image visual features and channel feature information in the process of text-to-image synthesis task, a text-to-image synthesis method based on Feature-enhanced Generative Adversarial Network (FE-GAN) was proposed. Firstly, a Memory on Memory (MoM) module was designed to pay attention to and fuse the generated intermediate features during dynamic memory reading. The attention mechanism was used to enhance the first visual features when memory was read, and then the obtained attention results were fused with the image features generated by the previous generator to achieve the second image visual feature enhancement. Then, channel attention was introduced into the residual block to obtain different semantics in image features, enhance the correlation between similar semantic channels, and achieve channel feature enhancement. Finally, the Instance Normalization Upsampling Block and the Batch Normalization Upsampling Block were combined to improve the image resolution, while mitigating the influence of the batch size on the generation effect and improving the style diversity ability of the generated image. Simulation experiments showed that the Inception Score (IS) of the proposed method reaches 4.83 and 4.13 respectively on the datasets of Caltech-UCSD Birds-200-2011 (CUB-200-2011) and 102 category flower dataset (Oxford-102), which are 1.68% and 5.62% higher than those of DM-GAN, respectively. Experimental results show that the images generated by FE-GAN are better in detail processing and more consistent with text semantics.
-
表 1 在两个不同数据集上不同方法的IS比较
Table 1. IS comparison of different method on two different datasets
方法 IS↑ CUB-200-2011 Oxford-102 StackGAN[7] $ 3.7 \pm 0.04 $ $ 3.20 \pm 0.01 $ AttnGAN[1] $ 4.36 \pm 0.03 $ - DM-GAN[2] $ 4.75 \pm 0.07 $ $ 3.91 \pm 0.06^{*} $ CFA-HAGAN[18] $ 4.54 \pm 0.04 $ $ 3.98 \pm 0.03 $ SegAttnGAN[19] $ 4.82 \pm 0.05 $ $ 3.52 \pm 0.09 $ CRD-CGAN[20] $ 4.75 \pm 0.10 $ $ 3.53 \pm 0.06 $ CSM-GAN[21] $ 4.62 \pm 0.08 $ - MA-GAN[22] $ 4.76 \pm 0.05 $ $ 4.09 \pm 0.08 $ 本文方法 $\boldsymbol{4.83 \pm 0.05 }$ $\boldsymbol{ 4.13 \pm 0.05 }$ 表 2 在两个不同数据集上不同方法的FID比较
Table 2. FID comparison of different method on two different datasets
表 3 FE-GAN和基线在数据集Oxford-102上的性能对比
Table 3. Performance comparison on Oxford-102 datasets between FE-GAN and baseline
方法 IS↑ FID↓ Baseline $ 3.91 \pm 0.06 $ 43.92 FE-GAN $ \boldsymbol{4.13 \pm 0.05} $ 42.61 表 4 CUB-200-2011数据集上的消融实验结果
Table 4. Results of ablation experiment on CUB-200-2011
模块选择 IS↑ FID↓ R-precision Baseline $ 4.69 \pm 0.05 $ 16.29 $ 71.95 \pm 0.71 $ +IN $ 4.62 \pm 0.06 $ 18.40 $ 73.06 \pm 0.73 $ +MoM $ 4.71 \pm 0.03 $ 19.29 $ 73.87 \pm 0.80 $ +CAR $ 4.78 \pm 0.06 $ 15.77 $ 75.22 \pm 1.06 $ +IN+MoM $ 4.57 \pm 0.07 $ 15.99 $ 74.95 \pm 0.60 $ +IN+CAR $ 4.79 \pm 0.04 $ 20.82 $\boldsymbol{ 75.32 \pm 0.75 }$ +MoM+CAR $ 4.80 \pm 0.06 $ 16.80 $ 72.02 \pm 0.84 $ +IN+MoM+CAR $ \boldsymbol{4.83 \pm 0.05} $ 15.32 $ 74.20 \pm 0.55 $ 表 5 IUpBlock数量对FE-GAN的影响
Table 5. IUpBlock quantity analysis on FE-GAN
模块选择 IS↑ FID↓ R-precision Baseline $ 4.69 \pm 0.05 $ 16.29 $ 71.95 \pm 0.71 $ AIN $ 4.72 \pm 0.06 $ 18.27 $ 72.93 \pm 0.68 $ SIN $\boldsymbol{ 4.85 \pm 0.05} $ 15.32 $ \boldsymbol{74.20 \pm 0.55 }$ -
[1] XU T, ZHANG P C, HUANG Q Y, et al. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 1316-1324. [2] ZHU M F, PAN P B, CHEN W, et al. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 5795-5803. [3] 鞠思博, 徐晶, 李岩芳. 基于自注意力机制的文本生成单目标图像方法[J]. 计算机工程与应用,2022,58(3):249-258. DOI: 10.3778/j.issn.1002-8331.2009-0194.JU S B, XU J, LI Y F. Text-to-single image method based on self-attention[J]. Computer Engineering and Applications,2022,58(3):249-258. DOI: 10.3778/j.issn.1002-8331.2009-0194. [4] SESHADRI A D, RAVINDRAN B. Multi-tailed, multi-headed, spatial dynamic memory refined text-to-image synthesis[EB/OL]. [2022-03-22]. https://arxiv.org/pdf/2110.08143.pdf. [5] 张云帆, 易尧华, 汤梓伟, 等. 基于通道注意力机制的文本生成图像方法[J]. 计算机工程,2022,48(4):206-212,222. DOI: 10.19678/j.issn.1000-3428.0062998.ZHANG Y F, YI Y H, TANG Z W, et al. Text-to-image synthesis method based on channel attention mechanism[J]. Computer Engineering,2022,48(4):206-212,222. DOI: 10.19678/j.issn.1000-3428.0062998. [6] REED S E, AKATA Z, YAN X C, et al. Generative adversarial text to image synthesis[C]//Proceedings of the 33rd International Conference on Machine Learning. New York: JMLR. org, 2016: 1060-1069. [7] ZHANG H, XU T, LI H S, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 5908-5916. [8] ZHANG H, XU T, LI H, et al. StackGAN++: realistic image synthesis with stacked generative adversarial networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,41(8):1947-1962. DOI: 10.1109/tpami.2018.2856256. [9] GUO M H, XU T X, LIU J J, et al. Attention mechanisms in computer vision: a survey[J]. Computational Visual Media,2022,8(3):331-368. DOI: 10.1007/s41095-022-0271-y. [10] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc, 2017: 6000–6010. [11] ZHANG H, GOODFELLOW I J, METAXAS D N, et al. Self-attention generative adversarial networks[C]//Proceedings of the 36th International Conference on Machine Learning. Long Beach: PMLR, 2019: 7354-7363. [12] CHEN L, ZHANG H W, XIAO J, et al. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 6298-6306. [13] HU J, SHEN L, ALBANIE S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,42(8):2011-2023. DOI: 10.1109/TPAMI.2019.2913372. [14] WANG Q L, WU B G, ZHU P F, et al. ECA-Net: efficient channel attention for deep convolutional neural networks[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11531-11539. [15] TANG H, BAI S, SEBE N. Dual attention GANs for semantic image synthesis[C]//Proceedings of the 28th ACM International Conference on Multimedia. Seattle: ACM, 2020: 1994-2002. [16] HUANG S Y, CHEN Y. Generative adversarial networks with adaptive semantic normalization for text-to-image synthesis[J]. Digital Signal Processing,2022,120:103267. DOI: 10.1016/j.dsp.2021.103267. [17] HUANG L, WANG W M, CHEN J, et al. Attention on attention for image captioning[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 4633-4642. [18] CHENG Q R, GU X D. Cross-modal feature alignment based hybrid attentional generative adversarial networks for text-to-image synthesis[J]. Digital Signal Processing,2020,107:102866. DOI: 10.1016/j.dsp.2020.102866. [19] GOU Y C, WU Q C, LI M B, et al. SegAttnGAN: text to image generation with segmentation attention[EB/OL]. [2022-03-22]. https://arxiv.org/pdf/2005.12444.pdf. [20] HU T, LONG C J, XIAO C X. CRD-CGAN: category-consistent and relativistic constraints for diverse text-to-image generation[EB/OL]. [2022-03-22]. https://arxiv.org/pdf/2107.13516.pdf. [21] TAN H C, LIU X P, YIN B C, et al. Cross-modal semantic matching generative adversarial networks for text-to-image synthesis[J]. IEEE Transactions on Multimedia,2022,24:832-845. DOI: 10.1109/tmm.2021.3060291. [22] YANG Y H, WANG L, XIE D, et al. Multi-sentence auxiliary adversarial networks for fine-grained text-to-image synthesis[J]. IEEE Transactions on Image Processing,2021,30:2798-2809. DOI: 10.1109/tip.2021.3055062. [23] LIAO W T, HU K, YANG M Y, et al. Text to image generation with semantic-spatial aware GAN[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 18166-18175. [24] ZHANG Z X, SCHOMAKER L. DiverGAN: an efficient and effective single-stage framework for diverse text-to-image generation[J]. Neurocomputing,2022,473:182-198. DOI: 10.1016/j.neucom.2021.12.005. -