基于注意力时空解耦3D卷积LSTM的视频预测

黄金贵; 黄一举

doi:10.19304/J.ISSN1000-7180.2022.0023

基于注意力时空解耦3D卷积LSTM的视频预测

Video prediction based on attention spatiotemporal decoupling 3D convolution LSTM

摘要

摘要: 为高效提取视频时空特征以提高视频预测准确性，提出了注意力时空解耦3D卷积LSTM算法.首先，将卷积LSTM内部单元的传统2D卷积运算改为3D卷积，额外提取视频帧间短期空间运动信息；并借助注意力机制自动捕捉视频帧间长期动态信息的相关性.其次，由于卷积LSTM网络中特征信息在所有层的Z型传递方式会导致梯度消失，为此在网络结构中加入层间高速通道优化不同层间LSTM单元视频信息流的传递过程.同时，时间特征和空间特征在网络中会彼此干扰学习冗余功能，造成特征信息的低效获取以及网络预测质量的降低，为此在损失函数中加入时空解耦运算分离时间特征和空间特征的学习.最后，针对训练编码阶段和预测解码阶段的数据输入过程，提出数据输入重采样，在模型训练和预测阶段使用相近相反的数据输入策略减少编码器和解码器的差异.在合成数据集以及人体动作数据库上的实验结果表明，该算法模型在时空特征提取上有更好的性能.

Abstract: To efficiently extract video spatio-temporal features to improve video prediction accuracy, an attentional spatio-temporal decoupling 3D convolutional LSTM algorithm is proposed. Firstly, the traditional 2D convolutional operation of the internal unit of convolutional LSTM is changed to 3D convolution to additionally extract short-term spatial motion information between video frames; and the correlation of long-term dynamic information between video frames is automatically captured by the attention mechanism. Since the Z-shaped transfer direction of feature information in the convolutional LSTM network in all layers leads to gradient disappearance, for this reason, inter-layer high-speed channels are added to the network structure to optimize the transfer process of video information flow between different inter-layer LSTM units. Meanwhile, temporal and spatial features in the network will interfere with each other to learn redundant functions, resulting in inefficient acquisition of feature information and degradation of network prediction quality, so temporal decoupling operations are added to the loss function to separate the learning of temporal and spatial features. For the data input process in the training encoding phase and the prediction decoding phase, data input resampling is proposed to reduce the differences between the encoder and decoder by using similar and opposite data input strategies in the model training and prediction phases. Experimental results on synthetic datasets as well as human action databases show that the algorithmic model has better performance in spatio-temporal feature extraction.

HTML全文

参考文献(21)

施引文献

资源附件(0)