滕思航,王烈,李雅,等.自适应独立性假设的非自回归Transformer语音识别[J]. 微电子学与计算机,2023,40(5):29-38. doi: 10.19304/J.ISSN1000-7180.2022.0419
引用本文: 滕思航,王烈,李雅,等.自适应独立性假设的非自回归Transformer语音识别[J]. 微电子学与计算机,2023,40(5):29-38. doi: 10.19304/J.ISSN1000-7180.2022.0419
TENG S H,WANG L,LI Y,et al. Adaptive independence assumption Non-autoregressive Transformer for speech recognition[J]. Microelectronics & Computer,2023,40(5):29-38. doi: 10.19304/J.ISSN1000-7180.2022.0419
Citation: TENG S H,WANG L,LI Y,et al. Adaptive independence assumption Non-autoregressive Transformer for speech recognition[J]. Microelectronics & Computer,2023,40(5):29-38. doi: 10.19304/J.ISSN1000-7180.2022.0419

自适应独立性假设的非自回归Transformer语音识别

Adaptive independence assumption Non-autoregressive Transformer for speech recognition

  • 摘要: 基于非自回归Transformer的端到端自动语音识别模型与自回归Transformer等传统模型相比拥有更快的解码速度,然而非自回归的解码方式与独立性假设导致了语音识别结果准确性的下降. 为了解决此问题,提出了一种语音表征融合的自适应独立性假设非自回归Transformer端到端中文语音识别模型. 在训练期间,通过对表征向量进行注意力融合,改善decoder输入帧语义信息部分缺失的问题;在解码期间,采用基于自适应独立性假设的解码策略,解决非自回归模型独立性假设带来的输出字符条件独立问题.最后,利用迭代式波束搜索进行多目标的排序搜索解码,解决波束搜索算法在提出模型上的不适用问题. 在中文数据集AISHELL-1的实验结果显示,模型的实时性因子达到0.005,字错误率为8.8%,较非自回归Transformer基线模型降低了20%,在保证较高的识别速度的同时大幅降低了错误率,展现出先进的模型性能.

     

    Abstract: The non-autoregressive Transformer based end-to-end automatic speech recognition model has a faster decoding speed compared with traditional models such as autoregressive Transformer, however, the non-autoregressive decoding method and independence assumption lead to the degradation of speech recognition result accuracy. To address this problem, a non-autoregressive Transformer Chinese speech recognition model with adaptive independence assumption and speech representation fusion is proposed. During training, the problem of partially missing semantic information in the input frame of the decoder is improved by attention fusion of the representation vectors; during decoding, adaptive independence assumption is used to solve the problem of conditional independence of the output characters brought by the independence assumption of the non-autoregressive model. Finally, iterative beam search is used to perform ranking search decoding of multiple targets to solve the inapplicability problem of the beam search algorithm in the proposed model. The experimental results on the Chinese dataset AISHELL-1 show that the real time factor of the model reaches 0.005 and the character error rate is 8.8%, which is 20% lower than the non-autoregressive Transformer baseline model, ensuring a higher recognition speed while significantly reducing the error rate, showing the advanced model performance.

     

/

返回文章
返回