段云,邵玉斌,刘晶,等.一种基音频率归一化的语种识别方法[J]. 微电子学与计算机,2023,40(5):20-28. doi: 10.19304/J.ISSN1000-7180.2022.0398
引用本文: 段云,邵玉斌,刘晶,等.一种基音频率归一化的语种识别方法[J]. 微电子学与计算机,2023,40(5):20-28. doi: 10.19304/J.ISSN1000-7180.2022.0398
DUAN Y,SHAO Y B,LIU J,et al. A language identification method based on normalization of pitch frequency[J]. Microelectronics & Computer,2023,40(5):20-28. doi: 10.19304/J.ISSN1000-7180.2022.0398
Citation: DUAN Y,SHAO Y B,LIU J,et al. A language identification method based on normalization of pitch frequency[J]. Microelectronics & Computer,2023,40(5):20-28. doi: 10.19304/J.ISSN1000-7180.2022.0398

一种基音频率归一化的语种识别方法

A language identification method based on normalization of pitch frequency

  • 摘要: 针对说话人发音特征影响语种辨识,导致识别性能不佳的问题,提出一种语音基音频率归一化的语种识别方法.首先,根据端点检测区分出语音中的有话段和无话段,并在有话段中提取基音频率并进行归一化处理产生声门脉冲.其次,提取声道响应,将声门脉冲和声道响应通过全极点滤波器重构出基音频率归一化的语音,最后,提取底层声学特征在ResNet网络中进行后端语种识别验证.实验结果表明,所提方法可以降低说话人发音特征对语种区分特征的影响,在灰度语谱图中效果显著,识别率达到94.3%.对MFCC、GFCC等传统底层声学特征以及改进的时域GF特征进行识别验证,所提方法的识别率均有3~4%幅度的提升.有效减小了说话人发音特征的影响,提高了语种识别性能.

     

    Abstract: To address the problem that speaker pronunciation features affect language identification and lead to poor recognition performance, a speech fundamental frequency normalization method is proposed. Firstly, the speech segments with and without speech are distinguished based on the endpoint detection, and the fundamental frequency is extracted from the speech segments and normalized to produce the voice-gated pulses. Then, we extract the vocal channel response, reconstruct the normalized speech with the fundamental frequency through the all-pole filter, and finally extract the underlying acoustic features for back-end language identification in the ResNet network. The experimental results show that the proposed method can reduce the influence of speaker pronunciation features on language differentiation features, and it is effective in gray-scale speech spectrograms, with a recognition rate of 94.3%. The recognition rate of the proposed method is improved by 3~4% for both the traditional underlying acoustic features such as MFCC and GFCC and the improved time-domain GF features. Effectively reduces the influence of speaker pronunciation features and improves language recognition performance.

     

/

返回文章
返回