基于改进隐马尔可夫模型的文本分类研究

Research on Improved HMM-Based Text Categorization

摘要: 将一种改进的隐马尔可夫模型(HMM)应用于文本分类中,在考虑其前向依赖的同时,需考虑状态的后向依赖性.将当前观测值和和当前状态对其后一状态的依赖性加入模型的学习,这样的改进模型能有效提高文本信息抽取准确率.在文本分类过程研究中,首先对训练样本进行文本预处理,对HMM分类器模型进行参数学习,建立HMM分类器后用测试集进行测试并做出性能评价.在性能评价中用改进的评测指标,可针对不同数据集做出准确评价,以及可对比不同分类工作在同一数据集上的性能,大大提高评价质量.

Abstract: The application of the improved Hidden Markov Models to text categorization should take the backward dependency as well as forward dependency on states into account.The accuracy of information extraction could be improved by applying the dependency of the current observation value and state on the backward ones into the learning of models.This paper is to preprocess the training samples in text categorization process, to learn the parameters of HMM classifier, establishing one and evaluating its performance through testing set.The improved evaluation criteria could give a fair evaluation of different dataset, make a comparison of different classifiers on the performance of the same dataset and feed back the classifier to improve.