面向非均衡数据的二进制排队搜索特征选择机制

郭嘉

面向非均衡数据的二进制排队搜索特征选择机制

郭嘉

Feature selection mechanism based on binary queue search for unbalanced data

GUO Jia

摘要

摘要: 非均衡数据(分类不均匀分布)和冗余特征的出现极大增加了数据准确分类的难度.以最优化学习算法的预测准确率为目标，结合合成少数过采样技术SMOTE，设计了一种针对非均衡数据的二进制排队搜索方法的包装式特征选择算法BQSA.利用PROMISE知识库中十四种软件故障预测数据集进行实验分析.测试了数据集过采样比例的影响，证实合成少数过采样对高度非均衡数据的分类预测具有正面影响，并得到了最佳过采样率；比较了BQSA与同类算法的性能，证实结合合成少数过采样技术的BQSA算法拥有更优的预测准确性，在分类敏感度、专一性以及曲线下面积AUC等指标上表现更佳.

Abstract: The unbalanced data (non-uniform distribution of classes) and the redundant features dramatically increased the difficulty of data accurate classification. Taking the prediction accuracy of the optimal learning algorithm as the goal, combined with the synthetic minority oversampling technology SMOTE, a wrapper feature selection algorithm BQSA was desigend for binary queue search method of unbalanced data. Using 14 kinds of software fault prediction in PROMISE knowledge base to conduct experimental analysis of datasets. The influence of the over-sampling ratio of the dataset is tested, and it is proved that the synthesis of a few over-sampling has a positive effect on the classification prediction of highly unbalanced data, and the optimal over-sampling rate is obtained. The performance of BQSA is compared with similar algorithms, and it is proved that the BQSA algorithm combined with synthetic minority oversampling has better prediction accuracy and better performance in classification sensitivity, specificity and AUC of area under the curve.

HTML全文

参考文献(15)

施引文献

资源附件(0)