基于多特征融合的恶意软件分类方案

张冬雯; 张少华; 陈振国; 张光华; 于乃文

doi:10.19304/J.ISSN1000-7180.2021.1198

基于多特征融合的恶意软件分类方案

Malware classification scheme based on multi feature fusion

摘要

摘要: 传统的恶意软件分类特征提取常以单一特征作为检测分类标准，存在检测准确率低、效果差问题，为此提出了一种提取多重静态特征进行融合并利用集成学习算法进行恶意软件家族分类方案.首先，在Kaggle数据集上对反编译恶意样本提取字节码、操作码、API序列和灰度图四种不同角度的静态特征; 然后，利用卡方检验和皮尔逊相关系数进行重要特征选择，筛选出与类标签相关性强的特征; 最后，将筛选出的重要特征输入到GBDT算法、XGBoost算法和随机森林算法等集成学习模型中进行恶意软件家族分类.实验结果表明，与传统的恶意软件分类方案相比，基于多特征融合的集成学习恶意软件分类方案准确率达到99.8%.相较传统单一特征机器学习分类方案能有效的提高对未知或变体恶意软件检测和分类的准确率.

Abstract: The traditional feature extraction of malware classification usually takes single feature as the detection classification standard, which has the problem of low detection accuracy and poor effect. A scheme of extracting multiple static features for fusion and using integrated learning algorithm for malware family classification is proposed. Firstly, the static features of byte code, operation code, API sequence and gray image are extracted from the decompiled malicious samples on the Kaggle data set. Then, the chi square test and the Pearson correlation coefficient are used to select important features, and the features with strong correlation with class labels are selected. Finally, the selected important features are input into the integrated learning models such as GBDT algorithm, XGBoost algorithm and random forest algorithm for malware family classification. Experimental results show that compared with the traditional malware classification scheme, the accuracy of the integrated learning malware classification scheme based on multi-feature fusion is 99.8%. Compared with the traditional single feature machine learning classification scheme, it can effectively improve the detection and classification accuracy of unknown or variant malware.

HTML全文

参考文献(16)

施引文献

资源附件(0)