Abstract:
The traditional feature extraction of malware classification usually takes single feature as the detection classification standard, which has the problem of low detection accuracy and poor effect. A scheme of extracting multiple static features for fusion and using integrated learning algorithm for malware family classification is proposed. Firstly, the static features of byte code, operation code, API sequence and gray image are extracted from the decompiled malicious samples on the Kaggle data set. Then, the chi square test and the Pearson correlation coefficient are used to select important features, and the features with strong correlation with class labels are selected. Finally, the selected important features are input into the integrated learning models such as GBDT algorithm, XGBoost algorithm and random forest algorithm for malware family classification. Experimental results show that compared with the traditional malware classification scheme, the accuracy of the integrated learning malware classification scheme based on multi-feature fusion is 99.8%. Compared with the traditional single feature machine learning classification scheme, it can effectively improve the detection and classification accuracy of unknown or variant malware.