基于MapReduce改进K-NN的大数据分类算法研究

蒋华; 韩飞; 王鑫; 王慧娇

基于MapReduce改进K-NN的大数据分类算法研究

Big Data Classification Algorithm Based on MapReduce to Improve K-NN

摘要

摘要: 针对传统k-nearest neighbor algorithm(K-NN)分类算法计算量大、高维度海量数据集处理效率低的缺点, 本文基于Hadoop平台依托MapReduce分布式编程模型改写Map和Reduce函数, 并针对传统K-NN提出数据集主成分分析和临界区域数据预测时距离加权的方法.首先, 对高维度数据进行主成分分析达到降维的目的, 从而提高运行效率; 其次, 在预测分类阶段加入完全区域和临界区域的概念, 临界区域对k个值n种类别进行距离加权, 提高准确率; 最后, 在Hadoop集群环境下的算法运行, 针对海量数据进一步提高其运行效率.实验结果表明:该算法在处理海量数据时极大地提高了计算效率和准确率.

Abstract: Aiming at the shortcomings of traditional k-nearest neighbor algorithm (K-NN) classification algorithm, such as large amount of calculation and high-dimension massive data set processing efficiency, this paper revises the Map and Reduce functions based on Hadoop platform by using MapReduce distributed programming model. Principal component analysis and critical region data when the distance weighted method. First, the principal component analysis of high-dimensional data to achieve the purpose of reducing dimension, so as to improve operational efficiency; secondly, in the classification stage of prediction, adding the concept of complete region and critical region, the critical region of k values of n species distance weighted, Finally, the algorithm running under the Hadoop cluster environment can further improve its operation efficiency against massive data. The experimental results show that this algorithm greatly improves the computational efficiency and accuracy when dealing with massive data

HTML全文

参考文献(12)

施引文献

资源附件(0)