Abstract:
Aiming at the shortcomings of traditional k-nearest neighbor algorithm (K-NN) classification algorithm, such as large amount of calculation and high-dimension massive data set processing efficiency, this paper revises the Map and Reduce functions based on Hadoop platform by using MapReduce distributed programming model. Principal component analysis and critical region data when the distance weighted method. First, the principal component analysis of high-dimensional data to achieve the purpose of reducing dimension, so as to improve operational efficiency; secondly, in the classification stage of prediction, adding the concept of complete region and critical region, the critical region of
k values of
n species distance weighted, Finally, the algorithm running under the Hadoop cluster environment can further improve its operation efficiency against massive data. The experimental results show that this algorithm greatly improves the computational efficiency and accuracy when dealing with massive data