基于Hadoop的灰狼优化K-means算法在主题发现的研究

王林; 陈青超

doi:10.19304/J.ISSN1000-7180.2021.0862

基于Hadoop的灰狼优化K-means算法在主题发现的研究

王林,
陈青超

Research on topic discovery based on hadoop gray wolf optimized K-means algorithm

摘要

摘要: 快速准确的在海量网络数据中发现热点主题对于网络舆情监控具有重要作用.针对K-means算法对初始中心点选择敏感和全局搜索能力不足的问题，提出一种基于Hadoop的改进灰狼优化K-means的IGWO-KM算法.首先，该算法将灰狼优化算法和K-means算法相结合，利用灰狼优化算法收敛速度快和可全局寻优的优势为K-means搜索最佳聚类中心，减小随机选取初始中心点而导致的聚类结果不稳定性，以获取更好的聚类结果.其次，使用非线性收敛因子改进灰狼优化算法，协调算法的全局和局部的搜索能力.然后，引入正弦余弦算法并进行改进，增强灰狼优化算法的全局搜索能力，优化寻优精度和收敛速度，避免陷入局部最优.之后，使用近邻空间球减少K-means聚类过程中冗余的距离计算加快算法收敛.最后，利用Hadoop集群可批量处理数据的特性，实现算法的并行化.实验结果表明，IGWO-KM算法具有更好的寻优精度和稳定性，相比于GWO-KM算法和K-means，该算法在查准率、召回率和F值均有明显提高，且具有良好的收敛速度和拓展性.

Abstract: Quickly and accurately discovering hot topics in massive network data plays an important role in network public opinion monitoring. Aiming at the problem that the K-means algorithm is sensitive to the initial center point selection and the global search ability is insufficient, an improved gray wolf optimization K-means IGWO-KM algorithm based on Hadoop is proposed. First, the algorithm combines the gray wolf optimization algorithm with the K-means algorithm, and takes advantage of the gray wolf optimization algorithm′s fast convergence speed and global optimization for K-means to search for the best clustering center, reducing the random selection of the initial center point The resulting clustering results are unstable to obtain better clustering results. Secondly, use nonlinear convergence factors to improve the gray wolf optimization algorithm, and coordinate the algorithm′s global and local search capabilities. Then, the sine cosine algorithm is introduced and improved to enhance the global search ability of the gray wolf optimization algorithm, optimize the optimization accuracy and convergence speed, and avoid falling into the local optimum. After that, the nearest neighbor space sphere is used to reduce the redundant distance calculation in the K-means clustering process to speed up the algorithm convergence. Finally, the Hadoop cluster can process data in batches to realize the parallelization of algorithms. The experimental results show that the IGWO-KM algorithm has better optimization accuracy and stability. Compared with the GWO-KM algorithm and K-means, the algorithm has significantly improved Precision, Recall and F value, and has good convergence speed and scalability.

HTML全文

参考文献(16)

施引文献

资源附件(0)