Research on topic discovery based on hadoop gray wolf optimized K-means algorithm
-
摘要:
快速准确的在海量网络数据中发现热点主题对于网络舆情监控具有重要作用.针对K-means算法对初始中心点选择敏感和全局搜索能力不足的问题,提出一种基于Hadoop的改进灰狼优化K-means的IGWO-KM算法.首先,该算法将灰狼优化算法和K-means算法相结合,利用灰狼优化算法收敛速度快和可全局寻优的优势为K-means搜索最佳聚类中心,减小随机选取初始中心点而导致的聚类结果不稳定性,以获取更好的聚类结果.其次,使用非线性收敛因子改进灰狼优化算法,协调算法的全局和局部的搜索能力.然后,引入正弦余弦算法并进行改进,增强灰狼优化算法的全局搜索能力,优化寻优精度和收敛速度,避免陷入局部最优.之后,使用近邻空间球减少K-means聚类过程中冗余的距离计算加快算法收敛.最后,利用Hadoop集群可批量处理数据的特性,实现算法的并行化.实验结果表明,IGWO-KM算法具有更好的寻优精度和稳定性,相比于GWO-KM算法和K-means,该算法在查准率、召回率和F值均有明显提高,且具有良好的收敛速度和拓展性.
Abstract:Quickly and accurately discovering hot topics in massive network data plays an important role in network public opinion monitoring. Aiming at the problem that the K-means algorithm is sensitive to the initial center point selection and the global search ability is insufficient, an improved gray wolf optimization K-means IGWO-KM algorithm based on Hadoop is proposed. First, the algorithm combines the gray wolf optimization algorithm with the K-means algorithm, and takes advantage of the gray wolf optimization algorithm′s fast convergence speed and global optimization for K-means to search for the best clustering center, reducing the random selection of the initial center point The resulting clustering results are unstable to obtain better clustering results. Secondly, use nonlinear convergence factors to improve the gray wolf optimization algorithm, and coordinate the algorithm′s global and local search capabilities. Then, the sine cosine algorithm is introduced and improved to enhance the global search ability of the gray wolf optimization algorithm, optimize the optimization accuracy and convergence speed, and avoid falling into the local optimum. After that, the nearest neighbor space sphere is used to reduce the redundant distance calculation in the K-means clustering process to speed up the algorithm convergence. Finally, the Hadoop cluster can process data in batches to realize the parallelization of algorithms. The experimental results show that the IGWO-KM algorithm has better optimization accuracy and stability. Compared with the GWO-KM algorithm and K-means, the algorithm has significantly improved Precision, Recall and F value, and has good convergence speed and scalability.
-
Key words:
- text clustering /
- K-means algorithm /
- topic discovery /
- GWO algorithm /
- distributed computing
-
表 1 测试函数
Table 1. Test function
测试函数 表达式 定义域 fmin 单峰测试函数 Sphere $f_{1}(x)=\sum_{i=1}^{n} x_{i}^{2}$ [-100, 100] 0 Schwefel2.22 $f_{2}(x)=\sum_{i=1}^{n}\left|x_{i}\right|+\prod_{i=1}^{n}\left|x_{i}\right|$ [-10, 10] 0 多峰测试函数 Ackley $f_{3}(x)=-20 \exp \left(-0.2 \sqrt{\frac{1}{n} \sum_{i=1}^{n} x_{i}^{2}}\right)-\exp \left(\frac{1}{n} \sum_{i=1}^{n} \cos \left(2 \pi x_{i}\right)\right)+20+e$ [-32, 32] 0 Griewank $f_{4}(x)=\frac{1}{40\ 000} \sum_{i=1}^{n} x_{i}^{2}-\prod_{i=1}^{n} \cos \left(\frac{x_{i}}{\sqrt{i}}\right)+1$ [-600, 600] 0 表 2 测试函数运行结果
Table 2. Test function running results
函数 GWO IGWO-KM1 IGWO-KM2 IGWO-KM fave fstd fave fstd fave fstd fave fstd f1 3.78e-59 6.69e-59 1.01e-70 1.49e-70 8.45e-82 2.59e-81 7.73e-83 1.34e-82 f2 8.73e-35 5.75e-35 2.96e-41 3.64e-41 3.98e-48 3.49e-48 2.10e-48 2.46e-48 f3 1.75e-14 3.36e-15 1.25e-14 3.68e-15 7.66e-15 1.11e-15 7.80e-15 0 f4 0.003 6 0.008 9 0.001 6 0.003 5 0.001 4 0.004 5 0 0 表 3 文本聚类结果评价指标
Table 3. Evaluation index of text clustering results
实验方法 评价指标 数据集规模/M 107.7 501.3 1 035 K-means 查准率 62.96 62.52 62.23 召回率 68.00 67.76 67.35 F值 65.38 65.03 64.69 GWO-KM 查准率 77.24 76.36 75.83 召回率 73.52 72.22 71.88 F值 75.33 74.23 73.80 本文算法 查准率 81.26 81.01 80.97 召回率 80.38 79.98 79.22 F值 80.82 80.49 80.09 表 4 速度优化百分比
Table 4. Speed optimization percentage
函数值 数据集规模/M 107.7 501.3 1 035 速度优化百分比 8.24 12.36 18.54 表 5 不同规模数据集运行时间
Table 5. Running time of data sets of different sizes
数据集 1节点/s 2节点/s 3节点/s 107.7M 281.25 169.56 136.58 501.3M 1335.18 753.66 556.67 1 035M 2 200.19 1 208.12 861.23 -
[1] 王林, 许郡蒙. 分布式K-means聚类在微博热点主题发现的应用[J]. 计算机仿真, 2020, 37(8): 121-125. DOI: 10.3969/j.issn.1006-9348.2020.08.027.WANG L, XU J M. Application of distributed K-means clustering algorithm in micro-blog hot topic discovery[J]. Computer Simulation, 2020, 37(8): 121-125. DOI: 10.3969/j.issn.1006-9348.2020.08.027. [2] 申国伟, 杨武, 王巍, 等. 面向大规模微博消息流的突发话题检测[J]. 计算机研究与发展, 2015, 52(2): 512-521. DOI: 10.7544/issn.1000-1239.2015.20131336.SHEN G W, YANG W, WANG W, et al. Burst topic detection oriented large-scale microblogs streams[J]. Journal of Computer Research and Development, 2015, 52(2): 512-521. DOI: 10.7544/issn.1000-1239.2015.20131336. [3] 李艳红, 贾丽娜, 王素格, 等. 基于动态窗口的微博突发话题检测方法[J]. 计算机应用与软件, 2020, 37(5): 30-37. DOI: 10.3969/j.issn.1000-386x.2020.05.006.LI Y H, JIA L N, WANG S G, et al. Microblog bursty topic detection method based on dynamic window[J]. Computer Applications and Software, 2020, 37(5): 30-37. DOI: 10.3969/j.issn.1000-386x.2020.05.006. [4] 柳毅, 曾昊. 改进K-means的双向采样非均衡数据分类方法[J]. 微电子学与计算机, 2020, 37(3): 60-65. http://www.journalmc.com/article/id/23347dde-7965-4427-93f8-7da401ef66bbLIU Y, ZENG H. Improved the bi-directional sampling unbalanced data classification method of K-means[J]. Microelectronics & Computer, 2020, 37(3): 60-65. http://www.journalmc.com/article/id/23347dde-7965-4427-93f8-7da401ef66bb [5] 安计勇, 高贵阁, 史志强, 等. 一种改进的K均值文本聚类算法[J]. 传感器与微系统, 2015, 34(5): 130-133. DOI: 10.13873/J.1000-9787(2015)05-0130-04.AN J Y, GAO G G, SHI Z Q, et al. An improved K-means text clustering algorithm[J]. Transducer and Microsystem Technologies, 2015, 34(5): 130-133. DOI: 10.13873/J.1000-9787(2015)05-0130-04. [6] 靳雁霞, 齐欣, 张晋瑞, 等. 一种改进的简化均值粒子群K-means聚类算法[J]. 微电子学与计算机, 2020, 37(5): 69-74. http://www.journalmc.com/article/id/7464ee05-5442-4cff-a88a-203e45152b4fJIN Y X, QI X, ZHANG J R, et al. An improved simplified mean particle swarm optimization K-means clustering algorithm[J]. Microelectronics & Computer, 2020, 37(5): 69-74. http://www.journalmc.com/article/id/7464ee05-5442-4cff-a88a-203e45152b4f [7] 谌志群, 徐宁, 王荣波. 基于主题演化图的网络论坛热点跟踪[J]. 情报科学, 2013, 31(3): 147-150. DOI: 10.13833/j.cnki.is.2013.03.021.ZHAN Z Q, XU N, WANG R B. BBS hot topic tracking based on theme evolution graph[J]. Information Science, 2013, 31(3): 147-150. DOI: 10.13833/j.cnki.is.2013.03.021. [8] 刘佳鸣, 况立群, 尹洪红, 等. 灰狼优化的k均值聚类算法[J]. 中国科技论文, 2019, 14(7): 778-782. DOI: 10.3969/j.issn.2095-2783.2019.07.013.LIU J M, KUANG L Q, YIN H H, et al. K-means clustering algorithm based on grey wolf optimization[J]. China Sciencepaper, 2019, 14(7): 778-782. DOI: 10.3969/j.issn.2095-2783.2019.07.013. [9] 刘江华. 一种基于kmeans聚类算法和LDA主题模型的文本检索方法及有效性验证[J]. 情报科学, 2017, 35(2): 16-21. DOI: 10.13833/j.cnki.is.2017.02.003.LIU J H. A text retrieval method based on Kmeans clustering algorithm and LDA topic model and its effectiveness[J]. Information Science, 2017, 35(2): 16-21. DOI: 10.13833/j.cnki.is.2017.02.003. [10] 王林, 贾钧琛. 基于改进Canopy-K-means算法的并行化研究[J]. 计算机测量与控制, 2021, 29(2): 176-179. DOI: 10.16526/j.cnki.11-4762/tp.2021.02.035.WANG L, JIA J C. Research on parallelization based on improved canopy-K-means algorithm[J]. Computer Measurement & Control, 2021, 29(2): 176-179. DOI: 10.16526/j.cnki.11-4762/tp.2021.02.035. [11] 萧婧婕, 陈志云. 基于灰狼算法的主题爬虫[J]. 计算机科学, 2018, 45(S2): 146-148. https://www.cnki.com.cn/Article/CJFDTOTAL-JSJA2018S2027.htmXIAO J J, CHEN Z Y. Focused crawling based on grey wolf algorithms[J]. Computer Science, 2018, 45(S2): 146-148. https://www.cnki.com.cn/Article/CJFDTOTAL-JSJA2018S2027.htm [12] MIRJALILI S. SCA: a sine cosine algorithm for solving optimization problems[J]. Knowledge-Based Systems, 2016, 96: 120-133. DOI: 10.1016/j.knosys.2015.12.022. [13] SHI Y, EBERHART R. A modified particle swarm optimizer[C]//1998 IEEE International Conference on Evolutionary Computation Proceedings. IEEE World Congress on Computational Intelligence (Cat. No. 98TH8360). Anchorage, AK, USA: IEEE, 1998: 69-73. DOI: 10.1109/ICEC.1998.699146. [14] 余辉, 黄永峰, 胡萍. 微博舆情的Hadoop存储和管理平台设计与实现[J]. 电子技术应用, 2017, 43(3): 120-123. DOI: 10.16157/j.issn.0258-7998.2017.03.030.YU H, HUANG Y F, HU P. Design and implementation of Hadoop based storage and management platform on the Weibo public opinion[J]. Application of Electronic Technique, 2017, 43(3): 120-123. DOI: 10.16157/j.issn.0258-7998.2017.03.030. [15] 王晓黎, 王文杰. 基于向量空间模型的文本检索系统[J]. 微电子学与计算机, 2006, 23(6): 188-190. DOI: 10.3969/j.issn.1000-7180.2006.06.055.WANG X L, WANG W J. A document searching system with vector space model[J]. Microelectronics & Computer, 2006, 23(6): 188-190. DOI: 10.3969/j.issn.1000-7180.2006.06.055. [16] 郑夏, 马良. 一种Fibonacci迭代的差分进化生物地理学优化算法[J]. 小型微型计算机系统, 2019, 40(7): 1391-1396. DOI: 10.3969/j.issn.1000-1220.2019.07.007.ZHENG X, MA L. Fibonacci iteration differential evolution biogeography-based optimization algorithm[J]. Journal of Chinese Computer Systems, 2019, 40(7): 1391-1396. DOI: 10.3969/j.issn.1000-1220.2019.07.007. -