赵彤, 刘斌, 李涛. 基于非均衡局部敏感哈希的并行文本分类研究[J]. 微电子学与计算机, 2017, 34(12): 67-73.
引用本文: 赵彤, 刘斌, 李涛. 基于非均衡局部敏感哈希的并行文本分类研究[J]. 微电子学与计算机, 2017, 34(12): 67-73.
ZHAO Tong, LIU Bin, LI Tao. Research on Parallel Text Classification System Based on Non-Balanced LSH[J]. Microelectronics & Computer, 2017, 34(12): 67-73.
Citation: ZHAO Tong, LIU Bin, LI Tao. Research on Parallel Text Classification System Based on Non-Balanced LSH[J]. Microelectronics & Computer, 2017, 34(12): 67-73.

基于非均衡局部敏感哈希的并行文本分类研究

Research on Parallel Text Classification System Based on Non-Balanced LSH

  • 摘要: 针对KNN分类算法在面对海量文本处理时效率低下的问题, 提出了一种基于超平面的非均衡局部敏感哈希分类算法, 该分类算法相比于传统的局部敏感哈希算法在提高分类的准确性和实时性上有显著的效果.同时, 为了进一步降低分类算法的执行时间, 提高分类效率, 将该分类算法与Spark并行计算模型结合, 在大数据处理平台Hadoop上实现了一种高效的并行文本分类系统.实验结果表明, 所设计的文本分类系统在具有较高分类速度的同时保持了较高的分类准确性.

     

    Abstract: In order to solve the problem of low efficiency of the K-Nearset Neighbors(KNN) classification algorithm in face of massive text, a non-balanced local sensitive hash classification algorithm based on hyper-plane is proposed, which has a more significant effect than the traditional local sensitive hash algorithm on improving the accuracy and real-time performance. At the same time, in order to further reduce the execution time of the classification algorithm and improve the classification efficiency, an efficient parallel text classification system baseed on Hadoop is designed which combines the classification algorithm and the Spark parallel computing model. The experimental results show that such text classification system has a high classification speed and a high classification accuracy.

     

/

返回文章
返回