• 北大核心期刊(《中文核心期刊要目总览》2017版)
  • 中国科技核心期刊(中国科技论文统计源期刊)
  • JST 日本科学技术振兴机构数据库(日)收录期刊

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于语义匹配的多标签多层级中文专利分类

王文川 朱全银 孙纪舟 马甲林

王文川, 朱全银, 孙纪舟, 马甲林. 基于语义匹配的多标签多层级中文专利分类[J]. 微电子学与计算机, 2022, 39(4): 91-99. doi: 10.19304/J.ISSN1000-7180.2021.1083
引用本文: 王文川, 朱全银, 孙纪舟, 马甲林. 基于语义匹配的多标签多层级中文专利分类[J]. 微电子学与计算机, 2022, 39(4): 91-99. doi: 10.19304/J.ISSN1000-7180.2021.1083
WANG Wenchuan, ZHU Quanyin, SUN Jizhou, MA Jialin. Multi-label and multi-level chinese patent classification based on semantic matching[J]. Microelectronics & Computer, 2022, 39(4): 91-99. doi: 10.19304/J.ISSN1000-7180.2021.1083
Citation: WANG Wenchuan, ZHU Quanyin, SUN Jizhou, MA Jialin. Multi-label and multi-level chinese patent classification based on semantic matching[J]. Microelectronics & Computer, 2022, 39(4): 91-99. doi: 10.19304/J.ISSN1000-7180.2021.1083

基于语义匹配的多标签多层级中文专利分类

doi: 10.19304/J.ISSN1000-7180.2021.1083
基金项目: 

国家自然科学基金青年项目 62002131

江苏省双创计划 JSSCBS220211179

详细信息
    作者简介:

    王文川    男,(1996-),硕士研究生.研究方向为数据挖掘与自然语言处理

    孙纪舟    男,(1985-),博士,讲师.研究方向为数据质量、大数据计算、数据库管理系统等

    马甲林    男,(1981-),博士,硕士生导师.研究方向为大数据挖掘、自然语言处理、机器学习等

    通讯作者:

    朱全银(通讯作者)    男,(1966-),教授,硕士生导师.研究方向为智能信息处理、接口与通信、数据挖掘等. E-mail: zqy@hyit.edu.cn

  • 中图分类号: TP391

Multi-label and multi-level chinese patent classification based on semantic matching

  • 摘要:

    随着“十四五”规划提出要保护和激励国内产生更多高价值专利,各类跨学科、跨领域的创新型专利申请量激增,专利自动分类方法辅助人工分类的需求日益增长.目前,中文专利分类主要由审查员根据提交的专利内容,与国际专利分类体系表进行人工匹配来确定所属分类,人工效率低.已有的专利自动分类方法主要从专利中提取文本结构特征和语义特征,将两种特征与国际专利分类体系表中的标签直接进行相似度匹配,没有考虑到国际专利分类表中分类标签解释文本的语义信息,容易导致分类模糊.为此,提出一种基于语义匹配的多标签多层级中文专利分类方法,将传统的文本分类问题转化为基于语义特征的文本匹配问题,以实现专利文本多标签多层级分类任务.通过从国际专利分类表中提取各标签各层级(部、大类、小类、大组和小组)的语义特征,同时从公开专利中提取文本语义特征,并将二者进行语义匹配,从而达到自动分类的目的.在同一数据集上的实验结果显示,该方法能够取得更好的效果.

     

  • 图 1  多标签多层级语义匹配框架流程

    Figure 1.  Multi-label multi-level semantic matching framework

    图 2  基于语义匹配的伪孪生网络模型

    Figure 2.  Semantic matching based Pseudo-Siamese network

    图 3  BERT训练词向量嵌入表示示例

    Figure 3.  BERT training word embedding example

    图 4  Transformer Encoder单元图

    Figure 4.  An encoder block of Transformer

    图 5  IPC层级编码卷积网络图

    Figure 5.  IPC hierarchical coding convolutional network

    表  1  实验环境

    Table  1.   Experiment environment

    Environment Environmental parameters
    Development language Python3.8
    Development tool PyCharm Community Edition
    Operating system Windows 10 64位
    CPU Intel Core i5-7500@3.4GHz
    GPU NVIDIA GeForce 2060 super 8G
    Memory 12G
    下载: 导出CSV

    表  2  基于不同文本相似度比较模型的试验

    Table  2.   Experiments based on different models

    模型 准确率 召回率 F1
    ABCNN[24] 80.27 79.49 81.26
    BiMPM[20] 84.92 83.01 84.79
    Siamese-LSTM[25] 84.92 85.52 85.33
    Sentence-Bert[2] 86.54 86.63 87.05
    ESIM[26] 88.39 87.83 88.86
    PSSM 90.94 91.73 91.61
    下载: 导出CSV

    表  3  本文的模型在多标签多层级分类任务上的实验

    Table  3.   Experiments for multi-label and multi-level classifications based on our method

    层级 准确率 召回率 F1
    94.23 94.68 95.29
    部+大类 92.57 92.87 93.10
    部+大类+小类 92.54 92.65 92.73
    部+大类+小类+大组 90.39 91.85 91.79
    部+大类+小类+大组+小组 90.94 91.73 91.61
    下载: 导出CSV
  • [1] 吕璐成, 韩涛, 周健, 等. 基于深度学习的中文专利自动分类方法研究[J]. 图书情报工作, 2020, 64(10): 75-85. DOI: 10.13266/j.issn.0252-3116.2020.10.009.

    LV L C, HAN T, ZHOU J, et al. Research on the method of chinese patent automatic classification based on deep learning[J]. Library and Information Service, 2020, 64(10): 75-85. DOI: 10.13266/j.issn.0252-3116.2020.10.009.
    [2] REIMERS N, GUREVYCH I. Sentence-bert: Sentence embeddings using siamese bert-networks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019: 3982-3992. DOI: 10.1016/j.patcog.2006.12.019.
    [3] ZHANG M L, ZHOU Z H. ML-KNN: A lazy learning approach to multi-label learning[J]. Pattern Recognition, 2007, 40(7): 2038-2048. DOI: 10.1016/j.patcog.2006.12.019.
    [4] WANG J R, FENG J, SUN X, et al. Simplified constraints rank-svm for multi-label classification[M]//LI S T, LIU C L, WANG Y N. 6th Chinese Conference on Pattern Recognition. Changsha, China: Springer, 2014. DOI: 10.1007/978-3-662-45646-0_23.
    [5] LIU J Z, CHANG W C, WU Y X, et al. Deep learning for extreme multi-label text classification[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, United States: Association for Computing Machinery, 2017. DOI: 10.1145/3077136.3080834.
    [6] CHALKIDIS I, FERGADIOTIS M, MALAKASIOTIS P, et al. Large-scale multi-label text classification on EU legislation[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019: 6314-6322. DOI: 10.18653/v1/p19-1636.
    [7] BAKER S, KORHONEN A L. Initializing neural networks for hierarchical multi-label text classification[C]//BioNLP 2017. Vancouver, Canada: Association for Computational Linguistics, 2017: 307-315. DOI: 10.18653/v1/w17-2339.
    [8] HUANG W, CHEN E H, LIU Q, et al. Hierarchical multi-label text classification: An attention-based recurrent network approach[C]//Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York, NY, United States: Association for Computing Machinery, 2019. DOI: 10.1145/3357384.3357885.
    [9] PAL A, SELVAKUMAR M, SANKARASUBBU M. MAGNET: multi-label text classification using attention-based graph neural network[C]//Proceedings of the 12th International Conference on Agents and Artificial Intelligence. Valletta, Malta: ICAART, 2020. DOI: 10.5220/0008940304940505.
    [10] LE Q, MIKOLOV T. Distributed representations of sentences and documents[C]//Proceedings of the 31st International Conference on Machine Learning. Beijing: ACM, 2014. DOI: 10.5555/3044805.3045025.
    [11] PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1(Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, 2018. DOI: 10.18653/v1/N18-1202.
    [12] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[EB/OL]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf. [2020-09-25]
    [13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of Advances in Neural Information Processing Systems. 2017: 5998-6008.
    [14] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv: 181004805, 2018.
    [15] LEE J S, HSIANG J. PatentBERT: Patent classification with fine-tuning a pre-trained bert model[J]. arXiv preprint arXiv: 190602124, 2019.
    [16] TSENG H C, CHEN B, CHANG T H, et al. Integrating LSA-based hierarchical conceptual space and machine learning methods for leveling the readability of domain-specific texts[J]. Natural Language Engineering, 2019, 25(3): 331-361. DOI: 10.1017/S1351324919000093.
    [17] JELODAR H, WANG Y L, YUAN C, et al. Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey[J]. Multimedia Tools and Applications, 2019, 78(11): 15169-15211. DOI: 10.1007/s11042-018-6894-4.
    [18] 韩程程, 李磊, 刘婷婷, 等. 语义文本相似度计算方法[J]. 华东师范大学学报(自然科学版), 2020(5): 95-112. DOI: 10.3969/j.issn.1000-5641.202091011.

    HAN C C, LI L, LIU T T, et al. Approaches for semantic textual similarity[J]. Journal of East China Normal University (Natural Science), 2020(5): 95-112. DOI: 10.3969/j.issn.1000-5641.202091011.
    [19] FUKUDA H, GUNJI R, HASEGAWA T, et al. DSSM: Distributed streaming data sharing manager[J]. Sensors, 2021, 21(4): 1344. DOI: 10.3390/s21041344.
    [20] WANG Z G, HAMZA W, FLORIAN R. Bilateral multi-perspective matching for natural language sentences[C]//Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence. Melbourne, Australia: IJCAI, 2017. DOI: 10.24963/ijcai.2017/579.
    [21] GAO J Y, XIAO C, GLASS L M, et al. COMPOSE: cross-modal pseudo-siamese network for patient trial matching[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, NY, United States: Association for Computing Machinery, 2020. DOI: 10.1145/3394486.3403123.
    [22] 马建红, 刘亚培, 刘言东, 等. CGGA: 一种CNN与并行门控机制混合的文本分类模型[J]. 小型微型计算机系统, 2021, 42(3): 516-521. https://www.cnki.com.cn/Article/CJFDTOTAL-XXWX202103012.htm

    MA J H, LIU Y P, LIU Y D, et al. CGGA: text classification model based on CNN and parallel gating mechanism[J]. Journal of Chinese Computer Systems, 2021, 42(3): 516-521. https://www.cnki.com.cn/Article/CJFDTOTAL-XXWX202103012.htm
    [23] 赵阳, 朱全银, 胡荣林, 等. 基于自编码机和聚类的混合推荐算法[J]. 微电子学与计算机, 2018, 35(11): 52-56. DOI: 10.19304/j.cnki.issn1000-7180.2018.11.011.

    ZHAO Y, ZHU Q Y, HU R L, et al. Hybrid recommendation algorithm based on autoencoder and clustering[J]. Microelectronics & Computer, 2018, 35(11): 52-56. DOI: 10.19304/j.cnki.issn1000-7180.2018.11.011.
    [24] YIN W P, SCH?TZE H, XIANG B, et al. ABCNN: Attention-based convolutional neural network for modeling sentence pairs[J]. Transactions of the Association for Computational Linguistics, 2016, 4: 259-272. DOI: 10.1162/tacl_a_00097.
    [25] MUELLER J, THYAGARAJAN A. Siamese recurrent architectures for learning sentence similarity[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Phoenix, Arizona, USA: AAAI, 2016.
    [26] CHEN Q, ZHU X D, LING Z H, et al. Enhanced LSTM for natural language inference[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics, 2016. DOI: 10.18653/v1/P17-1152.
  • 加载中
图(5) / 表(3)
计量
  • 文章访问数:  22
  • HTML全文浏览量:  14
  • PDF下载量:  0
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-09-10
  • 修回日期:  2021-10-11
  • 网络出版日期:  2022-05-12

目录

    /

    返回文章
    返回