Sung Hwan Jeon, Hyeonguk Lee, Jihye Park, Sungzoon Cho
{"title":"使用命名实体识别和边缘权重更新神经网络从技术文档中构建知识图谱,并利用三重损失实现实体规范化","authors":"Sung Hwan Jeon, Hyeonguk Lee, Jihye Park, Sungzoon Cho","doi":"10.3233/ida-227129","DOIUrl":null,"url":null,"abstract":"Attempts to express information from various documents in graph form are rapidly increasing. The speed and volume in which these documents are being generated call for an automated process, based on machine learning techniques, for cost-effective and timely analysis. Past studies responded to such needs by building knowledge graphs or technology trees from the bibliographic information of documents, or by relying on text mining techniques in order to extract keywords and/or phrases. While these approaches provide an intuitive glance into the technological hotspots or the key features of the select field, there still is room for improvement, especially in terms of recognizing the same entities appearing in different forms so as to interconnect closely related technological concepts properly. In this paper, we propose to build a patent knowledge network using the United States Patent and Trademark Office (USPTO) patent filings for the semiconductor device sector by fine-tuning Huggingface’s named entity recognition (NER) model with our novel edge weight updating neural network. For the named entity normalization, we employ edge weight updating neural network with positive and negative candidates that are chosen by substring matching techniques. Experiment results show that our proposed approach performs very competitively against the conventional keyword extraction models frequently employed in patent analysis, especially for the named entity normalization (NEN) and document retrieval tasks. By grouping entities with named entity normalization model, the resulting knowledge graph achieves higher scores in retrieval tasks. We also show that our model is robust to the out-of-vocabulary problem by employing the fine-tuned BERT NER model.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"173 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Building knowledge graphs from technical documents using named entity recognition and edge weight updating neural network with triplet loss for entity normalization\",\"authors\":\"Sung Hwan Jeon, Hyeonguk Lee, Jihye Park, Sungzoon Cho\",\"doi\":\"10.3233/ida-227129\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Attempts to express information from various documents in graph form are rapidly increasing. The speed and volume in which these documents are being generated call for an automated process, based on machine learning techniques, for cost-effective and timely analysis. Past studies responded to such needs by building knowledge graphs or technology trees from the bibliographic information of documents, or by relying on text mining techniques in order to extract keywords and/or phrases. While these approaches provide an intuitive glance into the technological hotspots or the key features of the select field, there still is room for improvement, especially in terms of recognizing the same entities appearing in different forms so as to interconnect closely related technological concepts properly. In this paper, we propose to build a patent knowledge network using the United States Patent and Trademark Office (USPTO) patent filings for the semiconductor device sector by fine-tuning Huggingface’s named entity recognition (NER) model with our novel edge weight updating neural network. For the named entity normalization, we employ edge weight updating neural network with positive and negative candidates that are chosen by substring matching techniques. Experiment results show that our proposed approach performs very competitively against the conventional keyword extraction models frequently employed in patent analysis, especially for the named entity normalization (NEN) and document retrieval tasks. By grouping entities with named entity normalization model, the resulting knowledge graph achieves higher scores in retrieval tasks. We also show that our model is robust to the out-of-vocabulary problem by employing the fine-tuned BERT NER model.\",\"PeriodicalId\":50355,\"journal\":{\"name\":\"Intelligent Data Analysis\",\"volume\":\"173 1\",\"pages\":\"\"},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2023-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intelligent Data Analysis\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.3233/ida-227129\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Data Analysis","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.3233/ida-227129","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
以图表形式表达各种文件信息的尝试正在迅速增加。这些文档生成的速度和数量要求基于机器学习技术的自动化流程,以便进行经济高效的及时分析。过去的研究通过从文件的书目信息中建立知识图谱或技术树,或依靠文本挖掘技术提取关键词和/或短语来满足这些需求。虽然这些方法能让人直观地了解技术热点或所选领域的关键特征,但仍有改进的余地,特别是在识别以不同形式出现的相同实体,从而将密切相关的技术概念正确地联系起来方面。在本文中,我们建议利用美国专利商标局(USPTO)半导体设备领域的专利申请,通过微调 Huggingface 的命名实体识别(NER)模型和我们新颖的边缘权重更新神经网络来构建专利知识网络。在命名实体规范化方面,我们采用边缘权重更新神经网络,通过子串匹配技术选择正负候选实体。实验结果表明,与专利分析中常用的传统关键词提取模型相比,我们提出的方法具有很强的竞争力,尤其是在命名实体规范化(NEN)和文档检索任务方面。通过命名实体归一化模型对实体进行分组,生成的知识图谱在检索任务中获得了更高的分数。我们还表明,通过采用微调 BERT NER 模型,我们的模型对词汇表外问题具有很强的鲁棒性。
Building knowledge graphs from technical documents using named entity recognition and edge weight updating neural network with triplet loss for entity normalization
Attempts to express information from various documents in graph form are rapidly increasing. The speed and volume in which these documents are being generated call for an automated process, based on machine learning techniques, for cost-effective and timely analysis. Past studies responded to such needs by building knowledge graphs or technology trees from the bibliographic information of documents, or by relying on text mining techniques in order to extract keywords and/or phrases. While these approaches provide an intuitive glance into the technological hotspots or the key features of the select field, there still is room for improvement, especially in terms of recognizing the same entities appearing in different forms so as to interconnect closely related technological concepts properly. In this paper, we propose to build a patent knowledge network using the United States Patent and Trademark Office (USPTO) patent filings for the semiconductor device sector by fine-tuning Huggingface’s named entity recognition (NER) model with our novel edge weight updating neural network. For the named entity normalization, we employ edge weight updating neural network with positive and negative candidates that are chosen by substring matching techniques. Experiment results show that our proposed approach performs very competitively against the conventional keyword extraction models frequently employed in patent analysis, especially for the named entity normalization (NEN) and document retrieval tasks. By grouping entities with named entity normalization model, the resulting knowledge graph achieves higher scores in retrieval tasks. We also show that our model is robust to the out-of-vocabulary problem by employing the fine-tuned BERT NER model.
期刊介绍:
Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing. In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains.