Performance evaluation of seven multi-label classification methods on real-world patent and publication datasets

IF 1.5 3区 管理学 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE Journal of Data and Information Science Pub Date : 2024-05-27 DOI:10.2478/jdis-2024-0014
Shuo Xu, Yuefu Zhang, Xin An, Sainan Pi
{"title":"Performance evaluation of seven multi-label classification methods on real-world patent and publication datasets","authors":"Shuo Xu, Yuefu Zhang, Xin An, Sainan Pi","doi":"10.2478/jdis-2024-0014","DOIUrl":null,"url":null,"abstract":"Purpose Many science, technology and innovation (STI) resources are attached with several different labels. To assign automatically the resulting labels to an interested instance, many approaches with good performance on the benchmark datasets have been proposed for multilabel classification task in the literature. Furthermore, several open-source tools implementing these approaches have also been developed. However, the characteristics of real-world multilabel patent and publication datasets are not completely in line with those of benchmark ones. Therefore, the main purpose of this paper is to evaluate comprehensively seven multi-label classification methods on real-world datasets. Design/methodology/approach Three real-world datasets (Biological-Sciences, Health-Sciences, and USPTO) from SciGraph and USPTO database are constructed. Seven multilabel classification methods with tuned parameters (dependency-LDA, ML<jats:italic>k</jats:italic>NN, LabelPowerset, RA<jats:italic>k</jats:italic>EL, TextCNN, TexRNN, and TextRCNN) are comprehensively compared on these three real-world datasets. To evaluate the performance, the study adopts three classification-based metrics: Macro-F1, Micro-F1, and Hamming Loss. Findings The TextCNN and TextRCNN models show obvious superiority on small-scale datasets with more complex hierarchical structure of labels and more balanced documentlabel distribution in terms of macro-F1, micro-F1 and Hamming Loss. The ML<jats:italic>k</jats:italic>NN method works better on the larger-scale dataset with more unbalanced document-label distribution. Research limitations Three real-world datasets differ in the following aspects: statement, data quality, and purposes. Additionally, open-source tools designed for multi-label classification also have intrinsic differences in their approaches for data processing and feature selection, which in turn impacts the performance of a multi-label classification approach. In the near future, we will enhance experimental precision and reinforce the validity of conclusions by employing more rigorous control over variables through introducing expanded parameter settings. Practical implications The observed Macro F1 and Micro F1 scores on real-world datasets typically fall short of those achieved on benchmark datasets, underscoring the complexity of real-world multi-label classification tasks. Approaches leveraging deep learning techniques offer promising solutions by accommodating the hierarchical relationships and interdependencies among labels. With ongoing enhancements in deep learning algorithms and large-scale models, it is expected that the efficacy of multi-label classification tasks will be significantly improved, reaching a level of practical utility in the foreseeable future. Originality/value (1) Seven multi-label classification methods are comprehensively compared on three real-world datasets. (2) The TextCNN and TextRCNN models perform better on small-scale datasets with more complex hierarchical structure of labels and more balanced document-label distribution. (3) The ML<jats:italic>k</jats:italic>NN method works better on the larger-scale dataset with more unbalanced document-label distribution.","PeriodicalId":44622,"journal":{"name":"Journal of Data and Information Science","volume":null,"pages":null},"PeriodicalIF":1.5000,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Science","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.2478/jdis-2024-0014","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose Many science, technology and innovation (STI) resources are attached with several different labels. To assign automatically the resulting labels to an interested instance, many approaches with good performance on the benchmark datasets have been proposed for multilabel classification task in the literature. Furthermore, several open-source tools implementing these approaches have also been developed. However, the characteristics of real-world multilabel patent and publication datasets are not completely in line with those of benchmark ones. Therefore, the main purpose of this paper is to evaluate comprehensively seven multi-label classification methods on real-world datasets. Design/methodology/approach Three real-world datasets (Biological-Sciences, Health-Sciences, and USPTO) from SciGraph and USPTO database are constructed. Seven multilabel classification methods with tuned parameters (dependency-LDA, MLkNN, LabelPowerset, RAkEL, TextCNN, TexRNN, and TextRCNN) are comprehensively compared on these three real-world datasets. To evaluate the performance, the study adopts three classification-based metrics: Macro-F1, Micro-F1, and Hamming Loss. Findings The TextCNN and TextRCNN models show obvious superiority on small-scale datasets with more complex hierarchical structure of labels and more balanced documentlabel distribution in terms of macro-F1, micro-F1 and Hamming Loss. The MLkNN method works better on the larger-scale dataset with more unbalanced document-label distribution. Research limitations Three real-world datasets differ in the following aspects: statement, data quality, and purposes. Additionally, open-source tools designed for multi-label classification also have intrinsic differences in their approaches for data processing and feature selection, which in turn impacts the performance of a multi-label classification approach. In the near future, we will enhance experimental precision and reinforce the validity of conclusions by employing more rigorous control over variables through introducing expanded parameter settings. Practical implications The observed Macro F1 and Micro F1 scores on real-world datasets typically fall short of those achieved on benchmark datasets, underscoring the complexity of real-world multi-label classification tasks. Approaches leveraging deep learning techniques offer promising solutions by accommodating the hierarchical relationships and interdependencies among labels. With ongoing enhancements in deep learning algorithms and large-scale models, it is expected that the efficacy of multi-label classification tasks will be significantly improved, reaching a level of practical utility in the foreseeable future. Originality/value (1) Seven multi-label classification methods are comprehensively compared on three real-world datasets. (2) The TextCNN and TextRCNN models perform better on small-scale datasets with more complex hierarchical structure of labels and more balanced document-label distribution. (3) The MLkNN method works better on the larger-scale dataset with more unbalanced document-label distribution.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
七种多标签分类方法在真实世界专利和出版物数据集上的性能评估
目的 许多科技创新(STI)资源都附有多个不同的标签。为了给感兴趣的实例自动分配由此产生的标签,文献中提出了许多在基准数据集上性能良好的多标签分类任务方法。此外,还开发了一些实现这些方法的开源工具。然而,现实世界中多标签专利和出版物数据集的特征与基准数据集的特征并不完全一致。因此,本文的主要目的是在真实数据集上全面评估七种多标签分类方法。设计/方法/途径 从 SciGraph 和 USPTO 数据库中构建了三个真实世界数据集(生物科学、健康科学和 USPTO)。在这三个真实世界数据集上综合比较了七种参数可调的多标签分类方法(dependency-LDA、MLkNN、LabelPowerset、RAkEL、TextCNN、TexRNN 和 TextRCNN)。为了评估性能,研究采用了三个基于分类的指标:宏观-F1、微观-F1 和汉明损失。研究结果 在标签层次结构更复杂、文档标签分布更均衡的小型数据集上,TextCNN 和 TextRCNN 模型在宏观-F1、微观-F1 和 Hamming Loss 方面表现出明显的优势。MLkNN 方法在文档标签分布更不均衡的大规模数据集上效果更好。研究局限性 三个真实世界数据集在以下方面存在差异:声明、数据质量和目的。此外,为多标签分类设计的开源工具在数据处理和特征选择方法上也存在内在差异,这反过来又会影响多标签分类方法的性能。在不久的将来,我们将通过引入扩展参数设置,对变量进行更严格的控制,从而提高实验精度,加强结论的有效性。实际意义 在真实世界数据集上观察到的宏观 F1 和微观 F1 分数通常低于在基准数据集上取得的分数,这凸显了真实世界多标签分类任务的复杂性。利用深度学习技术的方法通过适应标签之间的层次关系和相互依赖关系,提供了有前景的解决方案。随着深度学习算法和大规模模型的不断改进,多标签分类任务的效率有望得到显著提高,在可预见的未来达到实用水平。独创性/价值 (1) 在三个真实世界数据集上全面比较了七种多标签分类方法。(2)TextCNN 和 TextRCNN 模型在标签层次结构更复杂、文档标签分布更均衡的小规模数据集上表现更好。(3) MLkNN 方法在文档标签分布更不均衡的大规模数据集上表现更好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Data and Information Science
Journal of Data and Information Science INFORMATION SCIENCE & LIBRARY SCIENCE-
CiteScore
3.50
自引率
6.70%
发文量
495
期刊介绍: JDIS devotes itself to the study and application of the theories, methods, techniques, services, infrastructural facilities using big data to support knowledge discovery for decision & policy making. The basic emphasis is big data-based, analytics centered, knowledge discovery driven, and decision making supporting. The special effort is on the knowledge discovery to detect and predict structures, trends, behaviors, relations, evolutions and disruptions in research, innovation, business, politics, security, media and communications, and social development, where the big data may include metadata or full content data, text or non-textural data, structured or non-structural data, domain specific or cross-domain data, and dynamic or interactive data. The main areas of interest are: (1) New theories, methods, and techniques of big data based data mining, knowledge discovery, and informatics, including but not limited to scientometrics, communication analysis, social network analysis, tech & industry analysis, competitive intelligence, knowledge mapping, evidence based policy analysis, and predictive analysis. (2) New methods, architectures, and facilities to develop or improve knowledge infrastructure capable to support knowledge organization and sophisticated analytics, including but not limited to ontology construction, knowledge organization, semantic linked data, knowledge integration and fusion, semantic retrieval, domain specific knowledge infrastructure, and semantic sciences. (3) New mechanisms, methods, and tools to embed knowledge analytics and knowledge discovery into actual operation, service, or managerial processes, including but not limited to knowledge assisted scientific discovery, data mining driven intelligent workflows in learning, communications, and management. Specific topic areas may include: Knowledge organization Knowledge discovery and data mining Knowledge integration and fusion Semantic Web metrics Scientometrics Analytic and diagnostic informetrics Competitive intelligence Predictive analysis Social network analysis and metrics Semantic and interactively analytic retrieval Evidence-based policy analysis Intelligent knowledge production Knowledge-driven workflow management and decision-making Knowledge-driven collaboration and its management Domain knowledge infrastructure with knowledge fusion and analytics Development of data and information services
期刊最新文献
Detecting LLM-assisted writing in scientific communication: Are we there yet? Beyond authorship: Analyzing contributions in PLOS ONE and the challenges of appropriate attribution Performance evaluation of seven multi-label classification methods on real-world patent and publication datasets Can ChatGPT evaluate research quality? Amend: an integrated platform of retracted papers and concerned papers
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1