学术文章中创新句子的自动识别：降低成本的半自动注释和增强数据的 SAO 重构

IF 3.5 3区管理学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Scientometrics Pub Date : 2024-08-01 DOI:10.1007/s11192-024-05114-z

Biao Zhang, Yunwei Chen

{"title":"学术文章中创新句子的自动识别：降低成本的半自动注释和增强数据的 SAO 重构","authors":"Biao Zhang, Yunwei Chen","doi":"10.1007/s11192-024-05114-z","DOIUrl":null,"url":null,"abstract":"<p>Research on innovative content within academic articles plays a vital role in exploring the frontiers of scientific and technological innovation while facilitating the integration of scientific and technological evaluation into academic discourse. To efficiently gather the latest innovative concepts, it is essential to accurately recognize innovative sentences within academic articles. Although several supervised methods for classifying article sentences exist, such as citation function sentences, future work sentences, and formal citation sentences, most of these methods rely on manual annotations or rule-based matching to construct datasets, often neglecting an in-depth exploration of model performance enhancement. To address the limitations of existing research in this domain, this study introduces a semi-automatic annotation method for innovative sentences (IS) with the assistance of expert comments information and proposes a data augmentation method by SAO reconstruction to augment the training dataset. Within this paper, we compared and analyzed the effectiveness of multiple algorithms for recognizing IS within academic articles. This study utilized the full text of academic articles as the research subject and employed the semi-automatic method to annotate IS for creating the training dataset. Then, this study validated the effectiveness of the semi-automatic annotation method through manual inspection and compared it with rule-based annotation methods. Additionally, the impacts of different augmentation ratios on model performance were also explored. The empirical results reveal the following: (1) The semi-automatic annotation method proposed in this study achieves an accuracy rate of 0.87239, ensuring the validity of annotated data while reducing the manual annotation cost. (2) The SAO reconstruction for data augmentation method significantly improved the accuracy of machine learning and deep learning algorithms in the recognition of IS. (3) When the augmentation ratio in the training set was set to 50%, the trained GPT-2 model was superior to other algorithms, achieving an ACC of 0.97883 in the test set and an F1 score of 0.95505 in practical application.</p>","PeriodicalId":21755,"journal":{"name":"Scientometrics","volume":"150 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automated recognition of innovative sentences in academic articles: semi-automatic annotation for cost reduction and SAO reconstruction for enhanced data\",\"authors\":\"Biao Zhang, Yunwei Chen\",\"doi\":\"10.1007/s11192-024-05114-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Research on innovative content within academic articles plays a vital role in exploring the frontiers of scientific and technological innovation while facilitating the integration of scientific and technological evaluation into academic discourse. To efficiently gather the latest innovative concepts, it is essential to accurately recognize innovative sentences within academic articles. Although several supervised methods for classifying article sentences exist, such as citation function sentences, future work sentences, and formal citation sentences, most of these methods rely on manual annotations or rule-based matching to construct datasets, often neglecting an in-depth exploration of model performance enhancement. To address the limitations of existing research in this domain, this study introduces a semi-automatic annotation method for innovative sentences (IS) with the assistance of expert comments information and proposes a data augmentation method by SAO reconstruction to augment the training dataset. Within this paper, we compared and analyzed the effectiveness of multiple algorithms for recognizing IS within academic articles. This study utilized the full text of academic articles as the research subject and employed the semi-automatic method to annotate IS for creating the training dataset. Then, this study validated the effectiveness of the semi-automatic annotation method through manual inspection and compared it with rule-based annotation methods. Additionally, the impacts of different augmentation ratios on model performance were also explored. The empirical results reveal the following: (1) The semi-automatic annotation method proposed in this study achieves an accuracy rate of 0.87239, ensuring the validity of annotated data while reducing the manual annotation cost. (2) The SAO reconstruction for data augmentation method significantly improved the accuracy of machine learning and deep learning algorithms in the recognition of IS. (3) When the augmentation ratio in the training set was set to 50%, the trained GPT-2 model was superior to other algorithms, achieving an ACC of 0.97883 in the test set and an F1 score of 0.95505 in practical application.</p>\",\"PeriodicalId\":21755,\"journal\":{\"name\":\"Scientometrics\",\"volume\":\"150 1\",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2024-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Scientometrics\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://doi.org/10.1007/s11192-024-05114-z\",\"RegionNum\":3,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientometrics","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1007/s11192-024-05114-z","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

对学术文章中创新内容的研究在探索科技创新前沿、促进科技评价融入学术话语方面发挥着至关重要的作用。为了有效收集最新的创新概念，准确识别学术文章中的创新句子至关重要。虽然目前已有多种有监督的文章句子分类方法，如引用功能句子、未来工作句子和正式引用句子等，但这些方法大多依赖人工标注或基于规则的匹配来构建数据集，往往忽视了对模型性能提升的深入探索。针对该领域现有研究的局限性，本研究引入了一种借助专家评论信息的创新句子（IS）半自动标注方法，并提出了一种通过SAO重构来增强训练数据集的数据增强方法。在本文中，我们比较并分析了多种算法识别学术文章中创新句子的有效性。本研究以学术文章全文为研究对象，采用半自动方法对 IS 进行注释以创建训练数据集。然后，本研究通过人工检查验证了半自动注释方法的有效性，并将其与基于规则的注释方法进行了比较。此外，还探讨了不同的增强比例对模型性能的影响。实证结果显示了以下几点：(1) 本研究提出的半自动标注方法准确率达到 0.87239，确保了标注数据的有效性，同时降低了人工标注成本。(2）数据扩增的 SAO 重构方法显著提高了机器学习和深度学习算法在 IS 识别中的准确率。(3）当训练集的扩增比例设置为50%时，训练出的GPT-2模型优于其他算法，在测试集中的ACC达到0.97883，在实际应用中的F1得分达到0.95505。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Automated recognition of innovative sentences in academic articles: semi-automatic annotation for cost reduction and SAO reconstruction for enhanced data

Research on innovative content within academic articles plays a vital role in exploring the frontiers of scientific and technological innovation while facilitating the integration of scientific and technological evaluation into academic discourse. To efficiently gather the latest innovative concepts, it is essential to accurately recognize innovative sentences within academic articles. Although several supervised methods for classifying article sentences exist, such as citation function sentences, future work sentences, and formal citation sentences, most of these methods rely on manual annotations or rule-based matching to construct datasets, often neglecting an in-depth exploration of model performance enhancement. To address the limitations of existing research in this domain, this study introduces a semi-automatic annotation method for innovative sentences (IS) with the assistance of expert comments information and proposes a data augmentation method by SAO reconstruction to augment the training dataset. Within this paper, we compared and analyzed the effectiveness of multiple algorithms for recognizing IS within academic articles. This study utilized the full text of academic articles as the research subject and employed the semi-automatic method to annotate IS for creating the training dataset. Then, this study validated the effectiveness of the semi-automatic annotation method through manual inspection and compared it with rule-based annotation methods. Additionally, the impacts of different augmentation ratios on model performance were also explored. The empirical results reveal the following: (1) The semi-automatic annotation method proposed in this study achieves an accuracy rate of 0.87239, ensuring the validity of annotated data while reducing the manual annotation cost. (2) The SAO reconstruction for data augmentation method significantly improved the accuracy of machine learning and deep learning algorithms in the recognition of IS. (3) When the augmentation ratio in the training set was set to 50%, the trained GPT-2 model was superior to other algorithms, achieving an ACC of 0.97883 in the test set and an F1 score of 0.95505 in practical application.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Scientometrics 管理科学-计算机：跨学科应用

CiteScore

7.20

自引率

17.90%

发文量

351

审稿时长

1.5 months

期刊介绍： Scientometrics aims at publishing original studies, short communications, preliminary reports, review papers, letters to the editor and book reviews on scientometrics. The topics covered are results of research concerned with the quantitative features and characteristics of science. Emphasis is placed on investigations in which the development and mechanism of science are studied by means of (statistical) mathematical methods. The Journal also provides the reader with important up-to-date information about international meetings and events in scientometrics and related fields. Appropriate bibliographic compilations are published as a separate section. Due to its fully interdisciplinary character, Scientometrics is indispensable to research workers and research administrators throughout the world. It provides valuable assistance to librarians and documentalists in central scientific agencies, ministries, research institutes and laboratories. Scientometrics includes the Journal of Research Communication Studies. Consequently its aims and scope cover that of the latter, namely, to bring the results of research investigations together in one place, in such a form that they will be of use not only to the investigators themselves but also to the entrepreneurs and research workers who form the object of these studies.