从科学论文中提取问题句和方法句:使用公式化表达脱敏的语境增强转换器

IF 3.5 3区 管理学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Scientometrics Pub Date : 2024-05-27 DOI:10.1007/s11192-024-05048-6
Yingyi Zhang, Chengzhi Zhang
{"title":"从科学论文中提取问题句和方法句:使用公式化表达脱敏的语境增强转换器","authors":"Yingyi Zhang, Chengzhi Zhang","doi":"10.1007/s11192-024-05048-6","DOIUrl":null,"url":null,"abstract":"<p>Billions of scientific papers lead to the need to identify essential parts from the massive text. Scientific research is an activity from putting forward problems to using methods. To learn the main idea from scientific papers, we focus on extracting problem and method sentences. Annotating sentences within scientific papers is labor-intensive, resulting in small-scale datasets that limit the amount of information models can learn. This limited information leads models to rely heavily on specific forms, which in turn reduces their generalization capabilities. This paper addresses the problems caused by small-scale datasets from three perspectives: increasing dataset scale, reducing dependence on specific forms, and enriching the information within sentences. To implement the first two ideas, we introduce the concept of formulaic expression (FE) desensitization and propose FE desensitization-based data augmenters to generate synthetic data and reduce models’ reliance on FEs. For the third idea, we propose a context-enhanced transformer that utilizes context to measure the importance of words in target sentences and to reduce noise in the context. Furthermore, this paper conducts experiments using large language model (LLM) based in-context learning (ICL) methods. Quantitative and qualitative experiments demonstrate that our proposed models achieve a higher macro F<sub>1</sub> score compared to the baseline models on two scientific paper datasets, with improvements of 3.71% and 2.67%, respectively. The LLM based ICL methods are found to be not suitable for the task of problem and method extraction.</p>","PeriodicalId":21755,"journal":{"name":"Scientometrics","volume":null,"pages":null},"PeriodicalIF":3.5000,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Extracting problem and method sentence from scientific papers: a context-enhanced transformer using formulaic expression desensitization\",\"authors\":\"Yingyi Zhang, Chengzhi Zhang\",\"doi\":\"10.1007/s11192-024-05048-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Billions of scientific papers lead to the need to identify essential parts from the massive text. Scientific research is an activity from putting forward problems to using methods. To learn the main idea from scientific papers, we focus on extracting problem and method sentences. Annotating sentences within scientific papers is labor-intensive, resulting in small-scale datasets that limit the amount of information models can learn. This limited information leads models to rely heavily on specific forms, which in turn reduces their generalization capabilities. This paper addresses the problems caused by small-scale datasets from three perspectives: increasing dataset scale, reducing dependence on specific forms, and enriching the information within sentences. To implement the first two ideas, we introduce the concept of formulaic expression (FE) desensitization and propose FE desensitization-based data augmenters to generate synthetic data and reduce models’ reliance on FEs. For the third idea, we propose a context-enhanced transformer that utilizes context to measure the importance of words in target sentences and to reduce noise in the context. Furthermore, this paper conducts experiments using large language model (LLM) based in-context learning (ICL) methods. Quantitative and qualitative experiments demonstrate that our proposed models achieve a higher macro F<sub>1</sub> score compared to the baseline models on two scientific paper datasets, with improvements of 3.71% and 2.67%, respectively. The LLM based ICL methods are found to be not suitable for the task of problem and method extraction.</p>\",\"PeriodicalId\":21755,\"journal\":{\"name\":\"Scientometrics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2024-05-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Scientometrics\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://doi.org/10.1007/s11192-024-05048-6\",\"RegionNum\":3,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientometrics","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1007/s11192-024-05048-6","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

摘要

数以亿计的科学论文导致我们需要从海量文本中找出重要部分。科学研究是一项从提出问题到使用方法的活动。为了从科学论文中学习主要观点,我们将重点放在提取问题句和方法句上。对科学论文中的句子进行注释是一项劳动密集型工作,导致数据集规模较小,限制了模型可学习的信息量。有限的信息导致模型严重依赖于特定的形式,这反过来又降低了模型的泛化能力。本文从三个方面解决了小规模数据集带来的问题:扩大数据集规模、减少对特定形式的依赖以及丰富句子中的信息。为了实现前两个想法,我们引入了公式化表达(FE)脱敏的概念,并提出了基于 FE 脱敏的数据增强器来生成合成数据,减少模型对 FE 的依赖。对于第三个想法,我们提出了一种上下文增强转换器,利用上下文来衡量目标句子中单词的重要性,并减少上下文中的噪音。此外,本文还使用基于大语言模型(LLM)的上下文学习(ICL)方法进行了实验。定量和定性实验表明,在两个科学论文数据集上,与基线模型相比,我们提出的模型获得了更高的宏观 F1 分数,分别提高了 3.71% 和 2.67%。基于 LLM 的 ICL 方法不适合问题和方法提取任务。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Extracting problem and method sentence from scientific papers: a context-enhanced transformer using formulaic expression desensitization

Billions of scientific papers lead to the need to identify essential parts from the massive text. Scientific research is an activity from putting forward problems to using methods. To learn the main idea from scientific papers, we focus on extracting problem and method sentences. Annotating sentences within scientific papers is labor-intensive, resulting in small-scale datasets that limit the amount of information models can learn. This limited information leads models to rely heavily on specific forms, which in turn reduces their generalization capabilities. This paper addresses the problems caused by small-scale datasets from three perspectives: increasing dataset scale, reducing dependence on specific forms, and enriching the information within sentences. To implement the first two ideas, we introduce the concept of formulaic expression (FE) desensitization and propose FE desensitization-based data augmenters to generate synthetic data and reduce models’ reliance on FEs. For the third idea, we propose a context-enhanced transformer that utilizes context to measure the importance of words in target sentences and to reduce noise in the context. Furthermore, this paper conducts experiments using large language model (LLM) based in-context learning (ICL) methods. Quantitative and qualitative experiments demonstrate that our proposed models achieve a higher macro F1 score compared to the baseline models on two scientific paper datasets, with improvements of 3.71% and 2.67%, respectively. The LLM based ICL methods are found to be not suitable for the task of problem and method extraction.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Scientometrics
Scientometrics 管理科学-计算机:跨学科应用
CiteScore
7.20
自引率
17.90%
发文量
351
审稿时长
1.5 months
期刊介绍: Scientometrics aims at publishing original studies, short communications, preliminary reports, review papers, letters to the editor and book reviews on scientometrics. The topics covered are results of research concerned with the quantitative features and characteristics of science. Emphasis is placed on investigations in which the development and mechanism of science are studied by means of (statistical) mathematical methods. The Journal also provides the reader with important up-to-date information about international meetings and events in scientometrics and related fields. Appropriate bibliographic compilations are published as a separate section. Due to its fully interdisciplinary character, Scientometrics is indispensable to research workers and research administrators throughout the world. It provides valuable assistance to librarians and documentalists in central scientific agencies, ministries, research institutes and laboratories. Scientometrics includes the Journal of Research Communication Studies. Consequently its aims and scope cover that of the latter, namely, to bring the results of research investigations together in one place, in such a form that they will be of use not only to the investigators themselves but also to the entrepreneurs and research workers who form the object of these studies.
期刊最新文献
Through the secret gate: a study of member-contributed submissions in PNAS Breach of academic values and misconduct: the case of Sci-Hub Measuring the global and domestic technological impact of Chinese scientific output: a patent-to-paper citation analysis of science-technology linkage Evolving patterns of extreme publishing behavior across science Automated taxonomy alignment via large language models: bridging the gap between knowledge domains
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1