Are the confidence scores of reviewers consistent with the review content? Evidence from top conference proceedings in AI

IF 3.5 3区管理学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Scientometrics Pub Date : 2024-06-20 DOI:10.1007/s11192-024-05070-8

Wenqing Wu, Haixu Xi, Chengzhi Zhang

{"title":"Are the confidence scores of reviewers consistent with the review content? Evidence from top conference proceedings in AI","authors":"Wenqing Wu, Haixu Xi, Chengzhi Zhang","doi":"10.1007/s11192-024-05070-8","DOIUrl":null,"url":null,"abstract":"<p>Peer review is a critical process used in academia to assess the quality and validity of research articles. Top-tier conferences in the field of artificial intelligence (e.g. ICLR and ACL et al.) require reviewers to provide confidence scores to ensure the reliability of their review reports. However, existing studies on confidence scores have neglected to measure the consistency between the comment text and the confidence score in a more refined way, which may overlook more detailed details (such as aspects) in the text, leading to incomplete understanding of the results and insufficient objective analysis of the results. In this work, we propose assessing the consistency between the textual content of the review reports and the assigned scores at a fine-grained level, including word, sentence and aspect levels. The data used in this paper is derived from the peer review comments of conferences in the fields of deep learning and natural language processing. We employed deep learning models to detect hedge sentences and their corresponding aspects. Furthermore, we conducted statistical analyses of the length of review reports, frequency of hedge word usage, number of hedge sentences, frequency of aspect mentions, and their associated sentiment to assess the consistency between the textual content and confidence scores. Finally, we performed correlation analysis, significance tests and regression analysis on the data to examine the impact of confidence scores on the outcomes of the papers. The results indicate that textual content of the review reports and their confidence scores have high level of consistency at the word, sentence, and aspect levels. The regression results reveal a negative correlation between confidence scores and paper outcomes, indicating that higher confidence scores given by reviewers were associated with paper rejection. This indicates that current overall assessment of the paper’s content and quality by the experts is reliable, making the transparency and fairness of the peer review process convincing. We release our data and associated codes at https://github.com/njust-winchy/confidence_score.</p>","PeriodicalId":21755,"journal":{"name":"Scientometrics","volume":"62 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientometrics","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1007/s11192-024-05070-8","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Peer review is a critical process used in academia to assess the quality and validity of research articles. Top-tier conferences in the field of artificial intelligence (e.g. ICLR and ACL et al.) require reviewers to provide confidence scores to ensure the reliability of their review reports. However, existing studies on confidence scores have neglected to measure the consistency between the comment text and the confidence score in a more refined way, which may overlook more detailed details (such as aspects) in the text, leading to incomplete understanding of the results and insufficient objective analysis of the results. In this work, we propose assessing the consistency between the textual content of the review reports and the assigned scores at a fine-grained level, including word, sentence and aspect levels. The data used in this paper is derived from the peer review comments of conferences in the fields of deep learning and natural language processing. We employed deep learning models to detect hedge sentences and their corresponding aspects. Furthermore, we conducted statistical analyses of the length of review reports, frequency of hedge word usage, number of hedge sentences, frequency of aspect mentions, and their associated sentiment to assess the consistency between the textual content and confidence scores. Finally, we performed correlation analysis, significance tests and regression analysis on the data to examine the impact of confidence scores on the outcomes of the papers. The results indicate that textual content of the review reports and their confidence scores have high level of consistency at the word, sentence, and aspect levels. The regression results reveal a negative correlation between confidence scores and paper outcomes, indicating that higher confidence scores given by reviewers were associated with paper rejection. This indicates that current overall assessment of the paper’s content and quality by the experts is reliable, making the transparency and fairness of the peer review process convincing. We release our data and associated codes at https://github.com/njust-winchy/confidence_score.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

审稿人的信心分数与审稿内容一致吗？来自人工智能顶级会议论文集的证据

同行评议是学术界用来评估研究文章质量和有效性的重要程序。人工智能领域的顶级会议（如 ICLR 和 ACL 等）都要求审稿人提供置信度分数，以确保审稿报告的可靠性。然而，现有关于置信度评分的研究忽略了以更精细的方式衡量评论文本与置信度评分之间的一致性，这可能会忽略文本中更详细的细节（如方面），导致对结果的理解不全面，对结果的分析不够客观。在这项工作中，我们建议在细粒度层面（包括单词、句子和方面层面）评估综述报告的文本内容与指定分数之间的一致性。本文使用的数据来自深度学习和自然语言处理领域会议的同行评审意见。我们采用深度学习模型来检测对冲句子及其相应方面。此外，我们还对评论报告的长度、对冲词的使用频率、对冲句子的数量、方面的提及频率及其相关情感进行了统计分析，以评估文本内容与置信度得分之间的一致性。最后，我们对数据进行了相关性分析、显著性检验和回归分析，以研究置信度得分对论文结果的影响。结果表明，综述报告的文本内容与置信度得分在词、句和方面层面上具有高度一致性。回归结果显示，可信度得分与论文结果之间呈负相关，表明审稿人给出的可信度得分越高，论文被拒的可能性越大。这表明目前专家对论文内容和质量的总体评价是可靠的，从而使同行评审过程的透明度和公平性令人信服。我们在 https://github.com/njust-winchy/confidence_score 上发布了我们的数据和相关代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Scientometrics 管理科学-计算机：跨学科应用

CiteScore

7.20

自引率

17.90%

发文量

351

审稿时长

1.5 months

期刊介绍： Scientometrics aims at publishing original studies, short communications, preliminary reports, review papers, letters to the editor and book reviews on scientometrics. The topics covered are results of research concerned with the quantitative features and characteristics of science. Emphasis is placed on investigations in which the development and mechanism of science are studied by means of (statistical) mathematical methods. The Journal also provides the reader with important up-to-date information about international meetings and events in scientometrics and related fields. Appropriate bibliographic compilations are published as a separate section. Due to its fully interdisciplinary character, Scientometrics is indispensable to research workers and research administrators throughout the world. It provides valuable assistance to librarians and documentalists in central scientific agencies, ministries, research institutes and laboratories. Scientometrics includes the Journal of Research Communication Studies. Consequently its aims and scope cover that of the latter, namely, to bring the results of research investigations together in one place, in such a form that they will be of use not only to the investigators themselves but also to the entrepreneurs and research workers who form the object of these studies.