人工智能的开放研究与在可再现性中寻找共同点——评“（为什么）开放研究实践是语言学习研究的未来？”

IF 3.5 1区文学 Q1 EDUCATION & EDUCATIONAL RESEARCH Language Learning Pub Date : 2023-05-22 DOI:10.1111/lang.12582

Odd Erik Gundersen, Kevin Coakley

{"title":"人工智能的开放研究与在可再现性中寻找共同点——评“（为什么）开放研究实践是语言学习研究的未来？”","authors":"Odd Erik Gundersen, Kevin Coakley","doi":"10.1111/lang.12582","DOIUrl":null,"url":null,"abstract":"Open research has a long tradition in the field of artificial intelligence (AI), which is our primary area of expertise. Richard Stallman, who has been affiliated with the AI laboratory at Massachusetts Institute of Technology since the early 1970s, launched the GNU project in 1983 and the Free Software Foundation in 1985. The goal of the free software movement has been to secure freedoms for software users to run, study, modify, and share software. GNU software grants these rights in licenses that enable anyone to read the code but also restrict anyone from changing the software without sharing these changes. The open data movement in AI was spearheaded by the Machine Learning Repository created in 1987 by David Aha and fellow graduate students at the University of California Irvine. This repository still hosts a collection of datasets that can be used for machine learning. One of the first digital-first scientific journals was the Journal of Artificial Intelligence Research (JAIR), established in 1993 on the initiative of Steven Minton. The journal is an open access, peer-reviewed scientific publication and has been community driven since its inception. It has no publishing fees, and all expenses have been covered by donations. Since it is hosted online, it supports publishing digital source material, such as code and data.AI research is a young science that is continuously seeking to improve research methodology and the quality of the published research. Although there currently is a movement towards publishing research in journals, a substantial number of scientific articles in AI are still published through conference proceedings. The conferences with the highest impact, such as those of the Association for the Advancement of Artificial Intelligence, Neural Information Processing Systems, International Conference on Machine Learning, and International Joint Conference on Artificial Intelligence, are community driven, and the articles presented and published in these venues are open access. Some of the proceedings are published by the Journal of Machine Learning Research, established as an open access alternative to the journal Machine Learning in 2001 to allow authors to publish for free and retain copyright. All these venues also promote and facilitate public sharing of research artifacts.Among many open research practices in our field of expertise, some of the most impactful have targeted research reproducibility. In this commentary, we have therefore focused on reproducibility, in the hopes that researchers in language sciences might benefit from the experience of AI scholars. One recent initiative in AI research involved reproducibility checklists introduced at all the most impactful AI conferences to improve the rigor of the research presented and published there. These checklists must be completed by all authors when submitting articles to conferences, and they cover various aspects of research methodology, including whether data and code are shared. The checklists have been introduced as a response to the reproducibility crisis and in recognition of the field's challenges with methodological rigor. Reproducibility badges have also been introduced at several conferences and journals, and soon in JAIR as well (Gundersen, Helmert, & Hoos, 2023). The badges indicate whether the research artifacts, such and data and code, that are required for reproducing the research have been shared. In some cases, reviewers evaluate the artifacts as well, which could earn the authors another badge if the reviewers are able to reproduce the research. However, this is a considerable task, recognized by many as too much to ask of reviewers. Instead, AI scholars now organize reproducibility challenges, with the idea of designating a separate track at a conference or a workshop where the goal is to attempt to reproduce a scientific article of choice and write a report on this effort. Some of these reports have been published in the community driven open access journal ReScience C. One issue with these initiatives is that the results of the replication efforts are not linked to the original scientific article. To address this shortcoming, a new procedure is currently being introduced at JAIR, where reports documenting the effort by third parties to reproduce research are published in the journal alongside the article that is being reproduced. This closes the gap between the reproducibility effort and the original work, in the sense that high quality research that is easy to reproduce will get credit, and readers will be made aware of research that is not easily reproducible. JAIR ensures that the original authors get to provide feedback on reproducibility reports and that any mistakes or misunderstandings by the third-party researchers are corrected.One challenge to reproducibility is conceptual. The term reproducibility has been called confused by Plesser (2018), and we agree. Our belief is that the reason for this confusion is caused by the term being defined without trying to operationalize it at the same time. Hence, we have tried to define reproducibility in such a way that the concept becomes operationalizable. We use machine learning to illustrate our reasoning because it is our domain, which we know well, and because machine learning is a computer science, so most of the experiments can be fully described in code and automated. We believe this is a strength as it allows us to be specific about what an experiment is and what reproducibility then should mean. However, we think that this definition of reproducibility is generalizable to all sciences.In Gundersen (2021), reproducibility is defined as “the ability of independent investigators to draw the same conclusions from an experiment by following the documentation shared by the original investigators” (p. 10). The documentation used to conduct a reproducibility experiment defines to which reproducibility type this experiment belongs, and the way the conclusion is reached determines to which degree the experiment reproduces the conclusion.An experiment can be documented in many ways. Traditionally, experiments have been documented only as text, and this is still the case for a large portion of all published studies, because this is the only way to document research in many settings. However, experiments do not have to be documented only as text; for example, if data collection is carried out automatically or if data analysis is performed computationally, both the data and the analysis can be shared. In computer science and AI research, most experiments can be fully automated and executed by a computer, which means that the complete experiments can be shared. The reproducibility type of an experiment is defined by which of these artifacts are shared with independent investigators replicating the initial study.We will emphasize two important points. First, sharing of code would be different for a machine learning experiment, where the complete research protocol can be reproduced (from data collection to analysis) and the study's conclusion can be reached if all statistical and analytical criteria are satisfied, and for a medical experiment, where often only the digitized data and analysis can be shared. Second, the textual description is important. Although the code and data could be shared without sharing a study's textual description, this is not enough. To validate whether the right experiment is carried out, independent investigators need the textual description. Validation includes but is not limited to evaluating whether the experiment tests the hypothesis and whether the results are analyzed in an appropriate way. If the textual description is lacking, only some verification can be done, such as checking whether the code produces the expected result given the data.We propose three different degrees of reproducibility experiments depending on how the conclusion is reached: (a) outcome reproducible (OR) is when a reproducibility experiment produces the exact same outcome (finding) as the initial experiment, (b) analysis reproducible (AR) is when the same analysis is carried out as in the initial experiment to reach the same conclusion but with a different (nonidentical) finding, and (c) interpretation reproducible (IR) is when the interpretation of the analysis is the same so that the same conclusion is reached even though the analysis is different from the initial one. Also here, let us emphasize three essential points. First, we do not distinguish between the terms reproducibility and replicability but instead cover the same concepts through introducing degrees of reproducibility. Second, in many scientific areas, a study's products or outcomes are described as data. However, in other areas, especially in machine learning, data are often an input to a study. To avoid ambiguity, we use the term outcome to refer to the finding of a study, and we use data to describe a study's input or stimuli, such as, for example, the images containing objects (data) which are classified in a visual perception task. This use of the term “data” appears to map loosely onto “materials” in language-focused studies. Third, outcome reproducibility is basically impossible in noncomputational experiments, and if achieved, is only spurious. This is often the case for highly complex computational experiments as well. However, to properly understand the concept of reproducibility, the distinction is important.Marsden and Morgan-Short raise the issue of replicating highly influential older studies that used methods and analytical practices that do not reflect the current field-specific standards. The degrees of reproducibility, as described here, illustrate this situation. Let us explain. When trying to reproduce a highly influential older study, one could choose to use the out-of-date methods or analytical practices to validate the conclusions of the initial experiment, or one could opt for new methods and analytical practices. The experiment would be analysis reproducible (AR) if researchers reach the same conclusions by relying on old analytical practices. In contrast, the experiment would be interpretation reproducible (IR) if researchers reach the same conclusion by modernizing their analytical practices.Furthermore, Marsden and Morgan-Short also remark on the difficulty of reproducing the initial study's finding when the full materials are not provided by the original author. The full materials may not be required to reproduce an experiment. This is captured through the various types of reproducibility. If only a textual description is made available by the original investigators (i.e., in a R1 description reproducibility experiment), but the independent investigators use the same analytical methods to reach the same conclusion, then this reproducibility experiment is analysis reproducible (R1AR). However, if new analytical practices are used in the same situation, then the reproducibility experiment will be classified as interpretation reproducible (R1IR).Marsden and Morgan-Short explained that, in studies on language learning, replication results were less likely to support the initial finding when materials were not provided. This could mean that the conclusions of the initial experiment do not generalize well. The type of reproducibility study also represents the generalizability of an experiment, going from R4 (least generalizable) to R1 (most generalizable). For instance, the two situations described above, namely, analysis and interpretation reproducible experiments based on a textual description only, would be classified as R1 (most generalizable). In contrast, when an experiment can only be reproduced with full materials, then its conclusions might not be as generalizable as those from an experiment whose findings can be reproduced through only a textual description. In AI research, the original investigators are in fact incentivized to share fewer study materials because this increases other researchers’ efforts to reproduce those findings with highest degree of generalization possible. Whereas this strategy might attract the attention of individual researchers, it ultimately represents an antisocial practice with respect to the research community, in the sense that this practice, of course, makes third parties less likely to reproduce a given finding, so it is a net loss for the community (for more detail, see Gundersen, 2019).To further increase the understanding of reproducibility, we have not only surveyed existing literature for variables that can lead to a lack of reproducibility but also analyzed how these variables affect various degrees of reproducibility (Gundersen, Coakley, et al., 2023). For instance, among various sources of irreproducibility, we have identified study design variables, algorithmic variables, implementation variables, observation variables, evaluation variables, and documentation variables. Understanding these sources of irreproducibility will help researchers to operationalize reproducibility research by highlighting links between a given study's degree of reproducibility and the various design decisions that allow the study to achieve that reproducibility. For example, if researchers try to reproduce an experiment and cannot achieve the degree of analysis reproducibility when evaluating a study's outcomes, those researchers could identify various potential sources of irreproducibility affecting their analysis. We believe that it could be very useful for scholars in other sciences, including language sciences, to identify various variables that can cause experiments to be irreproducible. This will not only help increase researchers’ methodological rigor but enhance their understanding of why reproducibility experiments sometimes fail.","PeriodicalId":51371,"journal":{"name":"Language Learning","volume":"73 S2","pages":"407-413"},"PeriodicalIF":3.5000,"publicationDate":"2023-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/lang.12582","citationCount":"0","resultStr":"{\"title\":\"Open Research in Artificial Intelligence and the Search for Common Ground in Reproducibility: A Commentary on “(Why) Are Open Research Practices the Future for the Study of Language Learning?”\",\"authors\":\"Odd Erik Gundersen, Kevin Coakley\",\"doi\":\"10.1111/lang.12582\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Open research has a long tradition in the field of artificial intelligence (AI), which is our primary area of expertise. Richard Stallman, who has been affiliated with the AI laboratory at Massachusetts Institute of Technology since the early 1970s, launched the GNU project in 1983 and the Free Software Foundation in 1985. The goal of the free software movement has been to secure freedoms for software users to run, study, modify, and share software. GNU software grants these rights in licenses that enable anyone to read the code but also restrict anyone from changing the software without sharing these changes. The open data movement in AI was spearheaded by the Machine Learning Repository created in 1987 by David Aha and fellow graduate students at the University of California Irvine. This repository still hosts a collection of datasets that can be used for machine learning. One of the first digital-first scientific journals was the Journal of Artificial Intelligence Research (JAIR), established in 1993 on the initiative of Steven Minton. The journal is an open access, peer-reviewed scientific publication and has been community driven since its inception. It has no publishing fees, and all expenses have been covered by donations. Since it is hosted online, it supports publishing digital source material, such as code and data.AI research is a young science that is continuously seeking to improve research methodology and the quality of the published research. Although there currently is a movement towards publishing research in journals, a substantial number of scientific articles in AI are still published through conference proceedings. The conferences with the highest impact, such as those of the Association for the Advancement of Artificial Intelligence, Neural Information Processing Systems, International Conference on Machine Learning, and International Joint Conference on Artificial Intelligence, are community driven, and the articles presented and published in these venues are open access. Some of the proceedings are published by the Journal of Machine Learning Research, established as an open access alternative to the journal Machine Learning in 2001 to allow authors to publish for free and retain copyright. All these venues also promote and facilitate public sharing of research artifacts.Among many open research practices in our field of expertise, some of the most impactful have targeted research reproducibility. In this commentary, we have therefore focused on reproducibility, in the hopes that researchers in language sciences might benefit from the experience of AI scholars. One recent initiative in AI research involved reproducibility checklists introduced at all the most impactful AI conferences to improve the rigor of the research presented and published there. These checklists must be completed by all authors when submitting articles to conferences, and they cover various aspects of research methodology, including whether data and code are shared. The checklists have been introduced as a response to the reproducibility crisis and in recognition of the field's challenges with methodological rigor. Reproducibility badges have also been introduced at several conferences and journals, and soon in JAIR as well (Gundersen, Helmert, & Hoos, 2023). The badges indicate whether the research artifacts, such and data and code, that are required for reproducing the research have been shared. In some cases, reviewers evaluate the artifacts as well, which could earn the authors another badge if the reviewers are able to reproduce the research. However, this is a considerable task, recognized by many as too much to ask of reviewers. Instead, AI scholars now organize reproducibility challenges, with the idea of designating a separate track at a conference or a workshop where the goal is to attempt to reproduce a scientific article of choice and write a report on this effort. Some of these reports have been published in the community driven open access journal ReScience C. One issue with these initiatives is that the results of the replication efforts are not linked to the original scientific article. To address this shortcoming, a new procedure is currently being introduced at JAIR, where reports documenting the effort by third parties to reproduce research are published in the journal alongside the article that is being reproduced. This closes the gap between the reproducibility effort and the original work, in the sense that high quality research that is easy to reproduce will get credit, and readers will be made aware of research that is not easily reproducible. JAIR ensures that the original authors get to provide feedback on reproducibility reports and that any mistakes or misunderstandings by the third-party researchers are corrected.One challenge to reproducibility is conceptual. The term reproducibility has been called confused by Plesser (2018), and we agree. Our belief is that the reason for this confusion is caused by the term being defined without trying to operationalize it at the same time. Hence, we have tried to define reproducibility in such a way that the concept becomes operationalizable. We use machine learning to illustrate our reasoning because it is our domain, which we know well, and because machine learning is a computer science, so most of the experiments can be fully described in code and automated. We believe this is a strength as it allows us to be specific about what an experiment is and what reproducibility then should mean. However, we think that this definition of reproducibility is generalizable to all sciences.In Gundersen (2021), reproducibility is defined as “the ability of independent investigators to draw the same conclusions from an experiment by following the documentation shared by the original investigators” (p. 10). The documentation used to conduct a reproducibility experiment defines to which reproducibility type this experiment belongs, and the way the conclusion is reached determines to which degree the experiment reproduces the conclusion.An experiment can be documented in many ways. Traditionally, experiments have been documented only as text, and this is still the case for a large portion of all published studies, because this is the only way to document research in many settings. However, experiments do not have to be documented only as text; for example, if data collection is carried out automatically or if data analysis is performed computationally, both the data and the analysis can be shared. In computer science and AI research, most experiments can be fully automated and executed by a computer, which means that the complete experiments can be shared. The reproducibility type of an experiment is defined by which of these artifacts are shared with independent investigators replicating the initial study.We will emphasize two important points. First, sharing of code would be different for a machine learning experiment, where the complete research protocol can be reproduced (from data collection to analysis) and the study's conclusion can be reached if all statistical and analytical criteria are satisfied, and for a medical experiment, where often only the digitized data and analysis can be shared. Second, the textual description is important. Although the code and data could be shared without sharing a study's textual description, this is not enough. To validate whether the right experiment is carried out, independent investigators need the textual description. Validation includes but is not limited to evaluating whether the experiment tests the hypothesis and whether the results are analyzed in an appropriate way. If the textual description is lacking, only some verification can be done, such as checking whether the code produces the expected result given the data.We propose three different degrees of reproducibility experiments depending on how the conclusion is reached: (a) outcome reproducible (OR) is when a reproducibility experiment produces the exact same outcome (finding) as the initial experiment, (b) analysis reproducible (AR) is when the same analysis is carried out as in the initial experiment to reach the same conclusion but with a different (nonidentical) finding, and (c) interpretation reproducible (IR) is when the interpretation of the analysis is the same so that the same conclusion is reached even though the analysis is different from the initial one. Also here, let us emphasize three essential points. First, we do not distinguish between the terms reproducibility and replicability but instead cover the same concepts through introducing degrees of reproducibility. Second, in many scientific areas, a study's products or outcomes are described as data. However, in other areas, especially in machine learning, data are often an input to a study. To avoid ambiguity, we use the term outcome to refer to the finding of a study, and we use data to describe a study's input or stimuli, such as, for example, the images containing objects (data) which are classified in a visual perception task. This use of the term “data” appears to map loosely onto “materials” in language-focused studies. Third, outcome reproducibility is basically impossible in noncomputational experiments, and if achieved, is only spurious. This is often the case for highly complex computational experiments as well. However, to properly understand the concept of reproducibility, the distinction is important.Marsden and Morgan-Short raise the issue of replicating highly influential older studies that used methods and analytical practices that do not reflect the current field-specific standards. The degrees of reproducibility, as described here, illustrate this situation. Let us explain. When trying to reproduce a highly influential older study, one could choose to use the out-of-date methods or analytical practices to validate the conclusions of the initial experiment, or one could opt for new methods and analytical practices. The experiment would be analysis reproducible (AR) if researchers reach the same conclusions by relying on old analytical practices. In contrast, the experiment would be interpretation reproducible (IR) if researchers reach the same conclusion by modernizing their analytical practices.Furthermore, Marsden and Morgan-Short also remark on the difficulty of reproducing the initial study's finding when the full materials are not provided by the original author. The full materials may not be required to reproduce an experiment. This is captured through the various types of reproducibility. If only a textual description is made available by the original investigators (i.e., in a R1 description reproducibility experiment), but the independent investigators use the same analytical methods to reach the same conclusion, then this reproducibility experiment is analysis reproducible (R1AR). However, if new analytical practices are used in the same situation, then the reproducibility experiment will be classified as interpretation reproducible (R1IR).Marsden and Morgan-Short explained that, in studies on language learning, replication results were less likely to support the initial finding when materials were not provided. This could mean that the conclusions of the initial experiment do not generalize well. The type of reproducibility study also represents the generalizability of an experiment, going from R4 (least generalizable) to R1 (most generalizable). For instance, the two situations described above, namely, analysis and interpretation reproducible experiments based on a textual description only, would be classified as R1 (most generalizable). In contrast, when an experiment can only be reproduced with full materials, then its conclusions might not be as generalizable as those from an experiment whose findings can be reproduced through only a textual description. In AI research, the original investigators are in fact incentivized to share fewer study materials because this increases other researchers’ efforts to reproduce those findings with highest degree of generalization possible. Whereas this strategy might attract the attention of individual researchers, it ultimately represents an antisocial practice with respect to the research community, in the sense that this practice, of course, makes third parties less likely to reproduce a given finding, so it is a net loss for the community (for more detail, see Gundersen, 2019).To further increase the understanding of reproducibility, we have not only surveyed existing literature for variables that can lead to a lack of reproducibility but also analyzed how these variables affect various degrees of reproducibility (Gundersen, Coakley, et al., 2023). For instance, among various sources of irreproducibility, we have identified study design variables, algorithmic variables, implementation variables, observation variables, evaluation variables, and documentation variables. Understanding these sources of irreproducibility will help researchers to operationalize reproducibility research by highlighting links between a given study's degree of reproducibility and the various design decisions that allow the study to achieve that reproducibility. For example, if researchers try to reproduce an experiment and cannot achieve the degree of analysis reproducibility when evaluating a study's outcomes, those researchers could identify various potential sources of irreproducibility affecting their analysis. We believe that it could be very useful for scholars in other sciences, including language sciences, to identify various variables that can cause experiments to be irreproducible. This will not only help increase researchers’ methodological rigor but enhance their understanding of why reproducibility experiments sometimes fail.\",\"PeriodicalId\":51371,\"journal\":{\"name\":\"Language Learning\",\"volume\":\"73 S2\",\"pages\":\"407-413\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2023-05-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1111/lang.12582\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Language Learning\",\"FirstCategoryId\":\"98\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/lang.12582\",\"RegionNum\":1,\"RegionCategory\":\"文学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION & EDUCATIONAL RESEARCH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Learning","FirstCategoryId":"98","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/lang.12582","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

摘要

开放研究在人工智能(AI)领域有着悠久的传统，这是我们的主要专业领域。Richard Stallman自20世纪70年代初就加入了麻省理工学院的人工智能实验室，他于1983年发起了GNU项目，并于1985年发起了自由软件基金会。自由软件运动的目标是确保软件用户运行、学习、修改和共享软件的自由。GNU软件在许可中授予这些权利，允许任何人阅读代码，但也限制任何人在不共享这些更改的情况下更改软件。人工智能领域的开放数据运动是由1987年由加州大学欧文分校的大卫·阿哈(David Aha)和其他研究生创建的机器学习存储库(Machine Learning Repository)引领的。该存储库仍然托管可用于机器学习的数据集集合。最早的数字优先科学期刊之一是《人工智能研究杂志》(JAIR)，该杂志于1993年在史蒂文·明顿(Steven Minton)的倡议下成立。该杂志是一份开放获取、同行评议的科学出版物，自创刊以来一直由社区推动。它没有出版费用，所有费用都由捐款支付。由于它是在线托管的，因此它支持发布数字源材料，如代码和数据。人工智能研究是一门年轻的科学，它不断寻求改进研究方法和发表的研究质量。尽管目前有在期刊上发表研究成果的趋势，但人工智能领域的大量科学文章仍然是通过会议记录发表的。影响最大的会议，如人工智能进步协会、神经信息处理系统、国际机器学习会议和国际人工智能联合会议等，都是由社区驱动的，在这些场所发表和发表的文章都是开放获取的。一些会议记录发表在《机器学习研究杂志》上，该杂志成立于2001年，是《机器学习》杂志的开放获取替代品，允许作者免费发表并保留版权。所有这些场所也促进和促进公众分享研究文物。在我们专业领域的许多开放研究实践中，一些最具影响力的是针对研究可重复性的。因此，在这篇评论中，我们将重点放在可重复性上，希望语言科学的研究人员可以从人工智能学者的经验中受益。人工智能研究领域最近的一项举措是在所有最有影响力的人工智能会议上引入可重复性检查表，以提高在会上发表和发表的研究的严谨性。这些检查表必须由所有作者在向会议提交文章时完成，它们涵盖了研究方法的各个方面，包括数据和代码是否共享。引入检查清单是为了应对可重复性危机，并承认该领域在方法严谨性方面面临的挑战。在一些会议和期刊上也引入了再现性徽章，不久也将在JAIR中引入(Gundersen, Helmert， &呼,2023)。这些标记表明再现研究所需的研究工件(如数据和代码)是否已被共享。在某些情况下，审稿人也会评估工件，如果审稿人能够重现研究，这可以为作者赢得另一个徽章。然而，这是一项相当大的任务，许多人认为对评论者的要求太高了。相反，人工智能学者现在组织了可重复性挑战，他们的想法是在会议或研讨会上指定一个单独的轨道，目标是尝试重现一篇选定的科学文章，并就这一努力写一份报告。其中一些报告已经发表在社区驱动的开放获取期刊《科学》上。这些举措的一个问题是，复制工作的结果与原始科学文章没有联系。为了解决这一缺陷，JAIR目前正在引入一种新的程序，在该程序中，记录第三方复制研究成果的报告将与被复制的文章一起发表在期刊上。这缩小了可重复性工作与原始工作之间的差距，因为易于复制的高质量研究将获得信誉，并且读者将了解不易复制的研究。JAIR确保原作者对可重复性报告提供反馈，并纠正第三方研究人员的任何错误或误解。可重复性的一个挑战是概念性的。Plesser(2018)将“可重复性”一词称为“混淆”，我们对此表示同意。我们的信念是，造成这种混乱的原因是由于在定义这个术语的同时没有试图将其付诸实施。因此，我们试图以这样一种方式来定义再现性，使这个概念变得可操作。我们使用机器学习来说明我们的推理，因为它是我们的领域，我们很了解，因为机器学习是一门计算机科学，所以大多数实验都可以用代码完全描述并自动化。我们相信这是一种优势，因为它使我们能够明确实验是什么以及可重复性意味着什么。然而，我们认为这个可重复性的定义可以推广到所有的科学。在Gundersen(2021)中，可重复性被定义为“独立研究人员通过遵循原始研究人员共享的文件从实验中得出相同结论的能力”(第10页)。进行可重复性实验所使用的文件定义了该实验属于哪一种可重复性类型，得出结论的方式决定了实验对结论的再现程度。一个实验可以用多种方式记录下来。传统上，实验只以文本的形式记录下来，这仍然是所有已发表的研究的很大一部分，因为这是在许多情况下记录研究的唯一方法。然而，实验不必仅仅以文本的形式记录下来;例如，如果数据收集是自动进行的，或者如果数据分析是计算进行的，那么数据和分析都可以共享。在计算机科学和人工智能研究中，大多数实验可以完全自动化并由计算机执行，这意味着完整的实验可以共享。实验的可重复性类型是由这些伪像中的哪一个与复制初始研究的独立调查人员共享来定义的。我们要强调两点。首先，代码共享对于机器学习实验来说是不同的，机器学习实验可以复制完整的研究方案(从数据收集到分析)，如果满足所有统计和分析标准，就可以得出研究结论，而对于医学实验来说，通常只能共享数字化的数据和分析。第二，文字描述很重要。虽然代码和数据可以在不共享研究的文本描述的情况下共享，但这还不够。为了验证是否进行了正确的实验，独立研究者需要文本描述。验证包括但不限于评估实验是否验证了假设，以及是否以适当的方式分析了结果。如果缺乏文本描述，则只能进行一些验证，例如检查代码是否产生给定数据的预期结果。根据得出结论的方式，我们提出了三种不同程度的可重复性实验:(a)结果可重复性(OR)是指可重复性实验产生与初始实验完全相同的结果(发现)，(b)分析可重复性(AR)是指与初始实验进行相同的分析，得出相同的结论，但发现不同(不相同)。(c)解释可重复性(IR)是指对分析的解释是相同的，即使分析与最初的分析不同，也能得出相同的结论。同样在这里，让我们强调三个要点。首先，我们不区分术语可再现性和可复制性，而是通过引入可再现性程度来涵盖相同的概念。其次，在许多科学领域，一项研究的产品或结果被描述为数据。然而，在其他领域，特别是在机器学习中，数据通常是研究的输入。为了避免歧义，我们使用术语“结果”来指代研究的发现，我们使用数据来描述研究的输入或刺激，例如，包含在视觉感知任务中分类的对象(数据)的图像。在以语言为重点的研究中，“数据”一词的使用似乎松散地映射到“材料”上。第三，在非计算实验中，结果的可重复性基本上是不可能的，即使实现了，也只是虚假的。这在高度复杂的计算实验中也经常出现。然而，要正确理解再现性的概念，区别是很重要的。马斯登和摩根-肖特提出了复制具有高度影响力的旧研究的问题，这些研究使用的方法和分析实践并不反映当前的特定领域标准。这里描述的可再现性程度说明了这种情况。让我们来解释一下。当试图重现一项具有高度影响力的旧研究时，可以选择使用过时的方法或分析实践来验证初始实验的结论，也可以选择新的方法和分析实践。如果研究人员依靠旧的分析方法得出相同的结论，那么实验将是分析可重复的(AR)。相反，如果研究人员通过现代化他们的分析实践得出相同的结论，那么实验将是解释可重复性的(IR)。此外，马斯登和摩根-肖特还指出，当原作者没有提供完整的材料时，复制最初研究的发现是困难的。重现实验可能不需要完整的材料。这是通过各种类型的再现性捕获的。如果原始研究者只提供了文本描述(即在R1描述可重复性实验中)，但独立研究者使用相同的分析方法得出相同的结论，则该可重复性实验为分析可重复性(R1AR)。然而，如果在相同的情况下使用新的分析实践，那么可重复性实验将被归类为解释可重复性(R1IR)。马斯登和摩根-肖特解释说，在语言学习的研究中，当没有提供材料时，复制结果不太可能支持最初的发现。这可能意味着最初实验的结论不能很好地推广。可重复性研究的类型也代表了实验的普遍性，从R4(最不具普遍性)到R1(最具普遍性)。例如，上述两种情况，即仅基于文本描述的分析和解释可重复实验，将被归类为R1(最一般化)。相反，当一个实验只能用完整的材料再现时，那么它的结论可能不像那些只能通过文本描述再现结果的实验那样具有普遍性。在人工智能研究中，最初的研究人员实际上被鼓励分享更少的研究材料，因为这增加了其他研究人员的努力，以尽可能高的泛化程度重现这些发现。虽然这种策略可能会吸引个别研究人员的注意，但它最终代表了研究社区的一种反社会实践，从某种意义上说，这种实践当然会使第三方不太可能重现给定的发现，因此对社区来说是一种净损失(有关更多细节，请参阅Gundersen, 2019)。为了进一步提高对可重复性的理解，我们不仅调查了现有文献中可能导致缺乏可重复性的变量，还分析了这些变量如何影响不同程度的可重复性(Gundersen, Coakley, et al.， 2023)。例如，在各种不可重复性的来源中，我们已经确定了研究设计变量、算法变量、实现变量、观察变量、评估变量和文档变量。了解这些不可重复性的来源将有助于研究人员通过强调给定研究的可重复性程度与允许研究实现可重复性的各种设计决策之间的联系，来实施可重复性研究。例如，如果研究人员试图重现一个实验，但在评估研究结果时无法达到分析可重复性的程度，这些研究人员可以识别影响其分析的各种不可重复性的潜在来源。我们相信，对于包括语言科学在内的其他科学领域的学者来说，识别可能导致实验不可复制的各种变量是非常有用的。这不仅有助于提高研究人员方法的严谨性，而且有助于提高他们对可重复性实验有时失败的原因的理解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Open Research in Artificial Intelligence and the Search for Common Ground in Reproducibility: A Commentary on “(Why) Are Open Research Practices the Future for the Study of Language Learning?”

Open research has a long tradition in the field of artificial intelligence (AI), which is our primary area of expertise. Richard Stallman, who has been affiliated with the AI laboratory at Massachusetts Institute of Technology since the early 1970s, launched the GNU project in 1983 and the Free Software Foundation in 1985. The goal of the free software movement has been to secure freedoms for software users to run, study, modify, and share software. GNU software grants these rights in licenses that enable anyone to read the code but also restrict anyone from changing the software without sharing these changes. The open data movement in AI was spearheaded by the Machine Learning Repository created in 1987 by David Aha and fellow graduate students at the University of California Irvine. This repository still hosts a collection of datasets that can be used for machine learning. One of the first digital-first scientific journals was the Journal of Artificial Intelligence Research (JAIR), established in 1993 on the initiative of Steven Minton. The journal is an open access, peer-reviewed scientific publication and has been community driven since its inception. It has no publishing fees, and all expenses have been covered by donations. Since it is hosted online, it supports publishing digital source material, such as code and data.

AI research is a young science that is continuously seeking to improve research methodology and the quality of the published research. Although there currently is a movement towards publishing research in journals, a substantial number of scientific articles in AI are still published through conference proceedings. The conferences with the highest impact, such as those of the Association for the Advancement of Artificial Intelligence, Neural Information Processing Systems, International Conference on Machine Learning, and International Joint Conference on Artificial Intelligence, are community driven, and the articles presented and published in these venues are open access. Some of the proceedings are published by the Journal of Machine Learning Research, established as an open access alternative to the journal Machine Learning in 2001 to allow authors to publish for free and retain copyright. All these venues also promote and facilitate public sharing of research artifacts.

Among many open research practices in our field of expertise, some of the most impactful have targeted research reproducibility. In this commentary, we have therefore focused on reproducibility, in the hopes that researchers in language sciences might benefit from the experience of AI scholars. One recent initiative in AI research involved reproducibility checklists introduced at all the most impactful AI conferences to improve the rigor of the research presented and published there. These checklists must be completed by all authors when submitting articles to conferences, and they cover various aspects of research methodology, including whether data and code are shared. The checklists have been introduced as a response to the reproducibility crisis and in recognition of the field's challenges with methodological rigor. Reproducibility badges have also been introduced at several conferences and journals, and soon in JAIR as well (Gundersen, Helmert, & Hoos, 2023). The badges indicate whether the research artifacts, such and data and code, that are required for reproducing the research have been shared. In some cases, reviewers evaluate the artifacts as well, which could earn the authors another badge if the reviewers are able to reproduce the research. However, this is a considerable task, recognized by many as too much to ask of reviewers. Instead, AI scholars now organize reproducibility challenges, with the idea of designating a separate track at a conference or a workshop where the goal is to attempt to reproduce a scientific article of choice and write a report on this effort. Some of these reports have been published in the community driven open access journal ReScience C. One issue with these initiatives is that the results of the replication efforts are not linked to the original scientific article. To address this shortcoming, a new procedure is currently being introduced at JAIR, where reports documenting the effort by third parties to reproduce research are published in the journal alongside the article that is being reproduced. This closes the gap between the reproducibility effort and the original work, in the sense that high quality research that is easy to reproduce will get credit, and readers will be made aware of research that is not easily reproducible. JAIR ensures that the original authors get to provide feedback on reproducibility reports and that any mistakes or misunderstandings by the third-party researchers are corrected.

One challenge to reproducibility is conceptual. The term reproducibility has been called confused by Plesser (2018), and we agree. Our belief is that the reason for this confusion is caused by the term being defined without trying to operationalize it at the same time. Hence, we have tried to define reproducibility in such a way that the concept becomes operationalizable. We use machine learning to illustrate our reasoning because it is our domain, which we know well, and because machine learning is a computer science, so most of the experiments can be fully described in code and automated. We believe this is a strength as it allows us to be specific about what an experiment is and what reproducibility then should mean. However, we think that this definition of reproducibility is generalizable to all sciences.

In Gundersen (2021), reproducibility is defined as “the ability of independent investigators to draw the same conclusions from an experiment by following the documentation shared by the original investigators” (p. 10). The documentation used to conduct a reproducibility experiment defines to which reproducibility type this experiment belongs, and the way the conclusion is reached determines to which degree the experiment reproduces the conclusion.

An experiment can be documented in many ways. Traditionally, experiments have been documented only as text, and this is still the case for a large portion of all published studies, because this is the only way to document research in many settings. However, experiments do not have to be documented only as text; for example, if data collection is carried out automatically or if data analysis is performed computationally, both the data and the analysis can be shared. In computer science and AI research, most experiments can be fully automated and executed by a computer, which means that the complete experiments can be shared. The reproducibility type of an experiment is defined by which of these artifacts are shared with independent investigators replicating the initial study.

We will emphasize two important points. First, sharing of code would be different for a machine learning experiment, where the complete research protocol can be reproduced (from data collection to analysis) and the study's conclusion can be reached if all statistical and analytical criteria are satisfied, and for a medical experiment, where often only the digitized data and analysis can be shared. Second, the textual description is important. Although the code and data could be shared without sharing a study's textual description, this is not enough. To validate whether the right experiment is carried out, independent investigators need the textual description. Validation includes but is not limited to evaluating whether the experiment tests the hypothesis and whether the results are analyzed in an appropriate way. If the textual description is lacking, only some verification can be done, such as checking whether the code produces the expected result given the data.

We propose three different degrees of reproducibility experiments depending on how the conclusion is reached: (a) outcome reproducible (OR) is when a reproducibility experiment produces the exact same outcome (finding) as the initial experiment, (b) analysis reproducible (AR) is when the same analysis is carried out as in the initial experiment to reach the same conclusion but with a different (nonidentical) finding, and (c) interpretation reproducible (IR) is when the interpretation of the analysis is the same so that the same conclusion is reached even though the analysis is different from the initial one. Also here, let us emphasize three essential points. First, we do not distinguish between the terms reproducibility and replicability but instead cover the same concepts through introducing degrees of reproducibility. Second, in many scientific areas, a study's products or outcomes are described as data. However, in other areas, especially in machine learning, data are often an input to a study. To avoid ambiguity, we use the term outcome to refer to the finding of a study, and we use data to describe a study's input or stimuli, such as, for example, the images containing objects (data) which are classified in a visual perception task. This use of the term “data” appears to map loosely onto “materials” in language-focused studies. Third, outcome reproducibility is basically impossible in noncomputational experiments, and if achieved, is only spurious. This is often the case for highly complex computational experiments as well. However, to properly understand the concept of reproducibility, the distinction is important.

Marsden and Morgan-Short raise the issue of replicating highly influential older studies that used methods and analytical practices that do not reflect the current field-specific standards. The degrees of reproducibility, as described here, illustrate this situation. Let us explain. When trying to reproduce a highly influential older study, one could choose to use the out-of-date methods or analytical practices to validate the conclusions of the initial experiment, or one could opt for new methods and analytical practices. The experiment would be analysis reproducible (AR) if researchers reach the same conclusions by relying on old analytical practices. In contrast, the experiment would be interpretation reproducible (IR) if researchers reach the same conclusion by modernizing their analytical practices.

Furthermore, Marsden and Morgan-Short also remark on the difficulty of reproducing the initial study's finding when the full materials are not provided by the original author. The full materials may not be required to reproduce an experiment. This is captured through the various types of reproducibility. If only a textual description is made available by the original investigators (i.e., in a R1 description reproducibility experiment), but the independent investigators use the same analytical methods to reach the same conclusion, then this reproducibility experiment is analysis reproducible (R1AR). However, if new analytical practices are used in the same situation, then the reproducibility experiment will be classified as interpretation reproducible (R1IR).

Marsden and Morgan-Short explained that, in studies on language learning, replication results were less likely to support the initial finding when materials were not provided. This could mean that the conclusions of the initial experiment do not generalize well. The type of reproducibility study also represents the generalizability of an experiment, going from R4 (least generalizable) to R1 (most generalizable). For instance, the two situations described above, namely, analysis and interpretation reproducible experiments based on a textual description only, would be classified as R1 (most generalizable). In contrast, when an experiment can only be reproduced with full materials, then its conclusions might not be as generalizable as those from an experiment whose findings can be reproduced through only a textual description. In AI research, the original investigators are in fact incentivized to share fewer study materials because this increases other researchers’ efforts to reproduce those findings with highest degree of generalization possible. Whereas this strategy might attract the attention of individual researchers, it ultimately represents an antisocial practice with respect to the research community, in the sense that this practice, of course, makes third parties less likely to reproduce a given finding, so it is a net loss for the community (for more detail, see Gundersen, 2019).

To further increase the understanding of reproducibility, we have not only surveyed existing literature for variables that can lead to a lack of reproducibility but also analyzed how these variables affect various degrees of reproducibility (Gundersen, Coakley, et al., 2023). For instance, among various sources of irreproducibility, we have identified study design variables, algorithmic variables, implementation variables, observation variables, evaluation variables, and documentation variables. Understanding these sources of irreproducibility will help researchers to operationalize reproducibility research by highlighting links between a given study's degree of reproducibility and the various design decisions that allow the study to achieve that reproducibility. For example, if researchers try to reproduce an experiment and cannot achieve the degree of analysis reproducibility when evaluating a study's outcomes, those researchers could identify various potential sources of irreproducibility affecting their analysis. We believe that it could be very useful for scholars in other sciences, including language sciences, to identify various variables that can cause experiments to be irreproducible. This will not only help increase researchers’ methodological rigor but enhance their understanding of why reproducibility experiments sometimes fail.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Language Learning Multiple-

CiteScore

9.10

自引率

15.90%

发文量

期刊介绍： Language Learning is a scientific journal dedicated to the understanding of language learning broadly defined. It publishes research articles that systematically apply methods of inquiry from disciplines including psychology, linguistics, cognitive science, educational inquiry, neuroscience, ethnography, sociolinguistics, sociology, and anthropology. It is concerned with fundamental theoretical issues in language learning such as child, second, and foreign language acquisition, language education, bilingualism, literacy, language representation in mind and brain, culture, cognition, pragmatics, and intergroup relations. A subscription includes one or two annual supplements, alternating among a volume from the Language Learning Cognitive Neuroscience Series, the Currents in Language Learning Series or the Language Learning Special Issue Series.