Open Research in Artificial Intelligence and the Search for Common Ground in Reproducibility: A Commentary on “(Why) Are Open Research Practices the Future for the Study of Language Learning?”
{"title":"Open Research in Artificial Intelligence and the Search for Common Ground in Reproducibility: A Commentary on “(Why) Are Open Research Practices the Future for the Study of Language Learning?”","authors":"Odd Erik Gundersen, Kevin Coakley","doi":"10.1111/lang.12582","DOIUrl":null,"url":null,"abstract":"<p>Open research has a long tradition in the field of artificial intelligence (AI), which is our primary area of expertise. Richard Stallman, who has been affiliated with the AI laboratory at Massachusetts Institute of Technology since the early 1970s, launched the GNU project in 1983 and the Free Software Foundation in 1985. The goal of the free software movement has been to secure freedoms for software users to run, study, modify, and share software. GNU software grants these rights in licenses that enable anyone to read the code but also restrict anyone from changing the software without sharing these changes. The open data movement in AI was spearheaded by the Machine Learning Repository created in 1987 by David Aha and fellow graduate students at the University of California Irvine. This repository still hosts a collection of datasets that can be used for machine learning. One of the first digital-first scientific journals was the <i>Journal of Artificial Intelligence Research</i> (JAIR), established in 1993 on the initiative of Steven Minton. The journal is an open access, peer-reviewed scientific publication and has been community driven since its inception. It has no publishing fees, and all expenses have been covered by donations. Since it is hosted online, it supports publishing digital source material, such as code and data.</p><p>AI research is a young science that is continuously seeking to improve research methodology and the quality of the published research. Although there currently is a movement towards publishing research in journals, a substantial number of scientific articles in AI are still published through conference proceedings. The conferences with the highest impact, such as those of the Association for the Advancement of Artificial Intelligence, Neural Information Processing Systems, International Conference on Machine Learning, and International Joint Conference on Artificial Intelligence, are community driven, and the articles presented and published in these venues are open access. Some of the proceedings are published by the <i>Journal of Machine Learning Research</i>, established as an open access alternative to the journal <i>Machine Learning</i> in 2001 to allow authors to publish for free and retain copyright. All these venues also promote and facilitate public sharing of research artifacts.</p><p>Among many open research practices in our field of expertise, some of the most impactful have targeted research reproducibility. In this commentary, we have therefore focused on reproducibility, in the hopes that researchers in language sciences might benefit from the experience of AI scholars. One recent initiative in AI research involved reproducibility checklists introduced at all the most impactful AI conferences to improve the rigor of the research presented and published there. These checklists must be completed by all authors when submitting articles to conferences, and they cover various aspects of research methodology, including whether data and code are shared. The checklists have been introduced as a response to the reproducibility crisis and in recognition of the field's challenges with methodological rigor. Reproducibility badges have also been introduced at several conferences and journals, and soon in JAIR as well (Gundersen, Helmert, & Hoos, <span>2023</span>). The badges indicate whether the research artifacts, such and data and code, that are required for reproducing the research have been shared. In some cases, reviewers evaluate the artifacts as well, which could earn the authors another badge if the reviewers are able to reproduce the research. However, this is a considerable task, recognized by many as too much to ask of reviewers. Instead, AI scholars now organize reproducibility challenges, with the idea of designating a separate track at a conference or a workshop where the goal is to attempt to reproduce a scientific article of choice and write a report on this effort. Some of these reports have been published in the community driven open access journal <i>ReScience C</i>. One issue with these initiatives is that the results of the replication efforts are not linked to the original scientific article. To address this shortcoming, a new procedure is currently being introduced at JAIR, where reports documenting the effort by third parties to reproduce research are published in the journal alongside the article that is being reproduced. This closes the gap between the reproducibility effort and the original work, in the sense that high quality research that is easy to reproduce will get credit, and readers will be made aware of research that is not easily reproducible. JAIR ensures that the original authors get to provide feedback on reproducibility reports and that any mistakes or misunderstandings by the third-party researchers are corrected.</p><p>One challenge to reproducibility is conceptual. The term reproducibility has been called confused by Plesser (<span>2018</span>), and we agree. Our belief is that the reason for this confusion is caused by the term being defined without trying to operationalize it at the same time. Hence, we have tried to define reproducibility in such a way that the concept becomes operationalizable. We use machine learning to illustrate our reasoning because it is our domain, which we know well, and because machine learning is a computer science, so most of the experiments can be fully described in code and automated. We believe this is a strength as it allows us to be specific about what an experiment is and what reproducibility then should mean. However, we think that this definition of reproducibility is generalizable to all sciences.</p><p>In Gundersen (<span>2021</span>), reproducibility is defined as “the ability of independent investigators to draw the same conclusions from an experiment by following the documentation shared by the original investigators” (p. 10). The documentation used to conduct a reproducibility experiment defines to which reproducibility type this experiment belongs, and the way the conclusion is reached determines to which degree the experiment reproduces the conclusion.</p><p>An experiment can be documented in many ways. Traditionally, experiments have been documented only as text, and this is still the case for a large portion of all published studies, because this is the only way to document research in many settings. However, experiments do not have to be documented only as text; for example, if data collection is carried out automatically or if data analysis is performed computationally, both the data and the analysis can be shared. In computer science and AI research, most experiments can be fully automated and executed by a computer, which means that the complete experiments can be shared. The reproducibility type of an experiment is defined by which of these artifacts are shared with independent investigators replicating the initial study.</p><p>We will emphasize two important points. First, sharing of code would be different for a machine learning experiment, where the complete research protocol can be reproduced (from data collection to analysis) and the study's conclusion can be reached if all statistical and analytical criteria are satisfied, and for a medical experiment, where often only the digitized data and analysis can be shared. Second, the textual description is important. Although the code and data could be shared without sharing a study's textual description, this is not enough. To validate whether the right experiment is carried out, independent investigators need the textual description. Validation includes but is not limited to evaluating whether the experiment tests the hypothesis and whether the results are analyzed in an appropriate way. If the textual description is lacking, only some verification can be done, such as checking whether the code produces the expected result given the data.</p><p>We propose three different degrees of reproducibility experiments depending on how the conclusion is reached: (a) outcome reproducible (OR) is when a reproducibility experiment produces the exact same outcome (finding) as the initial experiment, (b) analysis reproducible (AR) is when the same analysis is carried out as in the initial experiment to reach the same conclusion but with a different (nonidentical) finding, and (c) interpretation reproducible (IR) is when the interpretation of the analysis is the same so that the same conclusion is reached even though the analysis is different from the initial one. Also here, let us emphasize three essential points. First, we do not distinguish between the terms reproducibility and replicability but instead cover the same concepts through introducing degrees of reproducibility. Second, in many scientific areas, a study's products or outcomes are described as data. However, in other areas, especially in machine learning, data are often an input to a study. To avoid ambiguity, we use the term outcome to refer to the finding of a study, and we use data to describe a study's input or stimuli, such as, for example, the images containing objects (data) which are classified in a visual perception task. This use of the term “data” appears to map loosely onto “materials” in language-focused studies. Third, outcome reproducibility is basically impossible in noncomputational experiments, and if achieved, is only spurious. This is often the case for highly complex computational experiments as well. However, to properly understand the concept of reproducibility, the distinction is important.</p><p>Marsden and Morgan-Short raise the issue of replicating highly influential older studies that used methods and analytical practices that do not reflect the current field-specific standards. The degrees of reproducibility, as described here, illustrate this situation. Let us explain. When trying to reproduce a highly influential older study, one could choose to use the out-of-date methods or analytical practices to validate the conclusions of the initial experiment, or one could opt for new methods and analytical practices. The experiment would be analysis reproducible (AR) if researchers reach the same conclusions by relying on old analytical practices. In contrast, the experiment would be interpretation reproducible (IR) if researchers reach the same conclusion by modernizing their analytical practices.</p><p>Furthermore, Marsden and Morgan-Short also remark on the difficulty of reproducing the initial study's finding when the full materials are not provided by the original author. The full materials may not be required to reproduce an experiment. This is captured through the various types of reproducibility. If only a textual description is made available by the original investigators (i.e., in a R1 description reproducibility experiment), but the independent investigators use the same analytical methods to reach the same conclusion, then this reproducibility experiment is analysis reproducible (R1AR). However, if new analytical practices are used in the same situation, then the reproducibility experiment will be classified as interpretation reproducible (R1IR).</p><p>Marsden and Morgan-Short explained that, in studies on language learning, replication results were less likely to support the initial finding when materials were not provided. This could mean that the conclusions of the initial experiment do not generalize well. The type of reproducibility study also represents the generalizability of an experiment, going from R4 (least generalizable) to R1 (most generalizable). For instance, the two situations described above, namely, analysis and interpretation reproducible experiments based on a textual description only, would be classified as R1 (most generalizable). In contrast, when an experiment can only be reproduced with full materials, then its conclusions might not be as generalizable as those from an experiment whose findings can be reproduced through only a textual description. In AI research, the original investigators are in fact incentivized to share fewer study materials because this increases other researchers’ efforts to reproduce those findings with highest degree of generalization possible. Whereas this strategy might attract the attention of individual researchers, it ultimately represents an antisocial practice with respect to the research community, in the sense that this practice, of course, makes third parties less likely to reproduce a given finding, so it is a net loss for the community (for more detail, see Gundersen, <span>2019</span>).</p><p>To further increase the understanding of reproducibility, we have not only surveyed existing literature for variables that can lead to a lack of reproducibility but also analyzed how these variables affect various degrees of reproducibility (Gundersen, Coakley, et al., <span>2023</span>). For instance, among various sources of irreproducibility, we have identified study design variables, algorithmic variables, implementation variables, observation variables, evaluation variables, and documentation variables. Understanding these sources of irreproducibility will help researchers to operationalize reproducibility research by highlighting links between a given study's degree of reproducibility and the various design decisions that allow the study to achieve that reproducibility. For example, if researchers try to reproduce an experiment and cannot achieve the degree of analysis reproducibility when evaluating a study's outcomes, those researchers could identify various potential sources of irreproducibility affecting their analysis. We believe that it could be very useful for scholars in other sciences, including language sciences, to identify various variables that can cause experiments to be irreproducible. This will not only help increase researchers’ methodological rigor but enhance their understanding of why reproducibility experiments sometimes fail.</p>","PeriodicalId":51371,"journal":{"name":"Language Learning","volume":null,"pages":null},"PeriodicalIF":3.5000,"publicationDate":"2023-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/lang.12582","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Learning","FirstCategoryId":"98","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/lang.12582","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0
Abstract
Open research has a long tradition in the field of artificial intelligence (AI), which is our primary area of expertise. Richard Stallman, who has been affiliated with the AI laboratory at Massachusetts Institute of Technology since the early 1970s, launched the GNU project in 1983 and the Free Software Foundation in 1985. The goal of the free software movement has been to secure freedoms for software users to run, study, modify, and share software. GNU software grants these rights in licenses that enable anyone to read the code but also restrict anyone from changing the software without sharing these changes. The open data movement in AI was spearheaded by the Machine Learning Repository created in 1987 by David Aha and fellow graduate students at the University of California Irvine. This repository still hosts a collection of datasets that can be used for machine learning. One of the first digital-first scientific journals was the Journal of Artificial Intelligence Research (JAIR), established in 1993 on the initiative of Steven Minton. The journal is an open access, peer-reviewed scientific publication and has been community driven since its inception. It has no publishing fees, and all expenses have been covered by donations. Since it is hosted online, it supports publishing digital source material, such as code and data.
AI research is a young science that is continuously seeking to improve research methodology and the quality of the published research. Although there currently is a movement towards publishing research in journals, a substantial number of scientific articles in AI are still published through conference proceedings. The conferences with the highest impact, such as those of the Association for the Advancement of Artificial Intelligence, Neural Information Processing Systems, International Conference on Machine Learning, and International Joint Conference on Artificial Intelligence, are community driven, and the articles presented and published in these venues are open access. Some of the proceedings are published by the Journal of Machine Learning Research, established as an open access alternative to the journal Machine Learning in 2001 to allow authors to publish for free and retain copyright. All these venues also promote and facilitate public sharing of research artifacts.
Among many open research practices in our field of expertise, some of the most impactful have targeted research reproducibility. In this commentary, we have therefore focused on reproducibility, in the hopes that researchers in language sciences might benefit from the experience of AI scholars. One recent initiative in AI research involved reproducibility checklists introduced at all the most impactful AI conferences to improve the rigor of the research presented and published there. These checklists must be completed by all authors when submitting articles to conferences, and they cover various aspects of research methodology, including whether data and code are shared. The checklists have been introduced as a response to the reproducibility crisis and in recognition of the field's challenges with methodological rigor. Reproducibility badges have also been introduced at several conferences and journals, and soon in JAIR as well (Gundersen, Helmert, & Hoos, 2023). The badges indicate whether the research artifacts, such and data and code, that are required for reproducing the research have been shared. In some cases, reviewers evaluate the artifacts as well, which could earn the authors another badge if the reviewers are able to reproduce the research. However, this is a considerable task, recognized by many as too much to ask of reviewers. Instead, AI scholars now organize reproducibility challenges, with the idea of designating a separate track at a conference or a workshop where the goal is to attempt to reproduce a scientific article of choice and write a report on this effort. Some of these reports have been published in the community driven open access journal ReScience C. One issue with these initiatives is that the results of the replication efforts are not linked to the original scientific article. To address this shortcoming, a new procedure is currently being introduced at JAIR, where reports documenting the effort by third parties to reproduce research are published in the journal alongside the article that is being reproduced. This closes the gap between the reproducibility effort and the original work, in the sense that high quality research that is easy to reproduce will get credit, and readers will be made aware of research that is not easily reproducible. JAIR ensures that the original authors get to provide feedback on reproducibility reports and that any mistakes or misunderstandings by the third-party researchers are corrected.
One challenge to reproducibility is conceptual. The term reproducibility has been called confused by Plesser (2018), and we agree. Our belief is that the reason for this confusion is caused by the term being defined without trying to operationalize it at the same time. Hence, we have tried to define reproducibility in such a way that the concept becomes operationalizable. We use machine learning to illustrate our reasoning because it is our domain, which we know well, and because machine learning is a computer science, so most of the experiments can be fully described in code and automated. We believe this is a strength as it allows us to be specific about what an experiment is and what reproducibility then should mean. However, we think that this definition of reproducibility is generalizable to all sciences.
In Gundersen (2021), reproducibility is defined as “the ability of independent investigators to draw the same conclusions from an experiment by following the documentation shared by the original investigators” (p. 10). The documentation used to conduct a reproducibility experiment defines to which reproducibility type this experiment belongs, and the way the conclusion is reached determines to which degree the experiment reproduces the conclusion.
An experiment can be documented in many ways. Traditionally, experiments have been documented only as text, and this is still the case for a large portion of all published studies, because this is the only way to document research in many settings. However, experiments do not have to be documented only as text; for example, if data collection is carried out automatically or if data analysis is performed computationally, both the data and the analysis can be shared. In computer science and AI research, most experiments can be fully automated and executed by a computer, which means that the complete experiments can be shared. The reproducibility type of an experiment is defined by which of these artifacts are shared with independent investigators replicating the initial study.
We will emphasize two important points. First, sharing of code would be different for a machine learning experiment, where the complete research protocol can be reproduced (from data collection to analysis) and the study's conclusion can be reached if all statistical and analytical criteria are satisfied, and for a medical experiment, where often only the digitized data and analysis can be shared. Second, the textual description is important. Although the code and data could be shared without sharing a study's textual description, this is not enough. To validate whether the right experiment is carried out, independent investigators need the textual description. Validation includes but is not limited to evaluating whether the experiment tests the hypothesis and whether the results are analyzed in an appropriate way. If the textual description is lacking, only some verification can be done, such as checking whether the code produces the expected result given the data.
We propose three different degrees of reproducibility experiments depending on how the conclusion is reached: (a) outcome reproducible (OR) is when a reproducibility experiment produces the exact same outcome (finding) as the initial experiment, (b) analysis reproducible (AR) is when the same analysis is carried out as in the initial experiment to reach the same conclusion but with a different (nonidentical) finding, and (c) interpretation reproducible (IR) is when the interpretation of the analysis is the same so that the same conclusion is reached even though the analysis is different from the initial one. Also here, let us emphasize three essential points. First, we do not distinguish between the terms reproducibility and replicability but instead cover the same concepts through introducing degrees of reproducibility. Second, in many scientific areas, a study's products or outcomes are described as data. However, in other areas, especially in machine learning, data are often an input to a study. To avoid ambiguity, we use the term outcome to refer to the finding of a study, and we use data to describe a study's input or stimuli, such as, for example, the images containing objects (data) which are classified in a visual perception task. This use of the term “data” appears to map loosely onto “materials” in language-focused studies. Third, outcome reproducibility is basically impossible in noncomputational experiments, and if achieved, is only spurious. This is often the case for highly complex computational experiments as well. However, to properly understand the concept of reproducibility, the distinction is important.
Marsden and Morgan-Short raise the issue of replicating highly influential older studies that used methods and analytical practices that do not reflect the current field-specific standards. The degrees of reproducibility, as described here, illustrate this situation. Let us explain. When trying to reproduce a highly influential older study, one could choose to use the out-of-date methods or analytical practices to validate the conclusions of the initial experiment, or one could opt for new methods and analytical practices. The experiment would be analysis reproducible (AR) if researchers reach the same conclusions by relying on old analytical practices. In contrast, the experiment would be interpretation reproducible (IR) if researchers reach the same conclusion by modernizing their analytical practices.
Furthermore, Marsden and Morgan-Short also remark on the difficulty of reproducing the initial study's finding when the full materials are not provided by the original author. The full materials may not be required to reproduce an experiment. This is captured through the various types of reproducibility. If only a textual description is made available by the original investigators (i.e., in a R1 description reproducibility experiment), but the independent investigators use the same analytical methods to reach the same conclusion, then this reproducibility experiment is analysis reproducible (R1AR). However, if new analytical practices are used in the same situation, then the reproducibility experiment will be classified as interpretation reproducible (R1IR).
Marsden and Morgan-Short explained that, in studies on language learning, replication results were less likely to support the initial finding when materials were not provided. This could mean that the conclusions of the initial experiment do not generalize well. The type of reproducibility study also represents the generalizability of an experiment, going from R4 (least generalizable) to R1 (most generalizable). For instance, the two situations described above, namely, analysis and interpretation reproducible experiments based on a textual description only, would be classified as R1 (most generalizable). In contrast, when an experiment can only be reproduced with full materials, then its conclusions might not be as generalizable as those from an experiment whose findings can be reproduced through only a textual description. In AI research, the original investigators are in fact incentivized to share fewer study materials because this increases other researchers’ efforts to reproduce those findings with highest degree of generalization possible. Whereas this strategy might attract the attention of individual researchers, it ultimately represents an antisocial practice with respect to the research community, in the sense that this practice, of course, makes third parties less likely to reproduce a given finding, so it is a net loss for the community (for more detail, see Gundersen, 2019).
To further increase the understanding of reproducibility, we have not only surveyed existing literature for variables that can lead to a lack of reproducibility but also analyzed how these variables affect various degrees of reproducibility (Gundersen, Coakley, et al., 2023). For instance, among various sources of irreproducibility, we have identified study design variables, algorithmic variables, implementation variables, observation variables, evaluation variables, and documentation variables. Understanding these sources of irreproducibility will help researchers to operationalize reproducibility research by highlighting links between a given study's degree of reproducibility and the various design decisions that allow the study to achieve that reproducibility. For example, if researchers try to reproduce an experiment and cannot achieve the degree of analysis reproducibility when evaluating a study's outcomes, those researchers could identify various potential sources of irreproducibility affecting their analysis. We believe that it could be very useful for scholars in other sciences, including language sciences, to identify various variables that can cause experiments to be irreproducible. This will not only help increase researchers’ methodological rigor but enhance their understanding of why reproducibility experiments sometimes fail.
期刊介绍:
Language Learning is a scientific journal dedicated to the understanding of language learning broadly defined. It publishes research articles that systematically apply methods of inquiry from disciplines including psychology, linguistics, cognitive science, educational inquiry, neuroscience, ethnography, sociolinguistics, sociology, and anthropology. It is concerned with fundamental theoretical issues in language learning such as child, second, and foreign language acquisition, language education, bilingualism, literacy, language representation in mind and brain, culture, cognition, pragmatics, and intergroup relations. A subscription includes one or two annual supplements, alternating among a volume from the Language Learning Cognitive Neuroscience Series, the Currents in Language Learning Series or the Language Learning Special Issue Series.