Beyond PRISMA 2020 and AMSTAR 2: Further Actions Are Needed to Deal With Problematic Meta-Analyses

IF 2.4 3区 管理学 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE Learned Publishing Pub Date : 2025-02-04 DOI:10.1002/leap.1666
Philippe C. Baveye
{"title":"Beyond PRISMA 2020 and AMSTAR 2: Further Actions Are Needed to Deal With Problematic Meta-Analyses","authors":"Philippe C. Baveye","doi":"10.1002/leap.1666","DOIUrl":null,"url":null,"abstract":"<p>For more than 10 years, researchers routinely complained that, because of the fast expansion of the scholarly literature, it was becoming very challenging for them to keep abreast of novel developments in even a very narrow portion of their discipline (e.g., Baveye <span>2014</span>). At the same time, journal editors have experienced increasing difficulties to recruit reviewers (Siegel and Baveye <span>2010</span>). Over the last few years, the situation does not appear to have improved significantly (West and Bergstrom <span>2021</span>; Baveye <span>2021a</span>, <span>2021b</span>). The scholarly literature keeps expanding at an exponential rate. According to some estimates, 5.14 million articles were published during 2022, sizeably more than the 4.18 million published just 4 years earlier (Curcic <span>2023</span>). More than ever, with conflicting demands on their time for teaching, supervising undergraduate and graduate students, reviewing for journals, or writing numerous proposals to compete for limited funding, researchers generally find it virtually impossible to devote as many hours as would be needed to read articles of direct interest to them in sufficient depth.</p><p>Not surprisingly in this context, a significant effort has unfolded to review and synthesise relatively large bodies of literature and make their content more readily accessible to researchers and policy-makers. In recent years, tens of thousands of systematic reviews and especially of “meta-analyses” have been written. The staggering scale of the endeavour is evinced by the fact that the article of Page et al. (<span>2021</span>), proposing revised reporting guidelines for meta-analyses, has already been cited over 79,000 times, in only 3 years, according to Google Scholar (https://scholar.google.com; Last retrieved, January 29, 2025). Because it is proving time-consuming to stay abreast even of meta-analyses in virtually all disciplines, a trend is currently emerging of synthesising meta-analyses via what has been referred to as “second-order” meta-analyses (e.g., Schmidt and Oh <span>2013</span>; Bergquist et al. <span>2023</span>), or of carrying out “overviews of systematic reviews” (Lunny et al. <span>2024</span>). In 2023 alone, more than 7000 articles referred to these practices, according to Google Scholar.</p><p>No doubt part of the appeal of the meta-analysis method over the years has been its original description as a robust technique with a strong statistical foundation (Glass <span>1976</span>; Shadish and Lecy <span>2015</span>). Nevertheless, implementations of the method in practice have been the object of very strong criticisms, in particular in research on education (Abrami, Cohen, and d'Apollonia <span>1988</span>; Ropovik, Adamkovic, and Greger <span>2021</span>), medicine (Haidich <span>2010</span>; Hedin et al. <span>2016</span>; Chapman <span>2020</span>), plant ecology (Koricheva and Gurevitch <span>2014</span>), agronomy (Philibert, Loyce, and Makowski <span>2012</span>; Beillouin, Ben-Ari, and Makowski <span>2019</span>; Krupnik et al. <span>2019</span>) and soil science (Fohrafellner et al. <span>2023</span>), where researchers who have assessed the quality of meta-analyses found it overall to be low, and noticed that core quality criteria necessary to conduct sound meta-analyses did not appear to be well-understood by authors. Among the criteria that seem to be the most problematic in these various assessment exercises are the theory-based requirements for studies included in meta-analyses to be weighed according to the inverse of their variance, for meta-analyses to avoid mixing primary studies that have no connection with each other—the “apple and oranges” problem (Sharpe <span>1997</span>)—, and for authors of meta-analyses to pay close attention to any bias that may exist in the literature, for example when journals solely publish articles describing positive, statistically-significant results.</p><p>Mounting concern about the soundness of published meta-analyses and systematic reviews encouraged various groups to develop guidelines for either their rigorous implementation or proper reporting. Among them, the collaborative effort of QUORUM (Quality of Reporting of Meta-analyses), initiated in 1996, eventually led to the development of Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) (Moher et al. <span>2009</span>), which was recently revised as PRISMA 2020 (Page et al. <span>2021</span>). It appears to be adopted voluntarily by a growing number of researchers, and adherence to it is currently encouraged or even mandated by an increasing number of journals (Baveye <span>2024</span>). Another set of guidelines is provided by AMSTAR (A MeaSurement Tool to Assess systematic Reviews), published in 2007 and revised in 2017 as AMSTAR 2 (Shea et al. <span>2007</span>, <span>2017</span>). In some ways, AMSTAR 2 provides a more comprehensive assessment of systematic reviews, since it focuses not only on their adherence to reporting standards but also on their methodological rigour, and addresses in detail issues such as protocol registration, risk of bias appraisals, and study selection processes. As with PRISMA 2020, some scholarly journals are now making the use of AMSTAR 2 mandatory for all new manuscripts (e.g., Chapman <span>2020</span>).</p><p>The current practice of adopting PRISMA 2020, AMSTAR 2, or other similar checklists, in the conduct and reporting of meta-analyses and systematic reviews is undoubtedly a step in the right direction. However, it raises two crucial questions. The first relates to whether this step will be sufficient, in and of itself, to weed out inadequate research syntheses in the future. There are indeed reasons to doubt whether, as both PRISMA 2020 and AMSTAR 2 assume implicitly, it is possible at this stage to diagnose accurately whether there is a significant publication bias in a given body of literature, whether something can be done to adjust for it, and, therefore, whether a meta-analysis or systematic review is meaningful. If this assumption is not warranted, the risk is that some meta-analyses that are currently published might turn out eventually, after additional research, to be fundamentally erroneous and misleading, which clearly would have devastating repercussions on the public's confidence toward science. The question therefore is whether something can be done relatively quickly to improve this situation. The second question relates to how we should handle research syntheses that were published in the past, did not follow checklists like PRISMA or AMSTAR, and are demonstrably deficient in one or more of their components. They are routinely being cited at present as entirely admissible evidence, when one might reasonably argue that they should not. The question is then how, practically, we can prevent citations to such potentially misleading publications. Following brief allusions by Baveye (<span>2024</span>) to these two questions and to how they can be dealt with by journal editors, we address them in more detail and from a broader perspective in the following.</p><p>Over the years, it has become the implicit rule for scholarly journals in most disciplines not to publish articles describing ‘negative’ or non-statistically-significant results. Ritchie (<span>2020</span>) refers to this tradition as ‘one of science's most embarrassing secrets.’ It is so familiar to researchers that journal editors do not have to spell it out explicitly any more. Authors know that if they try to submit manuscripts that do not conform to the unwritten rule in this respect, it is almost guaranteed that they will not fare well in the review process, so that eventually they self-censure. In the least bad scenario, supposedly unpublishable results end up in the ‘grey’ (i.e., non-peer-reviewed) literature in the form of technical reports, unpublished Ph.D. dissertations, or papers in conference proceedings. Evidence suggests however that researchers often do not even bother writing up and sharing with colleagues results that are generally considered to be of negligible import, because they cannot lead to publications in peer-reviewed journals (Franco, Malhotra, and Simonovits <span>2014</span>).</p><p>Years ago, researchers realised that the reluctance of editors to publish negative or non-statistically-significant results meant that only a portion of all results would be available for review in meta-analyses, and therefore that the latter may be biased, leading to potentially erroneous conclusions (e.g., Eysenck <span>1984</span>, <span>1994</span>; Rothstein <span>2008</span>; Haidich <span>2010</span>; Lin and Chu <span>2018</span>; Furuya-Kanamori, Barendregt, and Doi <span>2018</span>; Nair <span>2019</span>; Ritchie <span>2020</span>; Borenstein et al. <span>2021</span>). Different methods have been devised to assess whether a significant publication bias exists in a given body of literature. Unfortunately, most of these methods have been shown time and again to have shortcomings, so that after several decades a consensus still does not appear to exist among specialists about the best assessment approach (e.g., Rothstein <span>2008</span>; Fragkos, Tsagris, and Frangos <span>2014</span>; McShane, Böckenholt, and Hansen <span>2016</span>; Lin and Chu <span>2018</span>; Shi and Lin <span>2019</span>; Mathur and VanderWeele <span>2020</span>; Afonso et al. <span>2024</span>; Lunny et al. <span>2024</span>).</p><p>One way out of this conundrum involves trying to convince statisticians to keep developing novel, more sophisticated methods to assess if a given body of literature exhibits publication bias. This is clearly a daunting challenge, since assessing publication bias implies basically trying to say something conclusive about a situation in which a portion of the relevant data is entirely missing and unknown. In this respect, Ritchie (<span>2020</span>) refers derisively to techniques like the commonly used ‘funnel’ plots to detect publication bias (Egger et al. <span>1997</span>), and the thorough analysis by Shi and Lin (<span>2019</span>) of the “trim and fill” method, which relies on filling the data gaps, shows that it is fraught with difficulties. Nevertheless, research in this area needs to be encouraged, new approaches need to be explored, and precise information needs to be obtained on the conditions under which publication bias can be effectively detected and adjusted, so that meta-analyses and systematic reviews become feasible.</p><p>Based on attempts to deal with the publication bias issue over the last 20 years, one could argue that the only full-proof way to deal with it is to avoid it in the first place. As Rothstein (<span>2008</span>) wrote already 16 years ago, ‘Over the long run, the best way to deal with publication bias is to prevent it’. That means practically that scholarly journals should start routinely publishing what is currently considered to be ‘negative’ results. Some journals have tried to convey the idea that they would be open to this approach. The editors of the journal PLOS One, for example, state on its website that they ‘evaluate research on scientific validity, strong methodology, and high ethical standards—not perceived significance’, which appears to head in the right direction. However, this attitude is neither widely shared, nor adopted by the high Impact Factor journals at this point. What it would take to make progress rapidly in that respect would be for journal editors to let it be known, clearly and unambiguously, to prospective authors that their journals would welcome manuscripts regardless of the direction, positive or negative, of the effects they observed, or their level of statistical significance, as long as the research was carried out with an appropriate level of rigour (Baveye <span>2024</span>). Editors who would be willing to follow this path could find support for it in the work of the philosopher Karl Popper (<span>1963</span>), who, in a nutshell, stated that theories cannot be validated, but only invalidated or falsified. From that perspective, which many researchers have espoused over the years, one could argue that the rigorous, carefully carried out experiments that are the most valuable for the advancement of science are those that clearly invalidate a given hypothesis, or at least show that differences predicted to be significant on the basis of it under specific circumstances turn out not to be so in practice. In editorials in which they would describe steps implemented by their journal to reduce the incidence of publication bias, editors could perhaps invoke this Popperian viewpoint to rationalise the shift in their policy.</p><p>Some advocates for the status quo may consider that this new editorial policy would amount to a drastic change in our publishing culture, which researchers would find difficult to adjust to or may suspect to generate articles of low quality. However, one might argue that it would not necessarily be such a major revolution in many cases, and would only require a minor adjustment of how we set up working hypotheses to structure reports on our work. Especially in disciplines like mine (soil science) in which we deal with very complex systems, many aspects of which we still do not understand, much of the research that we carry out is exploratory in nature, driven by questions but not necessarily by definite hypotheses. In spite of that, when writing the ensuing manuscripts, we often feel we have to formulate a hypothesis a posteriori, to which the experiments are then supposed to lend support. Under these conditions, without any loss of rigour or scientific integrity, nothing would prevent us from setting up working hypotheses at the write-up stage in a way that the experiments appear to produce a positive outcome, to circumvent any reluctance of the part of journal editors or reviewers toward publishing negative results. The soil science article of Wu et al. (<span>2010</span>) illustrates this approach. It is very likely that it would never have been accepted for publication if the working hypothesis had stated that a particular measurement technique (visible–near infrared spectroscopy) can directly detect heavy metals in soils, and the results had turned out negative. However, the article stipulated instead, based on prior knowledge, that the spectroscopic method should theoretically not be expected to ‘see’ heavy metals in soils, and the article ended up with a confirmation of this working hypothesis, an apparently positive message that was very well received by reviewers.</p><p>Whenever it is suggested that editors of scholarly journals should allow the publication of ‘negative’ or non-statistically-significant results, it is common for someone to ask what kind of incentive structure would be needed for that to finally happen, more than 16 years after Rothstein (<span>2008</span>) pointed out how crucial that was. Insofar as editors and researchers are concerned, it is not obvious that incentives would be absolutely required. As the situation currently stands, both groups would lose a lot, and their work would become more complicated, if research syntheses that are being published in increasing numbers would turn out eventually to be partly or entirely erroneous and misleading. That prospect alone, one would think, accompanied by the loss of funding that would likely result from the public's and decision makers' diminishing trust in research, should make it compelling for editors and researchers alike to try to sanitise the field and prevent the publication of problematic meta-analyses and systematic reviews. For the same reason, funding agencies should be interested in promoting the publication of all outcomes of the research they sponsor, regardless of the direction of the observed effects, and could exert a significant influence in that sense. Finally, the stakeholders who stand to benefit the most, ultimately, from the advancement of science, for example, the private sector taking advantage of research breakthroughs to develop new products or governmental agencies basing some of their key decisions on scientific findings, should also be interested in making sure that scientific publications be as trustworthy as possible, and, in particular, that research syntheses be based on complete sets of results instead of just very biased subsets.</p><p>The previous section indicated what should be done to ensure as much as possible that meta-analyses published in the future meet appropriate quality standards. However, this still leaves open the question of how one should handle meta-analyses that were published in the past, did not meet one or more quality criteria, may be misleading, and therefore, one could argue, should not be cited. To a large extent, this issue extends beyond flawed meta-analyses and is pertinent for all manuscripts, not just those involving meta-analyses. Many articles that have been withdrawn by their authors or by the journals in which they were published continue to be heavily cited years later, as if they were still legitimate.</p><p>In fields where quality surveys have shown some meta-analyses to be seriously flawed (e.g., Haidich <span>2010</span>; Koricheva and Gurevitch <span>2014</span>; Philibert, Loyce, and Makowski <span>2012</span>; Beillouin et al. <span>2019</span>; Krupnik et al. <span>2019</span>; Fohrafellner et al. <span>2023</span>), their number is small enough that reviewers could be instructed to flag them in manuscripts submitted for publication, and to recommend either that they not be cited, or that citation be accompanied by cautionary statements. However, the number of meta-analyses in that category is infinitesimal compared to the many tens of thousands of meta-analyses in the literature as a whole. Editors cannot realistically ask already over-solicited reviewers to systematically assess every single meta-analysis that is cited in manuscripts they are asked to evaluate, and to weed out inappropriate citations. Another approach is needed.</p><p>As suggested by Baveye (<span>2024</span>), one could imagine creating a kind of registry of articles that were either withdrawn or demonstrated in the peer-reviewed literature to be flawed in some way, and using this registry to inform reviewers. To ensure that such an approach would be useful, one needs to think carefully at how such a registry should be set up, what the most practical way would be for technical editors to access it, and what information resulting from it should be conveyed to reviewers. Perhaps the easiest way to answer these questions without getting overly technical is to proceed backward through them, that is, describe first the type of information that would be useful to reviewers and the format in which it should be delivered, before thinking about how this information could be generated.</p><p>Whenever reviewers are asked to assess manuscripts containing citations to meta-analyses, they need to know whether these citations are appropriate, that is, whether they fare well on checklists like PRISMA 2020, AMSTAR 2, or similar ones. AMSTAR 2 can be particularly useful in this context, as it could serve as an additional filter to identify flawed studies that did not adhere to acceptable methodological standards, offering a systematic way to flag these works for future citation. For each cited meta-analysis, regardless of the checklists that are used, an overall score could be computed, following, for example, Fohrafellner et al. (<span>2023</span>). However, such a score may not be sufficiently instructive, since it might be affected by compensations among criteria. At the other extreme, just listing the scores for each of the individual quality criteria in the checklists might overwhelm reviewers, even if a graphical summary is provided (Bougioukas et al. <span>2024</span>; Karakasis et al. <span>2024</span>). A happy medium might consist of singling out a few key criteria (like the weighting of primary data with the inverse of their variance), and compounding others that may be viewed as less essential.</p><p>Who would generate the assessments of cited meta-analyses provided to reviewers? The most straightforward approach would be for technical editors to use dedicated software that would scan the files of submitted manuscripts, use artificial intelligence tools—like Scite.AI (https://scite.ai) does—to determine if they cite previously published meta-analyses in more than a casual way (e.g., as part of the background publications on a given topic, but not specifically in terms of their conclusions), assess the adequacy of the cited meta-analyses criterion by criterion using checklists like PRISMA 2020 or AMSTAR 2, and issue a report along the lines described in the previous paragraph, to inform reviewers. The dedicated software could be periodically upgraded as new tools emerge in the future to deal with the problem of publication bias.</p><p>A possible objection to technical editors being asked to subject all manuscripts to this scrutiny is that it would increase their workload. However, one could argue that the process would not be very different than what they are already doing routinely at present to check for plagiarism. In this case as well, they use dedicated software, which scans the internet very rapidly to determine if sections of text, figures, or pictures in the manuscripts are copied from other documents. Running the software itself does not take much time, whereas interpreting the results may in some cases require some attention. The use of a ‘meta-analysis checking software’ would in a number of ways be very similar, although it might actually be faster and easier, for two reasons. One is that running of this software might be sped up considerably by systematically storing information about assessed meta-analyses in a database that the program would consult afterwards before proceeding to any new assessment. That way, the database would become more and more extensive the more it is used, and therefore the time needed to evaluate citations in a given manuscript would become progressively shorter. The other reason why the adoption of such a piece of software would constitute a lesser burden for technical editors than the plagiarism-detection ones is that reports on meta-analyses should simply be forwarded to reviewers, who would then evaluate whether or not there is a problem worth pointing out to the authors.</p><p>To develop the computer program needed to check published meta-analyses, a previous effort initiated a decade ago can serve as a useful blueprint. Nuijten (<span>2016</span>) came up with a program, ‘Statcheck’, to detect statistical errors in peer-reviewed psychology articles by searching articles for statistical results, redoing the calculations described in the article, and comparing the two values to see if they match. Using Statcheck, Nuijten et al. (<span>2016</span>) showed that half of the 30,000 psychology articles they re-analysed ‘contained at least one p-value that was inconsistent with its test’. These observations caused quite a stir in the field (Baker <span>2016</span>; Wren <span>2018</span>; Ritchie <span>2020</span>). To this date more than 100,000 articles have been checked with Statcheck. Hartgerink (<span>2016</span>) alone, ‘content mined’ more than 160,000 articles to eventually run Statcheck on more than 50,000 of them! At this juncture, a growing number of journals are using Statcheck as part of their peer-review process. The content-mining tools used by Statcheck, and other similar ones that have been generated in the past decade based on artificial intelligence, could relatively easily be used for the development a computer program designed to systematically check past meta-analyses in the literature. Likewise, one could find guidance as well in the efforts of Shi and Lin (<span>2019</span>), who retrieved primary data from 29,932 meta-analyses in order to find out to what extent the so-called ‘trim and fill’ method could be effective to adjust for publication bias.</p><p>With tens of thousands of first- and second-order meta-analyses getting published each year, and serious concerns about the quality of some of these meta-analyses having been raised in the past, the research enterprise runs the risk of having to deal in the future with many published research synthesis articles containing erroneous and/or misleading conclusions. This would likely have a devastating effect on the public's confidence in science and limit its ability to positively inform policy-making in many areas (e.g., Ritchie <span>2020</span>; Baveye <span>2024</span>).</p><p>The development, and adoption by many journals, of comprehensive reporting and implementation guidelines for meta-analyses and systematic reviews (e.g., PRISMA 2020 and AMSTAR 2) definitely constitute a positive step in this context, and may in the future effectively discourage authors from submitting meta-analyses that do not meet acceptable standards. Unfortunately, the use of these guidelines still leaves two basic questions unanswered. First, both PRISMA 2020 and AMSTAR 2 implicitly assume that it is possible at this point to diagnose accurately the absence of publication bias in a given field, which is a pre-condition to meaningfully carry out any kind of research synthesis. Further research is definitely needed in this area. In the meantime, the only reasonable solution in this regard would be for scholarly journals to start routinely publishing ‘negative’ or non-statistically-significant results. This might seem like a drastic departure from a long tradition, but for various reasons, it does not have to be so drastic, if one is willing to set up working hypotheses slightly differently, when writing up manuscripts, or if one considers, following Popper, that the ‘falsification’ of theories is crucial to the advancement of science. Second, the current use of checklists like PRISMA 2020 or AMSTAR 2, while it will undoubtedly make the conclusions of future meta-analyses more reliable, raises the question of the appropriate way to deal with the many thousands of meta-analyses and systematic reviews that have been published in the past at a time when the use of these checklists was not required, and continue to be cited extensively. A solution in this respect might be the development of a dedicated computer program, which technical editors of scholarly journals could use systematically to provide information to reviewers on the soundness of the meta-analyses cited in the manuscripts they are asked to evaluate. Hopefully, with both of these lines of action, we could get to a point, a couple of years from now, when risks associated with problematic meta-analyses would be adequately minimised or even entirely alleviated.</p><p>Philippe C. Baveye is the sole author of the article, came up with the opinion that is presented, wrote the text, and handled its revision after reviews.</p><p>The author declares that the research was conducted in the absence of any commercial or financial relationships of any kind that could be construed as potential conflicts of interest, in the broader sense of the concept described in Baveye (<span>2023</span>).</p>","PeriodicalId":51636,"journal":{"name":"Learned Publishing","volume":"38 2","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/leap.1666","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Learned Publishing","FirstCategoryId":"91","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/leap.1666","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 0

Abstract

For more than 10 years, researchers routinely complained that, because of the fast expansion of the scholarly literature, it was becoming very challenging for them to keep abreast of novel developments in even a very narrow portion of their discipline (e.g., Baveye 2014). At the same time, journal editors have experienced increasing difficulties to recruit reviewers (Siegel and Baveye 2010). Over the last few years, the situation does not appear to have improved significantly (West and Bergstrom 2021; Baveye 2021a, 2021b). The scholarly literature keeps expanding at an exponential rate. According to some estimates, 5.14 million articles were published during 2022, sizeably more than the 4.18 million published just 4 years earlier (Curcic 2023). More than ever, with conflicting demands on their time for teaching, supervising undergraduate and graduate students, reviewing for journals, or writing numerous proposals to compete for limited funding, researchers generally find it virtually impossible to devote as many hours as would be needed to read articles of direct interest to them in sufficient depth.

Not surprisingly in this context, a significant effort has unfolded to review and synthesise relatively large bodies of literature and make their content more readily accessible to researchers and policy-makers. In recent years, tens of thousands of systematic reviews and especially of “meta-analyses” have been written. The staggering scale of the endeavour is evinced by the fact that the article of Page et al. (2021), proposing revised reporting guidelines for meta-analyses, has already been cited over 79,000 times, in only 3 years, according to Google Scholar (https://scholar.google.com; Last retrieved, January 29, 2025). Because it is proving time-consuming to stay abreast even of meta-analyses in virtually all disciplines, a trend is currently emerging of synthesising meta-analyses via what has been referred to as “second-order” meta-analyses (e.g., Schmidt and Oh 2013; Bergquist et al. 2023), or of carrying out “overviews of systematic reviews” (Lunny et al. 2024). In 2023 alone, more than 7000 articles referred to these practices, according to Google Scholar.

No doubt part of the appeal of the meta-analysis method over the years has been its original description as a robust technique with a strong statistical foundation (Glass 1976; Shadish and Lecy 2015). Nevertheless, implementations of the method in practice have been the object of very strong criticisms, in particular in research on education (Abrami, Cohen, and d'Apollonia 1988; Ropovik, Adamkovic, and Greger 2021), medicine (Haidich 2010; Hedin et al. 2016; Chapman 2020), plant ecology (Koricheva and Gurevitch 2014), agronomy (Philibert, Loyce, and Makowski 2012; Beillouin, Ben-Ari, and Makowski 2019; Krupnik et al. 2019) and soil science (Fohrafellner et al. 2023), where researchers who have assessed the quality of meta-analyses found it overall to be low, and noticed that core quality criteria necessary to conduct sound meta-analyses did not appear to be well-understood by authors. Among the criteria that seem to be the most problematic in these various assessment exercises are the theory-based requirements for studies included in meta-analyses to be weighed according to the inverse of their variance, for meta-analyses to avoid mixing primary studies that have no connection with each other—the “apple and oranges” problem (Sharpe 1997)—, and for authors of meta-analyses to pay close attention to any bias that may exist in the literature, for example when journals solely publish articles describing positive, statistically-significant results.

Mounting concern about the soundness of published meta-analyses and systematic reviews encouraged various groups to develop guidelines for either their rigorous implementation or proper reporting. Among them, the collaborative effort of QUORUM (Quality of Reporting of Meta-analyses), initiated in 1996, eventually led to the development of Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) (Moher et al. 2009), which was recently revised as PRISMA 2020 (Page et al. 2021). It appears to be adopted voluntarily by a growing number of researchers, and adherence to it is currently encouraged or even mandated by an increasing number of journals (Baveye 2024). Another set of guidelines is provided by AMSTAR (A MeaSurement Tool to Assess systematic Reviews), published in 2007 and revised in 2017 as AMSTAR 2 (Shea et al. 2007, 2017). In some ways, AMSTAR 2 provides a more comprehensive assessment of systematic reviews, since it focuses not only on their adherence to reporting standards but also on their methodological rigour, and addresses in detail issues such as protocol registration, risk of bias appraisals, and study selection processes. As with PRISMA 2020, some scholarly journals are now making the use of AMSTAR 2 mandatory for all new manuscripts (e.g., Chapman 2020).

The current practice of adopting PRISMA 2020, AMSTAR 2, or other similar checklists, in the conduct and reporting of meta-analyses and systematic reviews is undoubtedly a step in the right direction. However, it raises two crucial questions. The first relates to whether this step will be sufficient, in and of itself, to weed out inadequate research syntheses in the future. There are indeed reasons to doubt whether, as both PRISMA 2020 and AMSTAR 2 assume implicitly, it is possible at this stage to diagnose accurately whether there is a significant publication bias in a given body of literature, whether something can be done to adjust for it, and, therefore, whether a meta-analysis or systematic review is meaningful. If this assumption is not warranted, the risk is that some meta-analyses that are currently published might turn out eventually, after additional research, to be fundamentally erroneous and misleading, which clearly would have devastating repercussions on the public's confidence toward science. The question therefore is whether something can be done relatively quickly to improve this situation. The second question relates to how we should handle research syntheses that were published in the past, did not follow checklists like PRISMA or AMSTAR, and are demonstrably deficient in one or more of their components. They are routinely being cited at present as entirely admissible evidence, when one might reasonably argue that they should not. The question is then how, practically, we can prevent citations to such potentially misleading publications. Following brief allusions by Baveye (2024) to these two questions and to how they can be dealt with by journal editors, we address them in more detail and from a broader perspective in the following.

Over the years, it has become the implicit rule for scholarly journals in most disciplines not to publish articles describing ‘negative’ or non-statistically-significant results. Ritchie (2020) refers to this tradition as ‘one of science's most embarrassing secrets.’ It is so familiar to researchers that journal editors do not have to spell it out explicitly any more. Authors know that if they try to submit manuscripts that do not conform to the unwritten rule in this respect, it is almost guaranteed that they will not fare well in the review process, so that eventually they self-censure. In the least bad scenario, supposedly unpublishable results end up in the ‘grey’ (i.e., non-peer-reviewed) literature in the form of technical reports, unpublished Ph.D. dissertations, or papers in conference proceedings. Evidence suggests however that researchers often do not even bother writing up and sharing with colleagues results that are generally considered to be of negligible import, because they cannot lead to publications in peer-reviewed journals (Franco, Malhotra, and Simonovits 2014).

Years ago, researchers realised that the reluctance of editors to publish negative or non-statistically-significant results meant that only a portion of all results would be available for review in meta-analyses, and therefore that the latter may be biased, leading to potentially erroneous conclusions (e.g., Eysenck 1984, 1994; Rothstein 2008; Haidich 2010; Lin and Chu 2018; Furuya-Kanamori, Barendregt, and Doi 2018; Nair 2019; Ritchie 2020; Borenstein et al. 2021). Different methods have been devised to assess whether a significant publication bias exists in a given body of literature. Unfortunately, most of these methods have been shown time and again to have shortcomings, so that after several decades a consensus still does not appear to exist among specialists about the best assessment approach (e.g., Rothstein 2008; Fragkos, Tsagris, and Frangos 2014; McShane, Böckenholt, and Hansen 2016; Lin and Chu 2018; Shi and Lin 2019; Mathur and VanderWeele 2020; Afonso et al. 2024; Lunny et al. 2024).

One way out of this conundrum involves trying to convince statisticians to keep developing novel, more sophisticated methods to assess if a given body of literature exhibits publication bias. This is clearly a daunting challenge, since assessing publication bias implies basically trying to say something conclusive about a situation in which a portion of the relevant data is entirely missing and unknown. In this respect, Ritchie (2020) refers derisively to techniques like the commonly used ‘funnel’ plots to detect publication bias (Egger et al. 1997), and the thorough analysis by Shi and Lin (2019) of the “trim and fill” method, which relies on filling the data gaps, shows that it is fraught with difficulties. Nevertheless, research in this area needs to be encouraged, new approaches need to be explored, and precise information needs to be obtained on the conditions under which publication bias can be effectively detected and adjusted, so that meta-analyses and systematic reviews become feasible.

Based on attempts to deal with the publication bias issue over the last 20 years, one could argue that the only full-proof way to deal with it is to avoid it in the first place. As Rothstein (2008) wrote already 16 years ago, ‘Over the long run, the best way to deal with publication bias is to prevent it’. That means practically that scholarly journals should start routinely publishing what is currently considered to be ‘negative’ results. Some journals have tried to convey the idea that they would be open to this approach. The editors of the journal PLOS One, for example, state on its website that they ‘evaluate research on scientific validity, strong methodology, and high ethical standards—not perceived significance’, which appears to head in the right direction. However, this attitude is neither widely shared, nor adopted by the high Impact Factor journals at this point. What it would take to make progress rapidly in that respect would be for journal editors to let it be known, clearly and unambiguously, to prospective authors that their journals would welcome manuscripts regardless of the direction, positive or negative, of the effects they observed, or their level of statistical significance, as long as the research was carried out with an appropriate level of rigour (Baveye 2024). Editors who would be willing to follow this path could find support for it in the work of the philosopher Karl Popper (1963), who, in a nutshell, stated that theories cannot be validated, but only invalidated or falsified. From that perspective, which many researchers have espoused over the years, one could argue that the rigorous, carefully carried out experiments that are the most valuable for the advancement of science are those that clearly invalidate a given hypothesis, or at least show that differences predicted to be significant on the basis of it under specific circumstances turn out not to be so in practice. In editorials in which they would describe steps implemented by their journal to reduce the incidence of publication bias, editors could perhaps invoke this Popperian viewpoint to rationalise the shift in their policy.

Some advocates for the status quo may consider that this new editorial policy would amount to a drastic change in our publishing culture, which researchers would find difficult to adjust to or may suspect to generate articles of low quality. However, one might argue that it would not necessarily be such a major revolution in many cases, and would only require a minor adjustment of how we set up working hypotheses to structure reports on our work. Especially in disciplines like mine (soil science) in which we deal with very complex systems, many aspects of which we still do not understand, much of the research that we carry out is exploratory in nature, driven by questions but not necessarily by definite hypotheses. In spite of that, when writing the ensuing manuscripts, we often feel we have to formulate a hypothesis a posteriori, to which the experiments are then supposed to lend support. Under these conditions, without any loss of rigour or scientific integrity, nothing would prevent us from setting up working hypotheses at the write-up stage in a way that the experiments appear to produce a positive outcome, to circumvent any reluctance of the part of journal editors or reviewers toward publishing negative results. The soil science article of Wu et al. (2010) illustrates this approach. It is very likely that it would never have been accepted for publication if the working hypothesis had stated that a particular measurement technique (visible–near infrared spectroscopy) can directly detect heavy metals in soils, and the results had turned out negative. However, the article stipulated instead, based on prior knowledge, that the spectroscopic method should theoretically not be expected to ‘see’ heavy metals in soils, and the article ended up with a confirmation of this working hypothesis, an apparently positive message that was very well received by reviewers.

Whenever it is suggested that editors of scholarly journals should allow the publication of ‘negative’ or non-statistically-significant results, it is common for someone to ask what kind of incentive structure would be needed for that to finally happen, more than 16 years after Rothstein (2008) pointed out how crucial that was. Insofar as editors and researchers are concerned, it is not obvious that incentives would be absolutely required. As the situation currently stands, both groups would lose a lot, and their work would become more complicated, if research syntheses that are being published in increasing numbers would turn out eventually to be partly or entirely erroneous and misleading. That prospect alone, one would think, accompanied by the loss of funding that would likely result from the public's and decision makers' diminishing trust in research, should make it compelling for editors and researchers alike to try to sanitise the field and prevent the publication of problematic meta-analyses and systematic reviews. For the same reason, funding agencies should be interested in promoting the publication of all outcomes of the research they sponsor, regardless of the direction of the observed effects, and could exert a significant influence in that sense. Finally, the stakeholders who stand to benefit the most, ultimately, from the advancement of science, for example, the private sector taking advantage of research breakthroughs to develop new products or governmental agencies basing some of their key decisions on scientific findings, should also be interested in making sure that scientific publications be as trustworthy as possible, and, in particular, that research syntheses be based on complete sets of results instead of just very biased subsets.

The previous section indicated what should be done to ensure as much as possible that meta-analyses published in the future meet appropriate quality standards. However, this still leaves open the question of how one should handle meta-analyses that were published in the past, did not meet one or more quality criteria, may be misleading, and therefore, one could argue, should not be cited. To a large extent, this issue extends beyond flawed meta-analyses and is pertinent for all manuscripts, not just those involving meta-analyses. Many articles that have been withdrawn by their authors or by the journals in which they were published continue to be heavily cited years later, as if they were still legitimate.

In fields where quality surveys have shown some meta-analyses to be seriously flawed (e.g., Haidich 2010; Koricheva and Gurevitch 2014; Philibert, Loyce, and Makowski 2012; Beillouin et al. 2019; Krupnik et al. 2019; Fohrafellner et al. 2023), their number is small enough that reviewers could be instructed to flag them in manuscripts submitted for publication, and to recommend either that they not be cited, or that citation be accompanied by cautionary statements. However, the number of meta-analyses in that category is infinitesimal compared to the many tens of thousands of meta-analyses in the literature as a whole. Editors cannot realistically ask already over-solicited reviewers to systematically assess every single meta-analysis that is cited in manuscripts they are asked to evaluate, and to weed out inappropriate citations. Another approach is needed.

As suggested by Baveye (2024), one could imagine creating a kind of registry of articles that were either withdrawn or demonstrated in the peer-reviewed literature to be flawed in some way, and using this registry to inform reviewers. To ensure that such an approach would be useful, one needs to think carefully at how such a registry should be set up, what the most practical way would be for technical editors to access it, and what information resulting from it should be conveyed to reviewers. Perhaps the easiest way to answer these questions without getting overly technical is to proceed backward through them, that is, describe first the type of information that would be useful to reviewers and the format in which it should be delivered, before thinking about how this information could be generated.

Whenever reviewers are asked to assess manuscripts containing citations to meta-analyses, they need to know whether these citations are appropriate, that is, whether they fare well on checklists like PRISMA 2020, AMSTAR 2, or similar ones. AMSTAR 2 can be particularly useful in this context, as it could serve as an additional filter to identify flawed studies that did not adhere to acceptable methodological standards, offering a systematic way to flag these works for future citation. For each cited meta-analysis, regardless of the checklists that are used, an overall score could be computed, following, for example, Fohrafellner et al. (2023). However, such a score may not be sufficiently instructive, since it might be affected by compensations among criteria. At the other extreme, just listing the scores for each of the individual quality criteria in the checklists might overwhelm reviewers, even if a graphical summary is provided (Bougioukas et al. 2024; Karakasis et al. 2024). A happy medium might consist of singling out a few key criteria (like the weighting of primary data with the inverse of their variance), and compounding others that may be viewed as less essential.

Who would generate the assessments of cited meta-analyses provided to reviewers? The most straightforward approach would be for technical editors to use dedicated software that would scan the files of submitted manuscripts, use artificial intelligence tools—like Scite.AI (https://scite.ai) does—to determine if they cite previously published meta-analyses in more than a casual way (e.g., as part of the background publications on a given topic, but not specifically in terms of their conclusions), assess the adequacy of the cited meta-analyses criterion by criterion using checklists like PRISMA 2020 or AMSTAR 2, and issue a report along the lines described in the previous paragraph, to inform reviewers. The dedicated software could be periodically upgraded as new tools emerge in the future to deal with the problem of publication bias.

A possible objection to technical editors being asked to subject all manuscripts to this scrutiny is that it would increase their workload. However, one could argue that the process would not be very different than what they are already doing routinely at present to check for plagiarism. In this case as well, they use dedicated software, which scans the internet very rapidly to determine if sections of text, figures, or pictures in the manuscripts are copied from other documents. Running the software itself does not take much time, whereas interpreting the results may in some cases require some attention. The use of a ‘meta-analysis checking software’ would in a number of ways be very similar, although it might actually be faster and easier, for two reasons. One is that running of this software might be sped up considerably by systematically storing information about assessed meta-analyses in a database that the program would consult afterwards before proceeding to any new assessment. That way, the database would become more and more extensive the more it is used, and therefore the time needed to evaluate citations in a given manuscript would become progressively shorter. The other reason why the adoption of such a piece of software would constitute a lesser burden for technical editors than the plagiarism-detection ones is that reports on meta-analyses should simply be forwarded to reviewers, who would then evaluate whether or not there is a problem worth pointing out to the authors.

To develop the computer program needed to check published meta-analyses, a previous effort initiated a decade ago can serve as a useful blueprint. Nuijten (2016) came up with a program, ‘Statcheck’, to detect statistical errors in peer-reviewed psychology articles by searching articles for statistical results, redoing the calculations described in the article, and comparing the two values to see if they match. Using Statcheck, Nuijten et al. (2016) showed that half of the 30,000 psychology articles they re-analysed ‘contained at least one p-value that was inconsistent with its test’. These observations caused quite a stir in the field (Baker 2016; Wren 2018; Ritchie 2020). To this date more than 100,000 articles have been checked with Statcheck. Hartgerink (2016) alone, ‘content mined’ more than 160,000 articles to eventually run Statcheck on more than 50,000 of them! At this juncture, a growing number of journals are using Statcheck as part of their peer-review process. The content-mining tools used by Statcheck, and other similar ones that have been generated in the past decade based on artificial intelligence, could relatively easily be used for the development a computer program designed to systematically check past meta-analyses in the literature. Likewise, one could find guidance as well in the efforts of Shi and Lin (2019), who retrieved primary data from 29,932 meta-analyses in order to find out to what extent the so-called ‘trim and fill’ method could be effective to adjust for publication bias.

With tens of thousands of first- and second-order meta-analyses getting published each year, and serious concerns about the quality of some of these meta-analyses having been raised in the past, the research enterprise runs the risk of having to deal in the future with many published research synthesis articles containing erroneous and/or misleading conclusions. This would likely have a devastating effect on the public's confidence in science and limit its ability to positively inform policy-making in many areas (e.g., Ritchie 2020; Baveye 2024).

The development, and adoption by many journals, of comprehensive reporting and implementation guidelines for meta-analyses and systematic reviews (e.g., PRISMA 2020 and AMSTAR 2) definitely constitute a positive step in this context, and may in the future effectively discourage authors from submitting meta-analyses that do not meet acceptable standards. Unfortunately, the use of these guidelines still leaves two basic questions unanswered. First, both PRISMA 2020 and AMSTAR 2 implicitly assume that it is possible at this point to diagnose accurately the absence of publication bias in a given field, which is a pre-condition to meaningfully carry out any kind of research synthesis. Further research is definitely needed in this area. In the meantime, the only reasonable solution in this regard would be for scholarly journals to start routinely publishing ‘negative’ or non-statistically-significant results. This might seem like a drastic departure from a long tradition, but for various reasons, it does not have to be so drastic, if one is willing to set up working hypotheses slightly differently, when writing up manuscripts, or if one considers, following Popper, that the ‘falsification’ of theories is crucial to the advancement of science. Second, the current use of checklists like PRISMA 2020 or AMSTAR 2, while it will undoubtedly make the conclusions of future meta-analyses more reliable, raises the question of the appropriate way to deal with the many thousands of meta-analyses and systematic reviews that have been published in the past at a time when the use of these checklists was not required, and continue to be cited extensively. A solution in this respect might be the development of a dedicated computer program, which technical editors of scholarly journals could use systematically to provide information to reviewers on the soundness of the meta-analyses cited in the manuscripts they are asked to evaluate. Hopefully, with both of these lines of action, we could get to a point, a couple of years from now, when risks associated with problematic meta-analyses would be adequately minimised or even entirely alleviated.

Philippe C. Baveye is the sole author of the article, came up with the opinion that is presented, wrote the text, and handled its revision after reviews.

The author declares that the research was conducted in the absence of any commercial or financial relationships of any kind that could be construed as potential conflicts of interest, in the broader sense of the concept described in Baveye (2023).

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
超越PRISMA 2020和AMSTAR 2:需要采取进一步行动处理有问题的荟萃分析
十多年来,研究人员经常抱怨,由于学术文献的快速扩张,即使是在他们学科的一个非常狭窄的部分,他们也很难跟上新发展的步伐(例如,Baveye 2014)。与此同时,期刊编辑在招募审稿人方面也遇到了越来越多的困难(Siegel and Baveye 2010)。在过去的几年里,这种情况似乎没有明显改善(West and Bergstrom 2021;Baveye 2021a, 2021b)。学术文献一直在以指数速度增长。据估计,2022年共发表了514万篇论文,大大超过了4年前的418万篇(Curcic 2023)。研究人员比以往任何时候都更需要在教学、指导本科生和研究生、为期刊审稿或撰写大量提案以争夺有限的资金等方面投入时间,他们通常发现,几乎不可能投入足够多的时间来充分深入地阅读他们直接感兴趣的文章。毫不奇怪,在这种背景下,已经展开了一项重大努力,以审查和综合相对较大的文献,并使其内容更容易为研究人员和决策者所获取。近年来,数以万计的系统评论,尤其是“元分析”被写了出来。根据谷歌Scholar (https://scholar.google.com;;)的数据,Page等人(2021)的文章提出了修订后的meta分析报告指南,仅在3年内就被引用了79000多次,这一事实证明了这一努力的惊人规模。最后检索时间为2025年1月29日)。因为事实证明,即使是在几乎所有学科中,跟上元分析的步伐也很耗时,所以目前出现了一种趋势,即通过所谓的“二阶”元分析来综合元分析(例如,Schmidt和Oh 2013;Bergquist et al. 2023),或者进行“系统评价概述”(Lunny et al. 2024)。谷歌Scholar的数据显示,仅在2023年,就有7000多篇文章提到了这些做法。毫无疑问,多年来元分析方法的部分吸引力一直是其最初描述为具有强大统计基础的稳健技术(Glass 1976;Shadish and Lecy 2015)。然而,该方法在实践中的实施一直是非常强烈批评的对象,特别是在教育研究中(Abrami, Cohen, and d’apollonia 1988;Ropovik, Adamkovic, and Greger 2021),医学(Haidich 2010;Hedin et al. 2016;Chapman 2020),植物生态学(Koricheva and Gurevitch 2014),农学(Philibert, Loyce, and Makowski 2012;Beillouin, Ben-Ari和Makowski 2019;Krupnik et al. 2019)和土壤科学(Fohrafellner et al. 2023),其中评估元分析质量的研究人员发现整体质量较低,并注意到进行可靠元分析所需的核心质量标准似乎并未被作者很好地理解。在这些不同的评估练习中,似乎问题最大的标准是,对纳入元分析的研究的理论基础要求,即根据其方差的反比进行权衡,对元分析的要求,以避免混合相互之间没有联系的主要研究-“苹果和橘子”问题(Sharpe 1997) -以及对元分析的作者密切关注文献中可能存在的任何偏见。例如,当期刊只发表描述积极的、统计上显著的结果的文章时。对已发表的荟萃分析和系统评价的可靠性的日益关注,鼓励了各种团体为其严格实施或适当报告制定指导方针。其中,1996年发起的QUORUM (meta分析报告质量)的合作努力最终导致了系统评价和meta分析首选报告项目(PRISMA)的发展(Moher等人,2009),最近修订为PRISMA 2020 (Page等人,2021)。它似乎被越来越多的研究人员自愿采用,并且目前越来越多的期刊鼓励甚至强制要求遵守它(Baveye 2024)。另一套指南由AMSTAR(评估系统评论的测量工具)提供,该指南于2007年发布,并于2017年修订为AMSTAR 2 (Shea et al. 2007, 2017)。在某些方面,AMSTAR 2提供了更全面的系统评价评估,因为它不仅关注他们对报告标准的遵守,而且关注他们的方法严谨性,并详细解决了方案注册、偏倚风险评估和研究选择过程等问题。与PRISMA 2020一样,一些学术期刊现在强制要求所有新手稿使用AMSTAR 2(例如,Chapman 2020)。 目前采用PRISMA 2020、AMSTAR 2或其他类似清单进行meta分析和系统评价报告的做法无疑是朝着正确方向迈出的一步。然而,它提出了两个关键问题。第一个问题是,这一步骤本身是否足以在未来淘汰不充分的综合研究。确实有理由怀疑,正如PRISMA 2020和AMSTAR 2隐含假设的那样,在这个阶段是否有可能准确诊断出给定文献中是否存在显著的发表偏倚,是否可以采取措施进行调整,以及因此,荟萃分析或系统评价是否有意义。如果这一假设不成立,那么风险在于,经过进一步的研究,目前发表的一些荟萃分析可能最终被证明是根本性的错误和误导性的,这显然会对公众对科学的信心产生毁灭性的影响。因此,问题是能否相对较快地采取措施改善这种状况。第二个问题与我们应该如何处理过去发表的研究合成有关,这些研究合成没有遵循PRISMA或AMSTAR之类的检查清单,并且在一个或多个组件中明显存在缺陷。目前,它们经常被引用为完全可以接受的证据,尽管人们可能会合理地认为它们不应该被接受。那么问题是,实际上,我们如何才能防止引用这种可能具有误导性的出版物。在Baveye(2024)对这两个问题以及期刊编辑如何处理这两个问题的简要暗示之后,我们将在以下从更广泛的角度更详细地讨论它们。多年来,对于大多数学科的学术期刊来说,不发表描述“负面”或非统计显著结果的文章已经成为一条潜规则。里奇(2020)将这一传统称为“科学上最令人尴尬的秘密之一”。这对研究人员来说是如此熟悉,以至于期刊编辑不必再明确地说出来。作者知道,如果他们试图提交不符合这方面不成文规则的稿件,几乎可以肯定他们在审稿过程中不会取得好成绩,因此最终他们会自我谴责。在最不坏的情况下,假设无法发表的结果最终以技术报告、未发表的博士论文或会议论文集的形式出现在“灰色”(即未经同行评议的)文献中。然而,有证据表明,研究人员通常甚至懒得撰写和与同事分享通常被认为微不足道的结果,因为它们无法在同行评审的期刊上发表(Franco, Malhotra, and Simonovits 2014)。多年前,研究人员意识到编辑不愿意发表负面或非统计显著的结果意味着只有一部分结果可用于荟萃分析,因此后者可能有偏见,导致潜在的错误结论(例如,Eysenck 1984, 1994;Rothstein 2008;Haidich 2010;Lin and Chu 2018;Furuya-Kanamori, Barendregt, and Doi 2018;Nair 2019;里奇2020;Borenstein et al. 2021)。已经设计了不同的方法来评估在给定的文献中是否存在显著的发表偏倚。不幸的是,这些方法中的大多数都一再被证明有缺点,因此,几十年后,专家之间似乎仍然没有就最佳评估方法达成共识(例如,Rothstein 2008;Fragkos, Tsagris, and Frangos 2014;McShane, Böckenholt, and Hansen 2016;Lin and Chu 2018;Shi and Lin 2019;Mathur and VanderWeele 2020;Afonso et al. 2024;Lunny et al. 2024)。解决这个难题的一种方法是试图说服统计学家不断开发新颖、更复杂的方法来评估某一特定文献是否存在出版偏见。这显然是一项艰巨的挑战,因为评估发表偏倚基本上意味着试图对部分相关数据完全缺失和未知的情况做出一些结论性的判断。在这方面,Ritchie(2020)嘲弄地引用了常用的“漏斗”图等技术来检测发表偏倚(Egger et al. 1997),而Shi和Lin(2019)对依赖于填补数据空白的“trim and fill”方法的深入分析表明,它充满了困难。然而,这方面的研究需要鼓励,需要探索新的方法,需要获得准确的信息,在何种条件下可以有效地发现和调整发表偏倚,从而使meta分析和系统评价成为可能。 根据过去20年处理出版偏见问题的尝试,有人可能会说,唯一完全可靠的处理方法就是从一开始就避免它。正如Rothstein(2008)在16年前所写的那样,“从长远来看,处理发表偏见的最佳方法是防止它”。这实际上意味着,学术期刊应该开始定期发表目前被认为是“负面”的结果。一些期刊试图传达这样一种想法,即他们将对这种方法持开放态度。例如,《公共科学图书馆·综合》(PLOS One)杂志的编辑在其网站上声明,他们“评估研究的标准是科学有效性、强有力的方法论和高道德标准——而不是感知到的重要性”,这似乎朝着正确的方向前进。然而,在这一点上,这种态度既没有被广泛认同,也没有被高影响因子期刊采用。要想在这方面迅速取得进展,期刊编辑应该让潜在的作者清楚而明确地知道,只要研究是在适当的严格程度下进行的,无论他们观察到的影响是积极的还是消极的,或者他们的统计显著性水平如何,他们的期刊都会欢迎稿件。愿意遵循这条道路的编辑可以在哲学家卡尔·波普尔(Karl Popper, 1963)的著作中找到支持,他概括地说,理论不能被验证,只能被无效或证伪。从许多研究人员多年来一直支持的这一观点来看,人们可能会说,对科学进步最有价值的严格、仔细进行的实验,是那些明确地证明某个给定假设无效的实验,或者至少表明,在特定情况下根据该假设预测的显著差异,在实践中却并非如此。在社论中,他们会描述他们的期刊为减少发表偏倚发生率所采取的步骤,编辑们可能会援引波普主义的观点来合理化他们政策的转变。一些主张保持现状的人可能会认为,这种新的编辑政策将对我们的出版文化产生巨大的变化,研究人员会发现难以适应,或者可能会怀疑产生低质量的文章。然而,有人可能会争辩说,在许多情况下,它不一定是如此重大的革命,并且只需要对我们如何设置工作假设来构建我们的工作报告进行轻微的调整。特别是在像我的学科(土壤科学)中,我们处理非常复杂的系统,其中许多方面我们仍然不了解,我们进行的许多研究本质上是探索性的,由问题驱动,但不一定是由明确的假设驱动。尽管如此,在撰写随后的手稿时,我们常常觉得我们必须事后制定一个假设,然后通过实验来支持这个假设。在这些条件下,没有任何严谨性或科学完整性的损失,没有什么能阻止我们在论文撰写阶段建立可行的假设,以使实验似乎产生了积极的结果,以规避期刊编辑或审稿人对发表负面结果的任何不情愿。Wu et al.(2010)的土壤科学文章阐述了这种方法。如果工作假设声明一种特定的测量技术(可见-近红外光谱)可以直接检测土壤中的重金属,并且结果是否定的,那么很可能它永远不会被接受发表。然而,根据先前的知识,这篇文章规定,光谱方法理论上不应该被期望“看到”土壤中的重金属,这篇文章最终证实了这一工作假设,这显然是一个积极的信息,得到了审稿人的好评。每当有人建议学术期刊的编辑应该允许发表“负面”或非统计显著的结果时,通常会有人问,在Rothstein(2008)指出这是多么重要的16年后,需要什么样的激励结构才能最终实现这一目标。就编辑和研究人员而言,激励措施显然不是绝对必要的。在目前的情况下,如果越来越多发表的综合研究最终被证明部分或全部是错误的和具有误导性的,那么这两个小组都会损失很多,他们的工作也会变得更加复杂。 人们会认为,单是这样的前景,再加上公众和决策者对研究信任度的下降可能导致的资金损失,应该会促使编辑和研究人员努力净化这个领域,防止发表有问题的荟萃分析和系统评价。出于同样的原因,资助机构应该有兴趣促进其资助的研究的所有成果的发表,而不管观察到的影响的方向如何,并且可以在这方面发挥重大影响。最后,从科学进步中最终受益最大的利益相关者,例如,利用研究突破开发新产品的私营部门或基于科学发现做出某些关键决策的政府机构,也应该有兴趣确保科学出版物尽可能值得信赖,特别是,研究综合是基于完整的结果集,而不是非常有偏见的子集。前一节指出了应该做些什么来尽可能地确保将来发表的荟萃分析符合适当的质量标准。然而,这仍然留下了一个悬而未决的问题,即人们应该如何处理过去发表的荟萃分析,这些荟萃分析不符合一个或多个质量标准,可能具有误导性,因此,有人可能会说,不应该被引用。在很大程度上,这个问题超出了有缺陷的荟萃分析,与所有手稿相关,而不仅仅是那些涉及荟萃分析的手稿。许多文章被其作者或其发表的期刊撤回,多年后仍被大量引用,就好像它们仍然是合法的一样。在质量调查显示一些元分析存在严重缺陷的领域(例如,Haidich 2010;Koricheva and Gurevitch 2014;Philibert, Loyce, and Makowski 2012;Beillouin et al. 2019;Krupnik et al. 2019;Fohrafellner et al. 2023),他们的数量足够小,可以指示审稿人在提交发表的手稿中标记他们,并建议不引用他们,或者在引用时附上警告性声明。然而,与整个文献中成千上万的荟萃分析相比,这类荟萃分析的数量是微不足道的。编辑不可能现实地要求已经被过度邀请的审稿人系统地评估他们被要求评估的手稿中引用的每一个元分析,并剔除不适当的引用。我们需要另一种方法。正如Baveye(2024)所建议的那样,人们可以想象创建一种注册表,记录那些在同行评审文献中被撤回或证明在某种程度上存在缺陷的文章,并使用该注册表通知审稿人。为了确保这种方法是有用的,需要仔细考虑应该如何建立这样一个注册表,技术编辑访问它的最实际的方法是什么,以及应该将它产生的信息传达给审稿人。也许回答这些问题的最简单的方法是回溯它们,也就是说,在考虑如何生成这些信息之前,首先描述对审阅者有用的信息类型和应该交付的格式。每当审稿人被要求评估包含元分析引用的手稿时,他们需要知道这些引用是否合适,也就是说,他们是否在PRISMA 2020、AMSTAR 2或类似的清单上表现良好。AMSTAR 2在这种情况下特别有用,因为它可以作为一个额外的过滤器,识别不符合可接受的方法标准的有缺陷的研究,提供一个系统的方法来标记这些作品,以便将来引用。对于每一个被引用的元分析,无论使用的检查表是什么,都可以计算出总体得分,例如,Fohrafellner等人(2023)。然而,这样的分数可能没有足够的指导意义,因为它可能受到标准间补偿的影响。在另一种极端情况下,即使提供了图形摘要,仅仅列出检查表中每个单独质量标准的分数也可能使审稿人不知所措(Bougioukas et al. 2024;Karakasis et al. 2024)。一个理想的折中方法可能包括挑选出几个关键标准(如原始数据的权重与其方差的反比),并将其他可能被视为不太重要的标准组合在一起。谁将对提供给审稿人的被引用的meta分析进行评估?最直接的方法是让技术编辑使用专门的软件来扫描提交的手稿文件,使用人工智能工具,比如Scite。人工智能(https://scite。 ai)确定他们是否以非随意的方式引用了先前发表的荟萃分析(例如,作为给定主题的背景出版物的一部分,但不明确地引用他们的结论),使用诸如PRISMA 2020或AMSTAR 2之类的核对表评估被引用的荟萃分析标准的充分性,并根据上一段所述的内容发布报告,以通知审稿人。专用软件可以定期升级,因为新的工具在未来出现,以处理发表偏倚的问题。要求技术编辑对所有手稿进行这种审查的一个可能的反对意见是,这会增加他们的工作量。然而,有人可能会说,这个过程与他们目前例行检查剽窃的过程没有太大不同。在这种情况下,他们也使用专门的软件,它可以非常迅速地扫描互联网,以确定手稿中的文本、数字或图片是否抄袭其他文件。运行软件本身并不需要花费太多时间,然而在某些情况下,解释结果可能需要一些注意力。使用“元分析检查软件”在很多方面都非常相似,尽管它实际上可能更快更容易,原因有两个。其一,通过系统地将评估后的元分析信息存储在数据库中,该程序可以在进行任何新的评估之前进行咨询,从而大大加快该软件的运行速度。这样,使用得越多,数据库就会变得越来越广泛,因此评价某一手稿中引文所需要的时间就会逐渐缩短。采用这种软件对技术编辑的负担比剽窃检测要小的另一个原因是,荟萃分析报告应该直接转发给审稿人,审稿人会评估是否有值得向作者指出的问题。为了开发检查已发表的荟萃分析所需的计算机程序,十年前开始的一项努力可以作为一个有用的蓝图。Nuijten(2016)提出了一个名为“Statcheck”的程序,通过搜索文章的统计结果,重新进行文章中描述的计算,并比较两个值,看看它们是否匹配,来检测同行评审心理学文章中的统计错误。Nuijten等人(2016)使用Statcheck显示,在他们重新分析的30,000篇心理学文章中,有一半“包含至少一个与其测试不一致的p值”。这些观察结果在该领域引起了不小的轰动(Baker 2016;雷恩2018;里奇2020)。到目前为止,Statcheck已经检查了超过10万篇文章。仅Hartgerink(2016)一项,就“内容挖掘”了16万多篇文章,最终在其中5万多篇文章上运行Statcheck !在这个关键时刻,越来越多的期刊将Statcheck作为同行评审过程的一部分。Statcheck使用的内容挖掘工具,以及过去十年中基于人工智能产生的其他类似工具,可以相对容易地用于开发计算机程序,旨在系统地检查文献中过去的元分析。同样,人们也可以从Shi和Lin(2019)的努力中找到指导,他们从29,932项荟萃分析中检索了原始数据,以找出所谓的“修剪和填充”方法在多大程度上可以有效地调整发表偏倚。每年都有数以万计的一阶和二阶荟萃分析被发表,过去人们对其中一些荟萃分析的质量提出了严重的担忧,研究企业面临着在未来不得不处理许多发表的含有错误和/或误导性结论的研究综合文章的风险。这可能会对公众对科学的信心产生破坏性影响,并限制其在许多领域为决策提供积极信息的能力(例如,Ritchie 2020;他们2024年)。许多期刊对meta分析和系统评价的综合报告和实施指南(例如PRISMA 2020和AMSTAR 2)的开发和采用无疑是在这方面迈出的积极一步,并且可能在未来有效地阻止作者提交不符合可接受标准的meta分析。不幸的是,使用这些指导方针仍然没有解决两个基本问题。首先,PRISMA 2020和AMSTAR 2都隐含地假设,在这一点上可以准确地诊断出给定领域中不存在发表偏倚,这是有意义地进行任何研究综合的先决条件。在这个领域肯定需要进一步的研究。 与此同时,在这方面唯一合理的解决方案是让学术期刊开始定期发表“负面”或非统计显著的结果。这似乎是对一个悠久传统的剧烈背离,但出于各种原因,如果一个人愿意在撰写手稿时建立稍微不同的工作假设,或者如果一个人遵循波普尔的观点,认为理论的“证伪”对科学的进步至关重要,那么它就不必如此剧烈。其次,目前使用的清单,如PRISMA 2020或AMSTAR 2,虽然它无疑会使未来的荟萃分析的结论更加可靠,但提出了一个问题,即如何处理过去在不需要使用这些清单的情况下发表的成千上万的荟萃分析和系统评价,并继续被广泛引用。在这方面的一个解决方案可能是开发一个专用的计算机程序,学术期刊的技术编辑可以系统地使用该程序向审稿人提供有关他们被要求评估的手稿中引用的元分析的可靠性的信息。希望通过这两方面的行动,我们可以在几年后达到一个点,届时与有问题的荟萃分析相关的风险将被充分最小化,甚至完全缓解。Philippe C. Baveye是这篇文章的唯一作者,他提出了所呈现的观点,撰写了文本,并在审查后进行了修改。作者声明,该研究是在没有任何商业或金融关系的情况下进行的,这些关系可以被解释为潜在的利益冲突,在Baveye(2023)中描述的概念的更广泛意义上。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Learned Publishing
Learned Publishing INFORMATION SCIENCE & LIBRARY SCIENCE-
CiteScore
4.40
自引率
17.90%
发文量
72
期刊最新文献
Clinical Evidence Behind a Paywall: An Analysis of Randomised Clinical Trials Included in Cochrane Reviews Scoping Reviews Should Describe—Not Score From Findings to Meaning: A Strategic Framework for the Discussion Section Scholarly Communications in 2025: An Aerial Evaluation of a System Challenged by AI and Much More Enhancing, Understanding and Adoption of the Contributor Roles Taxonomy (CRediT)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1