Beyond PRISMA 2020 and AMSTAR 2: Further Actions Are Needed to Deal With Problematic Meta-Analyses

IF 2.2 3区管理学 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE Learned Publishing Pub Date : 2025-02-04 DOI:10.1002/leap.1666

Philippe C. Baveye

{"title":"Beyond PRISMA 2020 and AMSTAR 2: Further Actions Are Needed to Deal With Problematic Meta-Analyses","authors":"Philippe C. Baveye","doi":"10.1002/leap.1666","DOIUrl":null,"url":null,"abstract":"For more than 10 years, researchers routinely complained that, because of the fast expansion of the scholarly literature, it was becoming very challenging for them to keep abreast of novel developments in even a very narrow portion of their discipline (e.g., Baveye 2014). At the same time, journal editors have experienced increasing difficulties to recruit reviewers (Siegel and Baveye 2010). Over the last few years, the situation does not appear to have improved significantly (West and Bergstrom 2021; Baveye 2021a, 2021b). The scholarly literature keeps expanding at an exponential rate. According to some estimates, 5.14 million articles were published during 2022, sizeably more than the 4.18 million published just 4 years earlier (Curcic 2023). More than ever, with conflicting demands on their time for teaching, supervising undergraduate and graduate students, reviewing for journals, or writing numerous proposals to compete for limited funding, researchers generally find it virtually impossible to devote as many hours as would be needed to read articles of direct interest to them in sufficient depth.Not surprisingly in this context, a significant effort has unfolded to review and synthesise relatively large bodies of literature and make their content more readily accessible to researchers and policy-makers. In recent years, tens of thousands of systematic reviews and especially of “meta-analyses” have been written. The staggering scale of the endeavour is evinced by the fact that the article of Page et al. (2021), proposing revised reporting guidelines for meta-analyses, has already been cited over 79,000 times, in only 3 years, according to Google Scholar (https://scholar.google.com; Last retrieved, January 29, 2025). Because it is proving time-consuming to stay abreast even of meta-analyses in virtually all disciplines, a trend is currently emerging of synthesising meta-analyses via what has been referred to as “second-order” meta-analyses (e.g., Schmidt and Oh 2013; Bergquist et al. 2023), or of carrying out “overviews of systematic reviews” (Lunny et al. 2024). In 2023 alone, more than 7000 articles referred to these practices, according to Google Scholar.No doubt part of the appeal of the meta-analysis method over the years has been its original description as a robust technique with a strong statistical foundation (Glass 1976; Shadish and Lecy 2015). Nevertheless, implementations of the method in practice have been the object of very strong criticisms, in particular in research on education (Abrami, Cohen, and d'Apollonia 1988; Ropovik, Adamkovic, and Greger 2021), medicine (Haidich 2010; Hedin et al. 2016; Chapman 2020), plant ecology (Koricheva and Gurevitch 2014), agronomy (Philibert, Loyce, and Makowski 2012; Beillouin, Ben-Ari, and Makowski 2019; Krupnik et al. 2019) and soil science (Fohrafellner et al. 2023), where researchers who have assessed the quality of meta-analyses found it overall to be low, and noticed that core quality criteria necessary to conduct sound meta-analyses did not appear to be well-understood by authors. Among the criteria that seem to be the most problematic in these various assessment exercises are the theory-based requirements for studies included in meta-analyses to be weighed according to the inverse of their variance, for meta-analyses to avoid mixing primary studies that have no connection with each other—the “apple and oranges” problem (Sharpe 1997)—, and for authors of meta-analyses to pay close attention to any bias that may exist in the literature, for example when journals solely publish articles describing positive, statistically-significant results.Mounting concern about the soundness of published meta-analyses and systematic reviews encouraged various groups to develop guidelines for either their rigorous implementation or proper reporting. Among them, the collaborative effort of QUORUM (Quality of Reporting of Meta-analyses), initiated in 1996, eventually led to the development of Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) (Moher et al. 2009), which was recently revised as PRISMA 2020 (Page et al. 2021). It appears to be adopted voluntarily by a growing number of researchers, and adherence to it is currently encouraged or even mandated by an increasing number of journals (Baveye 2024). Another set of guidelines is provided by AMSTAR (A MeaSurement Tool to Assess systematic Reviews), published in 2007 and revised in 2017 as AMSTAR 2 (Shea et al. 2007, 2017). In some ways, AMSTAR 2 provides a more comprehensive assessment of systematic reviews, since it focuses not only on their adherence to reporting standards but also on their methodological rigour, and addresses in detail issues such as protocol registration, risk of bias appraisals, and study selection processes. As with PRISMA 2020, some scholarly journals are now making the use of AMSTAR 2 mandatory for all new manuscripts (e.g., Chapman 2020).The current practice of adopting PRISMA 2020, AMSTAR 2, or other similar checklists, in the conduct and reporting of meta-analyses and systematic reviews is undoubtedly a step in the right direction. However, it raises two crucial questions. The first relates to whether this step will be sufficient, in and of itself, to weed out inadequate research syntheses in the future. There are indeed reasons to doubt whether, as both PRISMA 2020 and AMSTAR 2 assume implicitly, it is possible at this stage to diagnose accurately whether there is a significant publication bias in a given body of literature, whether something can be done to adjust for it, and, therefore, whether a meta-analysis or systematic review is meaningful. If this assumption is not warranted, the risk is that some meta-analyses that are currently published might turn out eventually, after additional research, to be fundamentally erroneous and misleading, which clearly would have devastating repercussions on the public's confidence toward science. The question therefore is whether something can be done relatively quickly to improve this situation. The second question relates to how we should handle research syntheses that were published in the past, did not follow checklists like PRISMA or AMSTAR, and are demonstrably deficient in one or more of their components. They are routinely being cited at present as entirely admissible evidence, when one might reasonably argue that they should not. The question is then how, practically, we can prevent citations to such potentially misleading publications. Following brief allusions by Baveye (2024) to these two questions and to how they can be dealt with by journal editors, we address them in more detail and from a broader perspective in the following.Over the years, it has become the implicit rule for scholarly journals in most disciplines not to publish articles describing ‘negative’ or non-statistically-significant results. Ritchie (2020) refers to this tradition as ‘one of science's most embarrassing secrets.’ It is so familiar to researchers that journal editors do not have to spell it out explicitly any more. Authors know that if they try to submit manuscripts that do not conform to the unwritten rule in this respect, it is almost guaranteed that they will not fare well in the review process, so that eventually they self-censure. In the least bad scenario, supposedly unpublishable results end up in the ‘grey’ (i.e., non-peer-reviewed) literature in the form of technical reports, unpublished Ph.D. dissertations, or papers in conference proceedings. Evidence suggests however that researchers often do not even bother writing up and sharing with colleagues results that are generally considered to be of negligible import, because they cannot lead to publications in peer-reviewed journals (Franco, Malhotra, and Simonovits 2014).Years ago, researchers realised that the reluctance of editors to publish negative or non-statistically-significant results meant that only a portion of all results would be available for review in meta-analyses, and therefore that the latter may be biased, leading to potentially erroneous conclusions (e.g., Eysenck 1984, 1994; Rothstein 2008; Haidich 2010; Lin and Chu 2018; Furuya-Kanamori, Barendregt, and Doi 2018; Nair 2019; Ritchie 2020; Borenstein et al. 2021). Different methods have been devised to assess whether a significant publication bias exists in a given body of literature. Unfortunately, most of these methods have been shown time and again to have shortcomings, so that after several decades a consensus still does not appear to exist among specialists about the best assessment approach (e.g., Rothstein 2008; Fragkos, Tsagris, and Frangos 2014; McShane, Böckenholt, and Hansen 2016; Lin and Chu 2018; Shi and Lin 2019; Mathur and VanderWeele 2020; Afonso et al. 2024; Lunny et al. 2024).One way out of this conundrum involves trying to convince statisticians to keep developing novel, more sophisticated methods to assess if a given body of literature exhibits publication bias. This is clearly a daunting challenge, since assessing publication bias implies basically trying to say something conclusive about a situation in which a portion of the relevant data is entirely missing and unknown. In this respect, Ritchie (2020) refers derisively to techniques like the commonly used ‘funnel’ plots to detect publication bias (Egger et al. 1997), and the thorough analysis by Shi and Lin (2019) of the “trim and fill” method, which relies on filling the data gaps, shows that it is fraught with difficulties. Nevertheless, research in this area needs to be encouraged, new approaches need to be explored, and precise information needs to be obtained on the conditions under which publication bias can be effectively detected and adjusted, so that meta-analyses and systematic reviews become feasible.Based on attempts to deal with the publication bias issue over the last 20 years, one could argue that the only full-proof way to deal with it is to avoid it in the first place. As Rothstein (2008) wrote already 16 years ago, ‘Over the long run, the best way to deal with publication bias is to prevent it’. That means practically that scholarly journals should start routinely publishing what is currently considered to be ‘negative’ results. Some journals have tried to convey the idea that they would be open to this approach. The editors of the journal PLOS One, for example, state on its website that they ‘evaluate research on scientific validity, strong methodology, and high ethical standards—not perceived significance’, which appears to head in the right direction. However, this attitude is neither widely shared, nor adopted by the high Impact Factor journals at this point. What it would take to make progress rapidly in that respect would be for journal editors to let it be known, clearly and unambiguously, to prospective authors that their journals would welcome manuscripts regardless of the direction, positive or negative, of the effects they observed, or their level of statistical significance, as long as the research was carried out with an appropriate level of rigour (Baveye 2024). Editors who would be willing to follow this path could find support for it in the work of the philosopher Karl Popper (1963), who, in a nutshell, stated that theories cannot be validated, but only invalidated or falsified. From that perspective, which many researchers have espoused over the years, one could argue that the rigorous, carefully carried out experiments that are the most valuable for the advancement of science are those that clearly invalidate a given hypothesis, or at least show that differences predicted to be significant on the basis of it under specific circumstances turn out not to be so in practice. In editorials in which they would describe steps implemented by their journal to reduce the incidence of publication bias, editors could perhaps invoke this Popperian viewpoint to rationalise the shift in their policy.Some advocates for the status quo may consider that this new editorial policy would amount to a drastic change in our publishing culture, which researchers would find difficult to adjust to or may suspect to generate articles of low quality. However, one might argue that it would not necessarily be such a major revolution in many cases, and would only require a minor adjustment of how we set up working hypotheses to structure reports on our work. Especially in disciplines like mine (soil science) in which we deal with very complex systems, many aspects of which we still do not understand, much of the research that we carry out is exploratory in nature, driven by questions but not necessarily by definite hypotheses. In spite of that, when writing the ensuing manuscripts, we often feel we have to formulate a hypothesis a posteriori, to which the experiments are then supposed to lend support. Under these conditions, without any loss of rigour or scientific integrity, nothing would prevent us from setting up working hypotheses at the write-up stage in a way that the experiments appear to produce a positive outcome, to circumvent any reluctance of the part of journal editors or reviewers toward publishing negative results. The soil science article of Wu et al. (2010) illustrates this approach. It is very likely that it would never have been accepted for publication if the working hypothesis had stated that a particular measurement technique (visible–near infrared spectroscopy) can directly detect heavy metals in soils, and the results had turned out negative. However, the article stipulated instead, based on prior knowledge, that the spectroscopic method should theoretically not be expected to ‘see’ heavy metals in soils, and the article ended up with a confirmation of this working hypothesis, an apparently positive message that was very well received by reviewers.Whenever it is suggested that editors of scholarly journals should allow the publication of ‘negative’ or non-statistically-significant results, it is common for someone to ask what kind of incentive structure would be needed for that to finally happen, more than 16 years after Rothstein (2008) pointed out how crucial that was. Insofar as editors and researchers are concerned, it is not obvious that incentives would be absolutely required. As the situation currently stands, both groups would lose a lot, and their work would become more complicated, if research syntheses that are being published in increasing numbers would turn out eventually to be partly or entirely erroneous and misleading. That prospect alone, one would think, accompanied by the loss of funding that would likely result from the public's and decision makers' diminishing trust in research, should make it compelling for editors and researchers alike to try to sanitise the field and prevent the publication of problematic meta-analyses and systematic reviews. For the same reason, funding agencies should be interested in promoting the publication of all outcomes of the research they sponsor, regardless of the direction of the observed effects, and could exert a significant influence in that sense. Finally, the stakeholders who stand to benefit the most, ultimately, from the advancement of science, for example, the private sector taking advantage of research breakthroughs to develop new products or governmental agencies basing some of their key decisions on scientific findings, should also be interested in making sure that scientific publications be as trustworthy as possible, and, in particular, that research syntheses be based on complete sets of results instead of just very biased subsets.The previous section indicated what should be done to ensure as much as possible that meta-analyses published in the future meet appropriate quality standards. However, this still leaves open the question of how one should handle meta-analyses that were published in the past, did not meet one or more quality criteria, may be misleading, and therefore, one could argue, should not be cited. To a large extent, this issue extends beyond flawed meta-analyses and is pertinent for all manuscripts, not just those involving meta-analyses. Many articles that have been withdrawn by their authors or by the journals in which they were published continue to be heavily cited years later, as if they were still legitimate.In fields where quality surveys have shown some meta-analyses to be seriously flawed (e.g., Haidich 2010; Koricheva and Gurevitch 2014; Philibert, Loyce, and Makowski 2012; Beillouin et al. 2019; Krupnik et al. 2019; Fohrafellner et al. 2023), their number is small enough that reviewers could be instructed to flag them in manuscripts submitted for publication, and to recommend either that they not be cited, or that citation be accompanied by cautionary statements. However, the number of meta-analyses in that category is infinitesimal compared to the many tens of thousands of meta-analyses in the literature as a whole. Editors cannot realistically ask already over-solicited reviewers to systematically assess every single meta-analysis that is cited in manuscripts they are asked to evaluate, and to weed out inappropriate citations. Another approach is needed.As suggested by Baveye (2024), one could imagine creating a kind of registry of articles that were either withdrawn or demonstrated in the peer-reviewed literature to be flawed in some way, and using this registry to inform reviewers. To ensure that such an approach would be useful, one needs to think carefully at how such a registry should be set up, what the most practical way would be for technical editors to access it, and what information resulting from it should be conveyed to reviewers. Perhaps the easiest way to answer these questions without getting overly technical is to proceed backward through them, that is, describe first the type of information that would be useful to reviewers and the format in which it should be delivered, before thinking about how this information could be generated.Whenever reviewers are asked to assess manuscripts containing citations to meta-analyses, they need to know whether these citations are appropriate, that is, whether they fare well on checklists like PRISMA 2020, AMSTAR 2, or similar ones. AMSTAR 2 can be particularly useful in this context, as it could serve as an additional filter to identify flawed studies that did not adhere to acceptable methodological standards, offering a systematic way to flag these works for future citation. For each cited meta-analysis, regardless of the checklists that are used, an overall score could be computed, following, for example, Fohrafellner et al. (2023). However, such a score may not be sufficiently instructive, since it might be affected by compensations among criteria. At the other extreme, just listing the scores for each of the individual quality criteria in the checklists might overwhelm reviewers, even if a graphical summary is provided (Bougioukas et al. 2024; Karakasis et al. 2024). A happy medium might consist of singling out a few key criteria (like the weighting of primary data with the inverse of their variance), and compounding others that may be viewed as less essential.Who would generate the assessments of cited meta-analyses provided to reviewers? The most straightforward approach would be for technical editors to use dedicated software that would scan the files of submitted manuscripts, use artificial intelligence tools—like Scite.AI (https://scite.ai) does—to determine if they cite previously published meta-analyses in more than a casual way (e.g., as part of the background publications on a given topic, but not specifically in terms of their conclusions), assess the adequacy of the cited meta-analyses criterion by criterion using checklists like PRISMA 2020 or AMSTAR 2, and issue a report along the lines described in the previous paragraph, to inform reviewers. The dedicated software could be periodically upgraded as new tools emerge in the future to deal with the problem of publication bias.A possible objection to technical editors being asked to subject all manuscripts to this scrutiny is that it would increase their workload. However, one could argue that the process would not be very different than what they are already doing routinely at present to check for plagiarism. In this case as well, they use dedicated software, which scans the internet very rapidly to determine if sections of text, figures, or pictures in the manuscripts are copied from other documents. Running the software itself does not take much time, whereas interpreting the results may in some cases require some attention. The use of a ‘meta-analysis checking software’ would in a number of ways be very similar, although it might actually be faster and easier, for two reasons. One is that running of this software might be sped up considerably by systematically storing information about assessed meta-analyses in a database that the program would consult afterwards before proceeding to any new assessment. That way, the database would become more and more extensive the more it is used, and therefore the time needed to evaluate citations in a given manuscript would become progressively shorter. The other reason why the adoption of such a piece of software would constitute a lesser burden for technical editors than the plagiarism-detection ones is that reports on meta-analyses should simply be forwarded to reviewers, who would then evaluate whether or not there is a problem worth pointing out to the authors.To develop the computer program needed to check published meta-analyses, a previous effort initiated a decade ago can serve as a useful blueprint. Nuijten (2016) came up with a program, ‘Statcheck’, to detect statistical errors in peer-reviewed psychology articles by searching articles for statistical results, redoing the calculations described in the article, and comparing the two values to see if they match. Using Statcheck, Nuijten et al. (2016) showed that half of the 30,000 psychology articles they re-analysed ‘contained at least one p-value that was inconsistent with its test’. These observations caused quite a stir in the field (Baker 2016; Wren 2018; Ritchie 2020). To this date more than 100,000 articles have been checked with Statcheck. Hartgerink (2016) alone, ‘content mined’ more than 160,000 articles to eventually run Statcheck on more than 50,000 of them! At this juncture, a growing number of journals are using Statcheck as part of their peer-review process. The content-mining tools used by Statcheck, and other similar ones that have been generated in the past decade based on artificial intelligence, could relatively easily be used for the development a computer program designed to systematically check past meta-analyses in the literature. Likewise, one could find guidance as well in the efforts of Shi and Lin (2019), who retrieved primary data from 29,932 meta-analyses in order to find out to what extent the so-called ‘trim and fill’ method could be effective to adjust for publication bias.With tens of thousands of first- and second-order meta-analyses getting published each year, and serious concerns about the quality of some of these meta-analyses having been raised in the past, the research enterprise runs the risk of having to deal in the future with many published research synthesis articles containing erroneous and/or misleading conclusions. This would likely have a devastating effect on the public's confidence in science and limit its ability to positively inform policy-making in many areas (e.g., Ritchie 2020; Baveye 2024).The development, and adoption by many journals, of comprehensive reporting and implementation guidelines for meta-analyses and systematic reviews (e.g., PRISMA 2020 and AMSTAR 2) definitely constitute a positive step in this context, and may in the future effectively discourage authors from submitting meta-analyses that do not meet acceptable standards. Unfortunately, the use of these guidelines still leaves two basic questions unanswered. First, both PRISMA 2020 and AMSTAR 2 implicitly assume that it is possible at this point to diagnose accurately the absence of publication bias in a given field, which is a pre-condition to meaningfully carry out any kind of research synthesis. Further research is definitely needed in this area. In the meantime, the only reasonable solution in this regard would be for scholarly journals to start routinely publishing ‘negative’ or non-statistically-significant results. This might seem like a drastic departure from a long tradition, but for various reasons, it does not have to be so drastic, if one is willing to set up working hypotheses slightly differently, when writing up manuscripts, or if one considers, following Popper, that the ‘falsification’ of theories is crucial to the advancement of science. Second, the current use of checklists like PRISMA 2020 or AMSTAR 2, while it will undoubtedly make the conclusions of future meta-analyses more reliable, raises the question of the appropriate way to deal with the many thousands of meta-analyses and systematic reviews that have been published in the past at a time when the use of these checklists was not required, and continue to be cited extensively. A solution in this respect might be the development of a dedicated computer program, which technical editors of scholarly journals could use systematically to provide information to reviewers on the soundness of the meta-analyses cited in the manuscripts they are asked to evaluate. Hopefully, with both of these lines of action, we could get to a point, a couple of years from now, when risks associated with problematic meta-analyses would be adequately minimised or even entirely alleviated.Philippe C. Baveye is the sole author of the article, came up with the opinion that is presented, wrote the text, and handled its revision after reviews.The author declares that the research was conducted in the absence of any commercial or financial relationships of any kind that could be construed as potential conflicts of interest, in the broader sense of the concept described in Baveye (2023).","PeriodicalId":51636,"journal":{"name":"Learned Publishing","volume":"38 2","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/leap.1666","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Learned Publishing","FirstCategoryId":"91","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/leap.1666","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}

引用次数: 0

Abstract

For more than 10 years, researchers routinely complained that, because of the fast expansion of the scholarly literature, it was becoming very challenging for them to keep abreast of novel developments in even a very narrow portion of their discipline (e.g., Baveye 2014). At the same time, journal editors have experienced increasing difficulties to recruit reviewers (Siegel and Baveye 2010). Over the last few years, the situation does not appear to have improved significantly (West and Bergstrom 2021; Baveye 2021a, 2021b). The scholarly literature keeps expanding at an exponential rate. According to some estimates, 5.14 million articles were published during 2022, sizeably more than the 4.18 million published just 4 years earlier (Curcic 2023). More than ever, with conflicting demands on their time for teaching, supervising undergraduate and graduate students, reviewing for journals, or writing numerous proposals to compete for limited funding, researchers generally find it virtually impossible to devote as many hours as would be needed to read articles of direct interest to them in sufficient depth.

Not surprisingly in this context, a significant effort has unfolded to review and synthesise relatively large bodies of literature and make their content more readily accessible to researchers and policy-makers. In recent years, tens of thousands of systematic reviews and especially of “meta-analyses” have been written. The staggering scale of the endeavour is evinced by the fact that the article of Page et al. (2021), proposing revised reporting guidelines for meta-analyses, has already been cited over 79,000 times, in only 3 years, according to Google Scholar (https://scholar.google.com; Last retrieved, January 29, 2025). Because it is proving time-consuming to stay abreast even of meta-analyses in virtually all disciplines, a trend is currently emerging of synthesising meta-analyses via what has been referred to as “second-order” meta-analyses (e.g., Schmidt and Oh 2013; Bergquist et al. 2023), or of carrying out “overviews of systematic reviews” (Lunny et al. 2024). In 2023 alone, more than 7000 articles referred to these practices, according to Google Scholar.

No doubt part of the appeal of the meta-analysis method over the years has been its original description as a robust technique with a strong statistical foundation (Glass 1976; Shadish and Lecy 2015). Nevertheless, implementations of the method in practice have been the object of very strong criticisms, in particular in research on education (Abrami, Cohen, and d'Apollonia 1988; Ropovik, Adamkovic, and Greger 2021), medicine (Haidich 2010; Hedin et al. 2016; Chapman 2020), plant ecology (Koricheva and Gurevitch 2014), agronomy (Philibert, Loyce, and Makowski 2012; Beillouin, Ben-Ari, and Makowski 2019; Krupnik et al. 2019) and soil science (Fohrafellner et al. 2023), where researchers who have assessed the quality of meta-analyses found it overall to be low, and noticed that core quality criteria necessary to conduct sound meta-analyses did not appear to be well-understood by authors. Among the criteria that seem to be the most problematic in these various assessment exercises are the theory-based requirements for studies included in meta-analyses to be weighed according to the inverse of their variance, for meta-analyses to avoid mixing primary studies that have no connection with each other—the “apple and oranges” problem (Sharpe 1997)—, and for authors of meta-analyses to pay close attention to any bias that may exist in the literature, for example when journals solely publish articles describing positive, statistically-significant results.

Mounting concern about the soundness of published meta-analyses and systematic reviews encouraged various groups to develop guidelines for either their rigorous implementation or proper reporting. Among them, the collaborative effort of QUORUM (Quality of Reporting of Meta-analyses), initiated in 1996, eventually led to the development of Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) (Moher et al. 2009), which was recently revised as PRISMA 2020 (Page et al. 2021). It appears to be adopted voluntarily by a growing number of researchers, and adherence to it is currently encouraged or even mandated by an increasing number of journals (Baveye 2024). Another set of guidelines is provided by AMSTAR (A MeaSurement Tool to Assess systematic Reviews), published in 2007 and revised in 2017 as AMSTAR 2 (Shea et al. 2007, 2017). In some ways, AMSTAR 2 provides a more comprehensive assessment of systematic reviews, since it focuses not only on their adherence to reporting standards but also on their methodological rigour, and addresses in detail issues such as protocol registration, risk of bias appraisals, and study selection processes. As with PRISMA 2020, some scholarly journals are now making the use of AMSTAR 2 mandatory for all new manuscripts (e.g., Chapman 2020).

The current practice of adopting PRISMA 2020, AMSTAR 2, or other similar checklists, in the conduct and reporting of meta-analyses and systematic reviews is undoubtedly a step in the right direction. However, it raises two crucial questions. The first relates to whether this step will be sufficient, in and of itself, to weed out inadequate research syntheses in the future. There are indeed reasons to doubt whether, as both PRISMA 2020 and AMSTAR 2 assume implicitly, it is possible at this stage to diagnose accurately whether there is a significant publication bias in a given body of literature, whether something can be done to adjust for it, and, therefore, whether a meta-analysis or systematic review is meaningful. If this assumption is not warranted, the risk is that some meta-analyses that are currently published might turn out eventually, after additional research, to be fundamentally erroneous and misleading, which clearly would have devastating repercussions on the public's confidence toward science. The question therefore is whether something can be done relatively quickly to improve this situation. The second question relates to how we should handle research syntheses that were published in the past, did not follow checklists like PRISMA or AMSTAR, and are demonstrably deficient in one or more of their components. They are routinely being cited at present as entirely admissible evidence, when one might reasonably argue that they should not. The question is then how, practically, we can prevent citations to such potentially misleading publications. Following brief allusions by Baveye (2024) to these two questions and to how they can be dealt with by journal editors, we address them in more detail and from a broader perspective in the following.

Over the years, it has become the implicit rule for scholarly journals in most disciplines not to publish articles describing ‘negative’ or non-statistically-significant results. Ritchie (2020) refers to this tradition as ‘one of science's most embarrassing secrets.’ It is so familiar to researchers that journal editors do not have to spell it out explicitly any more. Authors know that if they try to submit manuscripts that do not conform to the unwritten rule in this respect, it is almost guaranteed that they will not fare well in the review process, so that eventually they self-censure. In the least bad scenario, supposedly unpublishable results end up in the ‘grey’ (i.e., non-peer-reviewed) literature in the form of technical reports, unpublished Ph.D. dissertations, or papers in conference proceedings. Evidence suggests however that researchers often do not even bother writing up and sharing with colleagues results that are generally considered to be of negligible import, because they cannot lead to publications in peer-reviewed journals (Franco, Malhotra, and Simonovits 2014).

Years ago, researchers realised that the reluctance of editors to publish negative or non-statistically-significant results meant that only a portion of all results would be available for review in meta-analyses, and therefore that the latter may be biased, leading to potentially erroneous conclusions (e.g., Eysenck 1984, 1994; Rothstein 2008; Haidich 2010; Lin and Chu 2018; Furuya-Kanamori, Barendregt, and Doi 2018; Nair 2019; Ritchie 2020; Borenstein et al. 2021). Different methods have been devised to assess whether a significant publication bias exists in a given body of literature. Unfortunately, most of these methods have been shown time and again to have shortcomings, so that after several decades a consensus still does not appear to exist among specialists about the best assessment approach (e.g., Rothstein 2008; Fragkos, Tsagris, and Frangos 2014; McShane, Böckenholt, and Hansen 2016; Lin and Chu 2018; Shi and Lin 2019; Mathur and VanderWeele 2020; Afonso et al. 2024; Lunny et al. 2024).

One way out of this conundrum involves trying to convince statisticians to keep developing novel, more sophisticated methods to assess if a given body of literature exhibits publication bias. This is clearly a daunting challenge, since assessing publication bias implies basically trying to say something conclusive about a situation in which a portion of the relevant data is entirely missing and unknown. In this respect, Ritchie (2020) refers derisively to techniques like the commonly used ‘funnel’ plots to detect publication bias (Egger et al. 1997), and the thorough analysis by Shi and Lin (2019) of the “trim and fill” method, which relies on filling the data gaps, shows that it is fraught with difficulties. Nevertheless, research in this area needs to be encouraged, new approaches need to be explored, and precise information needs to be obtained on the conditions under which publication bias can be effectively detected and adjusted, so that meta-analyses and systematic reviews become feasible.

Based on attempts to deal with the publication bias issue over the last 20 years, one could argue that the only full-proof way to deal with it is to avoid it in the first place. As Rothstein (2008) wrote already 16 years ago, ‘Over the long run, the best way to deal with publication bias is to prevent it’. That means practically that scholarly journals should start routinely publishing what is currently considered to be ‘negative’ results. Some journals have tried to convey the idea that they would be open to this approach. The editors of the journal PLOS One, for example, state on its website that they ‘evaluate research on scientific validity, strong methodology, and high ethical standards—not perceived significance’, which appears to head in the right direction. However, this attitude is neither widely shared, nor adopted by the high Impact Factor journals at this point. What it would take to make progress rapidly in that respect would be for journal editors to let it be known, clearly and unambiguously, to prospective authors that their journals would welcome manuscripts regardless of the direction, positive or negative, of the effects they observed, or their level of statistical significance, as long as the research was carried out with an appropriate level of rigour (Baveye 2024). Editors who would be willing to follow this path could find support for it in the work of the philosopher Karl Popper (1963), who, in a nutshell, stated that theories cannot be validated, but only invalidated or falsified. From that perspective, which many researchers have espoused over the years, one could argue that the rigorous, carefully carried out experiments that are the most valuable for the advancement of science are those that clearly invalidate a given hypothesis, or at least show that differences predicted to be significant on the basis of it under specific circumstances turn out not to be so in practice. In editorials in which they would describe steps implemented by their journal to reduce the incidence of publication bias, editors could perhaps invoke this Popperian viewpoint to rationalise the shift in their policy.

Some advocates for the status quo may consider that this new editorial policy would amount to a drastic change in our publishing culture, which researchers would find difficult to adjust to or may suspect to generate articles of low quality. However, one might argue that it would not necessarily be such a major revolution in many cases, and would only require a minor adjustment of how we set up working hypotheses to structure reports on our work. Especially in disciplines like mine (soil science) in which we deal with very complex systems, many aspects of which we still do not understand, much of the research that we carry out is exploratory in nature, driven by questions but not necessarily by definite hypotheses. In spite of that, when writing the ensuing manuscripts, we often feel we have to formulate a hypothesis a posteriori, to which the experiments are then supposed to lend support. Under these conditions, without any loss of rigour or scientific integrity, nothing would prevent us from setting up working hypotheses at the write-up stage in a way that the experiments appear to produce a positive outcome, to circumvent any reluctance of the part of journal editors or reviewers toward publishing negative results. The soil science article of Wu et al. (2010) illustrates this approach. It is very likely that it would never have been accepted for publication if the working hypothesis had stated that a particular measurement technique (visible–near infrared spectroscopy) can directly detect heavy metals in soils, and the results had turned out negative. However, the article stipulated instead, based on prior knowledge, that the spectroscopic method should theoretically not be expected to ‘see’ heavy metals in soils, and the article ended up with a confirmation of this working hypothesis, an apparently positive message that was very well received by reviewers.

Whenever it is suggested that editors of scholarly journals should allow the publication of ‘negative’ or non-statistically-significant results, it is common for someone to ask what kind of incentive structure would be needed for that to finally happen, more than 16 years after Rothstein (2008) pointed out how crucial that was. Insofar as editors and researchers are concerned, it is not obvious that incentives would be absolutely required. As the situation currently stands, both groups would lose a lot, and their work would become more complicated, if research syntheses that are being published in increasing numbers would turn out eventually to be partly or entirely erroneous and misleading. That prospect alone, one would think, accompanied by the loss of funding that would likely result from the public's and decision makers' diminishing trust in research, should make it compelling for editors and researchers alike to try to sanitise the field and prevent the publication of problematic meta-analyses and systematic reviews. For the same reason, funding agencies should be interested in promoting the publication of all outcomes of the research they sponsor, regardless of the direction of the observed effects, and could exert a significant influence in that sense. Finally, the stakeholders who stand to benefit the most, ultimately, from the advancement of science, for example, the private sector taking advantage of research breakthroughs to develop new products or governmental agencies basing some of their key decisions on scientific findings, should also be interested in making sure that scientific publications be as trustworthy as possible, and, in particular, that research syntheses be based on complete sets of results instead of just very biased subsets.

The previous section indicated what should be done to ensure as much as possible that meta-analyses published in the future meet appropriate quality standards. However, this still leaves open the question of how one should handle meta-analyses that were published in the past, did not meet one or more quality criteria, may be misleading, and therefore, one could argue, should not be cited. To a large extent, this issue extends beyond flawed meta-analyses and is pertinent for all manuscripts, not just those involving meta-analyses. Many articles that have been withdrawn by their authors or by the journals in which they were published continue to be heavily cited years later, as if they were still legitimate.

In fields where quality surveys have shown some meta-analyses to be seriously flawed (e.g., Haidich 2010; Koricheva and Gurevitch 2014; Philibert, Loyce, and Makowski 2012; Beillouin et al. 2019; Krupnik et al. 2019; Fohrafellner et al. 2023), their number is small enough that reviewers could be instructed to flag them in manuscripts submitted for publication, and to recommend either that they not be cited, or that citation be accompanied by cautionary statements. However, the number of meta-analyses in that category is infinitesimal compared to the many tens of thousands of meta-analyses in the literature as a whole. Editors cannot realistically ask already over-solicited reviewers to systematically assess every single meta-analysis that is cited in manuscripts they are asked to evaluate, and to weed out inappropriate citations. Another approach is needed.

As suggested by Baveye (2024), one could imagine creating a kind of registry of articles that were either withdrawn or demonstrated in the peer-reviewed literature to be flawed in some way, and using this registry to inform reviewers. To ensure that such an approach would be useful, one needs to think carefully at how such a registry should be set up, what the most practical way would be for technical editors to access it, and what information resulting from it should be conveyed to reviewers. Perhaps the easiest way to answer these questions without getting overly technical is to proceed backward through them, that is, describe first the type of information that would be useful to reviewers and the format in which it should be delivered, before thinking about how this information could be generated.

Whenever reviewers are asked to assess manuscripts containing citations to meta-analyses, they need to know whether these citations are appropriate, that is, whether they fare well on checklists like PRISMA 2020, AMSTAR 2, or similar ones. AMSTAR 2 can be particularly useful in this context, as it could serve as an additional filter to identify flawed studies that did not adhere to acceptable methodological standards, offering a systematic way to flag these works for future citation. For each cited meta-analysis, regardless of the checklists that are used, an overall score could be computed, following, for example, Fohrafellner et al. (2023). However, such a score may not be sufficiently instructive, since it might be affected by compensations among criteria. At the other extreme, just listing the scores for each of the individual quality criteria in the checklists might overwhelm reviewers, even if a graphical summary is provided (Bougioukas et al. 2024; Karakasis et al. 2024). A happy medium might consist of singling out a few key criteria (like the weighting of primary data with the inverse of their variance), and compounding others that may be viewed as less essential.

Who would generate the assessments of cited meta-analyses provided to reviewers? The most straightforward approach would be for technical editors to use dedicated software that would scan the files of submitted manuscripts, use artificial intelligence tools—like Scite.AI (https://scite.ai) does—to determine if they cite previously published meta-analyses in more than a casual way (e.g., as part of the background publications on a given topic, but not specifically in terms of their conclusions), assess the adequacy of the cited meta-analyses criterion by criterion using checklists like PRISMA 2020 or AMSTAR 2, and issue a report along the lines described in the previous paragraph, to inform reviewers. The dedicated software could be periodically upgraded as new tools emerge in the future to deal with the problem of publication bias.

A possible objection to technical editors being asked to subject all manuscripts to this scrutiny is that it would increase their workload. However, one could argue that the process would not be very different than what they are already doing routinely at present to check for plagiarism. In this case as well, they use dedicated software, which scans the internet very rapidly to determine if sections of text, figures, or pictures in the manuscripts are copied from other documents. Running the software itself does not take much time, whereas interpreting the results may in some cases require some attention. The use of a ‘meta-analysis checking software’ would in a number of ways be very similar, although it might actually be faster and easier, for two reasons. One is that running of this software might be sped up considerably by systematically storing information about assessed meta-analyses in a database that the program would consult afterwards before proceeding to any new assessment. That way, the database would become more and more extensive the more it is used, and therefore the time needed to evaluate citations in a given manuscript would become progressively shorter. The other reason why the adoption of such a piece of software would constitute a lesser burden for technical editors than the plagiarism-detection ones is that reports on meta-analyses should simply be forwarded to reviewers, who would then evaluate whether or not there is a problem worth pointing out to the authors.

To develop the computer program needed to check published meta-analyses, a previous effort initiated a decade ago can serve as a useful blueprint. Nuijten (2016) came up with a program, ‘Statcheck’, to detect statistical errors in peer-reviewed psychology articles by searching articles for statistical results, redoing the calculations described in the article, and comparing the two values to see if they match. Using Statcheck, Nuijten et al. (2016) showed that half of the 30,000 psychology articles they re-analysed ‘contained at least one p-value that was inconsistent with its test’. These observations caused quite a stir in the field (Baker 2016; Wren 2018; Ritchie 2020). To this date more than 100,000 articles have been checked with Statcheck. Hartgerink (2016) alone, ‘content mined’ more than 160,000 articles to eventually run Statcheck on more than 50,000 of them! At this juncture, a growing number of journals are using Statcheck as part of their peer-review process. The content-mining tools used by Statcheck, and other similar ones that have been generated in the past decade based on artificial intelligence, could relatively easily be used for the development a computer program designed to systematically check past meta-analyses in the literature. Likewise, one could find guidance as well in the efforts of Shi and Lin (2019), who retrieved primary data from 29,932 meta-analyses in order to find out to what extent the so-called ‘trim and fill’ method could be effective to adjust for publication bias.

With tens of thousands of first- and second-order meta-analyses getting published each year, and serious concerns about the quality of some of these meta-analyses having been raised in the past, the research enterprise runs the risk of having to deal in the future with many published research synthesis articles containing erroneous and/or misleading conclusions. This would likely have a devastating effect on the public's confidence in science and limit its ability to positively inform policy-making in many areas (e.g., Ritchie 2020; Baveye 2024).

The development, and adoption by many journals, of comprehensive reporting and implementation guidelines for meta-analyses and systematic reviews (e.g., PRISMA 2020 and AMSTAR 2) definitely constitute a positive step in this context, and may in the future effectively discourage authors from submitting meta-analyses that do not meet acceptable standards. Unfortunately, the use of these guidelines still leaves two basic questions unanswered. First, both PRISMA 2020 and AMSTAR 2 implicitly assume that it is possible at this point to diagnose accurately the absence of publication bias in a given field, which is a pre-condition to meaningfully carry out any kind of research synthesis. Further research is definitely needed in this area. In the meantime, the only reasonable solution in this regard would be for scholarly journals to start routinely publishing ‘negative’ or non-statistically-significant results. This might seem like a drastic departure from a long tradition, but for various reasons, it does not have to be so drastic, if one is willing to set up working hypotheses slightly differently, when writing up manuscripts, or if one considers, following Popper, that the ‘falsification’ of theories is crucial to the advancement of science. Second, the current use of checklists like PRISMA 2020 or AMSTAR 2, while it will undoubtedly make the conclusions of future meta-analyses more reliable, raises the question of the appropriate way to deal with the many thousands of meta-analyses and systematic reviews that have been published in the past at a time when the use of these checklists was not required, and continue to be cited extensively. A solution in this respect might be the development of a dedicated computer program, which technical editors of scholarly journals could use systematically to provide information to reviewers on the soundness of the meta-analyses cited in the manuscripts they are asked to evaluate. Hopefully, with both of these lines of action, we could get to a point, a couple of years from now, when risks associated with problematic meta-analyses would be adequately minimised or even entirely alleviated.

Philippe C. Baveye is the sole author of the article, came up with the opinion that is presented, wrote the text, and handled its revision after reviews.

The author declares that the research was conducted in the absence of any commercial or financial relationships of any kind that could be construed as potential conflicts of interest, in the broader sense of the concept described in Baveye (2023).

查看原文