{"title":"关于 \"类固醇反应性脑膜炎-动脉炎犬只接受 6 周或 6 个月泼尼松龙方案治疗后复发率比较的前瞻性随机试验 \"的信件","authors":"Andrew Woodward","doi":"10.1111/jvim.17188","DOIUrl":null,"url":null,"abstract":"<p>I read with interest the article “Prospective randomized trial comparing relapse rates in dogs with steroid-responsive meningitis-arteritis treated with a 6-week or 6-month prednisolone protocol.”<span><sup>1</sup></span> I am concerned that the article contains substantial misinterpretations of statistical evidence, which undermine the reliability of the authors' conclusions.</p><p>Unfortunately, the phrase “no significant difference” may be deeply misleading unless it is correctly interpreted, under the unintuitive logic of frequentist hypothesis tests, and has led the authors to a mistake; it may be true that the interventions are practically exchangeable, but that conclusion cannot be reached from “not significant.” The <i>P</i>-value resulting from frequentist hypothesis tests represents an approximation of the probability that data more extreme than the data at hand would be observed, if the test hypothesis was true; where “probability” represents the frequency in a hypothetical series of identical repeated trials, and the test hypothesis is some statistical model including its parameters.<span><sup>2</sup></span></p><p>The <i>P</i>-value is calculated under the assumption that the test hypothesis (whatever it is) is true, so never indicates support for the test hypothesis, which would involve circular reasoning. Though the “evidential” meaning of <i>P</i>-values is contested and generally dubious,<span><sup>3</sup></span> in simple terms the <i>P</i>-value can be considered a summary of the evidence provided by the data to <i>refute</i> the test hypothesis,<span><sup>4</sup></span> or equivalently, an expression of how surprising it would be to observe data at least as extreme as these, if the test hypothesis was in fact true.<span><sup>2</sup></span> Though a small <i>P</i>-value may suggest (charitably) that some aspect of the test hypothesis is untrue, the usage advocated by Fisher, a large <i>P</i>-value does not support that it is true, because it says nothing about other test hypotheses (in this case, difference between interventions) with which the data may be compatible. It is, therefore, incorrect to conclude anything substantive from a large <i>P</i>-value. Unfortunately, the incorrect interpretation that a large <i>P</i>-value indicates that an effect or association is absent appears common in clinical trials reporting.<span><sup>5</sup></span></p><p>An emphasis on confidence intervals may mitigate some of the limitations of reasoning based on hypothesis tests, even if their exact meaning is unintuitive. This is a popular view,<span><sup>6</sup></span> and is generally encouraged by relevant reporting guidelines. Considering the authors' estimate of the relative incidence risk (which they express as odds ratio) of at least one relapse, which they state as 1.40 (95% CI: 0.40, 4.96, <i>P</i> = 0.60), the confidence interval represents, in simple terms, the set of values of the parameter of interest (test hypotheses) with which the data are reasonably consistent; in some sense, those values of the parameter that the data at hand cannot rule out. Although I profess no expertise with the assessment of intervention effect sizes in neurology, at face value the interval is fairly wide; the data are consistent with benefit of the 6-weekly intervention of about 2.5 reduction (1/0.40) in the odds of relapse, or detriment of about 5 times increase in the odds of relapse. The extent of evidence this study provided in support of the 6-weekly intervention depends strongly on a contextual interpretation of all the effects within the nominated uncertainty interval, which was not provided by the authors.</p><p>The conflation of “not statistically significant” with “not different” is an apparently common, but fundamental, error. The apparent effect of this and related misinterpretations of frequentist statistics have led to recent calls for major reform.<span><sup>8</sup></span> For the majority of situations, I contend that veterinary clinical researchers would be wise to set aside “significance,” <i>P</i>-value thresholds, and related concepts altogether, and focus on estimates and their uncertainty.</p>","PeriodicalId":49958,"journal":{"name":"Journal of Veterinary Internal Medicine","volume":"38 5","pages":"2412-2413"},"PeriodicalIF":2.1000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jvim.17188","citationCount":"0","resultStr":"{\"title\":\"Letter regarding “Prospective randomized trial comparing relapse rates in dogs with steroid-responsive meningitis-arteritis treated with a 6-week or 6-month prednisolone protocol”\",\"authors\":\"Andrew Woodward\",\"doi\":\"10.1111/jvim.17188\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>I read with interest the article “Prospective randomized trial comparing relapse rates in dogs with steroid-responsive meningitis-arteritis treated with a 6-week or 6-month prednisolone protocol.”<span><sup>1</sup></span> I am concerned that the article contains substantial misinterpretations of statistical evidence, which undermine the reliability of the authors' conclusions.</p><p>Unfortunately, the phrase “no significant difference” may be deeply misleading unless it is correctly interpreted, under the unintuitive logic of frequentist hypothesis tests, and has led the authors to a mistake; it may be true that the interventions are practically exchangeable, but that conclusion cannot be reached from “not significant.” The <i>P</i>-value resulting from frequentist hypothesis tests represents an approximation of the probability that data more extreme than the data at hand would be observed, if the test hypothesis was true; where “probability” represents the frequency in a hypothetical series of identical repeated trials, and the test hypothesis is some statistical model including its parameters.<span><sup>2</sup></span></p><p>The <i>P</i>-value is calculated under the assumption that the test hypothesis (whatever it is) is true, so never indicates support for the test hypothesis, which would involve circular reasoning. Though the “evidential” meaning of <i>P</i>-values is contested and generally dubious,<span><sup>3</sup></span> in simple terms the <i>P</i>-value can be considered a summary of the evidence provided by the data to <i>refute</i> the test hypothesis,<span><sup>4</sup></span> or equivalently, an expression of how surprising it would be to observe data at least as extreme as these, if the test hypothesis was in fact true.<span><sup>2</sup></span> Though a small <i>P</i>-value may suggest (charitably) that some aspect of the test hypothesis is untrue, the usage advocated by Fisher, a large <i>P</i>-value does not support that it is true, because it says nothing about other test hypotheses (in this case, difference between interventions) with which the data may be compatible. It is, therefore, incorrect to conclude anything substantive from a large <i>P</i>-value. Unfortunately, the incorrect interpretation that a large <i>P</i>-value indicates that an effect or association is absent appears common in clinical trials reporting.<span><sup>5</sup></span></p><p>An emphasis on confidence intervals may mitigate some of the limitations of reasoning based on hypothesis tests, even if their exact meaning is unintuitive. This is a popular view,<span><sup>6</sup></span> and is generally encouraged by relevant reporting guidelines. Considering the authors' estimate of the relative incidence risk (which they express as odds ratio) of at least one relapse, which they state as 1.40 (95% CI: 0.40, 4.96, <i>P</i> = 0.60), the confidence interval represents, in simple terms, the set of values of the parameter of interest (test hypotheses) with which the data are reasonably consistent; in some sense, those values of the parameter that the data at hand cannot rule out. Although I profess no expertise with the assessment of intervention effect sizes in neurology, at face value the interval is fairly wide; the data are consistent with benefit of the 6-weekly intervention of about 2.5 reduction (1/0.40) in the odds of relapse, or detriment of about 5 times increase in the odds of relapse. The extent of evidence this study provided in support of the 6-weekly intervention depends strongly on a contextual interpretation of all the effects within the nominated uncertainty interval, which was not provided by the authors.</p><p>The conflation of “not statistically significant” with “not different” is an apparently common, but fundamental, error. The apparent effect of this and related misinterpretations of frequentist statistics have led to recent calls for major reform.<span><sup>8</sup></span> For the majority of situations, I contend that veterinary clinical researchers would be wise to set aside “significance,” <i>P</i>-value thresholds, and related concepts altogether, and focus on estimates and their uncertainty.</p>\",\"PeriodicalId\":49958,\"journal\":{\"name\":\"Journal of Veterinary Internal Medicine\",\"volume\":\"38 5\",\"pages\":\"2412-2413\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jvim.17188\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Veterinary Internal Medicine\",\"FirstCategoryId\":\"97\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/jvim.17188\",\"RegionNum\":2,\"RegionCategory\":\"农林科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"VETERINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Veterinary Internal Medicine","FirstCategoryId":"97","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jvim.17188","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"VETERINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
摘要
我饶有兴趣地阅读了《比较类固醇反应性脑膜炎-动脉炎患犬接受 6 周或 6 个月泼尼松龙方案治疗后的复发率的前瞻性随机试验》1 一文,我担心的是,该文对统计证据进行了大量曲解,从而削弱了作者结论的可靠性。不幸的是,根据频繁假设检验的非直觉逻辑,除非正确解释,否则 "无显著差异 "这一短语可能会严重误导读者,并导致作者犯了一个错误;干预措施实际上可以交换可能是真的,但这一结论不能从 "无显著差异 "中得出。频繁假设检验得出的 P 值代表的是,如果检验假设为真,那么观察到比当前数据更极端数据的概率的近似值;其中 "概率 "代表的是假设的一系列相同重复试验中的频率,而检验假设则是某种统计模型,包括其参数2 。虽然 P 值的 "证据 "意义存在争议,而且一般都值得怀疑3 ,但简单地说,P 值可以被视为数据提供的反驳检验假设的证据的总结4 ,或者说,如果检验假设确实成立,那么观察到至少与这些数据一样极端的数据会是多么令人惊讶2。虽然小的 P 值可以(善意地)表明检验假设的某些方面是不真实的,但按照费雪所主张的用法,大的 P 值并不支持检验假设是真实的,因为它对数据可能与之相符的其他检验假设(在本例中为干预之间的差异)只字未提。因此,从大 P 值得出任何实质性结论都是不正确的。不幸的是,临床试验报告中经常出现这样的错误解释,即大 P 值表明不存在效应或关联。5 强调置信区间可以减轻基于假设检验推理的一些局限性,即使其确切含义并不直观。这是一种流行的观点6 ,也是相关报告指南所普遍鼓励的。考虑到作者估计的至少一次复发的相对发病风险(以几率比表示)为 1.40(95% CI:0.40, 4.96,P = 0.60),简单地说,置信区间代表了与数据合理一致的一组相关参数值(检验假设);从某种意义上说,是手头数据无法排除的参数值。虽然我自认不擅长评估神经病学中的干预效果大小,但从表面价值来看,这个区间还是相当宽的;数据显示,每 6 周一次的干预可使复发几率降低约 2.5(1/0.40),或使复发几率增加约 5 倍。这项研究为支持每 6 周一次的干预措施提供了多大程度的证据,这在很大程度上取决于对提名不确定区间内所有效应的背景解释,而作者并没有提供这种解释。8 在大多数情况下,我认为兽医临床研究人员最好将 "显著性"、P 值阈值及相关概念完全搁置一边,将注意力集中在估计值及其不确定性上。
Letter regarding “Prospective randomized trial comparing relapse rates in dogs with steroid-responsive meningitis-arteritis treated with a 6-week or 6-month prednisolone protocol”
I read with interest the article “Prospective randomized trial comparing relapse rates in dogs with steroid-responsive meningitis-arteritis treated with a 6-week or 6-month prednisolone protocol.”1 I am concerned that the article contains substantial misinterpretations of statistical evidence, which undermine the reliability of the authors' conclusions.
Unfortunately, the phrase “no significant difference” may be deeply misleading unless it is correctly interpreted, under the unintuitive logic of frequentist hypothesis tests, and has led the authors to a mistake; it may be true that the interventions are practically exchangeable, but that conclusion cannot be reached from “not significant.” The P-value resulting from frequentist hypothesis tests represents an approximation of the probability that data more extreme than the data at hand would be observed, if the test hypothesis was true; where “probability” represents the frequency in a hypothetical series of identical repeated trials, and the test hypothesis is some statistical model including its parameters.2
The P-value is calculated under the assumption that the test hypothesis (whatever it is) is true, so never indicates support for the test hypothesis, which would involve circular reasoning. Though the “evidential” meaning of P-values is contested and generally dubious,3 in simple terms the P-value can be considered a summary of the evidence provided by the data to refute the test hypothesis,4 or equivalently, an expression of how surprising it would be to observe data at least as extreme as these, if the test hypothesis was in fact true.2 Though a small P-value may suggest (charitably) that some aspect of the test hypothesis is untrue, the usage advocated by Fisher, a large P-value does not support that it is true, because it says nothing about other test hypotheses (in this case, difference between interventions) with which the data may be compatible. It is, therefore, incorrect to conclude anything substantive from a large P-value. Unfortunately, the incorrect interpretation that a large P-value indicates that an effect or association is absent appears common in clinical trials reporting.5
An emphasis on confidence intervals may mitigate some of the limitations of reasoning based on hypothesis tests, even if their exact meaning is unintuitive. This is a popular view,6 and is generally encouraged by relevant reporting guidelines. Considering the authors' estimate of the relative incidence risk (which they express as odds ratio) of at least one relapse, which they state as 1.40 (95% CI: 0.40, 4.96, P = 0.60), the confidence interval represents, in simple terms, the set of values of the parameter of interest (test hypotheses) with which the data are reasonably consistent; in some sense, those values of the parameter that the data at hand cannot rule out. Although I profess no expertise with the assessment of intervention effect sizes in neurology, at face value the interval is fairly wide; the data are consistent with benefit of the 6-weekly intervention of about 2.5 reduction (1/0.40) in the odds of relapse, or detriment of about 5 times increase in the odds of relapse. The extent of evidence this study provided in support of the 6-weekly intervention depends strongly on a contextual interpretation of all the effects within the nominated uncertainty interval, which was not provided by the authors.
The conflation of “not statistically significant” with “not different” is an apparently common, but fundamental, error. The apparent effect of this and related misinterpretations of frequentist statistics have led to recent calls for major reform.8 For the majority of situations, I contend that veterinary clinical researchers would be wise to set aside “significance,” P-value thresholds, and related concepts altogether, and focus on estimates and their uncertainty.
期刊介绍:
The mission of the Journal of Veterinary Internal Medicine is to advance veterinary medical knowledge and improve the lives of animals by publication of authoritative scientific articles of animal diseases.