Marianne Huebner, Laura Bond, Felesia Stukes, Joel Herndon, David J. Edwards, Gina-Maria Pomann
Data science consulting and collaboration units (DSUs) are core infrastructure for research at universities. Activities span data management, study design, data analysis, data visualization, predictive modelling, preparing reports, manuscript writing and advising on statistical methods and may include an experiential or teaching component. Partnerships are needed for a thriving DSU as an active part of the larger university network. Guidance for identifying, developing and managing successful partnerships for DSUs can be summarized in six rules: (1) align with institutional strategic plans, (2) cultivate partnerships that fit your mission, (3) ensure sustainability and prepare for growth, (4) define clear expectations in a partnership agreement, (5) communicate and (6) expect the unexpected. While these rules are not exhaustive, they are derived from experiences in a diverse set of DSUs, which vary by administrative home, mission, staffing and funding model. As examples in this paper illustrate, these rules can be adapted to different organizational models for DSUs. Clear expectations in partnership agreements are essential for high quality and consistent collaborations and address core activities, duration, staffing, cost and evaluation. A DSU is an organizational asset that should involve thoughtful investment if the institution is to gain real value.
{"title":"Developing partnerships for academic data science consulting and collaboration units","authors":"Marianne Huebner, Laura Bond, Felesia Stukes, Joel Herndon, David J. Edwards, Gina-Maria Pomann","doi":"10.1002/sta4.644","DOIUrl":"https://doi.org/10.1002/sta4.644","url":null,"abstract":"Data science consulting and collaboration units (DSUs) are core infrastructure for research at universities. Activities span data management, study design, data analysis, data visualization, predictive modelling, preparing reports, manuscript writing and advising on statistical methods and may include an experiential or teaching component. Partnerships are needed for a thriving DSU as an active part of the larger university network. Guidance for identifying, developing and managing successful partnerships for DSUs can be summarized in six rules: (1) align with institutional strategic plans, (2) cultivate partnerships that fit your mission, (3) ensure sustainability and prepare for growth, (4) define clear expectations in a partnership agreement, (5) communicate and (6) expect the unexpected. While these rules are not exhaustive, they are derived from experiences in a diverse set of DSUs, which vary by administrative home, mission, staffing and funding model. As examples in this paper illustrate, these rules can be adapted to different organizational models for DSUs. Clear expectations in partnership agreements are essential for high quality and consistent collaborations and address core activities, duration, staffing, cost and evaluation. A DSU is an organizational asset that should involve thoughtful investment if the institution is to gain real value.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"5 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139459547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Testing for equivalence, rather than testing for a difference, is an important component of some scientific studies. While the focus of the existing literature is on comparing two groups for equivalence, real-world applications arise regularly that require testing across more than two groups. This paper reviews the existing approaches for testing across multiple groups and proposes a novel framework for multigroup equivalence testing under a Bayesian paradigm. This approach allows for a more scientifically meaningful definition of the equivalence margin and a more powerful test than the few existing alternatives. This approach also allows a new definition of equivalence based on future differences.
{"title":"Equivalence testing for multiple groups","authors":"Tony Pourmohamad, Herbert K. H. Lee","doi":"10.1002/sta4.645","DOIUrl":"https://doi.org/10.1002/sta4.645","url":null,"abstract":"Testing for equivalence, rather than testing for a difference, is an important component of some scientific studies. While the focus of the existing literature is on comparing two groups for equivalence, real-world applications arise regularly that require testing across more than two groups. This paper reviews the existing approaches for testing across multiple groups and proposes a novel framework for multigroup equivalence testing under a Bayesian paradigm. This approach allows for a more scientifically meaningful definition of the equivalence margin and a more powerful test than the few existing alternatives. This approach also allows a new definition of equivalence based on future differences.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"3 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139460065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spatial epidemiology often involves the analysis of spatial count data with an unusually high proportion of zero observations. While Bayesian hierarchical models perform very well for zero‐inflated data in many situations, a smooth response surface is usually required for the Bayesian methods to converge. However, for infectious disease data with excessive zeros, a Wombling issue with large spatial variation could make the Bayesian methods infeasible. To address this issue, we develop estimating equations associated with disease mapping by including over‐dispersion and spatial noises in a spatial zero‐inflated Poisson model. Asymptotic properties are derived for the parameter estimates. Simulations and data analysis are used to assess and illustrate the proposed method.
{"title":"Iterative estimating equations for disease mapping with spatial zero‐inflated Poisson data","authors":"Pei-Sheng Lin, Jun Zhu, Feng‐Chang Lin","doi":"10.1002/sta4.646","DOIUrl":"https://doi.org/10.1002/sta4.646","url":null,"abstract":"Spatial epidemiology often involves the analysis of spatial count data with an unusually high proportion of zero observations. While Bayesian hierarchical models perform very well for zero‐inflated data in many situations, a smooth response surface is usually required for the Bayesian methods to converge. However, for infectious disease data with excessive zeros, a Wombling issue with large spatial variation could make the Bayesian methods infeasible. To address this issue, we develop estimating equations associated with disease mapping by including over‐dispersion and spatial noises in a spatial zero‐inflated Poisson model. Asymptotic properties are derived for the parameter estimates. Simulations and data analysis are used to assess and illustrate the proposed method.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"91 25","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139454718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper addresses the problem of identifying modes or density bumps in multivariate angular or circular data, which have diverse applications in fields like medicine, biology and physics. We focus on the use of topological data analysis and persistent homology for this task. Specifically, we extend the methods for uncertainty quantification in the context of a torus sample space, where circular data lie. To achieve this, we employ two types of density estimators, namely, the von Mises kernel density estimator and the von Mises mixture model, to compute persistent homology, and propose a scale-space view for searching significant bumps in the density. The results of bump hunting are summarised and visualised through a scale-space diagram. Our approach using the mixture model for persistent homology offers advantages over conventional methods, allowing for dendrogram visualisation of components and identification of mode locations. For testing whether a detected mode is really there, we propose several inference tools based on bootstrap resampling and concentration inequalities, establishing their theoretical applicability. Experimental results on SARS-CoV-2 spike glycoprotein torsion angle data demonstrate the effectiveness of our proposed methods in practice.
本文探讨了在多元角度或圆形数据中识别模式或密度凹凸的问题,这些数据在医学、生物学和物理学等领域有着广泛的应用。我们重点关注拓扑数据分析和持久同源性在这项任务中的应用。具体来说,我们将不确定性量化方法扩展到环形数据所在的环形样本空间中。为此,我们采用了两种密度估算器,即 von Mises 核密度估算器和 von Mises 混合模型,来计算持久同源性,并提出了在密度中搜索重要凹凸的尺度空间视图。我们通过标度空间图总结并直观地展示了凹凸搜索的结果。与传统方法相比,我们使用混合物模型计算持久同源性的方法具有优势,可以实现成分的树枝图可视化和模式位置的识别。为了检验检测到的模式是否真实存在,我们提出了几种基于引导重采样和浓度不等式的推理工具,并确定了它们的理论适用性。在 SARS-CoV-2 穗状糖蛋白扭转角数据上的实验结果证明了我们提出的方法在实践中的有效性。
{"title":"Significance of modes in the torus by topological data analysis","authors":"Changjo Yu, Sungkyu Jung, Jisu Kim","doi":"10.1002/sta4.636","DOIUrl":"https://doi.org/10.1002/sta4.636","url":null,"abstract":"This paper addresses the problem of identifying modes or density bumps in multivariate angular or circular data, which have diverse applications in fields like medicine, biology and physics. We focus on the use of topological data analysis and persistent homology for this task. Specifically, we extend the methods for uncertainty quantification in the context of a torus sample space, where circular data lie. To achieve this, we employ two types of density estimators, namely, the von Mises kernel density estimator and the von Mises mixture model, to compute persistent homology, and propose a scale-space view for searching significant bumps in the density. The results of bump hunting are summarised and visualised through a scale-space diagram. Our approach using the mixture model for persistent homology offers advantages over conventional methods, allowing for dendrogram visualisation of components and identification of mode locations. For testing whether a detected mode is really there, we propose several inference tools based on bootstrap resampling and concentration inequalities, establishing their theoretical applicability. Experimental results on SARS-CoV-2 spike glycoprotein torsion angle data demonstrate the effectiveness of our proposed methods in practice.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"20 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138717531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maximum likelihood estimator (MLE) of the Dirichlet distribution is usually obtained by using the Newton–Raphson algorithm. However, in some cases, the computational costs can be burdensome, for example, in real-time processes. Therefore, it is beneficial to develop a closed-form estimator that is as efficient as the MLE for large sample. Here, we suggest asymptotically efficient closed-form estimator based on the classical large sample theory.
{"title":"An asymptotically efficient closed-form estimator for the Dirichlet distribution","authors":"Jae Ho Chang, Sang Kyu Lee, Hyoung-Moon Kim","doi":"10.1002/sta4.640","DOIUrl":"https://doi.org/10.1002/sta4.640","url":null,"abstract":"Maximum likelihood estimator (MLE) of the Dirichlet distribution is usually obtained by using the Newton–Raphson algorithm. However, in some cases, the computational costs can be burdensome, for example, in real-time processes. Therefore, it is beneficial to develop a closed-form estimator that is as efficient as the MLE for large sample. Here, we suggest asymptotically efficient closed-form estimator based on the classical large sample theory.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"82 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138629302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhaoqing Tian, Peng Wu, Zixin Yang, Dingjiao Cai, Qirui Hu
We present a novel nonparametric approach for estimating average treatment effects (ATEs), addressing a fundamental challenge in causal inference research, both in theory and empirical studies. Our method offers an effective solution to mitigate the instability problem caused by propensity scores close to zero or one, which are commonly encountered in (augmented) inverse probability weighting approaches. Notably, our method is straightforward to implement and does not depend on outcome model specification. We introduce an estimator for ATE and establish its consistency and asymptotic normality through rigorous analysis. To demonstrate the robustness of our method against extreme propensity scores, we conduct an extensive simulation study. Additionally, we apply our proposed methods to estimate the impact of social activity disengagement on cognitive ability using a nationally representative cohort study. Furthermore, we extend our proposed method to estimate the ATE on the treated population.
{"title":"Robust nonparametric estimation of average treatment effects: A propensity score-based varying coefficient approach","authors":"Zhaoqing Tian, Peng Wu, Zixin Yang, Dingjiao Cai, Qirui Hu","doi":"10.1002/sta4.637","DOIUrl":"https://doi.org/10.1002/sta4.637","url":null,"abstract":"We present a novel nonparametric approach for estimating average treatment effects (ATEs), addressing a fundamental challenge in causal inference research, both in theory and empirical studies. Our method offers an effective solution to mitigate the instability problem caused by propensity scores close to zero or one, which are commonly encountered in (augmented) inverse probability weighting approaches. Notably, our method is straightforward to implement and does not depend on outcome model specification. We introduce an estimator for ATE and establish its consistency and asymptotic normality through rigorous analysis. To demonstrate the robustness of our method against extreme propensity scores, we conduct an extensive simulation study. Additionally, we apply our proposed methods to estimate the impact of social activity disengagement on cognitive ability using a nationally representative cohort study. Furthermore, we extend our proposed method to estimate the ATE on the treated population.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"286 1 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138629207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article presents an approach to forecasting count time series with a form of exponential smoothing built from observation-driven models. The proposed method is easy to implement and simple to interpret. A variant of the approach is also proposed to handle the impact of outliers on the forecast. The performance of the methodology is studied with simulations and illustrated with an analysis of the number of monthly cases of dengue fever observed in Italy for the years 2008–2021. An R package is made available to enable the reader to reproduce the results discussed in the article.
本文介绍了一种利用观测驱动模型建立的指数平滑法预测计数时间序列的方法。所提出的方法易于实施,解释起来也很简单。本文还提出了一种方法的变体,以处理异常值对预测的影响。通过模拟研究了该方法的性能,并通过分析 2008-2021 年在意大利观察到的登革热月病例数进行了说明。为使读者能够重现文章中讨论的结果,提供了一个 R 软件包。
{"title":"Observation-driven exponential smoothing","authors":"Dimitris Karlis, Xanthi Pedeli, Cristiano Varin","doi":"10.1002/sta4.642","DOIUrl":"https://doi.org/10.1002/sta4.642","url":null,"abstract":"This article presents an approach to forecasting count time series with a form of exponential smoothing built from observation-driven models. The proposed method is easy to implement and simple to interpret. A variant of the approach is also proposed to handle the impact of outliers on the forecast. The performance of the methodology is studied with simulations and illustrated with an analysis of the number of monthly cases of dengue fever observed in Italy for the years 2008–2021. An <span style=\"font-family:monospace\">R</span> package is made available to enable the reader to reproduce the results discussed in the article.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"1 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138548194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Shilane, Nicole L. Lorenzetti, David K. Kruetter
This study enumerates and compares the risks and rewards of different forms of statistical consulting contracts. We assess three different contract models: project-based fees, hourly fees, and retainer agreements and three different planned durations: project-based, time-based, and evergreen contracts. The requirements of time and effort vary considerably for many aspects of consulting work. The risks of statistical consulting contracts include both the general risks of consulting projects along with the specialized risks of statistical investigations. We enumerate a number of general risks in the categories of unanticipated developments, revisions and collaboration, and changing scopes of projects. Meanwhile, the specialized statistical risks include issues of study design, data quality, statistical investigation, and communication of statistical issues. Because of these concerns, the specialized risks of statistical investigations add considerably to the general risks of consulting projects. Moreover, these issues can be exacerbated or mitigated by the form of the consulting agreement. With a greater understanding of the risks and benefits of each type of contract, statistical consultants and clients can negotiate more mutually beneficial contracts for either or both parties. Through this discussion, we hope to raise awareness of these issues and help to create working conditions with a greater likelihood of a successful project for both statistical consultants and their clients.
{"title":"A comparative analysis of contractual risks in statistical consulting","authors":"David Shilane, Nicole L. Lorenzetti, David K. Kruetter","doi":"10.1002/sta4.639","DOIUrl":"https://doi.org/10.1002/sta4.639","url":null,"abstract":"This study enumerates and compares the risks and rewards of different forms of statistical consulting contracts. We assess three different contract models: project-based fees, hourly fees, and retainer agreements and three different planned durations: project-based, time-based, and evergreen contracts. The requirements of time and effort vary considerably for many aspects of consulting work. The risks of statistical consulting contracts include both the general risks of consulting projects along with the specialized risks of statistical investigations. We enumerate a number of general risks in the categories of unanticipated developments, revisions and collaboration, and changing scopes of projects. Meanwhile, the specialized statistical risks include issues of study design, data quality, statistical investigation, and communication of statistical issues. Because of these concerns, the specialized risks of statistical investigations add considerably to the general risks of consulting projects. Moreover, these issues can be exacerbated or mitigated by the form of the consulting agreement. With a greater understanding of the risks and benefits of each type of contract, statistical consultants and clients can negotiate more mutually beneficial contracts for either or both parties. Through this discussion, we hope to raise awareness of these issues and help to create working conditions with a greater likelihood of a successful project for both statistical consultants and their clients.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"16 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138548108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Logistic regression models are widely applied in daily practice. Hence, it is necessary to ensure they have an adequate predictive performance, which is usually estimated by means of the receiver operating characteristic (ROC) curve and the area under it (area under the curve [AUC]). Traditional estimators of these parameters are thought to be applied to simple random samples but are not appropriate for complex survey data. The goal of this work is to propose new weighted estimators for the ROC curve and AUC based on sampling weights which, in the context of complex survey data, indicate the number of units that each sampled observation represents in the population. The behaviour of the proposed estimators is evaluated and compared with the traditional unweighted ones by means of a simulation study. Finally, weighted and unweighted ROC curve and AUC estimators are applied to real survey data in order to compare the estimates in a real scenario. The results suggest the use of the weighted estimators proposed in this work in order to obtain unbiassed estimates for the ROC curve and AUC of logistic regression models fitted to complex survey data.
{"title":"Estimation of the ROC curve and the area under it with complex survey data","authors":"Amaia Iparragirre, Irantzu Barrio, Inmaculada Arostegui","doi":"10.1002/sta4.635","DOIUrl":"https://doi.org/10.1002/sta4.635","url":null,"abstract":"Logistic regression models are widely applied in daily practice. Hence, it is necessary to ensure they have an adequate predictive performance, which is usually estimated by means of the receiver operating characteristic (ROC) curve and the area under it (area under the curve [AUC]). Traditional estimators of these parameters are thought to be applied to simple random samples but are not appropriate for complex survey data. The goal of this work is to propose new weighted estimators for the ROC curve and AUC based on sampling weights which, in the context of complex survey data, indicate the number of units that each sampled observation represents in the population. The behaviour of the proposed estimators is evaluated and compared with the traditional unweighted ones by means of a simulation study. Finally, weighted and unweighted ROC curve and AUC estimators are applied to real survey data in order to compare the estimates in a real scenario. The results suggest the use of the weighted estimators proposed in this work in order to obtain unbiassed estimates for the ROC curve and AUC of logistic regression models fitted to complex survey data.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"24 8","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although the era of digitalization has enabled access to large quantities of data, due to their insufficient structuring, some data are often missing, and sometimes, the percentage of missing data is significant compared to the entire sample. On the other hand, most of the statistical methodology is designed for complete data. Here, we explore the asymptotic properties of non-degenerate U-statistics when the data are missing completely at random and a complete-case approach is utilized. The obtained results are applied to the estimator of Kendall's