首页 > 最新文献

Journal of the Royal Statistical Society Series C-Applied Statistics最新文献

英文 中文
Non-parametric calibration of multiple related radiocarbon determinations and their calendar age summarisation 多个相关放射性碳测定的非参数校准及其日历年龄汇总
IF 1.6 4区 数学 Q2 Mathematics Pub Date : 2022-10-17 DOI: 10.1111/rssc.12599
Timothy J. Heaton

Due to fluctuations in past radiocarbon (14$$ {}^{14} $$C) levels, calibration is required to convert 14$$ {}^{14} $$C determinations Xi$$ {X}_i $$ into calendar ages θi$$ {theta}_i $$. In many studies, we wish to calibrate a set of related samples taken from the same site or context, which have calendar ages drawn from the same shared, but unknown, density f(θ)$$ fleft(theta right) $$. Calibration of X1,,Xn$$ {X}_1,dots, {X}_n

由于过去放射性碳(14 $$ {}^{14} $$ C)水平的波动,需要校准转换14 $$ {}^{14} $$ C测定X i$$ {X}_i $$变成历法年龄θ I $$ {theta}_i $$。在许多研究中,我们希望校准来自同一地点或环境的一组相关样本,这些样本的日历年龄来自相同的共享但未知的密度f (θ) $$ fleft(theta right) $$。校准x1,…,X n $$ {X}_1,dots, {X}_n $$可以通过纳入样本相关的知识而得到显著改善。此外,对潜在的共享f (θ) $$ fleft(theta right) $$的概要估计可以提供关于人口规模/活动随时间变化的有价值的信息。目前的大多数方法都需要f (θ) $$ fleft(theta right) $$的参数说明,这通常是不合适的。我们使用Dirichlet过程混合模型开发了严格的非参数贝叶斯方法,并使用切片采样来解决14 $$ {}^{14} $$ C校准内的多模态典型问题。我们的方法同时校准了14个$$ {}^{14} $$ C测定集,并为未来样本的潜在日历年龄提供了预测性估计。在一项模拟研究中,我们表明,与单独校准每个14 $$ {}^{14} $$ C测定相比,使用我们的方法联合校准相关样品时,日历年龄估计的改善。我们还通过三个现实案例研究说明了预测性日历年龄估计的使用,以深入了解随时间变化的活动水平。
{"title":"Non-parametric calibration of multiple related radiocarbon determinations and their calendar age summarisation","authors":"Timothy J. Heaton","doi":"10.1111/rssc.12599","DOIUrl":"10.1111/rssc.12599","url":null,"abstract":"<p>Due to fluctuations in past radiocarbon (<math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow></mrow>\u0000 <mrow>\u0000 <mn>14</mn>\u0000 </mrow>\u0000 </msup>\u0000 </mrow>\u0000 <annotation>$$ {}^{14} $$</annotation>\u0000 </semantics></math>C) levels, calibration is required to convert <math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow></mrow>\u0000 <mrow>\u0000 <mn>14</mn>\u0000 </mrow>\u0000 </msup>\u0000 </mrow>\u0000 <annotation>$$ {}^{14} $$</annotation>\u0000 </semantics></math>C determinations <math>\u0000 <semantics>\u0000 <mrow>\u0000 <msub>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mi>i</mi>\u0000 </mrow>\u0000 </msub>\u0000 </mrow>\u0000 <annotation>$$ {X}_i $$</annotation>\u0000 </semantics></math> into calendar ages <math>\u0000 <semantics>\u0000 <mrow>\u0000 <msub>\u0000 <mrow>\u0000 <mi>θ</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mi>i</mi>\u0000 </mrow>\u0000 </msub>\u0000 </mrow>\u0000 <annotation>$$ {theta}_i $$</annotation>\u0000 </semantics></math>. In many studies, we wish to calibrate a set of related samples taken from the same site or context, which have calendar ages drawn from the same shared, but unknown, density <math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>f</mi>\u0000 <mo>(</mo>\u0000 <mi>θ</mi>\u0000 <mo>)</mo>\u0000 </mrow>\u0000 <annotation>$$ fleft(theta right) $$</annotation>\u0000 </semantics></math>. Calibration of <math>\u0000 <semantics>\u0000 <mrow>\u0000 <msub>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mn>1</mn>\u0000 </mrow>\u0000 </msub>\u0000 <mo>,</mo>\u0000 <mi>…</mi>\u0000 <mo>,</mo>\u0000 <msub>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mi>n</mi>\u0000 </mrow>\u0000 </msub>\u0000 </mrow>\u0000 <annotation>$$ {X}_1,dots, {X}_n","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/rssc.12599","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80137362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Optimal approximate choice designs for a two-step coffee choice, taste and choice again experiment 最优近似选择设计为两步咖啡选择,口味和选择再次实验
IF 1.6 4区 数学 Q2 Mathematics Pub Date : 2022-10-03 DOI: 10.1111/rssc.12601
Nedka Dechkova Nikiforova, Rossella Berni, Jesús Fernando López-Fidalgo

This work deals with consumers' preferences about coffee. Firstly, a choice experiment is performed on a sample of potential consumers. Following this, a sensory test involving the tasting of two varieties of coffee is carried out with the respondents, after which the same choice experiment is supplied to them again. An innovative approach for building heterogeneous choice designs is specifically developed for the case-study, based on approximate design theory and compound design criterion. Panel Mixed Logit models are used, thereby allowing for the inclusion of correlation among consumers' responses; choice-sets are supplied to a proportion of respondents according to optimal weights. The estimation results of the Panel Mixed Logit model are satisfactory, confirming the validity of the proposed approach.

这项工作涉及消费者对咖啡的偏好。首先,对潜在消费者样本进行选择实验。在此之后,对受访者进行了一项感官测试,包括品尝两种咖啡,之后再次向他们提供相同的选择实验。基于近似设计理论和复合设计准则,为案例研究开发了一种构建异质选择设计的创新方法。使用面板混合Logit模型,从而允许包含消费者的反应之间的相关性;选择集根据最优权重提供给一定比例的受访者。面板混合Logit模型的估计结果令人满意,验证了所提方法的有效性。
{"title":"Optimal approximate choice designs for a two-step coffee choice, taste and choice again experiment","authors":"Nedka Dechkova Nikiforova,&nbsp;Rossella Berni,&nbsp;Jesús Fernando López-Fidalgo","doi":"10.1111/rssc.12601","DOIUrl":"10.1111/rssc.12601","url":null,"abstract":"<p>This work deals with consumers' preferences about coffee. Firstly, a choice experiment is performed on a sample of potential consumers. Following this, a sensory test involving the tasting of two varieties of coffee is carried out with the respondents, after which the same choice experiment is supplied to them again. An innovative approach for building heterogeneous choice designs is specifically developed for the case-study, based on approximate design theory and compound design criterion. Panel Mixed Logit models are used, thereby allowing for the inclusion of correlation among consumers' responses; choice-sets are supplied to a proportion of respondents according to optimal weights. The estimation results of the Panel Mixed Logit model are satisfactory, confirming the validity of the proposed approach.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/rssc.12601","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76942285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flexible domain prediction using mixed effects random forests 使用混合效应随机森林的灵活域预测
IF 1.6 4区 数学 Q2 Mathematics Pub Date : 2022-10-02 DOI: 10.1111/rssc.12600
Patrick Krennmair, Timo Schmid

This paper promotes the use of random forests as versatile tools for estimating spatially disaggregated indicators in the presence of small area-specific sample sizes. Small area estimators are predominantly conceptualised within the regression-setting and rely on linear mixed models to account for the hierarchical structure of the survey data. In contrast, machine learning methods offer non-linear and non-parametric alternatives, combining excellent predictive performance and a reduced risk of model-misspecification. Mixed effects random forests combine advantages of regression forests with the ability to model hierarchical dependencies. This paper provides a coherent framework based on mixed effects random forests for estimating small area averages and proposes a non-parametric bootstrap estimator for assessing the uncertainty of the estimates. We illustrate advantages of our proposed methodology using Mexican income-data from the state Nuevo León. Finally, the methodology is evaluated in model-based and design-based simulations comparing the proposed methodology to traditional regression-based approaches for estimating small area averages.

本文提倡使用随机森林作为在存在小区域特定样本量的情况下估计空间分类指标的通用工具。小面积估计值主要在回归设置中概念化,并依赖线性混合模型来解释调查数据的层次结构。相比之下,机器学习方法提供了非线性和非参数替代方案,结合了出色的预测性能和降低模型错误规范的风险。混合效应随机森林结合了回归森林的优点和对分层依赖关系建模的能力。本文提出了一种基于混合效应随机森林的小面积平均估计框架,并提出了一种用于估计不确定性的非参数自举估计方法。我们使用来自Nuevo州León的墨西哥收入数据来说明我们提出的方法的优点。最后,在基于模型和基于设计的模拟中对该方法进行了评估,并将该方法与传统的基于回归的小面积平均值估算方法进行了比较。
{"title":"Flexible domain prediction using mixed effects random forests","authors":"Patrick Krennmair,&nbsp;Timo Schmid","doi":"10.1111/rssc.12600","DOIUrl":"10.1111/rssc.12600","url":null,"abstract":"<p>This paper promotes the use of random forests as versatile tools for estimating spatially disaggregated indicators in the presence of small area-specific sample sizes. Small area estimators are predominantly conceptualised within the regression-setting and rely on linear mixed models to account for the hierarchical structure of the survey data. In contrast, machine learning methods offer non-linear and non-parametric alternatives, combining excellent predictive performance and a reduced risk of model-misspecification. Mixed effects random forests combine advantages of regression forests with the ability to model hierarchical dependencies. This paper provides a coherent framework based on mixed effects random forests for estimating small area averages and proposes a non-parametric bootstrap estimator for assessing the uncertainty of the estimates. We illustrate advantages of our proposed methodology using Mexican income-data from the state Nuevo León. Finally, the methodology is evaluated in model-based and design-based simulations comparing the proposed methodology to traditional regression-based approaches for estimating small area averages.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/rssc.12600","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117390797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Bayesian model for estimating Sustainable Development Goal indicator 4.1.2: School completion rates 估算可持续发展目标指标4.1.2:学校完成率的贝叶斯模型
IF 1.6 4区 数学 Q2 Mathematics Pub Date : 2022-09-25 DOI: 10.1111/rssc.12595
Ameer Dharamshi, Bilal Barakat, Leontine Alkema, Manos Antoninis

Estimating school completion is crucial for monitoring Sustainable Development Goal (SDG) 4 on education. The recently introduced SDG indicator 4.1.2, defined as the percentage of children aged 3–5 years above the expected completion age of a given level of education that have completed the respective level, differs from enrolment indicators in that it relies primarily on household surveys. This introduces a number of challenges including gaps between survey waves, conflicting estimates, age misreporting and delayed completion. We introduce the Adjusted Bayesian Completion Rates (ABCR) model to address these challenges and produce the first complete and consistent time series for SDG indicator 4.1.2, by school level and sex, for 164 countries. Validation exercises indicate that the model appears well-calibrated and offers a meaningful improvement over simpler approaches in predictive performance. The ABCR model is now used by the United Nations to monitor completion rates for all countries with available survey data.

估计学校完成情况对于监测关于教育的可持续发展目标4至关重要。最近引入的可持续发展目标指标4.1.2定义为超过预期完成某一特定教育水平的3-5岁儿童完成相应教育水平的百分比,它与入学率指标不同,因为它主要依赖于住户调查。这带来了许多挑战,包括调查浪潮之间的差距、相互矛盾的估计、年龄错误报告和延迟完成。我们引入了调整贝叶斯完成率(ABCR)模型来应对这些挑战,并为164个国家的可持续发展目标指标4.1.2制作了第一个完整和一致的时间序列,按学校水平和性别分列。验证练习表明,该模型似乎经过了很好的校准,并在预测性能方面提供了比更简单的方法有意义的改进。联合国现在使用ABCR模式来监测所有拥有调查数据的国家的完成率。
{"title":"A Bayesian model for estimating Sustainable Development Goal indicator 4.1.2: School completion rates","authors":"Ameer Dharamshi,&nbsp;Bilal Barakat,&nbsp;Leontine Alkema,&nbsp;Manos Antoninis","doi":"10.1111/rssc.12595","DOIUrl":"10.1111/rssc.12595","url":null,"abstract":"<p>Estimating school completion is crucial for monitoring Sustainable Development Goal (SDG) 4 on education. The recently introduced SDG indicator 4.1.2, defined as the percentage of children aged 3–5 years above the expected completion age of a given level of education that have completed the respective level, differs from enrolment indicators in that it relies primarily on household surveys. This introduces a number of challenges including gaps between survey waves, conflicting estimates, age misreporting and delayed completion. We introduce the Adjusted Bayesian Completion Rates (ABCR) model to address these challenges and produce the first complete and consistent time series for SDG indicator 4.1.2, by school level and sex, for 164 countries. Validation exercises indicate that the model appears well-calibrated and offers a meaningful improvement over simpler approaches in predictive performance. The ABCR model is now used by the United Nations to monitor completion rates for all countries with available survey data.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/rssc.12595","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72417219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Efficient estimation of the marginal mean of recurrent events 重复事件的边际均值的有效估计
IF 1.6 4区 数学 Q2 Mathematics Pub Date : 2022-09-21 DOI: 10.1111/rssc.12586
Giuliana Cortese, Thomas H. Scheike

Recurrent events are often encountered in clinical and epidemiological studies where a terminal event is also observed. With recurrent events data it is of great interest to estimate the marginal mean of the cumulative number of recurrent events experienced prior to the terminal event. The standard nonparametric estimator was suggested in Cook and Lawless and further developed in Ghosh and Lin. We here investigate the efficiency of this estimator that, surprisingly, has not been studied before. We rewrite the standard estimator as an inverse probability of censoring weighted estimator. From this representation we derive an efficient augmented estimator using efficient estimation theory for right-censored data. We show that the standard estimator is efficient in settings with no heterogeneity. In other settings with different sources of heterogeneity, we show theoretically and by simulations that the efficiency can be greatly improved when an efficient augmented estimator based on dynamic predictions is employed, at no extra cost to robustness. The estimators are applied and compared to study the mean number of catheter-related bloodstream infections in heterogeneous patients with chronic intestinal failure who can possibly die, and the efficiency gain is highlighted in the resulting point-wise confidence intervals.

在临床和流行病学研究中经常遇到复发事件,在这些研究中也观察到终末事件。有了反复事件的数据,估计在结束事件之前经历的反复事件累积次数的边际平均值是非常有趣的。标准非参数估计量由Cook和Lawless提出,并由Ghosh和Lin进一步发展。我们在这里研究这个估计器的效率,令人惊讶的是,以前没有研究过。我们将标准估计量改写为一个逆概率的滤波加权估计量。在此基础上,利用有效估计理论导出了右截尾数据的有效增广估计量。我们证明了标准估计器在没有异质性的情况下是有效的。在具有不同异质性来源的其他设置中,我们从理论上和模拟中表明,当采用基于动态预测的有效增强估计器时,效率可以大大提高,而不会对鲁棒性造成额外损失。我们应用并比较了这些估计值来研究可能死亡的异质性慢性肠衰竭患者中导管相关血流感染的平均数量,并在所得的逐点置信区间中强调了效率的提高。
{"title":"Efficient estimation of the marginal mean of recurrent events","authors":"Giuliana Cortese,&nbsp;Thomas H. Scheike","doi":"10.1111/rssc.12586","DOIUrl":"10.1111/rssc.12586","url":null,"abstract":"<p>Recurrent events are often encountered in clinical and epidemiological studies where a terminal event is also observed. With recurrent events data it is of great interest to estimate the marginal mean of the cumulative number of recurrent events experienced prior to the terminal event. The standard nonparametric estimator was suggested in Cook and Lawless and further developed in Ghosh and Lin. We here investigate the efficiency of this estimator that, surprisingly, has not been studied before. We rewrite the standard estimator as an inverse probability of censoring weighted estimator. From this representation we derive an efficient augmented estimator using efficient estimation theory for right-censored data. We show that the standard estimator is efficient in settings with no heterogeneity. In other settings with different sources of heterogeneity, we show theoretically and by simulations that the efficiency can be greatly improved when an efficient augmented estimator based on dynamic predictions is employed, at no extra cost to robustness. The estimators are applied and compared to study the mean number of catheter-related bloodstream infections in heterogeneous patients with chronic intestinal failure who can possibly die, and the efficiency gain is highlighted in the resulting point-wise confidence intervals.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/rssc.12586","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79223958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Contour models for physical boundaries enclosing star-shaped and approximately star-shaped polygons 星形多边形和近似星形多边形的物理边界的轮廓模型
IF 1.6 4区 数学 Q2 Mathematics Pub Date : 2022-09-19 DOI: 10.1111/rssc.12592
Hannah M. Director, Adrian E. Raftery

Boundaries on spatial fields divide regions with particular features from surrounding background areas. Methods to identify boundary lines from interpolated spatial fields are well established. Less attention has been paid to how to model sequences of connected spatial points. Such models are needed for physical boundaries. For example, in the Arctic ocean, large contiguous areas are covered by sea ice, or frozen ocean water. We define the ice edge contour as the ordered sequences of spatial points that connect to form a line around set(s) of contiguous grid boxes with sea ice present. Polar scientists need to describe how this contiguous area behaves in present and historical data and under future climate change scenarios. We introduce the Gaussian Star-shaped Contour Model (GSCM) for modelling boundaries represented as connected sequences of spatial points such as the sea ice edge. GSCMs generate sequences of spatial points via generating sets of distances in various directions from a fixed starting point. The GSCM can be applied to contours that enclose regions that are star-shaped polygons or approximately star-shaped polygons. Metrics are introduced to assess the extent to which a polygon deviates from star-shapedness. Simulation studies illustrate the performance of the GSCM in different situations.

空间场的边界将具有特定特征的区域与周围的背景区域分开。从插值空间场中识别边界线的方法已经建立。如何对空间点的连通序列进行建模一直受到较少的关注。物理边界需要这样的模型。例如,在北冰洋,大片连续的区域被海冰或冰冻的海水覆盖。我们将冰边缘轮廓定义为空间点的有序序列,这些点围绕存在海冰的一组连续网格框连接形成一条线。极地科学家需要描述这片连续区域在当前和历史数据以及未来气候变化情景下的表现。我们引入了高斯星形轮廓模型(GSCM),用于将边界表示为空间点(如海冰边缘)的连接序列。GSCMs从固定的起始点出发,通过不同方向的距离生成集生成空间点序列。GSCM可以应用于包围星形多边形或近似星形多边形区域的轮廓。引入度量来评估多边形偏离星形的程度。仿真研究表明了GSCM在不同情况下的性能。
{"title":"Contour models for physical boundaries enclosing star-shaped and approximately star-shaped polygons","authors":"Hannah M. Director,&nbsp;Adrian E. Raftery","doi":"10.1111/rssc.12592","DOIUrl":"10.1111/rssc.12592","url":null,"abstract":"<p>Boundaries on spatial fields divide regions with particular features from surrounding background areas. Methods to identify boundary lines from interpolated spatial fields are well established. Less attention has been paid to how to model sequences of connected spatial points. Such models are needed for physical boundaries. For example, in the Arctic ocean, large contiguous areas are covered by sea ice, or frozen ocean water. We define the ice edge contour as the ordered sequences of spatial points that connect to form a line around set(s) of contiguous grid boxes with sea ice present. Polar scientists need to describe how this contiguous area behaves in present and historical data and under future climate change scenarios. We introduce the Gaussian Star-shaped Contour Model (GSCM) for modelling boundaries represented as connected sequences of spatial points such as the sea ice edge. GSCMs generate sequences of spatial points via generating sets of distances in various directions from a fixed starting point. The GSCM can be applied to contours that enclose regions that are star-shaped polygons or approximately star-shaped polygons. Metrics are introduced to assess the extent to which a polygon deviates from star-shapedness. Simulation studies illustrate the performance of the GSCM in different situations.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89579451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sequential one-step estimator by sub-sampling for customer churn analysis with massive data sets 基于子抽样的大规模客户流失分析的序贯一步估计方法
IF 1.6 4区 数学 Q2 Mathematics Pub Date : 2022-09-19 DOI: 10.1111/rssc.12597
Feifei Wang, Danyang Huang, Tianchen Gao, Shuyuan Wu, Hansheng Wang

Customer churn is one of the most important concerns for large companies. Currently, massive data are often encountered in customer churn analysis, which bring new challenges for model computation. To cope with these concerns, sub-sampling methods are often used to accomplish data analysis tasks of large scale. To cover more informative samples in one sampling round, classic sub-sampling methods need to compute non-uniform sampling probabilities for all data points. However, this method creates a huge computational burden for data sets of large scale and therefore, is not applicable in practice. In this study, we propose a sequential one-step (SOS) estimation method based on repeated sub-sampling data sets. In the SOS method, data points need to be sampled only with uniform probabilities, and the sampling step is conducted repeatedly. In each sampling step, a new estimate is computed via one-step updating based on the newly sampled data points. This leads to a sequence of estimates, of which the final SOS estimate is their average. We theoretically show that both the bias and the standard error of the SOS estimator can decrease with increasing sub-sampling sizes or sub-sampling times. The finite sample SOS performances are assessed through simulations. Finally, we apply this SOS method to analyse a real large-scale customer churn data set in a securities company. The results show that the SOS method has good interpretability and prediction power in this real application.

客户流失是大公司最关心的问题之一。目前,客户流失分析中经常会遇到海量数据,这给模型计算带来了新的挑战。为了解决这些问题,通常采用子抽样方法来完成大规模的数据分析任务。为了在一轮抽样中覆盖更多的信息样本,经典的子抽样方法需要计算所有数据点的非均匀抽样概率。但是,这种方法对于大规模的数据集产生了巨大的计算负担,因此在实际应用中并不适用。在本研究中,我们提出了一种基于重复子抽样数据集的顺序一步(SOS)估计方法。在SOS方法中,只需要对数据点进行均匀概率采样,并且重复进行采样步骤。在每个采样步骤中,通过基于新采样数据点的一步更新计算新的估计。这导致一系列估计,其中最终的SOS估计是它们的平均值。我们从理论上证明了SOS估计器的偏差和标准误差都可以随着子抽样大小或子抽样次数的增加而减小。通过仿真评估了有限样本SOS的性能。最后,我们将此方法应用于某证券公司实际大规模客户流失数据集的分析。结果表明,SOS方法在实际应用中具有良好的可解释性和预测能力。
{"title":"Sequential one-step estimator by sub-sampling for customer churn analysis with massive data sets","authors":"Feifei Wang,&nbsp;Danyang Huang,&nbsp;Tianchen Gao,&nbsp;Shuyuan Wu,&nbsp;Hansheng Wang","doi":"10.1111/rssc.12597","DOIUrl":"10.1111/rssc.12597","url":null,"abstract":"<p>Customer churn is one of the most important concerns for large companies. Currently, massive data are often encountered in customer churn analysis, which bring new challenges for model computation. To cope with these concerns, sub-sampling methods are often used to accomplish data analysis tasks of large scale. To cover more informative samples in one sampling round, classic sub-sampling methods need to compute <i>non-uniform</i> sampling probabilities for all data points. However, this method creates a huge computational burden for data sets of large scale and therefore, is not applicable in practice. In this study, we propose a sequential one-step (SOS) estimation method based on repeated sub-sampling data sets. In the SOS method, data points need to be sampled only with <i>uniform</i> probabilities, and the sampling step is conducted repeatedly. In each sampling step, a new estimate is computed via one-step updating based on the newly sampled data points. This leads to a sequence of estimates, of which the final SOS estimate is their average. We theoretically show that both the bias and the standard error of the SOS estimator can decrease with increasing sub-sampling sizes or sub-sampling times. The finite sample SOS performances are assessed through simulations. Finally, we apply this SOS method to analyse a real large-scale customer churn data set in a securities company. The results show that the SOS method has good interpretability and prediction power in this real application.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88578893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The saturated pairwise interaction Gibbs point process as a joint species distribution model 饱和两向相互作用Gibbs点过程作为联合物种分布模型
IF 1.6 4区 数学 Q2 Mathematics Pub Date : 2022-09-19 DOI: 10.1111/rssc.12596
Ian Flint, Nick Golding, Peter Vesk, Yan Wang, Aihua Xia

In an effort to effectively model observed patterns in the spatial configuration of individuals of multiple species in nature, we introduce the saturated pairwise interaction Gibbs point process. Its main strength lies in its ability to model both attraction and repulsion within and between species, over different scales. As such, it is particularly well-suited to the study of associations in complex ecosystems. Based on the existing literature, we provide an easy to implement fitting procedure as well as a technique to make inference for the model parameters. We also prove that under certain hypotheses the point process is locally stable, which allows us to use the well-known ‘coupling from the past’ algorithm to draw samples from the model. Different numerical experiments show the robustness of the model. We study three different ecological data sets, demonstrating in each one that our model helps disentangle competing ecological effects on species' distribution.

为了有效地模拟自然界中多物种个体空间配置的观测模式,我们引入了饱和双相互作用吉布斯点过程。它的主要优势在于它能够模拟不同尺度的物种内部和物种之间的吸引力和排斥力。因此,它特别适合于研究复杂生态系统中的关联。在现有文献的基础上,我们提供了一种易于实现的拟合程序以及模型参数的推理技术。我们还证明了在某些假设下,点过程是局部稳定的,这允许我们使用众所周知的“过去耦合”算法从模型中抽取样本。不同的数值实验表明了该模型的鲁棒性。我们研究了三个不同的生态数据集,在每个数据集中都证明了我们的模型有助于理清物种分布中相互竞争的生态效应。
{"title":"The saturated pairwise interaction Gibbs point process as a joint species distribution model","authors":"Ian Flint,&nbsp;Nick Golding,&nbsp;Peter Vesk,&nbsp;Yan Wang,&nbsp;Aihua Xia","doi":"10.1111/rssc.12596","DOIUrl":"10.1111/rssc.12596","url":null,"abstract":"<p>In an effort to effectively model observed patterns in the spatial configuration of individuals of multiple species in nature, we introduce the saturated pairwise interaction Gibbs point process. Its main strength lies in its ability to model both attraction and repulsion within and between species, over different scales. As such, it is particularly well-suited to the study of associations in complex ecosystems. Based on the existing literature, we provide an easy to implement fitting procedure as well as a technique to make inference for the model parameters. We also prove that under certain hypotheses the point process is locally stable, which allows us to use the well-known ‘coupling from the past’ algorithm to draw samples from the model. Different numerical experiments show the robustness of the model. We study three different ecological data sets, demonstrating in each one that our model helps disentangle competing ecological effects on species' distribution.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/rssc.12596","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89252881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Score test for assessing the conditional dependence in latent class models and its application to record linkage 潜在类别模型条件依赖性评估的得分检验及其在记录关联中的应用
IF 1.6 4区 数学 Q2 Mathematics Pub Date : 2022-09-18 DOI: 10.1111/rssc.12590
Huiping Xu, Xiaochun Li, Zuoyi Zhang, Shaun Grannis

The Fellegi–Sunter model has been widely used in probabilistic record linkage despite its often invalid conditional independence assumption. Prior research has demonstrated that conditional dependence latent class models yield improved match performance when using the correct conditional dependence structure. With a misspecified conditional dependence structure, these models can yield worse performance. It is, therefore, critically important to correctly identify the conditional dependence structure. Existing methods for identifying the conditional dependence structure include the correlation residual plot, the log-odds ratio check, and the bivariate residual, all of which have been shown to perform inadequately. Bootstrap bivariate residual approach and score test have also been proposed and found to have better performance, with the score test having greater power and lower computational burden. In this paper, we extend the score-test-based approach to account for different conditional dependence structures. Through a simulation study, we develop practical recommendations on the utilisation of the score test and assess the match performance with conditional dependence identified by the proposed method. Performance of the proposed method is further evaluated using a real-world record linkage example. Findings show that the proposed method leads to improved matching accuracy relative to the Fellegi–Sunter model.

尽管Fellegi-Sunter模型的条件独立假设常常是无效的,但它在概率记录关联中得到了广泛的应用。已有研究表明,当使用正确的条件依赖结构时,条件依赖潜类模型的匹配性能得到了提高。如果使用错误指定的条件依赖结构,这些模型可能会产生更差的性能。因此,正确识别条件依赖结构是至关重要的。现有的识别条件依赖结构的方法包括相关残差图、对数-比值比检查和二元残差,但这些方法都表现不佳。Bootstrap双变量残差法和分数检验也被提出,结果表明分数检验具有更好的性能,分数检验具有更大的能力和更低的计算负担。在本文中,我们扩展了基于分数测试的方法来考虑不同的条件依赖结构。通过模拟研究,我们提出了关于分数测试使用的实用建议,并评估了由所提出的方法确定的条件依赖的匹配性能。使用实际记录链接示例进一步评估了所提出方法的性能。研究结果表明,相对于Fellegi-Sunter模型,该方法具有更高的匹配精度。
{"title":"Score test for assessing the conditional dependence in latent class models and its application to record linkage","authors":"Huiping Xu,&nbsp;Xiaochun Li,&nbsp;Zuoyi Zhang,&nbsp;Shaun Grannis","doi":"10.1111/rssc.12590","DOIUrl":"10.1111/rssc.12590","url":null,"abstract":"<p>The Fellegi–Sunter model has been widely used in probabilistic record linkage despite its often invalid conditional independence assumption. Prior research has demonstrated that conditional dependence latent class models yield improved match performance when using the correct conditional dependence structure. With a misspecified conditional dependence structure, these models can yield worse performance. It is, therefore, critically important to correctly identify the conditional dependence structure. Existing methods for identifying the conditional dependence structure include the correlation residual plot, the log-odds ratio check, and the bivariate residual, all of which have been shown to perform inadequately. Bootstrap bivariate residual approach and score test have also been proposed and found to have better performance, with the score test having greater power and lower computational burden. In this paper, we extend the score-test-based approach to account for different conditional dependence structures. Through a simulation study, we develop practical recommendations on the utilisation of the score test and assess the match performance with conditional dependence identified by the proposed method. Performance of the proposed method is further evaluated using a real-world record linkage example. Findings show that the proposed method leads to improved matching accuracy relative to the Fellegi–Sunter model.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82870632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging network structure to improve pooled testing efficiency 利用网络结构提高池化测试效率
IF 1.6 4区 数学 Q2 Mathematics Pub Date : 2022-09-16 DOI: 10.1111/rssc.12594
Daniel K. Sewell

Screening is a powerful tool for infection control, allowing for infectious individuals, whether they be symptomatic or asymptomatic, to be identified and isolated. The resource burden of regular and comprehensive screening can often be prohibitive, however. One such measure to address this is pooled testing, whereby groups of individuals are each given a composite test; should a group receive a positive diagnostic test result, those comprising the group are then tested individually. Infectious disease is spread through a transmission network, and this paper shows how assigning individuals to pools based on this underlying network can improve the efficiency of the pooled testing strategy, thereby reducing the resource burden. We designed a simulated annealing algorithm to improve the pooled testing efficiency as measured by the ratio of the expected number of correct classifications to the expected number of tests performed. We then evaluated our approach using an agent-based model designed to simulate the spread of SARS-CoV-2 in a school setting. Our results suggest that our approach can decrease the number of tests required to regularly screen the student body, and that these reductions are quite robust to assigning pools based on partially observed or noisy versions of the network.

筛查是控制感染的有力工具,可以识别和隔离有症状或无症状的感染个体。然而,定期和全面筛查的资源负担往往令人望而却步。解决这一问题的一个这样的措施是集合测试,即每组个体都接受一个复合测试;如果一个组的诊断测试结果呈阳性,则该组的成员将分别接受测试。传染病是通过传播网络传播的,本文展示了如何基于这个底层网络将个体分配到池中,从而提高池检测策略的效率,从而减少资源负担。我们设计了一种模拟退火算法,通过期望正确分类数与期望执行的测试数之比来提高池测试效率。然后,我们使用基于代理的模型评估了我们的方法,该模型旨在模拟SARS-CoV-2在学校环境中的传播。我们的结果表明,我们的方法可以减少定期筛选学生群体所需的测试次数,并且这些减少对于基于部分观察到的或有噪声的网络版本分配池非常稳健。
{"title":"Leveraging network structure to improve pooled testing efficiency","authors":"Daniel K. Sewell","doi":"10.1111/rssc.12594","DOIUrl":"10.1111/rssc.12594","url":null,"abstract":"<p>Screening is a powerful tool for infection control, allowing for infectious individuals, whether they be symptomatic or asymptomatic, to be identified and isolated. The resource burden of regular and comprehensive screening can often be prohibitive, however. One such measure to address this is pooled testing, whereby groups of individuals are each given a composite test; should a group receive a positive diagnostic test result, those comprising the group are then tested individually. Infectious disease is spread through a transmission network, and this paper shows how assigning individuals to pools based on this underlying network can improve the efficiency of the pooled testing strategy, thereby reducing the resource burden. We designed a simulated annealing algorithm to improve the pooled testing efficiency as measured by the ratio of the expected number of correct classifications to the expected number of tests performed. We then evaluated our approach using an agent-based model designed to simulate the spread of SARS-CoV-2 in a school setting. Our results suggest that our approach can decrease the number of tests required to regularly screen the student body, and that these reductions are quite robust to assigning pools based on partially observed or noisy versions of the network.</p>","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/0b/29/RSSC-71-1648.PMC9826453.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10257743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Journal of the Royal Statistical Society Series C-Applied Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1