Journal of the Royal Statistical Society Series C-Applied Statistics最新文献_第9页

Sequential one-step estimator by sub-sampling for customer churn analysis with massive data sets 基于子抽样的大规模客户流失分析的序贯一步估计方法

IF 1.6 4区数学 Q3 STATISTICS & PROBABILITY

Journal of the Royal Statistical Society Series C-Applied Statistics

Pub Date : 2022-09-19 DOI: 10.1111/rssc.12597

Feifei Wang, Danyang Huang, Tianchen Gao, Shuyuan Wu, Hansheng Wang

Customer churn is one of the most important concerns for large companies. Currently, massive data are often encountered in customer churn analysis, which bring new challenges for model computation. To cope with these concerns, sub-sampling methods are often used to accomplish data analysis tasks of large scale. To cover more informative samples in one sampling round, classic sub-sampling methods need to compute non-uniform sampling probabilities for all data points. However, this method creates a huge computational burden for data sets of large scale and therefore, is not applicable in practice. In this study, we propose a sequential one-step (SOS) estimation method based on repeated sub-sampling data sets. In the SOS method, data points need to be sampled only with uniform probabilities, and the sampling step is conducted repeatedly. In each sampling step, a new estimate is computed via one-step updating based on the newly sampled data points. This leads to a sequence of estimates, of which the final SOS estimate is their average. We theoretically show that both the bias and the standard error of the SOS estimator can decrease with increasing sub-sampling sizes or sub-sampling times. The finite sample SOS performances are assessed through simulations. Finally, we apply this SOS method to analyse a real large-scale customer churn data set in a securities company. The results show that the SOS method has good interpretability and prediction power in this real application.

客户流失是大公司最关心的问题之一。目前，客户流失分析中经常会遇到海量数据，这给模型计算带来了新的挑战。为了解决这些问题，通常采用子抽样方法来完成大规模的数据分析任务。为了在一轮抽样中覆盖更多的信息样本，经典的子抽样方法需要计算所有数据点的非均匀抽样概率。但是，这种方法对于大规模的数据集产生了巨大的计算负担，因此在实际应用中并不适用。在本研究中，我们提出了一种基于重复子抽样数据集的顺序一步(SOS)估计方法。在SOS方法中，只需要对数据点进行均匀概率采样，并且重复进行采样步骤。在每个采样步骤中，通过基于新采样数据点的一步更新计算新的估计。这导致一系列估计，其中最终的SOS估计是它们的平均值。我们从理论上证明了SOS估计器的偏差和标准误差都可以随着子抽样大小或子抽样次数的增加而减小。通过仿真评估了有限样本SOS的性能。最后，我们将此方法应用于某证券公司实际大规模客户流失数据集的分析。结果表明，SOS方法在实际应用中具有良好的可解释性和预测能力。

{"title":"Sequential one-step estimator by sub-sampling for customer churn analysis with massive data sets","authors":"Feifei Wang, Danyang Huang, Tianchen Gao, Shuyuan Wu, Hansheng Wang","doi":"10.1111/rssc.12597","DOIUrl":"10.1111/rssc.12597","url":null,"abstract":"Customer churn is one of the most important concerns for large companies. Currently, massive data are often encountered in customer churn analysis, which bring new challenges for model computation. To cope with these concerns, sub-sampling methods are often used to accomplish data analysis tasks of large scale. To cover more informative samples in one sampling round, classic sub-sampling methods need to compute non-uniform sampling probabilities for all data points. However, this method creates a huge computational burden for data sets of large scale and therefore, is not applicable in practice. In this study, we propose a sequential one-step (SOS) estimation method based on repeated sub-sampling data sets. In the SOS method, data points need to be sampled only with uniform probabilities, and the sampling step is conducted repeatedly. In each sampling step, a new estimate is computed via one-step updating based on the newly sampled data points. This leads to a sequence of estimates, of which the final SOS estimate is their average. We theoretically show that both the bias and the standard error of the SOS estimator can decrease with increasing sub-sampling sizes or sub-sampling times. The finite sample SOS performances are assessed through simulations. Finally, we apply this SOS method to analyse a real large-scale customer churn data set in a securities company. The results show that the SOS method has good interpretability and prediction power in this real application.","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1753-1786"},"PeriodicalIF":1.6,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88578893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The saturated pairwise interaction Gibbs point process as a joint species distribution model 饱和两向相互作用Gibbs点过程作为联合物种分布模型

IF 1.6 4区数学 Q3 STATISTICS & PROBABILITY

Journal of the Royal Statistical Society Series C-Applied Statistics

Pub Date : 2022-09-19 DOI: 10.1111/rssc.12596

Ian Flint, Nick Golding, Peter Vesk, Yan Wang, Aihua Xia

In an effort to effectively model observed patterns in the spatial configuration of individuals of multiple species in nature, we introduce the saturated pairwise interaction Gibbs point process. Its main strength lies in its ability to model both attraction and repulsion within and between species, over different scales. As such, it is particularly well-suited to the study of associations in complex ecosystems. Based on the existing literature, we provide an easy to implement fitting procedure as well as a technique to make inference for the model parameters. We also prove that under certain hypotheses the point process is locally stable, which allows us to use the well-known ‘coupling from the past’ algorithm to draw samples from the model. Different numerical experiments show the robustness of the model. We study three different ecological data sets, demonstrating in each one that our model helps disentangle competing ecological effects on species' distribution.

为了有效地模拟自然界中多物种个体空间配置的观测模式，我们引入了饱和双相互作用吉布斯点过程。它的主要优势在于它能够模拟不同尺度的物种内部和物种之间的吸引力和排斥力。因此，它特别适合于研究复杂生态系统中的关联。在现有文献的基础上，我们提供了一种易于实现的拟合程序以及模型参数的推理技术。我们还证明了在某些假设下，点过程是局部稳定的，这允许我们使用众所周知的“过去耦合”算法从模型中抽取样本。不同的数值实验表明了该模型的鲁棒性。我们研究了三个不同的生态数据集，在每个数据集中都证明了我们的模型有助于理清物种分布中相互竞争的生态效应。

引用次数: 3

Score test for assessing the conditional dependence in latent class models and its application to record linkage 潜在类别模型条件依赖性评估的得分检验及其在记录关联中的应用

IF 1.6 4区数学 Q3 STATISTICS & PROBABILITY

Journal of the Royal Statistical Society Series C-Applied Statistics

Pub Date : 2022-09-18 DOI: 10.1111/rssc.12590

Huiping Xu, Xiaochun Li, Zuoyi Zhang, Shaun Grannis

The Fellegi–Sunter model has been widely used in probabilistic record linkage despite its often invalid conditional independence assumption. Prior research has demonstrated that conditional dependence latent class models yield improved match performance when using the correct conditional dependence structure. With a misspecified conditional dependence structure, these models can yield worse performance. It is, therefore, critically important to correctly identify the conditional dependence structure. Existing methods for identifying the conditional dependence structure include the correlation residual plot, the log-odds ratio check, and the bivariate residual, all of which have been shown to perform inadequately. Bootstrap bivariate residual approach and score test have also been proposed and found to have better performance, with the score test having greater power and lower computational burden. In this paper, we extend the score-test-based approach to account for different conditional dependence structures. Through a simulation study, we develop practical recommendations on the utilisation of the score test and assess the match performance with conditional dependence identified by the proposed method. Performance of the proposed method is further evaluated using a real-world record linkage example. Findings show that the proposed method leads to improved matching accuracy relative to the Fellegi–Sunter model.

尽管Fellegi-Sunter模型的条件独立假设常常是无效的，但它在概率记录关联中得到了广泛的应用。已有研究表明，当使用正确的条件依赖结构时，条件依赖潜类模型的匹配性能得到了提高。如果使用错误指定的条件依赖结构，这些模型可能会产生更差的性能。因此，正确识别条件依赖结构是至关重要的。现有的识别条件依赖结构的方法包括相关残差图、对数-比值比检查和二元残差，但这些方法都表现不佳。Bootstrap双变量残差法和分数检验也被提出，结果表明分数检验具有更好的性能，分数检验具有更大的能力和更低的计算负担。在本文中，我们扩展了基于分数测试的方法来考虑不同的条件依赖结构。通过模拟研究，我们提出了关于分数测试使用的实用建议，并评估了由所提出的方法确定的条件依赖的匹配性能。使用实际记录链接示例进一步评估了所提出方法的性能。研究结果表明，相对于Fellegi-Sunter模型，该方法具有更高的匹配精度。

{"title":"Score test for assessing the conditional dependence in latent class models and its application to record linkage","authors":"Huiping Xu, Xiaochun Li, Zuoyi Zhang, Shaun Grannis","doi":"10.1111/rssc.12590","DOIUrl":"10.1111/rssc.12590","url":null,"abstract":"The Fellegi–Sunter model has been widely used in probabilistic record linkage despite its often invalid conditional independence assumption. Prior research has demonstrated that conditional dependence latent class models yield improved match performance when using the correct conditional dependence structure. With a misspecified conditional dependence structure, these models can yield worse performance. It is, therefore, critically important to correctly identify the conditional dependence structure. Existing methods for identifying the conditional dependence structure include the correlation residual plot, the log-odds ratio check, and the bivariate residual, all of which have been shown to perform inadequately. Bootstrap bivariate residual approach and score test have also been proposed and found to have better performance, with the score test having greater power and lower computational burden. In this paper, we extend the score-test-based approach to account for different conditional dependence structures. Through a simulation study, we develop practical recommendations on the utilisation of the score test and assess the match performance with conditional dependence identified by the proposed method. Performance of the proposed method is further evaluated using a real-world record linkage example. Findings show that the proposed method leads to improved matching accuracy relative to the Fellegi–Sunter model.","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1663-1687"},"PeriodicalIF":1.6,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82870632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Leveraging network structure to improve pooled testing efficiency 利用网络结构提高池化测试效率

IF 1.6 4区数学 Q3 STATISTICS & PROBABILITY

Journal of the Royal Statistical Society Series C-Applied Statistics

Pub Date : 2022-09-16 DOI: 10.1111/rssc.12594

Daniel K. Sewell

Screening is a powerful tool for infection control, allowing for infectious individuals, whether they be symptomatic or asymptomatic, to be identified and isolated. The resource burden of regular and comprehensive screening can often be prohibitive, however. One such measure to address this is pooled testing, whereby groups of individuals are each given a composite test; should a group receive a positive diagnostic test result, those comprising the group are then tested individually. Infectious disease is spread through a transmission network, and this paper shows how assigning individuals to pools based on this underlying network can improve the efficiency of the pooled testing strategy, thereby reducing the resource burden. We designed a simulated annealing algorithm to improve the pooled testing efficiency as measured by the ratio of the expected number of correct classifications to the expected number of tests performed. We then evaluated our approach using an agent-based model designed to simulate the spread of SARS-CoV-2 in a school setting. Our results suggest that our approach can decrease the number of tests required to regularly screen the student body, and that these reductions are quite robust to assigning pools based on partially observed or noisy versions of the network.

筛查是控制感染的有力工具，可以识别和隔离有症状或无症状的感染个体。然而，定期和全面筛查的资源负担往往令人望而却步。解决这一问题的一个这样的措施是集合测试，即每组个体都接受一个复合测试;如果一个组的诊断测试结果呈阳性，则该组的成员将分别接受测试。传染病是通过传播网络传播的，本文展示了如何基于这个底层网络将个体分配到池中，从而提高池检测策略的效率，从而减少资源负担。我们设计了一种模拟退火算法，通过期望正确分类数与期望执行的测试数之比来提高池测试效率。然后，我们使用基于代理的模型评估了我们的方法，该模型旨在模拟SARS-CoV-2在学校环境中的传播。我们的结果表明，我们的方法可以减少定期筛选学生群体所需的测试次数，并且这些减少对于基于部分观察到的或有噪声的网络版本分配池非常稳健。

{"title":"Leveraging network structure to improve pooled testing efficiency","authors":"Daniel K. Sewell","doi":"10.1111/rssc.12594","DOIUrl":"10.1111/rssc.12594","url":null,"abstract":"Screening is a powerful tool for infection control, allowing for infectious individuals, whether they be symptomatic or asymptomatic, to be identified and isolated. The resource burden of regular and comprehensive screening can often be prohibitive, however. One such measure to address this is pooled testing, whereby groups of individuals are each given a composite test; should a group receive a positive diagnostic test result, those comprising the group are then tested individually. Infectious disease is spread through a transmission network, and this paper shows how assigning individuals to pools based on this underlying network can improve the efficiency of the pooled testing strategy, thereby reducing the resource burden. We designed a simulated annealing algorithm to improve the pooled testing efficiency as measured by the ratio of the expected number of correct classifications to the expected number of tests performed. We then evaluated our approach using an agent-based model designed to simulate the spread of SARS-CoV-2 in a school setting. Our results suggest that our approach can decrease the number of tests required to regularly screen the student body, and that these reductions are quite robust to assigning pools based on partially observed or noisy versions of the network.","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1648-1662"},"PeriodicalIF":1.6,"publicationDate":"2022-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/0b/29/RSSC-71-1648.PMC9826453.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10257743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Semi-parametric time-to-event modelling of lengths of hospital stays 住院时间长度的半参数时间-事件模型

IF 1.6 4区数学 Q3 STATISTICS & PROBABILITY

Journal of the Royal Statistical Society Series C-Applied Statistics

Pub Date : 2022-09-15 DOI: 10.1111/rssc.12593

Yang Li, Hao Liu, Xiaoshen Wang, Wanzhu Tu

Length of stay (LOS) is an essential metric for the quality of hospital care. Published works on LOS analysis have primarily focused on skewed LOS distributions and the influences of patient diagnostic characteristics. Few authors have considered the events that terminate a hospital stay: Both successful discharge and death could end a hospital stay but with completely different implications. Modelling the time to the first occurrence of discharge or death obscures the true nature of LOS. In this research, we propose a structure that simultaneously models the probabilities of discharge and death. The model has a flexible formulation that accounts for both additive and multiplicative effects of factors influencing the occurrence of death and discharge. We present asymptotic properties of the parameter estimates so that valid inference can be performed for the parametric as well as nonparametric model components. Simulation studies confirmed the good finite-sample performance of the proposed method. As the research is motivated by practical issues encountered in LOS analysis, we analysed data from two real clinical studies to showcase the general applicability of the proposed model.

住院时间(LOS)是衡量医院护理质量的重要指标。已发表的LOS分析作品主要集中在斜斜的LOS分布和患者诊断特征的影响。很少有作者考虑到终止住院的事件:成功出院和死亡都可能结束住院，但具有完全不同的含义。模拟第一次放电或死亡的时间模糊了LOS的真实性质。在本研究中，我们提出了一个同时模拟放电和死亡概率的结构。该模型具有灵活的公式，可以考虑影响死亡和放电发生的因素的加性和乘法效应。我们给出了参数估计的渐近性质，从而可以对参数和非参数模型分量进行有效的推理。仿真研究证实了该方法具有良好的有限样本性能。由于研究的动机是LOS分析中遇到的实际问题，我们分析了两个真实临床研究的数据，以展示所提出模型的一般适用性。

{"title":"Semi-parametric time-to-event modelling of lengths of hospital stays","authors":"Yang Li, Hao Liu, Xiaoshen Wang, Wanzhu Tu","doi":"10.1111/rssc.12593","DOIUrl":"10.1111/rssc.12593","url":null,"abstract":"Length of stay (LOS) is an essential metric for the quality of hospital care. Published works on LOS analysis have primarily focused on skewed LOS distributions and the influences of patient diagnostic characteristics. Few authors have considered the events that terminate a hospital stay: Both successful discharge and death could end a hospital stay but with completely different implications. Modelling the time to the first occurrence of discharge or death obscures the true nature of LOS. In this research, we propose a structure that simultaneously models the probabilities of discharge and death. The model has a flexible formulation that accounts for both additive and multiplicative effects of factors influencing the occurrence of death and discharge. We present asymptotic properties of the parameter estimates so that valid inference can be performed for the parametric as well as nonparametric model components. Simulation studies confirmed the good finite-sample performance of the proposed method. As the research is motivated by practical issues encountered in LOS analysis, we analysed data from two real clinical studies to showcase the general applicability of the proposed model.","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1623-1647"},"PeriodicalIF":1.6,"publicationDate":"2022-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/e7/b9/RSSC-71-1623.PMC9826400.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10525190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Utility-based Bayesian personalized treatment selection for advanced breast cancer 基于效用的晚期乳腺癌贝叶斯个性化治疗选择

IF 1.6 4区数学 Q3 STATISTICS & PROBABILITY

Journal of the Royal Statistical Society Series C-Applied Statistics

Pub Date : 2022-09-09 DOI: 10.1111/rssc.12582

Juhee Lee, Peter F. Thall, Bora Lim, Pavlos Msaouel

A Bayesian method is proposed for personalized treatment selection in settings where data are available from a randomized clinical trial with two or more outcomes. The motivating application is a randomized trial that compared letrozole plus bevacizumab to letrozole alone as first-line therapy for hormone receptor-positive advanced breast cancer. The combination treatment arm had larger median progression-free survival time, but also a higher rate of severe toxicities. This suggests that the risk-benefit trade-off between these two outcomes should play a central role in selecting each patient's treatment, particularly since older patients are less likely to tolerate severe toxicities. To quantify the desirability of each possible outcome combination for an individual patient, we elicited from breast cancer oncologists a utility function that varied with age. The utility was used as an explicit criterion for quantifying risk-benefit trade-offs when making personalized treatment selections. A Bayesian nonparametric multivariate regression model with a dependent Dirichlet process prior was fit to the trial data. Under the fitted model, a new patient's treatment can be selected based on the posterior predictive utility distribution. For the breast cancer trial dataset, the optimal treatment depends on the patient's age, with the combination preferable for patients 70 years or younger and the single agent preferable for patients older than 70.

在具有两个或多个结果的随机临床试验中，提出了一种贝叶斯方法，用于个性化治疗选择。激励应用是一项随机试验，比较来曲唑加贝伐单抗与来曲唑单独作为激素受体阳性晚期乳腺癌一线治疗。联合治疗组的中位无进展生存时间更长，但严重毒性发生率也更高。这表明，这两种结果之间的风险-收益权衡应该在选择每个患者的治疗方案时发挥核心作用，特别是因为老年患者不太可能耐受严重的毒性。为了量化每位患者的每种可能结果组合的可取性，我们从乳腺癌肿瘤学家那里获得了一个随年龄变化的效用函数。在做出个性化治疗选择时，效用被用作量化风险-收益权衡的明确标准。对试验数据拟合了一个具有相关Dirichlet过程先验的贝叶斯非参数多元回归模型。在拟合模型下，根据后验预测效用分布选择新患者的治疗方案。对于乳腺癌试验数据集，最佳治疗取决于患者的年龄，70岁或以下的患者优选联合用药，70岁以上的患者优选单药。

{"title":"Utility-based Bayesian personalized treatment selection for advanced breast cancer","authors":"Juhee Lee, Peter F. Thall, Bora Lim, Pavlos Msaouel","doi":"10.1111/rssc.12582","DOIUrl":"10.1111/rssc.12582","url":null,"abstract":"A Bayesian method is proposed for personalized treatment selection in settings where data are available from a randomized clinical trial with two or more outcomes. The motivating application is a randomized trial that compared letrozole plus bevacizumab to letrozole alone as first-line therapy for hormone receptor-positive advanced breast cancer. The combination treatment arm had larger median progression-free survival time, but also a higher rate of severe toxicities. This suggests that the risk-benefit trade-off between these two outcomes should play a central role in selecting each patient's treatment, particularly since older patients are less likely to tolerate severe toxicities. To quantify the desirability of each possible outcome combination for an individual patient, we elicited from breast cancer oncologists a utility function that varied with age. The utility was used as an explicit criterion for quantifying risk-benefit trade-offs when making personalized treatment selections. A Bayesian nonparametric multivariate regression model with a dependent Dirichlet process prior was fit to the trial data. Under the fitted model, a new patient's treatment can be selected based on the posterior predictive utility distribution. For the breast cancer trial dataset, the optimal treatment depends on the patient's age, with the combination preferable for patients 70 years or younger and the single agent preferable for patients older than 70.","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1605-1622"},"PeriodicalIF":1.6,"publicationDate":"2022-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10116488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Measuring diachronic sense change: New models and Monte Carlo methods for Bayesian inference 测量历时感变化:贝叶斯推理的新模型和蒙特卡罗方法

IF 1.6 4区数学 Q3 STATISTICS & PROBABILITY

Journal of the Royal Statistical Society Series C-Applied Statistics

Pub Date : 2022-09-06 DOI: 10.1111/rssc.12591

Schyan Zafar, Geoff K. Nicholls

In a bag-of-words model, the senses of a word with multiple meanings, for example ‘bank’ (used either in a river-bank or an institution sense), are represented as probability distributions over context words, and sense prevalence is represented as a probability distribution over senses. Both of these may change with time. Modelling and measuring this kind of sense change are challenging due to the typically high-dimensional parameter space and sparse datasets. A recently published corpus of ancient Greek texts contains expert-annotated sense labels for selected target words. Automatic sense-annotation for the word ‘kosmos’ (meaning decoration, order or world) has been used as a test case in recent work with related generative models and Monte Carlo methods. We adapt an existing generative sense change model to develop a simpler model for the main effects of sense and time, and give Markov Chain Monte Carlo methods for Bayesian inference on all these models that are more efficient than existing methods. We carry out automatic sense-annotation of snippets containing ‘kosmos’ using our model, and measure the time-evolution of its three senses and their prevalence. As far as we are aware, ours is the first analysis of this data, within the class of generative models we consider, that quantifies uncertainty and returns credible sets for evolving sense prevalence in good agreement with those given by expert annotation.

在词袋模型中，一个词的多个含义的意义，例如“bank”(用于河岸或机构意义)，被表示为上下文词的概率分布，而意义流行度被表示为意义的概率分布。这两者都可能随着时间而改变。由于典型的高维参数空间和稀疏数据集，这种感觉变化的建模和测量具有挑战性。最近出版的古希腊文本语料库包含专家注释的意义标签为选定的目标词。单词“kosmos”(意为装饰、秩序或世界)的自动意义注释已被用作最近与相关生成模型和蒙特卡罗方法一起工作的测试用例。我们对现有的生成式感觉变化模型进行了改进，建立了一个更简单的模型来描述感觉和时间的主要影响，并给出了在所有这些模型上进行贝叶斯推理的马尔可夫链蒙特卡罗方法，该方法比现有方法更有效。我们使用我们的模型对包含“宇宙”的片段进行自动意义注释，并测量其三种意义的时间演化及其流行程度。据我们所知，在我们考虑的生成模型类别中，我们的分析是对这些数据的第一次分析，它量化了不确定性，并返回了与专家注释给出的一致的进化感觉流行度的可信集。

{"title":"Measuring diachronic sense change: New models and Monte Carlo methods for Bayesian inference","authors":"Schyan Zafar, Geoff K. Nicholls","doi":"10.1111/rssc.12591","DOIUrl":"10.1111/rssc.12591","url":null,"abstract":"In a bag-of-words model, the senses of a word with multiple meanings, for example ‘bank’ (used either in a river-bank or an institution sense), are represented as probability distributions over context words, and sense prevalence is represented as a probability distribution over senses. Both of these may change with time. Modelling and measuring this kind of sense change are challenging due to the typically high-dimensional parameter space and sparse datasets. A recently published corpus of ancient Greek texts contains expert-annotated sense labels for selected target words. Automatic sense-annotation for the word ‘kosmos’ (meaning decoration, order or world) has been used as a test case in recent work with related generative models and Monte Carlo methods. We adapt an existing generative sense change model to develop a simpler model for the main effects of sense and time, and give Markov Chain Monte Carlo methods for Bayesian inference on all these models that are more efficient than existing methods. We carry out automatic sense-annotation of snippets containing ‘kosmos’ using our model, and measure the time-evolution of its three senses and their prevalence. As far as we are aware, ours is the first analysis of this data, within the class of generative models we consider, that quantifies uncertainty and returns credible sets for evolving sense prevalence in good agreement with those given by expert annotation.","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1569-1604"},"PeriodicalIF":1.6,"publicationDate":"2022-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/rssc.12591","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82774142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Environmental Engel curves: A neural network approach 环境恩格尔曲线:一种神经网络方法

IF 1.6 4区数学 Q3 STATISTICS & PROBABILITY

Journal of the Royal Statistical Society Series C-Applied Statistics

Pub Date : 2022-08-31 DOI: 10.1111/rssc.12588

Tullio Mancini, Hector Calvo-Pardo, Jose Olmo

Environmental Engel curves describe how households' income relates to the pollution associated with the services and goods consumed. This paper estimates these curves with neural networks using the novel dataset constructed in Levinson and O'Brien. We provide further statistical rigor to the empirical analysis by constructing prediction intervals obtained from novel neural network methods such as extra-neural nets and MC dropout. The application of these techniques for five different pollutants allow us to confirm statistically that Environmental Engel curves are upward sloping, have income elasticities smaller than one and shift down, becoming more concave, over time. Importantly, for the last year of the sample, we find an inverted U shape that suggests the existence of a maximum in pollution for medium-to-high levels of household income beyond which pollution flattens or decreases for top income earners.

环境恩格尔曲线描述了家庭收入与所消费的服务和商品相关的污染之间的关系。本文使用Levinson和O'Brien构建的新数据集用神经网络估计这些曲线。我们通过构建新的神经网络方法(如extra-neural networks和MC dropout)获得的预测区间，为实证分析提供进一步的统计严谨性。这些技术对五种不同污染物的应用使我们能够在统计上确认环境恩格尔曲线是向上倾斜的，收入弹性小于1，并且随着时间的推移向下移动，变得更加凹。重要的是，在样本的最后一年，我们发现了一个倒U形，表明中高收入家庭的污染存在一个最大值，超过这个最大值，高收入者的污染就会持平或减少。

引用次数: 0

Non-parametric Bayesian covariate-dependent multivariate functional clustering: An application to time-series data for multiple air pollutants 非参数贝叶斯协变量相关多变量函数聚类:多空气污染物时间序列数据的应用

IF 1.6 4区数学 Q3 STATISTICS & PROBABILITY

Journal of the Royal Statistical Society Series C-Applied Statistics

Pub Date : 2022-08-30 DOI: 10.1111/rssc.12589

Daewon Yang, Taeryon Choi, Eric Lavigne, Yeonseung Chung

Air pollution is a major threat to public health. Understanding the spatial distribution of air pollution concentration is of great interest to government or local authorities, as it informs about target areas for implementing policies for air quality management. Cluster analysis has been popularly used to identify groups of locations with similar profiles of average levels of multiple air pollutants, efficiently summarising the spatial pattern. This study aimed to cluster locations based on the seasonal patterns of multiple air pollutants incorporating the location-specific characteristics such as socio-economic indicators. For this purpose, we proposed a novel non-parametric Bayesian sparse latent factor model for covariate-dependent multivariate functional clustering. Furthermore, we extend this model to conduct clustering with temporal dependency. The proposed methods are illustrated through a simulation study and applied to time-series data for daily mean concentrations of ozone ( $� � � �_{� O � � � 3 � �} � � �$ ), nitrogen dioxide ( $� � � N � �_{� O � � � 2 � �} � � �$ ), and fine particulate matter ( $� � � P � �_{� M � � � 2 � . � 5 � �} � � �$ ) collected for 25 cities in Canada in 1986–2015.

空气污染是对公众健康的重大威胁。了解空气污染浓度的空间分布对政府或地方当局非常有意义，因为它可以为实施空气质量管理政策的目标区域提供信息。聚类分析已被广泛用于识别具有多种空气污染物平均水平相似概况的地点组，有效地总结空间格局。本研究旨在根据多种空气污染物的季节性模式，结合社会经济指标等地点特定特征，对地点进行聚类。为此，我们提出了一种新的非参数贝叶斯稀疏潜因子模型，用于协变量相关的多元函数聚类。此外，我们将该模型扩展到具有时间依赖性的聚类。通过模拟研究说明了所提出的方法，并将其应用于臭氧日平均浓度的时间序列数据(o3 $$ {mathrm{O}}_3 $$)。二氧化氮(n2 $$ mathrm{N}{mathrm{O}}_2 $$);细颗粒物(pm2)。5 $$ mathrm{P}{mathrm{M}}_{2.5} $$)，收集了1986-2015年加拿大25个城市的数据。

{"title":"Non-parametric Bayesian covariate-dependent multivariate functional clustering: An application to time-series data for multiple air pollutants","authors":"Daewon Yang, Taeryon Choi, Eric Lavigne, Yeonseung Chung","doi":"10.1111/rssc.12589","DOIUrl":"10.1111/rssc.12589","url":null,"abstract":"Air pollution is a major threat to public health. Understanding the spatial distribution of air pollution concentration is of great interest to government or local authorities, as it informs about target areas for implementing policies for air quality management. Cluster analysis has been popularly used to identify groups of locations with similar profiles of average levels of multiple air pollutants, efficiently summarising the spatial pattern. This study aimed to cluster locations based on the seasonal patterns of multiple air pollutants incorporating the location-specific characteristics such as socio-economic indicators. For this purpose, we proposed a novel non-parametric Bayesian sparse latent factor model for covariate-dependent multivariate functional clustering. Furthermore, we extend this model to conduct clustering with temporal dependency. The proposed methods are illustrated through a simulation study and applied to time-series data for daily mean concentrations of ozone (<math>\u0000 <semantics>\u0000 <mrow>\u0000 <msub>\u0000 <mrow>\u0000 <mi>O</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mn>3</mn>\u0000 </mrow>\u0000 </msub>\u0000 </mrow>\u0000 <annotation>$$ {mathrm{O}}_3 $$</annotation>\u0000 </semantics></math>), nitrogen dioxide (<math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>N</mi>\u0000 <msub>\u0000 <mrow>\u0000 <mi>O</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mn>2</mn>\u0000 </mrow>\u0000 </msub>\u0000 </mrow>\u0000 <annotation>$$ mathrm{N}{mathrm{O}}_2 $$</annotation>\u0000 </semantics></math>), and fine particulate matter (<math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>P</mi>\u0000 <msub>\u0000 <mrow>\u0000 <mi>M</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mn>2</mn>\u0000 <mo>.</mo>\u0000 <mn>5</mn>\u0000 </mrow>\u0000 </msub>\u0000 </mrow>\u0000 <annotation>$$ mathrm{P}{mathrm{M}}_{2.5} $$</annotation>\u0000 </semantics></math>) collected for 25 cities in Canada in 1986–2015.","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1521-1542"},"PeriodicalIF":1.6,"publicationDate":"2022-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89028372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A model-based approach to predict employee compensation components 基于模型的员工薪酬预测方法

IF 1.6 4区数学 Q3 STATISTICS & PROBABILITY

Journal of the Royal Statistical Society Series C-Applied Statistics

Pub Date : 2022-08-26 DOI: 10.1111/rssc.12587

Andreea L. Erciulescu, Jean D. Opsomer

The demand for official statistics at fine levels is motivating researchers to explore estimation methods that extend beyond the traditional survey-based estimation. For this work, the challenge originated with the US Bureau of Labor Statistics, who conducts the National Compensation Survey to collect compensation data from a nationwide sample of establishments. The objective is to obtain predictions of the wage and non-wage components of compensation for a large number of employment domains defined by detailed job characteristics. Survey estimates are only available for a small subset of these domains. To address the objective, we developed a bivariate hierarchical Bayes model that jointly predicts the wage and non-wage compensation components for a large number of employment domains defined by detailed job characteristics. We also discuss solutions to some practical challenges encountered in implementing small area estimation methods in large-scale settings, including methods for defining the prediction space, for constructing and selecting the information that serves as model input, and for obtaining stable survey variance and covariance estimates.

对精细官方统计的需求促使研究者探索超越传统的基于调查的估计方法。对于这项工作，挑战源于美国劳工统计局，他们进行了全国薪酬调查，从全国范围内的企业样本中收集薪酬数据。其目的是对由详细的工作特征界定的大量就业领域的工资和非工资部分的薪酬进行预测。调查估计仅适用于这些领域的一小部分。为了实现这一目标，我们开发了一个双变量分层贝叶斯模型，该模型可以共同预测由详细工作特征定义的大量就业领域的工资和非工资补偿成分。我们还讨论了在大规模环境下实施小面积估计方法时遇到的一些实际挑战的解决方案，包括定义预测空间的方法，构建和选择作为模型输入的信息的方法，以及获得稳定的调查方差和协方差估计的方法。

{"title":"A model-based approach to predict employee compensation components","authors":"Andreea L. Erciulescu, Jean D. Opsomer","doi":"10.1111/rssc.12587","DOIUrl":"10.1111/rssc.12587","url":null,"abstract":"The demand for official statistics at fine levels is motivating researchers to explore estimation methods that extend beyond the traditional survey-based estimation. For this work, the challenge originated with the US Bureau of Labor Statistics, who conducts the National Compensation Survey to collect compensation data from a nationwide sample of establishments. The objective is to obtain predictions of the wage and non-wage components of compensation for a large number of employment domains defined by detailed job characteristics. Survey estimates are only available for a small subset of these domains. To address the objective, we developed a bivariate hierarchical Bayes model that jointly predicts the wage and non-wage compensation components for a large number of employment domains defined by detailed job characteristics. We also discuss solutions to some practical challenges encountered in implementing small area estimation methods in large-scale settings, including methods for defining the prediction space, for constructing and selecting the information that serves as model input, and for obtaining stable survey variance and covariance estimates.","PeriodicalId":49981,"journal":{"name":"Journal of the Royal Statistical Society Series C-Applied Statistics","volume":"71 5","pages":"1503-1520"},"PeriodicalIF":1.6,"publicationDate":"2022-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88880927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4