首页 > 最新文献

Journal of data science : JDS最新文献

英文 中文
Quantifying Gender Disparity in Pre-Modern English Literature using Natural Language Processing 用自然语言处理量化前现代英语文学中的性别差异
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1100
M. Kejriwal, Akarsh Nagaraj
Research has continued to shed light on the extent and significance of gender disparity in social, cultural and economic spheres. More recently, computational tools from the data science and Natural Language Processing (NLP) communities have been proposed for measuring such disparity at scale using empirically rigorous methodologies. In this article, we contribute to this line of research by studying gender disparity in 2,443 copyright-expired literary texts published in the pre-modern period, defined in this work as the period ranging from the beginning of the nineteenth through the early twentieth century. Using a replicable data science methodology relying on publicly available and established NLP components, we extract three different gendered character prevalence measures within these texts. We use an extensive set of statistical tests to robustly demonstrate a significant disparity between the prevalence of female characters and male characters in pre-modern literature. We also show that the proportion of female characters in literary texts significantly increases in female-authored texts compared to the same proportion in male-authored texts. However, regression-based analysis shows that, over the 120 year period covered by the corpus, female character prevalence does not change significantly over time, and remains below the parity level of 50%, regardless of the gender of the author. Qualitative analyses further show that descriptions associated with female characters across the corpus are markedly different (and stereotypical) from the descriptions associated with male characters.
研究继续阐明了社会、文化和经济领域的性别差异的程度和意义。最近,来自数据科学和自然语言处理(NLP)社区的计算工具被提议使用经验严格的方法来大规模测量这种差异。在这篇文章中,我们通过研究前现代时期出版的2443篇版权过期的文学文本中的性别差异,为这条研究线做出了贡献。在这项工作中,前现代时期被定义为从19世纪初到20世纪初的时期。使用可复制的数据科学方法,依赖于公开可用和已建立的NLP组件,我们在这些文本中提取了三种不同的性别字符流行度量。我们使用了一套广泛的统计测试来有力地证明了前现代文学中女性角色和男性角色的流行程度之间存在显著差异。我们还发现,在女性创作的文学文本中,女性角色的比例显著高于男性创作的文学文本。然而,基于回归的分析表明,在语料库覆盖的120年期间,女性角色的流行率并没有随着时间的推移而显著变化,无论作者的性别如何,女性角色的流行率仍然低于50%的平价水平。定性分析进一步表明,语料库中与女性角色相关的描述与与男性角色相关的描述明显不同(和刻板)。
{"title":"Quantifying Gender Disparity in Pre-Modern English Literature using Natural Language Processing","authors":"M. Kejriwal, Akarsh Nagaraj","doi":"10.6339/23-jds1100","DOIUrl":"https://doi.org/10.6339/23-jds1100","url":null,"abstract":"Research has continued to shed light on the extent and significance of gender disparity in social, cultural and economic spheres. More recently, computational tools from the data science and Natural Language Processing (NLP) communities have been proposed for measuring such disparity at scale using empirically rigorous methodologies. In this article, we contribute to this line of research by studying gender disparity in 2,443 copyright-expired literary texts published in the pre-modern period, defined in this work as the period ranging from the beginning of the nineteenth through the early twentieth century. Using a replicable data science methodology relying on publicly available and established NLP components, we extract three different gendered character prevalence measures within these texts. We use an extensive set of statistical tests to robustly demonstrate a significant disparity between the prevalence of female characters and male characters in pre-modern literature. We also show that the proportion of female characters in literary texts significantly increases in female-authored texts compared to the same proportion in male-authored texts. However, regression-based analysis shows that, over the 120 year period covered by the corpus, female character prevalence does not change significantly over time, and remains below the parity level of 50%, regardless of the gender of the author. Qualitative analyses further show that descriptions associated with female characters across the corpus are markedly different (and stereotypical) from the descriptions associated with male characters.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Association Between Body Fat and Body Mass Index from Incomplete Longitudinal Proportion Data: Findings from the Fels Study 来自不完整纵向比例数据的体脂和体重指数之间的关系:来自费尔斯研究的发现
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1104
Xin Tong, Seohyun Kim, D. Bandyopadhyay, Shumei S. Sun
Obesity rates continue to exhibit an upward trajectory, particularly in the US, and is the underlying cause of several comorbidities, including but not limited to high blood pressure, high cholesterol, diabetes, heart disease, stroke, and cancers. To monitor obesity, body mass index (BMI) and proportion body fat (PBF) are two commonly used measurements. Although BMI and PBF changes over time in an individual’s lifespan and their relationship may also change dynamically, existing work has mostly remained cross-sectional, or separately modeling BMI and PBF. A combined longitudinal assessment is expected to be more effective in unravelling their complex interplay. To mitigate this, we consider Bayesian cross-domain latent growth curve models within a structural equation modeling framework, which simultaneously handles issues such as individually varying time metrics, proportion data, and potential missing not at random data for joint assessment of the longitudinal changes of BMI and PBF. Through simulation studies, we observe that our proposed models and estimation method yielded parameter estimates with small bias and mean squared error in general, however, a mis-specified missing data mechanism may cause inaccurate and inefficient parameter estimates. Furthermore, we demonstrate application of our method to a motivating longitudinal obesity study, controlling for both time-invariant (such as, sex), and time-varying (such as diastolic and systolic blood pressure, biceps skinfold, bioelectrical impedance, and waist circumference) covariates in separate models. Under time-invariance, we observe that the initial BMI level and the rate of change in BMI influenced PBF. However, in presence of time-varying covariates, only the initial BMI level influenced the initial PBF. The added-on selection model estimation indicated that observations with higher PBF values were less likely to be missing.
肥胖率继续呈上升趋势,特别是在美国,并且是几种合并症的潜在原因,包括但不限于高血压、高胆固醇、糖尿病、心脏病、中风和癌症。为了监测肥胖,身体质量指数(BMI)和身体脂肪比例(PBF)是两种常用的测量方法。虽然BMI和PBF在个体的一生中会随着时间的推移而变化,它们之间的关系也可能动态变化,但现有的研究大多是横向的,或者是单独对BMI和PBF进行建模。综合的纵向评估有望更有效地揭示它们复杂的相互作用。为了缓解这一问题,我们在结构方程建模框架内考虑贝叶斯跨域潜在增长曲线模型,该模型同时处理诸如单独变化的时间指标、比例数据和潜在的非随机数据缺失等问题,以联合评估BMI和PBF的纵向变化。通过仿真研究,我们发现我们所提出的模型和估计方法得到的参数估计总体上具有较小的偏差和均方误差,然而,错误指定的缺失数据机制可能导致参数估计不准确和低效。此外,我们展示了我们的方法在纵向肥胖研究中的应用,在不同的模型中控制了时不变(如性别)和时变(如舒张压和收缩压、二头肌皮褶、生物电阻抗和腰围)协变量。在时不变条件下,我们观察到初始BMI水平和BMI变化率影响PBF。然而,当存在时变协变量时,只有初始BMI水平影响初始PBF。附加选择模型估计表明,PBF值较高的观测值不太可能丢失。
{"title":"Association Between Body Fat and Body Mass Index from Incomplete Longitudinal Proportion Data: Findings from the Fels Study","authors":"Xin Tong, Seohyun Kim, D. Bandyopadhyay, Shumei S. Sun","doi":"10.6339/23-jds1104","DOIUrl":"https://doi.org/10.6339/23-jds1104","url":null,"abstract":"Obesity rates continue to exhibit an upward trajectory, particularly in the US, and is the underlying cause of several comorbidities, including but not limited to high blood pressure, high cholesterol, diabetes, heart disease, stroke, and cancers. To monitor obesity, body mass index (BMI) and proportion body fat (PBF) are two commonly used measurements. Although BMI and PBF changes over time in an individual’s lifespan and their relationship may also change dynamically, existing work has mostly remained cross-sectional, or separately modeling BMI and PBF. A combined longitudinal assessment is expected to be more effective in unravelling their complex interplay. To mitigate this, we consider Bayesian cross-domain latent growth curve models within a structural equation modeling framework, which simultaneously handles issues such as individually varying time metrics, proportion data, and potential missing not at random data for joint assessment of the longitudinal changes of BMI and PBF. Through simulation studies, we observe that our proposed models and estimation method yielded parameter estimates with small bias and mean squared error in general, however, a mis-specified missing data mechanism may cause inaccurate and inefficient parameter estimates. Furthermore, we demonstrate application of our method to a motivating longitudinal obesity study, controlling for both time-invariant (such as, sex), and time-varying (such as diastolic and systolic blood pressure, biceps skinfold, bioelectrical impedance, and waist circumference) covariates in separate models. Under time-invariance, we observe that the initial BMI level and the rate of change in BMI influenced PBF. However, in presence of time-varying covariates, only the initial BMI level influenced the initial PBF. The added-on selection model estimation indicated that observations with higher PBF values were less likely to be missing.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Effects of County-Level Socioeconomic and Healthcare Factors on Controlling COVID-19 in the Southern and Southeastern United States 美国南部和东南部县级社会经济和卫生保健因素对控制COVID-19的影响
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1111
Jackson Barth, Guanqing Cheng, Webb Williams, Ming Zhang, H. K. T. Ng
This paper aims to determine the effects of socioeconomic and healthcare factors on the performance of controlling COVID-19 in both the Southern and Southeastern United States. This analysis will provide government agencies with information to determine what communities need additional COVID-19 assistance, to identify counties that effectively control COVID-19, and to apply effective strategies on a broader scale. The statistical analysis uses data from 328 counties with a population of more than 65,000 from 13 states. We define a new response variable by considering infection and mortality rates to capture how well each county controls COVID-19. We collect 14 factors from the 2019 American Community Survey Single-Year Estimates and obtain county-level infection and mortality rates from USAfacts.org. We use the least absolute shrinkage and selection operator (LASSO) regression to fit a multiple linear regression model and develop an interactive system programmed in R shiny to deliver all results. The interactive system at https://asa-competition-smu.shinyapps.io/COVID19/ provides many options for users to explore our data, models, and results.
本文旨在确定社会经济和医疗保健因素对美国南部和东南部控制COVID-19绩效的影响。这一分析将为政府机构提供信息,以确定哪些社区需要额外的COVID-19援助,确定有效控制COVID-19的县,并在更大范围内应用有效战略。统计分析使用了来自13个州的328个县的数据,这些县的人口超过6.5万人。我们通过考虑感染率和死亡率来定义一个新的响应变量,以捕捉每个国家控制COVID-19的情况。我们从2019年美国社区调查单年估算中收集了14个因素,并从USAfacts.org上获得了县级感染率和死亡率。我们使用最小绝对收缩和选择算子(LASSO)回归来拟合多元线性回归模型,并开发了一个用R shiny编程的交互式系统来提供所有结果。在https://asa-competition-smu.shinyapps.io/COVID19/上的交互系统为用户提供了许多选项来探索我们的数据、模型和结果。
{"title":"The Effects of County-Level Socioeconomic and Healthcare Factors on Controlling COVID-19 in the Southern and Southeastern United States","authors":"Jackson Barth, Guanqing Cheng, Webb Williams, Ming Zhang, H. K. T. Ng","doi":"10.6339/23-jds1111","DOIUrl":"https://doi.org/10.6339/23-jds1111","url":null,"abstract":"This paper aims to determine the effects of socioeconomic and healthcare factors on the performance of controlling COVID-19 in both the Southern and Southeastern United States. This analysis will provide government agencies with information to determine what communities need additional COVID-19 assistance, to identify counties that effectively control COVID-19, and to apply effective strategies on a broader scale. The statistical analysis uses data from 328 counties with a population of more than 65,000 from 13 states. We define a new response variable by considering infection and mortality rates to capture how well each county controls COVID-19. We collect 14 factors from the 2019 American Community Survey Single-Year Estimates and obtain county-level infection and mortality rates from USAfacts.org. We use the least absolute shrinkage and selection operator (LASSO) regression to fit a multiple linear regression model and develop an interactive system programmed in R shiny to deliver all results. The interactive system at https://asa-competition-smu.shinyapps.io/COVID19/ provides many options for users to explore our data, models, and results.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"405 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71321027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial: Symposium Data Science and Statistics 2022 编辑:学术研讨会数据科学与统计2022
Pub Date : 2023-01-01 DOI: 10.6339/23-jds212edi
C. Bowen, M. Grosskopf
{"title":"Editorial: Symposium Data Science and Statistics 2022","authors":"C. Bowen, M. Grosskopf","doi":"10.6339/23-jds212edi","DOIUrl":"https://doi.org/10.6339/23-jds212edi","url":null,"abstract":"","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71321084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Efficient Bayesian High-Dimensional Classification via Random Projection with Application to Gene Expression Data 基于随机投影的高效贝叶斯高维分类及其在基因表达数据中的应用
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1102
Abhisek Chakraborty
Inspired by the impressive successes of compress sensing-based machine learning algorithms, data augmentation-based efficient Gibbs samplers for Bayesian high-dimensional classification models are developed by compressing the design matrix to a much lower dimension. Ardent care is exercised in the choice of the projection mechanism, and an adaptive voting rule is employed to reduce sensitivity to the random projection matrix. Focusing on the high-dimensional Probit regression model, we note that the naive implementation of the data augmentation-based Gibbs sampler is not robust to the presence of co-linearity in the design matrix – a setup ubiquitous in $n
受基于压缩感知的机器学习算法令人印象深刻的成功启发,基于数据增强的高效吉布斯采样器通过将设计矩阵压缩到更低的维度来开发贝叶斯高维分类模型。在投影机制的选择上特别注意,并采用自适应投票规则来降低对随机投影矩阵的敏感性。专注于高维Probit回归模型,我们注意到基于数据增强的Gibbs采样器的天真实现对设计矩阵中共线性的存在不具有鲁棒性-这是在$n
{"title":"Efficient Bayesian High-Dimensional Classification via Random Projection with Application to Gene Expression Data","authors":"Abhisek Chakraborty","doi":"10.6339/23-jds1102","DOIUrl":"https://doi.org/10.6339/23-jds1102","url":null,"abstract":"Inspired by the impressive successes of compress sensing-based machine learning algorithms, data augmentation-based efficient Gibbs samplers for Bayesian high-dimensional classification models are developed by compressing the design matrix to a much lower dimension. Ardent care is exercised in the choice of the projection mechanism, and an adaptive voting rule is employed to reduce sensitivity to the random projection matrix. Focusing on the high-dimensional Probit regression model, we note that the naive implementation of the data augmentation-based Gibbs sampler is not robust to the presence of co-linearity in the design matrix – a setup ubiquitous in $n<p$ problems. We demonstrate that a simple fix based on joint updates of parameters in the latent space circumnavigates this issue. With a computationally efficient MCMC scheme in place, we introduce an ensemble classifier by creating R (∼25–50) projected copies of the design matrix, and subsequently running R classification models with the R projected design matrix in parallel. We combine the output from the R replications via an adaptive voting scheme. Our scheme is inherently parallelizable and capable of taking advantage of modern computing environments often equipped with multiple cores. The empirical success of our methodology is illustrated in elaborate simulations and gene expression data applications. We also extend our methodology to a high-dimensional logistic regression model and carry out numerical studies to showcase its efficacy.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identification of Optimal Combined Moderators for Time to Relapse 复吸时间最优组合调节因子的识别
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1107
Bang Wang, Yu Cheng, M. Levine
Identifying treatment effect modifiers (i.e., moderators) plays an essential role in improving treatment efficacy when substantial treatment heterogeneity exists. However, studies are often underpowered for detecting treatment effect modifiers, and exploratory analyses that examine one moderator per statistical model often yield spurious interactions. Therefore, in this work, we focus on creating an intuitive and readily implementable framework to facilitate the discovery of treatment effect modifiers and to make treatment recommendations for time-to-event outcomes. To minimize the impact of a misspecified main effect and avoid complex modeling, we construct the framework by matching the treated with the controls and modeling the conditional average treatment effect via regressing the difference in the observed outcomes of a matched pair on the averaged moderators. Inverse-probability-of-censoring weighting is used to handle censored observations. As matching is the foundation of the proposed methods, we explore different matching metrics and recommend the use of Mahalanobis distance when both continuous and categorical moderators are present. After matching, the proposed framework can be flexibly combined with popular variable selection and prediction methods such as linear regression, least absolute shrinkage and selection operator (Lasso), and random forest to create different combinations of potential moderators. The optimal combination is determined by the out-of-bag prediction error and the area under the receiver operating characteristic curve in making correct treatment recommendations. We compare the performance of various combined moderators through extensive simulations and the analysis of real trial data. Our approach can be easily implemented using existing R packages, resulting in a straightforward optimal combined moderator to make treatment recommendations.
当治疗异质性存在时,识别治疗效果调节剂(即调节因子)在提高治疗疗效方面起着至关重要的作用。然而,在检测治疗效果调节剂方面的研究往往力度不足,并且每个统计模型检查一个调节剂的探索性分析经常产生虚假的相互作用。因此,在这项工作中,我们专注于创建一个直观且易于实施的框架,以促进治疗效果调节剂的发现,并针对事件发生时间提出治疗建议。为了最小化指定错误的主效应的影响并避免复杂的建模,我们通过将被处理组与对照组匹配,并通过回归匹配对平均调节因子的观察结果的差异来建模条件平均处理效应,从而构建了框架。采用反截后概率加权法处理截后观测值。由于匹配是所提出方法的基础,我们探索了不同的匹配度量,并建议在存在连续调节因子和分类调节因子时使用马氏距离。匹配后,该框架可灵活结合线性回归、最小绝对收缩和选择算子(Lasso)、随机森林等常用的变量选择和预测方法,创建不同组合的潜在调节因子。最优组合是由出袋预测误差和受试者工作特性曲线下面积决定的,从而给出正确的治疗建议。我们通过广泛的模拟和对真实试验数据的分析,比较了各种组合调节剂的性能。我们的方法可以很容易地使用现有的R包实现,从而产生一个直接的最佳组合缓和剂来提出治疗建议。
{"title":"Identification of Optimal Combined Moderators for Time to Relapse","authors":"Bang Wang, Yu Cheng, M. Levine","doi":"10.6339/23-jds1107","DOIUrl":"https://doi.org/10.6339/23-jds1107","url":null,"abstract":"Identifying treatment effect modifiers (i.e., moderators) plays an essential role in improving treatment efficacy when substantial treatment heterogeneity exists. However, studies are often underpowered for detecting treatment effect modifiers, and exploratory analyses that examine one moderator per statistical model often yield spurious interactions. Therefore, in this work, we focus on creating an intuitive and readily implementable framework to facilitate the discovery of treatment effect modifiers and to make treatment recommendations for time-to-event outcomes. To minimize the impact of a misspecified main effect and avoid complex modeling, we construct the framework by matching the treated with the controls and modeling the conditional average treatment effect via regressing the difference in the observed outcomes of a matched pair on the averaged moderators. Inverse-probability-of-censoring weighting is used to handle censored observations. As matching is the foundation of the proposed methods, we explore different matching metrics and recommend the use of Mahalanobis distance when both continuous and categorical moderators are present. After matching, the proposed framework can be flexibly combined with popular variable selection and prediction methods such as linear regression, least absolute shrinkage and selection operator (Lasso), and random forest to create different combinations of potential moderators. The optimal combination is determined by the out-of-bag prediction error and the area under the receiver operating characteristic curve in making correct treatment recommendations. We compare the performance of various combined moderators through extensive simulations and the analysis of real trial data. Our approach can be easily implemented using existing R packages, resulting in a straightforward optimal combined moderator to make treatment recommendations.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Revisiting the Use of Generalized Least Squares in Time Series Regression Models 回顾广义最小二乘在时间序列回归模型中的应用
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1108
Yue Fang, S. Koreisha, Q. Shao
Linear regression models are widely used in empirical studies. When serial correlation is present in the residuals, generalized least squares (GLS) estimation is commonly used to improve estimation efficiency. This paper proposes the use of an alternative estimator, the approximate generalized least squares estimators based on high-order AR(p) processes (GLS-AR). We show that GLS-AR estimators are asymptotically efficient as GLS estimators, as both the number of AR lag, p, and the number of observations, n, increase together so that $p=o({n^{1/4}})$ in the limit. The proposed GLS-AR estimators do not require the identification of the residual serial autocorrelation structure and perform more robust in finite samples than the conventional FGLS-based tests. Finally, we illustrate the usefulness of GLS-AR method by applying it to the global warming data from 1850–2012.
线性回归模型在实证研究中被广泛使用。当残差中存在序列相关时,一般采用广义最小二乘(GLS)估计来提高估计效率。本文提出了一种替代估计量——基于高阶AR(p)过程的近似广义最小二乘估计量(GLS-AR)。我们证明了GLS-AR估计器作为GLS估计器是渐近有效的,因为AR滞后数p和观测数n一起增加,使得$p= 0 ({n^{1/4}})$在极限上。所提出的GLS-AR估计器不需要识别残差序列自相关结构,并且在有限样本中比传统的基于fgls的测试具有更高的鲁棒性。最后,通过对1850-2012年全球变暖数据的分析,说明了GLS-AR方法的有效性。
{"title":"Revisiting the Use of Generalized Least Squares in Time Series Regression Models","authors":"Yue Fang, S. Koreisha, Q. Shao","doi":"10.6339/23-jds1108","DOIUrl":"https://doi.org/10.6339/23-jds1108","url":null,"abstract":"Linear regression models are widely used in empirical studies. When serial correlation is present in the residuals, generalized least squares (GLS) estimation is commonly used to improve estimation efficiency. This paper proposes the use of an alternative estimator, the approximate generalized least squares estimators based on high-order AR(p) processes (GLS-AR). We show that GLS-AR estimators are asymptotically efficient as GLS estimators, as both the number of AR lag, p, and the number of observations, n, increase together so that $p=o({n^{1/4}})$ in the limit. The proposed GLS-AR estimators do not require the identification of the residual serial autocorrelation structure and perform more robust in finite samples than the conventional FGLS-based tests. Finally, we illustrate the usefulness of GLS-AR method by applying it to the global warming data from 1850–2012.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analyzing the Rainfall Pattern in Honduras Through Non-Homogeneous Hidden Markov Models 用非齐次隐马尔可夫模型分析洪都拉斯降雨模式
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1091
Gustavo Alexis Sabillón, D. Zuanetti
One of the major climatic interests of the last decades has been to understand and describe the rainfall patterns of specific areas of the world as functions of other climate covariates. We do it for the historical climate monitoring data from Tegucigalpa, Honduras, using non-homogeneous hidden Markov models (NHMMs), which are dynamic models usually used to identify and predict heterogeneous regimes. For estimating the NHMM in an efficient and scalable way, we propose the stochastic Expectation-Maximization (EM) algorithm and a Bayesian method, and compare their performance in synthetic data. Although these methodologies have already been used for estimating several other statistical models, it is not the case of NHMMs which are still widely fitted by the traditional EM algorithm. We observe that, under tested conditions, the performance of the Bayesian and stochastic EM algorithms is similar and discuss their slight differences. Analyzing the Honduras rainfall data set, we identify three heterogeneous rainfall periods and select temperature and humidity as relevant covariates for explaining the dynamic relation among these periods.
在过去的几十年里,主要的气候兴趣之一是理解和描述世界上特定地区的降雨模式作为其他气候协变量的函数。我们对洪都拉斯特古西加尔巴的历史气候监测数据进行了分析,使用非同质隐马尔可夫模型(nhhmm),这是一种通常用于识别和预测异质状态的动态模型。为了有效和可扩展地估计NHMM,我们提出了随机期望最大化(EM)算法和贝叶斯方法,并比较了它们在合成数据中的性能。虽然这些方法已经被用于估计其他几种统计模型,但nhmm的情况并非如此,它仍然广泛地使用传统的EM算法进行拟合。我们观察到,在测试条件下,贝叶斯算法和随机EM算法的性能是相似的,并讨论了它们的细微差异。通过对洪都拉斯降雨数据集的分析,我们确定了三个非均匀降雨期,并选择温度和湿度作为相关协变量来解释这些时期之间的动态关系。
{"title":"Analyzing the Rainfall Pattern in Honduras Through Non-Homogeneous Hidden Markov Models","authors":"Gustavo Alexis Sabillón, D. Zuanetti","doi":"10.6339/23-jds1091","DOIUrl":"https://doi.org/10.6339/23-jds1091","url":null,"abstract":"One of the major climatic interests of the last decades has been to understand and describe the rainfall patterns of specific areas of the world as functions of other climate covariates. We do it for the historical climate monitoring data from Tegucigalpa, Honduras, using non-homogeneous hidden Markov models (NHMMs), which are dynamic models usually used to identify and predict heterogeneous regimes. For estimating the NHMM in an efficient and scalable way, we propose the stochastic Expectation-Maximization (EM) algorithm and a Bayesian method, and compare their performance in synthetic data. Although these methodologies have already been used for estimating several other statistical models, it is not the case of NHMMs which are still widely fitted by the traditional EM algorithm. We observe that, under tested conditions, the performance of the Bayesian and stochastic EM algorithms is similar and discuss their slight differences. Analyzing the Honduras rainfall data set, we identify three heterogeneous rainfall periods and select temperature and humidity as relevant covariates for explaining the dynamic relation among these periods.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Network A/B Testing: Nonparametric Statistical Significance Test Based on Cluster-Level Permutation 网络A/B测试:基于集群水平排列的非参数统计显著性检验
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1112
Hongwei Shang, Xiaolin Shi, Bai Jiang
A/B testing is widely used for comparing two versions of a product and evaluating new proposed product features. It is of great importance for decision-making and has been applied as a golden standard in the IT industry. It is essentially a form of two-sample statistical hypothesis testing. Average treatment effect (ATE) and the corresponding p-value can be obtained under certain assumptions. One key assumption in traditional A/B testing is the stable-unit-treatment-value assumption (SUTVA): there is no interference among different units. It means that the observation on one unit is unaffected by the particular assignment of treatments to the other units. Nonetheless, interference is very common in social network settings where people communicate and spread information to their neighbors. Therefore, the SUTVA assumption is violated. Analysis ignoring this network effect will lead to biased estimation of ATE. Most existing works focus mainly on the design of experiment and data analysis in order to produce estimators with good performance in regards to bias and variance. Little attention has been paid to the calculation of p-value. We work on the calculation of p-value for the ATE estimator in network A/B tests. After a brief review of existing research methods on design of experiment based on graph cluster randomization and different ATE estimation methods, we propose a permutation method for calculating p-value based on permutation test at the cluster level. The effectiveness of the method against that based on individual-level permutation is validated in a simulation study mimicking realistic settings.
A/B测试广泛用于比较产品的两个版本和评估新提出的产品功能。它对决策具有重要意义,并已成为It行业的黄金标准。它本质上是一种双样本统计假设检验。在一定的假设条件下,可以得到平均处理效果(ATE)及其对应的p值。传统A/B测试中的一个关键假设是稳定单元处理值假设(SUTVA):不同单元之间不存在干扰。这意味着对一个单元的观察不受对其他单元的特殊处理分配的影响。尽管如此,在人们与邻居交流和传播信息的社交网络环境中,干扰是非常常见的。因此,违反了SUTVA假设。忽略这种网络效应的分析将导致ATE估计的偏差。大多数现有的工作主要集中在实验设计和数据分析上,以产生在偏差和方差方面具有良好性能的估计器。p值的计算很少受到重视。我们研究了网络A/B测试中ATE估计器的p值的计算。在简要回顾现有基于图类随机化的实验设计研究方法和不同ATE估计方法的基础上,提出了一种基于聚类水平置换检验的p值计算置换方法。在模拟现实环境的仿真研究中,验证了该方法对基于个体水平排列的方法的有效性。
{"title":"Network A/B Testing: Nonparametric Statistical Significance Test Based on Cluster-Level Permutation","authors":"Hongwei Shang, Xiaolin Shi, Bai Jiang","doi":"10.6339/23-jds1112","DOIUrl":"https://doi.org/10.6339/23-jds1112","url":null,"abstract":"A/B testing is widely used for comparing two versions of a product and evaluating new proposed product features. It is of great importance for decision-making and has been applied as a golden standard in the IT industry. It is essentially a form of two-sample statistical hypothesis testing. Average treatment effect (ATE) and the corresponding p-value can be obtained under certain assumptions. One key assumption in traditional A/B testing is the stable-unit-treatment-value assumption (SUTVA): there is no interference among different units. It means that the observation on one unit is unaffected by the particular assignment of treatments to the other units. Nonetheless, interference is very common in social network settings where people communicate and spread information to their neighbors. Therefore, the SUTVA assumption is violated. Analysis ignoring this network effect will lead to biased estimation of ATE. Most existing works focus mainly on the design of experiment and data analysis in order to produce estimators with good performance in regards to bias and variance. Little attention has been paid to the calculation of p-value. We work on the calculation of p-value for the ATE estimator in network A/B tests. After a brief review of existing research methods on design of experiment based on graph cluster randomization and different ATE estimation methods, we propose a permutation method for calculating p-value based on permutation test at the cluster level. The effectiveness of the method against that based on individual-level permutation is validated in a simulation study mimicking realistic settings.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71321036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Editorial: Advances in Network Data Science 社论:网络数据科学进展
Pub Date : 2023-01-01 DOI: 10.6339/23-jds213edi
Yuguo Chen, Daniel Sewell, Panpan Zhang, Xuening Zhu
This special issue features nine articles on “Advances in Network Data Science”. Data science is an interdisciplinary research field utilizing scientific methods to facilitate knowledge and insights from structured and unstructured data across a broad range of domains. Network data are proliferating in many fields, and network data analysis has become a burgeoning research in the data science community. Due to the nature of heterogeneity and complexity of network data, classical statistical approaches for network model fitting face a great deal of challenges, especially for large-scale network data. Therefore, it becomes crucial to develop advanced methodological and computational tools to cope with challenges associated with massive and complex network data analyses. This special issue highlights some recent studies in the area of network data analysis, showcasing a variety of contributions in statistical methodology, two real-world applications, a software package for network generation, and a survey on handling missing values in networks. Five articles are published in the Statistical Data Science Section. Wang and Resnick (2023) employed point processes to investigate the macroscopic growth dynamics of geographically concentrated regional networks. They discovered that during the startup phase, a self-exciting point process effectively modeled the growth process, and subsequently, the growth of links could be suitably described by a non-homogeneous Poisson process. Komolafe
本期特刊收录了九篇关于“网络数据科学进展”的文章。数据科学是一个跨学科的研究领域,利用科学的方法从广泛的领域中结构化和非结构化数据中获取知识和见解。网络数据在许多领域激增,网络数据分析已成为数据科学界的一项新兴研究。由于网络数据的异质性和复杂性,传统的网络模型拟合的统计方法面临着很大的挑战,特别是对于大规模的网络数据。因此,开发先进的方法和计算工具来应对与大量复杂网络数据分析相关的挑战变得至关重要。本期特刊重点介绍了网络数据分析领域的一些最新研究,展示了统计方法的各种贡献,两个现实世界的应用,一个网络生成软件包,以及对处理网络中缺失值的调查。在统计数据科学部分发表了五篇文章。Wang和Resnick(2023)采用点过程研究地理集中区域网络的宏观增长动态。他们发现,在启动阶段,一个自激点过程有效地模拟了生长过程,随后,链接的生长可以用非齐次泊松过程来适当地描述。Komolafe
{"title":"Editorial: Advances in Network Data Science","authors":"Yuguo Chen, Daniel Sewell, Panpan Zhang, Xuening Zhu","doi":"10.6339/23-jds213edi","DOIUrl":"https://doi.org/10.6339/23-jds213edi","url":null,"abstract":"This special issue features nine articles on “Advances in Network Data Science”. Data science is an interdisciplinary research field utilizing scientific methods to facilitate knowledge and insights from structured and unstructured data across a broad range of domains. Network data are proliferating in many fields, and network data analysis has become a burgeoning research in the data science community. Due to the nature of heterogeneity and complexity of network data, classical statistical approaches for network model fitting face a great deal of challenges, especially for large-scale network data. Therefore, it becomes crucial to develop advanced methodological and computational tools to cope with challenges associated with massive and complex network data analyses. This special issue highlights some recent studies in the area of network data analysis, showcasing a variety of contributions in statistical methodology, two real-world applications, a software package for network generation, and a survey on handling missing values in networks. Five articles are published in the Statistical Data Science Section. Wang and Resnick (2023) employed point processes to investigate the macroscopic growth dynamics of geographically concentrated regional networks. They discovered that during the startup phase, a self-exciting point process effectively modeled the growth process, and subsequently, the growth of links could be suitably described by a non-homogeneous Poisson process. Komolafe","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71321135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of data science : JDS
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1