首页 > 最新文献

Journal of data science : JDS最新文献

英文 中文
Maximum Likelihood Estimation for Shape-restricted Single-index Hazard Models. 形状受限单指数危险模型的最大似然估计。
Pub Date : 2023-10-01 Epub Date: 2022-11-04 DOI: 10.6339/22-jds1061
Jing Qin, Yifei Sun, Ao Yuan, Chiung-Yu Huang

Single-index models are becoming increasingly popular in many scientific applications as they offer the advantages of flexibility in regression modeling as well as interpretable covariate effects. In the context of survival analysis, the single-index hazards models are natural extensions of the Cox proportional hazards models. In this paper, we propose a novel estimation procedure for single-index hazard models under a monotone constraint of the index. We apply the profile likelihood method to obtain the semiparametric maximum likelihood estimator, where the novelty of the estimation procedure lies in estimating the unknown monotone link function by embedding the problem in isotonic regression with exponentially distributed random variables. The consistency of the proposed semiparametric maximum likelihood estimator is established under suitable regularity conditions. Numerical simulations are conducted to examine the finite-sample performance of the proposed method. An analysis of breast cancer data is presented for illustration.

单指数模型具有回归建模灵活、协变量效应可解释等优点,因此在许多科学应用中越来越受欢迎。在生存分析中,单指数危险模型是 Cox 比例危险模型的自然扩展。在本文中,我们提出了一种在指数单调约束条件下的单指数危险模型的新型估计程序。我们应用轮廓似然法获得半参数最大似然估计器,估计程序的新颖之处在于通过将问题嵌入指数分布随机变量的等比数列回归中来估计未知的单调联系函数。在适当的正则条件下,建立了所提出的半参数最大似然估计器的一致性。通过数值模拟,检验了所提方法的有限样本性能。并通过对乳腺癌数据的分析进行了说明。
{"title":"Maximum Likelihood Estimation for Shape-restricted Single-index Hazard Models.","authors":"Jing Qin, Yifei Sun, Ao Yuan, Chiung-Yu Huang","doi":"10.6339/22-jds1061","DOIUrl":"10.6339/22-jds1061","url":null,"abstract":"<p><p>Single-index models are becoming increasingly popular in many scientific applications as they offer the advantages of flexibility in regression modeling as well as interpretable covariate effects. In the context of survival analysis, the single-index hazards models are natural extensions of the Cox proportional hazards models. In this paper, we propose a novel estimation procedure for single-index hazard models under a monotone constraint of the index. We apply the profile likelihood method to obtain the semiparametric maximum likelihood estimator, where the novelty of the estimation procedure lies in estimating the unknown monotone link function by embedding the problem in isotonic regression with exponentially distributed random variables. The consistency of the proposed semiparametric maximum likelihood estimator is established under suitable regularity conditions. Numerical simulations are conducted to examine the finite-sample performance of the proposed method. An analysis of breast cancer data is presented for illustration.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":"681-695"},"PeriodicalIF":0.0,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11017303/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Central Posterior Envelopes for Bayesian Functional Principal Component Analysis. 贝叶斯功能主成分分析的中心后包络。
Pub Date : 2023-10-01 Epub Date: 2023-01-19 DOI: 10.6339/23-jds1085
Joanna Boland, Donatello Telesca, Catherine Sugar, Shafali Jeste, Abigail Dickinson, Charlotte DiStefano, Damla Şentürk

Bayesian methods provide direct inference in functional data analysis applications without reliance on bootstrap techniques. A major tool in functional data applications is the functional principal component analysis which decomposes the data around a common mean function and identifies leading directions of variation. Bayesian functional principal components analysis (BFPCA) provides uncertainty quantification on the estimated functional model components via the posterior samples obtained. We propose central posterior envelopes (CPEs) for BFPCA based on functional depth as a descriptive visualization tool to summarize variation in the posterior samples of the estimated functional model components, contributing to uncertainty quantification in BFPCA. The proposed BFPCA relies on a latent factor model and targets model parameters within a mixed effects modeling framework using modified multiplicative gamma process shrinkage priors on the variance components. Functional depth provides a center-outward order to a sample of functions. We utilize modified band depth and modified volume depth for ordering of a sample of functions and surfaces, respectively, to derive at CPEs of the mean and eigenfunctions within the BFPCA framework. The proposed CPEs are showcased in extensive simulations. Finally, the proposed CPEs are applied to the analysis of a sample of power spectral densities (PSD) from resting state electroencephalography (EEG) where they lead to novel insights on diagnostic group differences among children diagnosed with autism spectrum disorder and their typically developing peers across age.

贝叶斯方法可在功能数据分析应用中提供直接推断,而无需依赖引导技术。功能数据应用中的一个主要工具是功能主成分分析,它围绕一个共同的平均函数对数据进行分解,并确定变化的主要方向。贝叶斯功能主成分分析(BFPCA)通过获得的后验样本对估计的功能模型成分进行不确定性量化。我们提出了基于功能深度的贝叶斯功能主成分分析中心后验包络(CPEs),作为一种描述性可视化工具,用于总结估计功能模型成分后验样本的变化,有助于贝叶斯功能主成分分析的不确定性量化。所提出的 BFPCA 依赖于潜因模型,并在混合效应建模框架内使用方差成分的修正乘法伽马过程收缩先验来锁定模型参数。函数深度为函数样本提供了中心向外的顺序。我们利用修正带深度和修正体深度分别对函数样本和曲面进行排序,从而在 BFPCA 框架内推导出均值和特征函数的 CPE。我们通过大量模拟展示了所提出的 CPE。最后,将所提出的 CPEs 应用于静息状态脑电图(EEG)的功率谱密度(PSD)样本分析,从而对被诊断为自闭症谱系障碍的儿童与发育正常的同龄人在不同年龄段的诊断群体差异有了新的认识。
{"title":"Central Posterior Envelopes for Bayesian Functional Principal Component Analysis.","authors":"Joanna Boland, Donatello Telesca, Catherine Sugar, Shafali Jeste, Abigail Dickinson, Charlotte DiStefano, Damla Şentürk","doi":"10.6339/23-jds1085","DOIUrl":"10.6339/23-jds1085","url":null,"abstract":"<p><p>Bayesian methods provide direct inference in functional data analysis applications without reliance on bootstrap techniques. A major tool in functional data applications is the functional principal component analysis which decomposes the data around a common mean function and identifies leading directions of variation. Bayesian functional principal components analysis (BFPCA) provides uncertainty quantification on the estimated functional model components via the posterior samples obtained. We propose central posterior envelopes (CPEs) for BFPCA based on functional depth as a descriptive visualization tool to summarize variation in the posterior samples of the estimated functional model components, contributing to uncertainty quantification in BFPCA. The proposed BFPCA relies on a latent factor model and targets model parameters within a mixed effects modeling framework using modified multiplicative gamma process shrinkage priors on the variance components. Functional depth provides a center-outward order to a sample of functions. We utilize modified band depth and modified volume depth for ordering of a sample of functions and surfaces, respectively, to derive at CPEs of the mean and eigenfunctions within the BFPCA framework. The proposed CPEs are showcased in extensive simulations. Finally, the proposed CPEs are applied to the analysis of a sample of power spectral densities (PSD) from resting state electroencephalography (EEG) where they lead to novel insights on diagnostic group differences among children diagnosed with autism spectrum disorder and their typically developing peers across age.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":"715-734"},"PeriodicalIF":0.0,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11178334/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimal Physician Shared-Patient Networks and the Diffusion of Medical Technologies. 最佳医生共享病人网络与医疗技术的传播。
Pub Date : 2023-07-01 Epub Date: 2022-08-30 DOI: 10.6339/22-jds1064
A James O'Malley, Xin Ran, Chuankai An, Daniel Rockmore

Social network analysis has created a productive framework for the analysis of the histories of patient-physician interactions and physician collaboration. Notable is the construction of networks based on the data of "referral paths" - sequences of patient-specific temporally linked physician visits - in this case, culled from a large set of Medicare claims data in the United States. Network constructions depend on a range of choices regarding the underlying data. In this paper we introduce the use of a five-factor experiment that produces 80 distinct projections of the bipartite patient-physician mixing matrix to a unipartite physician network derived from the referral path data, which is further analyzed at the level of the 2,219 hospitals in the final analytic sample. We summarize the networks of physicians within a given hospital using a range of directed and undirected network features (quantities that summarize structural properties of the network such as its size, density, and reciprocity). The different projections and their underlying factors are evaluated in terms of the heterogeneity of the network features across the hospitals. We also evaluate the projections relative to their ability to improve the predictive accuracy of a model estimating a hospital's adoption of implantable cardiac defibrillators, a novel cardiac intervention. Because it optimizes the knowledge learned about the overall and interactive effects of the factors, we anticipate that the factorial design setting for network analysis may be useful more generally as a methodological advance in network analysis.

社会网络分析为分析医患互动和医生合作的历史提供了一个富有成效的框架。值得注意的是基于 "转诊路径 "数据的网络构建--"转诊路径 "是指与特定患者有时间联系的医生就诊序列--本案例中的 "转诊路径 "数据来自于美国的大量医疗保险报销数据。网络的构建取决于对基础数据的一系列选择。在本文中,我们介绍了五因素实验的使用方法,该方法可将双方形患者-医生混合矩阵生成 80 个不同的投影,并将其投影到从转诊路径数据中得出的单方形医生网络中,然后在最终分析样本中的 2,219 家医院层面对该网络进行进一步分析。我们使用一系列有向和无向网络特征(概括网络结构属性的数量,如网络规模、密度和互惠性)来概括特定医院内的医生网络。我们根据各医院网络特征的异质性对不同的预测及其基本因素进行了评估。我们还评估了这些预测是否能提高一个模型的预测准确性,该模型估计了医院采用植入式心脏除颤器(一种新型心脏干预措施)的情况。由于它优化了所学到的有关因素的整体效应和交互效应的知识,我们预计网络分析的因子设计设置作为网络分析方法的一种进步,可能会有更广泛的用途。
{"title":"Optimal Physician Shared-Patient Networks and the Diffusion of Medical Technologies.","authors":"A James O'Malley, Xin Ran, Chuankai An, Daniel Rockmore","doi":"10.6339/22-jds1064","DOIUrl":"10.6339/22-jds1064","url":null,"abstract":"<p><p>Social network analysis has created a productive framework for the analysis of the histories of patient-physician interactions and physician collaboration. Notable is the construction of networks based on the data of \"referral paths\" - sequences of patient-specific temporally linked physician visits - in this case, culled from a large set of Medicare claims data in the United States. Network constructions depend on a range of choices regarding the underlying data. In this paper we introduce the use of a five-factor experiment that produces 80 distinct projections of the bipartite patient-physician mixing matrix to a unipartite physician network derived from the referral path data, which is further analyzed at the level of the 2,219 hospitals in the final analytic sample. We summarize the networks of physicians within a given hospital using a range of directed and undirected network features (quantities that summarize structural properties of the network such as its size, density, and reciprocity). The different projections and their underlying factors are evaluated in terms of the heterogeneity of the network features across the hospitals. We also evaluate the projections relative to their ability to improve the predictive accuracy of a model estimating a hospital's adoption of implantable cardiac defibrillators, a novel cardiac intervention. Because it optimizes the knowledge learned about the overall and interactive effects of the factors, we anticipate that the factorial design setting for network analysis may be useful more generally as a methodological advance in network analysis.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":"578-598"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10956597/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generating General Preferential Attachment Networks with R Package wdnet 用R包wdnet生成一般优先依恋网络
Pub Date : 2023-01-31 DOI: 10.6339/23-jds1110
Yelie Yuan, Tiandong Wang, Jun Yan, Panpan Zhang
Preferential attachment (PA) network models have a wide range of applications in various scientific disciplines. Efficient generation of large-scale PA networks helps uncover their structural properties and facilitate the development of associated analytical methodologies. Existing software packages only provide limited functions for this purpose with restricted configurations and efficiency. We present a generic, user-friendly implementation of weighted, directed PA network generation with R package wdnet. The core algorithm is based on an efficient binary tree approach. The package further allows adding multiple edges at a time, heterogeneous reciprocal edges, and user-specified preference functions. The engine under the hood is implemented in C++. Usages of the package are illustrated with detailed explanation. A benchmark study shows that wdnet is efficient for generating general PA networks not available in other packages. In restricted settings that can be handled by existing packages, wdnet provides comparable efficiency.
优先依恋(PA)网络模型在各个科学学科中有着广泛的应用。大规模PA网络的有效生成有助于揭示其结构特性,并促进相关分析方法的发展。现有的软件包仅为此目的提供有限的功能,并且具有有限的配置和效率。我们提出了一个通用的,用户友好的实现加权,有向PA网络生成与R包wdnet。核心算法基于一种高效的二叉树方法。该包还允许一次添加多个边、异构互惠边和用户指定的偏好函数。发动机罩下的发动机是用C++实现的。详细说明了该包装的用途。一项基准研究表明,wdnet对于生成其他包中没有的通用PA网络是有效的。在现有包可以处理的受限设置中,wdnet提供了相当的效率。
{"title":"Generating General Preferential Attachment Networks with R Package wdnet","authors":"Yelie Yuan, Tiandong Wang, Jun Yan, Panpan Zhang","doi":"10.6339/23-jds1110","DOIUrl":"https://doi.org/10.6339/23-jds1110","url":null,"abstract":"Preferential attachment (PA) network models have a wide range of applications in various scientific disciplines. Efficient generation of large-scale PA networks helps uncover their structural properties and facilitate the development of associated analytical methodologies. Existing software packages only provide limited functions for this purpose with restricted configurations and efficiency. We present a generic, user-friendly implementation of weighted, directed PA network generation with R package wdnet. The core algorithm is based on an efficient binary tree approach. The package further allows adding multiple edges at a time, heterogeneous reciprocal edges, and user-specified preference functions. The engine under the hood is implemented in C++. Usages of the package are illustrated with detailed explanation. A benchmark study shows that wdnet is efficient for generating general PA networks not available in other packages. In restricted settings that can be handled by existing packages, wdnet provides comparable efficiency.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42733675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Random Forest of Interaction Trees for Estimating Individualized Treatment Regimes with Ordered Treatment Levels in Observational Studies 在观察性研究中估计有顺序治疗水平的个体化治疗方案的相互作用树随机森林
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1084
Justine Thorp, R. Levine, Luo Li, J. Fan
Traditional methods for evaluating a potential treatment have focused on the average treatment effect. However, there exist situations where individuals can experience significantly heterogeneous responses to a treatment. In these situations, one needs to account for the differences among individuals when estimating the treatment effect. Li et al. (2022) proposed a method based on random forest of interaction trees (RFIT) for a binary or categorical treatment variable, while incorporating the propensity score in the construction of random forest. Motivated by the need to evaluate the effect of tutoring sessions at a Math and Stat Learning Center (MSLC), we extend their approach to an ordinal treatment variable. Our approach improves upon RFIT for multiple treatments by incorporating the ordered structure of the treatment variable into the tree growing process. To illustrate the effectiveness of our proposed method, we conduct simulation studies where the results show that our proposed method has a lower mean squared error and higher optimal treatment classification, and is able to identify the most important variables that impact the treatment effect. We then apply the proposed method to estimate how the number of visits to the MSLC impacts an individual student’s probability of passing an introductory statistics course. Our results show that every student is recommended to go to the MSLC at least once and some can drastically improve their chance of passing the course by going the optimal number of times suggested by our analysis.
评估潜在治疗的传统方法侧重于平均治疗效果。然而,在某些情况下,个体可能会对一种治疗产生明显的异质反应。在这些情况下,在估计治疗效果时需要考虑到个体之间的差异。Li等人(2022)提出了一种基于相互作用树随机森林(RFIT)的方法,用于二元或分类处理变量,同时将倾向得分纳入随机森林的构建中。由于需要评估数学和统计学习中心(MSLC)辅导课程的效果,我们将他们的方法扩展到一个顺序处理变量。我们的方法通过将处理变量的有序结构纳入树木生长过程,改进了RFIT对多个处理的影响。为了说明我们提出的方法的有效性,我们进行了模拟研究,结果表明我们提出的方法具有较低的均方误差和较高的最优处理分类,并且能够识别影响处理效果的最重要变量。然后,我们应用所提出的方法来估计访问MSLC的次数如何影响单个学生通过入门统计课程的概率。我们的结果表明,每个学生都被建议至少去一次MSLC,有些学生可以通过我们的分析建议的最佳次数来大大提高他们通过课程的机会。
{"title":"Random Forest of Interaction Trees for Estimating Individualized Treatment Regimes with Ordered Treatment Levels in Observational Studies","authors":"Justine Thorp, R. Levine, Luo Li, J. Fan","doi":"10.6339/23-jds1084","DOIUrl":"https://doi.org/10.6339/23-jds1084","url":null,"abstract":"Traditional methods for evaluating a potential treatment have focused on the average treatment effect. However, there exist situations where individuals can experience significantly heterogeneous responses to a treatment. In these situations, one needs to account for the differences among individuals when estimating the treatment effect. Li et al. (2022) proposed a method based on random forest of interaction trees (RFIT) for a binary or categorical treatment variable, while incorporating the propensity score in the construction of random forest. Motivated by the need to evaluate the effect of tutoring sessions at a Math and Stat Learning Center (MSLC), we extend their approach to an ordinal treatment variable. Our approach improves upon RFIT for multiple treatments by incorporating the ordered structure of the treatment variable into the tree growing process. To illustrate the effectiveness of our proposed method, we conduct simulation studies where the results show that our proposed method has a lower mean squared error and higher optimal treatment classification, and is able to identify the most important variables that impact the treatment effect. We then apply the proposed method to estimate how the number of visits to the MSLC impacts an individual student’s probability of passing an introductory statistics course. Our results show that every student is recommended to go to the MSLC at least once and some can drastically improve their chance of passing the course by going the optimal number of times suggested by our analysis.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Quantifying Gender Disparity in Pre-Modern English Literature using Natural Language Processing 用自然语言处理量化前现代英语文学中的性别差异
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1100
M. Kejriwal, Akarsh Nagaraj
Research has continued to shed light on the extent and significance of gender disparity in social, cultural and economic spheres. More recently, computational tools from the data science and Natural Language Processing (NLP) communities have been proposed for measuring such disparity at scale using empirically rigorous methodologies. In this article, we contribute to this line of research by studying gender disparity in 2,443 copyright-expired literary texts published in the pre-modern period, defined in this work as the period ranging from the beginning of the nineteenth through the early twentieth century. Using a replicable data science methodology relying on publicly available and established NLP components, we extract three different gendered character prevalence measures within these texts. We use an extensive set of statistical tests to robustly demonstrate a significant disparity between the prevalence of female characters and male characters in pre-modern literature. We also show that the proportion of female characters in literary texts significantly increases in female-authored texts compared to the same proportion in male-authored texts. However, regression-based analysis shows that, over the 120 year period covered by the corpus, female character prevalence does not change significantly over time, and remains below the parity level of 50%, regardless of the gender of the author. Qualitative analyses further show that descriptions associated with female characters across the corpus are markedly different (and stereotypical) from the descriptions associated with male characters.
研究继续阐明了社会、文化和经济领域的性别差异的程度和意义。最近,来自数据科学和自然语言处理(NLP)社区的计算工具被提议使用经验严格的方法来大规模测量这种差异。在这篇文章中,我们通过研究前现代时期出版的2443篇版权过期的文学文本中的性别差异,为这条研究线做出了贡献。在这项工作中,前现代时期被定义为从19世纪初到20世纪初的时期。使用可复制的数据科学方法,依赖于公开可用和已建立的NLP组件,我们在这些文本中提取了三种不同的性别字符流行度量。我们使用了一套广泛的统计测试来有力地证明了前现代文学中女性角色和男性角色的流行程度之间存在显著差异。我们还发现,在女性创作的文学文本中,女性角色的比例显著高于男性创作的文学文本。然而,基于回归的分析表明,在语料库覆盖的120年期间,女性角色的流行率并没有随着时间的推移而显著变化,无论作者的性别如何,女性角色的流行率仍然低于50%的平价水平。定性分析进一步表明,语料库中与女性角色相关的描述与与男性角色相关的描述明显不同(和刻板)。
{"title":"Quantifying Gender Disparity in Pre-Modern English Literature using Natural Language Processing","authors":"M. Kejriwal, Akarsh Nagaraj","doi":"10.6339/23-jds1100","DOIUrl":"https://doi.org/10.6339/23-jds1100","url":null,"abstract":"Research has continued to shed light on the extent and significance of gender disparity in social, cultural and economic spheres. More recently, computational tools from the data science and Natural Language Processing (NLP) communities have been proposed for measuring such disparity at scale using empirically rigorous methodologies. In this article, we contribute to this line of research by studying gender disparity in 2,443 copyright-expired literary texts published in the pre-modern period, defined in this work as the period ranging from the beginning of the nineteenth through the early twentieth century. Using a replicable data science methodology relying on publicly available and established NLP components, we extract three different gendered character prevalence measures within these texts. We use an extensive set of statistical tests to robustly demonstrate a significant disparity between the prevalence of female characters and male characters in pre-modern literature. We also show that the proportion of female characters in literary texts significantly increases in female-authored texts compared to the same proportion in male-authored texts. However, regression-based analysis shows that, over the 120 year period covered by the corpus, female character prevalence does not change significantly over time, and remains below the parity level of 50%, regardless of the gender of the author. Qualitative analyses further show that descriptions associated with female characters across the corpus are markedly different (and stereotypical) from the descriptions associated with male characters.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Association Between Body Fat and Body Mass Index from Incomplete Longitudinal Proportion Data: Findings from the Fels Study 来自不完整纵向比例数据的体脂和体重指数之间的关系:来自费尔斯研究的发现
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1104
Xin Tong, Seohyun Kim, D. Bandyopadhyay, Shumei S. Sun
Obesity rates continue to exhibit an upward trajectory, particularly in the US, and is the underlying cause of several comorbidities, including but not limited to high blood pressure, high cholesterol, diabetes, heart disease, stroke, and cancers. To monitor obesity, body mass index (BMI) and proportion body fat (PBF) are two commonly used measurements. Although BMI and PBF changes over time in an individual’s lifespan and their relationship may also change dynamically, existing work has mostly remained cross-sectional, or separately modeling BMI and PBF. A combined longitudinal assessment is expected to be more effective in unravelling their complex interplay. To mitigate this, we consider Bayesian cross-domain latent growth curve models within a structural equation modeling framework, which simultaneously handles issues such as individually varying time metrics, proportion data, and potential missing not at random data for joint assessment of the longitudinal changes of BMI and PBF. Through simulation studies, we observe that our proposed models and estimation method yielded parameter estimates with small bias and mean squared error in general, however, a mis-specified missing data mechanism may cause inaccurate and inefficient parameter estimates. Furthermore, we demonstrate application of our method to a motivating longitudinal obesity study, controlling for both time-invariant (such as, sex), and time-varying (such as diastolic and systolic blood pressure, biceps skinfold, bioelectrical impedance, and waist circumference) covariates in separate models. Under time-invariance, we observe that the initial BMI level and the rate of change in BMI influenced PBF. However, in presence of time-varying covariates, only the initial BMI level influenced the initial PBF. The added-on selection model estimation indicated that observations with higher PBF values were less likely to be missing.
肥胖率继续呈上升趋势,特别是在美国,并且是几种合并症的潜在原因,包括但不限于高血压、高胆固醇、糖尿病、心脏病、中风和癌症。为了监测肥胖,身体质量指数(BMI)和身体脂肪比例(PBF)是两种常用的测量方法。虽然BMI和PBF在个体的一生中会随着时间的推移而变化,它们之间的关系也可能动态变化,但现有的研究大多是横向的,或者是单独对BMI和PBF进行建模。综合的纵向评估有望更有效地揭示它们复杂的相互作用。为了缓解这一问题,我们在结构方程建模框架内考虑贝叶斯跨域潜在增长曲线模型,该模型同时处理诸如单独变化的时间指标、比例数据和潜在的非随机数据缺失等问题,以联合评估BMI和PBF的纵向变化。通过仿真研究,我们发现我们所提出的模型和估计方法得到的参数估计总体上具有较小的偏差和均方误差,然而,错误指定的缺失数据机制可能导致参数估计不准确和低效。此外,我们展示了我们的方法在纵向肥胖研究中的应用,在不同的模型中控制了时不变(如性别)和时变(如舒张压和收缩压、二头肌皮褶、生物电阻抗和腰围)协变量。在时不变条件下,我们观察到初始BMI水平和BMI变化率影响PBF。然而,当存在时变协变量时,只有初始BMI水平影响初始PBF。附加选择模型估计表明,PBF值较高的观测值不太可能丢失。
{"title":"Association Between Body Fat and Body Mass Index from Incomplete Longitudinal Proportion Data: Findings from the Fels Study","authors":"Xin Tong, Seohyun Kim, D. Bandyopadhyay, Shumei S. Sun","doi":"10.6339/23-jds1104","DOIUrl":"https://doi.org/10.6339/23-jds1104","url":null,"abstract":"Obesity rates continue to exhibit an upward trajectory, particularly in the US, and is the underlying cause of several comorbidities, including but not limited to high blood pressure, high cholesterol, diabetes, heart disease, stroke, and cancers. To monitor obesity, body mass index (BMI) and proportion body fat (PBF) are two commonly used measurements. Although BMI and PBF changes over time in an individual’s lifespan and their relationship may also change dynamically, existing work has mostly remained cross-sectional, or separately modeling BMI and PBF. A combined longitudinal assessment is expected to be more effective in unravelling their complex interplay. To mitigate this, we consider Bayesian cross-domain latent growth curve models within a structural equation modeling framework, which simultaneously handles issues such as individually varying time metrics, proportion data, and potential missing not at random data for joint assessment of the longitudinal changes of BMI and PBF. Through simulation studies, we observe that our proposed models and estimation method yielded parameter estimates with small bias and mean squared error in general, however, a mis-specified missing data mechanism may cause inaccurate and inefficient parameter estimates. Furthermore, we demonstrate application of our method to a motivating longitudinal obesity study, controlling for both time-invariant (such as, sex), and time-varying (such as diastolic and systolic blood pressure, biceps skinfold, bioelectrical impedance, and waist circumference) covariates in separate models. Under time-invariance, we observe that the initial BMI level and the rate of change in BMI influenced PBF. However, in presence of time-varying covariates, only the initial BMI level influenced the initial PBF. The added-on selection model estimation indicated that observations with higher PBF values were less likely to be missing.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Effects of County-Level Socioeconomic and Healthcare Factors on Controlling COVID-19 in the Southern and Southeastern United States 美国南部和东南部县级社会经济和卫生保健因素对控制COVID-19的影响
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1111
Jackson Barth, Guanqing Cheng, Webb Williams, Ming Zhang, H. K. T. Ng
This paper aims to determine the effects of socioeconomic and healthcare factors on the performance of controlling COVID-19 in both the Southern and Southeastern United States. This analysis will provide government agencies with information to determine what communities need additional COVID-19 assistance, to identify counties that effectively control COVID-19, and to apply effective strategies on a broader scale. The statistical analysis uses data from 328 counties with a population of more than 65,000 from 13 states. We define a new response variable by considering infection and mortality rates to capture how well each county controls COVID-19. We collect 14 factors from the 2019 American Community Survey Single-Year Estimates and obtain county-level infection and mortality rates from USAfacts.org. We use the least absolute shrinkage and selection operator (LASSO) regression to fit a multiple linear regression model and develop an interactive system programmed in R shiny to deliver all results. The interactive system at https://asa-competition-smu.shinyapps.io/COVID19/ provides many options for users to explore our data, models, and results.
本文旨在确定社会经济和医疗保健因素对美国南部和东南部控制COVID-19绩效的影响。这一分析将为政府机构提供信息,以确定哪些社区需要额外的COVID-19援助,确定有效控制COVID-19的县,并在更大范围内应用有效战略。统计分析使用了来自13个州的328个县的数据,这些县的人口超过6.5万人。我们通过考虑感染率和死亡率来定义一个新的响应变量,以捕捉每个国家控制COVID-19的情况。我们从2019年美国社区调查单年估算中收集了14个因素,并从USAfacts.org上获得了县级感染率和死亡率。我们使用最小绝对收缩和选择算子(LASSO)回归来拟合多元线性回归模型,并开发了一个用R shiny编程的交互式系统来提供所有结果。在https://asa-competition-smu.shinyapps.io/COVID19/上的交互系统为用户提供了许多选项来探索我们的数据、模型和结果。
{"title":"The Effects of County-Level Socioeconomic and Healthcare Factors on Controlling COVID-19 in the Southern and Southeastern United States","authors":"Jackson Barth, Guanqing Cheng, Webb Williams, Ming Zhang, H. K. T. Ng","doi":"10.6339/23-jds1111","DOIUrl":"https://doi.org/10.6339/23-jds1111","url":null,"abstract":"This paper aims to determine the effects of socioeconomic and healthcare factors on the performance of controlling COVID-19 in both the Southern and Southeastern United States. This analysis will provide government agencies with information to determine what communities need additional COVID-19 assistance, to identify counties that effectively control COVID-19, and to apply effective strategies on a broader scale. The statistical analysis uses data from 328 counties with a population of more than 65,000 from 13 states. We define a new response variable by considering infection and mortality rates to capture how well each county controls COVID-19. We collect 14 factors from the 2019 American Community Survey Single-Year Estimates and obtain county-level infection and mortality rates from USAfacts.org. We use the least absolute shrinkage and selection operator (LASSO) regression to fit a multiple linear regression model and develop an interactive system programmed in R shiny to deliver all results. The interactive system at https://asa-competition-smu.shinyapps.io/COVID19/ provides many options for users to explore our data, models, and results.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"405 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71321027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Bayesian High-Dimensional Classification via Random Projection with Application to Gene Expression Data 基于随机投影的高效贝叶斯高维分类及其在基因表达数据中的应用
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1102
Abhisek Chakraborty
Inspired by the impressive successes of compress sensing-based machine learning algorithms, data augmentation-based efficient Gibbs samplers for Bayesian high-dimensional classification models are developed by compressing the design matrix to a much lower dimension. Ardent care is exercised in the choice of the projection mechanism, and an adaptive voting rule is employed to reduce sensitivity to the random projection matrix. Focusing on the high-dimensional Probit regression model, we note that the naive implementation of the data augmentation-based Gibbs sampler is not robust to the presence of co-linearity in the design matrix – a setup ubiquitous in $n
受基于压缩感知的机器学习算法令人印象深刻的成功启发,基于数据增强的高效吉布斯采样器通过将设计矩阵压缩到更低的维度来开发贝叶斯高维分类模型。在投影机制的选择上特别注意,并采用自适应投票规则来降低对随机投影矩阵的敏感性。专注于高维Probit回归模型,我们注意到基于数据增强的Gibbs采样器的天真实现对设计矩阵中共线性的存在不具有鲁棒性-这是在$n
{"title":"Efficient Bayesian High-Dimensional Classification via Random Projection with Application to Gene Expression Data","authors":"Abhisek Chakraborty","doi":"10.6339/23-jds1102","DOIUrl":"https://doi.org/10.6339/23-jds1102","url":null,"abstract":"Inspired by the impressive successes of compress sensing-based machine learning algorithms, data augmentation-based efficient Gibbs samplers for Bayesian high-dimensional classification models are developed by compressing the design matrix to a much lower dimension. Ardent care is exercised in the choice of the projection mechanism, and an adaptive voting rule is employed to reduce sensitivity to the random projection matrix. Focusing on the high-dimensional Probit regression model, we note that the naive implementation of the data augmentation-based Gibbs sampler is not robust to the presence of co-linearity in the design matrix – a setup ubiquitous in $n<p$ problems. We demonstrate that a simple fix based on joint updates of parameters in the latent space circumnavigates this issue. With a computationally efficient MCMC scheme in place, we introduce an ensemble classifier by creating R (∼25–50) projected copies of the design matrix, and subsequently running R classification models with the R projected design matrix in parallel. We combine the output from the R replications via an adaptive voting scheme. Our scheme is inherently parallelizable and capable of taking advantage of modern computing environments often equipped with multiple cores. The empirical success of our methodology is illustrated in elaborate simulations and gene expression data applications. We also extend our methodology to a high-dimensional logistic regression model and carry out numerical studies to showcase its efficacy.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial: Symposium Data Science and Statistics 2022 编辑:学术研讨会数据科学与统计2022
Pub Date : 2023-01-01 DOI: 10.6339/23-jds212edi
C. Bowen, M. Grosskopf
{"title":"Editorial: Symposium Data Science and Statistics 2022","authors":"C. Bowen, M. Grosskopf","doi":"10.6339/23-jds212edi","DOIUrl":"https://doi.org/10.6339/23-jds212edi","url":null,"abstract":"","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71321084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Journal of data science : JDS
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1