首页 > 最新文献

Journal of data science : JDS最新文献

英文 中文
The Impact of COVID-19 on Subjective Well-Being: Evidence from Twitter Data COVID-19对主观幸福感的影响:来自Twitter数据的证据
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1066
Tiziana Carpi, Airo Hino, S. Iacus, G. Porro
This study analyzes the impact of the COVID-19 pandemic on subjective well-being as measured through Twitter for the countries of Japan and Italy. In the first nine months of 2020, the Twitter indicators dropped by 11.7% for Italy and 8.3% for Japan compared to the last two months of 2019, and even more compared to their historical means. To understand what affected the Twitter mood so strongly, the study considers a pool of potential factors including: climate and air quality data, number of COVID-19 cases and deaths, Facebook COVID-19 and flu-like symptoms global survey data, coronavirus-related Google search data, policy intervention measures, human mobility data, macro economic variables, as well as health and stress proxy variables. This study proposes a framework to analyse and assess the relative impact of these external factors on the dynamic of Twitter mood and further implements a structural model to describe the underlying concept of subjective well-being. It turns out that prolonged mobility restrictions, flu and Covid-like symptoms, economic uncertainty and low levels of quality in social interactions have a negative impact on well-being.
本研究分析了COVID-19大流行对日本和意大利两国主观幸福感的影响,通过Twitter进行了测量。与2019年最后两个月相比,2020年前9个月,意大利和日本的推特指标分别下降了11.7%和8.3%,与历史平均值相比,下降幅度更大。为了了解是什么对推特情绪产生了如此强烈的影响,该研究考虑了一系列潜在因素,包括:气候和空气质量数据、COVID-19病例和死亡人数、Facebook COVID-19和流感样症状全球调查数据、冠状病毒相关的谷歌搜索数据、政策干预措施、人类流动性数据、宏观经济变量以及健康和压力代理变量。本研究提出了一个框架来分析和评估这些外部因素对Twitter情绪动态的相对影响,并进一步实现了一个结构模型来描述主观幸福感的基本概念。事实证明,长期的行动限制、流感和冠状病毒样症状、经济不确定性和社会交往质量低下对幸福感产生了负面影响。
{"title":"The Impact of COVID-19 on Subjective Well-Being: Evidence from Twitter Data","authors":"Tiziana Carpi, Airo Hino, S. Iacus, G. Porro","doi":"10.6339/22-jds1066","DOIUrl":"https://doi.org/10.6339/22-jds1066","url":null,"abstract":"This study analyzes the impact of the COVID-19 pandemic on subjective well-being as measured through Twitter for the countries of Japan and Italy. In the first nine months of 2020, the Twitter indicators dropped by 11.7% for Italy and 8.3% for Japan compared to the last two months of 2019, and even more compared to their historical means. To understand what affected the Twitter mood so strongly, the study considers a pool of potential factors including: climate and air quality data, number of COVID-19 cases and deaths, Facebook COVID-19 and flu-like symptoms global survey data, coronavirus-related Google search data, policy intervention measures, human mobility data, macro economic variables, as well as health and stress proxy variables. This study proposes a framework to analyse and assess the relative impact of these external factors on the dynamic of Twitter mood and further implements a structural model to describe the underlying concept of subjective well-being. It turns out that prolonged mobility restrictions, flu and Covid-like symptoms, economic uncertainty and low levels of quality in social interactions have a negative impact on well-being.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
What Kind of Music Do You Like? A Statistical Analysis of Music Genre Popularity Over Time 你喜欢什么样的音乐?音乐类型随时间流行的统计分析
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1040
Aimée M. Petitbon, D. B. Hitchcock
{"title":"What Kind of Music Do You Like? A Statistical Analysis of Music Genre Popularity Over Time","authors":"Aimée M. Petitbon, D. B. Hitchcock","doi":"10.6339/22-jds1040","DOIUrl":"https://doi.org/10.6339/22-jds1040","url":null,"abstract":"<jats:p />","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Sampling-based Gaussian Mixture Regression for Big Data 基于抽样的大数据高斯混合回归
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1057
Joochul Lee, E. Schifano, Haiying Wang
This paper proposes a nonuniform subsampling method for finite mixtures of regression models to reduce large data computational tasks. A general estimator based on a subsample is investigated, and its asymptotic normality is established. We assign optimal subsampling probabilities to data points that minimize the asymptotic mean squared errors of the general estimator and linearly transformed estimators. Since the proposed probabilities depend on unknown parameters, an implementable algorithm is developed. We first approximate the optimal subsampling probabilities using a pilot sample. After that, we select a subsample using the approximated subsampling probabilities and compute estimates using the subsample. We evaluate the proposed method in a simulation study and present a real data example using appliance energy data.
为了减少大数据的计算量,提出了一种有限混合回归模型的非均匀次抽样方法。研究了基于子样本的一般估计量,并建立了它的渐近正态性。我们将最优子抽样概率分配给数据点,使一般估计量和线性变换估计量的渐近均方误差最小。由于所提出的概率依赖于未知参数,因此提出了一种可实现的算法。我们首先使用先导样本近似最优子抽样概率。然后,我们使用近似的子抽样概率选择子样本,并使用该子样本计算估计。我们在一个模拟研究中对所提出的方法进行了评估,并给出了一个使用电器能量数据的真实数据示例。
{"title":"Sampling-based Gaussian Mixture Regression for Big Data","authors":"Joochul Lee, E. Schifano, Haiying Wang","doi":"10.6339/22-jds1057","DOIUrl":"https://doi.org/10.6339/22-jds1057","url":null,"abstract":"This paper proposes a nonuniform subsampling method for finite mixtures of regression models to reduce large data computational tasks. A general estimator based on a subsample is investigated, and its asymptotic normality is established. We assign optimal subsampling probabilities to data points that minimize the asymptotic mean squared errors of the general estimator and linearly transformed estimators. Since the proposed probabilities depend on unknown parameters, an implementable algorithm is developed. We first approximate the optimal subsampling probabilities using a pilot sample. After that, we select a subsample using the approximated subsampling probabilities and compute estimates using the subsample. We evaluate the proposed method in a simulation study and present a real data example using appliance energy data.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Joint Analysis for Field Goal Attempts and Percentages of Professional Basketball Players: Bayesian Nonparametric Resource 职业篮球运动员投篮命中率与投篮命中率的联合分析:贝叶斯非参数资源
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1062
Eliot Wong-Toi, Hou‐Cheng Yang, Weining Shen, Guanyu Hu
Understanding shooting patterns among different players is a fundamental problem in basketball game analyses. In this paper, we quantify the shooting pattern via the field goal attempts and percentages over twelve non-overlapping regions around the front court. A joint Bayesian nonparametric mixture model is developed to find latent clusters of players based on their shooting patterns. We apply our proposed model to learn the heterogeneity among selected players from the National Basketball Association (NBA) games over the 2018–2019 regular season and 2019–2020 bubble season. Thirteen clusters are identified for 2018–2019 regular season and seven clusters are identified for 2019–2020 bubble season. We further examine the shooting patterns of players in these clusters and discuss their relation to players’ other available information. The results shed new insights on the effect of NBA COVID bubble and may provide useful guidance for player’s shot selection and team’s in-game and recruiting strategy planning.
了解不同球员的投篮模式是篮球比赛分析中的一个基本问题。在本文中,我们通过前场周围12个不重叠区域的投篮命中率和命中率来量化投篮模式。建立了一个联合贝叶斯非参数混合模型,根据球员的投篮模式寻找潜在的球员簇。我们将提出的模型应用于2018-2019赛季常规赛和2019-2020赛季NBA比赛中被选中的球员之间的异质性。2018-2019赛季确定了13个集群,2019-2020赛季确定了7个集群。我们进一步研究了这些集群中球员的投篮模式,并讨论了它们与球员其他可用信息的关系。研究结果为NBA COVID泡沫的影响提供了新的见解,并可能为球员的投篮选择和球队的比赛和招募策略规划提供有用的指导。
{"title":"A Joint Analysis for Field Goal Attempts and Percentages of Professional Basketball Players: Bayesian Nonparametric Resource","authors":"Eliot Wong-Toi, Hou‐Cheng Yang, Weining Shen, Guanyu Hu","doi":"10.6339/22-jds1062","DOIUrl":"https://doi.org/10.6339/22-jds1062","url":null,"abstract":"Understanding shooting patterns among different players is a fundamental problem in basketball game analyses. In this paper, we quantify the shooting pattern via the field goal attempts and percentages over twelve non-overlapping regions around the front court. A joint Bayesian nonparametric mixture model is developed to find latent clusters of players based on their shooting patterns. We apply our proposed model to learn the heterogeneity among selected players from the National Basketball Association (NBA) games over the 2018–2019 regular season and 2019–2020 bubble season. Thirteen clusters are identified for 2018–2019 regular season and seven clusters are identified for 2019–2020 bubble season. We further examine the shooting patterns of players in these clusters and discuss their relation to players’ other available information. The results shed new insights on the effect of NBA COVID bubble and may provide useful guidance for player’s shot selection and team’s in-game and recruiting strategy planning.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchical Ridge Regression for Incorporating Prior Information in Genomic Studies. 在基因组研究中纳入先验信息的层次岭回归。
Pub Date : 2022-01-01 Epub Date: 2021-12-13 DOI: 10.6339/21-jds1030
Eric S Kawaguchi, Sisi Li, Garrett M Weaver, Juan Pablo Lewinger

There is a great deal of prior knowledge about gene function and regulation in the form of annotations or prior results that, if directly integrated into individual prognostic or diagnostic studies, could improve predictive performance. For example, in a study to develop a predictive model for cancer survival based on gene expression, effect sizes from previous studies or the grouping of genes based on pathways constitute such prior knowledge. However, this external information is typically only used post-analysis to aid in the interpretation of any findings. We propose a new hierarchical two-level ridge regression model that can integrate external information in the form of "meta features" to predict an outcome. We show that the model can be fit efficiently using cyclic coordinate descent by recasting the problem as a single-level regression model. In a simulation-based evaluation we show that the proposed method outperforms standard ridge regression and competing methods that integrate prior information, in terms of prediction performance when the meta features are informative on the mean of the features, and that there is no loss in performance when the meta features are uninformative. We demonstrate our approach with applications to the prediction of chronological age based on methylation features and breast cancer mortality based on gene expression features.

以注释或先前结果的形式存在着大量有关基因功能和调控的先验知识,如果将这些先验知识直接整合到单项预后或诊断研究中,可以提高预测效果。例如,在根据基因表达建立癌症生存预测模型的研究中,以往研究的效应大小或基于通路的基因分组就构成了此类先验知识。然而,这些外部信息通常只能在分析后使用,以帮助解释研究结果。我们提出了一种新的分层两级脊回归模型,它可以整合 "元特征 "形式的外部信息来预测结果。我们表明,通过将问题重铸为单层回归模型,可以使用循环坐标下降法高效拟合该模型。在基于模拟的评估中,我们发现当元特征对特征的平均值具有参考价值时,所提出的方法在预测性能方面优于标准脊回归和整合先验信息的竞争方法;而当元特征对特征的平均值不具有参考价值时,所提出的方法在性能方面没有任何损失。我们将我们的方法应用于基于甲基化特征的年代预测和基于基因表达特征的乳腺癌死亡率预测。
{"title":"Hierarchical Ridge Regression for Incorporating Prior Information in Genomic Studies.","authors":"Eric S Kawaguchi, Sisi Li, Garrett M Weaver, Juan Pablo Lewinger","doi":"10.6339/21-jds1030","DOIUrl":"10.6339/21-jds1030","url":null,"abstract":"<p><p>There is a great deal of prior knowledge about gene function and regulation in the form of annotations or prior results that, if directly integrated into individual prognostic or diagnostic studies, could improve predictive performance. For example, in a study to develop a predictive model for cancer survival based on gene expression, effect sizes from previous studies or the grouping of genes based on pathways constitute such prior knowledge. However, this external information is typically only used post-analysis to aid in the interpretation of any findings. We propose a new hierarchical two-level ridge regression model that can integrate external information in the form of \"meta features\" to predict an outcome. We show that the model can be fit efficiently using cyclic coordinate descent by recasting the problem as a single-level regression model. In a simulation-based evaluation we show that the proposed method outperforms standard ridge regression and competing methods that integrate prior information, in terms of prediction performance when the meta features are informative on the mean of the features, and that there is no loss in performance when the meta features are uninformative. We demonstrate our approach with applications to the prediction of chronological age based on methylation features and breast cancer mortality based on gene expression features.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"20 1","pages":"34-50"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9581069/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10451046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Fixed-Point Algorithms in Statistics and Data Science: A State-of-Art Review 加速统计和数据科学中的定点算法:最新评述
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1051
Bohao Tang, Nicholas C. Henderson, Ravi Varadhan
Fixed-point algorithms are popular in statistics and data science due to their simplicity, guaranteed convergence, and applicability to high-dimensional problems. Well-known examples include the expectation-maximization (EM) algorithm, majorization-minimization (MM), and gradient-based algorithms like gradient descent (GD) and proximal gradient descent. A characteristic weakness of these algorithms is their slow convergence. We discuss several state-of-art techniques for accelerating their convergence. We demonstrate and evaluate these techniques in terms of their efficiency and robustness in six distinct applications. Among the acceleration schemes, SQUAREM shows robust acceleration with a mean 18-fold speedup. DAAREM and restarted-Nesterov schemes also demonstrate consistently impressive accelerations. Thus, it is possible to accelerate the original fixed-point algorithm by using one of SQUAREM, DAAREM, or restarted-Nesterov acceleration schemes. We describe implementation details and software packages to facilitate the application of the acceleration schemes. We also discuss strategies for selecting a particular acceleration scheme for a given problem.
不动点算法因其简单、保证收敛和适用于高维问题而在统计学和数据科学中很受欢迎。众所周知的例子包括期望最大化(EM)算法、最大化最小化(MM)和基于梯度的算法,如梯度下降(GD)和近端梯度下降。这些算法的一个特点是收敛速度慢。我们讨论了几种最先进的技术来加速它们的收敛。我们在六个不同的应用中演示并评估了这些技术的效率和健壮性。在加速方案中,SQUAREM表现出稳健的加速,平均加速18倍。DAAREM和重新启动的nesterov方案也一直表现出令人印象深刻的加速。因此,可以使用SQUAREM、DAAREM或restart - nesterov加速方案中的一种来加速原始不动点算法。我们描述了实现细节和软件包,以促进加速方案的应用。我们还讨论了针对给定问题选择特定加速方案的策略。
{"title":"Accelerating Fixed-Point Algorithms in Statistics and Data Science: A State-of-Art Review","authors":"Bohao Tang, Nicholas C. Henderson, Ravi Varadhan","doi":"10.6339/22-jds1051","DOIUrl":"https://doi.org/10.6339/22-jds1051","url":null,"abstract":"Fixed-point algorithms are popular in statistics and data science due to their simplicity, guaranteed convergence, and applicability to high-dimensional problems. Well-known examples include the expectation-maximization (EM) algorithm, majorization-minimization (MM), and gradient-based algorithms like gradient descent (GD) and proximal gradient descent. A characteristic weakness of these algorithms is their slow convergence. We discuss several state-of-art techniques for accelerating their convergence. We demonstrate and evaluate these techniques in terms of their efficiency and robustness in six distinct applications. Among the acceleration schemes, SQUAREM shows robust acceleration with a mean 18-fold speedup. DAAREM and restarted-Nesterov schemes also demonstrate consistently impressive accelerations. Thus, it is possible to accelerate the original fixed-point algorithm by using one of SQUAREM, DAAREM, or restarted-Nesterov acceleration schemes. We describe implementation details and software packages to facilitate the application of the acceleration schemes. We also discuss strategies for selecting a particular acceleration scheme for a given problem.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Editorial: Data Science Meets Social Sciences 社论:数据科学遇上社会科学
Pub Date : 2022-01-01 DOI: 10.6339/22-jds203edi
E. Erosheva, Shahryar Minhas, Gongjun Xu, Ran Xu
{"title":"Editorial: Data Science Meets Social Sciences","authors":"E. Erosheva, Shahryar Minhas, Gongjun Xu, Ran Xu","doi":"10.6339/22-jds203edi","DOIUrl":"https://doi.org/10.6339/22-jds203edi","url":null,"abstract":"","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Propensity Score Modeling in Electronic Health Records with Time-to-Event Endpoints: Application to Kidney Transplantation 以时间到事件为终点的电子健康记录中的倾向评分模型:在肾移植中的应用
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1046
Jonathan W. Yu, D. Bandyopadhyay, Shu Yang, Le Kang, G. Gupta
For large observational studies lacking a control group (unlike randomized controlled trials, RCT), propensity scores (PS) are often the method of choice to account for pre-treatment confounding in baseline characteristics, and thereby avoid substantial bias in treatment estimation. A vast majority of PS techniques focus on average treatment effect estimation, without any clear consensus on how to account for confounders, especially in a multiple treatment setting. Furthermore, for time-to event outcomes, the analytical framework is further complicated in presence of high censoring rates (sometimes, due to non-susceptibility of study units to a disease), imbalance between treatment groups, and clustered nature of the data (where, survival outcomes appear in groups). Motivated by a right-censored kidney transplantation dataset derived from the United Network of Organ Sharing (UNOS), we investigate and compare two recent promising PS procedures, (a) the generalized boosted model (GBM), and (b) the covariate-balancing propensity score (CBPS), in an attempt to decouple the causal effects of treatments (here, study subgroups, such as hepatitis C virus (HCV) positive/negative donors, and positive/negative recipients) on time to death of kidney recipients due to kidney failure, post transplantation. For estimation, we employ a 2-step procedure which addresses various complexities observed in the UNOS database within a unified paradigm. First, to adjust for the large number of confounders on the multiple sub-groups, we fit multinomial PS models via procedures (a) and (b). In the next stage, the estimated PS is incorporated into the likelihood of a semi-parametric cure rate Cox proportional hazard frailty model via inverse probability of treatment weighting, adjusted for multi-center clustering and excess censoring, Our data analysis reveals a more informative and superior performance of the full model in terms of treatment effect estimation, over sub-models that relaxes the various features of the event time dataset.
对于缺乏对照组的大型观察性研究(不像随机对照试验,RCT),倾向评分(PS)通常是考虑基线特征的治疗前混淆的选择方法,从而避免治疗估计中的重大偏差。绝大多数PS技术侧重于平均治疗效果估计,对于如何考虑混杂因素没有任何明确的共识,特别是在多重治疗环境中。此外,对于时间到事件的结果,由于存在高审查率(有时,由于研究单位对某种疾病不敏感)、治疗组之间的不平衡以及数据的聚集性(其中,生存结果出现在组中),分析框架进一步复杂化。受来自器官共享联合网络(UNOS)的右审查肾移植数据集的激励,我们调查并比较了两种最近有前途的PS程序,(a)广义增强模型(GBM)和(b)协变量平衡倾向评分(CBPS),试图解解治疗的因果效应(这里,研究亚组,如丙型肝炎病毒(HCV)阳性/阴性供者,和阳性/阴性受者)及时死亡肾受者因肾功能衰竭,移植后。对于估计,我们采用了一个两步程序,在统一的范例中解决UNOS数据库中观察到的各种复杂性。首先,为了调整多个子组上的大量混杂因素,我们通过程序(a)和(b)拟合多项PS模型。在下一阶段,估计的PS通过处理权重的逆概率纳入半参数治愈率Cox比例风险脆弱性模型的可能性,并根据多中心聚类和过度审查进行调整。我们的数据分析显示,在治疗效果估计方面,与放松事件时间数据集的各种特征的子模型相比,完整模型具有更丰富的信息和更优越的性能。
{"title":"Propensity Score Modeling in Electronic Health Records with Time-to-Event Endpoints: Application to Kidney Transplantation","authors":"Jonathan W. Yu, D. Bandyopadhyay, Shu Yang, Le Kang, G. Gupta","doi":"10.6339/22-jds1046","DOIUrl":"https://doi.org/10.6339/22-jds1046","url":null,"abstract":"For large observational studies lacking a control group (unlike randomized controlled trials, RCT), propensity scores (PS) are often the method of choice to account for pre-treatment confounding in baseline characteristics, and thereby avoid substantial bias in treatment estimation. A vast majority of PS techniques focus on average treatment effect estimation, without any clear consensus on how to account for confounders, especially in a multiple treatment setting. Furthermore, for time-to event outcomes, the analytical framework is further complicated in presence of high censoring rates (sometimes, due to non-susceptibility of study units to a disease), imbalance between treatment groups, and clustered nature of the data (where, survival outcomes appear in groups). Motivated by a right-censored kidney transplantation dataset derived from the United Network of Organ Sharing (UNOS), we investigate and compare two recent promising PS procedures, (a) the generalized boosted model (GBM), and (b) the covariate-balancing propensity score (CBPS), in an attempt to decouple the causal effects of treatments (here, study subgroups, such as hepatitis C virus (HCV) positive/negative donors, and positive/negative recipients) on time to death of kidney recipients due to kidney failure, post transplantation. For estimation, we employ a 2-step procedure which addresses various complexities observed in the UNOS database within a unified paradigm. First, to adjust for the large number of confounders on the multiple sub-groups, we fit multinomial PS models via procedures (a) and (b). In the next stage, the estimated PS is incorporated into the likelihood of a semi-parametric cure rate Cox proportional hazard frailty model via inverse probability of treatment weighting, adjusted for multi-center clustering and excess censoring, Our data analysis reveals a more informative and superior performance of the full model in terms of treatment effect estimation, over sub-models that relaxes the various features of the event time dataset.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An Effective Tensor Regression with Latent Sparse Regularization 具有隐稀疏正则化的有效张量回归
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1048
Ko-Shin Chen, Tingyang Xu, Guannan Liang, Qianqian Tong, Minghu Song, J. Bi
As data acquisition technologies advance, longitudinal analysis is facing challenges of exploring complex feature patterns from high-dimensional data and modeling potential temporally lagged effects of features on a response. We propose a tensor-based model to analyze multidimensional data. It simultaneously discovers patterns in features and reveals whether features observed at past time points have impact on current outcomes. The model coefficient, a k-mode tensor, is decomposed into a summation of k tensors of the same dimension. We introduce a so-called latent F-1 norm that can be applied to the coefficient tensor to performed structured selection of features. Specifically, features will be selected along each mode of the tensor. The proposed model takes into account within-subject correlations by employing a tensor-based quadratic inference function. An asymptotic analysis shows that our model can identify true support when the sample size approaches to infinity. To solve the corresponding optimization problem, we develop a linearized block coordinate descent algorithm and prove its convergence for a fixed sample size. Computational results on synthetic datasets and real-life fMRI and EEG datasets demonstrate the superior performance of the proposed approach over existing techniques.
随着数据采集技术的进步,纵向分析面临着从高维数据中探索复杂特征模式和建模特征对响应的潜在时间滞后效应的挑战。我们提出了一种基于张量的多维数据分析模型。它同时发现特征中的模式,并揭示在过去时间点观察到的特征是否对当前结果有影响。模型系数,一个k模张量,被分解成k个相同维度张量的总和。我们引入了一个所谓的潜在F-1范数,它可以应用于系数张量来进行特征的结构化选择。具体来说,将沿着张量的每个模式选择特征。该模型通过采用基于张量的二次推理函数考虑了主体内的相关性。渐近分析表明,当样本量接近无穷大时,我们的模型可以识别出真正的支持。为了解决相应的优化问题,我们提出了一种线性化的块坐标下降算法,并证明了它在固定样本量下的收敛性。在合成数据集和真实的fMRI和EEG数据集上的计算结果表明,该方法优于现有技术。
{"title":"An Effective Tensor Regression with Latent Sparse Regularization","authors":"Ko-Shin Chen, Tingyang Xu, Guannan Liang, Qianqian Tong, Minghu Song, J. Bi","doi":"10.6339/22-jds1048","DOIUrl":"https://doi.org/10.6339/22-jds1048","url":null,"abstract":"As data acquisition technologies advance, longitudinal analysis is facing challenges of exploring complex feature patterns from high-dimensional data and modeling potential temporally lagged effects of features on a response. We propose a tensor-based model to analyze multidimensional data. It simultaneously discovers patterns in features and reveals whether features observed at past time points have impact on current outcomes. The model coefficient, a k-mode tensor, is decomposed into a summation of k tensors of the same dimension. We introduce a so-called latent F-1 norm that can be applied to the coefficient tensor to performed structured selection of features. Specifically, features will be selected along each mode of the tensor. The proposed model takes into account within-subject correlations by employing a tensor-based quadratic inference function. An asymptotic analysis shows that our model can identify true support when the sample size approaches to infinity. To solve the corresponding optimization problem, we develop a linearized block coordinate descent algorithm and prove its convergence for a fixed sample size. Computational results on synthetic datasets and real-life fMRI and EEG datasets demonstrate the superior performance of the proposed approach over existing techniques.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Does Aging Make Us Grittier? Disentangling the Age and Generation Effect on Passion and Perseverance 衰老会让我们变得更坚强吗?拆解年龄和世代对激情和毅力的影响
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1041
S. Sanders, Nuwan Indika Millagaha Gedara, Bhavneet Walia, C. Boudreaux, M. Silverstein
Defined as perseverance and passion for long term goals, grit represents an important psychological skill toward goal-attainment in academic and less-stylized settings. An outstanding issue of primary importance is whether age affects grit, ceteris paribus. The 12-item Grit-O Scale and the 8-item Grit-S Scale—from which grit scores are calculated—have not existed for a long period of time. Therefore, Duckworth (2016, p. 37) states in her book, Grit: The Power and Passion of Perseverance, that “we need a different kind of study” to distinguish between rival explanations that either generational cohort or age are more important in explaining variation in grit across individuals. Despite this clear data constraint, we obtain a glimpse into the future in the present study by using a within and between generational cohort age difference-in-difference approach. By specifying generation as a categorical variable and age-in-generation as a count variable in the same regression specifications, we are able to account for the effects of variation in age and generation simultaneously, while avoiding problems of multicollinearity that would hinder post-regression statistical inference. We conclude robust, significant evidence that the negative-parabolic shape of the grit-age profile is driven by generational variation and not by age variation. Our findings suggest that, absent a grit-mindset intervention, individual-level grit may be persistent over time.
砂砾被定义为对长期目标的毅力和激情,在学术和不那么程式化的环境中,砂砾代表了实现目标的重要心理技能。最重要的一个突出问题是,在其他条件不变的情况下,年龄是否会影响砂砾。12项“勇气- 0”量表和8项“勇气- s”量表——用来计算勇气得分——已经不存在很长时间了。因此,Duckworth (2016, p. 37)在她的书《毅力:毅力的力量和激情》中指出,“我们需要一种不同的研究”来区分不同的解释,即世代或年龄在解释个体之间的毅力差异方面更重要。尽管有这种明确的数据限制,我们在本研究中通过使用代际队列内和代际之间的年龄差异方法对未来进行了一瞥。通过在相同的回归规范中指定世代作为分类变量,代中年龄作为计数变量,我们能够同时解释年龄和世代变化的影响,同时避免多重共线性问题,这将阻碍回归后的统计推断。我们得出了强有力的、有意义的证据,表明砂年龄剖面的负抛物线形状是由代际变化而不是年龄变化驱动的。我们的研究结果表明,如果没有勇气心态的干预,个人层面的勇气可能会持续一段时间。
{"title":"Does Aging Make Us Grittier? Disentangling the Age and Generation Effect on Passion and Perseverance","authors":"S. Sanders, Nuwan Indika Millagaha Gedara, Bhavneet Walia, C. Boudreaux, M. Silverstein","doi":"10.6339/22-jds1041","DOIUrl":"https://doi.org/10.6339/22-jds1041","url":null,"abstract":"Defined as perseverance and passion for long term goals, grit represents an important psychological skill toward goal-attainment in academic and less-stylized settings. An outstanding issue of primary importance is whether age affects grit, ceteris paribus. The 12-item Grit-O Scale and the 8-item Grit-S Scale—from which grit scores are calculated—have not existed for a long period of time. Therefore, Duckworth (2016, p. 37) states in her book, Grit: The Power and Passion of Perseverance, that “we need a different kind of study” to distinguish between rival explanations that either generational cohort or age are more important in explaining variation in grit across individuals. Despite this clear data constraint, we obtain a glimpse into the future in the present study by using a within and between generational cohort age difference-in-difference approach. By specifying generation as a categorical variable and age-in-generation as a count variable in the same regression specifications, we are able to account for the effects of variation in age and generation simultaneously, while avoiding problems of multicollinearity that would hinder post-regression statistical inference. We conclude robust, significant evidence that the negative-parabolic shape of the grit-age profile is driven by generational variation and not by age variation. Our findings suggest that, absent a grit-mindset intervention, individual-level grit may be persistent over time.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Journal of data science : JDS
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1