首页 > 最新文献

Journal of data science : JDS最新文献

英文 中文
EMixed: Probabilistic Multi-Omics Cellular Deconvolution of Bulk Omics Data. EMixed:大量组学数据的概率多组学细胞反卷积。
Pub Date : 2025-02-26 DOI: 10.6339/25-jds1170
Manqi Cai, Kangyi Zhao, Penghui Huang, Juan C Celedón, Chris McKennan, Wei Chen, Jiebiao Wang

Cellular deconvolution is a key approach to deciphering the complex cellular makeup of tissues by inferring the composition of cell types from bulk data. Traditionally, deconvolution methods have focused on a single molecular modality, relying either on RNA sequencing (RNA-seq) to capture gene expression or on DNA methylation (DNAm) to reveal epigenetic profiles. While these single-modality approaches have provided important insights, they often lack the depth needed to fully understand the intricacies of cellular compositions, especially in complex tissues. To address these limitations, we introduce EMixed, a versatile framework designed for both single-modality and multi-omics cellular deconvolution. EMixed models raw RNA counts and DNAm counts or frequencies via allocation models that assign RNA transcripts and DNAm reads to cell types, and uses an expectation-maximization (EM) algorithm to estimate parameters. Benchmarking results demonstrate that EMixed significantly outperforms existing methods across both single-modality and multi-modality applications, underscoring the broad utility of this approach in enhancing our understanding of cellular heterogeneity.

细胞反褶积是通过从大量数据推断细胞类型的组成来破译组织复杂细胞组成的关键方法。传统上,反褶积方法侧重于单分子模式,依赖于RNA测序(RNA-seq)来捕获基因表达或依赖于DNA甲基化(DNAm)来揭示表观遗传谱。虽然这些单模态方法提供了重要的见解,但它们往往缺乏充分理解细胞组成复杂性所需的深度,特别是在复杂组织中。为了解决这些限制,我们引入了EMixed,这是一个设计用于单模态和多组学细胞反卷积的通用框架。emix通过将RNA转录物和DNAm读取物分配给细胞类型的分配模型,对原始RNA计数和DNAm计数或频率进行建模,并使用期望最大化(EM)算法来估计参数。基准测试结果表明,EMixed在单模态和多模态应用中都明显优于现有方法,强调了该方法在增强我们对细胞异质性的理解方面的广泛实用性。
{"title":"EMixed: Probabilistic Multi-Omics Cellular Deconvolution of Bulk Omics Data.","authors":"Manqi Cai, Kangyi Zhao, Penghui Huang, Juan C Celedón, Chris McKennan, Wei Chen, Jiebiao Wang","doi":"10.6339/25-jds1170","DOIUrl":"10.6339/25-jds1170","url":null,"abstract":"<p><p>Cellular deconvolution is a key approach to deciphering the complex cellular makeup of tissues by inferring the composition of cell types from bulk data. Traditionally, deconvolution methods have focused on a single molecular modality, relying either on RNA sequencing (RNA-seq) to capture gene expression or on DNA methylation (DNAm) to reveal epigenetic profiles. While these single-modality approaches have provided important insights, they often lack the depth needed to fully understand the intricacies of cellular compositions, especially in complex tissues. To address these limitations, we introduce EMixed, a versatile framework designed for both single-modality and multi-omics cellular deconvolution. EMixed models raw RNA counts and DNAm counts or frequencies via allocation models that assign RNA transcripts and DNAm reads to cell types, and uses an expectation-maximization (EM) algorithm to estimate parameters. Benchmarking results demonstrate that EMixed significantly outperforms existing methods across both single-modality and multi-modality applications, underscoring the broad utility of this approach in enhancing our understanding of cellular heterogeneity.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12530062/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145330957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Innovative Method of Singular Spectrum Analysis to Conduct Gap-filling and Denoising on Time Series Data. 一种对时间序列数据进行空白填充和去噪的奇异谱分析方法。
Pub Date : 2025-01-28 DOI: 10.6339/25-jds1164
James J Yang, Anne Buu

Heart rate data collected from wearable devices - one type of time series data - could provide insights into activities, stress levels, and health. Yet, consecutive missing segments (i.e., gaps) that commonly occur due to improper device placement or device malfunction could distort the temporal patterns inherent in the data and undermine the validity of downstream analyses. This study proposes an innovative iterative procedure to fill gaps in time series data that capitalizes on the denoising capability of Singular Spectrum Analysis (SSA) and eliminates SSA's requirement of pre-specifying the window length and number of groups. The results of simulations demonstrate that the performance of SSA-based gap-filling methods depends on the choice of window length, number of groups, and the percentage of missing values. In contrast, the proposed method consistently achieves the lowest rates of reconstruction error and gap-filling error across a variety of combinations of the factors manipulated in the simulations. The simulation findings also highlight that the commonly recommended long window length - half of the time series length - may not apply to time series with varying frequencies such as heart rate data. The initialization step of the proposed method that involves a large window length and the first four singular values in the iterative singular value decomposition process not only avoids convergence issues but also facilitates imputation accuracy in subsequent iterations. The proposed method provides the flexibility for researchers to conduct gap-filling solely or in combination with denoising on time series data and thus widens the applications.

从可穿戴设备收集的心率数据——一种时间序列数据——可以提供有关活动、压力水平和健康状况的见解。然而,由于设备放置不当或设备故障而经常出现的连续缺失段(即间隙)可能会扭曲数据中固有的时间模式,并破坏下游分析的有效性。本研究提出了一种创新的迭代过程来填补时间序列数据中的空白,该过程利用奇异谱分析(SSA)的去噪能力,消除了SSA预先指定窗口长度和组数的要求。仿真结果表明,基于ssa的空白填充方法的性能取决于窗口长度、组数和缺失值百分比的选择。相比之下,在模拟中操纵的各种因素组合中,所提出的方法始终能够获得最低的重建错误率和间隙填充错误率。模拟结果还强调,通常推荐的长窗口长度-时间序列长度的一半-可能不适用于具有不同频率的时间序列,例如心率数据。该方法的初始化步骤涉及较大的窗长和迭代奇异值分解过程中的前四个奇异值,既避免了收敛问题,又有利于后续迭代的插补精度。该方法为研究人员提供了对时间序列数据单独或结合去噪进行空白填充的灵活性,从而扩大了应用范围。
{"title":"An Innovative Method of Singular Spectrum Analysis to Conduct Gap-filling and Denoising on Time Series Data.","authors":"James J Yang, Anne Buu","doi":"10.6339/25-jds1164","DOIUrl":"10.6339/25-jds1164","url":null,"abstract":"<p><p>Heart rate data collected from wearable devices - one type of time series data - could provide insights into activities, stress levels, and health. Yet, consecutive missing segments (i.e., gaps) that commonly occur due to improper device placement or device malfunction could distort the temporal patterns inherent in the data and undermine the validity of downstream analyses. This study proposes an innovative iterative procedure to fill gaps in time series data that capitalizes on the denoising capability of Singular Spectrum Analysis (SSA) and eliminates SSA's requirement of pre-specifying the window length and number of groups. The results of simulations demonstrate that the performance of SSA-based gap-filling methods depends on the choice of window length, number of groups, and the percentage of missing values. In contrast, the proposed method consistently achieves the lowest rates of reconstruction error and gap-filling error across a variety of combinations of the factors manipulated in the simulations. The simulation findings also highlight that the commonly recommended long window length - half of the time series length - may not apply to time series with varying frequencies such as heart rate data. The initialization step of the proposed method that involves a large window length and the first four singular values in the iterative singular value decomposition process not only avoids convergence issues but also facilitates imputation accuracy in subsequent iterations. The proposed method provides the flexibility for researchers to conduct gap-filling solely or in combination with denoising on time series data and thus widens the applications.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12439824/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145082677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Neural Network for Correlated Survival Outcomes Using Frailty Model. 神经网络对衰弱模型相关生存结果的影响。
Pub Date : 2025-01-01 Epub Date: 2025-03-26 DOI: 10.6339/25-jds1173
Ruiwen Zhou, Kevin He, Di Wang, Lili Liu, Shujie Ma, Annie Qu, J Philip Miller, Lei Liu

Extensive literature has been proposed for the analysis of correlated survival data. Subjects within a cluster share some common characteristics, e.g., genetic and environmental factors, so their time-to-event outcomes are correlated. The frailty model under proportional hazards assumption has been widely applied for the analysis of clustered survival outcomes. However, the prediction performance of this method can be less satisfactory when the risk factors have complicated effects, e.g., nonlinear and interactive. To deal with these issues, we propose a neural network frailty Cox model that replaces the linear risk function with the output of a feed-forward neural network. The estimation is based on quasi-likelihood using Laplace approximation. A simulation study suggests that the proposed method has the best performance compared with existing methods. The method is applied to the clustered time-to-failure prediction within the kidney transplantation facility using the national kidney transplant registry data from the U.S. Organ Procurement and Transplantation Network. All computer programs are available at https://github.com/rivenzhou/deep_learning_clustered.

广泛的文献已经提出了相关的生存数据分析。集群内的受试者具有一些共同特征,例如遗传和环境因素,因此他们的时间到事件结果是相关的。基于比例风险假设的脆弱性模型已被广泛应用于聚类生存结果的分析。然而,当风险因素具有非线性、交互作用等复杂影响时,该方法的预测效果并不理想。为了解决这些问题,我们提出了一个神经网络脆弱性Cox模型,该模型用前馈神经网络的输出取代线性风险函数。该估计是基于拉普拉斯近似的拟似然估计。仿真研究表明,与现有方法相比,该方法具有最好的性能。该方法应用于肾移植设施内的聚类故障时间预测,使用来自美国器官获取和移植网络的国家肾移植注册数据。所有的计算机程序都可以在https://github.com/rivenzhou/deep_learning_clustered上找到。
{"title":"Neural Network for Correlated Survival Outcomes Using Frailty Model.","authors":"Ruiwen Zhou, Kevin He, Di Wang, Lili Liu, Shujie Ma, Annie Qu, J Philip Miller, Lei Liu","doi":"10.6339/25-jds1173","DOIUrl":"10.6339/25-jds1173","url":null,"abstract":"<p><p>Extensive literature has been proposed for the analysis of correlated survival data. Subjects within a cluster share some common characteristics, e.g., genetic and environmental factors, so their time-to-event outcomes are correlated. The frailty model under proportional hazards assumption has been widely applied for the analysis of clustered survival outcomes. However, the prediction performance of this method can be less satisfactory when the risk factors have complicated effects, e.g., nonlinear and interactive. To deal with these issues, we propose a neural network frailty Cox model that replaces the linear risk function with the output of a feed-forward neural network. The estimation is based on quasi-likelihood using Laplace approximation. A simulation study suggests that the proposed method has the best performance compared with existing methods. The method is applied to the clustered time-to-failure prediction within the kidney transplantation facility using the national kidney transplant registry data from the U.S. Organ Procurement and Transplantation Network. All computer programs are available at https://github.com/rivenzhou/deep_learning_clustered.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"23 4","pages":"624-637"},"PeriodicalIF":0.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12829921/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146055270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Magnitude Pruning of Large Pretrained Transformer Models with a Mixture Gaussian Prior. 基于混合高斯先验的大型预训练变压器模型的幅度修剪。
Pub Date : 2024-11-26 DOI: 10.6339/24-jds1156
Mingxuan Zhang, Yan Sun, Faming Liang

Large pretrained transformer models have revolutionized modern AI applications with their state-of-the-art performance in natural language processing (NLP). However, their substantial parameter count poses challenges for real-world deployment. To address this, researchers often reduce model size by pruning parameters based on their magnitude or sensitivity. Previous research has demonstrated the limitations of magnitude pruning, especially in the context of transfer learning for modern NLP tasks. In this paper, we introduce a new magnitude-based pruning algorithm called mixture Gaussian prior pruning (MGPP), which employs a mixture Gaussian prior for regularization. MGPP prunes non-expressive weights under the guidance of the mixture Gaussian prior, aiming to retain the model's expressive capability. Extensive evaluations across various NLP tasks, including natural language understanding, question answering, and natural language generation, demonstrate the superiority of MGPP over existing pruning methods, particularly in high sparsity settings. Additionally, we provide a theoretical justification for the consistency of the sparse transformer, shedding light on the effectiveness of the proposed pruning method.

大型预训练变压器模型以其在自然语言处理(NLP)方面的最先进性能彻底改变了现代人工智能应用。然而,它们大量的参数数量给实际部署带来了挑战。为了解决这个问题,研究人员经常通过根据它们的大小或灵敏度修剪参数来减小模型的大小。先前的研究已经证明了幅度修剪的局限性,特别是在现代NLP任务的迁移学习背景下。本文介绍了一种新的基于幅度的剪枝算法,称为混合高斯先验剪枝(MGPP),该算法采用混合高斯先验进行正则化。MGPP在混合高斯先验的指导下对非表达权值进行修剪,目的是保持模型的表达能力。对各种NLP任务(包括自然语言理解、问题回答和自然语言生成)的广泛评估表明,MGPP优于现有的修剪方法,特别是在高稀疏设置中。此外,我们为稀疏变压器的一致性提供了理论依据,揭示了所提出的修剪方法的有效性。
{"title":"Magnitude Pruning of Large Pretrained Transformer Models with a Mixture Gaussian Prior.","authors":"Mingxuan Zhang, Yan Sun, Faming Liang","doi":"10.6339/24-jds1156","DOIUrl":"10.6339/24-jds1156","url":null,"abstract":"<p><p>Large pretrained transformer models have revolutionized modern AI applications with their state-of-the-art performance in natural language processing (NLP). However, their substantial parameter count poses challenges for real-world deployment. To address this, researchers often reduce model size by pruning parameters based on their magnitude or sensitivity. Previous research has demonstrated the limitations of magnitude pruning, especially in the context of transfer learning for modern NLP tasks. In this paper, we introduce a new magnitude-based pruning algorithm called mixture Gaussian prior pruning (MGPP), which employs a mixture Gaussian prior for regularization. MGPP prunes non-expressive weights under the guidance of the mixture Gaussian prior, aiming to retain the model's expressive capability. Extensive evaluations across various NLP tasks, including natural language understanding, question answering, and natural language generation, demonstrate the superiority of MGPP over existing pruning methods, particularly in high sparsity settings. Additionally, we provide a theoretical justification for the consistency of the sparse transformer, shedding light on the effectiveness of the proposed pruning method.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12629628/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145566680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Meta-Learner Framework to Estimate Individualized Treatment Effects for Survival Outcomes. 评估个体化治疗对生存结果影响的元学习者框架。
Pub Date : 2024-10-01 Epub Date: 2024-02-05 DOI: 10.6339/24-jds1119
Na Bo, Yue Wei, Lang Zeng, Chaeryon Kang, Ying Ding

One crucial aspect of precision medicine is to allow physicians to recommend the most suitable treatment for their patients. This requires understanding the treatment heterogeneity from a patient-centric view, quantified by estimating the individualized treatment effect (ITE). With a large amount of genetics data and medical factors being collected, a complete picture of individuals' characteristics is forming, which provides more opportunities to accurately estimate ITE. Recent development using machine learning methods within the counterfactual outcome framework shows excellent potential in analyzing such data. In this research, we propose to extend meta-learning approaches to estimate individualized treatment effects with survival outcomes. Two meta-learning algorithms are considered, T-learner and X-learner, each combined with three types of machine learning methods: random survival forest, Bayesian accelerated failure time model and survival neural network. We examine the performance of the proposed methods and provide practical guidelines for their application in randomized clinical trials (RCTs). Moreover, we propose to use the Boruta algorithm to identify risk factors that contribute to treatment heterogeneity based on ITE estimates. The finite sample performances of these methods are compared through extensive simulations under different randomization designs. The proposed approach is applied to a large RCT of eye disease, namely, age-related macular degeneration (AMD), to estimate the ITE on delaying time-to-AMD progression and to make individualized treatment recommendations.

精准医疗的一个关键方面是允许医生为他们的病人推荐最合适的治疗方法。这需要从以患者为中心的角度来理解治疗的异质性,并通过估计个体化治疗效果(ITE)来量化。随着大量的遗传学数据和医学因素的收集,一个完整的个体特征正在形成,这为准确估计ITE提供了更多的机会。最近在反事实结果框架内使用机器学习方法的发展在分析此类数据方面显示出极好的潜力。在这项研究中,我们建议扩展元学习方法,以评估个性化治疗效果和生存结果。本文考虑了t -学习者和x -学习者两种元学习算法,每种算法都结合了三种机器学习方法:随机生存森林、贝叶斯加速失效时间模型和生存神经网络。我们检查了所提出的方法的性能,并为其在随机临床试验(rct)中的应用提供了实用指南。此外,我们建议使用Boruta算法来识别基于ITE估计的导致治疗异质性的风险因素。在不同的随机化设计下,通过大量的模拟比较了这些方法的有限样本性能。该方法应用于一项大型眼病随机对照试验,即年龄相关性黄斑变性(AMD),以估计延迟AMD进展时间的ITE,并提出个体化治疗建议。
{"title":"A Meta-Learner Framework to Estimate Individualized Treatment Effects for Survival Outcomes.","authors":"Na Bo, Yue Wei, Lang Zeng, Chaeryon Kang, Ying Ding","doi":"10.6339/24-jds1119","DOIUrl":"10.6339/24-jds1119","url":null,"abstract":"<p><p>One crucial aspect of precision medicine is to allow physicians to recommend the most suitable treatment for their patients. This requires understanding the treatment heterogeneity from a patient-centric view, quantified by estimating the individualized treatment effect (ITE). With a large amount of genetics data and medical factors being collected, a complete picture of individuals' characteristics is forming, which provides more opportunities to accurately estimate ITE. Recent development using machine learning methods within the counterfactual outcome framework shows excellent potential in analyzing such data. In this research, we propose to extend meta-learning approaches to estimate individualized treatment effects with survival outcomes. Two meta-learning algorithms are considered, T-learner and X-learner, each combined with three types of machine learning methods: random survival forest, Bayesian accelerated failure time model and survival neural network. We examine the performance of the proposed methods and provide practical guidelines for their application in randomized clinical trials (RCTs). Moreover, we propose to use the Boruta algorithm to identify risk factors that contribute to treatment heterogeneity based on ITE estimates. The finite sample performances of these methods are compared through extensive simulations under different randomization designs. The proposed approach is applied to a large RCT of eye disease, namely, age-related macular degeneration (AMD), to estimate the ITE on delaying time-to-AMD progression and to make individualized treatment recommendations.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"22 4","pages":"505-523"},"PeriodicalIF":0.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12440118/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145082744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Maximum Likelihood Estimation for Shape-restricted Single-index Hazard Models. 形状受限单指数危险模型的最大似然估计。
Pub Date : 2023-10-01 Epub Date: 2022-11-04 DOI: 10.6339/22-jds1061
Jing Qin, Yifei Sun, Ao Yuan, Chiung-Yu Huang

Single-index models are becoming increasingly popular in many scientific applications as they offer the advantages of flexibility in regression modeling as well as interpretable covariate effects. In the context of survival analysis, the single-index hazards models are natural extensions of the Cox proportional hazards models. In this paper, we propose a novel estimation procedure for single-index hazard models under a monotone constraint of the index. We apply the profile likelihood method to obtain the semiparametric maximum likelihood estimator, where the novelty of the estimation procedure lies in estimating the unknown monotone link function by embedding the problem in isotonic regression with exponentially distributed random variables. The consistency of the proposed semiparametric maximum likelihood estimator is established under suitable regularity conditions. Numerical simulations are conducted to examine the finite-sample performance of the proposed method. An analysis of breast cancer data is presented for illustration.

单指数模型具有回归建模灵活、协变量效应可解释等优点,因此在许多科学应用中越来越受欢迎。在生存分析中,单指数危险模型是 Cox 比例危险模型的自然扩展。在本文中,我们提出了一种在指数单调约束条件下的单指数危险模型的新型估计程序。我们应用轮廓似然法获得半参数最大似然估计器,估计程序的新颖之处在于通过将问题嵌入指数分布随机变量的等比数列回归中来估计未知的单调联系函数。在适当的正则条件下,建立了所提出的半参数最大似然估计器的一致性。通过数值模拟,检验了所提方法的有限样本性能。并通过对乳腺癌数据的分析进行了说明。
{"title":"Maximum Likelihood Estimation for Shape-restricted Single-index Hazard Models.","authors":"Jing Qin, Yifei Sun, Ao Yuan, Chiung-Yu Huang","doi":"10.6339/22-jds1061","DOIUrl":"10.6339/22-jds1061","url":null,"abstract":"<p><p>Single-index models are becoming increasingly popular in many scientific applications as they offer the advantages of flexibility in regression modeling as well as interpretable covariate effects. In the context of survival analysis, the single-index hazards models are natural extensions of the Cox proportional hazards models. In this paper, we propose a novel estimation procedure for single-index hazard models under a monotone constraint of the index. We apply the profile likelihood method to obtain the semiparametric maximum likelihood estimator, where the novelty of the estimation procedure lies in estimating the unknown monotone link function by embedding the problem in isotonic regression with exponentially distributed random variables. The consistency of the proposed semiparametric maximum likelihood estimator is established under suitable regularity conditions. Numerical simulations are conducted to examine the finite-sample performance of the proposed method. An analysis of breast cancer data is presented for illustration.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":"681-695"},"PeriodicalIF":0.0,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11017303/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Central Posterior Envelopes for Bayesian Functional Principal Component Analysis. 贝叶斯功能主成分分析的中心后包络。
Pub Date : 2023-10-01 Epub Date: 2023-01-19 DOI: 10.6339/23-jds1085
Joanna Boland, Donatello Telesca, Catherine Sugar, Shafali Jeste, Abigail Dickinson, Charlotte DiStefano, Damla Şentürk

Bayesian methods provide direct inference in functional data analysis applications without reliance on bootstrap techniques. A major tool in functional data applications is the functional principal component analysis which decomposes the data around a common mean function and identifies leading directions of variation. Bayesian functional principal components analysis (BFPCA) provides uncertainty quantification on the estimated functional model components via the posterior samples obtained. We propose central posterior envelopes (CPEs) for BFPCA based on functional depth as a descriptive visualization tool to summarize variation in the posterior samples of the estimated functional model components, contributing to uncertainty quantification in BFPCA. The proposed BFPCA relies on a latent factor model and targets model parameters within a mixed effects modeling framework using modified multiplicative gamma process shrinkage priors on the variance components. Functional depth provides a center-outward order to a sample of functions. We utilize modified band depth and modified volume depth for ordering of a sample of functions and surfaces, respectively, to derive at CPEs of the mean and eigenfunctions within the BFPCA framework. The proposed CPEs are showcased in extensive simulations. Finally, the proposed CPEs are applied to the analysis of a sample of power spectral densities (PSD) from resting state electroencephalography (EEG) where they lead to novel insights on diagnostic group differences among children diagnosed with autism spectrum disorder and their typically developing peers across age.

贝叶斯方法可在功能数据分析应用中提供直接推断,而无需依赖引导技术。功能数据应用中的一个主要工具是功能主成分分析,它围绕一个共同的平均函数对数据进行分解,并确定变化的主要方向。贝叶斯功能主成分分析(BFPCA)通过获得的后验样本对估计的功能模型成分进行不确定性量化。我们提出了基于功能深度的贝叶斯功能主成分分析中心后验包络(CPEs),作为一种描述性可视化工具,用于总结估计功能模型成分后验样本的变化,有助于贝叶斯功能主成分分析的不确定性量化。所提出的 BFPCA 依赖于潜因模型,并在混合效应建模框架内使用方差成分的修正乘法伽马过程收缩先验来锁定模型参数。函数深度为函数样本提供了中心向外的顺序。我们利用修正带深度和修正体深度分别对函数样本和曲面进行排序,从而在 BFPCA 框架内推导出均值和特征函数的 CPE。我们通过大量模拟展示了所提出的 CPE。最后,将所提出的 CPEs 应用于静息状态脑电图(EEG)的功率谱密度(PSD)样本分析,从而对被诊断为自闭症谱系障碍的儿童与发育正常的同龄人在不同年龄段的诊断群体差异有了新的认识。
{"title":"Central Posterior Envelopes for Bayesian Functional Principal Component Analysis.","authors":"Joanna Boland, Donatello Telesca, Catherine Sugar, Shafali Jeste, Abigail Dickinson, Charlotte DiStefano, Damla Şentürk","doi":"10.6339/23-jds1085","DOIUrl":"10.6339/23-jds1085","url":null,"abstract":"<p><p>Bayesian methods provide direct inference in functional data analysis applications without reliance on bootstrap techniques. A major tool in functional data applications is the functional principal component analysis which decomposes the data around a common mean function and identifies leading directions of variation. Bayesian functional principal components analysis (BFPCA) provides uncertainty quantification on the estimated functional model components via the posterior samples obtained. We propose central posterior envelopes (CPEs) for BFPCA based on functional depth as a descriptive visualization tool to summarize variation in the posterior samples of the estimated functional model components, contributing to uncertainty quantification in BFPCA. The proposed BFPCA relies on a latent factor model and targets model parameters within a mixed effects modeling framework using modified multiplicative gamma process shrinkage priors on the variance components. Functional depth provides a center-outward order to a sample of functions. We utilize modified band depth and modified volume depth for ordering of a sample of functions and surfaces, respectively, to derive at CPEs of the mean and eigenfunctions within the BFPCA framework. The proposed CPEs are showcased in extensive simulations. Finally, the proposed CPEs are applied to the analysis of a sample of power spectral densities (PSD) from resting state electroencephalography (EEG) where they lead to novel insights on diagnostic group differences among children diagnosed with autism spectrum disorder and their typically developing peers across age.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":"715-734"},"PeriodicalIF":0.0,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11178334/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimal Physician Shared-Patient Networks and the Diffusion of Medical Technologies. 最佳医生共享病人网络与医疗技术的传播。
Pub Date : 2023-07-01 Epub Date: 2022-08-30 DOI: 10.6339/22-jds1064
A James O'Malley, Xin Ran, Chuankai An, Daniel Rockmore

Social network analysis has created a productive framework for the analysis of the histories of patient-physician interactions and physician collaboration. Notable is the construction of networks based on the data of "referral paths" - sequences of patient-specific temporally linked physician visits - in this case, culled from a large set of Medicare claims data in the United States. Network constructions depend on a range of choices regarding the underlying data. In this paper we introduce the use of a five-factor experiment that produces 80 distinct projections of the bipartite patient-physician mixing matrix to a unipartite physician network derived from the referral path data, which is further analyzed at the level of the 2,219 hospitals in the final analytic sample. We summarize the networks of physicians within a given hospital using a range of directed and undirected network features (quantities that summarize structural properties of the network such as its size, density, and reciprocity). The different projections and their underlying factors are evaluated in terms of the heterogeneity of the network features across the hospitals. We also evaluate the projections relative to their ability to improve the predictive accuracy of a model estimating a hospital's adoption of implantable cardiac defibrillators, a novel cardiac intervention. Because it optimizes the knowledge learned about the overall and interactive effects of the factors, we anticipate that the factorial design setting for network analysis may be useful more generally as a methodological advance in network analysis.

社会网络分析为分析医患互动和医生合作的历史提供了一个富有成效的框架。值得注意的是基于 "转诊路径 "数据的网络构建--"转诊路径 "是指与特定患者有时间联系的医生就诊序列--本案例中的 "转诊路径 "数据来自于美国的大量医疗保险报销数据。网络的构建取决于对基础数据的一系列选择。在本文中,我们介绍了五因素实验的使用方法,该方法可将双方形患者-医生混合矩阵生成 80 个不同的投影,并将其投影到从转诊路径数据中得出的单方形医生网络中,然后在最终分析样本中的 2,219 家医院层面对该网络进行进一步分析。我们使用一系列有向和无向网络特征(概括网络结构属性的数量,如网络规模、密度和互惠性)来概括特定医院内的医生网络。我们根据各医院网络特征的异质性对不同的预测及其基本因素进行了评估。我们还评估了这些预测是否能提高一个模型的预测准确性,该模型估计了医院采用植入式心脏除颤器(一种新型心脏干预措施)的情况。由于它优化了所学到的有关因素的整体效应和交互效应的知识,我们预计网络分析的因子设计设置作为网络分析方法的一种进步,可能会有更广泛的用途。
{"title":"Optimal Physician Shared-Patient Networks and the Diffusion of Medical Technologies.","authors":"A James O'Malley, Xin Ran, Chuankai An, Daniel Rockmore","doi":"10.6339/22-jds1064","DOIUrl":"10.6339/22-jds1064","url":null,"abstract":"<p><p>Social network analysis has created a productive framework for the analysis of the histories of patient-physician interactions and physician collaboration. Notable is the construction of networks based on the data of \"referral paths\" - sequences of patient-specific temporally linked physician visits - in this case, culled from a large set of Medicare claims data in the United States. Network constructions depend on a range of choices regarding the underlying data. In this paper we introduce the use of a five-factor experiment that produces 80 distinct projections of the bipartite patient-physician mixing matrix to a unipartite physician network derived from the referral path data, which is further analyzed at the level of the 2,219 hospitals in the final analytic sample. We summarize the networks of physicians within a given hospital using a range of directed and undirected network features (quantities that summarize structural properties of the network such as its size, density, and reciprocity). The different projections and their underlying factors are evaluated in terms of the heterogeneity of the network features across the hospitals. We also evaluate the projections relative to their ability to improve the predictive accuracy of a model estimating a hospital's adoption of implantable cardiac defibrillators, a novel cardiac intervention. Because it optimizes the knowledge learned about the overall and interactive effects of the factors, we anticipate that the factorial design setting for network analysis may be useful more generally as a methodological advance in network analysis.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":"578-598"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10956597/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generating General Preferential Attachment Networks with R Package wdnet 用R包wdnet生成一般优先依恋网络
Pub Date : 2023-01-31 DOI: 10.6339/23-jds1110
Yelie Yuan, Tiandong Wang, Jun Yan, Panpan Zhang
Preferential attachment (PA) network models have a wide range of applications in various scientific disciplines. Efficient generation of large-scale PA networks helps uncover their structural properties and facilitate the development of associated analytical methodologies. Existing software packages only provide limited functions for this purpose with restricted configurations and efficiency. We present a generic, user-friendly implementation of weighted, directed PA network generation with R package wdnet. The core algorithm is based on an efficient binary tree approach. The package further allows adding multiple edges at a time, heterogeneous reciprocal edges, and user-specified preference functions. The engine under the hood is implemented in C++. Usages of the package are illustrated with detailed explanation. A benchmark study shows that wdnet is efficient for generating general PA networks not available in other packages. In restricted settings that can be handled by existing packages, wdnet provides comparable efficiency.
优先依恋(PA)网络模型在各个科学学科中有着广泛的应用。大规模PA网络的有效生成有助于揭示其结构特性,并促进相关分析方法的发展。现有的软件包仅为此目的提供有限的功能,并且具有有限的配置和效率。我们提出了一个通用的,用户友好的实现加权,有向PA网络生成与R包wdnet。核心算法基于一种高效的二叉树方法。该包还允许一次添加多个边、异构互惠边和用户指定的偏好函数。发动机罩下的发动机是用C++实现的。详细说明了该包装的用途。一项基准研究表明,wdnet对于生成其他包中没有的通用PA网络是有效的。在现有包可以处理的受限设置中,wdnet提供了相当的效率。
{"title":"Generating General Preferential Attachment Networks with R Package wdnet","authors":"Yelie Yuan, Tiandong Wang, Jun Yan, Panpan Zhang","doi":"10.6339/23-jds1110","DOIUrl":"https://doi.org/10.6339/23-jds1110","url":null,"abstract":"Preferential attachment (PA) network models have a wide range of applications in various scientific disciplines. Efficient generation of large-scale PA networks helps uncover their structural properties and facilitate the development of associated analytical methodologies. Existing software packages only provide limited functions for this purpose with restricted configurations and efficiency. We present a generic, user-friendly implementation of weighted, directed PA network generation with R package wdnet. The core algorithm is based on an efficient binary tree approach. The package further allows adding multiple edges at a time, heterogeneous reciprocal edges, and user-specified preference functions. The engine under the hood is implemented in C++. Usages of the package are illustrated with detailed explanation. A benchmark study shows that wdnet is efficient for generating general PA networks not available in other packages. In restricted settings that can be handled by existing packages, wdnet provides comparable efficiency.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42733675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Random Forest of Interaction Trees for Estimating Individualized Treatment Regimes with Ordered Treatment Levels in Observational Studies 在观察性研究中估计有顺序治疗水平的个体化治疗方案的相互作用树随机森林
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1084
Justine Thorp, R. Levine, Luo Li, J. Fan
Traditional methods for evaluating a potential treatment have focused on the average treatment effect. However, there exist situations where individuals can experience significantly heterogeneous responses to a treatment. In these situations, one needs to account for the differences among individuals when estimating the treatment effect. Li et al. (2022) proposed a method based on random forest of interaction trees (RFIT) for a binary or categorical treatment variable, while incorporating the propensity score in the construction of random forest. Motivated by the need to evaluate the effect of tutoring sessions at a Math and Stat Learning Center (MSLC), we extend their approach to an ordinal treatment variable. Our approach improves upon RFIT for multiple treatments by incorporating the ordered structure of the treatment variable into the tree growing process. To illustrate the effectiveness of our proposed method, we conduct simulation studies where the results show that our proposed method has a lower mean squared error and higher optimal treatment classification, and is able to identify the most important variables that impact the treatment effect. We then apply the proposed method to estimate how the number of visits to the MSLC impacts an individual student’s probability of passing an introductory statistics course. Our results show that every student is recommended to go to the MSLC at least once and some can drastically improve their chance of passing the course by going the optimal number of times suggested by our analysis.
评估潜在治疗的传统方法侧重于平均治疗效果。然而,在某些情况下,个体可能会对一种治疗产生明显的异质反应。在这些情况下,在估计治疗效果时需要考虑到个体之间的差异。Li等人(2022)提出了一种基于相互作用树随机森林(RFIT)的方法,用于二元或分类处理变量,同时将倾向得分纳入随机森林的构建中。由于需要评估数学和统计学习中心(MSLC)辅导课程的效果,我们将他们的方法扩展到一个顺序处理变量。我们的方法通过将处理变量的有序结构纳入树木生长过程,改进了RFIT对多个处理的影响。为了说明我们提出的方法的有效性,我们进行了模拟研究,结果表明我们提出的方法具有较低的均方误差和较高的最优处理分类,并且能够识别影响处理效果的最重要变量。然后,我们应用所提出的方法来估计访问MSLC的次数如何影响单个学生通过入门统计课程的概率。我们的结果表明,每个学生都被建议至少去一次MSLC,有些学生可以通过我们的分析建议的最佳次数来大大提高他们通过课程的机会。
{"title":"Random Forest of Interaction Trees for Estimating Individualized Treatment Regimes with Ordered Treatment Levels in Observational Studies","authors":"Justine Thorp, R. Levine, Luo Li, J. Fan","doi":"10.6339/23-jds1084","DOIUrl":"https://doi.org/10.6339/23-jds1084","url":null,"abstract":"Traditional methods for evaluating a potential treatment have focused on the average treatment effect. However, there exist situations where individuals can experience significantly heterogeneous responses to a treatment. In these situations, one needs to account for the differences among individuals when estimating the treatment effect. Li et al. (2022) proposed a method based on random forest of interaction trees (RFIT) for a binary or categorical treatment variable, while incorporating the propensity score in the construction of random forest. Motivated by the need to evaluate the effect of tutoring sessions at a Math and Stat Learning Center (MSLC), we extend their approach to an ordinal treatment variable. Our approach improves upon RFIT for multiple treatments by incorporating the ordered structure of the treatment variable into the tree growing process. To illustrate the effectiveness of our proposed method, we conduct simulation studies where the results show that our proposed method has a lower mean squared error and higher optimal treatment classification, and is able to identify the most important variables that impact the treatment effect. We then apply the proposed method to estimate how the number of visits to the MSLC impacts an individual student’s probability of passing an introductory statistics course. Our results show that every student is recommended to go to the MSLC at least once and some can drastically improve their chance of passing the course by going the optimal number of times suggested by our analysis.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Journal of data science : JDS
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1