首页 > 最新文献

Journal of Machine Learning Research最新文献

英文 中文
D-GCCA: Decomposition-based Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data. D-GCCA:基于分解的多视角高维数据广义典范相关分析。
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2022-01-01
Hai Shu, Zhe Qu, Hongtu Zhu

Modern biomedical studies often collect multi-view data, that is, multiple types of data measured on the same set of objects. A popular model in high-dimensional multi-view data analysis is to decompose each view's data matrix into a low-rank common-source matrix generated by latent factors common across all data views, a low-rank distinctive-source matrix corresponding to each view, and an additive noise matrix. We propose a novel decomposition method for this model, called decomposition-based generalized canonical correlation analysis (D-GCCA). The D-GCCA rigorously defines the decomposition on the L 2 space of random variables in contrast to the Euclidean dot product space used by most existing methods, thereby being able to provide the estimation consistency for the low-rank matrix recovery. Moreover, to well calibrate common latent factors, we impose a desirable orthogonality constraint on distinctive latent factors. Existing methods, however, inadequately consider such orthogonality and may thus suffer from substantial loss of undetected common-source variation. Our D-GCCA takes one step further than generalized canonical correlation analysis by separating common and distinctive components among canonical variables, while enjoying an appealing interpretation from the perspective of principal component analysis. Furthermore, we propose to use the variable-level proportion of signal variance explained by common or distinctive latent factors for selecting the variables most influenced. Consistent estimators of our D-GCCA method are established with good finite-sample numerical performance, and have closed-form expressions leading to efficient computation especially for large-scale data. The superiority of D-GCCA over state-of-the-art methods is also corroborated in simulations and real-world data examples.

现代生物医学研究经常收集多视图数据,即对同一组对象测量的多种类型数据。高维多视图数据分析中的一种流行模型是将每个视图的数据矩阵分解为由所有数据视图中共同的潜在因子生成的低阶共源矩阵、与每个视图相对应的低阶独特源矩阵以及加性噪声矩阵。我们为此模型提出了一种新颖的分解方法,称为基于分解的广义典型相关分析(D-GCCA)。与大多数现有方法使用的欧几里得点积空间不同,D-GCCA 在随机变量的 L 2 空间上严格定义了分解,因此能为低阶矩阵恢复提供估计一致性。此外,为了很好地校准共同潜因,我们对不同的潜因施加了理想的正交性约束。然而,现有的方法没有充分考虑到这种正交性,因此可能会导致大量未检测到的共源变异损失。我们的 D-GCCA 比广义典型相关分析更进了一步,它在典型变量中分离了共同成分和独特成分,同时从主成分分析的角度进行了有吸引力的解释。此外,我们还建议使用由共同或独特潜在因素解释的信号方差的变量级比例来选择受影响最大的变量。我们的 D-GCCA 方法建立了一致的估计值,具有良好的有限样本数值性能,并且具有闭式表达式,特别适合大规模数据的高效计算。模拟和实际数据实例也证实了 D-GCCA 方法优于最先进的方法。
{"title":"D-GCCA: Decomposition-based Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data.","authors":"Hai Shu, Zhe Qu, Hongtu Zhu","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Modern biomedical studies often collect multi-view data, that is, multiple types of data measured on the same set of objects. A popular model in high-dimensional multi-view data analysis is to decompose each view's data matrix into a low-rank common-source matrix generated by latent factors common across all data views, a low-rank distinctive-source matrix corresponding to each view, and an additive noise matrix. We propose a novel decomposition method for this model, called decomposition-based generalized canonical correlation analysis (D-GCCA). The D-GCCA rigorously defines the decomposition on the <math> <mrow><msup><mi>L</mi> <mn>2</mn></msup> </mrow> </math> space of random variables in contrast to the Euclidean dot product space used by most existing methods, thereby being able to provide the estimation consistency for the low-rank matrix recovery. Moreover, to well calibrate common latent factors, we impose a desirable orthogonality constraint on distinctive latent factors. Existing methods, however, inadequately consider such orthogonality and may thus suffer from substantial loss of undetected common-source variation. Our D-GCCA takes one step further than generalized canonical correlation analysis by separating common and distinctive components among canonical variables, while enjoying an appealing interpretation from the perspective of principal component analysis. Furthermore, we propose to use the variable-level proportion of signal variance explained by common or distinctive latent factors for selecting the variables most influenced. Consistent estimators of our D-GCCA method are established with good finite-sample numerical performance, and have closed-form expressions leading to efficient computation especially for large-scale data. The superiority of D-GCCA over state-of-the-art methods is also corroborated in simulations and real-world data examples.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9380864/pdf/nihms-1815754.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10468609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpretable Classification of Categorical Time Series Using the Spectral Envelope and Optimal Scalings. 使用谱包络和最优标度的分类时间序列的可解释分类。
IF 6 3区 计算机科学 Q1 Mathematics Pub Date : 2022-01-01
Zeda Li, Scott A Bruce, Tian Cai

This article introduces a novel approach to the classification of categorical time series under the supervised learning paradigm. To construct meaningful features for categorical time series classification, we consider two relevant quantities: the spectral envelope and its corresponding set of optimal scalings. These quantities characterize oscillatory patterns in a categorical time series as the largest possible power at each frequency, or spectral envelope, obtained by assigning numerical values, or scalings, to categories that optimally emphasize oscillations at each frequency. Our procedure combines these two quantities to produce an interpretable and parsimonious feature-based classifier that can be used to accurately determine group membership for categorical time series. Classification consistency of the proposed method is investigated, and simulation studies are used to demonstrate accuracy in classifying categorical time series with various underlying group structures. Finally, we use the proposed method to explore key differences in oscillatory patterns of sleep stage time series for patients with different sleep disorders and accurately classify patients accordingly. The code for implementing the proposed method is available at https://github.com/zedali16/envsca.

本文介绍了一种在监督学习范式下分类时间序列的新方法。为了构造对分类时间序列分类有意义的特征,我们考虑了两个相关的量:谱包络及其相应的最优尺度集。这些量将分类时间序列中的振荡模式表征为每个频率或频谱包络的最大可能功率,通过分配数值或缩放来获得,以最优地强调每个频率的振荡。我们的程序将这两个量结合起来,产生一个可解释且简洁的基于特征的分类器,可用于准确确定分类时间序列的组成员关系。研究了该方法的分类一致性,并用仿真研究证明了该方法对具有不同底层群结构的分类时间序列进行分类的准确性。最后,我们使用该方法探索不同睡眠障碍患者睡眠阶段时间序列振荡模式的关键差异,并据此对患者进行准确分类。实现所建议的方法的代码可在https://github.com/zedali16/envsca上获得。
{"title":"Interpretable Classification of Categorical Time Series Using the Spectral Envelope and Optimal Scalings.","authors":"Zeda Li,&nbsp;Scott A Bruce,&nbsp;Tian Cai","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>This article introduces a novel approach to the classification of categorical time series under the supervised learning paradigm. To construct meaningful features for categorical time series classification, we consider two relevant quantities: the spectral envelope and its corresponding set of optimal scalings. These quantities characterize oscillatory patterns in a categorical time series as the largest possible power at each frequency, or <i>spectral envelope</i>, obtained by assigning numerical values, or <i>scalings</i>, to categories that optimally emphasize oscillations at each frequency. Our procedure combines these two quantities to produce an interpretable and parsimonious feature-based classifier that can be used to accurately determine group membership for categorical time series. Classification consistency of the proposed method is investigated, and simulation studies are used to demonstrate accuracy in classifying categorical time series with various underlying group structures. Finally, we use the proposed method to explore key differences in oscillatory patterns of sleep stage time series for patients with different sleep disorders and accurately classify patients accordingly. The code for implementing the proposed method is available at https://github.com/zedali16/envsca.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":6.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10210597/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9529646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatial Multivariate Trees for Big Data Bayesian Regression. 大数据贝叶斯回归的空间多元树。
IF 6 3区 计算机科学 Q1 Mathematics Pub Date : 2022-01-01
Michele Peruzzi, David B Dunson

High resolution geospatial data are challenging because standard geostatistical models based on Gaussian processes are known to not scale to large data sizes. While progress has been made towards methods that can be computed more efficiently, considerably less attention has been devoted to methods for large scale data that allow the description of complex relationships between several outcomes recorded at high resolutions by different sensors. Our Bayesian multivariate regression models based on spatial multivariate trees (SpamTrees) achieve scalability via conditional independence assumptions on latent random effects following a treed directed acyclic graph. Information-theoretic arguments and considerations on computational efficiency guide the construction of the tree and the related efficient sampling algorithms in imbalanced multivariate settings. In addition to simulated data examples, we illustrate SpamTrees using a large climate data set which combines satellite data with land-based station data. Software and source code are available on CRAN at https://CRAN.R-project.org/package=spamtree.

高分辨率地理空间数据具有挑战性,因为众所周知,基于高斯过程的标准地质统计模型无法扩展到大数据大小。虽然在可以更有效地计算的方法方面取得了进展,但对能够描述不同传感器以高分辨率记录的几个结果之间的复杂关系的大规模数据方法的关注要少得多。我们基于空间多变量树(SpamTrees)的贝叶斯多变量回归模型通过对树有向无环图的潜在随机效应的条件独立性假设实现了可扩展性。关于计算效率的信息论论点和考虑指导了树的构建以及在不平衡多元环境中的相关高效采样算法。除了模拟数据示例外,我们还使用了一个大型气候数据集来说明SpamTrees,该数据集将卫星数据与地面站数据相结合。软件和源代码可在CRAN上获得,网址为https://CRAN.R-project.org/package=spamtree.
{"title":"Spatial Multivariate Trees for Big Data Bayesian Regression.","authors":"Michele Peruzzi,&nbsp;David B Dunson","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>High resolution geospatial data are challenging because standard geostatistical models based on Gaussian processes are known to not scale to large data sizes. While progress has been made towards methods that can be computed more efficiently, considerably less attention has been devoted to methods for large scale data that allow the description of complex relationships between several outcomes recorded at high resolutions by different sensors. Our Bayesian multivariate regression models based on spatial multivariate trees (SpamTrees) achieve scalability via conditional independence assumptions on latent random effects following a treed directed acyclic graph. Information-theoretic arguments and considerations on computational efficiency guide the construction of the tree and the related efficient sampling algorithms in imbalanced multivariate settings. In addition to simulated data examples, we illustrate SpamTrees using a large climate data set which combines satellite data with land-based station data. Software and source code are available on CRAN at https://CRAN.R-project.org/package=spamtree.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":6.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9311452/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40548958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian subset selection and variable importance for interpretable prediction and classification. 用于可解释预测和分类的贝叶斯子集选择和变量重要性。
IF 6 3区 计算机科学 Q1 Mathematics Pub Date : 2022-01-01
Daniel R Kowal

Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often avoided due to selection instability, lack of regularization, and difficulties with post-selection inference. We address these challenges from a Bayesian perspective. Given any Bayesian predictive model , we extract a family of near-optimal subsets of variables for linear prediction or classification. This strategy deemphasizes the role of a single "best" subset and instead advances the broader perspective that often many subsets are highly competitive. The acceptable family of subsets offers a new pathway for model interpretation and is neatly summarized by key members such as the smallest acceptable subset, along with new (co-) variable importance metrics based on whether variables (co-) appear in all, some, or no acceptable subsets. More broadly, we apply Bayesian decision analysis to derive the optimal linear coefficients for any subset of variables. These coefficients inherit both regularization and predictive uncertainty quantification via . For both simulated and real data, the proposed approach exhibits better prediction, interval estimation, and variable selection than competing Bayesian and frequentist selection methods. These tools are applied to a large education dataset with highly correlated covariates. Our analysis provides unique insights into the combination of environmental, socioeconomic, and demographic factors that predict educational outcomes, and identifies over 200 distinct subsets of variables that offer near-optimal out-of-sample predictive accuracy.

子集选择是可解释学习、科学发现和数据压缩的重要工具。然而,由于选择的不稳定性、缺乏正则化以及选择后推理的困难,经典的子集选择常常被回避。我们从贝叶斯的角度来解决这些难题。给定任何贝叶斯预测模型ℳ,我们就能为线性预测或分类提取一系列近乎最优的变量子集。这一策略不再强调单一 "最佳 "子集的作用,而是从更广阔的视角出发,认为许多子集往往具有很强的竞争力。可接受子集系列为模型解释提供了一条新途径,其主要成员(如最小可接受子集)以及新的(共同)变量重要性度量(基于变量(共同)是否出现在所有、部分或无可接受子集中)均可清晰概括。更广义地说,我们应用贝叶斯决策分析为任何变量子集推导出最优线性系数。这些系数通过ℳ继承了正则化和预测不确定性量化。对于模拟数据和真实数据,所提出的方法在预测、区间估计和变量选择方面都优于其他贝叶斯和频数选择方法。这些工具被应用于具有高度相关协变量的大型教育数据集。我们的分析为预测教育结果的环境、社会经济和人口因素组合提供了独特的见解,并确定了 200 多个不同的变量子集,这些变量子集提供了接近最优的样本外预测准确性。
{"title":"Bayesian subset selection and variable importance for interpretable prediction and classification.","authors":"Daniel R Kowal","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often avoided due to selection instability, lack of regularization, and difficulties with post-selection inference. We address these challenges from a Bayesian perspective. Given any Bayesian predictive model <math><mi>ℳ</mi></math>, we extract a <i>family</i> of near-optimal subsets of variables for linear prediction or classification. This strategy deemphasizes the role of a single \"best\" subset and instead advances the broader perspective that often many subsets are highly competitive. The <i>acceptable family</i> of subsets offers a new pathway for model interpretation and is neatly summarized by key members such as the smallest acceptable subset, along with new (co-) variable importance metrics based on whether variables (co-) appear in all, some, or no acceptable subsets. More broadly, we apply Bayesian decision analysis to derive the optimal linear coefficients for <i>any</i> subset of variables. These coefficients inherit both regularization and predictive uncertainty quantification via <math><mi>ℳ</mi></math>. For both simulated and real data, the proposed approach exhibits better prediction, interval estimation, and variable selection than competing Bayesian and frequentist selection methods. These tools are applied to a large education dataset with highly correlated covariates. Our analysis provides unique insights into the combination of environmental, socioeconomic, and demographic factors that predict educational outcomes, and identifies over 200 distinct subsets of variables that offer near-optimal out-of-sample predictive accuracy.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":6.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10723825/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138811860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hoeffding's inequality for general Markov chains with its applications to statistical learning. 一般马尔可夫链的Hoeffding不等式及其在统计学习中的应用。
IF 6 3区 计算机科学 Q1 Mathematics Pub Date : 2021-08-01
Jianqing Fan, Bai Jiang, Qiang Sun

This paper establishes Hoeffding's lemma and inequality for bounded functions of general-state-space and not necessarily reversible Markov chains. The sharpness of these results is characterized by the optimality of the ratio between variance proxies in the Markov-dependent and independent settings. The boundedness of functions is shown necessary for such results to hold in general. To showcase the usefulness of the new results, we apply them for non-asymptotic analyses of MCMC estimation, respondent-driven sampling and high-dimensional covariance matrix estimation on time series data with a Markovian nature. In addition to statistical problems, we also apply them to study the time-discounted rewards in econometric models and the multi-armed bandit problem with Markovian rewards arising from the field of machine learning.

本文建立了一般状态空间非可逆马尔可夫链有界函数的Hoeffding引理和不等式。这些结果的清晰度是由方差代理在马尔可夫依赖和独立设置之间的比率的最优性来表征的。对于一般的结果,函数的有界性是成立的必要条件。为了展示新结果的实用性,我们将其应用于具有马尔可夫性质的时间序列数据的MCMC估计,受访者驱动抽样和高维协方差矩阵估计的非渐近分析。除了统计问题外,我们还将其应用于研究计量经济模型中的时间贴现奖励和机器学习领域中出现的带有马尔可夫奖励的多臂强盗问题。
{"title":"Hoeffding's inequality for general Markov chains with its applications to statistical learning.","authors":"Jianqing Fan,&nbsp;Bai Jiang,&nbsp;Qiang Sun","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>This paper establishes Hoeffding's lemma and inequality for bounded functions of general-state-space and not necessarily reversible Markov chains. The sharpness of these results is characterized by the optimality of the ratio between variance proxies in the Markov-dependent and independent settings. The boundedness of functions is shown necessary for such results to hold in general. To showcase the usefulness of the new results, we apply them for non-asymptotic analyses of MCMC estimation, respondent-driven sampling and high-dimensional covariance matrix estimation on time series data with a Markovian nature. In addition to statistical problems, we also apply them to study the time-discounted rewards in econometric models and the multi-armed bandit problem with Markovian rewards arising from the field of machine learning.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":6.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8457514/pdf/nihms-1639585.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39450202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A flexible model-free prediction-based framework for feature ranking. 一个灵活的、无模型的、基于预测的特征排序框架。
IF 6 3区 计算机科学 Q1 Mathematics Pub Date : 2021-05-01
Jingyi Jessica Li, Yiling Elaine Chen, Xin Tong

Despite the availability of numerous statistical and machine learning tools for joint feature modeling, many scientists investigate features marginally, i.e., one feature at a time. This is partly due to training and convention but also roots in scientists' strong interests in simple visualization and interpretability. As such, marginal feature ranking for some predictive tasks, e.g., prediction of cancer driver genes, is widely practiced in the process of scientific discoveries. In this work, we focus on marginal ranking for binary classification, one of the most common predictive tasks. We argue that the most widely used marginal ranking criteria, including the Pearson correlation, the two-sample t test, and two-sample Wilcoxon rank-sum test, do not fully take feature distributions and prediction objectives into account. To address this gap in practice, we propose two ranking criteria corresponding to two prediction objectives: the classical criterion (CC) and the Neyman-Pearson criterion (NPC), both of which use model-free nonparametric implementations to accommodate diverse feature distributions. Theoretically, we show that under regularity conditions, both criteria achieve sample-level ranking that is consistent with their population-level counterpart with high probability. Moreover, NPC is robust to sampling bias when the two class proportions in a sample deviate from those in the population. This property endows NPC good potential in biomedical research where sampling biases are ubiquitous. We demonstrate the use and relative advantages of CC and NPC in simulation and real data studies. Our model-free objective-based ranking idea is extendable to ranking feature subsets and generalizable to other prediction tasks and learning objectives.

尽管有许多统计和机器学习工具可用于联合特征建模,但许多科学家对特征进行了边缘研究,即一次研究一个特征。这部分是由于训练和惯例,但也源于科学家对简单可视化和可解释性的强烈兴趣。因此,在科学发现的过程中,对某些预测任务(如癌症驱动基因的预测)的边缘特征排序被广泛应用。在这项工作中,我们专注于二元分类的边缘排序,这是最常见的预测任务之一。我们认为,最广泛使用的边际排序标准,包括Pearson相关性、两样本t检验和两样本Wilcoxon秩和检验,没有充分考虑特征分布和预测目标。为了解决实践中的这一差距,我们提出了两个与两个预测目标相对应的排名标准:经典标准(CC)和Neyman-Pearson标准(NPC),两者都使用无模型非参数实现来适应不同的特征分布。从理论上讲,我们证明了在规则条件下,这两个标准都以高概率实现了与其总体水平对应的样本水平排名一致。此外,当样本中的两个类别比例偏离总体时,NPC对抽样偏差具有鲁棒性。这一特性使NPC在抽样偏差普遍存在的生物医学研究中具有良好的潜力。我们展示了CC和NPC在仿真和实际数据研究中的使用及其相对优势。我们的无模型的基于目标的排序思想可以扩展到对特征子集进行排序,并且可以推广到其他预测任务和学习目标。
{"title":"A flexible model-free prediction-based framework for feature ranking.","authors":"Jingyi Jessica Li,&nbsp;Yiling Elaine Chen,&nbsp;Xin Tong","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Despite the availability of numerous statistical and machine learning tools for joint feature modeling, many scientists investigate features marginally, i.e., one feature at a time. This is partly due to training and convention but also roots in scientists' strong interests in simple visualization and interpretability. As such, marginal feature ranking for some predictive tasks, e.g., prediction of cancer driver genes, is widely practiced in the process of scientific discoveries. In this work, we focus on marginal ranking for binary classification, one of the most common predictive tasks. We argue that the most widely used marginal ranking criteria, including the Pearson correlation, the two-sample <i>t</i> test, and two-sample Wilcoxon rank-sum test, do not fully take feature distributions and prediction objectives into account. To address this gap in practice, we propose two ranking criteria corresponding to two prediction objectives: the classical criterion (CC) and the Neyman-Pearson criterion (NPC), both of which use model-free nonparametric implementations to accommodate diverse feature distributions. Theoretically, we show that under regularity conditions, both criteria achieve sample-level ranking that is consistent with their population-level counterpart with high probability. Moreover, NPC is robust to sampling bias when the two class proportions in a sample deviate from those in the population. This property endows NPC good potential in biomedical research where sampling biases are ubiquitous. We demonstrate the use and relative advantages of CC and NPC in simulation and real data studies. Our model-free objective-based ranking idea is extendable to ranking feature subsets and generalizable to other prediction tasks and learning objectives.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":6.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8939838/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10265462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrative High Dimensional Multiple Testing with Heterogeneity under Data Sharing Constraints. 数据共享约束下的异质性整合高维多重测试
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2021-04-01
Molei Liu, Yin Xia, Kelly Cho, Tianxi Cai

Identifying informative predictors in a high dimensional regression model is a critical step for association analysis and predictive modeling. Signal detection in the high dimensional setting often fails due to the limited sample size. One approach to improving power is through meta-analyzing multiple studies which address the same scientific question. However, integrative analysis of high dimensional data from multiple studies is challenging in the presence of between-study heterogeneity. The challenge is even more pronounced with additional data sharing constraints under which only summary data can be shared across different sites. In this paper, we propose a novel data shielding integrative large-scale testing (DSILT) approach to signal detection allowing between-study heterogeneity and not requiring the sharing of individual level data. Assuming the underlying high dimensional regression models of the data differ across studies yet share similar support, the proposed method incorporates proper integrative estimation and debiasing procedures to construct test statistics for the overall effects of specific covariates. We also develop a multiple testing procedure to identify significant effects while controlling the false discovery rate (FDR) and false discovery proportion (FDP). Theoretical comparisons of the new testing procedure with the ideal individual-level meta-analysis (ILMA) approach and other distributed inference methods are investigated. Simulation studies demonstrate that the proposed testing procedure performs well in both controlling false discovery and attaining power. The new method is applied to a real example detecting interaction effects of the genetic variants for statins and obesity on the risk for type II diabetes.

在高维回归模型中识别有信息量的预测因子是关联分析和预测建模的关键步骤。由于样本量有限,高维环境下的信号检测往往会失败。提高分析能力的一种方法是对涉及同一科学问题的多项研究进行荟萃分析。然而,在存在研究间异质性的情况下,对来自多项研究的高维数据进行综合分析具有挑战性。在额外的数据共享限制条件下,不同研究地点之间只能共享摘要数据,因此这一挑战就更加突出。在本文中,我们提出了一种新颖的数据屏蔽集成大规模测试(DSILT)方法来进行信号检测,这种方法允许研究间异质性,而且不需要共享个体水平的数据。假设不同研究的基础高维数据回归模型各不相同,但具有相似的支持,所提出的方法结合了适当的整合估计和去杂程序,以构建特定协变量总体效应的检验统计量。我们还开发了多重检验程序,在控制误发现率(FDR)和误发现比例(FDP)的同时识别显著效应。我们研究了新测试程序与理想个体水平荟萃分析(ILMA)方法和其他分布式推断方法的理论比较。模拟研究表明,建议的测试程序在控制误发现率和获得功率方面都表现出色。新方法被应用于一个实际例子,检测他汀类药物和肥胖的遗传变异对 II 型糖尿病风险的交互效应。
{"title":"Integrative High Dimensional Multiple Testing with Heterogeneity under Data Sharing Constraints.","authors":"Molei Liu, Yin Xia, Kelly Cho, Tianxi Cai","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Identifying informative predictors in a high dimensional regression model is a critical step for association analysis and predictive modeling. Signal detection in the high dimensional setting often fails due to the limited sample size. One approach to improving power is through meta-analyzing multiple studies which address the same scientific question. However, integrative analysis of high dimensional data from multiple studies is challenging in the presence of between-study heterogeneity. The challenge is even more pronounced with additional data sharing constraints under which only summary data can be shared across different sites. In this paper, we propose a novel data shielding integrative large-scale testing (DSILT) approach to signal detection allowing between-study heterogeneity and not requiring the sharing of individual level data. Assuming the underlying high dimensional regression models of the data differ across studies yet share similar support, the proposed method incorporates proper integrative estimation and debiasing procedures to construct test statistics for the overall effects of specific covariates. We also develop a multiple testing procedure to identify significant effects while controlling the false discovery rate (FDR) and false discovery proportion (FDP). Theoretical comparisons of the new testing procedure with the ideal individual-level meta-analysis (ILMA) approach and other distributed inference methods are investigated. Simulation studies demonstrate that the proposed testing procedure performs well in both controlling false discovery and attaining power. The new method is applied to a real example detecting interaction effects of the genetic variants for statins and obesity on the risk for type II diabetes.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10327421/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9811440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inference for Multiple Heterogeneous Networks with a Common Invariant Subspace. 具有共同不变子空间的多个异构网络的推理。
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2021-03-01
Jesús Arroyo, Avanti Athreya, Joshua Cape, Guodong Chen, Carey E Priebe, Joshua T Vogelstein

The development of models and methodology for the analysis of data from multiple heterogeneous networks is of importance both in statistical network theory and across a wide spectrum of application domains. Although single-graph analysis is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to render estimation feasible. This paper addresses exactly this gap, by introducing a new model, the common subspace independent-edge multiple random graph model, which describes a heterogeneous collection of networks with a shared latent structure on the vertices but potentially different connectivity patterns for each graph. The model encompasses many popular network representations, including the stochastic blockmodel. The model is both flexible enough to meaningfully account for important graph differences, and tractable enough to allow for accurate inference in multiple networks. In particular, a joint spectral embedding of adjacency matrices-the multiple adjacency spectral embedding-leads to simultaneous consistent estimation of underlying parameters for each graph. Under mild additional assumptions, the estimates satisfy asymptotic normality and yield improvements for graph eigenvalue estimation. In both simulated and real data, the model and the embedding can be deployed for a number of subsequent network inference tasks, including dimensionality reduction, classification, hypothesis testing, and community detection. Specifically, when the embedding is applied to a data set of connectomes constructed through diffusion magnetic resonance imaging, the result is an accurate classification of brain scans by human subject and a meaningful determination of heterogeneity across scans of different individuals.

开发用于分析来自多个异构网络的数据的模型和方法在统计网络理论和广泛的应用领域中都具有重要意义。虽然单图分析已被广泛研究,但多图推断在很大程度上还未被探索,部分原因是在对图差异进行适当建模的同时又要保持足够的模型简洁性以保证估算的可行性所面临的固有挑战。本文正是为了弥补这一不足,引入了一个新模型--公共子空间独立边多随机图模型,该模型描述了具有共享顶点潜在结构但每个图的连接模式可能不同的异构网络集合。该模型涵盖了许多流行的网络表示法,包括随机块模型。该模型既具有足够的灵活性,可以有意义地解释重要的图差异,又具有足够的可操作性,可以在多个网络中进行精确推断。特别是,邻接矩阵的联合谱嵌入--多邻接谱嵌入--可同时一致地估计每个图的基本参数。在温和的附加假设条件下,估计值满足渐近正态性,并改进了图特征值估计。在模拟数据和真实数据中,该模型和嵌入可用于一系列后续网络推断任务,包括降维、分类、假设检验和群落检测。具体来说,当嵌入应用于通过扩散磁共振成像构建的连接组数据集时,结果是按人类主体对大脑扫描进行了准确分类,并对不同个体扫描的异质性做出了有意义的判断。
{"title":"Inference for Multiple Heterogeneous Networks with a Common Invariant Subspace.","authors":"Jesús Arroyo, Avanti Athreya, Joshua Cape, Guodong Chen, Carey E Priebe, Joshua T Vogelstein","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The development of models and methodology for the analysis of data from multiple heterogeneous networks is of importance both in statistical network theory and across a wide spectrum of application domains. Although single-graph analysis is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to render estimation feasible. This paper addresses exactly this gap, by introducing a new model, the common subspace independent-edge multiple random graph model, which describes a heterogeneous collection of networks with a shared latent structure on the vertices but potentially different connectivity patterns for each graph. The model encompasses many popular network representations, including the stochastic blockmodel. The model is both flexible enough to meaningfully account for important graph differences, and tractable enough to allow for accurate inference in multiple networks. In particular, a joint spectral embedding of adjacency matrices-the multiple adjacency spectral embedding-leads to simultaneous consistent estimation of underlying parameters for each graph. Under mild additional assumptions, the estimates satisfy asymptotic normality and yield improvements for graph eigenvalue estimation. In both simulated and real data, the model and the embedding can be deployed for a number of subsequent network inference tasks, including dimensionality reduction, classification, hypothesis testing, and community detection. Specifically, when the embedding is applied to a data set of connectomes constructed through diffusion magnetic resonance imaging, the result is an accurate classification of brain scans by human subject and a meaningful determination of heterogeneity across scans of different individuals.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8513708/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39543833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian time-aligned factor analysis of paired multivariate time series. 配对多变量时间序列的贝叶斯时间对齐因子分析。
IF 6 3区 计算机科学 Q1 Mathematics Pub Date : 2021-01-01
Arkaprava Roy, Jana Schaich Borg, David B Dunson

Many modern data sets require inference methods that can estimate the shared and individual-specific components of variability in collections of matrices that change over time. Promising methods have been developed to analyze these types of data in static cases, but only a few approaches are available for dynamic settings. To address this gap, we consider novel models and inference methods for pairs of matrices in which the columns correspond to multivariate observations at different time points. In order to characterize common and individual features, we propose a Bayesian dynamic factor modeling framework called Time Aligned Common and Individual Factor Analysis (TACIFA) that includes uncertainty in time alignment through an unknown warping function. We provide theoretical support for the proposed model, showing identifiability and posterior concentration. The structure enables efficient computation through a Hamiltonian Monte Carlo (HMC) algorithm. We show excellent performance in simulations, and illustrate the method through application to a social mimicry experiment.

许多现代数据集需要推理方法,可以估计随时间变化的矩阵集合中可变性的共享和个体特定组成部分。已经开发出了在静态情况下分析这些类型数据的有前途的方法,但只有少数方法可用于动态设置。为了解决这一差距,我们考虑了矩阵对的新模型和推理方法,其中列对应于不同时间点的多变量观测。为了描述共同和个体特征,我们提出了一个贝叶斯动态因子建模框架,称为时间对齐的共同和个体因子分析(TACIFA),该框架通过未知的扭曲函数包含时间对齐的不确定性。我们为提出的模型提供了理论支持,显示了可识别性和后验浓度。该结构通过哈密顿蒙特卡罗(HMC)算法实现了高效的计算。我们在仿真中显示了良好的性能,并通过应用于社会模仿实验来说明该方法。
{"title":"Bayesian time-aligned factor analysis of paired multivariate time series.","authors":"Arkaprava Roy,&nbsp;Jana Schaich Borg,&nbsp;David B Dunson","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Many modern data sets require inference methods that can estimate the shared and individual-specific components of variability in collections of matrices that change over time. Promising methods have been developed to analyze these types of data in static cases, but only a few approaches are available for dynamic settings. To address this gap, we consider novel models and inference methods for pairs of matrices in which the columns correspond to multivariate observations at different time points. In order to characterize common and individual features, we propose a Bayesian dynamic factor modeling framework called Time Aligned Common and Individual Factor Analysis (TACIFA) that includes uncertainty in time alignment through an unknown warping function. We provide theoretical support for the proposed model, showing identifiability and posterior concentration. The structure enables efficient computation through a Hamiltonian Monte Carlo (HMC) algorithm. We show excellent performance in simulations, and illustrate the method through application to a social mimicry experiment.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":6.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9221555/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40398444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Soft Tensor Regression. 软张量回归
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2021-01-01
Georgia Papadogeorgou, Zhengwu Zhang, David B Dunson

Statistical methods relating tensor predictors to scalar outcomes in a regression model generally vectorize the tensor predictor and estimate the coefficients of its entries employing some form of regularization, use summaries of the tensor covariate, or use a low dimensional approximation of the coefficient tensor. However, low rank approximations of the coefficient tensor can suffer if the true rank is not small. We propose a tensor regression framework which assumes a soft version of the parallel factors (PARAFAC) approximation. In contrast to classic PARAFAC where each entry of the coefficient tensor is the sum of products of row-specific contributions across the tensor modes, the soft tensor regression (Softer) framework allows the row-specific contributions to vary around an overall mean. We follow a Bayesian approach to inference, and show that softening the PARAFAC increases model flexibility, leads to improved estimation of coefficient tensors, more accurate identification of important predictor entries, and more precise predictions, even for a low approximation rank. From a theoretical perspective, we show that employing Softer leads to a weakly consistent posterior distribution of the coefficient tensor, irrespective of the true or approximation tensor rank, a result that is not true when employing the classic PARAFAC for tensor regression. In the context of our motivating application, we adapt Softer to symmetric and semi-symmetric tensor predictors and analyze the relationship between brain network characteristics and human traits.

在回归模型中,将张量预测因子与标量结果相关联的统计方法通常会将张量预测因子矢量化,并通过某种形式的正则化来估计其条目系数,或使用张量协变量的摘要,或使用系数张量的低维近似值。然而,如果真实秩不大,系数张量的低秩近似就会受到影响。我们提出了一种张量回归框架,它假定了一种软版本的并行因子(PARAFAC)近似。与传统的 PARAFAC(系数张量的每个条目都是张量模式中特定行贡献的乘积之和)不同,软张量回归(Soft)框架允许特定行的贡献围绕总体平均值变化。我们采用贝叶斯方法进行推理,结果表明,软化 PARAFAC 增加了模型的灵活性,改进了系数张量的估计,更准确地识别了重要的预测项,即使在近似等级较低的情况下,预测结果也更加精确。从理论角度来看,我们发现,无论真实或近似张量阶数如何,使用 Softer 都会导致系数张量的弱一致性后验分布,而使用经典 PARAFAC 进行张量回归时则不会出现这种结果。在我们的激励应用中,我们将 Softer 应用于对称和半对称张量预测,并分析了大脑网络特征与人类特征之间的关系。
{"title":"Soft Tensor Regression.","authors":"Georgia Papadogeorgou, Zhengwu Zhang, David B Dunson","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Statistical methods relating tensor predictors to scalar outcomes in a regression model generally vectorize the tensor predictor and estimate the coefficients of its entries employing some form of regularization, use summaries of the tensor covariate, or use a low dimensional approximation of the coefficient tensor. However, low rank approximations of the coefficient tensor can suffer if the true rank is not small. We propose a tensor regression framework which assumes a <i>soft</i> version of the parallel factors (PARAFAC) approximation. In contrast to classic PARAFAC where each entry of the coefficient tensor is the sum of products of row-specific contributions across the tensor modes, the soft tensor regression (Softer) framework allows the row-specific contributions to vary around an overall mean. We follow a Bayesian approach to inference, and show that softening the PARAFAC increases model flexibility, leads to improved estimation of coefficient tensors, more accurate identification of important predictor entries, and more precise predictions, even for a low approximation rank. From a theoretical perspective, we show that employing Softer leads to a weakly consistent posterior distribution of the coefficient tensor, <i>irrespective of the true or approximation tensor rank</i>, a result that is not true when employing the classic PARAFAC for tensor regression. In the context of our motivating application, we adapt Softer to symmetric and semi-symmetric tensor predictors and analyze the relationship between brain network characteristics and human traits.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9222480/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40398446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Machine Learning Research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1