首页 > 最新文献

Statistical Analysis and Data Mining: The ASA Data Science Journal最新文献

英文 中文
Exponential calibration for correlation coefficient with additive distortion measurement errors 加性失真测量误差相关系数的指数校正
Pub Date : 2021-04-20 DOI: 10.1002/sam.11509
Jun Zhang, Zhuoer Xu
This paper studies the estimation of correlation coefficient between unobserved variables of interest. These unobservable variables are distorted in an additive fashion by an observed confounding variable. We propose a new identifiability condition by using the exponential calibration to obtain calibrated variables and propose a direct‐plug‐in estimator for the correlation coefficient. We show that the direct‐plug‐in estimator is asymptotically efficient. Next, we suggest an asymptotic normal approximation and an empirical likelihood‐based statistic to construct the confidence intervals. Last, we propose several test statistics for testing whether the true correlation coefficient is zero or not. The asymptotic properties of the proposed test statistics are examined. We conduct Monte Carlo simulation experiments to examine the performance of the proposed estimators and test statistics. These methods are applied to analyze a temperature forecast data set for an illustration.
本文研究了未观测变量间相关系数的估计。这些不可观察的变量被观察到的混淆变量以相加的方式扭曲。我们提出了一个新的可辨识性条件,利用指数校准来获得校准变量,并提出了相关系数的直接插入估计。我们证明了直接插入估计量是渐近有效的。接下来,我们提出了一个渐近正态近似和一个基于经验似然的统计来构建置信区间。最后,我们提出了几个检验统计量来检验真实相关系数是否为零。检验了所提出的检验统计量的渐近性质。我们进行蒙特卡罗模拟实验,以检查所提出的估计器的性能和测试统计量。应用这些方法对一组温度预报数据进行了分析。
{"title":"Exponential calibration for correlation coefficient with additive distortion measurement errors","authors":"Jun Zhang, Zhuoer Xu","doi":"10.1002/sam.11509","DOIUrl":"https://doi.org/10.1002/sam.11509","url":null,"abstract":"This paper studies the estimation of correlation coefficient between unobserved variables of interest. These unobservable variables are distorted in an additive fashion by an observed confounding variable. We propose a new identifiability condition by using the exponential calibration to obtain calibrated variables and propose a direct‐plug‐in estimator for the correlation coefficient. We show that the direct‐plug‐in estimator is asymptotically efficient. Next, we suggest an asymptotic normal approximation and an empirical likelihood‐based statistic to construct the confidence intervals. Last, we propose several test statistics for testing whether the true correlation coefficient is zero or not. The asymptotic properties of the proposed test statistics are examined. We conduct Monte Carlo simulation experiments to examine the performance of the proposed estimators and test statistics. These methods are applied to analyze a temperature forecast data set for an illustration.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115632675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Supervised compression of big data 监督式大数据压缩
Pub Date : 2021-04-08 DOI: 10.1002/sam.11508
V. R. Joseph, Simon Mak
The phenomenon of big data has become ubiquitous in nearly all disciplines, from science to engineering. A key challenge is the use of such data for fitting statistical and machine learning models, which can incur high computational and storage costs. One solution is to perform model fitting on a carefully selected subset of the data. Various data reduction methods have been proposed in the literature, ranging from random subsampling to optimal experimental design‐based methods. However, when the goal is to learn the underlying input–output relationship, such reduction methods may not be ideal, since it does not make use of information contained in the output. To this end, we propose a supervised data compression method called supercompress, which integrates output information by sampling data from regions most important for modeling the desired input–output relationship. An advantage of supercompress is that it is nonparametric—the compression method does not rely on parametric modeling assumptions between inputs and output. As a result, the proposed method is robust to a wide range of modeling choices. We demonstrate the usefulness of supercompress over existing data reduction methods, in both simulations and a taxicab predictive modeling application.
从科学到工程,大数据现象在几乎所有学科中都无处不在。一个关键的挑战是使用这些数据来拟合统计和机器学习模型,这可能会产生很高的计算和存储成本。一种解决方案是在仔细选择的数据子集上执行模型拟合。文献中提出了各种数据缩减方法,从随机抽样到基于最佳实验设计的方法。然而,当目标是学习潜在的投入产出关系时,这种约简方法可能不是理想的,因为它没有利用输出中包含的信息。为此,我们提出了一种被称为超级压缩的监督数据压缩方法,该方法通过从最重要的区域采样数据来集成输出信息,以建立期望的输入输出关系。超压缩的一个优点是它是非参数化的——压缩方法不依赖于输入和输出之间的参数化建模假设。结果表明,该方法对各种建模选择都具有较强的鲁棒性。在模拟和出租车预测建模应用中,我们展示了超压缩对现有数据缩减方法的有用性。
{"title":"Supervised compression of big data","authors":"V. R. Joseph, Simon Mak","doi":"10.1002/sam.11508","DOIUrl":"https://doi.org/10.1002/sam.11508","url":null,"abstract":"The phenomenon of big data has become ubiquitous in nearly all disciplines, from science to engineering. A key challenge is the use of such data for fitting statistical and machine learning models, which can incur high computational and storage costs. One solution is to perform model fitting on a carefully selected subset of the data. Various data reduction methods have been proposed in the literature, ranging from random subsampling to optimal experimental design‐based methods. However, when the goal is to learn the underlying input–output relationship, such reduction methods may not be ideal, since it does not make use of information contained in the output. To this end, we propose a supervised data compression method called supercompress, which integrates output information by sampling data from regions most important for modeling the desired input–output relationship. An advantage of supercompress is that it is nonparametric—the compression method does not rely on parametric modeling assumptions between inputs and output. As a result, the proposed method is robust to a wide range of modeling choices. We demonstrate the usefulness of supercompress over existing data reduction methods, in both simulations and a taxicab predictive modeling application.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126687652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Emulated order identification for models of big time series data 大时间序列数据模型的仿真阶数辨识
Pub Date : 2021-04-01 DOI: 10.1002/sam.11504
Brian Wu, Dorin Drignei
This interdisciplinary research includes elements of computing, optimization, and statistics for big data. Specifically, it addresses model order identification aspects of big time series data. Computing and minimizing information criteria, such as BIC, on a grid of integer orders becomes prohibitive for time series recorded at a large number of time points. We propose to compute information criteria only for a sample of integer orders and use kriging‐based methods to emulate the information criteria on the rest of the grid. Then we use an efficient global optimization (EGO) algorithm to identify the orders. The method is applied to both ARMA and ARMA‐GARCH models. We simulated times series from each type of model of prespecified orders and applied the method to identify the orders. We also used real big time series with tens of thousands of time points to illustrate the method. In particular, we used sentiment scores for news headlines on the economy for ARMA models, and the NASDAQ daily returns for ARMA‐GARCH models, from the beginning in 1971 to mid‐April 2020 in the early stages of the COVID‐19 pandemic. The proposed method identifies efficiently and accurately the orders of models for big time series data.
这一跨学科研究包括计算、优化和大数据统计的要素。具体来说,它解决了大时间序列数据的模型顺序识别问题。对于记录在大量时间点上的时间序列,在整数顺序网格上计算和最小化信息标准(例如BIC)变得令人望而却步。我们建议仅为整数阶的样本计算信息标准,并使用基于克里格的方法来模拟网格其余部分的信息标准。然后,我们使用一种高效的全局优化(EGO)算法来识别订单。该方法适用于ARMA和ARMA‐GARCH模型。我们模拟了每种预定阶数模型的时间序列,并应用该方法来识别阶数。我们还使用了具有数万个时间点的真实大时间序列来说明该方法。特别是,我们在ARMA模型中使用了经济新闻标题的情绪得分,在ARMA - GARCH模型中使用了纳斯达克日回报率,从1971年开始到2020年4月中旬,即COVID - 19大流行的早期阶段。该方法对大时间序列数据进行了高效、准确的模型阶数识别。
{"title":"Emulated order identification for models of big time series data","authors":"Brian Wu, Dorin Drignei","doi":"10.1002/sam.11504","DOIUrl":"https://doi.org/10.1002/sam.11504","url":null,"abstract":"This interdisciplinary research includes elements of computing, optimization, and statistics for big data. Specifically, it addresses model order identification aspects of big time series data. Computing and minimizing information criteria, such as BIC, on a grid of integer orders becomes prohibitive for time series recorded at a large number of time points. We propose to compute information criteria only for a sample of integer orders and use kriging‐based methods to emulate the information criteria on the rest of the grid. Then we use an efficient global optimization (EGO) algorithm to identify the orders. The method is applied to both ARMA and ARMA‐GARCH models. We simulated times series from each type of model of prespecified orders and applied the method to identify the orders. We also used real big time series with tens of thousands of time points to illustrate the method. In particular, we used sentiment scores for news headlines on the economy for ARMA models, and the NASDAQ daily returns for ARMA‐GARCH models, from the beginning in 1971 to mid‐April 2020 in the early stages of the COVID‐19 pandemic. The proposed method identifies efficiently and accurately the orders of models for big time series data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131154827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Trees, forests, chickens, and eggs: when and why to prune trees in a random forest 树木,森林,鸡和蛋:何时以及为什么要在随机的森林中修剪树木
Pub Date : 2021-03-30 DOI: 10.1002/sam.11594
Siyu Zhou, L. Mentch
Due to their long‐standing reputation as excellent off‐the‐shelf predictors, random forests (RFs) continue to remain a go‐to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged–one based on interpolation and the other based on regularization. This work argues in favor of the latter by utilizing the regularization framework to reexamine the decades‐old question of whether individual trees in an ensemble ought to be pruned. Despite the fact that default constructions of RFs use near full depth trees in most popular software packages, here we provide strong evidence that tree depth should be seen as a natural form of regularization across the entire procedure. In particular, our work suggests that RFs with shallow trees are advantageous when the signal‐to‐noise ratio in the data is low. In building up this argument, we also critique the newly popular notion of “double descent” in RFs by drawing parallels to U‐statistics and arguing that the noticeable jumps in random forest accuracy are the result of simple averaging rather than interpolation.
由于随机森林(RFs)作为优秀的现成预测器的长期声誉,随机森林(RFs)仍然是应用统计学家和数据科学家的首选模型。然而,尽管它们被广泛使用,直到最近,人们对它们的内部工作原理以及哪些方面的程序推动了它们的成功知之甚少。最近,出现了两个相互竞争的假设——一个基于插值,另一个基于正则化。这项工作通过利用正则化框架来重新审视几十年前的问题,即是否应该修剪集合中的单个树,从而支持后者。尽管在大多数流行的软件包中,RFs的默认构造使用接近全深度树,但在这里,我们提供了强有力的证据,证明树深度应该被视为贯穿整个过程的正则化的自然形式。特别是,我们的工作表明,当数据中的信噪比较低时,具有浅树的rf是有利的。在建立这一论点的过程中,我们还批评了RFs中新近流行的“双重下降”概念,通过将其与U统计进行类比,并认为随机森林精度的明显跳跃是简单平均而不是插值的结果。
{"title":"Trees, forests, chickens, and eggs: when and why to prune trees in a random forest","authors":"Siyu Zhou, L. Mentch","doi":"10.1002/sam.11594","DOIUrl":"https://doi.org/10.1002/sam.11594","url":null,"abstract":"Due to their long‐standing reputation as excellent off‐the‐shelf predictors, random forests (RFs) continue to remain a go‐to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged–one based on interpolation and the other based on regularization. This work argues in favor of the latter by utilizing the regularization framework to reexamine the decades‐old question of whether individual trees in an ensemble ought to be pruned. Despite the fact that default constructions of RFs use near full depth trees in most popular software packages, here we provide strong evidence that tree depth should be seen as a natural form of regularization across the entire procedure. In particular, our work suggests that RFs with shallow trees are advantageous when the signal‐to‐noise ratio in the data is low. In building up this argument, we also critique the newly popular notion of “double descent” in RFs by drawing parallels to U‐statistics and arguing that the noticeable jumps in random forest accuracy are the result of simple averaging rather than interpolation.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126486915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A comparison of Gaussian processes and neural networks for computer model emulation and calibration 高斯过程与神经网络在计算机模型仿真与标定中的比较
Pub Date : 2021-03-30 DOI: 10.1002/sam.11507
Samuel Myren, E. Lawrence
The Department of Energy relies on complex physics simulations for prediction in domains like cosmology, nuclear theory, and materials science. These simulations are often extremely computationally intensive, with some requiring days or weeks for a single simulation. In order to assure their accuracy, these models are calibrated against observational data in order to estimate inputs and systematic biases. Because of their great computational complexity, this process typically requires the construction of an emulator, a fast approximation to the simulation. In this paper, two emulator approaches are compared: Gaussian process regression and neural networks. Their emulation accuracy and calibration performance on three real problems of Department of Energy interest is considered. On these problems, the Gaussian process emulator tends to be more accurate with narrower, but still well‐calibrated uncertainty estimates. The neural network emulator is accurate, but tends to have large uncertainty on its predictions. As a result, calibration with the Gaussian process emulator produces more constrained posteriors that still perform well in prediction.
美国能源部依靠复杂的物理模拟来预测宇宙学、核理论和材料科学等领域。这些模拟通常是非常密集的计算,有些需要几天或几周的时间来进行一次模拟。为了确保它们的准确性,这些模型是根据观测数据校准的,以便估计输入和系统偏差。由于其巨大的计算复杂性,这一过程通常需要构建一个模拟器,快速逼近模拟。本文比较了两种仿真方法:高斯过程回归和神经网络。针对能源部关心的三个实际问题,考虑了它们的仿真精度和标定性能。在这些问题上,高斯过程模拟器往往更准确,具有更窄的,但仍然校准良好的不确定性估计。神经网络仿真器是准确的,但其预测往往存在较大的不确定性。因此,使用高斯过程模拟器进行校准可以产生更多约束的后验,并且在后验在预测中仍然表现良好。
{"title":"A comparison of Gaussian processes and neural networks for computer model emulation and calibration","authors":"Samuel Myren, E. Lawrence","doi":"10.1002/sam.11507","DOIUrl":"https://doi.org/10.1002/sam.11507","url":null,"abstract":"The Department of Energy relies on complex physics simulations for prediction in domains like cosmology, nuclear theory, and materials science. These simulations are often extremely computationally intensive, with some requiring days or weeks for a single simulation. In order to assure their accuracy, these models are calibrated against observational data in order to estimate inputs and systematic biases. Because of their great computational complexity, this process typically requires the construction of an emulator, a fast approximation to the simulation. In this paper, two emulator approaches are compared: Gaussian process regression and neural networks. Their emulation accuracy and calibration performance on three real problems of Department of Energy interest is considered. On these problems, the Gaussian process emulator tends to be more accurate with narrower, but still well‐calibrated uncertainty estimates. The neural network emulator is accurate, but tends to have large uncertainty on its predictions. As a result, calibration with the Gaussian process emulator produces more constrained posteriors that still perform well in prediction.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115036004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Approximation error of Fourier neural networks 傅里叶神经网络的近似误差
Pub Date : 2021-03-23 DOI: 10.1002/sam.11506
Abylay Zhumekenov, Rustem Takhanov, Alejandro J. Castro, Z. Assylbekov
The paper investigates approximation error of two‐layer feedforward Fourier Neural Networks (FNNs). Such networks are motivated by the approximation properties of Fourier series. Several implementations of FNNs were proposed since 1980s: by Gallant and White, Silvescu, Tan, Zuo and Cai, and Liu. The main focus of our work is Silvescu's FNN, because its activation function does not fit into the category of networks, where the linearly transformed input is exposed to activation. The latter ones were extensively described by Hornik. In regard to non‐trivial Silvescu's FNN, its convergence rate is proven to be of order O(1/n). The paper continues investigating classes of functions approximated by Silvescu FNN, which appeared to be from Schwartz space and space of positive definite functions.
研究了两层前馈傅立叶神经网络(FNNs)的逼近误差。这种网络是由傅里叶级数的近似性质驱动的。自20世纪80年代以来,提出了几种fnn的实现:Gallant和White, Silvescu, Tan, Zuo和Cai以及Liu。我们工作的主要焦点是Silvescu的FNN,因为它的激活函数不适合网络的类别,其中线性转换的输入暴露于激活。霍尼克对后者进行了广泛的描述。对于非平凡的Silvescu FNN,证明了其收敛速度为O(1/n)阶。本文继续研究了由Silvescu FNN逼近的函数类,它们似乎来自Schwartz空间和正定函数空间。
{"title":"Approximation error of Fourier neural networks","authors":"Abylay Zhumekenov, Rustem Takhanov, Alejandro J. Castro, Z. Assylbekov","doi":"10.1002/sam.11506","DOIUrl":"https://doi.org/10.1002/sam.11506","url":null,"abstract":"The paper investigates approximation error of two‐layer feedforward Fourier Neural Networks (FNNs). Such networks are motivated by the approximation properties of Fourier series. Several implementations of FNNs were proposed since 1980s: by Gallant and White, Silvescu, Tan, Zuo and Cai, and Liu. The main focus of our work is Silvescu's FNN, because its activation function does not fit into the category of networks, where the linearly transformed input is exposed to activation. The latter ones were extensively described by Hornik. In regard to non‐trivial Silvescu's FNN, its convergence rate is proven to be of order O(1/n). The paper continues investigating classes of functions approximated by Silvescu FNN, which appeared to be from Schwartz space and space of positive definite functions.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130973284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feature selection for imbalanced data with deep sparse autoencoders ensemble 基于深度稀疏自编码器集成的不平衡数据特征选择
Pub Date : 2021-03-22 DOI: 10.1002/sam.11567
M. Massi, F. Ieva, Francesca Gasperoni, A. Paganoni
Class imbalance is a common issue in many domain applications of learning algorithms. Oftentimes, in the same domains it is much more relevant to correctly classify and profile minority class observations. This need can be addressed by feature selection (FS), that offers several further advantages, such as decreasing computational costs, aiding inference and interpretability. However, traditional FS techniques may become suboptimal in the presence of strongly imbalanced data. To achieve FS advantages in this setting, we propose a filtering FS algorithm ranking feature importance on the basis of the reconstruction error of a deep sparse autoencoders ensemble (DSAEE). We use each DSAE trained only on majority class to reconstruct both classes. From the analysis of the aggregated reconstruction error, we determine the features where the minority class presents a different distribution of values w.r.t. the overrepresented one, thus identifying the most relevant features to discriminate between the two. We empirically demonstrate the efficacy of our algorithm in several experiments, both simulated and on high‐dimensional datasets of varying sample size, showcasing its capability to select relevant and generalizable features to profile and classify minority class, outperforming other benchmark FS methods. We also briefly present a real application in radiogenomics, where the methodology was applied successfully.
在学习算法的许多领域应用中,类不平衡是一个常见的问题。通常,在相同的领域中,正确分类和描述少数类观察结果更为相关。这种需求可以通过特征选择(FS)来解决,它提供了几个进一步的优势,例如降低计算成本,帮助推理和可解释性。然而,传统的FS技术在存在严重不平衡的数据时可能变得不理想。为了在这种情况下实现FS的优势,我们提出了一种基于深度稀疏自编码器集成(DSAEE)的重构误差对特征重要性进行排序的滤波FS算法。我们使用仅在多数类上训练的每个DSAE来重建两个类。通过对汇总重建误差的分析,我们确定了少数类呈现不同值分布的特征,从而识别出最相关的特征来区分两者。我们在模拟和不同样本量的高维数据集上的几个实验中实证地证明了我们的算法的有效性,展示了它选择相关和可推广的特征来描述和分类少数类的能力,优于其他基准FS方法。我们还简要介绍了放射基因组学的实际应用,其中该方法得到了成功的应用。
{"title":"Feature selection for imbalanced data with deep sparse autoencoders ensemble","authors":"M. Massi, F. Ieva, Francesca Gasperoni, A. Paganoni","doi":"10.1002/sam.11567","DOIUrl":"https://doi.org/10.1002/sam.11567","url":null,"abstract":"Class imbalance is a common issue in many domain applications of learning algorithms. Oftentimes, in the same domains it is much more relevant to correctly classify and profile minority class observations. This need can be addressed by feature selection (FS), that offers several further advantages, such as decreasing computational costs, aiding inference and interpretability. However, traditional FS techniques may become suboptimal in the presence of strongly imbalanced data. To achieve FS advantages in this setting, we propose a filtering FS algorithm ranking feature importance on the basis of the reconstruction error of a deep sparse autoencoders ensemble (DSAEE). We use each DSAE trained only on majority class to reconstruct both classes. From the analysis of the aggregated reconstruction error, we determine the features where the minority class presents a different distribution of values w.r.t. the overrepresented one, thus identifying the most relevant features to discriminate between the two. We empirically demonstrate the efficacy of our algorithm in several experiments, both simulated and on high‐dimensional datasets of varying sample size, showcasing its capability to select relevant and generalizable features to profile and classify minority class, outperforming other benchmark FS methods. We also briefly present a real application in radiogenomics, where the methodology was applied successfully.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126663908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Generalized mixed‐effects random forest: A flexible approach to predict university student dropout 广义混合效应随机森林:一种预测大学生辍学的灵活方法
Pub Date : 2021-03-09 DOI: 10.1002/sam.11505
Massimo Pellagatti, Chiara Masci, F. Ieva, A. Paganoni
We propose a new statistical method, called generalized mixed‐effects random forest (GMERF), that extends the use of random forest to the analysis of hierarchical data, for any type of response variable in the exponential family. The method maintains the flexibility and the ability of modeling complex patterns within the data, typical of tree‐based ensemble methods, and it can handle both continuous and discrete covariates. At the same time, GMERF takes into account the nested structure of hierarchical data, modeling the dependence structure that exists at the highest level of the hierarchy and allowing statistical inference on this structure. In the case study, we apply GMERF to Higher Education data to analyze the university student dropout phenomenon. We predict engineering student dropout probability by means of student‐level information and considering the degree program students are enrolled in as grouping factor.
我们提出了一种新的统计方法,称为广义混合效应随机森林(GMERF),它将随机森林的使用扩展到对指数族中任何类型的响应变量的分层数据分析。该方法保持了数据中复杂模式建模的灵活性和能力,典型的基于树的集成方法,它可以处理连续和离散协变量。同时,GMERF考虑了层次数据的嵌套结构,对存在于层次结构最高层的依赖结构进行建模,并允许对该结构进行统计推断。在案例研究中,我们将GMERF应用于高等教育数据,分析大学生辍学现象。我们通过学生层面的信息来预测工程专业学生的退学概率,并将学生所就读的学位课程作为分组因素。
{"title":"Generalized mixed‐effects random forest: A flexible approach to predict university student dropout","authors":"Massimo Pellagatti, Chiara Masci, F. Ieva, A. Paganoni","doi":"10.1002/sam.11505","DOIUrl":"https://doi.org/10.1002/sam.11505","url":null,"abstract":"We propose a new statistical method, called generalized mixed‐effects random forest (GMERF), that extends the use of random forest to the analysis of hierarchical data, for any type of response variable in the exponential family. The method maintains the flexibility and the ability of modeling complex patterns within the data, typical of tree‐based ensemble methods, and it can handle both continuous and discrete covariates. At the same time, GMERF takes into account the nested structure of hierarchical data, modeling the dependence structure that exists at the highest level of the hierarchy and allowing statistical inference on this structure. In the case study, we apply GMERF to Higher Education data to analyze the university student dropout phenomenon. We predict engineering student dropout probability by means of student‐level information and considering the degree program students are enrolled in as grouping factor.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130355214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Model‐based clustering of time‐dependent categorical sequences with application to the analysis of major life event patterns 基于模型的时间依赖分类序列聚类及其在主要生活事件模式分析中的应用
Pub Date : 2021-03-08 DOI: 10.1002/sam.11502
Yingying Zhang, Volodymyr Melnykov, Xuwen Zhu
Clustering categorical sequences is a problem that arises in many fields. There is a few techniques available in this framework but none of them take into account the possible temporal character of transitions from one state to another. A mixture of Markov models is proposed, where transition probabilities are represented as functions of time. The corresponding expectation–maximization algorithm is discussed along with related computational challenges. The effectiveness of the proposed procedure is illustrated on the set of simulation studies, in which it outperforms four alternative approaches. The method is applied to major life event sequences from the British Household Panel Survey. As reflected by Bayesian Information Criterion, the proposed model demonstrates substantially better performance than its competitors. The analysis of obtained results and related transition probability plots reveals two groups of individuals: people with a conventional development of life course and those encountering some challenges.
聚类分类序列是一个在许多领域都会遇到的问题。在这个框架中有一些可用的技术,但它们都没有考虑到从一种状态到另一种状态转换的可能的时间特征。提出了一种混合马尔可夫模型,其中转移概率表示为时间的函数。讨论了相应的期望最大化算法以及相关的计算挑战。在一组仿真研究中说明了所提出程序的有效性,其中它优于四种替代方法。该方法应用于英国家庭小组调查的主要生活事件序列。贝叶斯信息准则表明,该模型的性能明显优于同类模型。对所得结果和相关的转移概率图进行分析,可以发现两类个体:生命历程发展正常的人和遇到一些挑战的人。
{"title":"Model‐based clustering of time‐dependent categorical sequences with application to the analysis of major life event patterns","authors":"Yingying Zhang, Volodymyr Melnykov, Xuwen Zhu","doi":"10.1002/sam.11502","DOIUrl":"https://doi.org/10.1002/sam.11502","url":null,"abstract":"Clustering categorical sequences is a problem that arises in many fields. There is a few techniques available in this framework but none of them take into account the possible temporal character of transitions from one state to another. A mixture of Markov models is proposed, where transition probabilities are represented as functions of time. The corresponding expectation–maximization algorithm is discussed along with related computational challenges. The effectiveness of the proposed procedure is illustrated on the set of simulation studies, in which it outperforms four alternative approaches. The method is applied to major life event sequences from the British Household Panel Survey. As reflected by Bayesian Information Criterion, the proposed model demonstrates substantially better performance than its competitors. The analysis of obtained results and related transition probability plots reveals two groups of individuals: people with a conventional development of life course and those encountering some challenges.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128367520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Subsampling from features in large regression to find “winning features” 从大回归的特征中进行子采样以找到“获胜特征”
Pub Date : 2021-02-27 DOI: 10.1002/sam.11499
Yiying Fan, Jiayang Sun
Feature (or variable) selection from a large number of p features continuously challenges data science, especially for ever‐enlarging data and in discovering scientifically important features in a regression setting. For example, to develop valid drug targets for ovarian cancer, we must control the false‐discovery rate (FDR) of a selection procedure. The popular approach to feature selection in large‐p regression uses a penalized likelihood or a shrinkage estimation, such as a LASSO, SCAD, Elastic Net, or MCP procedure. We present a different approach called the Subsampling Winner algorithm (SWA), which subsamples from p features. The idea of SWA is analogous to selecting US national merit scholars' that selects semifinalists based on student's performance in tests done at local schools (a.k.a. subsample analyses), and then determine the finalists (a.k.a. winning features) from the semifinalists. Due to its subsampling nature, SWA can scale to data of any dimension. SWA also has the best‐controlled FDR compared to the penalized and Random Forest procedures while having a competitive true‐feature discovery rate. Our application of SWA to an ovarian cancer data revealed functionally important genes and pathways.
从大量p个特征中选择特征(或变量)不断挑战数据科学,特别是对于不断扩大的数据和在回归设置中发现科学上重要的特征。例如,为了开发有效的卵巢癌药物靶点,我们必须控制选择程序的错误发现率(FDR)。大p回归中流行的特征选择方法使用惩罚似然或收缩估计,如LASSO, SCAD, Elastic Net或MCP程序。我们提出了一种不同的方法,称为Subsampling Winner算法(SWA),它从p个特征中进行子采样。SWA的理念类似于选择美国国家优秀学者,根据学生在当地学校的测试成绩(即子样本分析)选择半决赛选手,然后从半决赛选手中确定决赛选手(即获胜特征)。由于其子采样性质,SWA可以扩展到任何维度的数据。与惩罚和随机森林程序相比,SWA还具有最佳控制的FDR,同时具有具有竞争力的真特征发现率。我们将SWA应用于卵巢癌数据,揭示了功能上重要的基因和途径。
{"title":"Subsampling from features in large regression to find “winning features”","authors":"Yiying Fan, Jiayang Sun","doi":"10.1002/sam.11499","DOIUrl":"https://doi.org/10.1002/sam.11499","url":null,"abstract":"Feature (or variable) selection from a large number of p features continuously challenges data science, especially for ever‐enlarging data and in discovering scientifically important features in a regression setting. For example, to develop valid drug targets for ovarian cancer, we must control the false‐discovery rate (FDR) of a selection procedure. The popular approach to feature selection in large‐p regression uses a penalized likelihood or a shrinkage estimation, such as a LASSO, SCAD, Elastic Net, or MCP procedure. We present a different approach called the Subsampling Winner algorithm (SWA), which subsamples from p features. The idea of SWA is analogous to selecting US national merit scholars' that selects semifinalists based on student's performance in tests done at local schools (a.k.a. subsample analyses), and then determine the finalists (a.k.a. winning features) from the semifinalists. Due to its subsampling nature, SWA can scale to data of any dimension. SWA also has the best‐controlled FDR compared to the penalized and Random Forest procedures while having a competitive true‐feature discovery rate. Our application of SWA to an ovarian cancer data revealed functionally important genes and pathways.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114316481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistical Analysis and Data Mining: The ASA Data Science Journal
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1