首页 > 最新文献

Statistical Analysis and Data Mining: The ASA Data Science Journal最新文献

英文 中文
Data‐driven sparse partial least squares 数据驱动的稀疏偏最小二乘
Pub Date : 2021-10-18 DOI: 10.1002/sam.11558
Hadrien Lorenzo, O. Cloarec, R. Thiébaut, J. Saracco
In the supervised high dimensional settings with a large number of variables and a low number of individuals, variable selection allows a simpler interpretation and more reliable predictions. That subspace selection is often managed with supervised tools when the real question is motivated by variable prediction. We propose a partial least square (PLS) based method, called data‐driven sparse PLS (ddsPLS), allowing variable selection both in the covariate and the response parts using a single hyperparameter per component. The subspace estimation is also performed by tuning a number of underlying parameters. The ddsPLS method is compared with existing methods such as classical PLS and two well established sparse PLS methods through numerical simulations. The observed results are promising both in terms of variable selection and prediction performance. This methodology is based on new prediction quality descriptors associated with the classical R2 and Q2 , and uses bootstrap sampling to tune parameters and select an optimal regression model.
在具有大量变量和少量个体的监督高维设置中,变量选择允许更简单的解释和更可靠的预测。当实际问题由变量预测驱动时,子空间选择通常使用监督工具进行管理。我们提出了一种基于偏最小二乘(PLS)的方法,称为数据驱动的稀疏PLS (ddsPLS),允许在协变量和响应部分使用每个组件的单个超参数进行变量选择。子空间估计也是通过调优一些底层参数来执行的。通过数值模拟,将ddsPLS方法与现有的经典PLS方法和两种已建立的稀疏PLS方法进行了比较。观察结果在变量选择和预测性能方面都是有希望的。该方法基于与经典R2和Q2相关的新的预测质量描述符,并使用自举抽样来调整参数并选择最优回归模型。
{"title":"Data‐driven sparse partial least squares","authors":"Hadrien Lorenzo, O. Cloarec, R. Thiébaut, J. Saracco","doi":"10.1002/sam.11558","DOIUrl":"https://doi.org/10.1002/sam.11558","url":null,"abstract":"In the supervised high dimensional settings with a large number of variables and a low number of individuals, variable selection allows a simpler interpretation and more reliable predictions. That subspace selection is often managed with supervised tools when the real question is motivated by variable prediction. We propose a partial least square (PLS) based method, called data‐driven sparse PLS (ddsPLS), allowing variable selection both in the covariate and the response parts using a single hyperparameter per component. The subspace estimation is also performed by tuning a number of underlying parameters. The ddsPLS method is compared with existing methods such as classical PLS and two well established sparse PLS methods through numerical simulations. The observed results are promising both in terms of variable selection and prediction performance. This methodology is based on new prediction quality descriptors associated with the classical R2 and Q2 , and uses bootstrap sampling to tune parameters and select an optimal regression model.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116353603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Factor analysis for high‐dimensional time series: Consistent estimation and efficient computation 高维时间序列的因子分析:一致的估计和高效的计算
Pub Date : 2021-10-15 DOI: 10.1002/sam.11557
Qiang Xia, H. Wong, Shirun Shen, Kejun He
To deal with the factor analysis for high‐dimensional stationary time series, this paper suggests a novel method that integrates three ideas. First, based on the eigenvalues of a non‐negative definite matrix, we propose a new approach for consistently determining the number of factors. The proposed method is computationally efficient with a single step procedure, especially when both weak and strong factors exist in the factor model. Second, a fresh measurement of the difference between the factor loading matrix and its estimate is recommended to overcome the nonidentifiability of the loading matrix due to any geometric rotation. The asymptotic results of our proposed method are also studied under this measurement, which enjoys “blessing of dimensionality.” Finally, with the estimated factors, the latent vector autoregressive (VAR) model is analyzed such that the convergence rate of the estimated coefficients is as fast as when the samples of VAR model are observed. In support of our results on consistency and computational efficiency, the finite sample performance of the proposed method is examined by simulations and the analysis of one real data example.
为了处理高维平稳时间序列的因子分析,本文提出了一种综合了三种思想的新方法。首先,基于非负定矩阵的特征值,我们提出了一种一致确定因子数量的新方法。当因子模型中同时存在弱因子和强因子时,该方法具有单步计算效率高的特点。其次,建议对因子加载矩阵与其估计值之间的差进行新的测量,以克服由于任何几何旋转而导致的加载矩阵的不可识别性。在这种度量下,我们还研究了我们提出的方法的渐近结果,它具有“维数的祝福”。最后,利用估计因子对潜在向量自回归(VAR)模型进行分析,使估计系数的收敛速度与观察VAR模型样本时一样快。为了支持我们在一致性和计算效率方面的结果,通过仿真和一个实际数据实例的分析来检验所提出方法的有限样本性能。
{"title":"Factor analysis for high‐dimensional time series: Consistent estimation and efficient computation","authors":"Qiang Xia, H. Wong, Shirun Shen, Kejun He","doi":"10.1002/sam.11557","DOIUrl":"https://doi.org/10.1002/sam.11557","url":null,"abstract":"To deal with the factor analysis for high‐dimensional stationary time series, this paper suggests a novel method that integrates three ideas. First, based on the eigenvalues of a non‐negative definite matrix, we propose a new approach for consistently determining the number of factors. The proposed method is computationally efficient with a single step procedure, especially when both weak and strong factors exist in the factor model. Second, a fresh measurement of the difference between the factor loading matrix and its estimate is recommended to overcome the nonidentifiability of the loading matrix due to any geometric rotation. The asymptotic results of our proposed method are also studied under this measurement, which enjoys “blessing of dimensionality.” Finally, with the estimated factors, the latent vector autoregressive (VAR) model is analyzed such that the convergence rate of the estimated coefficients is as fast as when the samples of VAR model are observed. In support of our results on consistency and computational efficiency, the finite sample performance of the proposed method is examined by simulations and the analysis of one real data example.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124714637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
High‐dimensional classification based on nonparametric maximum likelihood estimation under unknown and inhomogeneous variances 基于未知和非齐次方差下的非参数极大似然估计的高维分类
Pub Date : 2021-10-12 DOI: 10.1002/sam.11554
Hoyoung Park, Seungchul Baek, Junyong Park
We propose a new method in high‐dimensional classification based on estimation of high‐dimensional mean vector under unknown and unequal variances. Our proposed method is based on a semi‐parametric model that combines nonparametric and parametric models for mean and variance, respectively. Our proposed method is designed to be robust to the structure of the mean vector, while most existing methods are developed for some specific cases such as either sparse or non‐sparse case of the mean vector. In addition, we also consider estimating mean and variance separately under nonparametric empirical Bayes framework that has advantage over existing nonparametric empirical Bayes classifiers based on standardization. We present simulation studies showing that our proposed method outperforms a variety of existing methods. Application to real data sets demonstrates robustness of our method to various types of data sets, while all other methods produce either sensitive or poor results for different data sets.
提出了一种基于未知和不等方差下高维均值向量估计的高维分类新方法。我们提出的方法是基于半参数模型,该模型分别结合了均值和方差的非参数模型和参数模型。我们提出的方法被设计为对平均向量的结构具有鲁棒性,而大多数现有方法都是针对某些特定情况而开发的,例如平均向量的稀疏或非稀疏情况。此外,我们还考虑在非参数经验贝叶斯框架下分别估计均值和方差,这比现有的基于标准化的非参数经验贝叶斯分类器有优势。我们提出的仿真研究表明,我们提出的方法优于各种现有的方法。对实际数据集的应用表明,我们的方法对各种类型的数据集具有鲁棒性,而所有其他方法对不同的数据集产生敏感或较差的结果。
{"title":"High‐dimensional classification based on nonparametric maximum likelihood estimation under unknown and inhomogeneous variances","authors":"Hoyoung Park, Seungchul Baek, Junyong Park","doi":"10.1002/sam.11554","DOIUrl":"https://doi.org/10.1002/sam.11554","url":null,"abstract":"We propose a new method in high‐dimensional classification based on estimation of high‐dimensional mean vector under unknown and unequal variances. Our proposed method is based on a semi‐parametric model that combines nonparametric and parametric models for mean and variance, respectively. Our proposed method is designed to be robust to the structure of the mean vector, while most existing methods are developed for some specific cases such as either sparse or non‐sparse case of the mean vector. In addition, we also consider estimating mean and variance separately under nonparametric empirical Bayes framework that has advantage over existing nonparametric empirical Bayes classifiers based on standardization. We present simulation studies showing that our proposed method outperforms a variety of existing methods. Application to real data sets demonstrates robustness of our method to various types of data sets, while all other methods produce either sensitive or poor results for different data sets.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127392221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tracking clusters and anomalies in evolving data streams 在不断发展的数据流中跟踪集群和异常
Pub Date : 2021-10-08 DOI: 10.1002/sam.11552
Sreelekha Guggilam, V. Chandola, A. Patra
Data‐driven anomaly detection methods typically build a model for the normal behavior of the target system, and score each data instance with respect to this model. A threshold is invariably needed to identify data instances with high (or low) scores as anomalies. This presents a practical limitation on the applicability of such methods, since most methods are sensitive to the choice of the threshold, and it is challenging to set optimal thresholds. The issue is exacerbated in a streaming scenario, where the optimal thresholds vary with time. We present a probabilistic framework to explicitly model the normal and anomalous behaviors and probabilistically reason about the data. An extreme value theory based formulation is proposed to model the anomalous behavior as the extremes of the normal behavior. As a specific instantiation, a joint nonparametric clustering and anomaly detection algorithm (INCAD) is proposed that models the normal behavior as a Dirichlet process mixture model. Results on a variety of datasets, including streaming data, show that the proposed method provides effective and simultaneous clustering and anomaly detection without requiring strong initialization and threshold parameters.
数据驱动的异常检测方法通常为目标系统的正常行为建立一个模型,并根据该模型对每个数据实例进行评分。总是需要一个阈值来识别具有高(或低)分数的数据实例作为异常。这对这些方法的适用性提出了实际限制,因为大多数方法对阈值的选择很敏感,并且设置最佳阈值具有挑战性。这个问题在流场景中更加严重,因为最佳阈值随时间而变化。我们提出了一个概率框架来明确地对数据的正常和异常行为和概率推理进行建模。提出了一种基于极值理论的公式,将异常行为建模为正常行为的极值。作为具体实例,提出了一种联合非参数聚类和异常检测算法(INCAD),该算法将正常行为建模为Dirichlet过程混合模型。在包括流数据在内的多种数据集上的实验结果表明,该方法在不需要强初始化和阈值参数的情况下,提供了有效且同步的聚类和异常检测。
{"title":"Tracking clusters and anomalies in evolving data streams","authors":"Sreelekha Guggilam, V. Chandola, A. Patra","doi":"10.1002/sam.11552","DOIUrl":"https://doi.org/10.1002/sam.11552","url":null,"abstract":"Data‐driven anomaly detection methods typically build a model for the normal behavior of the target system, and score each data instance with respect to this model. A threshold is invariably needed to identify data instances with high (or low) scores as anomalies. This presents a practical limitation on the applicability of such methods, since most methods are sensitive to the choice of the threshold, and it is challenging to set optimal thresholds. The issue is exacerbated in a streaming scenario, where the optimal thresholds vary with time. We present a probabilistic framework to explicitly model the normal and anomalous behaviors and probabilistically reason about the data. An extreme value theory based formulation is proposed to model the anomalous behavior as the extremes of the normal behavior. As a specific instantiation, a joint nonparametric clustering and anomaly detection algorithm (INCAD) is proposed that models the normal behavior as a Dirichlet process mixture model. Results on a variety of datasets, including streaming data, show that the proposed method provides effective and simultaneous clustering and anomaly detection without requiring strong initialization and threshold parameters.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129250964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Data Twinning 数据孪生
Pub Date : 2021-10-06 DOI: 10.1002/sam.11574
Akhil Vakayil, V. R. Joseph
In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures and k‐fold cross validation.
在这项工作中,我们开发了一种名为Twinning的方法,用于将数据集划分为统计上相似的双胞胎集。Twinning基于SPlit, SPlit是最近提出的一种独立于模型的方法,用于将数据集最佳地分割为训练集和测试集。孪生算法比SPlit算法快几个数量级,适用于数据压缩等大数据问题。孪生也可用于生成给定数据集的多个分裂,以帮助分而治之的过程和k - fold交叉验证。
{"title":"Data Twinning","authors":"Akhil Vakayil, V. R. Joseph","doi":"10.1002/sam.11574","DOIUrl":"https://doi.org/10.1002/sam.11574","url":null,"abstract":"In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures and k‐fold cross validation.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114243324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Regrouped design in privacy analysis for multinomial microdata 多项微数据隐私分析中的重组设计
Pub Date : 2021-10-04 DOI: 10.1002/sam.11553
Shu-Mei Wan, Danny Wen-Yaw Chung, Monica Mayeni Manurung, Kwang-Hwa Chang, Chien-Hua Wu
In this paper, we are dealing with the dual goals for protecting privacy and making statistical inferences from the disseminated data using the regrouped design. It is not difficult to protect the privacy of patients by perturbing data. The problem is to perturb the data in such a way that privacy is protected, and also, the released data are useful for research. By applying the regrouped design, the dataset is released with the dummy groups associated with the actual groups via a pre‐specified transition probability matrix. Small stagnation probabilities of regrouped design are recommended to reach a small disclosure risk and a higher power of hypothesis testing. The power of test statistic in the released data increases as the stagnation probabilities depart from 0.5. The disclosure risk can be reduced further if more quasi‐identifiers are relocated. An example of National Health Insurance Research Database is given to illustrate the use of the regrouped design to protect the privacy and make the statistical inference.
在本文中,我们使用重组设计来处理保护隐私和从传播数据中进行统计推断的双重目标。通过干扰数据来保护患者的隐私并不困难。问题是要以一种既保护隐私又对研究有用的方式来干扰数据。通过应用重组设计,数据集通过预先指定的转移概率矩阵与实际组关联的虚拟组一起发布。建议重组设计的停滞概率较小,以达到较小的披露风险和更高的假设检验能力。发布的数据中检验统计量的幂随着停滞概率离开0.5而增加。如果更多的准标识符被重新定位,披露风险可以进一步降低。以全国健康保险研究数据库为例,说明了利用重组设计来保护隐私并进行统计推断。
{"title":"Regrouped design in privacy analysis for multinomial microdata","authors":"Shu-Mei Wan, Danny Wen-Yaw Chung, Monica Mayeni Manurung, Kwang-Hwa Chang, Chien-Hua Wu","doi":"10.1002/sam.11553","DOIUrl":"https://doi.org/10.1002/sam.11553","url":null,"abstract":"In this paper, we are dealing with the dual goals for protecting privacy and making statistical inferences from the disseminated data using the regrouped design. It is not difficult to protect the privacy of patients by perturbing data. The problem is to perturb the data in such a way that privacy is protected, and also, the released data are useful for research. By applying the regrouped design, the dataset is released with the dummy groups associated with the actual groups via a pre‐specified transition probability matrix. Small stagnation probabilities of regrouped design are recommended to reach a small disclosure risk and a higher power of hypothesis testing. The power of test statistic in the released data increases as the stagnation probabilities depart from 0.5. The disclosure risk can be reduced further if more quasi‐identifiers are relocated. An example of National Health Insurance Research Database is given to illustrate the use of the regrouped design to protect the privacy and make the statistical inference.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"18 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125770296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Residual's influence index (RINFIN), bad leverage and unmasking in high dimensional L2‐regression 残差影响指数(RINFIN),高维L2回归中的不良杠杆和揭露
Pub Date : 2021-09-26 DOI: 10.1002/sam.11550
Y. Yatracos
In linear regression of Y on X(∈ Rp) with parameters β(∈ Rp+1), statistical inference is unreliable when observations are obtained from gross‐error model, Fϵ,G = (1 − ϵ)F + ϵG, instead of the assumed probability F;G is gross‐error probability, 0 < ϵ < 1. Residual's influence index (RINFIN) at (x, y) is introduced, with components measuring also the local influence of x in the residual and large value flagging a bad leverage case (from G), thus causing unmasking. Large sample properties of RINFIN are presented to confirm significance of the findings, but often the large difference in the RINFIN scores of the data is indicative. RINFIN is successful with microarray data, simulated, high dimensional data and classic regression data sets. RINFIN's performance improves as p increases and can be used in multiple response linear regression.
在参数为β(∈Rp+1)的Y在X(∈Rp)上的线性回归中,当从粗误差模型中获得观测值时,统计推断是不可靠的,f御,G =(1−御)F + ϵG,而不是假设概率F;G是粗误差概率,0 <御< 1。引入残差在(x, y)处的影响指数(RINFIN),其分量也测量残差中x的局部影响,大值表示不良杠杆情况(来自G),从而导致揭罩。提出RINFIN的大样本特性是为了证实研究结果的意义,但通常数据的RINFIN分数的大差异是指示性的。RINFIN在微阵列数据,模拟,高维数据和经典回归数据集方面取得了成功。RINFIN的性能随p的增加而提高,可用于多响应线性回归。
{"title":"Residual's influence index (RINFIN), bad leverage and unmasking in high dimensional L2‐regression","authors":"Y. Yatracos","doi":"10.1002/sam.11550","DOIUrl":"https://doi.org/10.1002/sam.11550","url":null,"abstract":"In linear regression of Y on X(∈ Rp) with parameters β(∈ Rp+1), statistical inference is unreliable when observations are obtained from gross‐error model, Fϵ,G = (1 − ϵ)F + ϵG, instead of the assumed probability F;G is gross‐error probability, 0 < ϵ < 1. Residual's influence index (RINFIN) at (x, y) is introduced, with components measuring also the local influence of x in the residual and large value flagging a bad leverage case (from G), thus causing unmasking. Large sample properties of RINFIN are presented to confirm significance of the findings, but often the large difference in the RINFIN scores of the data is indicative. RINFIN is successful with microarray data, simulated, high dimensional data and classic regression data sets. RINFIN's performance improves as p increases and can be used in multiple response linear regression.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128249071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparison of merging strategies for building machine learning models on multiple independent gene expression data sets 在多个独立基因表达数据集上构建机器学习模型的合并策略比较
Pub Date : 2021-09-13 DOI: 10.1002/sam.11549
J. Krepel, Magdalena Kircher, Moritz Kohls, K. Jung
High‐dimensional gene expression data are regularly studied for their ability to separate different groups of samples by means of machine learning (ML) models. Meanwhile, a large number of such data are publicly available. Several approaches for meta‐analysis on independent sets of gene expression data have been proposed, mainly focusing on the step of feature selection, a typical step in fitting a ML model. Here, we compare different strategies of merging the information of such independent data sets to train a classifier model. Specifically, we compare the strategy of merging data sets directly (strategy A), and the strategy of merging the classification results (strategy B). We use simulations with pure artificial data as well as evaluations based on independent gene expression data from lung fibrosis studies to compare the two merging approaches. In the simulations, the number of studies, the strength of batch effects, and the separability are varied. The comparison incorporates five standard ML techniques typically used for high‐dimensional data, namely discriminant analysis, support vector machines, least absolute shrinkage and selection operator, random forest, and artificial neural networks. Using cross‐study validations, we found that direct data merging yields higher accuracies when having training data of three or four studies, and merging of classification results performed better when having only two training studies. In the evaluation with the lung fibrosis data, both strategies showed a similar performance.
人们经常研究高维基因表达数据,因为它们能够通过机器学习(ML)模型分离不同的样本组。与此同时,大量此类数据是公开的。已经提出了几种对独立基因表达数据集进行元分析的方法,主要集中在特征选择步骤,这是拟合ML模型的典型步骤。在这里,我们比较了合并这些独立数据集的信息来训练分类器模型的不同策略。具体来说,我们比较了直接合并数据集的策略(策略A)和合并分类结果的策略(策略B)。我们使用纯人工数据的模拟以及基于肺纤维化研究中独立基因表达数据的评估来比较两种合并方法。在模拟中,研究的数量、批效应的强度和可分离性是不同的。比较采用了五种通常用于高维数据的标准机器学习技术,即判别分析、支持向量机、最小绝对收缩和选择算子、随机森林和人工神经网络。使用交叉研究验证,我们发现当有三个或四个研究的训练数据时,直接数据合并产生更高的准确性,当只有两个训练研究时,合并分类结果表现更好。在肺纤维化数据的评估中,两种策略显示出相似的性能。
{"title":"Comparison of merging strategies for building machine learning models on multiple independent gene expression data sets","authors":"J. Krepel, Magdalena Kircher, Moritz Kohls, K. Jung","doi":"10.1002/sam.11549","DOIUrl":"https://doi.org/10.1002/sam.11549","url":null,"abstract":"High‐dimensional gene expression data are regularly studied for their ability to separate different groups of samples by means of machine learning (ML) models. Meanwhile, a large number of such data are publicly available. Several approaches for meta‐analysis on independent sets of gene expression data have been proposed, mainly focusing on the step of feature selection, a typical step in fitting a ML model. Here, we compare different strategies of merging the information of such independent data sets to train a classifier model. Specifically, we compare the strategy of merging data sets directly (strategy A), and the strategy of merging the classification results (strategy B). We use simulations with pure artificial data as well as evaluations based on independent gene expression data from lung fibrosis studies to compare the two merging approaches. In the simulations, the number of studies, the strength of batch effects, and the separability are varied. The comparison incorporates five standard ML techniques typically used for high‐dimensional data, namely discriminant analysis, support vector machines, least absolute shrinkage and selection operator, random forest, and artificial neural networks. Using cross‐study validations, we found that direct data merging yields higher accuracies when having training data of three or four studies, and merging of classification results performed better when having only two training studies. In the evaluation with the lung fibrosis data, both strategies showed a similar performance.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122182087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Bayesian modeling of location, scale, and shape parameters in skew‐normal regression models 斜正态回归模型中位置、规模和形状参数的贝叶斯建模
Pub Date : 2021-09-09 DOI: 10.1002/sam.11548
Martha Lucía Corrales, Edilberto Cepeda Cuervo
In this paper, we propose Bayesian skew‐normal regression models where the location, scale and shape parameters follow (linear or nonlinear) regression structures, and the variable of interest follows the Azzalini skew‐normal distribution. A Bayesian method is developed to fit the proposed models, using working variables to build the kernel transition functions. To illustrate the performance of the proposed Bayesian method and application of the model to analyze statistical data, we present results of simulated studies and of the application to studies of forced displacement in Colombia.
在本文中,我们提出了贝叶斯偏正态回归模型,其中位置,规模和形状参数遵循(线性或非线性)回归结构,并且感兴趣的变量遵循Azzalini偏正态分布。提出了一种贝叶斯方法来拟合所提出的模型,使用工作变量来构建核转移函数。为了说明所提出的贝叶斯方法的性能以及该模型在分析统计数据方面的应用,我们给出了模拟研究的结果以及在哥伦比亚被迫流离失所研究中的应用。
{"title":"Bayesian modeling of location, scale, and shape parameters in skew‐normal regression models","authors":"Martha Lucía Corrales, Edilberto Cepeda Cuervo","doi":"10.1002/sam.11548","DOIUrl":"https://doi.org/10.1002/sam.11548","url":null,"abstract":"In this paper, we propose Bayesian skew‐normal regression models where the location, scale and shape parameters follow (linear or nonlinear) regression structures, and the variable of interest follows the Azzalini skew‐normal distribution. A Bayesian method is developed to fit the proposed models, using working variables to build the kernel transition functions. To illustrate the performance of the proposed Bayesian method and application of the model to analyze statistical data, we present results of simulated studies and of the application to studies of forced displacement in Colombia.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114841431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predictive models with end user preference 具有最终用户偏好的预测模型
Pub Date : 2021-08-26 DOI: 10.1002/sam.11545
Yifan Zhao, Xian Yang, Carolina Bolnykh, Steve Harenberg, Nodirbek Korchiev, Saavan Raj Yerramsetty, Bhanu Prasad Vellanki, Ramakanth Kodumagulla, N. Samatova
Classical machine learning models typically try to optimize the model based on the most discriminatory features of the data; however, they do not usually account for end user preferences. In certain applications, this can be a serious issue as models not aware of user preferences could become costly, untrustworthy, or privacy‐intrusive to use, thus becoming irrelevant and/or uninterpretable. Ideally, end users with domain knowledge could propose preferable features that the predictive model could then take into account. In this paper, we propose a generic modeling method that respects end user preferences via a relative ranking system to express multi‐criteria preferences and a regularization term in the model's objective function to incorporate the ranked preferences. In a more generic perspective, this method is able to plug user preferences into existing predictive models without creating completely new ones. We implement this method in the context of decision trees and are able to achieve a comparable classification accuracy while reducing the use of undesirable features.
经典的机器学习模型通常试图根据数据中最具歧视性的特征来优化模型;然而,它们通常不考虑最终用户的偏好。在某些应用程序中,这可能是一个严重的问题,因为不了解用户偏好的模型可能会变得昂贵、不可信或侵犯隐私,从而变得无关紧要和/或不可解释。理想情况下,具有领域知识的最终用户可以提出预测模型可以考虑的优选特征。在本文中,我们提出了一种通用的建模方法,该方法通过一个相对排名系统来表达多标准偏好,并在模型的目标函数中加入正则化项来包含排名偏好。从更一般的角度来看,该方法能够将用户偏好插入到现有的预测模型中,而无需创建全新的预测模型。我们在决策树的上下文中实现了这种方法,并且能够在减少不良特征的使用的同时达到相当的分类精度。
{"title":"Predictive models with end user preference","authors":"Yifan Zhao, Xian Yang, Carolina Bolnykh, Steve Harenberg, Nodirbek Korchiev, Saavan Raj Yerramsetty, Bhanu Prasad Vellanki, Ramakanth Kodumagulla, N. Samatova","doi":"10.1002/sam.11545","DOIUrl":"https://doi.org/10.1002/sam.11545","url":null,"abstract":"Classical machine learning models typically try to optimize the model based on the most discriminatory features of the data; however, they do not usually account for end user preferences. In certain applications, this can be a serious issue as models not aware of user preferences could become costly, untrustworthy, or privacy‐intrusive to use, thus becoming irrelevant and/or uninterpretable. Ideally, end users with domain knowledge could propose preferable features that the predictive model could then take into account. In this paper, we propose a generic modeling method that respects end user preferences via a relative ranking system to express multi‐criteria preferences and a regularization term in the model's objective function to incorporate the ranked preferences. In a more generic perspective, this method is able to plug user preferences into existing predictive models without creating completely new ones. We implement this method in the context of decision trees and are able to achieve a comparable classification accuracy while reducing the use of undesirable features.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"363 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132787446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Statistical Analysis and Data Mining: The ASA Data Science Journal
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1