首页 > 最新文献

Statistical Analysis and Data Mining: The ASA Data Science Journal最新文献

英文 中文
Doubly robust estimation for non‐probability samples with modified intertwined probabilistic factors decoupling 修正交织概率因子解耦的非概率样本双鲁棒估计
Pub Date : 2023-02-18 DOI: 10.1002/sam.11614
Zhanxu Liu, Junbo Zheng, Yingli Pan
In recent years, non‐probability samples, such as web survey samples, have become increasingly popular in many fields, but they may be subject to selection biases, which results in the difficulty for inference from them. Doubly robust (DR) estimation is one of the approaches to making inferences from non‐probability samples. When many covariates are available, variable selection becomes important in DR estimation. In this paper, a new DR estimator for the finite population mean is constructed, where the intertwined probabilistic factors decoupling (IPAD) and modified IPAD are used to select important variables in the propensity score model and the outcome superpopulation model, respectively. Unlike the traditional variable selection approaches, such as adaptive least absolute shrinkage and selection operator and smoothly clipped absolute deviations, IPAD and the modified IPAD not only can select important variables and estimate parameters, but also can control the false discovery rate, which can produce more accurate population estimators. Asymptotic theories and variance estimation of the DR estimator with a modified IPAD are established. Results from simulation studies indicate that our proposed estimator performs well. We apply the proposed method to the analysis of the Pew Research Center data and the Behavioral Risk Factor Surveillance System data.
近年来,非概率样本,如网络调查样本,在许多领域越来越受欢迎,但它们可能受到选择偏差的影响,这导致了从它们推断的困难。双鲁棒估计是一种从非概率样本中进行推断的方法。当有许多协变量可用时,变量选择在DR估计中变得很重要。本文构造了一个新的有限总体均值DR估计器,其中分别使用交织概率因子解耦(IPAD)和改进的IPAD来选择倾向得分模型和结果超总体模型中的重要变量。与传统的自适应最小绝对收缩和选择算子、平滑裁剪绝对偏差等变量选择方法不同,IPAD和改进的IPAD不仅可以选择重要变量和估计参数,还可以控制错误发现率,从而产生更准确的总体估计器。建立了基于改进IPAD的DR估计量的渐近理论和方差估计。仿真研究结果表明,该估计器性能良好。我们将提出的方法应用于皮尤研究中心数据和行为风险因素监测系统数据的分析。
{"title":"Doubly robust estimation for non‐probability samples with modified intertwined probabilistic factors decoupling","authors":"Zhanxu Liu, Junbo Zheng, Yingli Pan","doi":"10.1002/sam.11614","DOIUrl":"https://doi.org/10.1002/sam.11614","url":null,"abstract":"In recent years, non‐probability samples, such as web survey samples, have become increasingly popular in many fields, but they may be subject to selection biases, which results in the difficulty for inference from them. Doubly robust (DR) estimation is one of the approaches to making inferences from non‐probability samples. When many covariates are available, variable selection becomes important in DR estimation. In this paper, a new DR estimator for the finite population mean is constructed, where the intertwined probabilistic factors decoupling (IPAD) and modified IPAD are used to select important variables in the propensity score model and the outcome superpopulation model, respectively. Unlike the traditional variable selection approaches, such as adaptive least absolute shrinkage and selection operator and smoothly clipped absolute deviations, IPAD and the modified IPAD not only can select important variables and estimate parameters, but also can control the false discovery rate, which can produce more accurate population estimators. Asymptotic theories and variance estimation of the DR estimator with a modified IPAD are established. Results from simulation studies indicate that our proposed estimator performs well. We apply the proposed method to the analysis of the Pew Research Center data and the Behavioral Risk Factor Surveillance System data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116207298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimation of disease progression for ischemic heart disease using latent Markov with covariates 用带有协变量的潜马尔可夫估计缺血性心脏病的疾病进展
Pub Date : 2023-02-01 DOI: 10.1002/sam.11589
Zarina Oflaz, Ceylan Yozgatlıgil, A. S. Selcuk-Kestel
Contemporaneous monitoring of disease progression, in addition to early diagnosis, is important for the treatment of patients with chronic conditions. Chronic disease‐related factors are not easily tractable, and the existing data sets do not clearly reflect them, making diagnosis difficult. The primary issue is that databases maintained by health care, insurance, or governmental organizations typically do not contain clinical information and instead focus on patient appointments and demographic profiles. Due to the lack of thorough information on potential risk factors for a single patient, investigations on the nature of disease are imprecise. We suggest the use of a latent Markov model with variables in a latent process because it enables the panel analysis of many forms of data. The purpose of this study is to evaluate unobserved factors in ischemic heart disease (IHD) using longitudinal data from electronic health records. Based on the results we designate states as healthy, light, moderate, and severe to represent stages of disease progression. This study demonstrates that gender, patient age, and hospital visit frequency are all significant factors in the development of the disease. Females acquire IHD more rapidly than males, frequently developing from moderate and severe disease. In addition, it demonstrates that individuals under the age of 20 bypass the light state of IHD and proceed directly to the moderate state.
除了早期诊断外,同步监测疾病进展对慢性疾病患者的治疗也很重要。慢性疾病相关因素不容易处理,现有的数据集不能清楚地反映这些因素,使诊断变得困难。主要问题是,由卫生保健、保险或政府组织维护的数据库通常不包含临床信息,而是侧重于患者预约和人口统计资料。由于缺乏对单个患者潜在危险因素的全面信息,对疾病性质的调查是不精确的。我们建议在潜在过程中使用具有变量的潜在马尔可夫模型,因为它可以对多种形式的数据进行面板分析。本研究的目的是利用电子健康记录的纵向数据来评估缺血性心脏病(IHD)中未观察到的因素。根据结果,我们将状态划分为健康、轻度、中度和重度,以代表疾病进展的各个阶段。本研究表明,性别、患者年龄和就诊频率都是影响疾病发展的重要因素。女性患IHD的速度比男性快,通常由中度和重度疾病发展而来。此外,这表明20岁以下的个体绕过IHD的轻度状态,直接进入中度状态。
{"title":"Estimation of disease progression for ischemic heart disease using latent Markov with covariates","authors":"Zarina Oflaz, Ceylan Yozgatlıgil, A. S. Selcuk-Kestel","doi":"10.1002/sam.11589","DOIUrl":"https://doi.org/10.1002/sam.11589","url":null,"abstract":"Contemporaneous monitoring of disease progression, in addition to early diagnosis, is important for the treatment of patients with chronic conditions. Chronic disease‐related factors are not easily tractable, and the existing data sets do not clearly reflect them, making diagnosis difficult. The primary issue is that databases maintained by health care, insurance, or governmental organizations typically do not contain clinical information and instead focus on patient appointments and demographic profiles. Due to the lack of thorough information on potential risk factors for a single patient, investigations on the nature of disease are imprecise. We suggest the use of a latent Markov model with variables in a latent process because it enables the panel analysis of many forms of data. The purpose of this study is to evaluate unobserved factors in ischemic heart disease (IHD) using longitudinal data from electronic health records. Based on the results we designate states as healthy, light, moderate, and severe to represent stages of disease progression. This study demonstrates that gender, patient age, and hospital visit frequency are all significant factors in the development of the disease. Females acquire IHD more rapidly than males, frequently developing from moderate and severe disease. In addition, it demonstrates that individuals under the age of 20 bypass the light state of IHD and proceed directly to the moderate state.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114991576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Adaptive boosting for ordinal target variables using neural networks 基于神经网络的有序目标变量自适应增强
Pub Date : 2023-01-26 DOI: 10.1002/sam.11613
Insung Um, Geonseok Lee, K. Lee
Boosting has proven its superiority by increasing the diversity of base classifiers, mainly in various classification problems. In reality, target variables in classification often are formed by numerical variables, in possession of ordinal information. However, existing boosting algorithms for classification are unable to reflect such ordinal target variables, resulting in non‐optimal solutions. In this paper, we propose a novel algorithm of ordinal encoding adaptive boosting (AdaBoost) using a multi‐dimensional encoding scheme for ordinal target variables. Extending an original binary‐class AdaBoost, the proposed algorithm is equipped with a multi‐class exponential loss function. We show that it achieves the Bayes classifier and establishes forward stagewise additive modeling. We demonstrate the performance of the proposed algorithm with a base learner as a neural network. Our experiments show that it outperforms existing boosting algorithms in various ordinal datasets.
增强算法通过增加基分类器的多样性,特别是在各种分类问题中,证明了它的优越性。在现实中,分类中的目标变量往往是由数值变量构成的,它们拥有有序的信息。然而,现有的分类增强算法无法反映这种有序的目标变量,导致非最优解。本文提出了一种新的有序编码自适应增强算法(AdaBoost),该算法采用一种针对有序目标变量的多维编码方案。该算法扩展了原有的二类AdaBoost算法,并引入了多类指数损失函数。我们证明它实现了贝叶斯分类器,并建立了前向阶段加性建模。我们用一个基础学习器作为神经网络来证明所提出算法的性能。我们的实验表明,它在各种有序数据集上优于现有的增强算法。
{"title":"Adaptive boosting for ordinal target variables using neural networks","authors":"Insung Um, Geonseok Lee, K. Lee","doi":"10.1002/sam.11613","DOIUrl":"https://doi.org/10.1002/sam.11613","url":null,"abstract":"Boosting has proven its superiority by increasing the diversity of base classifiers, mainly in various classification problems. In reality, target variables in classification often are formed by numerical variables, in possession of ordinal information. However, existing boosting algorithms for classification are unable to reflect such ordinal target variables, resulting in non‐optimal solutions. In this paper, we propose a novel algorithm of ordinal encoding adaptive boosting (AdaBoost) using a multi‐dimensional encoding scheme for ordinal target variables. Extending an original binary‐class AdaBoost, the proposed algorithm is equipped with a multi‐class exponential loss function. We show that it achieves the Bayes classifier and establishes forward stagewise additive modeling. We demonstrate the performance of the proposed algorithm with a base learner as a neural network. Our experiments show that it outperforms existing boosting algorithms in various ordinal datasets.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116989208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bilateral‐Weighted Online Adaptive Isolation Forest for anomaly detection in streaming data 用于流数据异常检测的双边加权在线自适应隔离林
Pub Date : 2023-01-14 DOI: 10.1002/sam.11612
Gabor Hannak, G. Horváth, Attila Kádár, Márk Dániel Szalai
We propose a method called Bilateral‐Weighted Online Adaptive Isolation Forest (BWOAIF) for unsupervised anomaly detection based on Isolation Forest (IF), which is applicable to streaming data and able to cope with concept drift. Similar to IF, the proposed method has only few hyperparameters whose effect on the performance are easy to interpret by human intuition and therefore easy to tune. BWOAIF ingests data and classifies it as normal or anomalous, and simultaneously adapts its classifier by removing old trees as well as by creating new ones. We show that BWOAIF adapts gradually to slow concept drifts, and, at the same time, it is able to adapt fast to sudden changes of the data distribution. Numerical results show the efficacy of the proposed algorithm and its ability to learn different classes of concept drifts, such as slow/fast concept shift, concept split, concept appearance, and concept disappearance.
提出了一种基于隔离森林(IF)的双边加权在线自适应隔离森林(BWOAIF)的无监督异常检测方法,该方法适用于流数据,能够应对概念漂移。与中频相似,该方法只有很少的超参数,这些超参数对性能的影响很容易被人类直觉解释,因此很容易调整。BWOAIF获取数据并将其分类为正常或异常,同时通过删除旧树和创建新树来调整其分类器。结果表明,BWOAIF能够逐渐适应缓慢的概念漂移,同时能够快速适应数据分布的突然变化。数值结果表明了该算法的有效性和学习不同类别概念漂移的能力,如慢/快概念漂移、概念分裂、概念出现和概念消失。
{"title":"Bilateral‐Weighted Online Adaptive Isolation Forest for anomaly detection in streaming data","authors":"Gabor Hannak, G. Horváth, Attila Kádár, Márk Dániel Szalai","doi":"10.1002/sam.11612","DOIUrl":"https://doi.org/10.1002/sam.11612","url":null,"abstract":"We propose a method called Bilateral‐Weighted Online Adaptive Isolation Forest (BWOAIF) for unsupervised anomaly detection based on Isolation Forest (IF), which is applicable to streaming data and able to cope with concept drift. Similar to IF, the proposed method has only few hyperparameters whose effect on the performance are easy to interpret by human intuition and therefore easy to tune. BWOAIF ingests data and classifies it as normal or anomalous, and simultaneously adapts its classifier by removing old trees as well as by creating new ones. We show that BWOAIF adapts gradually to slow concept drifts, and, at the same time, it is able to adapt fast to sudden changes of the data distribution. Numerical results show the efficacy of the proposed algorithm and its ability to learn different classes of concept drifts, such as slow/fast concept shift, concept split, concept appearance, and concept disappearance.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"157 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123468562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Specifying composites in structural equation modeling: A refinement of the Henseler–Ogasawara specification 在结构方程建模中指定复合材料:Henseler-Ogasawara规范的改进
Pub Date : 2023-01-05 DOI: 10.1002/sam.11608
Xi Yu, Florian Schuberth, J. Henseler
Structural equation modeling (SEM) plays an important role in business and social science and so do composites, that is, linear combinations of variables. However, existing approaches to integrate composites into structural equation models still have limitations. A major leap forward has been the Henseler–Ogasawara (H–O) specification, which for the first time allows for seamlessly integrating composites into structural equation models. In doing so, it relies on emergent variables, that is, the composite of interest, and one or more orthogonal excrescent variables, that is, composites that have no surplus meaning but just span the remaining space of the emergent variable's components. Although the H–O specification enables researchers to flexibly model composites in SEM, it comes along with several practical problems: (i) The H–O specification is difficult to visualize graphically; (ii) its complexity could create difficulties for analysts, and (iii) at times SEM software packages seem to encounter convergence issues with it. In this paper, we present a refinement of the original H–O specification that addresses these three problems. In this new specification, only two components load on each excrescent variable, whereas the excrescent variables are allowed to covary among themselves. This results in a simpler graphical visualization. Additionally, researchers facing convergence issues of the original H–O specification are provided with an alternative specification. Finally, we illustrate the new specification's application by means of an empirical example and provide guidance on how (standardized) weights including their standard errors can be calculated in the R package lavaan. The corresponding Mplus model syntax is provided in the Supplementary Material.
结构方程建模(SEM)在商业和社会科学中发挥着重要作用,复合材料(即变量的线性组合)也发挥着重要作用。然而,现有的将复合材料整合到结构方程模型中的方法仍然存在局限性。Henseler-Ogasawara (H-O)规范是一个重大的飞跃,它首次允许将复合材料无缝集成到结构方程模型中。在这样做的过程中,它依赖于紧急变量,即感兴趣的组合,以及一个或多个正交的多余变量,即没有多余意义的组合,只是跨越了紧急变量组件的剩余空间。虽然H-O规范使研究人员能够灵活地在SEM中建模复合材料,但它也带来了几个实际问题:(i) H-O规范难以可视化;(ii)其复杂性可能会给分析师带来困难,(iii)有时SEM软件包似乎会遇到与之相关的收敛问题。在本文中,我们提出了对原始H-O规范的改进,以解决这三个问题。在这个新规范中,每个多余的变量只有两个分量,而多余的变量可以相互协变。这将产生更简单的图形可视化。此外,面对原始H-O规范的收敛问题的研究人员提供了一个替代规范。最后,我们通过一个实例说明了新规范的应用,并提供了如何在R包lavaan中计算(标准化)权重(包括其标准误差)的指导。相应的Mplus模型语法在补充材料中提供。
{"title":"Specifying composites in structural equation modeling: A refinement of the Henseler–Ogasawara specification","authors":"Xi Yu, Florian Schuberth, J. Henseler","doi":"10.1002/sam.11608","DOIUrl":"https://doi.org/10.1002/sam.11608","url":null,"abstract":"Structural equation modeling (SEM) plays an important role in business and social science and so do composites, that is, linear combinations of variables. However, existing approaches to integrate composites into structural equation models still have limitations. A major leap forward has been the Henseler–Ogasawara (H–O) specification, which for the first time allows for seamlessly integrating composites into structural equation models. In doing so, it relies on emergent variables, that is, the composite of interest, and one or more orthogonal excrescent variables, that is, composites that have no surplus meaning but just span the remaining space of the emergent variable's components. Although the H–O specification enables researchers to flexibly model composites in SEM, it comes along with several practical problems: (i) The H–O specification is difficult to visualize graphically; (ii) its complexity could create difficulties for analysts, and (iii) at times SEM software packages seem to encounter convergence issues with it. In this paper, we present a refinement of the original H–O specification that addresses these three problems. In this new specification, only two components load on each excrescent variable, whereas the excrescent variables are allowed to covary among themselves. This results in a simpler graphical visualization. Additionally, researchers facing convergence issues of the original H–O specification are provided with an alternative specification. Finally, we illustrate the new specification's application by means of an empirical example and provide guidance on how (standardized) weights including their standard errors can be calculated in the R package lavaan. The corresponding Mplus model syntax is provided in the Supplementary Material.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132327806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Model selection with bootstrap validation 带自举验证的模型选择
Pub Date : 2023-01-04 DOI: 10.1002/sam.11606
Rafael Savvides, Jarmo Mäkelä, K. Puolamäki
Model selection is one of the most central tasks in supervised learning. Validation set methods are the standard way to accomplish this task: models are trained on training data, and the model with the smallest loss on the validation data is selected. However, it is generally not obvious how much validation data is required to make a reliable selection, which is essential when labeled data are scarce or expensive. We propose a bootstrap‐based algorithm, bootstrap validation (BSV), that uses the bootstrap to adjust the validation set size and to find the best‐performing model within a tolerance parameter specified by the user. We find that BSV works well in practice and can be used as a drop‐in replacement for validation set methods or k‐fold cross‐validation. The main advantage of BSV is that less validation data is typically needed, so more data can be used to train the model, resulting in better approximations and efficient use of validation data.
模型选择是监督学习中最核心的任务之一。验证集方法是完成此任务的标准方法:在训练数据上训练模型,并选择在验证数据上损失最小的模型。然而,通常不清楚需要多少验证数据才能做出可靠的选择,当标记数据稀缺或昂贵时,这是必不可少的。我们提出了一种基于bootstrap的算法,bootstrap验证(BSV),它使用bootstrap来调整验证集的大小,并在用户指定的公差参数内找到性能最佳的模型。我们发现BSV在实践中工作得很好,可以作为验证集方法或k - fold交叉验证的替代方法。BSV的主要优点是通常需要较少的验证数据,因此可以使用更多的数据来训练模型,从而获得更好的近似值并有效地使用验证数据。
{"title":"Model selection with bootstrap validation","authors":"Rafael Savvides, Jarmo Mäkelä, K. Puolamäki","doi":"10.1002/sam.11606","DOIUrl":"https://doi.org/10.1002/sam.11606","url":null,"abstract":"Model selection is one of the most central tasks in supervised learning. Validation set methods are the standard way to accomplish this task: models are trained on training data, and the model with the smallest loss on the validation data is selected. However, it is generally not obvious how much validation data is required to make a reliable selection, which is essential when labeled data are scarce or expensive. We propose a bootstrap‐based algorithm, bootstrap validation (BSV), that uses the bootstrap to adjust the validation set size and to find the best‐performing model within a tolerance parameter specified by the user. We find that BSV works well in practice and can be used as a drop‐in replacement for validation set methods or k‐fold cross‐validation. The main advantage of BSV is that less validation data is typically needed, so more data can be used to train the model, resulting in better approximations and efficient use of validation data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117144076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Hierarchy‐assisted gene expression regulatory network analysis 层级辅助基因表达调控网络分析
Pub Date : 2023-01-04 DOI: 10.1002/sam.11609
Han Yan, Sanguo Zhang, Shuangge Ma
Gene expressions have been extensively studied in biomedical research. With gene expression, network analysis, which takes a system perspective and examines the interconnections among genes, has been established as highly important and meaningful. In the construction of gene expression networks, a commonly adopted technique is high‐dimensional regularized regression. Network construction can be unadjusted (which focuses on gene expressions only) and adjusted (which also incorporates regulators of gene expressions), and the two types of construction have different implications and can be equally important. In this article, we propose a variable selection hierarchy to connect the unadjusted regression‐based network construction with the adjusted construction that incorporates two or more types of regulators. This hierarchy is sensible and amounts to additional information for both constructions, thus having the potential of improving variable selection and estimation. An effective computational algorithm is developed, and extensive simulation demonstrates the superiority of the proposed construction over multiple closely relevant alternatives. The analysis of TCGA data further demonstrates the practical utility of the proposed approach.
基因表达在生物医学研究中得到了广泛的研究。随着基因表达的发展,以系统视角考察基因间相互联系的网络分析具有重要意义。在基因表达网络的构建中,一种常用的技术是高维正则化回归。网络构建可以是不调整的(只关注基因表达)和调整的(也包括基因表达的调节因子),两种类型的构建具有不同的含义,可能同样重要。在本文中,我们提出了一个变量选择层次结构,将未调整的基于回归的网络结构与包含两种或更多类型调节器的调整结构连接起来。这种层次结构是合理的,并且为这两种结构提供了额外的信息,因此具有改进变量选择和估计的潜力。开发了一种有效的计算算法,广泛的仿真证明了所提出的结构优于多个密切相关的替代方案。TCGA数据的分析进一步证明了该方法的实用性。
{"title":"Hierarchy‐assisted gene expression regulatory network analysis","authors":"Han Yan, Sanguo Zhang, Shuangge Ma","doi":"10.1002/sam.11609","DOIUrl":"https://doi.org/10.1002/sam.11609","url":null,"abstract":"Gene expressions have been extensively studied in biomedical research. With gene expression, network analysis, which takes a system perspective and examines the interconnections among genes, has been established as highly important and meaningful. In the construction of gene expression networks, a commonly adopted technique is high‐dimensional regularized regression. Network construction can be unadjusted (which focuses on gene expressions only) and adjusted (which also incorporates regulators of gene expressions), and the two types of construction have different implications and can be equally important. In this article, we propose a variable selection hierarchy to connect the unadjusted regression‐based network construction with the adjusted construction that incorporates two or more types of regulators. This hierarchy is sensible and amounts to additional information for both constructions, thus having the potential of improving variable selection and estimation. An effective computational algorithm is developed, and extensive simulation demonstrates the superiority of the proposed construction over multiple closely relevant alternatives. The analysis of TCGA data further demonstrates the practical utility of the proposed approach.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124058475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust deep neural network surrogate models with uncertainty quantification via adversarial training 基于对抗训练的不确定性量化鲁棒深度神经网络代理模型
Pub Date : 2023-01-04 DOI: 10.1002/sam.11610
Lixiang Zhang, Jia Li
Surrogate models have been used to emulate mathematical simulators of physical or biological processes for computational efficiency. High‐speed simulation is crucial for conducting uncertainty quantification (UQ) when the simulation must repeat over many randomly sampled input points (aka the Monte Carlo method). A simulator can be so computationally intensive that UQ is only feasible with a surrogate model. Recently, deep neural network (DNN) surrogate models have gained popularity for their state‐of‐the‐art emulation accuracy. However, it is well‐known that DNN is prone to severe errors when input data are perturbed in particular ways, the very phenomenon which has inspired great interest in adversarial training. In the case of surrogate models, the concern is less about a deliberate attack exploiting the vulnerability of a DNN but more of the high sensitivity of its accuracy to input directions, an issue largely ignored by researchers using emulation models. In this paper, we show the severity of this issue through empirical studies and hypothesis testing. Furthermore, we adopt methods in adversarial training to enhance the robustness of DNN surrogate models. Experiments demonstrate that our approaches significantly improve the robustness of the surrogate models without compromising emulation accuracy.
代理模型已被用于模拟物理或生物过程的数学模拟器,以提高计算效率。当模拟必须在许多随机采样的输入点上重复时(即蒙特卡罗方法),高速模拟对于进行不确定性量化(UQ)至关重要。模拟器的计算量非常大,因此UQ只能通过代理模型实现。最近,深度神经网络(DNN)代理模型因其先进的仿真精度而受到欢迎。然而,众所周知,当输入数据以特定方式受到干扰时,深度神经网络容易出现严重错误,这种现象激发了人们对对抗性训练的极大兴趣。在替代模型的情况下,关注的不是利用深度神经网络漏洞的蓄意攻击,而是它对输入方向的准确性的高灵敏度,这是使用仿真模型的研究人员在很大程度上忽略的一个问题。在本文中,我们通过实证研究和假设检验来证明这一问题的严重性。此外,我们采用对抗训练方法来增强DNN代理模型的鲁棒性。实验表明,我们的方法在不影响仿真精度的情况下显著提高了代理模型的鲁棒性。
{"title":"Robust deep neural network surrogate models with uncertainty quantification via adversarial training","authors":"Lixiang Zhang, Jia Li","doi":"10.1002/sam.11610","DOIUrl":"https://doi.org/10.1002/sam.11610","url":null,"abstract":"Surrogate models have been used to emulate mathematical simulators of physical or biological processes for computational efficiency. High‐speed simulation is crucial for conducting uncertainty quantification (UQ) when the simulation must repeat over many randomly sampled input points (aka the Monte Carlo method). A simulator can be so computationally intensive that UQ is only feasible with a surrogate model. Recently, deep neural network (DNN) surrogate models have gained popularity for their state‐of‐the‐art emulation accuracy. However, it is well‐known that DNN is prone to severe errors when input data are perturbed in particular ways, the very phenomenon which has inspired great interest in adversarial training. In the case of surrogate models, the concern is less about a deliberate attack exploiting the vulnerability of a DNN but more of the high sensitivity of its accuracy to input directions, an issue largely ignored by researchers using emulation models. In this paper, we show the severity of this issue through empirical studies and hypothesis testing. Furthermore, we adopt methods in adversarial training to enhance the robustness of DNN surrogate models. Experiments demonstrate that our approaches significantly improve the robustness of the surrogate models without compromising emulation accuracy.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123711917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semi‐supervised multi‐label learning with missing labels by exploiting feature‐label correlations 利用特征-标签相关性进行缺失标签的半监督多标签学习
Pub Date : 2022-12-31 DOI: 10.1002/sam.11607
Runxin Li, Xuefeng Zhao, Zhenhong Shang, Lianyin Jia
The majority of multi‐learning techniques now in use presuppose that there will be enough labeled instances. But in real‐world applications, it is frequently the case that only partial labels are included for each training instance. This is either because getting a fully labeled training set takes a lot of time and effort or because doing so is expensive. Multi‐label learning with missing labels, on the other hand, has greater practical value. In this paper, we propose a brand‐new semi‐supervised multi‐label learning method (SMLMFC) that specifically addresses missing‐label scenarios. After successfully filling in the missing labels for instances using two‐stage label correlations, SMLMFC trains a semi‐supervised multi‐label classifier by imposing feature‐label correlation restrictions directly on the output of labels. The complex relationships between features and labels can be learned and implicitly captured through feature‐label correlations, in particular. The experimental results on a number of real‐world multi‐label datasets confirm that SMLMFC has strong competitiveness in comparison to other state‐of‐the‐art methods.
现在使用的大多数多学习技术都假定有足够的标记实例。但在现实世界的应用中,通常情况下每个训练实例只包含部分标签。这要么是因为获得一个完全标记的训练集需要花费大量的时间和精力,要么是因为这样做很昂贵。另一方面,缺少标签的多标签学习具有更大的实用价值。在本文中,我们提出了一种全新的半监督多标签学习方法(SMLMFC),专门用于解决缺少标签的场景。在使用两阶段标签关联成功地填充缺失标签实例后,SMLMFC通过直接对标签输出施加特征标签关联限制来训练半监督多标签分类器。特征和标签之间的复杂关系可以通过特征-标签相关性来学习和隐式捕获。在许多真实世界的多标签数据集上的实验结果证实,与其他最先进的方法相比,SMLMFC具有很强的竞争力。
{"title":"Semi‐supervised multi‐label learning with missing labels by exploiting feature‐label correlations","authors":"Runxin Li, Xuefeng Zhao, Zhenhong Shang, Lianyin Jia","doi":"10.1002/sam.11607","DOIUrl":"https://doi.org/10.1002/sam.11607","url":null,"abstract":"The majority of multi‐learning techniques now in use presuppose that there will be enough labeled instances. But in real‐world applications, it is frequently the case that only partial labels are included for each training instance. This is either because getting a fully labeled training set takes a lot of time and effort or because doing so is expensive. Multi‐label learning with missing labels, on the other hand, has greater practical value. In this paper, we propose a brand‐new semi‐supervised multi‐label learning method (SMLMFC) that specifically addresses missing‐label scenarios. After successfully filling in the missing labels for instances using two‐stage label correlations, SMLMFC trains a semi‐supervised multi‐label classifier by imposing feature‐label correlation restrictions directly on the output of labels. The complex relationships between features and labels can be learned and implicitly captured through feature‐label correlations, in particular. The experimental results on a number of real‐world multi‐label datasets confirm that SMLMFC has strong competitiveness in comparison to other state‐of‐the‐art methods.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"250 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120890196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simplicial depth and its median: Selected properties and limitations 简单深度及其中值:选定的属性和限制
Pub Date : 2022-12-02 DOI: 10.1002/sam.11605
Stanislav Nagy
Depth functions are important tools of nonparametric statistics that extend orderings, ranks, and quantiles to the setup of multivariate data. We revisit the classical definition of the simplicial depth and explore its theoretical properties when evaluated with respect to datasets or measures that do not necessarily possess a symmetric density. Recent advances from discrete geometry are used to refine the results about the robustness and continuity of the simplicial depth and its induced multivariate median. Further, we compute the exact simplicial depth in several scenarios and point out some undesirable behavior: (i) the simplicial depth does not have to be maximized at the center of symmetry of the distribution, (ii) it is not necessarily unimodal, and can possess local extremes, and (iii) the sets of the induced multivariate medians or other central regions do not have to be connected.
深度函数是非参数统计的重要工具,它将排序、秩和分位数扩展到多变量数据的设置。我们重新审视了简单深度的经典定义,并在对不一定具有对称密度的数据集或测量进行评估时探索了其理论性质。利用离散几何的最新进展,改进了简单深度及其诱导的多元中值的鲁棒性和连续性的结果。此外,我们在几种情况下计算了精确的简单深度,并指出了一些不良行为:(i)简单深度不必在分布的对称中心最大化,(ii)它不一定是单峰的,并且可以具有局部极值,以及(iii)诱导的多元中位数或其他中心区域的集合不必连接。
{"title":"Simplicial depth and its median: Selected properties and limitations","authors":"Stanislav Nagy","doi":"10.1002/sam.11605","DOIUrl":"https://doi.org/10.1002/sam.11605","url":null,"abstract":"Depth functions are important tools of nonparametric statistics that extend orderings, ranks, and quantiles to the setup of multivariate data. We revisit the classical definition of the simplicial depth and explore its theoretical properties when evaluated with respect to datasets or measures that do not necessarily possess a symmetric density. Recent advances from discrete geometry are used to refine the results about the robustness and continuity of the simplicial depth and its induced multivariate median. Further, we compute the exact simplicial depth in several scenarios and point out some undesirable behavior: (i) the simplicial depth does not have to be maximized at the center of symmetry of the distribution, (ii) it is not necessarily unimodal, and can possess local extremes, and (iii) the sets of the induced multivariate medians or other central regions do not have to be connected.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132291653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Statistical Analysis and Data Mining: The ASA Data Science Journal
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1