首页 > 最新文献

Wiley Interdisciplinary Reviews-Computational Statistics最新文献

英文 中文
Stability estimation for unsupervised clustering: A review. 无监督聚类的稳定性估计:综述。
IF 4.4 2区 数学 Q1 STATISTICS & PROBABILITY Pub Date : 2022-11-01 Epub Date: 2022-01-09 DOI: 10.1002/wics.1575
Tianmou Liu, Han Yu, Rachael Hageman Blair

Cluster analysis remains one of the most challenging yet fundamental tasks in unsupervised learning. This is due in part to the fact that there are no labels or gold standards by which performance can be measured. Moreover, the wide range of clustering methods available is governed by different objective functions, different parameters, and dissimilarity measures. The purpose of clustering is versatile, often playing critical roles in the early stages of exploratory data analysis and as an endpoint for knowledge and discovery. Thus, understanding the quality of a clustering is of critical importance. The concept of stability has emerged as a strategy for assessing the performance and reproducibility of data clustering. The key idea is to produce perturbed data sets that are very close to the original, and cluster them. If the clustering is stable, then the clusters from the original data will be preserved in the perturbed data clustering. The nature of the perturbation, and the methods for quantifying similarity between clusterings, are nontrivial, and ultimately what distinguishes many of the stability estimation methods apart. In this review, we provide an overview of the very active research area of cluster stability estimation and discuss some of the open questions and challenges that remain in the field. This article is categorized under:Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification.

聚类分析仍然是无监督学习中最具挑战性的基本任务之一。部分原因在于没有标签或黄金标准来衡量性能。此外,现有的各种聚类方法受制于不同的目标函数、不同的参数和差异度量。聚类的目的是多方面的,通常在探索性数据分析的早期阶段发挥关键作用,也是知识和发现的终点。因此,了解聚类的质量至关重要。稳定性概念已成为评估数据聚类性能和可重复性的一种策略。其关键思路是生成与原始数据非常接近的扰动数据集,并对其进行聚类。如果聚类是稳定的,那么原始数据中的聚类将在扰动数据聚类中得到保留。扰动的性质以及聚类之间相似性的量化方法并不复杂,这也是许多稳定性估计方法的最终区别所在。在这篇综述中,我们将概述非常活跃的聚类稳定性估计研究领域,并讨论该领域仍存在的一些开放性问题和挑战。本文所属分类:数据科学的统计学习与探索方法 > 聚类与分类。
{"title":"Stability estimation for unsupervised clustering: A review.","authors":"Tianmou Liu, Han Yu, Rachael Hageman Blair","doi":"10.1002/wics.1575","DOIUrl":"10.1002/wics.1575","url":null,"abstract":"<p><p>Cluster analysis remains one of the most challenging yet fundamental tasks in unsupervised learning. This is due in part to the fact that there are no labels or gold standards by which performance can be measured. Moreover, the wide range of clustering methods available is governed by different objective functions, different parameters, and dissimilarity measures. The purpose of clustering is versatile, often playing critical roles in the early stages of exploratory data analysis and as an endpoint for knowledge and discovery. Thus, understanding the quality of a clustering is of critical importance. The concept of <i>stability</i> has emerged as a strategy for assessing the performance and reproducibility of data clustering. The key idea is to produce perturbed data sets that are very close to the original, and cluster them. If the clustering is stable, then the clusters from the original data will be preserved in the perturbed data clustering. The nature of the perturbation, and the methods for quantifying similarity between clusterings, are nontrivial, and ultimately what distinguishes many of the stability estimation methods apart. In this review, we provide an overview of the very active research area of cluster stability estimation and discuss some of the open questions and challenges that remain in the field. This article is categorized under:Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification.</p>","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":"14 6","pages":"e1575"},"PeriodicalIF":4.4,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/0e/84/WICS-14-e1575.PMC9787023.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10512933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A survey of numerical algorithms that can solve the Lasso problems 求解Lasso问题的数值算法综述
IF 3.2 2区 数学 Q1 STATISTICS & PROBABILITY Pub Date : 2022-10-24 DOI: 10.1002/wics.1602
Yujie Zhao, X. Huo
In statistics, the least absolute shrinkage and selection operator (Lasso) is a regression method that performs both variable selection and regularization. There is a lot of literature available, discussing the statistical properties of the regression coefficients estimated by the Lasso method. However, there lacks a comprehensive review discussing the algorithms to solve the optimization problem in Lasso. In this review, we summarize five representative algorithms to optimize the objective function in Lasso, including iterative shrinkage threshold algorithm (ISTA), fast iterative shrinkage‐thresholding algorithms (FISTA), coordinate gradient descent algorithm (CGDA), smooth L1 algorithm (SLA), and path following algorithm (PFA). Additionally, we also compare their convergence rate, as well as their potential strengths and weakness.
在统计学中,最小绝对收缩和选择算子(Lasso)是一种同时执行变量选择和正则化的回归方法。有很多文献讨论了拉索方法估计的回归系数的统计特性。然而,对Lasso中解决优化问题的算法缺乏全面的综述。在这篇综述中,我们总结了五种有代表性的Lasso目标函数优化算法,包括迭代收缩阈值算法(ISTA)、快速迭代收缩阈值法(FISTA)、坐标梯度下降算法(CGDA)、平滑L1算法(SLA)和路径跟随算法(PFA)。此外,我们还比较了它们的收敛速度,以及它们潜在的优势和劣势。
{"title":"A survey of numerical algorithms that can solve the Lasso problems","authors":"Yujie Zhao, X. Huo","doi":"10.1002/wics.1602","DOIUrl":"https://doi.org/10.1002/wics.1602","url":null,"abstract":"In statistics, the least absolute shrinkage and selection operator (Lasso) is a regression method that performs both variable selection and regularization. There is a lot of literature available, discussing the statistical properties of the regression coefficients estimated by the Lasso method. However, there lacks a comprehensive review discussing the algorithms to solve the optimization problem in Lasso. In this review, we summarize five representative algorithms to optimize the objective function in Lasso, including iterative shrinkage threshold algorithm (ISTA), fast iterative shrinkage‐thresholding algorithms (FISTA), coordinate gradient descent algorithm (CGDA), smooth L1 algorithm (SLA), and path following algorithm (PFA). Additionally, we also compare their convergence rate, as well as their potential strengths and weakness.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41836132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Data validation and statistical issues such as power and other considerations in genome‐wide association study (GWAS) 数据验证和统计问题,如全基因组关联研究(GWAS)中的功率和其他考虑因素
IF 3.2 2区 数学 Q1 STATISTICS & PROBABILITY Pub Date : 2022-10-06 DOI: 10.1002/wics.1601
Makoto Tomita
A series of steps in genomic data analysis will be presented. In data validation, starting with marker quality control, he mentioned structuring problems from ethnic populations, genome‐wide significant levels, Manhattan plots, and Haploview. Statistical issues such as power, sample size calculation, false discovery rate, and QQ plot of p‐values were also introduced.
将介绍基因组数据分析的一系列步骤。在数据验证中,从标记质量控制开始,他提到了种族群体、全基因组显著性水平、曼哈顿图和Haploview的结构问题。统计问题,如功率,样本量计算,错误发现率,和QQ图的p值也被介绍。
{"title":"Data validation and statistical issues such as power and other considerations in genome‐wide association study (GWAS)","authors":"Makoto Tomita","doi":"10.1002/wics.1601","DOIUrl":"https://doi.org/10.1002/wics.1601","url":null,"abstract":"A series of steps in genomic data analysis will be presented. In data validation, starting with marker quality control, he mentioned structuring problems from ethnic populations, genome‐wide significant levels, Manhattan plots, and Haploview. Statistical issues such as power, sample size calculation, false discovery rate, and QQ plot of p‐values were also introduced.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46676632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On unbiasedness and biasedness of the Wilcoxon and some nonparametric tests 关于Wilcoxon的无偏性和偏性及一些非参数检验
IF 3.2 2区 数学 Q1 STATISTICS & PROBABILITY Pub Date : 2022-09-23 DOI: 10.1002/wics.1600
H. Murakami, Seong-Keon Lee
In several fields of applications, the underlying theoretical distribution is unknown and cannot be assumed to have a specific parametric distribution such as a normal distribution. Nonparametric statistical methods are preferable in these cases. Nonparametric testing hypotheses have been one of the primarily used statistical procedures for nearly a century, and the power of the test is an important property in nonparametric testing procedures. This review discusses the unbiasedness of nonparametric tests. In nonparametric hypothesis, the best‐known Wilcoxon–Mann–Whitney (WMW) test has both robustness and power performance. Therefore, the WMW test is widely used to determine the location parameter. In this review, the unbiasedness and biasedness of the WMW test for the location parameter family of the distribution is mainly investigated. An overview of historical developments, detailed discussions, and works on the unbiasedness/biasedness of several nonparametric tests are presented with references to numerous studies. Finally, we conclude this review with a discussion on the unbiasedness/biasedness of nonparametric test procedures. This article is categorized under: Statistical and Graphical Methods of Data Analysis > Nonparametric Methods.
在几个应用领域中,潜在的理论分布是未知的,不能假设具有特定的参数分布,如正态分布。在这些情况下,非参数统计方法更可取。近一个世纪以来,非参数检验假设一直是主要使用的统计程序之一,而检验的幂是非参数检验程序的一个重要性质。本文讨论了非参数检验的无偏性。在非参数假设中,最著名的Wilcoxon–Mann–Whitney(WMW)检验具有稳健性和幂性能。因此,WMW测试被广泛用于确定位置参数。本文主要研究了分布的位置参数族的WMW检验的无偏性和偏性。参考大量研究,概述了几种非参数检验的无偏性/偏倚性的历史发展、详细讨论和工作。最后,我们讨论了非参数检验过程的无偏性/偏倚性,从而结束了这篇综述。本文分类为:数据分析的统计和图形方法>非参数方法。
{"title":"On unbiasedness and biasedness of the Wilcoxon and some nonparametric tests","authors":"H. Murakami, Seong-Keon Lee","doi":"10.1002/wics.1600","DOIUrl":"https://doi.org/10.1002/wics.1600","url":null,"abstract":"In several fields of applications, the underlying theoretical distribution is unknown and cannot be assumed to have a specific parametric distribution such as a normal distribution. Nonparametric statistical methods are preferable in these cases. Nonparametric testing hypotheses have been one of the primarily used statistical procedures for nearly a century, and the power of the test is an important property in nonparametric testing procedures. This review discusses the unbiasedness of nonparametric tests. In nonparametric hypothesis, the best‐known Wilcoxon–Mann–Whitney (WMW) test has both robustness and power performance. Therefore, the WMW test is widely used to determine the location parameter. In this review, the unbiasedness and biasedness of the WMW test for the location parameter family of the distribution is mainly investigated. An overview of historical developments, detailed discussions, and works on the unbiasedness/biasedness of several nonparametric tests are presented with references to numerous studies. Finally, we conclude this review with a discussion on the unbiasedness/biasedness of nonparametric test procedures. This article is categorized under: Statistical and Graphical Methods of Data Analysis > Nonparametric Methods.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48942026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A review of recent advances in empirical likelihood 对经验可能性的最新进展的回顾
IF 3.2 2区 数学 Q1 STATISTICS & PROBABILITY Pub Date : 2022-09-20 DOI: 10.1002/wics.1599
Pang-Chi Liu, Yichuan Zhao
Empirical likelihood is widely used in many statistical problems. In this article, we provide a review of the empirical likelihood method, due to its significant development in recent years. Since the introduction of empirical likelihood, variants of empirical likelihood have been proposed, and the applications of empirical likelihood in high dimensions have also been studied. It is necessary to summarize the new development of empirical likelihood. In this article, we give a review of the Bayesian empirical likelihood, the bias‐corrected empirical likelihood, the jackknife empirical likelihood, the adjusted empirical likelihood, the extended empirical likelihood, the transformed empirical likelihood, the mean empirical likelihood, and the empirical likelihood with high dimensions. Finally, we have a brief survey of the computation and implementation for empirical likelihood methods.
经验似然在许多统计问题中得到广泛应用。在本文中,我们提供了一个回顾的经验似然方法,由于它的重大发展,近年来。自经验似然引入以来,人们提出了经验似然的变体,并研究了经验似然在高维中的应用。总结经验似然的新发展是必要的。在本文中,我们回顾了贝叶斯经验似然、偏差校正经验似然、折刀经验似然、调整经验似然、扩展经验似然、转换经验似然、平均经验似然和高维经验似然。最后,简要介绍了经验似然方法的计算和实现。
{"title":"A review of recent advances in empirical likelihood","authors":"Pang-Chi Liu, Yichuan Zhao","doi":"10.1002/wics.1599","DOIUrl":"https://doi.org/10.1002/wics.1599","url":null,"abstract":"Empirical likelihood is widely used in many statistical problems. In this article, we provide a review of the empirical likelihood method, due to its significant development in recent years. Since the introduction of empirical likelihood, variants of empirical likelihood have been proposed, and the applications of empirical likelihood in high dimensions have also been studied. It is necessary to summarize the new development of empirical likelihood. In this article, we give a review of the Bayesian empirical likelihood, the bias‐corrected empirical likelihood, the jackknife empirical likelihood, the adjusted empirical likelihood, the extended empirical likelihood, the transformed empirical likelihood, the mean empirical likelihood, and the empirical likelihood with high dimensions. Finally, we have a brief survey of the computation and implementation for empirical likelihood methods.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45129169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Sequential Monte Carlo optimization and statistical inference 顺序蒙特卡罗优化和统计推断
IF 3.2 2区 数学 Q1 STATISTICS & PROBABILITY Pub Date : 2022-09-20 DOI: 10.1002/wics.1598
J. Duan, Shuping Li, Yaxian Xu
Sequential Monte Carlo (SMC) is a powerful technique originally developed for particle filtering and Bayesian inference. As a generic optimizer for statistical and nonstatistical objectives, its role is far less known. Density‐tempered SMC is a highly efficient sampling technique ideally suited for challenging global optimization problems and is implementable with a somewhat arbitrary initialization sampler instead of relying on a prior distribution. SMC optimization is anchored at the fact that all optimization tasks (continuous, discontinuous, combinatorial, or noisy objective function) can be turned into sampling under a density or probability function short of a norming constant. The point with the highest functional value is the SMC estimate for the maximum. Through examples, we systematically present various density‐tempered SMC algorithms and their superior performance vs. other techniques like Markov Chain Monte Carlo. Data cloning and k‐fold duplication are two easily implementable accuracy accelerators, and their complementarity is discussed. The Extreme Value Theorem on the maximum order statistic can also help assess the quality of the SMC optimum. Our coverage includes the algorithmic essence of the density‐tempered SMC with various enhancements and solutions for (1) a bi‐modal nonstatistical function without and with constraints, (2) a multidimensional step function, (3) offline and online optimizations, (4) combinatorial variable selection, and (5) noninvertibility of the Hessian.
序列蒙特卡罗(SMC)是一种强大的技术,最初用于粒子滤波和贝叶斯推理。作为统计和非统计目标的通用优化器,它的作用远不为人所知。密度回火SMC是一种高效的采样技术,非常适合于具有挑战性的全局优化问题,并且可以使用任意的初始化采样器来实现,而不是依赖于先验分布。SMC优化基于这样一个事实,即所有优化任务(连续、不连续、组合或有噪声的目标函数)都可以在密度或概率函数小于规范常数的情况下进行采样。函数值最高的点是最大值的SMC估计值。通过实例,我们系统地介绍了各种密度调和SMC算法及其相对于其他技术(如马尔可夫链蒙特卡罗)的优越性能。数据克隆和k倍复制是两个易于实现的准确性加速器,并讨论了它们的互补性。关于最大阶统计量的极值定理也可以帮助评估SMC最优的质量。我们的覆盖范围包括密度调和SMC的算法本质,以及(1)无约束和有约束的双模非平稳函数,(2)多维阶跃函数,(3)离线和在线优化,(4)组合变量选择,以及(5)Hessian的不可逆性的各种增强和解决方案。
{"title":"Sequential Monte Carlo optimization and statistical inference","authors":"J. Duan, Shuping Li, Yaxian Xu","doi":"10.1002/wics.1598","DOIUrl":"https://doi.org/10.1002/wics.1598","url":null,"abstract":"Sequential Monte Carlo (SMC) is a powerful technique originally developed for particle filtering and Bayesian inference. As a generic optimizer for statistical and nonstatistical objectives, its role is far less known. Density‐tempered SMC is a highly efficient sampling technique ideally suited for challenging global optimization problems and is implementable with a somewhat arbitrary initialization sampler instead of relying on a prior distribution. SMC optimization is anchored at the fact that all optimization tasks (continuous, discontinuous, combinatorial, or noisy objective function) can be turned into sampling under a density or probability function short of a norming constant. The point with the highest functional value is the SMC estimate for the maximum. Through examples, we systematically present various density‐tempered SMC algorithms and their superior performance vs. other techniques like Markov Chain Monte Carlo. Data cloning and k‐fold duplication are two easily implementable accuracy accelerators, and their complementarity is discussed. The Extreme Value Theorem on the maximum order statistic can also help assess the quality of the SMC optimum. Our coverage includes the algorithmic essence of the density‐tempered SMC with various enhancements and solutions for (1) a bi‐modal nonstatistical function without and with constraints, (2) a multidimensional step function, (3) offline and online optimizations, (4) combinatorial variable selection, and (5) noninvertibility of the Hessian.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42620366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Issue Information 问题信息
IF 3.2 2区 数学 Q1 STATISTICS & PROBABILITY Pub Date : 2022-09-01 DOI: 10.1111/papq.12360
{"title":"Issue Information","authors":"","doi":"10.1111/papq.12360","DOIUrl":"https://doi.org/10.1111/papq.12360","url":null,"abstract":"","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47320100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cluster analysis: A modern statistical review 聚类分析:现代统计回顾
IF 3.2 2区 数学 Q1 STATISTICS & PROBABILITY Pub Date : 2022-08-19 DOI: 10.1002/wics.1597
Adam Jaeger, David Banks
Cluster analysis is a big, sprawling field. This review paper cannot hope to fully survey the territory. Instead, it focuses on hierarchical agglomerative clustering, k‐means clustering, mixture models, and then several related topics of which any cluster analysis practitioner should be aware. Even then, this review cannot do justice to the chosen topics. There is a lot of literature, and often it is somewhat ad hoc. That is generally the nature of cluster analysis—each application requires a bespoke analysis. Nonetheless, clustering has proven itself to be incredibly useful as an exploratory data analysis tool in biology, advertising, recommender systems, and genomics.
聚类分析是一个庞大而庞大的领域。这份检讨文件不可能全面调查全港。相反,它关注的是分层聚集聚类、k均值聚类、混合模型,以及任何聚类分析从业者都应该意识到的几个相关主题。即便如此,这次审查也无法公正地对待所选的主题。有很多文学作品,而且往往都是即兴创作的。这通常是聚类分析的本质——每个应用程序都需要定制的分析。尽管如此,聚类已被证明在生物学、广告、推荐系统和基因组学中作为一种探索性数据分析工具是非常有用的。
{"title":"Cluster analysis: A modern statistical review","authors":"Adam Jaeger, David Banks","doi":"10.1002/wics.1597","DOIUrl":"https://doi.org/10.1002/wics.1597","url":null,"abstract":"Cluster analysis is a big, sprawling field. This review paper cannot hope to fully survey the territory. Instead, it focuses on hierarchical agglomerative clustering, k‐means clustering, mixture models, and then several related topics of which any cluster analysis practitioner should be aware. Even then, this review cannot do justice to the chosen topics. There is a lot of literature, and often it is somewhat ad hoc. That is generally the nature of cluster analysis—each application requires a bespoke analysis. Nonetheless, clustering has proven itself to be incredibly useful as an exploratory data analysis tool in biology, advertising, recommender systems, and genomics.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48225906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Robust regression using probabilistically linked data 使用概率关联数据的稳健回归
IF 3.2 2区 数学 Q1 STATISTICS & PROBABILITY Pub Date : 2022-07-07 DOI: 10.1002/wics.1596
R. Chambers, E. Fabrizi, M. Ranalli, N. Salvati, Suojin Wang
There is growing interest in a data integration approach to survey sampling, particularly where population registers are linked for sampling and subsequent analysis. The reason for doing this is simple: it is only by linking the same individuals in the different sources that it becomes possible to create a data set suitable for analysis. But data linkage is not error free. Many linkages are nondeterministic, based on how likely a linking decision corresponds to a correct match, that is, it brings together the same individual in all sources. High quality linking will ensure that the probability of this happening is high. Analysis of the linked data should take account of this additional source of error when this is not the case. This is especially true for secondary analysis carried out without access to the linking information, that is, the often confidential data that agencies use in their record matching. We describe an inferential framework that allows for linkage errors when sampling from linked registers. After first reviewing current research activity in this area, we focus on secondary analysis and linear regression modeling, including the important special case of estimation of subpopulation and small area means. In doing so we consider both robustness and efficiency of the resulting linked data inferences.
人们对调查抽样的数据综合办法越来越感兴趣,特别是在将人口登记册联系起来进行抽样和随后的分析的情况下。这样做的原因很简单:只有将不同来源中的相同个体联系起来,才有可能创建适合分析的数据集。但数据链接并非没有错误。许多链接是不确定的,这取决于链接决策对应于正确匹配的可能性,也就是说,它将所有来源中的相同个体聚集在一起。高质量的链接将确保这种情况发生的可能性很高。当不存在这种情况时,对关联数据的分析应考虑到这种额外的误差来源。在没有接触到相关信息的情况下进行的二次分析尤其如此,这些信息通常是各机构在其记录匹配中使用的机密数据。我们描述了一个推理框架,当从链接寄存器采样时允许链接错误。本文首先回顾了该领域的研究现状,重点介绍了二次分析和线性回归建模,包括亚种群估计和小面积均值的重要特例。在这样做时,我们考虑了由此产生的关联数据推断的鲁棒性和效率。
{"title":"Robust regression using probabilistically linked data","authors":"R. Chambers, E. Fabrizi, M. Ranalli, N. Salvati, Suojin Wang","doi":"10.1002/wics.1596","DOIUrl":"https://doi.org/10.1002/wics.1596","url":null,"abstract":"There is growing interest in a data integration approach to survey sampling, particularly where population registers are linked for sampling and subsequent analysis. The reason for doing this is simple: it is only by linking the same individuals in the different sources that it becomes possible to create a data set suitable for analysis. But data linkage is not error free. Many linkages are nondeterministic, based on how likely a linking decision corresponds to a correct match, that is, it brings together the same individual in all sources. High quality linking will ensure that the probability of this happening is high. Analysis of the linked data should take account of this additional source of error when this is not the case. This is especially true for secondary analysis carried out without access to the linking information, that is, the often confidential data that agencies use in their record matching. We describe an inferential framework that allows for linkage errors when sampling from linked registers. After first reviewing current research activity in this area, we focus on secondary analysis and linear regression modeling, including the important special case of estimation of subpopulation and small area means. In doing so we consider both robustness and efficiency of the resulting linked data inferences.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46408778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
SAREV: A review on statistical analytics of single-cell RNA sequencing data. SAREV:单细胞RNA测序数据统计分析综述
IF 3.2 2区 数学 Q1 STATISTICS & PROBABILITY Pub Date : 2022-07-01 Epub Date: 2021-05-20 DOI: 10.1002/wics.1558
Dorothy Ellis, Dongyuan Wu, Susmita Datta

Due to the development of next-generation RNA sequencing (NGS) technologies, there has been tremendous progress in research involving determining the role of genomics, transcriptomics and epigenomics in complex biological systems. However, scientists have realized that information obtained using earlier technology, frequently called 'bulk RNA-seq' data, provides information averaged across all the cells present in a tissue. Relatively newly developed single cell (scRNA-seq) technology allows us to provide transcriptomic information at a single-cell resolution. Nevertheless, these high-resolution data have their own complex natures and demand novel statistical data analysis methods to provide effective and highly accurate results on complex biological systems. In this review, we cover many such recently developed statistical methods for researchers wanting to pursue scRNA-seq statistical and computational research as well as scientific research about these existing methods and free software tools available for their generated data. This review is certainly not exhaustive due to page limitations. We have tried to cover the popular methods starting from quality control to the downstream analysis of finding differentially expressed genes and concluding with a brief description of network analysis.

由于下一代RNA测序(NGS)技术的发展,在确定基因组学、转录组学和表观基因组学在复杂生物系统中的作用方面的研究取得了巨大进展。然而,科学家们已经意识到,使用早期技术获得的信息,通常被称为“批量RNA-seq”数据,提供了组织中所有细胞的平均信息。相对较新开发的单细胞(scRNA-seq)技术使我们能够以单细胞分辨率提供转录组信息。然而,这些高分辨率数据具有其自身的复杂性,需要新的统计数据分析方法来对复杂的生物系统提供有效和高度准确的结果。在这篇综述中,我们介绍了许多最近开发的统计方法,供希望进行scRNA-seq统计和计算研究的研究人员使用,以及对这些现有方法和可用于生成数据的免费软件工具的科学研究。由于篇幅限制,这篇综述肯定不是详尽无遗的。我们试图涵盖从质量控制到寻找差异表达基因的下游分析的流行方法,最后简要描述网络分析。
{"title":"SAREV: A review on statistical analytics of single-cell RNA sequencing data.","authors":"Dorothy Ellis,&nbsp;Dongyuan Wu,&nbsp;Susmita Datta","doi":"10.1002/wics.1558","DOIUrl":"10.1002/wics.1558","url":null,"abstract":"<p><p>Due to the development of next-generation RNA sequencing (NGS) technologies, there has been tremendous progress in research involving determining the role of genomics, transcriptomics and epigenomics in complex biological systems. However, scientists have realized that information obtained using earlier technology, frequently called 'bulk RNA-seq' data, provides information averaged across all the cells present in a tissue. Relatively newly developed single cell (scRNA-seq) technology allows us to provide transcriptomic information at a single-cell resolution. Nevertheless, these high-resolution data have their own complex natures and demand novel statistical data analysis methods to provide effective and highly accurate results on complex biological systems. In this review, we cover many such recently developed statistical methods for researchers wanting to pursue scRNA-seq statistical and computational research as well as scientific research about these existing methods and free software tools available for their generated data. This review is certainly not exhaustive due to page limitations. We have tried to cover the popular methods starting from quality control to the downstream analysis of finding differentially expressed genes and concluding with a brief description of network analysis.</p>","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":"14 4","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/wics.1558","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9729203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Wiley Interdisciplinary Reviews-Computational Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1