首页 > 最新文献

Journal of data science, statistics, and visualisation最新文献

英文 中文
Casting multiple shadows: interactive data visualisation with tours and embeddings 投射多个阴影:带有游览和嵌入的交互式数据可视化
Pub Date : 2022-05-30 DOI: 10.52933/jdssv.v2i3.21
Stuart Lee, U. Laa, D. Cook
Non-linear dimensionality reduction (NLDR) methods such as t-distributed stochastic neighbour embedding (t-SNE) are ubiquitous in the natural sciences, however, the appropriate use of these methods is difficult because of their complex parameterisations; analysts must make trade-offs in order to identify structure in the visualisation of an NLDR technique. We present visual diagnostics for the pragmatic usage of NLDR methods by combining them with a technique called the tour. A tour is a sequence of interpolated linear projections of multivariate data onto a lower dimensional space. The sequence is displayed as a dynamic visualisation, allowing a user to see the shadows the high-dimensional data casts in a lower dimensional view. By linking the tour to an NLDR view, we can preserve global structure and through user interactions like linked brushing observe where the NLDR view may be misleading. We display several case studies from both simulations and single cell transcriptomics, that shows our approach is useful for cluster orientation tasks. The implementation of our framework is available as an R package called liminal available at https://github.com/sa-lee/liminal.
非线性降维(NLDR)方法,如t分布随机邻居嵌入(t-SNE)在自然科学中无处不在,然而,由于它们复杂的参数化,这些方法的适当使用是困难的;分析人员必须做出权衡,以便在NLDR技术的可视化中识别结构。我们通过将NLDR方法与称为tour的技术相结合,为NLDR方法的实用使用提供可视化诊断。遍历是多维数据在低维空间上的插值线性投影序列。序列显示为动态可视化,允许用户看到高维数据在低维视图中投射的阴影。通过将游览链接到NLDR视图,我们可以保留全局结构,并通过用户交互(如链接刷刷)观察NLDR视图可能会误导的地方。我们展示了几个来自模拟和单细胞转录组学的案例研究,表明我们的方法对集群定向任务是有用的。我们的框架的实现可以在https://github.com/sa-lee/liminal上作为一个名为liminal的R包获得。
{"title":"Casting multiple shadows: interactive data visualisation with tours and embeddings","authors":"Stuart Lee, U. Laa, D. Cook","doi":"10.52933/jdssv.v2i3.21","DOIUrl":"https://doi.org/10.52933/jdssv.v2i3.21","url":null,"abstract":"Non-linear dimensionality reduction (NLDR) methods such as t-distributed stochastic neighbour embedding (t-SNE) are ubiquitous in the natural sciences, however, the appropriate use of these methods is difficult because of their complex parameterisations; analysts must make trade-offs in order to identify structure in the visualisation of an NLDR technique. We present visual diagnostics for the pragmatic usage of NLDR methods by combining them with a technique called the tour. A tour is a sequence of interpolated linear projections of multivariate data onto a lower dimensional space. The sequence is displayed as a dynamic visualisation, allowing a user to see the shadows the high-dimensional data casts in a lower dimensional view. By linking the tour to an NLDR view, we can preserve global structure and through user interactions like linked brushing observe where the NLDR view may be misleading. We display several case studies from both simulations and single cell transcriptomics, that shows our approach is useful for cluster orientation tasks. The implementation of our framework is available as an R package called liminal available at https://github.com/sa-lee/liminal.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81018292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
INTEREST: INteractive Tool for Exploring REsults from Simulation sTudies. 兴趣:探索模拟研究结果的交互式工具。
Pub Date : 2021-12-31 DOI: 10.52933/jdssv.v1i4.9
Alessandro Gasparini, Tim P Morris, Michael J Crowther

Simulation studies allow us to explore the properties of statistical methods. They provide a powerful tool with a multiplicity of aims; among others: evaluating and comparing new or existing statistical methods, assessing violations of modelling assumptions, helping with the understanding of statistical concepts, and supporting the design of clinical trials. The increased availability of powerful computational tools and usable software has contributed to the rise of simulation studies in the current literature. However, simulation studies involve increasingly complex designs, making it difficult to provide all relevant results clearly. Dissemination of results plays a focal role in simulation studies: it can drive applied analysts to use methods that have been shown to perform well in their settings, guide researchers to develop new methods in a promising direction, and provide insights into less established methods. It is crucial that we can digest relevant results of simulation studies. Therefore, we developed INTEREST: an INteractive Tool for Exploring REsults from Simulation sTudies. The tool has been developed using the Shiny framework in R and is available as a web app or as a standalone package. It requires uploading a tidy format dataset with the results of a simulation study in R, Stata, SAS, SPSS, or comma-separated format. A variety of performance measures are estimated automatically along with Monte Carlo standard errors; results and performance summaries are displayed both in tabular and graphical fashion, with a wide variety of available plots. Consequently, the reader can focus on simulation parameters and estimands of most interest. In conclusion, INTEREST can facilitate the investigation of results from simulation studies and supplement the reporting of results, allowing researchers to share detailed results from their simulations, readers to explore them freely.

模拟研究使我们能够探索统计方法的特性。它们提供了具有多种目标的强大工具;其中包括:评估和比较新的或现有的统计方法,评估模型假设的违反情况,帮助理解统计概念,并支持临床试验的设计。强大的计算工具和可用软件的可用性的增加促进了当前文献中模拟研究的兴起。然而,仿真研究涉及到越来越复杂的设计,使得很难提供清晰的所有相关结果。结果的传播在模拟研究中起着核心作用:它可以驱动应用分析人员使用在其环境中表现良好的方法,指导研究人员在有前途的方向上开发新方法,并为不太成熟的方法提供见解。重要的是,我们可以消化相关的模拟研究结果。因此,我们开发了INTEREST:一个探索模拟研究结果的交互式工具。该工具是使用R中的Shiny框架开发的,可以作为web应用程序或独立包使用。它需要上传一个整洁格式的数据集,其中包含R, Stata, SAS, SPSS或逗号分隔格式的模拟研究结果。各种性能指标与蒙特卡洛标准误差一起自动估计;结果和性能总结以表格和图形方式显示,有各种各样的可用图。因此,读者可以专注于模拟参数和最感兴趣的估计。总之,INTEREST可以促进模拟研究结果的调查,并补充结果的报告,使研究人员可以分享详细的模拟结果,读者可以自由地探索它们。
{"title":"INTEREST: INteractive Tool for Exploring REsults from Simulation sTudies.","authors":"Alessandro Gasparini,&nbsp;Tim P Morris,&nbsp;Michael J Crowther","doi":"10.52933/jdssv.v1i4.9","DOIUrl":"https://doi.org/10.52933/jdssv.v1i4.9","url":null,"abstract":"<p><p>Simulation studies allow us to explore the properties of statistical methods. They provide a powerful tool with a multiplicity of aims; among others: evaluating and comparing new or existing statistical methods, assessing violations of modelling assumptions, helping with the understanding of statistical concepts, and supporting the design of clinical trials. The increased availability of powerful computational tools and usable software has contributed to the rise of simulation studies in the current literature. However, simulation studies involve increasingly complex designs, making it difficult to provide all relevant results clearly. Dissemination of results plays a focal role in simulation studies: it can drive applied analysts to use methods that have been shown to perform well in their settings, guide researchers to develop new methods in a promising direction, and provide insights into less established methods. It is crucial that we can digest relevant results of simulation studies. Therefore, we developed <b>INTEREST</b>: an <i>INteractive Tool for Exploring REsults from Simulation sTudies</i>. The tool has been developed using the <b>Shiny</b> framework in R and is available as a web app or as a standalone package. It requires uploading a tidy format dataset with the results of a simulation study in R, Stata, SAS, SPSS, or comma-separated format. A variety of performance measures are estimated automatically along with Monte Carlo standard errors; results and performance summaries are displayed both in tabular and graphical fashion, with a wide variety of available plots. Consequently, the reader can focus on simulation parameters and estimands of most interest. In conclusion, <b>INTEREST</b> can facilitate the investigation of results from simulation studies and supplement the reporting of results, allowing researchers to share detailed results from their simulations, readers to explore them freely.</p>","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"1 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7612246/pdf/EMS140699.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39949693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
On Generalization and Computation of Tukey's Depth: Part II 土基深度的概化与计算:第二部分
Pub Date : 2021-12-15 DOI: 10.52933/jdssv.v2i2.61
Yiyuan She, Shao Tang, Jingze Liu
This paper studies how to generalize Tukey's depth to problems defined in a restricted space that may be curved or have boundaries, and to problems with a nondifferentiable objective. First, using a manifold approach, we propose a broad class of Riemannian depth for smooth problems defined on a Riemannian manifold, and showcase its applications in spherical data analysis, principal component analysis, and multivariate orthogonal regression. Moreover, for nonsmooth problems, we introduce additional slack variables and inequality constraints to define a novel slacked data depth, which can perform center-outward rankings of estimators arising from sparse learning and reduced rank regression. Real data examples illustrate the usefulness of some proposed data depths.  
本文研究了如何将Tukey深度推广到定义在受限空间中可能是弯曲的或有边界的问题,以及具有不可微目标的问题。首先,使用流形方法,我们提出了一类广义的黎曼深度,用于黎曼流形上定义的光滑问题,并展示了它在球面数据分析、主成分分析和多元正交回归中的应用。此外,对于非光滑问题,我们引入了额外的松弛变量和不等式约束来定义新的松弛数据深度,该深度可以对稀疏学习和降秩回归产生的估计量进行中心向外排序。真实的数据示例说明了一些建议的数据深度的有用性。
{"title":"On Generalization and Computation of Tukey's Depth: Part II","authors":"Yiyuan She, Shao Tang, Jingze Liu","doi":"10.52933/jdssv.v2i2.61","DOIUrl":"https://doi.org/10.52933/jdssv.v2i2.61","url":null,"abstract":"This paper studies how to generalize Tukey's depth to problems defined in a restricted space that may be curved or have boundaries, and to problems with a nondifferentiable objective. First, using a manifold approach, we propose a broad class of Riemannian \u0000depth for smooth problems defined on a Riemannian manifold, and showcase its applications in spherical data analysis, principal component analysis, and multivariate orthogonal regression. Moreover, for nonsmooth problems, we introduce additional slack variables and inequality constraints to define a novel slacked data depth, which can perform center-outward rankings of estimators arising from sparse learning and reduced rank regression. Real data examples illustrate the usefulness of some proposed data depths. \u0000 ","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86968094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
On Generalization and Computation of Tukey's Depth: Part I 土基深度的概化与计算:第一部分
Pub Date : 2021-12-15 DOI: 10.52933/jdssv.v2i1.23
Yiyuan She, S. Tang, Jingze Liu
Tukey's depth offers a powerful tool for nonparametric inference and estimation, but also encounters serious computational and methodological difficulties in modern statistical data analysis. This paper studies how to generalize and compute Tukey-type depths in multi-dimensions. A general framework of influence-driven polished subspace depth, which emphasizes the importance of the underlying influence space and discrepancy measure, is introduced. The new matrix formulation enables us to utilize state-of-the-art optimization techniques to develop scalable algorithms with implementation ease and guaranteed fast convergence. In particular, half-space depth as well as regression depth can now be computed much faster than previously possible, with the support from extensive experiments. A companion paper is also offered to the reader in the same issue of this journal.
Tukey的深度为非参数推理和估计提供了强大的工具,但在现代统计数据分析中也遇到了严重的计算和方法困难。本文研究了如何在多维情况下推广和计算tukey型深度。介绍了影响驱动抛光子空间深度的一般框架,强调了潜在影响空间和差异度量的重要性。新的矩阵公式使我们能够利用最先进的优化技术来开发可扩展的算法,实现简单,并保证快速收敛。特别是,在大量实验的支持下,现在可以比以前更快地计算半空间深度和回归深度。在同一期杂志中,还为读者提供了一篇配套论文。
{"title":"On Generalization and Computation of Tukey's Depth: Part I","authors":"Yiyuan She, S. Tang, Jingze Liu","doi":"10.52933/jdssv.v2i1.23","DOIUrl":"https://doi.org/10.52933/jdssv.v2i1.23","url":null,"abstract":"Tukey's depth offers a powerful tool for nonparametric inference and estimation, but also encounters serious computational and methodological difficulties in modern statistical data analysis. This paper studies how to generalize and compute Tukey-type depths in multi-dimensions. A general framework of influence-driven polished subspace depth, which emphasizes the importance of the underlying influence space and discrepancy measure, is introduced. The new matrix formulation enables us to utilize state-of-the-art optimization techniques to develop scalable algorithms with implementation ease and guaranteed fast convergence. In particular, half-space depth as well as regression depth can now be computed much faster than previously possible, with the support from extensive experiments. A companion paper is also offered to the reader in the same issue of this journal.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89122730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Editorial Founding Issue 创刊编辑
Pub Date : 2021-09-30 DOI: 10.52933/jdssv.v1i1.52
S. Aelst, P. Groenen
The Journal of Data Science, Statistics, and Visualisation (JDSSV) is an electronic journal which welcomes contributions to data science, statistics, and visualisation, and in particular, those aspects which link and integrate these subject areas. Articles can cover topics such as machine learning and statistical learning, the visualisation and verbalisation of data, visual analytics, big data infrastructures and analytics, interactive learning, and advanced computing. Articles thatdiscuss two or more research areas of the journal are favoured. Scientific contributions should be of a high standard. Articles should be oriented towards a wide scientific audience of statisticians, data scientists, computer scientists, data analysts, etc. The journal welcomes original contributions that are not being considered for publication elsewhere and contain a high level of novelty. Articles with a thorough but concise review of a certain topic with the potential to provide new insights are also welcome. Manuscripts submitted to the journal generally are accompanied by supplementary material containing software code, data, technical derivations or detailed explanations, additional examples, etc. All submitted material will be reviewed by the assigned associate editor and reviewers of the manuscript.
《数据科学、统计和可视化杂志》(JDSSV)是一本电子期刊,欢迎对数据科学、统计和可视化,特别是那些连接和整合这些学科领域的方面的贡献。文章可以涵盖机器学习和统计学习、数据的可视化和语言化、可视化分析、大数据基础设施和分析、交互式学习和高级计算等主题。讨论期刊两个或两个以上研究领域的文章更受欢迎。科学贡献应该是高水平的。文章应该面向统计学家、数据科学家、计算机科学家、数据分析师等广泛的科学受众。本刊欢迎在其他地方未被考虑发表的原创文章,并欢迎具有高度新颖性的文章。对某一主题进行全面而简明的回顾,并有可能提供新的见解的文章也受欢迎。提交给期刊的稿件通常附有补充材料,包括软件代码、数据、技术衍生或详细解释、附加示例等。所有提交的材料将由指定的副编辑和审稿人审阅。
{"title":"Editorial Founding Issue","authors":"S. Aelst, P. Groenen","doi":"10.52933/jdssv.v1i1.52","DOIUrl":"https://doi.org/10.52933/jdssv.v1i1.52","url":null,"abstract":"The Journal of Data Science, Statistics, and Visualisation (JDSSV) is an electronic journal which welcomes contributions to data science, statistics, and visualisation, and in particular, those aspects which link and integrate these subject areas. Articles can cover topics such as machine learning and statistical learning, the visualisation and verbalisation of data, visual analytics, big data infrastructures and analytics, interactive learning, and advanced computing. Articles thatdiscuss two or more research areas of the journal are favoured. Scientific contributions should be of a high standard. Articles should be oriented towards a wide scientific audience of statisticians, data scientists, computer scientists, data analysts, etc. The journal welcomes original contributions that are not being considered for publication elsewhere and contain a high level of novelty. Articles with a thorough but concise review of a certain topic with the potential to provide new insights are also welcome. Manuscripts submitted to the journal generally are accompanied by supplementary material containing software code, data, technical derivations or detailed explanations, additional examples, etc. All submitted material will be reviewed by the assigned associate editor and reviewers of the manuscript.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86046756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Spatial SEIR Model for COVID-19 in South Africa 南非COVID-19的空间SEIR模型
Pub Date : 2021-06-09 DOI: 10.20944/PREPRINTS202106.0262.V1
I. Fabris-Rotelli, Jenny P. Holloway, Zaid Kimmie, S. Archibald, P. Debba, Raeesa Manjoo-Docrat, A. Roux, Nontembeko Dudeni-Tlhone, Charl Janse van Rensburg, R. Thiede, N. Abdelatif, Sibusisiwe Makhanya, Arminn Potgieter
The virus SARS-CoV-2 has resulted in numerous modelling approaches arising rapidly to understand the spread of the disease COVID-19 and to plan for future interventions. Herein, we present an SEIR model with a spatial spread component as well as four infectious compartments to account for the variety of symptom levels and transmission rate. The model takes into account the pattern of spatial vulnerability in South Africa through a vulnerability index that is based on socioeconomic and health susceptibility characteristics. Another spatially relevant factor in this context is level of mobility throughout. The thesis of this study is that without the contextual spatial spread modelling, the heterogeneity in COVID-19 prevalence in the South African setting would not be captured. The model is illustrated on South African COVID-19 case counts and hospitalisations.
SARS-CoV-2病毒导致许多建模方法迅速出现,以了解COVID-19疾病的传播并计划未来的干预措施。在此,我们提出了一个具有空间传播成分的SEIR模型,以及四个感染区室,以解释症状水平和传播率的变化。该模型通过基于社会经济和健康易感性特征的脆弱性指数考虑了南非的空间脆弱性格局。在这种情况下,另一个与空间相关的因素是整个流动水平。本研究的论点是,如果没有背景空间传播模型,就无法捕捉到南非环境中COVID-19流行的异质性。该模型以南非COVID-19病例数和住院情况为例进行了说明。
{"title":"A Spatial SEIR Model for COVID-19 in South Africa","authors":"I. Fabris-Rotelli, Jenny P. Holloway, Zaid Kimmie, S. Archibald, P. Debba, Raeesa Manjoo-Docrat, A. Roux, Nontembeko Dudeni-Tlhone, Charl Janse van Rensburg, R. Thiede, N. Abdelatif, Sibusisiwe Makhanya, Arminn Potgieter","doi":"10.20944/PREPRINTS202106.0262.V1","DOIUrl":"https://doi.org/10.20944/PREPRINTS202106.0262.V1","url":null,"abstract":"The virus SARS-CoV-2 has resulted in numerous modelling approaches arising rapidly to understand the spread of the disease COVID-19 and to plan for future interventions. Herein, we present an SEIR model with a spatial spread component as well as four infectious compartments to account for the variety of symptom levels and transmission rate. The model takes into account the pattern of spatial vulnerability in South Africa through a vulnerability index that is based on socioeconomic and health susceptibility characteristics. Another spatially relevant factor in this context is level of mobility throughout. The thesis of this study is that without the contextual spatial spread modelling, the heterogeneity in COVID-19 prevalence in the South African setting would not be captured. The model is illustrated on South African COVID-19 case counts and hospitalisations.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"84 1","pages":"14-45"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85564927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Review of Containerization for Interactive and Reproducible Analysis 交互式和可重复分析的容器化研究综述
Pub Date : 2021-03-30 DOI: 10.52933/jdssv.v3i1.53
Gregory J. Hunt, Johann A. Gagnon-Bartsch
In recent decades the analysis of data has become increasingly computational. Correspondingly, this has changed how scientific and statistical work is shared. For example, it is now commonplace for underlying analysis code and data to be proffered alongside journal publications and conference talks. Unfortunately, sharing code faces several challenges. First, it is often difficult to take code from one computer and run it on another. Code configuration, version, and dependency issues often make this challenging. Secondly, even if the code runs, it is often hard to understand or interact with the analysis. This makes it difficult to assess the code and its findings, for example, in a peer review process. In this review we describe the combination of two computing technologies that help make analyses shareable, interactive, and completely reproducible. These technologies are (1) analysis containerization, which leverages virtualization to fully encapsulate analysis, data, code and dependencies into an interactive and shareable format, and (2) code notebooks, a literate programming format for interacting with analyses. The fusion of these two technologies offers significant advantages over using either individually. This review surveys how the combination enhances the accessibility and reproducibility of code, analyses, and ideas.
近几十年来,数据分析越来越具有计算性。相应地,这也改变了科学和统计工作的共享方式。例如,现在将基础分析代码和数据与期刊出版物和会议演讲一起提供是很常见的。不幸的是,共享代码面临着几个挑战。首先,通常很难从一台计算机中取出代码并在另一台计算机上运行。代码配置、版本和依赖关系问题常常使这一工作具有挑战性。其次,即使代码运行,通常也很难理解或与分析交互。这使得评估代码及其发现变得困难,例如,在同行评审过程中。在这篇综述中,我们描述了两种计算技术的结合,这两种技术有助于使分析具有可共享性、交互性和完全可重复性。这些技术是:(1)分析容器化,它利用虚拟化将分析、数据、代码和依赖关系完全封装成一种交互式和可共享的格式,以及(2)代码笔记本,一种用于与分析交互的文字编程格式。这两种技术的融合比单独使用任何一种技术都有显著的优势。这篇综述调查了这种组合如何增强代码、分析和思想的可访问性和可再现性。
{"title":"A Review of Containerization for Interactive and Reproducible Analysis","authors":"Gregory J. Hunt, Johann A. Gagnon-Bartsch","doi":"10.52933/jdssv.v3i1.53","DOIUrl":"https://doi.org/10.52933/jdssv.v3i1.53","url":null,"abstract":"In recent decades the analysis of data has become increasingly computational. Correspondingly, this has changed how scientific and statistical work is shared. For example, it is now commonplace for underlying analysis code and data to be proffered alongside journal publications and conference talks. Unfortunately, sharing code faces several challenges. First, it is often difficult to take code from one computer and run it on another. Code configuration, version, and dependency issues often make this challenging. Secondly, even if the code runs, it is often hard to understand or interact with the analysis. This makes it difficult to assess the code and its findings, for example, in a peer review process. In this review we describe the combination of two computing technologies that help make analyses shareable, interactive, and completely reproducible. These technologies are (1) analysis containerization, which leverages virtualization to fully encapsulate analysis, data, code and dependencies into an interactive and shareable format, and (2) code notebooks, a literate programming format for interacting with analyses. The fusion of these two technologies offers significant advantages over using either individually. This review surveys how the combination enhances the accessibility and reproducibility of code, analyses, and ideas.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73237900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust Model-Based Clustering 稳健的基于模型的聚类
Pub Date : 2021-02-13 DOI: 10.1201/b18358-20
Juan D. González, R. Maronna, V. Yohai, R. Zamar
We propose a class of Fisher-consistent robust estimators for mixture models. These estimators are then used to build a robust model-based clustering procedure. We study in detail the case of multivariate Gaussian mixtures and propose an algorithm, similar to the EM algorithm, to compute the proposed estimators and build the robust clusters. An extensive Monte Carlo simulation study shows that our proposal outperforms other robust and non robust, state of the art, model-based clustering procedures. We apply our proposal to a real data set and show that again it outperforms alternative procedures.
我们提出了一类混合模型的Fisher-consistent鲁棒估计。然后使用这些估计器构建一个健壮的基于模型的聚类过程。我们详细研究了多元高斯混合的情况,并提出了一种类似于EM算法的算法来计算所提出的估计量并构建鲁棒聚类。一项广泛的蒙特卡罗模拟研究表明,我们的建议优于其他鲁棒和非鲁棒,最先进的,基于模型的聚类过程。我们将我们的建议应用于一个真实的数据集,并再次证明它优于其他方法。
{"title":"Robust Model-Based Clustering","authors":"Juan D. González, R. Maronna, V. Yohai, R. Zamar","doi":"10.1201/b18358-20","DOIUrl":"https://doi.org/10.1201/b18358-20","url":null,"abstract":"We propose a class of Fisher-consistent robust estimators for mixture models. These estimators are then used to build a robust model-based clustering procedure. We study in detail the case of multivariate Gaussian mixtures and propose an algorithm, similar to the EM algorithm, to compute the proposed estimators and build the robust clusters. An extensive Monte Carlo simulation study shows that our proposal outperforms other robust and non robust, state of the art, model-based clustering procedures. We apply our proposal to a real data set and show that again it outperforms alternative procedures.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85674805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Handling Cellwise Outliers by Sparse Regression and Robust Covariance 稀疏回归和稳健协方差处理单元格异常值
Pub Date : 2020-12-07 DOI: 10.52933/jdssv.v1i3.18
Jakob Raymaekers, P. Rousseeuw
We propose a data-analytic method for detecting cellwise outliers. Given a robust covariance matrix, outlying cells (entries) in a row are found by the cellFlagger technique which combines lasso regression with a stepwise application of constructed cutoff values. The penalty term of the lasso has a physical interpretation as the total distance that suspicious cells need to move in order to bring their row into the fold. For estimating a cellwise robust covariance matrix we construct a detection-imputation method which alternates between flagging outlying cells and updating the covariance matrix as in the EM algorithm. The proposed methods are illustrated by simulations and on real data about volatile organic compounds in children.
我们提出了一种数据分析方法来检测细胞异常值。给定一个鲁棒协方差矩阵,通过cellFlagger技术找到行中的外围细胞(条目),该技术将套索回归与逐步应用构建的截止值相结合。套索的惩罚期限有一个物理解释,即可疑细胞需要移动的总距离,以便将其排到折叠中。为了估计细胞鲁棒协方差矩阵,我们构造了一种检测-imputation方法,该方法与EM算法一样,在标记外围细胞和更新协方差矩阵之间交替进行。所提出的方法通过模拟和儿童挥发性有机化合物的真实数据来说明。
{"title":"Handling Cellwise Outliers by Sparse Regression and Robust Covariance","authors":"Jakob Raymaekers, P. Rousseeuw","doi":"10.52933/jdssv.v1i3.18","DOIUrl":"https://doi.org/10.52933/jdssv.v1i3.18","url":null,"abstract":"We propose a data-analytic method for detecting cellwise outliers. Given a robust covariance matrix, outlying cells (entries) in a row are found by the cellFlagger technique which combines lasso regression with a stepwise application of constructed cutoff values. The penalty term of the lasso has a physical interpretation as the total distance that suspicious cells need to move in order to bring their row into the fold. For estimating a cellwise robust covariance matrix we construct a detection-imputation method which alternates between flagging outlying cells and updating the covariance matrix as in the EM algorithm. The proposed methods are illustrated by simulations and on real data about volatile organic compounds in children.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82184537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Compressed sensing with a jackknife and a bootstrap 压缩传感与一个小刀和一个引导
Pub Date : 2018-09-18 DOI: 10.52933/jdssv.v2i4.43
Aaron Defazio, M. Tygert, Rachel A. Ward, Jure Zbontar
Compressed sensing proposes to reconstruct more degrees of freedom in a signal than the number of values actually measured (based on a potentially unjustified regularizer or prior distribution). Compressed sensing therefore risks introducing errors -- inserting spurious artifacts or masking the abnormalities that medical imaging seeks to discover. Estimating errors using the standard statistical tools of a jackknife and a bootstrap yields "error bars" in the form of full images that are remarkably qualitatively representative of the actual errors (at least when evaluated and validated on data sets for which the ground truth and hence the actual error is available). These images show the structure of possible errors -- without recourse to measuring the entire ground truth directly -- and build confidence in regions of the images where the estimated errors are small. Further visualizations and summary statistics can aid in the interpretation of such error estimates. Visualizations include suitable colorizations of the reconstruction, as well as the obvious "correction" of the reconstruction by subtracting off the error estimates. The canonical summary statistic would be the root-mean-square of the error estimates. Unfortunately, colorizations appear likely to be too distracting for actual clinical practice in medical imaging, and the root-mean-square gets swamped by background noise in the error estimates. Fortunately, straightforward displays of the error estimates and of the "corrected" reconstruction are illuminating, and the root-mean-square improves greatly after mild blurring of the error estimates; the blurring is barely perceptible to the human eye yet smooths away background noise that would otherwise overwhelm the root-mean-square.
压缩感知提出在一个信号中重建比实际测量值的数量更多的自由度(基于一个可能不合理的正则器或先验分布)。因此,压缩感知存在引入错误的风险——插入虚假伪影或掩盖医学成像试图发现的异常。使用jackknife和bootstrap等标准统计工具估算误差会产生完整图像形式的“误差条”,这些图像在质量上显著地代表了实际误差(至少在对具有基本事实和实际误差的数据集进行评估和验证时是这样)。这些图像显示了可能误差的结构——无需直接测量整个地面真值——并在估计误差较小的图像区域建立信心。进一步的可视化和汇总统计可以帮助解释这种误差估计。可视化包括重建的适当着色,以及通过减去误差估计对重建进行明显的“校正”。典型的汇总统计量将是误差估计的均方根。不幸的是,在医学成像的实际临床实践中,着色似乎过于分散注意力,均方根在误差估计中被背景噪声淹没。幸运的是,误差估计和“修正”重建的直观显示是有启发的,在轻微模糊误差估计后,均方根大大提高;人眼几乎察觉不到这种模糊,但它消除了背景噪音,否则这些噪音会压倒均方根。
{"title":"Compressed sensing with a jackknife and a bootstrap","authors":"Aaron Defazio, M. Tygert, Rachel A. Ward, Jure Zbontar","doi":"10.52933/jdssv.v2i4.43","DOIUrl":"https://doi.org/10.52933/jdssv.v2i4.43","url":null,"abstract":"Compressed sensing proposes to reconstruct more degrees of freedom in a signal than the number of values actually measured (based on a potentially unjustified regularizer or prior distribution). Compressed sensing therefore risks introducing errors -- inserting spurious artifacts or masking the abnormalities that medical imaging seeks to discover. Estimating errors using the standard statistical tools of a jackknife and a bootstrap yields \"error bars\" in the form of full images that are remarkably qualitatively representative of the actual errors (at least when evaluated and validated on data sets for which the ground truth and hence the actual error is available). These images show the structure of possible errors -- without recourse to measuring the entire ground truth directly -- and build confidence in regions of the images where the estimated errors are small. Further visualizations and summary statistics can aid in the interpretation of such error estimates. Visualizations include suitable colorizations of the reconstruction, as well as the obvious \"correction\" of the reconstruction by subtracting off the error estimates. The canonical summary statistic would be the root-mean-square of the error estimates. Unfortunately, colorizations appear likely to be too distracting for actual clinical practice in medical imaging, and the root-mean-square gets swamped by background noise in the error estimates. Fortunately, straightforward displays of the error estimates and of the \"corrected\" reconstruction are illuminating, and the root-mean-square improves greatly after mild blurring of the error estimates; the blurring is barely perceptible to the human eye yet smooths away background noise that would otherwise overwhelm the root-mean-square.","PeriodicalId":93459,"journal":{"name":"Journal of data science, statistics, and visualisation","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85227334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Journal of data science, statistics, and visualisation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1