Foundations of data science (Springfield, Mo.)最新文献_第3页

Fast computation of persistent homology representatives with involuted persistent homology 具有对折持久同调的持久同调表示的快速计算

Q2 MATHEMATICS, APPLIED

Foundations of data science (Springfield, Mo.)

Pub Date : 2021-05-08 DOI: 10.3934/fods.2023006

Matija vCufar, Žiga Virk

Persistent homology is typically computed through persistent cohomology. While this generally improves the running time significantly, it does not facilitate extraction of homology representatives. The mentioned representatives are geometric manifestations of the corresponding holes and often carry desirable information. We propose a new method of extraction of persistent homology representatives using cohomology. In a nutshell, we first compute persistent cohomology and use the obtained information to significantly improve the running time of the direct persistent homology computations. This algorithm applied to Rips filtrations generally computes persistent homology representatives much faster than the standard methods.

持久同调通常通过持久上同调来计算。虽然这通常会显著改善运行时间，但它不利于提取同源表示。所述代表物是相应孔的几何表现形式，并且通常携带所需的信息。提出了一种利用上同调提取持久同调代表的新方法。简而言之，我们首先计算持久上同调，并使用获得的信息来显著提高直接持久同调计算的运行时间。该算法应用于rip过滤，通常比标准方法更快地计算持久同源表示。

引用次数: 8

HERMES: PERSISTENT SPECTRAL GRAPH SOFTWARE. hermes：持久光谱图软件。

Q2 MATHEMATICS, APPLIED

Foundations of data science (Springfield, Mo.)

Pub Date : 2021-03-01 DOI: 10.3934/fods.2021006

Rui Wang, Rundong Zhao, Emily Ribando-Gros, Jiahui Chen, Yiying Tong, Guo-Wei Wei

Persistent homology (PH) is one of the most popular tools in topological data analysis (TDA), while graph theory has had a significant impact on data science. Our earlier work introduced the persistent spectral graph (PSG) theory as a unified multiscale paradigm to encompass TDA and geometric analysis. In PSG theory, families of persistent Laplacian matrices (PLMs) corresponding to various topological dimensions are constructed via a filtration to sample a given dataset at multiple scales. The harmonic spectra from the null spaces of PLMs offer the same topological invariants, namely persistent Betti numbers, at various dimensions as those provided by PH, while the non-harmonic spectra of PLMs give rise to additional geometric analysis of the shape of the data. In this work, we develop an open-source software package, called highly efficient robust multidimensional evolutionary spectra (HERMES), to enable broad applications of PSGs in science, engineering, and technology. To ensure the reliability and robustness of HERMES, we have validated the software with simple geometric shapes and complex datasets from three-dimensional (3D) protein structures. We found that the smallest non-zero eigenvalues are very sensitive to data abnormality.

持久同源性（PH）是拓扑数据分析（TDA）中最流行的工具之一，而图理论则对数据科学产生了重大影响。我们早期的工作引入了持久谱图（PSG）理论，将其作为一种统一的多尺度范式，涵盖了拓扑数据分析和几何分析。在持久谱图理论中，通过过滤构建了对应于各种拓扑维度的持久拉普拉斯矩阵（PLM）族，以在多个尺度上对给定数据集进行采样。来自 PLMs 空域的谐波谱在不同维度上提供了与 PH 所提供的相同的拓扑不变式，即持久贝蒂数，而 PLMs 的非谐波谱则提供了对数据形状的额外几何分析。在这项工作中，我们开发了一个名为 "高效鲁棒多维进化谱（HERMES）"的开源软件包，以实现 PSG 在科学、工程和技术领域的广泛应用。为了确保 HERMES 的可靠性和鲁棒性，我们用简单的几何图形和来自三维（3D）蛋白质结构的复杂数据集对该软件进行了验证。我们发现，最小的非零特征值对数据异常非常敏感。

{"title":"HERMES: PERSISTENT SPECTRAL GRAPH SOFTWARE.","authors":"Rui Wang, Rundong Zhao, Emily Ribando-Gros, Jiahui Chen, Yiying Tong, Guo-Wei Wei","doi":"10.3934/fods.2021006","DOIUrl":"10.3934/fods.2021006","url":null,"abstract":"Persistent homology (PH) is one of the most popular tools in topological data analysis (TDA), while graph theory has had a significant impact on data science. Our earlier work introduced the persistent spectral graph (PSG) theory as a unified multiscale paradigm to encompass TDA and geometric analysis. In PSG theory, families of persistent Laplacian matrices (PLMs) corresponding to various topological dimensions are constructed via a filtration to sample a given dataset at multiple scales. The harmonic spectra from the null spaces of PLMs offer the same topological invariants, namely persistent Betti numbers, at various dimensions as those provided by PH, while the non-harmonic spectra of PLMs give rise to additional geometric analysis of the shape of the data. In this work, we develop an open-source software package, called highly efficient robust multidimensional evolutionary spectra (HERMES), to enable broad applications of PSGs in science, engineering, and technology. To ensure the reliability and robustness of HERMES, we have validated the software with simple geometric shapes and complex datasets from three-dimensional (3D) protein structures. We found that the smallest non-zero eigenvalues are very sensitive to data abnormality.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"3 1","pages":"67-97"},"PeriodicalIF":0.0,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8411887/pdf/nihms-1717421.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39387483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A study of disproportionately affected populations by race/ethnicity during the SARS-CoV-2 pandemic using multi-population SEIR modeling and ensemble data assimilation 使用多人群SEIR模型和集合数据同化对SARS-CoV-2大流行期间按种族/族裔受不成比例影响人群的研究

Q2 MATHEMATICS, APPLIED

Foundations of data science (Springfield, Mo.)

Pub Date : 2021-01-01 DOI: 10.3934/fods.2021022

Emmanuel Fleurantin, C. Sampson, Daniel P. Maes, Justin P. Bennett, Tayler Fernandes-Nunez, S. Marx, G. Evensen

The disparity in the impact of COVID-19 on minority populations in the United States has been well established in the available data on deaths, case counts, and adverse outcomes. However, critical metrics used by public health officials and epidemiologists, such as a time dependent viral reproductive number (begin{document}$ R_t $end{document}), can be hard to calculate from this data especially for individual populations. Furthermore, disparities in the availability of testing, record keeping infrastructure, or government funding in disadvantaged populations can produce incomplete data sets. In this work, we apply ensemble data assimilation techniques which optimally combine model and data to produce a more complete data set providing better estimates of the critical metrics used by public health officials and epidemiologists. We employ a multi-population SEIR (Susceptible, Exposed, Infected and Recovered) model with a time dependent reproductive number and age stratified contact rate matrix for each population. We assimilate the daily death data for populations separated by ethnic/racial groupings using a technique called Ensemble Smoothing with Multiple Data Assimilation (ESMDA) to estimate model parameters and produce an begin{document}$R_t(n)$end{document} for the begin{document}$n^{th}$end{document} population. We do this with three distinct approaches, (1) using the same contact matrices and prior begin{document}$R_t(n)$end{document} for each population, (2) assigning contact matrices with increased contact rates for working age and older adults to populations experiencing disparity and (3) as in (2) but with a time-continuous update to begin{document}$R_t(n)$end{document}. We make a study of 9 U.S. states and the District of Columbia providing a complete time series of the pandemic in each and, in some cases, identifying disparities not otherwise evident in the aggregate statistics.

{"title":"A study of disproportionately affected populations by race/ethnicity during the SARS-CoV-2 pandemic using multi-population SEIR modeling and ensemble data assimilation","authors":"Emmanuel Fleurantin, C. Sampson, Daniel P. Maes, Justin P. Bennett, Tayler Fernandes-Nunez, S. Marx, G. Evensen","doi":"10.3934/fods.2021022","DOIUrl":"https://doi.org/10.3934/fods.2021022","url":null,"abstract":"The disparity in the impact of COVID-19 on minority populations in the United States has been well established in the available data on deaths, case counts, and adverse outcomes. However, critical metrics used by public health officials and epidemiologists, such as a time dependent viral reproductive number (<inline-formula><tex-math id=\"M1\">begin{document}$ R_t $end{document}</tex-math></inline-formula>), can be hard to calculate from this data especially for individual populations. Furthermore, disparities in the availability of testing, record keeping infrastructure, or government funding in disadvantaged populations can produce incomplete data sets. In this work, we apply ensemble data assimilation techniques which optimally combine model and data to produce a more complete data set providing better estimates of the critical metrics used by public health officials and epidemiologists. We employ a multi-population SEIR (Susceptible, Exposed, Infected and Recovered) model with a time dependent reproductive number and age stratified contact rate matrix for each population. We assimilate the daily death data for populations separated by ethnic/racial groupings using a technique called Ensemble Smoothing with Multiple Data Assimilation (ESMDA) to estimate model parameters and produce an <inline-formula><tex-math id=\"M10000\">begin{document}$R_t(n)$end{document}</tex-math></inline-formula> for the <inline-formula><tex-math id=\"M2000\">begin{document}$n^{th}$end{document}</tex-math></inline-formula> population. We do this with three distinct approaches, (1) using the same contact matrices and prior <inline-formula><tex-math id=\"M30000\">begin{document}$R_t(n)$end{document}</tex-math></inline-formula> for each population, (2) assigning contact matrices with increased contact rates for working age and older adults to populations experiencing disparity and (3) as in (2) but with a time-continuous update to <inline-formula><tex-math id=\"M4\">begin{document}$R_t(n)$end{document}</tex-math></inline-formula>. We make a study of 9 U.S. states and the District of Columbia providing a complete time series of the pandemic in each and, in some cases, identifying disparities not otherwise evident in the aggregate statistics.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"112 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Intrinsic disease maps using persistent cohomology 使用持续上同源的内在疾病图

Q2 MATHEMATICS, APPLIED

Foundations of data science (Springfield, Mo.)

Pub Date : 2021-01-01 DOI: 10.3934/FODS.2021008

Daniel Amin, Mikael Vejdemo-Johansson

We use persistent cohomology and circular coordinates to investigate three datasets related to infectious diseases. We show that all three datasets exhibit circular coordinates that carry information about the data itself. For one of the datasets we are able to recover time post infection from the circular coordinate itself – for the other datasets, this information was not available, but in one we were able to relate the circular coordinate to red blood cell counts and weight changes in the subjects.

我们使用持久上同源和圆坐标来调查三个与传染病相关的数据集。我们展示了所有三个数据集都显示了带有数据本身信息的圆形坐标。对于其中一个数据集，我们能够从圆形坐标本身恢复感染后的时间-对于其他数据集，该信息不可用，但在一个数据集中，我们能够将圆形坐标与受试者的红细胞计数和体重变化联系起来。

引用次数: 1

ToFU: Topology functional units for deep learning 豆腐:深度学习的拓扑功能单元

Q2 MATHEMATICS, APPLIED

Foundations of data science (Springfield, Mo.)

Pub Date : 2021-01-01 DOI: 10.3934/fods.2021021

Christopher Oballe, D. Boothe, P. Franaszczuk, V. Maroulas

We propose ToFU, a new trainable neural network unit with a persistence diagram dissimilarity function as its activation. Since persistence diagrams are topological summaries of structures, this new activation measures and learns the topology of data to leverage it in machine learning tasks. We showcase the utility of ToFU in two experiments: one involving the classification of discrete-time autoregressive signals, and another involving a variational autoencoder. In the former, ToFU yields competitive results with networks that use spectral features while outperforming CNN architectures. In the latter, ToFU produces topologically-interpretable latent space representations of inputs without sacrificing reconstruction fidelity.

我们提出了一种新的可训练神经网络单元豆腐，该神经网络单元以一个持续图不相似函数作为其激活。由于持久性图是结构的拓扑摘要，这个新的激活测量和学习数据的拓扑，以便在机器学习任务中利用它。我们在两个实验中展示了豆腐的效用:一个涉及离散时间自回归信号的分类，另一个涉及变分自编码器。在前者中，豆腐与使用频谱特征的网络产生竞争结果，同时优于CNN架构。在后者中，豆腐在不牺牲重建保真度的情况下产生输入的拓扑可解释的潜在空间表示。

引用次数: 3

A density-based approach to feature detection in persistence diagrams for firn data 一种基于密度的方法，用于在企业数据的持久性图中进行特征检测

Q2 MATHEMATICS, APPLIED

Foundations of data science (Springfield, Mo.)

Pub Date : 2021-01-01 DOI: 10.3934/FODS.2021012

A. Lawson, Tyler Hoffman, Yu-Min Chung, K. Keegan, S. Day

Topological data analysis, and in particular persistence diagrams, are gaining popularity as tools for extracting topological information from noisy point cloud and digital data. Persistence diagrams track topological features in the form of begin{document}$ k $end{document} -dimensional holes in the data. Here, we construct a new, automated approach for identifying persistence diagram points that represent robust long-life features. These features may be used to provide a more accurate estimate of Betti numbers for the underlying space. This approach extends the established practice of using a lifespan cutoff on the features in order to take advantage of the observation that noisy features typically appear in clusters in the persistence diagram. We show that this approach offers more flexibility in partitioning features in the persistence diagram, resulting in greater accuracy in computed Betti numbers, especially in the case of high noise levels and varying image illumination. This work is motivated by 3-dimensional Micro-CT imaging of ice core samples, and is applicable for separating noise from robust signals in persistence diagrams from noisy data.

{"title":"A density-based approach to feature detection in persistence diagrams for firn data","authors":"A. Lawson, Tyler Hoffman, Yu-Min Chung, K. Keegan, S. Day","doi":"10.3934/FODS.2021012","DOIUrl":"https://doi.org/10.3934/FODS.2021012","url":null,"abstract":"Topological data analysis, and in particular persistence diagrams, are gaining popularity as tools for extracting topological information from noisy point cloud and digital data. Persistence diagrams track topological features in the form of begin{document}$ k $end{document} -dimensional holes in the data. Here, we construct a new, automated approach for identifying persistence diagram points that represent robust long-life features. These features may be used to provide a more accurate estimate of Betti numbers for the underlying space. This approach extends the established practice of using a lifespan cutoff on the features in order to take advantage of the observation that noisy features typically appear in clusters in the persistence diagram. We show that this approach offers more flexibility in partitioning features in the persistence diagram, resulting in greater accuracy in computed Betti numbers, especially in the case of high noise levels and varying image illumination. This work is motivated by 3-dimensional Micro-CT imaging of ice core samples, and is applicable for separating noise from robust signals in persistence diagrams from noisy data.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

The rankability of weighted data from pairwise comparisons 两两比较中加权数据的排名

Q2 MATHEMATICS, APPLIED

Foundations of data science (Springfield, Mo.)

Pub Date : 2021-01-01 DOI: 10.3934/FODS.2021002

Paul E. Anderson, T. Chartier, A. Langville, Kathryn E. Pedings-Behling

In prior work [ 4 ], Anderson et al. introduced a new problem, the rankability problem, which refers to a dataset's inherent ability to produce a meaningful ranking of its items. Ranking is a fundamental data science task with numerous applications that include web search, data mining, cybersecurity, machine learning, and statistical learning theory. Yet little attention has been paid to the question of whether a dataset is suitable for ranking. As a result, when a ranking method is applied to a dataset with low rankability, the resulting ranking may not be reliable. Rankability paper [ 4 ] and its methods studied unweighted data for which the dominance relations are binary, i.e., an item either dominates or is dominated by another item. In this paper, we extend rankability methods to weighted data for which an item may dominate another by any finite amount. We present combinatorial approaches to a weighted rankability measure and apply our new measure to several weighted datasets.

在之前的工作[4]中，Anderson等人引入了一个新问题，即排名问题，这是指数据集对其项目产生有意义排名的固有能力。排名是一项基础数据科学任务，有许多应用，包括网络搜索、数据挖掘、网络安全、机器学习和统计学习理论。然而，很少有人关注数据集是否适合进行排名的问题。因此，当排名方法应用于排名性较低的数据集时，所得到的排名可能不可靠。排名论文[4]及其方法研究了优势关系为二元的未加权数据，即一个项目占主导地位或被另一个项目占主导地位。在本文中，我们将排名方法扩展到一个项目可以以任意有限的量支配另一个项目的加权数据。我们提出了加权排名度量的组合方法，并将我们的新度量应用于几个加权数据集。

引用次数: 10

Learning landmark geodesics using the ensemble Kalman filter 使用集合卡尔曼滤波器学习地标测地线

Q2 MATHEMATICS, APPLIED

Foundations of data science (Springfield, Mo.)

Pub Date : 2021-01-01 DOI: 10.3934/fods.2021020

Andreas Bock, C. Cotter

We study the problem of diffeomorphometric geodesic landmark matching where the objective is to find a diffeomorphism that, via its group action, maps between two sets of landmarks. It is well-known that the motion of the landmarks, and thereby the diffeomorphism, can be encoded by an initial momentum leading to a formulation where the landmark matching problem can be solved as an optimisation problem over such momenta. The novelty of our work lies in the application of a derivative-free Bayesian inverse method for learning the optimal momentum encoding the diffeomorphic mapping between the template and the target. The method we apply is the ensemble Kalman filter, an extension of the Kalman filter to nonlinear operators. We describe an efficient implementation of the algorithm and show several numerical results for various target shapes.

我们研究了微分同构的测地线地标匹配问题，其目标是找到一个通过群作用在两组地标之间映射的微分同构。众所周知，地标的运动，从而微分同构，可以通过一个初始动量编码，导致一个公式，其中地标匹配问题可以作为一个优化问题解决在这样的动量。我们工作的新颖之处在于应用无导数贝叶斯逆方法来学习模板和目标之间差分映射的最优动量编码。我们采用的方法是集合卡尔曼滤波，这是卡尔曼滤波在非线性算子上的扩展。我们描述了一种有效的算法实现，并给出了几种不同形状目标的数值结果。

引用次数: 2

Reconstructing linearly embedded graphs: A first step to stratified space learning 重构线性嵌入图:分层空间学习的第一步

Q2 MATHEMATICS, APPLIED

Foundations of data science (Springfield, Mo.)

Pub Date : 2021-01-01 DOI: 10.3934/fods.2021026

Yossi Bokor Bleile, Katharine Turner, Christopher Williams

In this paper, we consider the simplest class of stratified spaces – linearly embedded graphs. We present an algorithm that learns the abstract structure of an embedded graph and models the specific embedding from a point cloud sampled from it. We use tools and inspiration from computational geometry, algebraic topology, and topological data analysis and prove the correctness of the identified abstract structure under assumptions on the embedding. The algorithm is implemented in the Julia package Skyler, which we used for the numerical simulations in this paper.

本文考虑了最简单的一类分层空间——线性嵌入图。我们提出了一种算法，该算法学习嵌入图的抽象结构，并从从中采样的点云中对特定嵌入建模。我们利用计算几何、代数拓扑和拓扑数据分析的工具和灵感，证明了在嵌入假设下识别的抽象结构的正确性。该算法在Julia软件包Skyler中实现，本文使用该软件包进行数值模拟。

引用次数: 1

Score matching filters for Gaussian Markov random fields with a linear model of the precision matrix 分数匹配滤波器高斯马尔可夫随机场与精度矩阵的线性模型

Q2 MATHEMATICS, APPLIED

Foundations of data science (Springfield, Mo.)

Pub Date : 2021-01-01 DOI: 10.3934/fods.2021030

Marie Turčičová, J. Mandel, K. Eben

We present an ensemble filtering method based on a linear model for the precision matrix (the inverse of the covariance) with the parameters determined by Score Matching Estimation. The method provides a rigorous covariance regularization when the underlying random field is Gaussian Markov. The parameters are found by solving a system of linear equations. The analysis step uses the inverse formulation of the Kalman update. Several filter versions, differing in the construction of the analysis ensemble, are proposed, as well as a Score matching version of the Extended Kalman Filter.

我们提出了一种基于精度矩阵(协方差逆)的线性模型的集成滤波方法，其参数由分数匹配估计确定。当底层随机场为高斯马尔可夫时，该方法提供了严格的协方差正则化。这些参数是通过求解一个线性方程组得到的。分析步骤使用卡尔曼更新的逆公式。提出了几种不同于分析集合构造的滤波器版本，以及扩展卡尔曼滤波器的分数匹配版本。

引用次数: 1