Alexej Gossmann, Aria Pezeshk, Yu-ping Wang, B. Sahiner
{"title":"Test Data Reuse for the Evaluation of Continuously Evolving Classification Algorithms Using the Area under the Receiver Operating Characteristic Curve","authors":"Alexej Gossmann, Aria Pezeshk, Yu-ping Wang, B. Sahiner","doi":"10.1137/20M1333110","DOIUrl":"https://doi.org/10.1137/20M1333110","url":null,"abstract":"","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"43 1","pages":"692-714"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91101484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-01Epub Date: 2021-02-11DOI: 10.1137/20m1335984
Robert J Webber, Erik H Thiede, Douglas Dow, Aaron R Dinner, Jonathan Weare
Dynamical spectral estimation is a well-established numerical approach for estimating eigenvalues and eigenfunctions of the Markov transition operator from trajectory data. Although the approach has been widely applied in biomolecular simulations, its error properties remain poorly understood. Here we analyze the error of a dynamical spectral estimation method called "the variational approach to conformational dynamics" (VAC). We bound the approximation error and estimation error for VAC estimates. Our analysis establishes VAC's convergence properties and suggests new strategies for tuning VAC to improve accuracy.
{"title":"Error Bounds for Dynamical Spectral Estimation.","authors":"Robert J Webber, Erik H Thiede, Douglas Dow, Aaron R Dinner, Jonathan Weare","doi":"10.1137/20m1335984","DOIUrl":"10.1137/20m1335984","url":null,"abstract":"<p><p>Dynamical spectral estimation is a well-established numerical approach for estimating eigenvalues and eigenfunctions of the Markov transition operator from trajectory data. Although the approach has been widely applied in biomolecular simulations, its error properties remain poorly understood. Here we analyze the error of a dynamical spectral estimation method called \"the variational approach to conformational dynamics\" (VAC). We bound the approximation error and estimation error for VAC estimates. Our analysis establishes VAC's convergence properties and suggests new strategies for tuning VAC to improve accuracy.</p>","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"3 1","pages":"225-252"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8336423/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39281319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Global Minima of Overparameterized Neural Networks","authors":"Y. Cooper","doi":"10.1137/19M1308943","DOIUrl":"https://doi.org/10.1137/19M1308943","url":null,"abstract":"","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"3 2","pages":"676-691"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72476727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For a compact set $Dsubsetmathbb{R}^{m}$ we consider the problem of approximating a function $f$ over $D$ by sums of ridge functions ${x}mapstovarphi({w}^{T}{x})$ with ${w}$ in a given set $ma...
{"title":"Approximation Properties of Ridge Functions and Extreme Learning Machines","authors":"P. Jorgensen, D. Stewart","doi":"10.1137/20m1356348","DOIUrl":"https://doi.org/10.1137/20m1356348","url":null,"abstract":"For a compact set $Dsubsetmathbb{R}^{m}$ we consider the problem of approximating a function $f$ over $D$ by sums of ridge functions ${x}mapstovarphi({w}^{T}{x})$ with ${w}$ in a given set $ma...","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"100 1","pages":"815-832"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75679201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Leonardo A. B. Tôrres, Kevin S. Chan, Hanghang Tong, T. Eliassi-Rad
. The non-backtracking matrix and its eigenvalues have many applications in network science and 5 graph mining, such as node and edge centrality, community detection, length spectrum theory, 6 graph distance, and epidemic and percolation thresholds. In network epidemiology, the reciprocal 7 of the largest eigenvalue of the non-backtracking matrix is a good approximation for the epidemic 8 threshold of certain network dynamics. In this work, we develop techniques that identify which 9 nodes have the largest impact on this leading eigenvalue. We do so by studying the behavior of 10 the spectrum of the non-backtracking matrix after a node is removed from the graph. From this 11 analysis we derive two new centrality measures: X -degree and X-non-backtracking centrality . We 12 perform extensive experimentation with targeted immunization strategies derived from these two 13 centrality measures. Our spectral analysis and centrality measures can be broadly applied, and will 14 be of interest to both theorists and practitioners alike. the perturbation of quadratic eigenvalue problems, with applications to the NB- eigenvalues of the stochastic block
{"title":"Nonbacktracking Eigenvalues under Node Removal: X-Centrality and Targeted Immunization","authors":"Leonardo A. B. Tôrres, Kevin S. Chan, Hanghang Tong, T. Eliassi-Rad","doi":"10.1137/20M1352132","DOIUrl":"https://doi.org/10.1137/20M1352132","url":null,"abstract":". The non-backtracking matrix and its eigenvalues have many applications in network science and 5 graph mining, such as node and edge centrality, community detection, length spectrum theory, 6 graph distance, and epidemic and percolation thresholds. In network epidemiology, the reciprocal 7 of the largest eigenvalue of the non-backtracking matrix is a good approximation for the epidemic 8 threshold of certain network dynamics. In this work, we develop techniques that identify which 9 nodes have the largest impact on this leading eigenvalue. We do so by studying the behavior of 10 the spectrum of the non-backtracking matrix after a node is removed from the graph. From this 11 analysis we derive two new centrality measures: X -degree and X-non-backtracking centrality . We 12 perform extensive experimentation with targeted immunization strategies derived from these two 13 centrality measures. Our spectral analysis and centrality measures can be broadly applied, and will 14 be of interest to both theorists and practitioners alike. the perturbation of quadratic eigenvalue problems, with applications to the NB- eigenvalues of the stochastic block","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"31 1","pages":"656-675"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85132979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-01Epub Date: 2021-02-01DOI: 10.1137/20m1365715
Ariel Jaffe, Noah Amsel, Yariv Aizenbud, Boaz Nadler, Joseph T Chang, Yuval Kluger
A common assumption in multiple scientific applications is that the distribution of observed data can be modeled by a latent tree graphical model. An important example is phylogenetics, where the tree models the evolutionary lineages of a set of observed organisms. Given a set of independent realizations of the random variables at the leaves of the tree, a key challenge is to infer the underlying tree topology. In this work we develop Spectral Neighbor Joining (SNJ), a novel method to recover the structure of latent tree graphical models. Given a matrix that contains a measure of similarity between all pairs of observed variables, SNJ computes a spectral measure of cohesion between groups of observed variables. We prove that SNJ is consistent, and derive a sufficient condition for correct tree recovery from an estimated similarity matrix. Combining this condition with a concentration of measure result on the similarity matrix, we bound the number of samples required to recover the tree with high probability. We illustrate via extensive simulations that in comparison to several other reconstruction methods, SNJ requires fewer samples to accurately recover trees with a large number of leaves or long edges.
{"title":"Spectral neighbor joining for reconstruction of latent tree Models.","authors":"Ariel Jaffe, Noah Amsel, Yariv Aizenbud, Boaz Nadler, Joseph T Chang, Yuval Kluger","doi":"10.1137/20m1365715","DOIUrl":"https://doi.org/10.1137/20m1365715","url":null,"abstract":"<p><p>A common assumption in multiple scientific applications is that the distribution of observed data can be modeled by a latent tree graphical model. An important example is phylogenetics, where the tree models the evolutionary lineages of a set of observed organisms. Given a set of independent realizations of the random variables at the leaves of the tree, a key challenge is to infer the underlying tree topology. In this work we develop Spectral Neighbor Joining (SNJ), a novel method to recover the structure of latent tree graphical models. Given a matrix that contains a measure of similarity between all pairs of observed variables, SNJ computes a spectral measure of cohesion between groups of observed variables. We prove that SNJ is consistent, and derive a sufficient condition for correct tree recovery from an estimated similarity matrix. Combining this condition with a concentration of measure result on the similarity matrix, we bound the number of samples required to recover the tree with high probability. We illustrate via extensive simulations that in comparison to several other reconstruction methods, SNJ requires fewer samples to accurately recover trees with a large number of leaves or long edges.</p>","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"3 1","pages":"113-141"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8194222/pdf/nihms-1702804.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39091867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-01Epub Date: 2021-03-23DOI: 10.1137/20M1342124
Boris Landa, Ronald R Coifman, Yuval Kluger
A fundamental step in many data-analysis techniques is the construction of an affinity matrix describing similarities between data points. When the data points reside in Euclidean space, a widespread approach is to from an affinity matrix by the Gaussian kernel with pairwise distances, and to follow with a certain normalization (e.g. the row-stochastic normalization or its symmetric variant). We demonstrate that the doubly-stochastic normalization of the Gaussian kernel with zero main diagonal (i.e., no self loops) is robust to heteroskedastic noise. That is, the doubly-stochastic normalization is advantageous in that it automatically accounts for observations with different noise variances. Specifically, we prove that in a suitable high-dimensional setting where heteroskedastic noise does not concentrate too much in any particular direction in space, the resulting (doubly-stochastic) noisy affinity matrix converges to its clean counterpart with rate m-1/2, where m is the ambient dimension. We demonstrate this result numerically, and show that in contrast, the popular row-stochastic and symmetric normalizations behave unfavorably under heteroskedastic noise. Furthermore, we provide examples of simulated and experimental single-cell RNA sequence data with intrinsic heteroskedasticity, where the advantage of the doubly-stochastic normalization for exploratory analysis is evident.
{"title":"Doubly Stochastic Normalization of the Gaussian Kernel Is Robust to Heteroskedastic Noise.","authors":"Boris Landa, Ronald R Coifman, Yuval Kluger","doi":"10.1137/20M1342124","DOIUrl":"https://doi.org/10.1137/20M1342124","url":null,"abstract":"<p><p>A fundamental step in many data-analysis techniques is the construction of an affinity matrix describing similarities between data points. When the data points reside in Euclidean space, a widespread approach is to from an affinity matrix by the Gaussian kernel with pairwise distances, and to follow with a certain normalization (e.g. the row-stochastic normalization or its symmetric variant). We demonstrate that the doubly-stochastic normalization of the Gaussian kernel with zero main diagonal (i.e., no self loops) is robust to heteroskedastic noise. That is, the doubly-stochastic normalization is advantageous in that it automatically accounts for observations with different noise variances. Specifically, we prove that in a suitable high-dimensional setting where heteroskedastic noise does not concentrate too much in any particular direction in space, the resulting (doubly-stochastic) noisy affinity matrix converges to its clean counterpart with rate <i>m</i> <sup>-1/2</sup>, where <i>m</i> is the ambient dimension. We demonstrate this result numerically, and show that in contrast, the popular row-stochastic and symmetric normalizations behave unfavorably under heteroskedastic noise. Furthermore, we provide examples of simulated and experimental single-cell RNA sequence data with intrinsic heteroskedasticity, where the advantage of the doubly-stochastic normalization for exploratory analysis is evident.</p>","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"3 1","pages":"388-413"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8194191/pdf/nihms-1702812.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39091868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qunwei Li, B. Kailkhura, R. Anirudh, Jize Zhang, Yi Zhou, Yingbin Liang, T. Y. Han, P. Varshney
{"title":"MR-GAN: Manifold Regularized Generative Adversarial Networks for Scientific Data","authors":"Qunwei Li, B. Kailkhura, R. Anirudh, Jize Zhang, Yi Zhou, Yingbin Liang, T. Y. Han, P. Varshney","doi":"10.1137/20m1344299","DOIUrl":"https://doi.org/10.1137/20m1344299","url":null,"abstract":"","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"15 1","pages":"1197-1222"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75872271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Santiago Armstrong, Crist'obal Guzm'an, C. Sing-Long
We study the problem of circular seriation, where we are given a matrix of pairwise dissimilarities between $n$ objects, and the goal is to find a {em circular order} of the objects in a manner that is consistent with their dissimilarity. This problem is a generalization of the classical {em linear seriation} problem where the goal is to find a {em linear order}, and for which optimal ${cal O}(n^2)$ algorithms are known. Our contributions can be summarized as follows. First, we introduce {em circular Robinson matrices} as the natural class of dissimilarity matrices for the circular seriation problem. Second, for the case of {em strict circular Robinson dissimilarity matrices} we provide an optimal ${cal O}(n^2)$ algorithm for the circular seriation problem. Finally, we propose a statistical model to analyze the well-posedness of the circular seriation problem for large $n$. In particular, we establish ${cal O}(log(n)/n)$ rates on the distance between any circular ordering found by solving the circular seriation problem to the underlying order of the model, in the Kendall-tau metric.
{"title":"An Optimal Algorithm for Strict Circular Seriation","authors":"Santiago Armstrong, Crist'obal Guzm'an, C. Sing-Long","doi":"10.1137/21m139356x","DOIUrl":"https://doi.org/10.1137/21m139356x","url":null,"abstract":"We study the problem of circular seriation, where we are given a matrix of pairwise dissimilarities between $n$ objects, and the goal is to find a {em circular order} of the objects in a manner that is consistent with their dissimilarity. This problem is a generalization of the classical {em linear seriation} problem where the goal is to find a {em linear order}, and for which optimal ${cal O}(n^2)$ algorithms are known. Our contributions can be summarized as follows. First, we introduce {em circular Robinson matrices} as the natural class of dissimilarity matrices for the circular seriation problem. Second, for the case of {em strict circular Robinson dissimilarity matrices} we provide an optimal ${cal O}(n^2)$ algorithm for the circular seriation problem. Finally, we propose a statistical model to analyze the well-posedness of the circular seriation problem for large $n$. In particular, we establish ${cal O}(log(n)/n)$ rates on the distance between any circular ordering found by solving the circular seriation problem to the underlying order of the model, in the Kendall-tau metric.","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"25 1","pages":"1223-1250"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84685942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce $k$-variance, a generalization of variance built on the machinery of random bipartite matchings. $K$-variance measures the expected cost of matching two sets of $k$ samples from a distribution to each other, capturing local rather than global information about a measure as $k$ increases; it is easily approximated stochastically using sampling and linear programming. In addition to defining $k$-variance and proving its basic properties, we provide in-depth analysis of this quantity in several key cases, including one-dimensional measures, clustered measures, and measures concentrated on low-dimensional subsets of $mathbb R^n$. We conclude with experiments and open problems motivated by this new way to summarize distributional shape.
{"title":"k-Variance: A Clustered Notion of Variance","authors":"J. Solomon, Kristjan H. Greenewald, H. Nagaraja","doi":"10.1137/20m1385895","DOIUrl":"https://doi.org/10.1137/20m1385895","url":null,"abstract":"We introduce $k$-variance, a generalization of variance built on the machinery of random bipartite matchings. $K$-variance measures the expected cost of matching two sets of $k$ samples from a distribution to each other, capturing local rather than global information about a measure as $k$ increases; it is easily approximated stochastically using sampling and linear programming. In addition to defining $k$-variance and proving its basic properties, we provide in-depth analysis of this quantity in several key cases, including one-dimensional measures, clustered measures, and measures concentrated on low-dimensional subsets of $mathbb R^n$. We conclude with experiments and open problems motivated by this new way to summarize distributional shape.","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"59 1","pages":"957-978"},"PeriodicalIF":0.0,"publicationDate":"2020-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90851468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}