首页 > 最新文献

Foundations of data science (Springfield, Mo.)最新文献

英文 中文
ASPECTS OF TOPOLOGICAL APPROACHES FOR DATA SCIENCE. 数据科学拓扑方法的各个方面。
Q2 MATHEMATICS, APPLIED Pub Date : 2022-06-01 DOI: 10.3934/fods.2022002
Jelena Grbić, Jie Wu, Kelin Xia, Guo-Wei Wei

We establish a new theory which unifies various aspects of topological approaches for data science, by being applicable both to point cloud data and to graph data, including networks beyond pairwise interactions. We generalize simplicial complexes and hypergraphs to super-hypergraphs and establish super-hypergraph homology as an extension of simplicial homology. Driven by applications, we also introduce super-persistent homology.

我们建立了一种新理论,通过同时适用于点云数据和图数据(包括超越成对交互的网络),统一了数据科学拓扑方法的各个方面。我们将简单复合物和超图概括为超超图,并建立了超超图同源性作为简单同源性的扩展。在应用的推动下,我们还引入了超持久同源性。
{"title":"ASPECTS OF TOPOLOGICAL APPROACHES FOR DATA SCIENCE.","authors":"Jelena Grbić, Jie Wu, Kelin Xia, Guo-Wei Wei","doi":"10.3934/fods.2022002","DOIUrl":"10.3934/fods.2022002","url":null,"abstract":"<p><p>We establish a new theory which unifies various aspects of topological approaches for data science, by being applicable both to point cloud data and to graph data, including networks beyond pairwise interactions. We generalize simplicial complexes and hypergraphs to super-hypergraphs and establish super-hypergraph homology as an extension of simplicial homology. Driven by applications, we also introduce super-persistent homology.</p>","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"4 2","pages":"165-216"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9881677/pdf/nihms-1825620.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10592051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A log-Gaussian Cox process with sequential Monte Carlo for line narrowing in spectroscopy 谱线窄化的对数高斯-考克斯过程
Q2 MATHEMATICS, APPLIED Pub Date : 2022-02-26 DOI: 10.3934/fods.2023008
T. Harkonen, Emma Hannula, M. Moores, E. Vartiainen, L. Roininen
We propose a statistical model for narrowing line shapes in spectroscopy that are well approximated as linear combinations of Lorentzian or Voigt functions. We introduce a log-Gaussian Cox process to represent the peak locations thereby providing uncertainty quantification for the line narrowing. Bayesian formulation of the method allows for robust and explicit inclusion of prior information as probability distributions for parameters of the model. Estimation of the signal and its parameters is performed using a sequential Monte Carlo algorithm followed by an optimization step to determine the peak locations. Our method is validated using a simulation study and applied to a mineralogical Raman spectrum.
我们提出了一种统计模型,用于缩小光谱中的线形,这种线形很好地近似为洛伦兹函数或Voigt函数的线性组合。我们引入对数高斯Cox过程来表示峰值位置,从而为线窄化提供不确定性量化。该方法的贝叶斯公式允许鲁棒和显式包含先验信息作为模型参数的概率分布。信号及其参数的估计是使用顺序蒙特卡罗算法执行的,然后是确定峰值位置的优化步骤。我们的方法通过模拟研究得到验证,并应用于矿物学拉曼光谱。
{"title":"A log-Gaussian Cox process with sequential Monte Carlo for line narrowing in spectroscopy","authors":"T. Harkonen, Emma Hannula, M. Moores, E. Vartiainen, L. Roininen","doi":"10.3934/fods.2023008","DOIUrl":"https://doi.org/10.3934/fods.2023008","url":null,"abstract":"We propose a statistical model for narrowing line shapes in spectroscopy that are well approximated as linear combinations of Lorentzian or Voigt functions. We introduce a log-Gaussian Cox process to represent the peak locations thereby providing uncertainty quantification for the line narrowing. Bayesian formulation of the method allows for robust and explicit inclusion of prior information as probability distributions for parameters of the model. Estimation of the signal and its parameters is performed using a sequential Monte Carlo algorithm followed by an optimization step to determine the peak locations. Our method is validated using a simulation study and applied to a mineralogical Raman spectrum.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45413111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data based quantification of synchronization 基于数据的同步量化
Q2 MATHEMATICS, APPLIED Pub Date : 2022-01-01 DOI: 10.3934/fods.2022020
{"title":"Data based quantification of synchronization","authors":"","doi":"10.3934/fods.2022020","DOIUrl":"https://doi.org/10.3934/fods.2022020","url":null,"abstract":"","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Addressing confirmation bias in middle school data science education 解决中学数据科学教育中的确认偏误
Q2 MATHEMATICS, APPLIED Pub Date : 2022-01-01 DOI: 10.3934/fods.2021035
S. Hedges, Kim Given
More research is needed involving middle school students' engagement in the statistical problem-solving process, particularly the beginning process steps: formulate a question and make a plan to collect data/consider the data. Further, the increased availability of large-scale electronically accessible data sets is an untapped area of study. This interpretive study examined middle school students' understanding of statistical concepts involved in making a plan to collect data to answer a statistical question within a social issue context using data available on the internet. Student artifacts, researcher notes, and audio and video recordings from nine groups of 20 seventh-grade students in two gifted education pull-out classes at a suburban middle school were used to answer the study research questions. Data were analyzed using a priori codes from previously developed frameworks and by using an inductive approach to find themes.Three themes that emerged from data related to confirmation bias. Some middle school students held preconceptions about the social issues they chose to study that biased their statistical questions. This in turn influenced the sources of data students used to answer their questions. Confirmation bias is a serious issue that is exacerbated due to endless sources of data electronically available. We argue that this type of bias should be addressed early in students' educational experiences. Based on the findings from this study, we offer recommendations for future research and implications for statistics and data science education.
需要对中学生参与统计问题解决的过程进行更多的研究,特别是开始的过程步骤:制定问题和制定收集数据/考虑数据的计划。此外,增加大规模电子数据集的可用性是一个尚未开发的研究领域。本解释性研究考察了中学生对统计概念的理解,这些概念涉及到使用互联网上可用的数据在社会问题背景下收集数据以回答统计问题的计划。学生的手工制品,研究人员的笔记,以及来自郊区一所中学的两个资优教育退出班的9组20名七年级学生的音频和视频记录被用来回答研究问题。使用先前开发的框架中的先验代码分析数据,并使用归纳方法找到主题。与确认偏差相关的数据中出现了三个主题。一些中学生对他们选择研究的社会问题有先入为主的观念,这对他们的统计问题有偏见。这反过来又影响了学生用来回答问题的数据来源。确认偏误是一个严重的问题,由于无穷无尽的电子数据来源而加剧。我们认为,这种类型的偏见应该在学生的教育经历的早期解决。基于本研究的发现,我们提出了未来研究的建议以及对统计和数据科学教育的启示。
{"title":"Addressing confirmation bias in middle school data science education","authors":"S. Hedges, Kim Given","doi":"10.3934/fods.2021035","DOIUrl":"https://doi.org/10.3934/fods.2021035","url":null,"abstract":"More research is needed involving middle school students' engagement in the statistical problem-solving process, particularly the beginning process steps: formulate a question and make a plan to collect data/consider the data. Further, the increased availability of large-scale electronically accessible data sets is an untapped area of study. This interpretive study examined middle school students' understanding of statistical concepts involved in making a plan to collect data to answer a statistical question within a social issue context using data available on the internet. Student artifacts, researcher notes, and audio and video recordings from nine groups of 20 seventh-grade students in two gifted education pull-out classes at a suburban middle school were used to answer the study research questions. Data were analyzed using a priori codes from previously developed frameworks and by using an inductive approach to find themes.Three themes that emerged from data related to confirmation bias. Some middle school students held preconceptions about the social issues they chose to study that biased their statistical questions. This in turn influenced the sources of data students used to answer their questions. Confirmation bias is a serious issue that is exacerbated due to endless sources of data electronically available. We argue that this type of bias should be addressed early in students' educational experiences. Based on the findings from this study, we offer recommendations for future research and implications for statistics and data science education.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Statistical inference for persistent homology applied to simulated fMRI time series data 持续同源性的统计推断应用于模拟fMRI时间序列数据
Q2 MATHEMATICS, APPLIED Pub Date : 2022-01-01 DOI: 10.3934/fods.2022014
H. Abdallah, Adam J. Regalski, Mohammad Behzad Kang, Maria Berishaj, N. Nnadi, Asadur Chowdury, V. Diwadkar, A. Salch
Time-series data are amongst the most widely-used in biomedical sciences, including domains such as functional Magnetic Resonance Imaging (fMRI). Structure within time series data can be captured by the tools of topological data analysis (TDA). Persistent homology is the mostly commonly used data-analytic tool in TDA, and can effectively summarize complex high-dimensional data into an interpretable 2-dimensional representation called a persistence diagram. Existing methods for statistical inference for persistent homology of data depend on an independence assumption being satisfied. While persistent homology can be computed for each time index in a time-series, time-series data often fail to satisfy the independence assumption. This paper develops a statistical test that obviates the independence assumption by implementing a multi-level block sampled Monte Carlo test with sets of persistence diagrams. Its efficacy for detecting task-dependent topological organization is then demonstrated on simulated fMRI data. This new statistical test is therefore suitable for analyzing persistent homology of fMRI data, and of non-independent data in general.
时间序列数据是生物医学科学中最广泛使用的数据之一,包括功能磁共振成像(fMRI)等领域。拓扑数据分析(TDA)工具可以捕获时间序列数据中的结构。持久化同构是TDA中最常用的数据分析工具,它可以有效地将复杂的高维数据总结为可解释的二维表示,称为持久化图。现有的数据持久同调的统计推断方法依赖于一个独立性假设的满足。虽然时间序列中的每个时间指标都可以计算出持久的同源性,但时间序列数据往往不能满足独立性假设。本文提出了一种统计检验方法,通过使用一组持久性图实现多级块采样蒙特卡罗检验,消除了独立性假设。然后在模拟的fMRI数据上证明了其检测任务相关拓扑组织的有效性。因此,这种新的统计检验适用于分析fMRI数据的持续同源性,以及一般的非独立数据。
{"title":"Statistical inference for persistent homology applied to simulated fMRI time series data","authors":"H. Abdallah, Adam J. Regalski, Mohammad Behzad Kang, Maria Berishaj, N. Nnadi, Asadur Chowdury, V. Diwadkar, A. Salch","doi":"10.3934/fods.2022014","DOIUrl":"https://doi.org/10.3934/fods.2022014","url":null,"abstract":"Time-series data are amongst the most widely-used in biomedical sciences, including domains such as functional Magnetic Resonance Imaging (fMRI). Structure within time series data can be captured by the tools of topological data analysis (TDA). Persistent homology is the mostly commonly used data-analytic tool in TDA, and can effectively summarize complex high-dimensional data into an interpretable 2-dimensional representation called a persistence diagram. Existing methods for statistical inference for persistent homology of data depend on an independence assumption being satisfied. While persistent homology can be computed for each time index in a time-series, time-series data often fail to satisfy the independence assumption. This paper develops a statistical test that obviates the independence assumption by implementing a multi-level block sampled Monte Carlo test with sets of persistence diagrams. Its efficacy for detecting task-dependent topological organization is then demonstrated on simulated fMRI data. This new statistical test is therefore suitable for analyzing persistent homology of fMRI data, and of non-independent data in general.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Teaching data science to students in biology using R, RStudio and Learnr: Analysis of three years data 使用R、RStudio和Learnr向生物学专业的学生教授数据科学:三年数据分析
Q2 MATHEMATICS, APPLIED Pub Date : 2022-01-01 DOI: 10.3934/fods.2022022
G. Engels, P. Grosjean, Frédérique Artus
We examine the impact of implementing active pedagogical methodologies in three successive data science courses for a biology curriculum at the University of Mons, Belgium. Blended learning and flipped classroom approaches were adopted, with an emphasis on project-based biological data analysis. Four successive types of exercises of increasing difficulties were proposed to the students. Tutorials written with the R package learnr were identified as a critical step to transition between theory and the application of the concepts. The cognitive workload needed to complete the learnr tutorials was measured for the three courses and it was only lower for the last course, suggesting students needed a long time to get used to their software environment (R, RStudio and git). Data relative to students' activity, collected primarily from the ongoing assessment, were also used to establish student profiles according to their learning strategies. Several suboptimal strategies were observed and discussed. Finally, the timing of students contributions, and the intensity of teacher-learner interactions related to these contributions were analyzed before, during and after the mandatory distance learning due to the COVID-19 lockdown. A lag phase was visible at the beginning of the first lockdown, but the students' work was not markedly affected during the second lockdown period which lasted much longer.
我们研究了在比利时蒙斯大学生物学课程的三门连续数据科学课程中实施积极教学方法的影响。采用混合学习和翻转课堂的方法,重点是基于项目的生物数据分析。向学生们提出了四种难度逐渐增加的连续练习。使用R包learnr编写的教程被认为是理论和概念应用之间过渡的关键步骤。我们测量了这三门课程完成学习者教程所需的认知工作量,只有最后一门课程的认知工作量更低,这表明学生需要很长时间来适应他们的软件环境(R, RStudio和git)。主要从正在进行的评估中收集的与学生活动有关的数据也用于根据学生的学习策略建立学生档案。观察并讨论了几种次优策略。最后,分析了由于COVID-19封锁导致的强制性远程学习之前、期间和之后,学生贡献的时间以及与这些贡献相关的师生互动的强度。在第一次封锁开始时,可以看到滞后阶段,但在持续时间更长的第二次封锁期间,学生的工作没有受到明显影响。
{"title":"Teaching data science to students in biology using R, RStudio and Learnr: Analysis of three years data","authors":"G. Engels, P. Grosjean, Frédérique Artus","doi":"10.3934/fods.2022022","DOIUrl":"https://doi.org/10.3934/fods.2022022","url":null,"abstract":"We examine the impact of implementing active pedagogical methodologies in three successive data science courses for a biology curriculum at the University of Mons, Belgium. Blended learning and flipped classroom approaches were adopted, with an emphasis on project-based biological data analysis. Four successive types of exercises of increasing difficulties were proposed to the students. Tutorials written with the R package learnr were identified as a critical step to transition between theory and the application of the concepts. The cognitive workload needed to complete the learnr tutorials was measured for the three courses and it was only lower for the last course, suggesting students needed a long time to get used to their software environment (R, RStudio and git). Data relative to students' activity, collected primarily from the ongoing assessment, were also used to establish student profiles according to their learning strategies. Several suboptimal strategies were observed and discussed. Finally, the timing of students contributions, and the intensity of teacher-learner interactions related to these contributions were analyzed before, during and after the mandatory distance learning due to the COVID-19 lockdown. A lag phase was visible at the beginning of the first lockdown, but the students' work was not markedly affected during the second lockdown period which lasted much longer.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Applying topological data analysis to local search problems 将拓扑数据分析应用于局部搜索问题
Q2 MATHEMATICS, APPLIED Pub Date : 2022-01-01 DOI: 10.3934/fods.2022006
Erik Carlsson, J. Carlsson, Shannon Sweitzer

We present an application of topological data analysis (TDA) to discrete optimization problems, which we show can improve the performance of the 2-opt local search method for the traveling salesman problem by simply applying standard Vietoris-Rips construction to a data set of trials. We then construct a simplicial complex which is specialized for this sort of simulated data set, determined by a stochastic matrix with a steady state vector begin{document}$ (P,pi) $end{document}. When begin{document}$ P $end{document} is induced from a random walk on a finite metric space, this complex exhibits similarities with standard constructions such as Vietoris-Rips on the data set, but is not sensitive to outliers, as sparsity is a natural feature of the construction. We interpret the persistent homology groups in several examples coming from random walks and discrete optimization, and illustrate how higher dimensional Betti numbers can be used to classify connected components, i.e. zero dimensional features in higher dimensions.

We present an application of topological data analysis (TDA) to discrete optimization problems, which we show can improve the performance of the 2-opt local search method for the traveling salesman problem by simply applying standard Vietoris-Rips construction to a data set of trials. We then construct a simplicial complex which is specialized for this sort of simulated data set, determined by a stochastic matrix with a steady state vector begin{document}$ (P,pi) $end{document}. When begin{document}$ P $end{document} is induced from a random walk on a finite metric space, this complex exhibits similarities with standard constructions such as Vietoris-Rips on the data set, but is not sensitive to outliers, as sparsity is a natural feature of the construction. We interpret the persistent homology groups in several examples coming from random walks and discrete optimization, and illustrate how higher dimensional Betti numbers can be used to classify connected components, i.e. zero dimensional features in higher dimensions.
{"title":"Applying topological data analysis to local search problems","authors":"Erik Carlsson, J. Carlsson, Shannon Sweitzer","doi":"10.3934/fods.2022006","DOIUrl":"https://doi.org/10.3934/fods.2022006","url":null,"abstract":"<p style='text-indent:20px;'>We present an application of topological data analysis (TDA) to discrete optimization problems, which we show can improve the performance of the 2-opt local search method for the traveling salesman problem by simply applying standard Vietoris-Rips construction to a data set of trials. We then construct a simplicial complex which is specialized for this sort of simulated data set, determined by a stochastic matrix with a steady state vector <inline-formula><tex-math id=\"M1\">begin{document}$ (P,pi) $end{document}</tex-math></inline-formula>. When <inline-formula><tex-math id=\"M2\">begin{document}$ P $end{document}</tex-math></inline-formula> is induced from a random walk on a finite metric space, this complex exhibits similarities with standard constructions such as Vietoris-Rips on the data set, but is not sensitive to outliers, as sparsity is a natural feature of the construction. We interpret the persistent homology groups in several examples coming from random walks and discrete optimization, and illustrate how higher dimensional Betti numbers can be used to classify connected components, i.e. zero dimensional features in higher dimensions.</p>","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Multimodal correlations-based data clustering 基于多模态相关的数据聚类
Q2 MATHEMATICS, APPLIED Pub Date : 2022-01-01 DOI: 10.3934/fods.2022011
Jia Chen, I. Schizas
This work proposes a novel technique for clustering multimodal data according to their information content. Statistical correlations present in data that contain similar information are exploited to perform the clustering task. Specifically, multiset canonical correlation analysis is equipped with norm-one regularization mechanisms to identify clusters within different types of data that share the same information content. A pertinent minimization formulation is put forth, while block coordinate descent is employed to derive a batch clustering algorithm which achieves better clustering performance than existing alternatives. Relying on subgradient descent, an online clustering approach is derived which substantially lowers computational complexity compared to the batch approach, while not compromising significantly the clustering performance. It is established that for an increasing number of data the novel regularized multiset framework is able to correctly cluster the multimodal data entries. Further, it is proved that the online clustering scheme converges with probability one to a stationary point of the ensemble regularized multiset correlations cost having the potential to recover the correct clusters. Extensive numerical tests demonstrate that the novel clustering scheme outperforms existing alternatives, while the online scheme achieves substantial computational savings.
本文提出了一种基于信息含量的多模态数据聚类方法。包含相似信息的数据中存在的统计相关性被用来执行聚类任务。具体来说,多集典型相关分析配备了规范一正则化机制,以识别共享相同信息内容的不同类型数据中的聚类。提出了相应的最小化公式,并采用块坐标下降法导出了一种比现有算法具有更好聚类性能的批量聚类算法。基于亚梯度下降,推导出了一种在线聚类方法,该方法与批处理方法相比大大降低了计算复杂度,同时不会显著影响聚类性能。结果表明,在数据量不断增加的情况下,本文提出的正则化多集框架能够正确聚类多模态数据。进一步证明了在线聚类方案以概率1收敛到集成正则化多集相关代价的平稳点,具有恢复正确聚类的潜力。大量的数值测试表明,新的聚类方案优于现有的替代方案,而在线方案实现了大量的计算节省。
{"title":"Multimodal correlations-based data clustering","authors":"Jia Chen, I. Schizas","doi":"10.3934/fods.2022011","DOIUrl":"https://doi.org/10.3934/fods.2022011","url":null,"abstract":"This work proposes a novel technique for clustering multimodal data according to their information content. Statistical correlations present in data that contain similar information are exploited to perform the clustering task. Specifically, multiset canonical correlation analysis is equipped with norm-one regularization mechanisms to identify clusters within different types of data that share the same information content. A pertinent minimization formulation is put forth, while block coordinate descent is employed to derive a batch clustering algorithm which achieves better clustering performance than existing alternatives. Relying on subgradient descent, an online clustering approach is derived which substantially lowers computational complexity compared to the batch approach, while not compromising significantly the clustering performance. It is established that for an increasing number of data the novel regularized multiset framework is able to correctly cluster the multimodal data entries. Further, it is proved that the online clustering scheme converges with probability one to a stationary point of the ensemble regularized multiset correlations cost having the potential to recover the correct clusters. Extensive numerical tests demonstrate that the novel clustering scheme outperforms existing alternatives, while the online scheme achieves substantial computational savings.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70248118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HOMOTOPY CONTINUATION FOR THE SPECTRA OF PERSISTENT LAPLACIANS. 持久拉普拉斯算子谱的同伦延拓。
Q2 MATHEMATICS, APPLIED Pub Date : 2021-12-01 DOI: 10.3934/fods.2021017
Xiaoqi Wei, Guo-Wei Wei

The p-persistent q-combinatorial Laplacian defined for a pair of simplicial complexes is a generalization of the q-combinatorial Laplacian. Given a filtration, the spectra of persistent combinatorial Laplacians not only recover the persistent Betti numbers of persistent homology but also provide extra multiscale geometrical information of the data. Paired with machine learning algorithms, the persistent Laplacian has many potential applications in data science. Seeking different ways to find the spectrum of an operator is an active research topic, becoming interesting when ideas are originated from multiple fields. In this work, we explore an alternative approach for the spectrum of persistent Laplacians. As the eigenvalues of a persistent Laplacian matrix are the roots of its characteristic polynomial, one may attempt to find the roots of the characteristic polynomial by homotopy continuation, and thus resolving the spectrum of the corresponding persistent Laplacian. We consider a set of simple polytopes and small molecules to prove the principle that algebraic topology, combinatorial graph, and algebraic geometry can be integrated to understand the shape of data.

对于一对简单复合体定义的p-持久q-组合拉普拉斯算子是对q-组合拉普拉斯算子的推广。经过过滤后,持久组合拉普拉斯算子的谱不仅恢复了持久同调的持久Betti数,而且提供了数据的额外多尺度几何信息。与机器学习算法相结合,持久拉普拉斯在数据科学中有许多潜在的应用。寻找不同的方法来寻找算子的频谱是一个活跃的研究课题,当想法来自多个领域时变得有趣。在这项工作中,我们探索了持久拉普拉斯光谱的另一种方法。由于持久拉普拉斯矩阵的特征值是其特征多项式的根,因此可以尝试用同伦延拓的方法求出特征多项式的根,从而求解相应的持久拉普拉斯矩阵的谱。我们考虑一组简单的多面体和小分子来证明代数拓扑、组合图和代数几何可以结合起来理解数据形状的原理。
{"title":"HOMOTOPY CONTINUATION FOR THE SPECTRA OF PERSISTENT LAPLACIANS.","authors":"Xiaoqi Wei,&nbsp;Guo-Wei Wei","doi":"10.3934/fods.2021017","DOIUrl":"https://doi.org/10.3934/fods.2021017","url":null,"abstract":"<p><p>The <i>p</i>-persistent <i>q</i>-combinatorial Laplacian defined for a pair of simplicial complexes is a generalization of the <i>q</i>-combinatorial Laplacian. Given a filtration, the spectra of persistent combinatorial Laplacians not only recover the persistent Betti numbers of persistent homology but also provide extra multiscale geometrical information of the data. Paired with machine learning algorithms, the persistent Laplacian has many potential applications in data science. Seeking different ways to find the spectrum of an operator is an active research topic, becoming interesting when ideas are originated from multiple fields. In this work, we explore an alternative approach for the spectrum of persistent Laplacians. As the eigenvalues of a persistent Laplacian matrix are the roots of its characteristic polynomial, one may attempt to find the roots of the characteristic polynomial by homotopy continuation, and thus resolving the spectrum of the corresponding persistent Laplacian. We consider a set of simple polytopes and small molecules to prove the principle that algebraic topology, combinatorial graph, and algebraic geometry can be integrated to understand the shape of data.</p>","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"3 4","pages":"677-700"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9273002/pdf/nihms-1768199.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40610845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Analysis of the feedback particle filter with diffusion map based approximation of the gain 基于扩散图逼近增益的反馈粒子滤波器分析
Q2 MATHEMATICS, APPLIED Pub Date : 2021-09-06 DOI: 10.3934/fods.2021023
S. Pathiraja, W. Stannat

Control-type particle filters have been receiving increasing attention over the last decade as a means of obtaining sample based approximations to the sequential Bayesian filtering problem in the nonlinear setting. Here we analyse one such type, namely the feedback particle filter and a recently proposed approximation of the associated gain function based on diffusion maps. The key purpose is to provide analytic insights on the form of the approximate gain, which are of interest in their own right. These are then used to establish a roadmap to obtaining well-posedness and convergence of the finite begin{document}$ N $end{document} system to its mean field limit. A number of possible future research directions are also discussed.

Control-type particle filters have been receiving increasing attention over the last decade as a means of obtaining sample based approximations to the sequential Bayesian filtering problem in the nonlinear setting. Here we analyse one such type, namely the feedback particle filter and a recently proposed approximation of the associated gain function based on diffusion maps. The key purpose is to provide analytic insights on the form of the approximate gain, which are of interest in their own right. These are then used to establish a roadmap to obtaining well-posedness and convergence of the finite begin{document}$ N $end{document} system to its mean field limit. A number of possible future research directions are also discussed.
{"title":"Analysis of the feedback particle filter with diffusion map based approximation of the gain","authors":"S. Pathiraja, W. Stannat","doi":"10.3934/fods.2021023","DOIUrl":"https://doi.org/10.3934/fods.2021023","url":null,"abstract":"<p style='text-indent:20px;'>Control-type particle filters have been receiving increasing attention over the last decade as a means of obtaining sample based approximations to the sequential Bayesian filtering problem in the nonlinear setting. Here we analyse one such type, namely the feedback particle filter and a recently proposed approximation of the associated gain function based on diffusion maps. The key purpose is to provide analytic insights on the form of the approximate gain, which are of interest in their own right. These are then used to establish a roadmap to obtaining well-posedness and convergence of the finite <inline-formula><tex-math id=\"M1\">begin{document}$ N $end{document}</tex-math></inline-formula> system to its mean field limit. A number of possible future research directions are also discussed.</p>","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42664532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Foundations of data science (Springfield, Mo.)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1