Journal of Computational Mathematics and Data Science最新文献_第2页

Estimating data complexity and drift through a multiscale generalized impurity approach 通过多尺度广义杂质法估算数据复杂性和漂移

Journal of Computational Mathematics and Data Science

Pub Date : 2024-08-26 DOI: 10.1016/j.jcmds.2024.100098

Diogo Costa , Eugénio M. Rocha , Nelson Ferreira

The quality of machine learning solutions, and of classifier models in general, depend largely on the performance of the chosen algorithm, and on the intrinsic characteristics of the input data. Although work has been extensive on the former of these aspects, the latter has received comparably less attention. In this paper, we introduce the Multiscale Impurity Complexity Analysis (MICA) algorithm for the quantification of class separability and decision-boundary complexity of datasets. MICA is both model and dimensionality-independent and can provide a measure of separability based on regional impurity values. This makes it so that MICA is sensible to both global and local data conditions. We show MICA to be capable of properly describing class separability in a comprehensive set of both synthetic and real datasets and comparing it against other state-of-the-art methods. After establishing the robustness of the proposed method, alternative applications are discussed, including a streaming-data variant of MICA (MICA-S), that can be repurposed into a model-independent method for concept drift detection.

机器学习解决方案以及分类器模型的质量在很大程度上取决于所选算法的性能以及输入数据的内在特征。尽管在前者方面已经开展了大量工作，但后者受到的关注却相对较少。本文介绍了多尺度杂质复杂性分析（MICA）算法，用于量化数据集的类别可分性和决策边界复杂性。MICA 与模型和维度无关，可以提供基于区域杂质值的可分性度量。这使得 MICA 对全局和局部数据条件都很敏感。我们展示了 MICA 能够在一组综合的合成和真实数据集中正确描述类的可分性，并将其与其他最先进的方法进行了比较。在确定了所提方法的鲁棒性之后，我们还讨论了其他应用，包括 MICA 的流数据变体（MICA-S），该变体可用于独立于模型的概念漂移检测方法。

{"title":"Estimating data complexity and drift through a multiscale generalized impurity approach","authors":"Diogo Costa , Eugénio M. Rocha , Nelson Ferreira","doi":"10.1016/j.jcmds.2024.100098","DOIUrl":"10.1016/j.jcmds.2024.100098","url":null,"abstract":"<div><p>The quality of machine learning solutions, and of classifier models in general, depend largely on the performance of the chosen algorithm, and on the intrinsic characteristics of the input data. Although work has been extensive on the former of these aspects, the latter has received comparably less attention. In this paper, we introduce the Multiscale Impurity Complexity Analysis (MICA) algorithm for the quantification of class separability and decision-boundary complexity of datasets. MICA is both model and dimensionality-independent and can provide a measure of separability based on regional impurity values. This makes it so that MICA is sensible to both global and local data conditions. We show MICA to be capable of properly describing class separability in a comprehensive set of both synthetic and real datasets and comparing it against other state-of-the-art methods. After establishing the robustness of the proposed method, alternative applications are discussed, including a streaming-data variant of MICA (MICA-S), that can be repurposed into a model-independent method for concept drift detection.</p></div>","PeriodicalId":100768,"journal":{"name":"Journal of Computational Mathematics and Data Science","volume":"12 ","pages":"Article 100098"},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772415824000099/pdfft?md5=54b719dae828872e98af24740cf27e23&pid=1-s2.0-S2772415824000099-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142076295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Structured stochastic curve fitting without gradient calculation 无需梯度计算的结构化随机曲线拟合

Journal of Computational Mathematics and Data Science

Pub Date : 2024-07-26 DOI: 10.1016/j.jcmds.2024.100097

Jixin Chen

Optimization of parameters and hyperparameters is a general process for any data analysis. Because not all models are mathematically well-behaved, stochastic optimization can be useful in many analyses by randomly choosing parameters in each optimization iteration. Many such algorithms have been reported and applied in chemistry data analysis, but the one reported here is interesting to check out, where a naïve algorithm searches each parameter sequentially and randomly in its bounds. Then it picks the best for the next iteration. Thus, one can ignore irrational solution of the model itself or its gradient in parameter space and continue the optimization.

参数和超参数的优化是任何数据分析的一般过程。由于并非所有模型都具有良好的数学特性，随机优化可以在每次优化迭代中随机选择参数，从而在许多分析中发挥作用。许多此类算法已被报道并应用于化学数据分析中，但本文报道的算法值得一探究竟。在该算法中，一个天真的算法在其边界内按顺序随机搜索每个参数。然后在下一次迭代中选出最佳方案。因此，我们可以忽略模型本身的不合理解或其在参数空间的梯度，继续进行优化。

引用次数: 0

A DEIM-CUR factorization with iterative SVDs 采用迭代 SVD 的 DEIM-CUR 因式分解法

Journal of Computational Mathematics and Data Science

Pub Date : 2024-06-17 DOI: 10.1016/j.jcmds.2024.100095

Perfect Y. Gidisu, Michiel E. Hochstenbach

A CUR factorization is often utilized as a substitute for the singular value decomposition (SVD), especially when a concrete interpretation of the singular vectors is challenging. Moreover, if the original data matrix possesses properties like nonnegativity and sparsity, a CUR decomposition can better preserve them compared to the SVD. An essential aspect of this approach is the methodology used for selecting a subset of columns and rows from the original matrix. This study investigates the effectiveness of one-round sampling and iterative subselection techniques and introduces new iterative subselection strategies based on iterative SVDs. One provably appropriate technique for index selection in constructing a CUR factorization is the discrete empirical interpolation method (DEIM). Our contribution aims to improve the approximation quality of the DEIM scheme by iteratively invoking it in several rounds, in the sense that we select subsequent columns and rows based on the previously selected ones. Thus, we modify $A$ after each iteration by removing the information that has been captured by the previously selected columns and rows. We also discuss how iterative procedures for computing a few singular vectors of large data matrices can be integrated with the new iterative subselection strategies. We present the results of numerical experiments, providing a comparison of one-round sampling and iterative subselection techniques, and demonstrating the improved approximation quality associated with using the latter.

CUR 因式分解经常被用来替代奇异值分解（SVD），尤其是在奇异向量的具体解释具有挑战性的情况下。此外，如果原始数据矩阵具有非负性和稀疏性等特性，CUR 分解与 SVD 相比能更好地保留这些特性。这种方法的一个重要方面是从原始矩阵中选择列和行子集的方法。本研究调查了单轮采样和迭代子选择技术的有效性，并在迭代 SVD 的基础上引入了新的迭代子选择策略。在构建 CUR 因式分解时，离散经验插值法（DEIM）是一种可证明的合适索引选择技术。我们的贡献旨在通过多轮迭代调用 DEIM 方案来提高其近似质量，即我们根据之前选择的列和行来选择后续的列和行。因此，我们在每次迭代后都会修改 A，删除之前选定的列和行所捕获的信息。我们还讨论了如何将计算大型数据矩阵几个奇异向量的迭代程序与新的迭代子选择策略相结合。我们介绍了数值实验的结果，对单轮采样和迭代子选择技术进行了比较，并证明使用后者可以提高近似质量。

{"title":"A DEIM-CUR factorization with iterative SVDs","authors":"Perfect Y. Gidisu, Michiel E. Hochstenbach","doi":"10.1016/j.jcmds.2024.100095","DOIUrl":"https://doi.org/10.1016/j.jcmds.2024.100095","url":null,"abstract":"<div><p>A CUR factorization is often utilized as a substitute for the singular value decomposition (SVD), especially when a concrete interpretation of the singular vectors is challenging. Moreover, if the original data matrix possesses properties like nonnegativity and sparsity, a CUR decomposition can better preserve them compared to the SVD. An essential aspect of this approach is the methodology used for selecting a subset of columns and rows from the original matrix. This study investigates the effectiveness of <em>one-round sampling</em> and iterative subselection techniques and introduces new iterative subselection strategies based on iterative SVDs. One provably appropriate technique for index selection in constructing a CUR factorization is the discrete empirical interpolation method (DEIM). Our contribution aims to improve the approximation quality of the DEIM scheme by iteratively invoking it in several rounds, in the sense that we select subsequent columns and rows based on the previously selected ones. Thus, we modify <span><math><mi>A</mi></math></span> after each iteration by removing the information that has been captured by the previously selected columns and rows. We also discuss how iterative procedures for computing a few singular vectors of large data matrices can be integrated with the new iterative subselection strategies. We present the results of numerical experiments, providing a comparison of one-round sampling and iterative subselection techniques, and demonstrating the improved approximation quality associated with using the latter.</p></div>","PeriodicalId":100768,"journal":{"name":"Journal of Computational Mathematics and Data Science","volume":"12 ","pages":"Article 100095"},"PeriodicalIF":0.0,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772415824000063/pdfft?md5=16d9fd47f077d52851c28e4d876eb3c0&pid=1-s2.0-S2772415824000063-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141484237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bayesian sparsity and class sparsity priors for dictionary learning and coding 用于字典学习和编码的贝叶斯稀疏性和类稀疏性先验

Journal of Computational Mathematics and Data Science

Pub Date : 2024-03-21 DOI: 10.1016/j.jcmds.2024.100094

A. Bocchinfuso , D. Calvetti, E. Somersalo

Dictionary learning methods continue to gain popularity for the solution of challenging inverse problems. In the dictionary learning approach, the computational forward model is replaced by a large dictionary of possible outcomes, and the problem is to identify the dictionary entries that best match the data, akin to traditional query matching in search engines. Sparse coding techniques are used to guarantee that the dictionary matching identifies only few of the dictionary entries, and dictionary compression methods are used to reduce the complexity of the matching problem. In this article, we propose a work flow to facilitate the dictionary matching process. First, the full dictionary is divided into subdictionaries that are separately compressed. The error introduced by the dictionary compression is handled in the Bayesian framework as a modeling error. Furthermore, we propose a new Bayesian data-driven group sparsity coding method to help identify subdictionaries that are not relevant for the dictionary matching. After discarding irrelevant subdictionaries, the dictionary matching is addressed as a deflated problem using sparse coding. The compression and deflation steps can lead to substantial decreases of the computational complexity. The effectiveness of compensating for the dictionary compression error and using the novel group sparsity promotion to deflate the original dictionary are illustrated by applying the methodology to real world problems, the glitch detection in the LIGO experiment and hyperspectral remote sensing.

字典学习方法在解决具有挑战性的逆问题方面越来越受欢迎。在字典学习方法中，计算前向模型被可能结果的大型字典所取代，问题是识别与数据最匹配的字典条目，类似于搜索引擎中的传统查询匹配。稀疏编码技术用于保证字典匹配只识别出少数字典条目，字典压缩方法用于降低匹配问题的复杂性。在本文中，我们提出了一个工作流程来促进字典匹配过程。首先，将完整的字典分为子字典，并分别进行压缩。字典压缩带来的误差在贝叶斯框架中作为建模误差处理。此外，我们还提出了一种新的贝叶斯数据驱动组稀疏性编码方法，以帮助识别与字典匹配无关的子字典。在剔除无关的子字典后，字典匹配将作为一个使用稀疏编码的缩减问题来处理。压缩和放缩步骤可大幅降低计算复杂度。通过将该方法应用于实际问题、LIGO 实验中的小故障检测和高光谱遥感，说明了补偿字典压缩误差和使用新颖的组稀疏性促进来压缩原始字典的有效性。

{"title":"Bayesian sparsity and class sparsity priors for dictionary learning and coding","authors":"A. Bocchinfuso , D. Calvetti, E. Somersalo","doi":"10.1016/j.jcmds.2024.100094","DOIUrl":"https://doi.org/10.1016/j.jcmds.2024.100094","url":null,"abstract":"<div><p>Dictionary learning methods continue to gain popularity for the solution of challenging inverse problems. In the dictionary learning approach, the computational forward model is replaced by a large dictionary of possible outcomes, and the problem is to identify the dictionary entries that best match the data, akin to traditional query matching in search engines. Sparse coding techniques are used to guarantee that the dictionary matching identifies only few of the dictionary entries, and dictionary compression methods are used to reduce the complexity of the matching problem. In this article, we propose a work flow to facilitate the dictionary matching process. First, the full dictionary is divided into subdictionaries that are separately compressed. The error introduced by the dictionary compression is handled in the Bayesian framework as a modeling error. Furthermore, we propose a new Bayesian data-driven group sparsity coding method to help identify subdictionaries that are not relevant for the dictionary matching. After discarding irrelevant subdictionaries, the dictionary matching is addressed as a deflated problem using sparse coding. The compression and deflation steps can lead to substantial decreases of the computational complexity. The effectiveness of compensating for the dictionary compression error and using the novel group sparsity promotion to deflate the original dictionary are illustrated by applying the methodology to real world problems, the glitch detection in the LIGO experiment and hyperspectral remote sensing.</p></div>","PeriodicalId":100768,"journal":{"name":"Journal of Computational Mathematics and Data Science","volume":"11 ","pages":"Article 100094"},"PeriodicalIF":0.0,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772415824000051/pdfft?md5=87116ca1a8ef189c30f80b5ed4b567bd&pid=1-s2.0-S2772415824000051-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140321133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Simulation of Erlang and negative binomial distributions using the generalized Lambert W function 使用广义兰伯特 W 函数模拟二项分布和负二项分布

Journal of Computational Mathematics and Data Science

Pub Date : 2024-02-08 DOI: 10.1016/j.jcmds.2024.100092

C.Y. Chew , G. Teng , Y.S. Lai

We present a simulation method for generating random variables from Erlang and negative binomial distributions using the generalized Lambert $W$ function. The generalized Lambert $W$ function is utilized to solve the quantile functions of these distributions, allowing for efficient and accurate generation of random variables. The simulation procedure is based on Halley’s method and is demonstrated through the generation of 100,000 random variables for each distribution. The results show close agreement with the theoretical mean and variance values, indicating the effectiveness of the proposed method. This approach offers a valuable tool for generating random variables from Erlang and negative binomial distributions in various applications.

我们介绍了一种利用广义兰伯特 W 函数从二项分布和负二项分布生成随机变量的模拟方法。广义兰伯特 W 函数用于求解这些分布的量子函数，从而高效、准确地生成随机变量。模拟程序以哈雷法为基础，通过为每种分布生成 100,000 个随机变量进行了演示。结果显示与理论均值和方差值非常接近，说明所提出的方法非常有效。这种方法为在各种应用中从二项分布和负二项分布生成随机变量提供了宝贵的工具。

引用次数: 0

Approximate Bayesian computational methods to estimate the strength of divergent selection in population genomics models 用近似贝叶斯计算方法估算群体基因组学模型中的分化选择强度

Journal of Computational Mathematics and Data Science

Pub Date : 2024-02-07 DOI: 10.1016/j.jcmds.2024.100091

Martyna Lukaszewicz , Ousseini Issaka Salia , Paul A. Hohenlohe , Erkan O. Buzbas

Statistical estimation of parameters in large models of evolutionary processes is often too computationally inefficient to pursue using exact model likelihoods, even with single-nucleotide polymorphism (SNP) data, which offers a way to reduce the size of genetic data while retaining relevant information. Approximate Bayesian Computation (ABC) to perform statistical inference about parameters of large models takes the advantage of simulations to bypass direct evaluation of model likelihoods. We develop a mechanistic model to simulate forward-in-time divergent selection with variable migration rates, modes of reproduction (sexual, asexual), length and number of migration-selection cycles. We investigate the computational feasibility of ABC to perform statistical inference and study the quality of estimates on the position of loci under selection and the strength of selection. To expand the parameter space of positions under selection, we enhance the model by implementing an outlier scan on summarized observed data. We evaluate the usefulness of summary statistics well-known to capture the strength of selection, and assess their informativeness under divergent selection. We also evaluate the effect of genetic drift with respect to an idealized deterministic model with single-locus selection. We discuss the role of the recombination rate as a confounding factor in estimating the strength of divergent selection, and emphasize its importance in break down of linkage disequilibrium (LD). We answer the question for which part of the parameter space of the model we recover strong signal for estimating the selection, and determine whether population differentiation-based summary statistics or LD–based summary statistics perform well in estimating selection.

对进化过程大型模型参数的统计估计通常计算效率太低，无法使用精确的模型似然，即使使用单核苷酸多态性（SNP）数据也是如此。利用近似贝叶斯计算（ABC）对大型模型的参数进行统计推断，可以利用模拟的优势绕过对模型似然的直接评估。我们建立了一个机理模型，以模拟具有可变迁移率、繁殖模式（有性繁殖、无性繁殖）、迁移-选择周期的长度和数量的前向时间分化选择。我们研究了用 ABC 进行统计推断的计算可行性，并研究了对受选择位点位置和选择强度的估计质量。为了扩展选择位置的参数空间，我们通过对汇总的观测数据进行离群扫描来增强模型。我们评估了众所周知的用于捕捉选择强度的汇总统计数据的有用性，并评估了它们在分歧选择下的信息量。我们还评估了遗传漂变对理想化的单病灶选择确定性模型的影响。我们讨论了重组率作为一个混杂因素在估计发散选择强度中的作用，并强调了它在打破连锁不平衡（LD）中的重要性。我们回答了在模型参数空间的哪一部分我们能恢复到估计选择的强信号这一问题，并确定了是基于种群分化的汇总统计还是基于 LD 的汇总统计在估计选择方面表现良好。

{"title":"Approximate Bayesian computational methods to estimate the strength of divergent selection in population genomics models","authors":"Martyna Lukaszewicz , Ousseini Issaka Salia , Paul A. Hohenlohe , Erkan O. Buzbas","doi":"10.1016/j.jcmds.2024.100091","DOIUrl":"https://doi.org/10.1016/j.jcmds.2024.100091","url":null,"abstract":"<div><p>Statistical estimation of parameters in large models of evolutionary processes is often too computationally inefficient to pursue using exact model likelihoods, even with single-nucleotide polymorphism (SNP) data, which offers a way to reduce the size of genetic data while retaining relevant information. Approximate Bayesian Computation (ABC) to perform statistical inference about parameters of large models takes the advantage of simulations to bypass direct evaluation of model likelihoods. We develop a mechanistic model to simulate forward-in-time divergent selection with variable migration rates, modes of reproduction (sexual, asexual), length and number of migration-selection cycles. We investigate the computational feasibility of ABC to perform statistical inference and study the quality of estimates on the position of loci under selection and the strength of selection. To expand the parameter space of positions under selection, we enhance the model by implementing an outlier scan on summarized observed data. We evaluate the usefulness of summary statistics well-known to capture the strength of selection, and assess their informativeness under divergent selection. We also evaluate the effect of genetic drift with respect to an idealized deterministic model with single-locus selection. We discuss the role of the recombination rate as a confounding factor in estimating the strength of divergent selection, and emphasize its importance in break down of linkage disequilibrium (LD). We answer the question for which part of the parameter space of the model we recover strong signal for estimating the selection, and determine whether population differentiation-based summary statistics or LD–based summary statistics perform well in estimating selection.</p></div>","PeriodicalId":100768,"journal":{"name":"Journal of Computational Mathematics and Data Science","volume":"10 ","pages":"Article 100091"},"PeriodicalIF":0.0,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772415824000026/pdfft?md5=74c5a713f0b6de0a968b0a22ee2b9d09&pid=1-s2.0-S2772415824000026-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139731929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Escaping saddle points efficiently with occupation-time-adapted perturbations 利用占用时间适应性扰动高效逃离鞍点

Journal of Computational Mathematics and Data Science

Pub Date : 2024-01-14 DOI: 10.1016/j.jcmds.2024.100090

Xin Guo , Jiequn Han , Mahan Tajrobehkar , Wenpin Tang

Motivated by the super-diffusivity of self-repelling random walk, which has roots in statistical physics, this paper develops a new perturbation mechanism for optimization algorithms. In this mechanism, perturbations are adapted to the history of states via the notion of occupation time. After integrating this mechanism into the framework of perturbed gradient descent (PGD) and perturbed accelerated gradient descent (PAGD), two new algorithms are proposed: perturbed gradient descent adapted to occupation time (PGDOT) and its accelerated version (PAGDOT). PGDOT and PAGDOT are guaranteed to avoid getting stuck at non-degenerate saddle points, and are shown to converge to second-order stationary points at least as fast as PGD and PAGD, respectively. The theoretical analysis is corroborated by empirical studies in which the new algorithms consistently escape saddle points and outperform not only their counterparts, PGD and PAGD, but also other popular alternatives including stochastic gradient descent, Adam, and several state-of-the-art adaptive gradient methods.

受源于统计物理学的自斥随机漫步超扩散性的启发，本文为优化算法开发了一种新的扰动机制。在这一机制中，扰动通过占用时间的概念来适应状态的历史。在将这一机制融入扰动梯度下降（PGD）和扰动加速梯度下降（PAGD）的框架后，本文提出了两种新算法：适应占用时间的扰动梯度下降（PGDOT）及其加速版本（PAGDOT）。PGDOT 和 PAGDOT 保证避免在非退化鞍点卡住，并分别以至少 PGD 和 PAGD 的速度收敛到二阶静止点。实证研究证实了理论分析的正确性，在实证研究中，新算法始终能摆脱鞍点，其性能不仅优于同类算法 PGD 和 PAGD，还优于其他流行算法，包括随机梯度下降法、亚当法和几种最先进的自适应梯度法。

{"title":"Escaping saddle points efficiently with occupation-time-adapted perturbations","authors":"Xin Guo , Jiequn Han , Mahan Tajrobehkar , Wenpin Tang","doi":"10.1016/j.jcmds.2024.100090","DOIUrl":"https://doi.org/10.1016/j.jcmds.2024.100090","url":null,"abstract":"<div><p>Motivated by the super-diffusivity of self-repelling random walk, which has roots in statistical physics, this paper develops a new perturbation mechanism for optimization algorithms. In this mechanism, perturbations are adapted to the history of states via the notion of occupation time. After integrating this mechanism into the framework of perturbed gradient descent (PGD) and perturbed accelerated gradient descent (PAGD), two new algorithms are proposed: perturbed gradient descent adapted to occupation time (PGDOT) and its accelerated version (PAGDOT). PGDOT and PAGDOT are guaranteed to avoid getting stuck at non-degenerate saddle points, and are shown to converge to second-order stationary points at least as fast as PGD and PAGD, respectively. The theoretical analysis is corroborated by empirical studies in which the new algorithms consistently escape saddle points and outperform not only their counterparts, PGD and PAGD, but also other popular alternatives including stochastic gradient descent, Adam, and several state-of-the-art adaptive gradient methods.</p></div>","PeriodicalId":100768,"journal":{"name":"Journal of Computational Mathematics and Data Science","volume":"10 ","pages":"Article 100090"},"PeriodicalIF":0.0,"publicationDate":"2024-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772415824000014/pdfft?md5=ef92b7ba4259b7a90a297dea99cfb00a&pid=1-s2.0-S2772415824000014-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139503605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

London street crime analysis and prediction using crowdsourced dataset 利用众包数据集进行伦敦街头犯罪分析和预测

Journal of Computational Mathematics and Data Science

Pub Date : 2023-12-18 DOI: 10.1016/j.jcmds.2023.100089

Ahmed Yunus, Jonathan Loo

To effectively prevent crimes, it is vital to anticipate their patterns and likely occurrences. Our efforts focused on analyzing diverse open-source datasets related to London, such as the Met police records, public social media posts, data from transportation hubs like bus and rail stations etc. These datasets provided rich insights into human behaviors, activities, and demographics across different parts of London, paving the way for a machine learning-driven prediction system. We developed this system using unique crime-related features extracted from these datasets. Furthermore, our study outlined methods to gather detailed street-level information from local communities using various applications. This innovative approach significantly enhances our ability to deeply understand and predict crime patterns. The proposed predictive system has the potential to forecast potential crimes in advance, enabling government bodies to proactively deploy targeted interventions, ultimately aiming to prevent and address criminal incidents more effectively.

为了有效预防犯罪，预测犯罪模式和可能发生的情况至关重要。我们的工作重点是分析与伦敦有关的各种开源数据集，如伦敦警察局的记录、公共社交媒体的帖子、公交车站和火车站等交通枢纽的数据等。这些数据集提供了有关伦敦不同地区人类行为、活动和人口统计的丰富信息，为机器学习驱动的预测系统铺平了道路。我们利用从这些数据集中提取的与犯罪相关的独特特征开发了这一系统。此外，我们的研究还概述了利用各种应用程序从当地社区收集详细街道信息的方法。这种创新方法大大提高了我们深入了解和预测犯罪模式的能力。所提出的预测系统有可能提前预测潜在的犯罪，使政府机构能够积极部署有针对性的干预措施，最终达到更有效地预防和解决犯罪事件的目的。

{"title":"London street crime analysis and prediction using crowdsourced dataset","authors":"Ahmed Yunus, Jonathan Loo","doi":"10.1016/j.jcmds.2023.100089","DOIUrl":"10.1016/j.jcmds.2023.100089","url":null,"abstract":"<div><p>To effectively prevent crimes, it is vital to anticipate their patterns and likely occurrences. Our efforts focused on analyzing diverse open-source datasets related to London, such as the Met police records, public social media posts, data from transportation hubs like bus and rail stations etc. These datasets provided rich insights into human behaviors, activities, and demographics across different parts of London, paving the way for a machine learning-driven prediction system. We developed this system using unique crime-related features extracted from these datasets. Furthermore, our study outlined methods to gather detailed street-level information from local communities using various applications. This innovative approach significantly enhances our ability to deeply understand and predict crime patterns. The proposed predictive system has the potential to forecast potential crimes in advance, enabling government bodies to proactively deploy targeted interventions, ultimately aiming to prevent and address criminal incidents more effectively.</p></div>","PeriodicalId":100768,"journal":{"name":"Journal of Computational Mathematics and Data Science","volume":"10 ","pages":"Article 100089"},"PeriodicalIF":0.0,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772415823000160/pdfft?md5=9901b92589c99927f4a51aa0d969d7a5&pid=1-s2.0-S2772415823000160-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139017375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Transfer learning across datasets with different input dimensions: An algorithm and analysis for the linear regression case 跨不同输入维度数据集的迁移学习:线性回归案例的算法和分析

Journal of Computational Mathematics and Data Science

Pub Date : 2023-11-02 DOI: 10.1016/j.jcmds.2023.100086

Luis Pedro Silvestrin , Harry van Zanten , Mark Hoogendoorn , Ger Koole

With the development of new sensors and monitoring devices, more sources of data become available to be used as inputs for machine learning models. These can on the one hand help to improve the accuracy of a model. On the other hand, combining these new inputs with historical data remains a challenge that has not yet been studied in enough detail. In this work, we propose a transfer learning algorithm that combines new and historical data with different input dimensions. This approach is easy to implement, efficient, with computational complexity equivalent to the ordinary least-squares method, and requires no hyperparameter tuning, making it straightforward to apply when the new data is limited. Different from other approaches, we provide a rigorous theoretical study of its robustness, showing that it cannot be outperformed by a baseline that utilizes only the new data. Our approach achieves state-of-the-art performance on 9 real-life datasets, outperforming the linear DSFT, another linear transfer learning algorithm, and performing comparably to non-linear DSFT.¹

随着新型传感器和监测设备的发展，越来越多的数据来源可以用作机器学习模型的输入。这些一方面可以帮助提高模型的准确性。另一方面，将这些新的输入与历史数据相结合仍然是一个挑战，尚未得到足够详细的研究。在这项工作中，我们提出了一种迁移学习算法，该算法结合了不同输入维度的新数据和历史数据。这种方法易于实现，效率高，计算复杂度相当于普通的最小二乘法，并且不需要超参数调优，使得在新数据有限的情况下可以直接应用。与其他方法不同，我们对其稳健性进行了严格的理论研究，表明仅利用新数据的基线不能超越它。我们的方法在9个真实数据集上实现了最先进的性能，优于线性DSFT(另一种线性迁移学习算法)，并且与非线性DSFT相媲美

{"title":"Transfer learning across datasets with different input dimensions: An algorithm and analysis for the linear regression case","authors":"Luis Pedro Silvestrin , Harry van Zanten , Mark Hoogendoorn , Ger Koole","doi":"10.1016/j.jcmds.2023.100086","DOIUrl":"https://doi.org/10.1016/j.jcmds.2023.100086","url":null,"abstract":"<div><p>With the development of new sensors and monitoring devices, more sources of data become available to be used as inputs for machine learning models. These can on the one hand help to improve the accuracy of a model. On the other hand, combining these new inputs with historical data remains a challenge that has not yet been studied in enough detail. In this work, we propose a transfer learning algorithm that combines new and historical data with different input dimensions. This approach is easy to implement, efficient, with computational complexity equivalent to the ordinary least-squares method, and requires no hyperparameter tuning, making it straightforward to apply when the new data is limited. Different from other approaches, we provide a rigorous theoretical study of its robustness, showing that it cannot be outperformed by a baseline that utilizes only the new data. Our approach achieves state-of-the-art performance on 9 real-life datasets, outperforming the linear DSFT, another linear transfer learning algorithm, and performing comparably to non-linear DSFT.<span><sup>1</sup></span></p></div>","PeriodicalId":100768,"journal":{"name":"Journal of Computational Mathematics and Data Science","volume":"9 ","pages":"Article 100086"},"PeriodicalIF":0.0,"publicationDate":"2023-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772415823000135/pdfft?md5=8c5d403909a1ea698959ce44c171ed61&pid=1-s2.0-S2772415823000135-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134657055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Quarter match non-local means algorithm for noise removal 用于噪声去除的四分之一匹配非局部均值算法

Journal of Computational Mathematics and Data Science

Pub Date : 2023-09-30 DOI: 10.1016/j.jcmds.2023.100085

Chartese Jones

The notion of improving plays out in many forms in our lives. We look for better quality, faster speed, and leisurelier connections. To achieve our desired goals, we must ask questions. How to make a process stronger? How to make a process more efficient? How to make a process more effective? Image denoising plays a vital role in many professions and understanding how noise can be present in images has led to multiple denoising techniques. These techniques include total variation regularization, non-local regularization, sparse representation, and low-rank minimization just to name a few. Many of these techniques exist because of the concept of improvement. First, we have a change (problem). This change invokes thoughts and questions. How these changes occur and how they are handled play an essential role in the realization or malfunction of that process. With this understanding, first, we look to fully understand the process to achieve success. As it relates to image denoising, the non-local means is incredibly effective in image reconstruction. In particular, the non-local means filter removes noise and sharpens edges without losing too many fine structures and details. Also, the non-local means algorithm is amazingly accurate. Consequently, the disadvantage that plagues the non-local means filtering algorithm is the computational burden and it is due to the non-local averaging. In this paper, we investigate innovative ways to reduce the computational burden and enhance the effectiveness of this filtering process. Research examining image analysis shows there is a battle between noise reduction and the preservation of actual features, which makes the reduction of noise a difficult task. For exploration, we propose a quarter-match non-local means denoising filtering algorithm. The filters help to classify a more concentrated region in the image and thereby enhance the computational efficiency of the existing non-local means denoising methods and produce an enriched comparison for overlying in the restoration process. To survey the constructs of this new algorithm, the authors use the original non-local means filtering algorithm, which is coined, “State of the Art” and other selective processes to test the effectiveness and efficiency of the new model. When comparing the original non-local means with the new quarter match filtering algorithm, on average, we can reduce the computational cost by half, while improving the quality of the image. To further test our new algorithm, medical resonance (MR) and synthetic aperture radar (SAR) images are used as specimens for real-world applications.

改善的观念在我们的生活中以多种形式表现出来。我们寻求更好的质量、更快的速度和更轻松的连接。为了实现我们想要的目标，我们必须提出问题。如何使流程更强大？如何提高流程效率？如何使流程更有效？图像去噪在许多行业中起着至关重要的作用，了解图像中如何存在噪声导致了多种去噪技术。这些技术包括全变分正则化、非局部正则化、稀疏表示和低秩最小化等。其中许多技术的存在是因为改进的概念。首先，我们有一个变化（问题）。这一变化引发了思考和疑问。这些变化是如何发生的以及如何处理的，在该过程的实现或故障中起着至关重要的作用。有了这种认识，首先，我们希望充分了解取得成功的过程。由于它涉及到图像去噪，非局部方法在图像重建中非常有效。特别地，非局部均值滤波器在不损失太多精细结构和细节的情况下去除噪声并锐化边缘。此外，非局部均值算法也非常精确。因此，困扰非局部均值滤波算法的缺点是计算负担，这是由于非局部平均。在本文中，我们研究了减少计算负担和提高滤波过程有效性的创新方法。对图像分析的研究表明，在降噪和保留实际特征之间存在着一场斗争，这使得降噪成为一项艰巨的任务。为了探索，我们提出了一种四分之一匹配非局部均值去噪滤波算法。滤波器有助于对图像中更集中的区域进行分类，从而提高现有非局部均值去噪方法的计算效率，并在恢复过程中为覆盖产生丰富的比较。为了考察这种新算法的结构，作者使用了最初的非局部均值滤波算法，即“最新技术”和其他选择性过程来测试新模型的有效性和效率。当将原始的非局部均值与新的四分之一匹配滤波算法进行比较时，我们平均可以将计算成本降低一半，同时提高图像质量。为了进一步测试我们的新算法，医学共振（MR）和合成孔径雷达（SAR）图像被用作真实世界应用的样本。

{"title":"Quarter match non-local means algorithm for noise removal","authors":"Chartese Jones","doi":"10.1016/j.jcmds.2023.100085","DOIUrl":"https://doi.org/10.1016/j.jcmds.2023.100085","url":null,"abstract":"<div><p>The notion of improving plays out in many forms in our lives. We look for better quality, faster speed, and leisurelier connections. To achieve our desired goals, we must ask questions. How to make a process stronger? How to make a process more efficient? How to make a process more effective? Image denoising plays a vital role in many professions and understanding how noise can be present in images has led to multiple denoising techniques. These techniques include total variation regularization, non-local regularization, sparse representation, and low-rank minimization just to name a few. Many of these techniques exist because of the concept of improvement. First, we have a change (problem). This change invokes thoughts and questions. How these changes occur and how they are handled play an essential role in the realization or malfunction of that process. With this understanding, first, we look to fully understand the process to achieve success. As it relates to image denoising, the non-local means is incredibly effective in image reconstruction. In particular, the non-local means filter removes noise and sharpens edges without losing too many fine structures and details. Also, the non-local means algorithm is amazingly accurate. Consequently, the disadvantage that plagues the non-local means filtering algorithm is the computational burden and it is due to the non-local averaging. In this paper, we investigate innovative ways to reduce the computational burden and enhance the effectiveness of this filtering process. Research examining image analysis shows there is a battle between noise reduction and the preservation of actual features, which makes the reduction of noise a difficult task. For exploration, we propose a quarter-match non-local means denoising filtering algorithm. The filters help to classify a more concentrated region in the image and thereby enhance the computational efficiency of the existing non-local means denoising methods and produce an enriched comparison for overlying in the restoration process. To survey the constructs of this new algorithm, the authors use the original non-local means filtering algorithm, which is coined, “State of the Art” and other selective processes to test the effectiveness and efficiency of the new model. When comparing the original non-local means with the new quarter match filtering algorithm, on average, we can reduce the computational cost by half, while improving the quality of the image. To further test our new algorithm, medical resonance (MR) and synthetic aperture radar (SAR) images are used as specimens for real-world applications.</p></div>","PeriodicalId":100768,"journal":{"name":"Journal of Computational Mathematics and Data Science","volume":"9 ","pages":"Article 100085"},"PeriodicalIF":0.0,"publicationDate":"2023-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50194986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0