Tensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this paper, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to MATLAB, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.
{"title":"Algorithm XXX: Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions","authors":"C. Psarras, L. Karlsson, R. Bro, P. Bientinesi","doi":"10.1145/3519383","DOIUrl":"https://doi.org/10.1145/3519383","url":null,"abstract":"Tensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this paper, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to MATLAB, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2022-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41347702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, a cubic integral smoothing spline with roughness penalty for restoring a function by integrals is described. A mathematical method for building such a spline is described in detail. The method is based on cubic integral spline with a penalty function, which minimizes the sum of squares of the difference between the observed integrals of the unknown function and the integrals of the spline being constructed, plus an additional penalty for the nonlinearity (roughness) of the spline. This method has a matrix form, and this paper shows in detail how to fill in each matrix. The parameter α governs the desired smoothness of the restored function. Spline knots can be chosen independently of observations, and a weight can be defined for each observation for more control over the resulting spline shape. An implementation in the R language as function int_spline is given. The function int_spline is easy to use, with all arguments completely described and corresponding examples given. An example of the application of the method in rare event analysis and forecasting is given.
{"title":"Algorithm xxx: Restoration of function by integrals with cubic integral smoothing spline in R","authors":"Yu. D. Korablev","doi":"10.1145/3519384","DOIUrl":"https://doi.org/10.1145/3519384","url":null,"abstract":"\u0000 In this paper, a cubic integral smoothing spline with roughness penalty for restoring a function by integrals is described. A mathematical method for building such a spline is described in detail. The method is based on cubic integral spline with a penalty function, which minimizes the sum of squares of the difference between the observed integrals of the unknown function and the integrals of the spline being constructed, plus an additional penalty for the nonlinearity (roughness) of the spline. This method has a matrix form, and this paper shows in detail how to fill in each matrix. The parameter\u0000 α\u0000 governs the desired smoothness of the restored function. Spline knots can be chosen independently of observations, and a weight can be defined for each observation for more control over the resulting spline shape. An implementation in the R language as function\u0000 int_spline\u0000 is given. The function\u0000 int_spline\u0000 is easy to use, with all arguments completely described and corresponding examples given. An example of the application of the method in rare event analysis and forecasting is given.\u0000","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2022-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46245619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Heavner, F. D. Igual, G. Quintana-Ortí, P.G. Martinsson
The randomized singular value decomposition (RSVD) is by now a well established technique for efficiently computing an approximate singular value decomposition of a matrix. Building on the ideas that underpin the RSVD, the recently proposed algorithm “randUTV” computes a full factorization of a given matrix that provides low-rank approximations with near-optimal error. Because the bulk of randUTV is cast in terms of communication-efficient operations like matrix-matrix multiplication and unpivoted QR factorizations, it is faster than competing rank-revealing factorization methods like column-pivoted QR in most high performance computational settings. In this article, optimized randUTV implementations are presented for both shared-memory and distributed-memory computing environments. For shared memory, randUTV is redesigned in terms of an algorithm-by-blocks that, together with a runtime task scheduler, eliminates bottlenecks from data synchronization points to achieve acceleration over the standard blocked algorithm , based on a purely fork-join approach. The distributed-memory implementation is based on the ScaLAPACK library. The performances of our new codes compare favorably with competing factorizations available on both shared-memory and distributed-memory architectures.
{"title":"Algorithm xxx: Efficient algorithms for computing a rank-revealing UTV factorization on parallel computing architectures","authors":"N. Heavner, F. D. Igual, G. Quintana-Ortí, P.G. Martinsson","doi":"10.1145/3507466","DOIUrl":"https://doi.org/10.1145/3507466","url":null,"abstract":"\u0000 The randomized singular value decomposition (RSVD) is by now a well established technique for efficiently computing an approximate singular value decomposition of a matrix. Building on the ideas that underpin the RSVD, the recently proposed algorithm “randUTV” computes a\u0000 full\u0000 factorization of a given matrix that provides low-rank approximations with near-optimal error. Because the bulk of\u0000 randUTV\u0000 is cast in terms of communication-efficient operations like matrix-matrix multiplication and unpivoted QR factorizations, it is faster than competing rank-revealing factorization methods like column-pivoted QR in most high performance computational settings. In this article, optimized\u0000 randUTV\u0000 implementations are presented for both shared-memory and distributed-memory computing environments. For shared memory,\u0000 randUTV\u0000 is redesigned in terms of an\u0000 algorithm-by-blocks\u0000 that, together with a runtime task scheduler, eliminates bottlenecks from data synchronization points to achieve acceleration over the standard\u0000 blocked algorithm\u0000 , based on a purely fork-join approach. The distributed-memory implementation is based on the ScaLAPACK library. The performances of our new codes compare favorably with competing factorizations available on both shared-memory and distributed-memory architectures.\u0000","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2022-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43296455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}