Seeking tighter relaxations of combinatorial optimization problems, semidefinite programming is a generalization of linear programming that offers better bounds and is still polynomially solvable. Yet, in practice, a semidefinite program is still significantly harder to solve than a similar-size Linear Program (LP). It is well-known that a semidefinite program can be written as an LP with infinitely-many cuts that could be solved by repeated separation in a Cutting-Planes scheme; this approach is likely to end up in failure. We proposed in [Projective Cutting-Planes, Daniel Porumbel, Siam Journal on Optimization, 2020] the Projective Cutting-Planes method that upgrades t he well-known separation sub-problem to the projection sub-problem: given a feasible $y$ inside a polytope $P$ and a direction $d$, find the maximum $t^*$ so that $y+t^*din P$. Using this new sub-problem, one can generate a sequence of both inner and outer solutions that converge to the optimum over $P$. This paper shows that the projection sub-problem can be solved very efficiently in a semidefinite programming context, enabling the resulting method to compete very well with state-of-the-art semidefinite optimization software (refined over decades). Results suggest it may the fastest method for matrix sizes larger than $2000times 2000$.
寻求组合优化问题的更紧密松弛,半定规划是线性规划的推广,它提供了更好的边界,并且仍然是多项式可解的。然而,在实践中,半确定程序仍然比类似规模的线性程序(LP)更难求解。众所周知,半定规划可以写成具有无限多个切口的LP,该LP可以通过切割平面格式中的重复分离来求解;这种方法很可能以失败告终。我们在[射影切割-平面,Daniel Porumbel, SiamJournal on Optimization, 2020]中提出了将分离子问题升级为投影子问题的射影切割-平面方法:给定多面体$P$内的可行$y$和方向$d$,找到最大$t^*$,使得$y+t^*d在P$中。利用这个新的子问题,我们可以得到一个内外解的序列,它们收敛于P$上的最优解。本文表明,在半确定规划环境中,投影子问题可以非常有效地解决,使所得到的方法能够与最先进的半确定优化软件(经过几十年的改进)竞争。结果表明,它可能是矩阵大小大于$2000 × 2000$的最快方法。
{"title":"Semidefinite Programming by Projective Cutting Planes","authors":"Daniel Porumbel","doi":"arxiv-2311.09365","DOIUrl":"https://doi.org/arxiv-2311.09365","url":null,"abstract":"Seeking tighter relaxations of combinatorial optimization problems,\u0000semidefinite programming is a generalization of linear programming that offers\u0000better bounds and is still polynomially solvable. Yet, in practice, a\u0000semidefinite program is still significantly harder to solve than a similar-size\u0000Linear Program (LP). It is well-known that a semidefinite program can be\u0000written as an LP with infinitely-many cuts that could be solved by repeated\u0000separation in a Cutting-Planes scheme; this approach is likely to end up in\u0000failure. We proposed in [Projective Cutting-Planes, Daniel Porumbel, Siam\u0000Journal on Optimization, 2020] the Projective Cutting-Planes method that\u0000upgrades t he well-known separation sub-problem to the projection sub-problem:\u0000given a feasible $y$ inside a polytope $P$ and a direction $d$, find the\u0000maximum $t^*$ so that $y+t^*din P$. Using this new sub-problem, one can\u0000generate a sequence of both inner and outer solutions that converge to the\u0000optimum over $P$. This paper shows that the projection sub-problem can be\u0000solved very efficiently in a semidefinite programming context, enabling the\u0000resulting method to compete very well with state-of-the-art semidefinite\u0000optimization software (refined over decades). Results suggest it may the\u0000fastest method for matrix sizes larger than $2000times 2000$.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"15 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138521160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Max von HippelNortheastern University, Panagiotis ManoliosNortheastern University, Kenneth L. McMillanUniversity of Texas at Austin, Cristina Nita-RotaruNortheastern University, Lenore ZuckUniversity of Illinois Chicago
When verifying computer systems we sometimes want to study their asymptotic behaviors, i.e., how they behave in the long run. In such cases, we need real analysis, the area of mathematics that deals with limits and the foundations of calculus. In a prior work, we used real analysis in ACL2s to study the asymptotic behavior of the RTO computation, commonly used in congestion control algorithms across the Internet. One key component in our RTO computation analysis was proving in ACL2s that for all alpha in [0, 1), the limit as n approaches infinity of alpha raised to n is zero. Whereas the most obvious proof strategy involves the logarithm, whose codomain includes irrationals, by default ACL2 only supports rationals, which forced us to take a non-standard approach. In this paper, we explore different approaches to proving the above result in ACL2(r) and ACL2s, from the perspective of a relatively new user to each. We also contextualize the theorem by showing how it allowed us to prove important asymptotic properties of the RTO computation. Finally, we discuss tradeoffs between the various proof strategies and directions for future research.
{"title":"A Case Study in Analytic Protocol Analysis in ACL2","authors":"Max von HippelNortheastern University, Panagiotis ManoliosNortheastern University, Kenneth L. McMillanUniversity of Texas at Austin, Cristina Nita-RotaruNortheastern University, Lenore ZuckUniversity of Illinois Chicago","doi":"arxiv-2311.08855","DOIUrl":"https://doi.org/arxiv-2311.08855","url":null,"abstract":"When verifying computer systems we sometimes want to study their asymptotic\u0000behaviors, i.e., how they behave in the long run. In such cases, we need real\u0000analysis, the area of mathematics that deals with limits and the foundations of\u0000calculus. In a prior work, we used real analysis in ACL2s to study the\u0000asymptotic behavior of the RTO computation, commonly used in congestion control\u0000algorithms across the Internet. One key component in our RTO computation\u0000analysis was proving in ACL2s that for all alpha in [0, 1), the limit as n\u0000approaches infinity of alpha raised to n is zero. Whereas the most obvious\u0000proof strategy involves the logarithm, whose codomain includes irrationals, by\u0000default ACL2 only supports rationals, which forced us to take a non-standard\u0000approach. In this paper, we explore different approaches to proving the above\u0000result in ACL2(r) and ACL2s, from the perspective of a relatively new user to\u0000each. We also contextualize the theorem by showing how it allowed us to prove\u0000important asymptotic properties of the RTO computation. Finally, we discuss\u0000tradeoffs between the various proof strategies and directions for future\u0000research.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"17 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138521152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Factorization and multiplication of dense matrices and tensors are critical, yet extremely expensive pieces of the scientific toolbox. Careful use of low rank approximation can drastically reduce the computation and memory requirements of these operations. In addition to a lower arithmetic complexity, such methods can, by their structure, be designed to efficiently exploit modern hardware architectures. The majority of existing work relies on batched BLAS libraries to handle the computation of many small dense matrices. We show that through careful analysis of the cache utilization, register accumulation using SIMD registers and a redesign of the implementation, one can achieve significantly higher throughput for these types of batched low-rank matrices across a large range of block and batch sizes. We test our algorithm on 3 CPUs using diverse ISAs -- the Fujitsu A64FX using ARM SVE, the Intel Xeon 6148 using AVX-512 and AMD EPYC 7502 using AVX-2, and show that our new batching methodology is able to obtain more than twice the throughput of vendor optimized libraries for all CPU architectures and problem sizes.
{"title":"Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors","authors":"Sameer Deshmukh, Rio Yokota, George Bosilca","doi":"arxiv-2311.07602","DOIUrl":"https://doi.org/arxiv-2311.07602","url":null,"abstract":"Factorization and multiplication of dense matrices and tensors are critical,\u0000yet extremely expensive pieces of the scientific toolbox. Careful use of low\u0000rank approximation can drastically reduce the computation and memory\u0000requirements of these operations. In addition to a lower arithmetic complexity,\u0000such methods can, by their structure, be designed to efficiently exploit modern\u0000hardware architectures. The majority of existing work relies on batched BLAS\u0000libraries to handle the computation of many small dense matrices. We show that\u0000through careful analysis of the cache utilization, register accumulation using\u0000SIMD registers and a redesign of the implementation, one can achieve\u0000significantly higher throughput for these types of batched low-rank matrices\u0000across a large range of block and batch sizes. We test our algorithm on 3 CPUs\u0000using diverse ISAs -- the Fujitsu A64FX using ARM SVE, the Intel Xeon 6148\u0000using AVX-512 and AMD EPYC 7502 using AVX-2, and show that our new batching\u0000methodology is able to obtain more than twice the throughput of vendor\u0000optimized libraries for all CPU architectures and problem sizes.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"10 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138521262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mitchell Tong Harris, Pierre-David Letourneau, Dalton Jones, M. Harper Langston
We present an efficient framework for solving constrained global non-convex polynomial optimization problems. We prove the existence of an equivalent nonlinear reformulation of such problems that possesses essentially no spurious local minima. We show through numerical experiments that polynomial scaling in dimension and degree is achievable for computing the optimal value and location of previously intractable global constrained polynomial optimization problems in high dimension.
{"title":"An Efficient Framework for Global Non-Convex Polynomial Optimization with Nonlinear Polynomial Constraints","authors":"Mitchell Tong Harris, Pierre-David Letourneau, Dalton Jones, M. Harper Langston","doi":"arxiv-2311.02037","DOIUrl":"https://doi.org/arxiv-2311.02037","url":null,"abstract":"We present an efficient framework for solving constrained global non-convex\u0000polynomial optimization problems. We prove the existence of an equivalent\u0000nonlinear reformulation of such problems that possesses essentially no spurious\u0000local minima. We show through numerical experiments that polynomial scaling in\u0000dimension and degree is achievable for computing the optimal value and location\u0000of previously intractable global constrained polynomial optimization problems\u0000in high dimension.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"14 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138521163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sameer Deshmukh, Qinxiang Ma, Rio Yokota, George Bosilca
Structured dense matrices result from boundary integral problems in electrostatics and geostatistics, and also Schur complements in sparse preconditioners such as multi-frontal methods. Exploiting the structure of such matrices can reduce the time for dense direct factorization from $O(N^3)$ to $O(N)$. The Hierarchically Semi-Separable (HSS) matrix is one such low rank matrix format that can be factorized using a Cholesky-like algorithm called ULV factorization. The HSS-ULV algorithm is highly parallel because it removes the dependency on trailing sub-matrices at each HSS level. However, a key merge step that links two successive HSS levels remains a challenge for efficient parallelization. In this paper, we use an asynchronous runtime system PaRSEC with the HSS-ULV algorithm. We compare our work with STRUMPACK and LORAPO, both state-of-the-art implementations of dense direct low rank factorization, and achieve up to 2x better factorization time for matrices arising from a diverse set of applications on up to 128 nodes of Fugaku for similar or better accuracy for all the problems that we survey.
{"title":"$O(N)$ distributed direct factorization of structured dense matrices using runtime systems","authors":"Sameer Deshmukh, Qinxiang Ma, Rio Yokota, George Bosilca","doi":"arxiv-2311.00921","DOIUrl":"https://doi.org/arxiv-2311.00921","url":null,"abstract":"Structured dense matrices result from boundary integral problems in\u0000electrostatics and geostatistics, and also Schur complements in sparse\u0000preconditioners such as multi-frontal methods. Exploiting the structure of such\u0000matrices can reduce the time for dense direct factorization from $O(N^3)$ to\u0000$O(N)$. The Hierarchically Semi-Separable (HSS) matrix is one such low rank\u0000matrix format that can be factorized using a Cholesky-like algorithm called ULV\u0000factorization. The HSS-ULV algorithm is highly parallel because it removes the\u0000dependency on trailing sub-matrices at each HSS level. However, a key merge\u0000step that links two successive HSS levels remains a challenge for efficient\u0000parallelization. In this paper, we use an asynchronous runtime system PaRSEC\u0000with the HSS-ULV algorithm. We compare our work with STRUMPACK and LORAPO, both\u0000state-of-the-art implementations of dense direct low rank factorization, and\u0000achieve up to 2x better factorization time for matrices arising from a diverse\u0000set of applications on up to 128 nodes of Fugaku for similar or better accuracy\u0000for all the problems that we survey.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"13 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138521169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we focus on three sparse matrix operations that are relevant for machine learning applications, namely, the sparse-dense matrix multiplication (SPMM), the sampled dense-dense matrix multiplication (SDDMM), and the composition of the SDDMM with SPMM, also termed as FusedMM. We develop optimized implementations for SPMM, SDDMM, and FusedMM operations utilizing Intel oneAPI's Explicit SIMD (ESIMD) SYCL extension API. In contrast to CUDA or SYCL, the ESIMD API enables the writing of explicitly vectorized kernel code. Sparse matrix algorithms implemented with the ESIMD API achieved performance close to the peak of the targeted Intel Data Center GPU. We compare our performance results to Intel's oneMKL library on Intel GPUs and to a recent CUDA implementation for the sparse matrix operations on NVIDIA's V100 GPU and demonstrate that our implementations for sparse matrix operations outperform either.
{"title":"Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel Max Series GPU","authors":"Mohammad Zubair, Christoph Bauinger","doi":"arxiv-2311.00368","DOIUrl":"https://doi.org/arxiv-2311.00368","url":null,"abstract":"In this paper, we focus on three sparse matrix operations that are relevant\u0000for machine learning applications, namely, the sparse-dense matrix\u0000multiplication (SPMM), the sampled dense-dense matrix multiplication (SDDMM),\u0000and the composition of the SDDMM with SPMM, also termed as FusedMM. We develop\u0000optimized implementations for SPMM, SDDMM, and FusedMM operations utilizing\u0000Intel oneAPI's Explicit SIMD (ESIMD) SYCL extension API. In contrast to CUDA or\u0000SYCL, the ESIMD API enables the writing of explicitly vectorized kernel code.\u0000Sparse matrix algorithms implemented with the ESIMD API achieved performance\u0000close to the peak of the targeted Intel Data Center GPU. We compare our\u0000performance results to Intel's oneMKL library on Intel GPUs and to a recent\u0000CUDA implementation for the sparse matrix operations on NVIDIA's V100 GPU and\u0000demonstrate that our implementations for sparse matrix operations outperform\u0000either.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"12 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138521257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NoMoPy is a code for fitting, analyzing, and generating noise modeled as a hidden Markov model (HMM) or, more generally, factorial hidden Markov model (FHMM). This code, written in Python, implements approximate and exact expectation maximization (EM) algorithms for performing the parameter estimation process, model selection procedures via cross-validation, and parameter confidence region estimation. Here, we describe in detail the functionality implemented in NoMoPy and provide examples of its use and performance on example problems.
{"title":"NoMoPy: Noise Modeling in Python","authors":"Dylan Albrecht, N. Tobias Jacobson","doi":"arxiv-2311.00084","DOIUrl":"https://doi.org/arxiv-2311.00084","url":null,"abstract":"NoMoPy is a code for fitting, analyzing, and generating noise modeled as a\u0000hidden Markov model (HMM) or, more generally, factorial hidden Markov model\u0000(FHMM). This code, written in Python, implements approximate and exact\u0000expectation maximization (EM) algorithms for performing the parameter\u0000estimation process, model selection procedures via cross-validation, and\u0000parameter confidence region estimation. Here, we describe in detail the\u0000functionality implemented in NoMoPy and provide examples of its use and\u0000performance on example problems.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"16 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138521254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tetiana Parshakova, Trevor Hastie, Eric Darve, Stephen Boyd
We consider multilevel low rank (MLR) matrices, defined as a row and column permutation of a sum of matrices, each one a block diagonal refinement of the previous one, with all blocks low rank given in factored form. MLR matrices extend low rank matrices but share many of their properties, such as the total storage required and complexity of matrix-vector multiplication. We address three problems that arise in fitting a given matrix by an MLR matrix in the Frobenius norm. The first problem is factor fitting, where we adjust the factors of the MLR matrix. The second is rank allocation, where we choose the ranks of the blocks in each level, subject to the total rank having a given value, which preserves the total storage needed for the MLR matrix. The final problem is to choose the hierarchical partition of rows and columns, along with the ranks and factors. This paper is accompanied by an open source package that implements the proposed methods.
{"title":"Factor Fitting, Rank Allocation, and Partitioning in Multilevel Low Rank Matrices","authors":"Tetiana Parshakova, Trevor Hastie, Eric Darve, Stephen Boyd","doi":"arxiv-2310.19214","DOIUrl":"https://doi.org/arxiv-2310.19214","url":null,"abstract":"We consider multilevel low rank (MLR) matrices, defined as a row and column\u0000permutation of a sum of matrices, each one a block diagonal refinement of the\u0000previous one, with all blocks low rank given in factored form. MLR matrices\u0000extend low rank matrices but share many of their properties, such as the total\u0000storage required and complexity of matrix-vector multiplication. We address\u0000three problems that arise in fitting a given matrix by an MLR matrix in the\u0000Frobenius norm. The first problem is factor fitting, where we adjust the\u0000factors of the MLR matrix. The second is rank allocation, where we choose the\u0000ranks of the blocks in each level, subject to the total rank having a given\u0000value, which preserves the total storage needed for the MLR matrix. The final\u0000problem is to choose the hierarchical partition of rows and columns, along with\u0000the ranks and factors. This paper is accompanied by an open source package that\u0000implements the proposed methods.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"18 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138521077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Hurst exponent is a significant indicator for characterizing the self-similarity and long-term memory properties of time sequences. It has wide applications in physics, technologies, engineering, mathematics, statistics, economics, psychology and so on. Currently, available methods for estimating the Hurst exponent of time sequences can be divided into different categories: time-domain methods and spectrum-domain methods based on the representation of time sequence, linear regression methods and Bayesian methods based on parameter estimation methods. Although various methods are discussed in literature, there are still some deficiencies: the descriptions of the estimation algorithms are just mathematics-oriented and the pseudo-codes are missing; the effectiveness and accuracy of the estimation algorithms are not clear; the classification of estimation methods is not considered and there is a lack of guidance for selecting the estimation methods. In this work, the emphasis is put on thirteen dominant methods for estimating the Hurst exponent. For the purpose of decreasing the difficulty of implementing the estimation methods with computer programs, the mathematical principles are discussed briefly and the pseudo-codes of algorithms are presented with necessary details. It is expected that the survey could help the researchers to select, implement and apply the estimation algorithms of interest in practical situations in an easy way.
{"title":"A Survey of Methods for Estimating Hurst Exponent of Time Sequence","authors":"Hong-Yan Zhang, Zhi-Qiang Feng, Si-Yu Feng, Yu Zhou","doi":"arxiv-2310.19051","DOIUrl":"https://doi.org/arxiv-2310.19051","url":null,"abstract":"The Hurst exponent is a significant indicator for characterizing the\u0000self-similarity and long-term memory properties of time sequences. It has wide\u0000applications in physics, technologies, engineering, mathematics, statistics,\u0000economics, psychology and so on. Currently, available methods for estimating\u0000the Hurst exponent of time sequences can be divided into different categories:\u0000time-domain methods and spectrum-domain methods based on the representation of\u0000time sequence, linear regression methods and Bayesian methods based on\u0000parameter estimation methods. Although various methods are discussed in\u0000literature, there are still some deficiencies: the descriptions of the\u0000estimation algorithms are just mathematics-oriented and the pseudo-codes are\u0000missing; the effectiveness and accuracy of the estimation algorithms are not\u0000clear; the classification of estimation methods is not considered and there is\u0000a lack of guidance for selecting the estimation methods. In this work, the\u0000emphasis is put on thirteen dominant methods for estimating the Hurst exponent.\u0000For the purpose of decreasing the difficulty of implementing the estimation\u0000methods with computer programs, the mathematical principles are discussed\u0000briefly and the pseudo-codes of algorithms are presented with necessary\u0000details. It is expected that the survey could help the researchers to select,\u0000implement and apply the estimation algorithms of interest in practical\u0000situations in an easy way.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"16 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138521255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The optimization of the matrix multiplication (or GEMM) has been a need during the last decades. This operation is considered the flagship of current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its widespread use in a large variety of scientific applications. The GEMM is usually implemented following the GotoBLAS philosophy, which tiles the GEMM operands and uses a series of nested loops for performance improvement. These approaches extract the maximum computational power of the architectures through small pieces of hardware-oriented, high-performance code called micro-kernel. However, this approach forces developers to generate, with a non-negligible effort, a dedicated micro-kernel for each new hardware. In this work, we present a step-by-step procedure for generating micro-kernels with the Exo compiler that performs close to (or even better than) manually developed microkernels written with intrinsic functions or assembly language. Our solution also improves the portability of the generated code, since a hardware target is fully specified by a concise library-based description of its instructions.
{"title":"Tackling the Matrix Multiplication Micro-kernel Generation with Exo","authors":"Adrián Castelló, Julian Bellavita, Grace Dinh, Yuka Ikarashi, Héctor Martínez","doi":"arxiv-2310.17408","DOIUrl":"https://doi.org/arxiv-2310.17408","url":null,"abstract":"The optimization of the matrix multiplication (or GEMM) has been a need\u0000during the last decades. This operation is considered the flagship of current\u0000linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its\u0000widespread use in a large variety of scientific applications. The GEMM is\u0000usually implemented following the GotoBLAS philosophy, which tiles the GEMM\u0000operands and uses a series of nested loops for performance improvement. These\u0000approaches extract the maximum computational power of the architectures through\u0000small pieces of hardware-oriented, high-performance code called micro-kernel.\u0000However, this approach forces developers to generate, with a non-negligible\u0000effort, a dedicated micro-kernel for each new hardware. In this work, we present a step-by-step procedure for generating\u0000micro-kernels with the Exo compiler that performs close to (or even better\u0000than) manually developed microkernels written with intrinsic functions or\u0000assembly language. Our solution also improves the portability of the generated\u0000code, since a hardware target is fully specified by a concise library-based\u0000description of its instructions.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"11 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138521249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}