多核cpu上低阶有限元装配算法的内存流量与优化

ACM Transactions on Mathematical Software (TOMS) Pub Date : 2022-03-04 DOI:10.1145/3503925

James D. Trotter, Xing Cai, S. Funke

{"title":"多核cpu上低阶有限元装配算法的内存流量与优化","authors":"James D. Trotter, Xing Cai, S. Funke","doi":"10.1145/3503925","DOIUrl":null,"url":null,"abstract":"Motivated by the wish to understand the achievable performance of finite element assembly on unstructured computational meshes, we dissect the standard cellwise assembly algorithm into four kernels, two of which are dominated by irregular memory traffic. Several optimisation schemes are studied together with associated lower and upper bounds on the estimated memory traffic volume. Apart from properly reordering the mesh entities, the two most significant optimisations include adopting a lookup table in adding element matrices or vectors to their global counterparts, and using a row-wise assembly algorithm for multi-threaded parallelisation. Rigorous benchmarking shows that, due to the various optimisations, the actual volumes of memory traffic are in many cases very close to the estimated lower bounds. These results confirm the effectiveness of the optimisations, while also providing a recipe for developing efficient software for finite element assembly.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"89 1","pages":"1 - 31"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"On Memory Traffic and Optimisations for Low-order Finite Element Assembly Algorithms on Multi-core CPUs\",\"authors\":\"James D. Trotter, Xing Cai, S. Funke\",\"doi\":\"10.1145/3503925\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivated by the wish to understand the achievable performance of finite element assembly on unstructured computational meshes, we dissect the standard cellwise assembly algorithm into four kernels, two of which are dominated by irregular memory traffic. Several optimisation schemes are studied together with associated lower and upper bounds on the estimated memory traffic volume. Apart from properly reordering the mesh entities, the two most significant optimisations include adopting a lookup table in adding element matrices or vectors to their global counterparts, and using a row-wise assembly algorithm for multi-threaded parallelisation. Rigorous benchmarking shows that, due to the various optimisations, the actual volumes of memory traffic are in many cases very close to the estimated lower bounds. These results confirm the effectiveness of the optimisations, while also providing a recipe for developing efficient software for finite element assembly.\",\"PeriodicalId\":7036,\"journal\":{\"name\":\"ACM Transactions on Mathematical Software (TOMS)\",\"volume\":\"89 1\",\"pages\":\"1 - 31\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Mathematical Software (TOMS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3503925\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Mathematical Software (TOMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3503925","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

为了理解非结构化计算网格上有限元装配的可实现性能，我们将标准单元装配算法分解为四个核，其中两个核由不规则内存流量主导。研究了几种优化方案，并给出了相应的内存流量估计下界和上界。除了正确地重新排序网格实体之外，两个最重要的优化包括采用查找表将元素矩阵或向量添加到其全局对应项中，以及使用逐行组装算法进行多线程并行化。严格的基准测试表明，由于各种优化，在许多情况下，实际内存流量非常接近估计的下限。这些结果证实了优化的有效性，同时也为开发高效的有限元装配软件提供了一个方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

On Memory Traffic and Optimisations for Low-order Finite Element Assembly Algorithms on Multi-core CPUs

Motivated by the wish to understand the achievable performance of finite element assembly on unstructured computational meshes, we dissect the standard cellwise assembly algorithm into four kernels, two of which are dominated by irregular memory traffic. Several optimisation schemes are studied together with associated lower and upper bounds on the estimated memory traffic volume. Apart from properly reordering the mesh entities, the two most significant optimisations include adopting a lookup table in adding element matrices or vectors to their global counterparts, and using a row-wise assembly algorithm for multi-threaded parallelisation. Rigorous benchmarking shows that, due to the various optimisations, the actual volumes of memory traffic are in many cases very close to the estimated lower bounds. These results confirm the effectiveness of the optimisations, while also providing a recipe for developing efficient software for finite element assembly.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Mathematical Software (TOMS)

自引率

0.00%

发文量

期刊最新文献

Configurable Open-source Data Structure for Distributed Conforming Unstructured Homogeneous Meshes with GPU Support Algorithm 1027: NOMAD Version 4: Nonlinear Optimization with the MADS Algorithm Toward Accurate and Fast Summation Algorithm 1028: VTMOP: Solver for Blackbox Multiobjective Optimization Problems Parallel QR Factorization of Block Low-rank Matrices