Tun Chen, Haipeng Jia, Yunquan Zhang, Kun Li, Zhihao Li, Xiang Zhao, Jianyu Yao, Chendi Li
{"title":"OpenFFT: An Adaptive Tuning Framework for 3D FFT on ARM Multicore CPUs","authors":"Tun Chen, Haipeng Jia, Yunquan Zhang, Kun Li, Zhihao Li, Xiang Zhao, Jianyu Yao, Chendi Li","doi":"10.1145/3577193.3593735","DOIUrl":null,"url":null,"abstract":"The sophisticated hierarchy and shared characteristics of cache in multicore CPU architectures bring challenges to the performance improvement of fundamental algorithms, especially in implementing and optimizing 3D FFT. 3D FFT is a memory-bounded algorithm that contains many highly discretized memory accesses. With the working set scaling, the data locality becomes poor, which is prone to cause serious memory access overhead, especially for high-dimensional data transposition. This paper proposes a 3D FFT optimization framework named OpenFFT. This framework optimizes the memory access of 3D FFT by the following methods, including 1) A novel tiling algorithm, Z-OpenFFT, based on the column-order algorithm for high-dimensional vectorization to improve data locality and eliminate transposition; 2) An efficient search algorithm Section-cache-aware algorithm to optimize the memory access of butterfly network of 1D FFT; 3) A multi-thread allocation model by analyzing the characteristics of cache hierarchy and task size to allocate threads adaptively. Experiments demonstrate that OpenFFT could obtain a more competitive performance than the best configuration of FFTW and ARMPL on ARM CPUs.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577193.3593735","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The sophisticated hierarchy and shared characteristics of cache in multicore CPU architectures bring challenges to the performance improvement of fundamental algorithms, especially in implementing and optimizing 3D FFT. 3D FFT is a memory-bounded algorithm that contains many highly discretized memory accesses. With the working set scaling, the data locality becomes poor, which is prone to cause serious memory access overhead, especially for high-dimensional data transposition. This paper proposes a 3D FFT optimization framework named OpenFFT. This framework optimizes the memory access of 3D FFT by the following methods, including 1) A novel tiling algorithm, Z-OpenFFT, based on the column-order algorithm for high-dimensional vectorization to improve data locality and eliminate transposition; 2) An efficient search algorithm Section-cache-aware algorithm to optimize the memory access of butterfly network of 1D FFT; 3) A multi-thread allocation model by analyzing the characteristics of cache hierarchy and task size to allocate threads adaptively. Experiments demonstrate that OpenFFT could obtain a more competitive performance than the best configuration of FFTW and ARMPL on ARM CPUs.