With the rising impact of the memory wall, selecting the adequate data-structure implementation for a given kernel has become a performance-critical issue. This paper presents a new methodology to solve the data-layout decision problem by adapting an input implementation to the host hardware-memory hierarchy. The proposed method automatically identifies, for a given input software, the most performing data-layout implementation for each selected variable by analyzing the memory-access pattern. The proposed method is designed to be embedded within a general-purpose compiler. Experiments on PolybenchC benchmark, recursive-bilateral filter and jpeg-compression kernels, show that our method accurately determines the optimized data structure implementation. These optimized implementations allow reaching an execution-time speed-up up to 48.9X and a L3-miss reduction up to 98.1X, on an X86 processor implementing an Intel Xeon with three levels of data-caches using the least recently used cache-replacement policy.
{"title":"Data-layout optimization based on memory-access-pattern analysis for source-code performance improvement","authors":"Riyane Sid Lakhdar, H. Charles, Maha Kooli","doi":"10.1145/3378678.3391874","DOIUrl":"https://doi.org/10.1145/3378678.3391874","url":null,"abstract":"With the rising impact of the memory wall, selecting the adequate data-structure implementation for a given kernel has become a performance-critical issue. This paper presents a new methodology to solve the data-layout decision problem by adapting an input implementation to the host hardware-memory hierarchy. The proposed method automatically identifies, for a given input software, the most performing data-layout implementation for each selected variable by analyzing the memory-access pattern. The proposed method is designed to be embedded within a general-purpose compiler. Experiments on PolybenchC benchmark, recursive-bilateral filter and jpeg-compression kernels, show that our method accurately determines the optimized data structure implementation. These optimized implementations allow reaching an execution-time speed-up up to 48.9X and a L3-miss reduction up to 98.1X, on an X86 processor implementing an Intel Xeon with three levels of data-caches using the least recently used cache-replacement policy.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121612965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thore Kolms, Andreas Waldner, Christine Lang, Philipp Grothe, Jan Haase
The upcoming topic of in-memory-computing tries to support CPUs by taking over simple calculations that can be done in memory. This leads to less performance drain caused by those simple calculations as well as lower energy consumption for the whole system, which is particularly important for embedded systems. Memristors are variable and non-volatile resistors that can be used to store analog values. This makes them suitable for in-memory computing. In this paper, a prototypical implementation of analog calculations (addition, subtraction, multiplication) is described. The prototype is based on an ESP32 microcontroller. Typical calculations currently take around 1μs.
{"title":"Analog implementation of arithmetic operations on real memristors","authors":"Thore Kolms, Andreas Waldner, Christine Lang, Philipp Grothe, Jan Haase","doi":"10.1145/3378678.3391883","DOIUrl":"https://doi.org/10.1145/3378678.3391883","url":null,"abstract":"The upcoming topic of in-memory-computing tries to support CPUs by taking over simple calculations that can be done in memory. This leads to less performance drain caused by those simple calculations as well as lower energy consumption for the whole system, which is particularly important for embedded systems. Memristors are variable and non-volatile resistors that can be used to store analog values. This makes them suitable for in-memory computing. In this paper, a prototypical implementation of analog calculations (addition, subtraction, multiplication) is described. The prototype is based on an ESP32 microcontroller. Typical calculations currently take around 1μs.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116978559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, we systematically explore the design space of throughput, energy, and hardware costs for layer-parallel mappings of Convolutional Neural Networks (CNNs) onto coarse-grained reconfigurable arrays (CGRAs). We derive an analytical model that computes the required resources (processing elements) and buffer memory and thus hardware cost C to sustain a given throughput T as well as the resulting overall energy consumption E for inference. Further, we propose an efficient design space exploration (DSE) to determine the fronts of Pareto-optimal (T,E,C) solutions. This exploration helps to determine the limits of scalability of the presented tiled CGRA accelerator architectures in terms of throughput, the number of parallel layers that can be simultaneously processed, and memory requirements. Finally, we provide an evaluation of energy savings achievable on our architecture in comparison to implementations that execute sequentially a CNN layer-by-layer. In experiments, it is shown that layer-parallel processing is able to reduce energy consumption E by 3.6X, hardware cost C by 1.2X, and increase the achievable throughput T by 6.2X for MobileNet.
{"title":"Design space exploration for layer-parallel execution of convolutional neural networks on CGRAs","authors":"C. Heidorn, Frank Hannig, J. Teich","doi":"10.1145/3378678.3391878","DOIUrl":"https://doi.org/10.1145/3378678.3391878","url":null,"abstract":"In this work, we systematically explore the design space of throughput, energy, and hardware costs for layer-parallel mappings of Convolutional Neural Networks (CNNs) onto coarse-grained reconfigurable arrays (CGRAs). We derive an analytical model that computes the required resources (processing elements) and buffer memory and thus hardware cost C to sustain a given throughput T as well as the resulting overall energy consumption E for inference. Further, we propose an efficient design space exploration (DSE) to determine the fronts of Pareto-optimal (T,E,C) solutions. This exploration helps to determine the limits of scalability of the presented tiled CGRA accelerator architectures in terms of throughput, the number of parallel layers that can be simultaneously processed, and memory requirements. Finally, we provide an evaluation of energy savings achievable on our architecture in comparison to implementations that execute sequentially a CNN layer-by-layer. In experiments, it is shown that layer-parallel processing is able to reduce energy consumption E by 3.6X, hardware cost C by 1.2X, and increase the achievable throughput T by 6.2X for MobileNet.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124057857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sotirios Panagiotou, August Ernstsson, Johan Ahlqvist, Lazaros Papadopoulos, C. Kessler, D. Soudris
The complexity of modern HPC systems requires the use of new tools that support advanced programming models and offer portability and programmability of parallel and heterogeneous architectures. In this work we evaluate the use of SkePU framework in an HPC application from the neural computing domain. We demonstrate the successful deployment of the application based on SkePU using multiple back-ends (OpenMP, OpenCL and MPI) and present lessons-learned towards future extensions of the SkePU framework.
{"title":"Portable exploitation of parallel and heterogeneous HPC architectures in neural simulation using SkePU","authors":"Sotirios Panagiotou, August Ernstsson, Johan Ahlqvist, Lazaros Papadopoulos, C. Kessler, D. Soudris","doi":"10.1145/3378678.3391889","DOIUrl":"https://doi.org/10.1145/3378678.3391889","url":null,"abstract":"The complexity of modern HPC systems requires the use of new tools that support advanced programming models and offer portability and programmability of parallel and heterogeneous architectures. In this work we evaluate the use of SkePU framework in an HPC application from the neural computing domain. We demonstrate the successful deployment of the application based on SkePU using multiple back-ends (OpenMP, OpenCL and MPI) and present lessons-learned towards future extensions of the SkePU framework.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133992069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Worst-Case Execution Time (WCET) is one of the most important criteria of hard real-time systems. Many optimizations have been proposed to improve WCET of an embedded application at compile time. Moreover, since modern embedded systems must also satisfy the additional design criteria like, e.g., code size or energy consumption, more often the compiler's optimizations go towards multi-objective optimization problems. Evolutionary algorithms are the most widely used method to solve a multi-objective problem. In order to find the set of the best trade-offs between the objectives, any evolutionary algorithm requires extensive evaluations of the objective functions. Thus, considering WCET as an objective in a multi-objective problem is infeasible in many cases, because the WCET analysis at compile time can be very time-consuming. For this reason, we propose a method based on a machine learning technique to predict the values of WCET at compile time. A well-known compiler-based optimization, function specialization, is considered as a base for the proposed prediction model. A regression method is analyzed in terms of making WCET predictions as precise as possible performing function specialization.
{"title":"Compiler-based WCET prediction performing function specialization","authors":"Kateryna Muts, H. Falk","doi":"10.1145/3378678.3391879","DOIUrl":"https://doi.org/10.1145/3378678.3391879","url":null,"abstract":"The Worst-Case Execution Time (WCET) is one of the most important criteria of hard real-time systems. Many optimizations have been proposed to improve WCET of an embedded application at compile time. Moreover, since modern embedded systems must also satisfy the additional design criteria like, e.g., code size or energy consumption, more often the compiler's optimizations go towards multi-objective optimization problems. Evolutionary algorithms are the most widely used method to solve a multi-objective problem. In order to find the set of the best trade-offs between the objectives, any evolutionary algorithm requires extensive evaluations of the objective functions. Thus, considering WCET as an objective in a multi-objective problem is infeasible in many cases, because the WCET analysis at compile time can be very time-consuming. For this reason, we propose a method based on a machine learning technique to predict the values of WCET at compile time. A well-known compiler-based optimization, function specialization, is considered as a base for the proposed prediction model. A regression method is analyzed in terms of making WCET predictions as precise as possible performing function specialization.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125423423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Qiao, Oliver Reiche, M. A. Özkan, J. Teich, Frank Hannig
Hipacc is a domain-specific language for ease of programming image processing applications on hardware accelerators such as GPUs. It relieves the burden of manually porting algorithms to hardware for developers with the help of domain- and architecture-specific knowledge. One fundamental operation in image processing is reduction. Global reduction operators are the building blocks of many widely used algorithms, including image normalization, similarity estimation, etc. This paper presents an efficient approach to perform parallel reductions on GPUs with Hipacc. Our proposed approach benefits from the continuous effort of performance and programmability improvement by hardware vendors, for example, by utilizing the latest low-level primitives from Nvidia. Results show our approach achieves a speedup of up to 3.43 over an existing Hipacc implementation with traditional optimization methods, and a speedup of up to 9.02 over an implementation using the Thrust library from Nvidia.
{"title":"Efficient parallel reduction on GPUs with Hipacc","authors":"Bo Qiao, Oliver Reiche, M. A. Özkan, J. Teich, Frank Hannig","doi":"10.1145/3378678.3391885","DOIUrl":"https://doi.org/10.1145/3378678.3391885","url":null,"abstract":"Hipacc is a domain-specific language for ease of programming image processing applications on hardware accelerators such as GPUs. It relieves the burden of manually porting algorithms to hardware for developers with the help of domain- and architecture-specific knowledge. One fundamental operation in image processing is reduction. Global reduction operators are the building blocks of many widely used algorithms, including image normalization, similarity estimation, etc. This paper presents an efficient approach to perform parallel reductions on GPUs with Hipacc. Our proposed approach benefits from the continuous effort of performance and programmability improvement by hardware vendors, for example, by utilizing the latest low-level primitives from Nvidia. Results show our approach achieves a speedup of up to 3.43 over an existing Hipacc implementation with traditional optimization methods, and a speedup of up to 9.02 over an implementation using the Thrust library from Nvidia.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130945785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conventional processor architectures are limited in exploiting instruction level parallelism (ILP). One of the reasons for this limitation is their relatively low number of registers. Thus, recent processor architectures expose their datapaths so that the compiler can take care of directly transporting results from processing units to other processing units. Among these architectures, the Synchronous Control Asynchronous Dataflow (SCAD) architecture is a recently proposed exposed datapath architecture whose goal is to completely bypass the use of registers. Processor architectures with a high degree of ILP like SCAD are particularly useful for executing synchronous programs: The execution of a synchronous program is a sequence of reaction steps that consist of atomic actions that have to be executed in dataflow order. Synchronous programs typically provide a lot of ILP so that exposed datapath architectures may execute these programs efficiently. However, optimal code generation for SCAD is a big challenge: Previous work already showed how one can compile basic blocks to optimal move code for SCAD by means of answer set programming (ASP). This paper extends this approach in order to compile complete synchronous programs instead of only basic blocks to optimal move code. As a result, an ASP-based compiler was developed to translate Quartz programs to move code for the SCAD architecture by maximizing the use of the available ILP in the program while respecting the available resource limitations of the available processor.
{"title":"Compiling synchronous languages to optimal move code for exposed datapath architectures","authors":"Marc Dahlem, K. Schneider","doi":"10.1145/3378678.3391877","DOIUrl":"https://doi.org/10.1145/3378678.3391877","url":null,"abstract":"Conventional processor architectures are limited in exploiting instruction level parallelism (ILP). One of the reasons for this limitation is their relatively low number of registers. Thus, recent processor architectures expose their datapaths so that the compiler can take care of directly transporting results from processing units to other processing units. Among these architectures, the Synchronous Control Asynchronous Dataflow (SCAD) architecture is a recently proposed exposed datapath architecture whose goal is to completely bypass the use of registers. Processor architectures with a high degree of ILP like SCAD are particularly useful for executing synchronous programs: The execution of a synchronous program is a sequence of reaction steps that consist of atomic actions that have to be executed in dataflow order. Synchronous programs typically provide a lot of ILP so that exposed datapath architectures may execute these programs efficiently. However, optimal code generation for SCAD is a big challenge: Previous work already showed how one can compile basic blocks to optimal move code for SCAD by means of answer set programming (ASP). This paper extends this approach in order to compile complete synchronous programs instead of only basic blocks to optimal move code. As a result, an ASP-based compiler was developed to translate Quartz programs to move code for the SCAD architecture by maximizing the use of the available ILP in the program while respecting the available resource limitations of the available processor.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133511530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}