Nils Bosbach, Lukas Jünger, Rebecca Pelke, Niko Zurstraßen, R. Leupers
Instruction-Set Simulators (ISSs) are widely used to simulate the execution of programs for a target architecture on a host machine. They translate the instructions of the program that should be executed into instructions of the host Instruction-Set Architecture (ISA). The performance of ISSs strongly depends on the implementation and the instructions that should be executed. Therefore, benchmarks that are used to compare the performance of ISSs should contain a variety of instructions. Since many benchmarks are written in high-level programming languages, it is usually not clear to the user which instructions are underlying the benchmarks. In this work, we present a tool that can be used to analyze the variety of instructions used in a benchmark. In a multi-stage analysis, the properties of the benchmarks are collected. An entropy-based metric is used to measure the diversity of the instructions used by the benchmark. In a case study, we present results for the benchmarks Whetstone, Dhrystone, Coremark STREAM, and stdcbench. We show the diversity of those benchmarks for different compiler optimizations and indicate which benchmarks should be used to test the general performance of an ISS.
{"title":"Entropy-Based Analysis of Benchmarks for Instruction Set Simulators","authors":"Nils Bosbach, Lukas Jünger, Rebecca Pelke, Niko Zurstraßen, R. Leupers","doi":"10.1145/3579170.3579267","DOIUrl":"https://doi.org/10.1145/3579170.3579267","url":null,"abstract":"Instruction-Set Simulators (ISSs) are widely used to simulate the execution of programs for a target architecture on a host machine. They translate the instructions of the program that should be executed into instructions of the host Instruction-Set Architecture (ISA). The performance of ISSs strongly depends on the implementation and the instructions that should be executed. Therefore, benchmarks that are used to compare the performance of ISSs should contain a variety of instructions. Since many benchmarks are written in high-level programming languages, it is usually not clear to the user which instructions are underlying the benchmarks. In this work, we present a tool that can be used to analyze the variety of instructions used in a benchmark. In a multi-stage analysis, the properties of the benchmarks are collected. An entropy-based metric is used to measure the diversity of the instructions used by the benchmark. In a case study, we present results for the benchmarks Whetstone, Dhrystone, Coremark STREAM, and stdcbench. We show the diversity of those benchmarks for different compiler optimizations and indicate which benchmarks should be used to test the general performance of an ISS.","PeriodicalId":153341,"journal":{"name":"Proceedings of the DroneSE and RAPIDO: System Engineering for constrained embedded systems","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124838940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quentin Dariol, S. Le Nours, D. Helms, R. Stemmer, S. Pillement, Kim Grüttner
When deploying Artificial Neural Networks (ANNs) onto multi-core embedded platforms, an intensive evaluation flow is necessary to find implementations that optimize resource usage, timing and power. ANNs require indeed significant amounts of computational and memory resources to execute, while embedded execution platforms offer limited resources with strict power budget. Concurrent accesses from processors to shared resources on multi-core platforms can lead to bottlenecks with impact on performance and power. Existing approaches show limitations to deliver fast yet accurate evaluation ahead of ANN deployment on the targeted hardware. In this paper, we present a modeling flow for timing and power prediction in early design stage of fully-connected ANNs on multi-core platforms. Our flow offers fast yet accurate predictions with consideration of shared communication resources and scalability in regards of the number of cores used. The flow is evaluated on real measurements for 42 mappings of 3 fully-connected ANNs executed on a clock-gated multi-core platform featuring two different communication modes: polling or interrupt-based. Our modeling flow predicts timing with accuracy and power with accuracy on the tested mappings for an average simulation time of 0.23 s for 100 iterations. We then illustrate the application of our approach for efficient design space exploration of ANN implementations.
{"title":"Fast Yet Accurate Timing and Power Prediction of Artificial Neural Networks Deployed on Clock-Gated Multi-Core Platforms","authors":"Quentin Dariol, S. Le Nours, D. Helms, R. Stemmer, S. Pillement, Kim Grüttner","doi":"10.1145/3579170.3579263","DOIUrl":"https://doi.org/10.1145/3579170.3579263","url":null,"abstract":"When deploying Artificial Neural Networks (ANNs) onto multi-core embedded platforms, an intensive evaluation flow is necessary to find implementations that optimize resource usage, timing and power. ANNs require indeed significant amounts of computational and memory resources to execute, while embedded execution platforms offer limited resources with strict power budget. Concurrent accesses from processors to shared resources on multi-core platforms can lead to bottlenecks with impact on performance and power. Existing approaches show limitations to deliver fast yet accurate evaluation ahead of ANN deployment on the targeted hardware. In this paper, we present a modeling flow for timing and power prediction in early design stage of fully-connected ANNs on multi-core platforms. Our flow offers fast yet accurate predictions with consideration of shared communication resources and scalability in regards of the number of cores used. The flow is evaluated on real measurements for 42 mappings of 3 fully-connected ANNs executed on a clock-gated multi-core platform featuring two different communication modes: polling or interrupt-based. Our modeling flow predicts timing with accuracy and power with accuracy on the tested mappings for an average simulation time of 0.23 s for 100 iterations. We then illustrate the application of our approach for efficient design space exploration of ANN implementations.","PeriodicalId":153341,"journal":{"name":"Proceedings of the DroneSE and RAPIDO: System Engineering for constrained embedded systems","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126548703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Nouacer, Mahmoud Hussein, Paul Detterer, E. Villar, F. Herrera, Carlo Tieri, E. Grolleau
Drone-based service and product innovation is curtailed by the growing dependence on poorly inter-operable proprietary technologies as well as by the risks posed to people on the ground, to other vehicles and to property (e.g. critical infrastructure). Regarding the innovation aspect, the Single European Sky Air Traffic Management (SESAR) Joint Research Undertaking is developing U-space, a set of services and procedures to help drones access airspace safely and efficiently. The aim of COMP4DRONES is to complements SESAR JU efforts by providing a framework of key enabling technologies for safe and autonomous drones with a specific focus on U2 and U3. The COMP4DRONES project has contributed to support (1) efficient customization and incremental assurance of drone-embedded platforms, (2) safe autonomous decision making concerning individual or cooperative missions, (3) trustworthy drone-to-drone and drone-to-ground communications even in presence of malicious attackers and under the intrinsic platform constraints, and (4) agile and cost-effective design and assurance of drone modules and systems. In this paper, we discuss the results of COMP4DRONES project to complement SESAR JU efforts with a particular focus on safe software and hardware drone architectures.
{"title":"Towards a European Network of Enabling Technologies for Drones","authors":"R. Nouacer, Mahmoud Hussein, Paul Detterer, E. Villar, F. Herrera, Carlo Tieri, E. Grolleau","doi":"10.1145/3579170.3579264","DOIUrl":"https://doi.org/10.1145/3579170.3579264","url":null,"abstract":"Drone-based service and product innovation is curtailed by the growing dependence on poorly inter-operable proprietary technologies as well as by the risks posed to people on the ground, to other vehicles and to property (e.g. critical infrastructure). Regarding the innovation aspect, the Single European Sky Air Traffic Management (SESAR) Joint Research Undertaking is developing U-space, a set of services and procedures to help drones access airspace safely and efficiently. The aim of COMP4DRONES is to complements SESAR JU efforts by providing a framework of key enabling technologies for safe and autonomous drones with a specific focus on U2 and U3. The COMP4DRONES project has contributed to support (1) efficient customization and incremental assurance of drone-embedded platforms, (2) safe autonomous decision making concerning individual or cooperative missions, (3) trustworthy drone-to-drone and drone-to-ground communications even in presence of malicious attackers and under the intrinsic platform constraints, and (4) agile and cost-effective design and assurance of drone modules and systems. In this paper, we discuss the results of COMP4DRONES project to complement SESAR JU efforts with a particular focus on safe software and hardware drone architectures.","PeriodicalId":153341,"journal":{"name":"Proceedings of the DroneSE and RAPIDO: System Engineering for constrained embedded systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117081207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given the performances it achieves, dynamic binary translation is the most compelling simulation approach for cross-emulation of software centric systems. This speed comes at a cost: simulation is purely functional. Modeling instruction caches by instrumenting each target instruction is feasible, but severely degrades performances. As the translation occurs per target instruction block, we propose to model instruction caches at that granularity. This raises a few issues that we detail and mitigate. We implement this solution in the QEMU dynamic binary translation engine, which brings up an interesting problem inherent to this simulation strategy. Using as test vehicle a multicore RISC-V based platform, we show that a proper model can be nearly as accurate as an instruction accurate model. On the PolyBench/C and PARSEC benchmarks, our model slows down simulation by a factor of 2 to 10 compared to vanilla QEMU. Although not negligible, this is to be balanced with the factor of 20 to 60 for the instruction accurate approach.
{"title":"Fast Instruction Cache Simulation is Trickier than You Think","authors":"M. Badaroux, J. Dumas, F. Pétrot","doi":"10.1145/3579170.3579261","DOIUrl":"https://doi.org/10.1145/3579170.3579261","url":null,"abstract":"Given the performances it achieves, dynamic binary translation is the most compelling simulation approach for cross-emulation of software centric systems. This speed comes at a cost: simulation is purely functional. Modeling instruction caches by instrumenting each target instruction is feasible, but severely degrades performances. As the translation occurs per target instruction block, we propose to model instruction caches at that granularity. This raises a few issues that we detail and mitigate. We implement this solution in the QEMU dynamic binary translation engine, which brings up an interesting problem inherent to this simulation strategy. Using as test vehicle a multicore RISC-V based platform, we show that a proper model can be nearly as accurate as an instruction accurate model. On the PolyBench/C and PARSEC benchmarks, our model slows down simulation by a factor of 2 to 10 compared to vanilla QEMU. Although not negligible, this is to be balanced with the factor of 20 to 60 for the instruction accurate approach.","PeriodicalId":153341,"journal":{"name":"Proceedings of the DroneSE and RAPIDO: System Engineering for constrained embedded systems","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125081874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tim Hotfilter, Patrick Schmidt, Julian Höfer, Fabian Kreß, T. Harbaum, Juergen Becker
Since their breakthrough, complexity of Deep Neural Networks (DNNs) is rising steadily. As a result, accelerators for DNNs are now used in many domains. However, designing and configuring an accelerator that meets the requirements of a given application perfectly is a challenging task. In this paper, we therefore present our approach to support the accelerator design process. With an analytical model of a systolic array we can estimate performance, energy consumption and area for each design option. To determine these metrics, usually a cycle accurate simulation is performed, which is a time-consuming task. Hence, the design space has to be restricted heavily. Analytical modelling, however, allows for fast evaluation of a design using a mathematical abstraction of the accelerator. For DNNs, this works especially well since the dataflow and memory accesses have high regularity. To show the correctness of our model, we perform an exemplary realization with the state-of-the-art systolic array generator Gemmini and compare it with a cycle accurate simulation and state-of-the-art modelling tools, showing less than 1% deviation. We also conducted a design space exploration, showing the analytical model’s capabilities to support an accelerator design. In a case study on ResNet-34, we can demonstrate that our model and DSE tool reduces the time to find the best-fitting solution by four or two orders of magnitude compared to a cycle-accurate simulation or state-of-the-art modelling tools, respectively.
{"title":"An Analytical Model of Configurable Systolic Arrays to find the Best-Fitting Accelerator for a given DNN Workload","authors":"Tim Hotfilter, Patrick Schmidt, Julian Höfer, Fabian Kreß, T. Harbaum, Juergen Becker","doi":"10.1145/3579170.3579258","DOIUrl":"https://doi.org/10.1145/3579170.3579258","url":null,"abstract":"Since their breakthrough, complexity of Deep Neural Networks (DNNs) is rising steadily. As a result, accelerators for DNNs are now used in many domains. However, designing and configuring an accelerator that meets the requirements of a given application perfectly is a challenging task. In this paper, we therefore present our approach to support the accelerator design process. With an analytical model of a systolic array we can estimate performance, energy consumption and area for each design option. To determine these metrics, usually a cycle accurate simulation is performed, which is a time-consuming task. Hence, the design space has to be restricted heavily. Analytical modelling, however, allows for fast evaluation of a design using a mathematical abstraction of the accelerator. For DNNs, this works especially well since the dataflow and memory accesses have high regularity. To show the correctness of our model, we perform an exemplary realization with the state-of-the-art systolic array generator Gemmini and compare it with a cycle accurate simulation and state-of-the-art modelling tools, showing less than 1% deviation. We also conducted a design space exploration, showing the analytical model’s capabilities to support an accelerator design. In a case study on ResNet-34, we can demonstrate that our model and DSE tool reduces the time to find the best-fitting solution by four or two orders of magnitude compared to a cycle-accurate simulation or state-of-the-art modelling tools, respectively.","PeriodicalId":153341,"journal":{"name":"Proceedings of the DroneSE and RAPIDO: System Engineering for constrained embedded systems","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123802384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lukas Steiner, Gustavo Delazeri, Iron Prando da Silva, Matthias Jung, N. Wehn
Nowadays, DRAM subsystem configuration includes a large number of parameters, resulting in an extensive design space. Setting these parameters is a challenging step in system design as the parameter-workload interactions are complex. Since design space exploration by exhaustive simulation is infeasible due to limited computing resources and development time, semi-automatic configuration involving both manual as well as simulation-based decisions is state-of-the-art. However, it requires a lot of expertise in the DRAM domain as well as application knowledge, and there is no guarantee for a good performance of the resulting subsystem. In this paper, we present a new framework that fully automatizes the DRAM subsystem configuration for a given parameter space and set of target applications. It is based on irace, a software package originally developed for automatic configuration of optimization algorithms. We show that the framework finds nearly-optimal configurations, while only a fraction of all application-configuration combinations has to be evaluated. In addition, all returned configurations perform better than a predefined standard configuration. Thus, our framework enables designers to automatically determine a suitable DRAM subsystem for their platform.
{"title":"Automatic DRAM Subsystem Configuration with irace","authors":"Lukas Steiner, Gustavo Delazeri, Iron Prando da Silva, Matthias Jung, N. Wehn","doi":"10.1145/3579170.3579259","DOIUrl":"https://doi.org/10.1145/3579170.3579259","url":null,"abstract":"Nowadays, DRAM subsystem configuration includes a large number of parameters, resulting in an extensive design space. Setting these parameters is a challenging step in system design as the parameter-workload interactions are complex. Since design space exploration by exhaustive simulation is infeasible due to limited computing resources and development time, semi-automatic configuration involving both manual as well as simulation-based decisions is state-of-the-art. However, it requires a lot of expertise in the DRAM domain as well as application knowledge, and there is no guarantee for a good performance of the resulting subsystem. In this paper, we present a new framework that fully automatizes the DRAM subsystem configuration for a given parameter space and set of target applications. It is based on irace, a software package originally developed for automatic configuration of optimization algorithms. We show that the framework finds nearly-optimal configurations, while only a fraction of all application-configuration combinations has to be evaluated. In addition, all returned configurations perform better than a predefined standard configuration. Thus, our framework enables designers to automatically determine a suitable DRAM subsystem for their platform.","PeriodicalId":153341,"journal":{"name":"Proceedings of the DroneSE and RAPIDO: System Engineering for constrained embedded systems","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114365571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guillaume Ollier, F. Arnez, Morayo Adedjouma, Raphaël Lallement, Simos Gerasimou, C. Mraidha
Dynamic Dependability Management (DDM) is a promising approach to guarantee and monitor the ability of safety-critical Automated Systems (ASs) to deliver the intended service with an acceptable risk level. However, the non-interpretability and lack of specifications of the Learning-Enabled Component (LEC) used in ASs make this mission particularly challenging. Some existing DDM techniques overcome these limitations by using probabilistic environmental perception knowledge associated with predicting behavior changes for the agents in the environment. Ontology-based methods allow using a formal and traceable representation of AS usage scenarios to support the design process of the DDM component of such ASs. This paper presents a methodology to perform this design process, starting from the AS specification stage and including threat analysis and requirements identification. The present paper focuses on the formalization of an ontology modeling language allowing the interpretation of logical usage scenarios, i.e., a formal description of the scenario represented by state variables. The proposed supervisory system also considers the uncertainty estimation and interaction between AS components through the whole perception-planning-control pipeline. This methodology is illustrated in this paper on a use case involving Unmanned Aerial Vehicles (UAVs).
{"title":"Towards an Ontological Methodology for Dynamic Dependability Management of Unmanned Aerial Vehicles","authors":"Guillaume Ollier, F. Arnez, Morayo Adedjouma, Raphaël Lallement, Simos Gerasimou, C. Mraidha","doi":"10.1145/3579170.3579265","DOIUrl":"https://doi.org/10.1145/3579170.3579265","url":null,"abstract":"Dynamic Dependability Management (DDM) is a promising approach to guarantee and monitor the ability of safety-critical Automated Systems (ASs) to deliver the intended service with an acceptable risk level. However, the non-interpretability and lack of specifications of the Learning-Enabled Component (LEC) used in ASs make this mission particularly challenging. Some existing DDM techniques overcome these limitations by using probabilistic environmental perception knowledge associated with predicting behavior changes for the agents in the environment. Ontology-based methods allow using a formal and traceable representation of AS usage scenarios to support the design process of the DDM component of such ASs. This paper presents a methodology to perform this design process, starting from the AS specification stage and including threat analysis and requirements identification. The present paper focuses on the formalization of an ontology modeling language allowing the interpretation of logical usage scenarios, i.e., a formal description of the scenario represented by state variables. The proposed supervisory system also considers the uncertainty estimation and interaction between AS components through the whole perception-planning-control pipeline. This methodology is illustrated in this paper on a use case involving Unmanned Aerial Vehicles (UAVs).","PeriodicalId":153341,"journal":{"name":"Proceedings of the DroneSE and RAPIDO: System Engineering for constrained embedded systems","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121310041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabian Lesniak, Nidhi Anantharajaiah, T. Harbaum, Juergen Becker
Rapid prototyping is widely used, essential technique for developing novel computing architectures. While simulation-based approaches allow to examine the Design Under Test, the observability of FPGA-based prototypes is limited as they can behave like a black box. However, for verification and design space exploration purposes it is crucial to obtain detailed information on the internal state of such a prototype. In this work we propose an architecture to gather detailed internal measurements during execution and extract them from the design under test without impacting its runtime behavior. It is specifically designed for low resource usage and minimal impact on timing, leaving more resources for the actual prototyped system. Our proposed architecture offers several different interface modules for various signal sources, including register capturing, event counters and bus snooping. We present an estimate of achievable bandwidth and maximum sample rate as well as a demanding case-study with a tiled manycore platform on a multi-FPGA prototyping platform. Experimental results show up to 32 million 4-byte measurements per second, saturating a gigabit Ethernet connection. The monitoring system has proven to be very useful when working with an FPGA-based manycore prototype, as it is an essential tool to reveal incorrect behavior and bottlenecks in hardware, operating system and applications at an early stage.
{"title":"Non-Intrusive Runtime Monitoring for Manycore Prototypes","authors":"Fabian Lesniak, Nidhi Anantharajaiah, T. Harbaum, Juergen Becker","doi":"10.1145/3579170.3579262","DOIUrl":"https://doi.org/10.1145/3579170.3579262","url":null,"abstract":"Rapid prototyping is widely used, essential technique for developing novel computing architectures. While simulation-based approaches allow to examine the Design Under Test, the observability of FPGA-based prototypes is limited as they can behave like a black box. However, for verification and design space exploration purposes it is crucial to obtain detailed information on the internal state of such a prototype. In this work we propose an architecture to gather detailed internal measurements during execution and extract them from the design under test without impacting its runtime behavior. It is specifically designed for low resource usage and minimal impact on timing, leaving more resources for the actual prototyped system. Our proposed architecture offers several different interface modules for various signal sources, including register capturing, event counters and bus snooping. We present an estimate of achievable bandwidth and maximum sample rate as well as a demanding case-study with a tiled manycore platform on a multi-FPGA prototyping platform. Experimental results show up to 32 million 4-byte measurements per second, saturating a gigabit Ethernet connection. The monitoring system has proven to be very useful when working with an FPGA-based manycore prototype, as it is an essential tool to reveal incorrect behavior and bottlenecks in hardware, operating system and applications at an early stage.","PeriodicalId":153341,"journal":{"name":"Proceedings of the DroneSE and RAPIDO: System Engineering for constrained embedded systems","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128159944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Smarts-like sampled hardware simulation techniques achieve good accuracy by simulating many small portions of an application in detail. However, while this reduces the simulation time, it results in extensive cache warming times, as each of the many simulation points requires warming the whole memory hierarchy. Adaptive Cache Warming reduces this time by iteratively increasing warming to achieve sufficient accuracy. Unfortunately, each increases requires that the previous warming be redone, nearly doubling the total warming. We address re-warming by developing a technique to merge the cache states from the previous and additional warming iterations. We demonstrate our merging approach on multi-level LRU cache hierarchy and evaluate and address the introduced errors. Our experiments show that Cache Merging delivers an average speedup of 1.44 ×, 1.84 ×, and 1.87 × for 128kB, 2MB, and 8MB L2 caches, respectively, (vs. a 2 × theoretical maximum speedup) with 95-percentile absolute IPC errors of only 0.029, 0.015, and 0.006, respectively. These results demonstrate that Cache Merging yields significantly higher simulation speed with minimal losses.
{"title":"Faster Functional Warming with Cache Merging","authors":"Gustaf Borgström, C. Rohner, D. Black-Schaffer","doi":"10.1145/3579170.3579256","DOIUrl":"https://doi.org/10.1145/3579170.3579256","url":null,"abstract":"Smarts-like sampled hardware simulation techniques achieve good accuracy by simulating many small portions of an application in detail. However, while this reduces the simulation time, it results in extensive cache warming times, as each of the many simulation points requires warming the whole memory hierarchy. Adaptive Cache Warming reduces this time by iteratively increasing warming to achieve sufficient accuracy. Unfortunately, each increases requires that the previous warming be redone, nearly doubling the total warming. We address re-warming by developing a technique to merge the cache states from the previous and additional warming iterations. We demonstrate our merging approach on multi-level LRU cache hierarchy and evaluate and address the introduced errors. Our experiments show that Cache Merging delivers an average speedup of 1.44 ×, 1.84 ×, and 1.87 × for 128kB, 2MB, and 8MB L2 caches, respectively, (vs. a 2 × theoretical maximum speedup) with 95-percentile absolute IPC errors of only 0.029, 0.015, and 0.006, respectively. These results demonstrate that Cache Merging yields significantly higher simulation speed with minimal losses.","PeriodicalId":153341,"journal":{"name":"Proceedings of the DroneSE and RAPIDO: System Engineering for constrained embedded systems","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114624099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sebastian Rahn, Philipp Gehricke, Can-Leon Petermöller, Eric Neumann, Philipp Schlinge, Leon Rabius, Henning Termühlen, Christopher Sieh, M. Tassemeier, T. Wiemann, Mario Porrmann
In this paper we present ReDroSe, a heterogeneous compute system based on embedded CPUs, FPGAs and GPUs, which is integrated into an existing UAV platform to allow real time SLAM based on a Truncated Signed Distance Field (TSDF) directly on the drone. The system is fully integrated into the existing infrastructure to allow ground control to manage and monitor the data acquisition process. ReDroSe is evaluated in terms of power consumption and computing capabilities. The results show that the proposed architecture allows computations on the UAV that were previously only possible in post-processing while keeping the power consumption low enough to match the available flight time of the UAV.
{"title":"ReDroSe — Reconfigurable Drone Setup for Resource-Efficient SLAM","authors":"Sebastian Rahn, Philipp Gehricke, Can-Leon Petermöller, Eric Neumann, Philipp Schlinge, Leon Rabius, Henning Termühlen, Christopher Sieh, M. Tassemeier, T. Wiemann, Mario Porrmann","doi":"10.1145/3579170.3579266","DOIUrl":"https://doi.org/10.1145/3579170.3579266","url":null,"abstract":"In this paper we present ReDroSe, a heterogeneous compute system based on embedded CPUs, FPGAs and GPUs, which is integrated into an existing UAV platform to allow real time SLAM based on a Truncated Signed Distance Field (TSDF) directly on the drone. The system is fully integrated into the existing infrastructure to allow ground control to manage and monitor the data acquisition process. ReDroSe is evaluated in terms of power consumption and computing capabilities. The results show that the proposed architecture allows computations on the UAV that were previously only possible in post-processing while keeping the power consumption low enough to match the available flight time of the UAV.","PeriodicalId":153341,"journal":{"name":"Proceedings of the DroneSE and RAPIDO: System Engineering for constrained embedded systems","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131901730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}