J. Núñez-Yáñez, Kris Nikov, K. Eder, Mohammad Hosseinabady
This paper investigates the application of a robust CPU-based power modelling methodology that performs an automatic search of explanatory events derived from performance counters to embedded GPUs. A 64-bit Tegra TX1 SoC is configured with DVFS enabled and multiple CUDA benchmarks are used to train and test models optimized for each frequency and voltage point. These optimized models are then compared with a simpler unified model that uses a single set of model coefficients for all frequency and voltage points of interest. To obtain this unified model, a number of experiments are conducted to extract information on idle, clock and static power to derive power usage from a single reference equation. The results show that the unified model offers competitive accuracy with an average 5% error with four explanatory variables on the test data set and it is capable to correctly predict the impact of voltage, frequency and temperature on power consumption. This model could be used to replace direct power measurements when these are not available due to hardware limitations or worst-case analysis in emulation platforms.
{"title":"Run-Time Power Modelling in Embedded GPUs with Dynamic Voltage and Frequency Scaling","authors":"J. Núñez-Yáñez, Kris Nikov, K. Eder, Mohammad Hosseinabady","doi":"10.1145/3381427.3381429","DOIUrl":"https://doi.org/10.1145/3381427.3381429","url":null,"abstract":"This paper investigates the application of a robust CPU-based power modelling methodology that performs an automatic search of explanatory events derived from performance counters to embedded GPUs. A 64-bit Tegra TX1 SoC is configured with DVFS enabled and multiple CUDA benchmarks are used to train and test models optimized for each frequency and voltage point. These optimized models are then compared with a simpler unified model that uses a single set of model coefficients for all frequency and voltage points of interest. To obtain this unified model, a number of experiments are conducted to extract information on idle, clock and static power to derive power usage from a single reference equation. The results show that the unified model offers competitive accuracy with an average 5% error with four explanatory variables on the test data set and it is capable to correctly predict the impact of voltage, frequency and temperature on power consumption. This model could be used to replace direct power measurements when these are not available due to hardware limitations or worst-case analysis in emulation platforms.","PeriodicalId":38836,"journal":{"name":"Meta: Avaliacao","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78363084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicola Fossati, Daniele Cattaneo, M. Chiari, Stefano Cherubin, G. Agosta
The greater availability and reduction in production cost make wearable IoT platforms perfect candidates to continuously monitor people at risk, like elderly people. In particular these platforms, along with the use of artifical intelligence algorithms, can be exploited to detect and monitor people's activities, in particular potentially harmful situations, such as falling. However, wearable devices have limited computational power and battery life. We optimize a situation-recognition application via the well-known precision tuning practice using a dedicated state-of-the-art toolchain. After the optimization we evaluate how the reduced-precision version better fits the use case of limited-resources platforms, such as wearable devices. In particular, we achieve over 500% of speedup in execution time, and consume about 6 times less energy to carry out the classification.
{"title":"Automated Precision Tuning in Activity Classification Systems: A Case Study","authors":"Nicola Fossati, Daniele Cattaneo, M. Chiari, Stefano Cherubin, G. Agosta","doi":"10.1145/3381427.3381432","DOIUrl":"https://doi.org/10.1145/3381427.3381432","url":null,"abstract":"The greater availability and reduction in production cost make wearable IoT platforms perfect candidates to continuously monitor people at risk, like elderly people. In particular these platforms, along with the use of artifical intelligence algorithms, can be exploited to detect and monitor people's activities, in particular potentially harmful situations, such as falling. However, wearable devices have limited computational power and battery life. We optimize a situation-recognition application via the well-known precision tuning practice using a dedicated state-of-the-art toolchain. After the optimization we evaluate how the reduced-precision version better fits the use case of limited-resources platforms, such as wearable devices. In particular, we achieve over 500% of speedup in execution time, and consume about 6 times less energy to carry out the classification.","PeriodicalId":38836,"journal":{"name":"Meta: Avaliacao","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84090634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Muttillo, Paolo Giammatteo, Giuseppe Fiorilli, L. Pomante
Heterogeneous multiprocessor platforms are becoming widespread in the embedded system domain, mainly for the opportunity to improve timing performance and to minimize energy/power consumption and costs. Therefore, when using such platforms, it is important to adopt a Design Space Exploration (DSE) strategy that considers compromises among different objectives. Existing DSE approaches are generally based on evolutionary algorithms to solve Multi-Objective Optimization Problems (MOOPs) by minimizing a linear combination of weighted cost functions (i.e., Weighted Sum Method, WSM). In this way, the main issues are related to reduce timing execution while trying to improve the evolutionary algorithm performance, introducing strategies that attempt to bring better solutions. Code parallelization is one of the most used approaches in this field, but no standard methods have been released since different aspects could affect the performance. This approach leads to exploit parallel and distributed processing elements in order to implement evolutionary algorithms. In the latter case, if we consider genetic algorithms, it is possible to talk about Parallel Genetic Algorithms (PGA). Considering this context, this paper focuses on DSE for heterogeneous multi-processor embedded systems and introduces an improvement that reduces execution time using parallel programming languages (i.e., OpenMP) inside the main genetic algorithm approach, while trying to lead to better partitioning solutions. The descriptions of the adopted DSE activities and the OpenMP implementation, validated by means of a case study, represent the core of the paper.
异构多处理器平台在嵌入式系统领域变得越来越普遍,主要是为了有机会提高时序性能,并最大限度地减少能源/功耗和成本。因此,在使用此类平台时,采用考虑不同目标之间折衷的设计空间探索(Design Space Exploration, DSE)策略非常重要。现有的DSE方法通常基于进化算法,通过最小化加权代价函数的线性组合来解决多目标优化问题(MOOPs)(即加权和法,WSM)。通过这种方式,主要问题与减少定时执行有关,同时试图提高进化算法的性能,引入试图带来更好解决方案的策略。代码并行化是该领域最常用的方法之一,但由于不同方面可能影响性能,因此没有发布标准方法。这种方法导致利用并行和分布式处理元素来实现进化算法。在后一种情况下,如果我们考虑遗传算法,就有可能讨论并行遗传算法(PGA)。考虑到这种情况,本文将重点放在异构多处理器嵌入式系统的DSE上,并在主要的遗传算法方法中引入一种改进方法,即使用并行编程语言(即OpenMP)减少执行时间,同时试图找到更好的分区解决方案。本文的核心是对采用的DSE活动和OpenMP实现的描述,并通过案例研究进行了验证。
{"title":"An OpenMP Parallel Genetic Algorithm for Design Space Exploration of Heterogeneous Multi-processor Embedded Systems","authors":"V. Muttillo, Paolo Giammatteo, Giuseppe Fiorilli, L. Pomante","doi":"10.1145/3381427.3381431","DOIUrl":"https://doi.org/10.1145/3381427.3381431","url":null,"abstract":"Heterogeneous multiprocessor platforms are becoming widespread in the embedded system domain, mainly for the opportunity to improve timing performance and to minimize energy/power consumption and costs. Therefore, when using such platforms, it is important to adopt a Design Space Exploration (DSE) strategy that considers compromises among different objectives. Existing DSE approaches are generally based on evolutionary algorithms to solve Multi-Objective Optimization Problems (MOOPs) by minimizing a linear combination of weighted cost functions (i.e., Weighted Sum Method, WSM). In this way, the main issues are related to reduce timing execution while trying to improve the evolutionary algorithm performance, introducing strategies that attempt to bring better solutions. Code parallelization is one of the most used approaches in this field, but no standard methods have been released since different aspects could affect the performance. This approach leads to exploit parallel and distributed processing elements in order to implement evolutionary algorithms. In the latter case, if we consider genetic algorithms, it is possible to talk about Parallel Genetic Algorithms (PGA). Considering this context, this paper focuses on DSE for heterogeneous multi-processor embedded systems and introduces an improvement that reduces execution time using parallel programming languages (i.e., OpenMP) inside the main genetic algorithm approach, while trying to lead to better partitioning solutions. The descriptions of the adopted DSE activities and the OpenMP implementation, validated by means of a case study, represent the core of the paper.","PeriodicalId":38836,"journal":{"name":"Meta: Avaliacao","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83301842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Embedded intelligence is becoming the primary driver for new applications in industry, healthcare, and automotive, to name a few. The main characteristics of these applications are high computational demand, real-time interaction with the environment, security, low power consumption, and local autonomy, among others. Addressing these diverse characteristics, researchers have proposed heterogeneous multicore embedded systems comprising CPUs, GPUs, FPGAs, and ASICs. Whereas each computing element provides a unique capability to enable one of the application characteristics, collaborating these processing cores in running an application to get the maximum performance is a crucial challenge. This paper considers the collaborative usage of a multicore CPU and an FPGA in a heterogeneous embedded system to improve the performance of sparse matrix operations, which have been essential techniques in reducing the inference complexity in machine learning techniques, especially deep convolutional neural networks. Experimental results show that the collaborative execution of sparse-matrix-dense-matrix multiplication on the Xilinx Zynq MPSoC, a heterogeneous CPU+FPGA embedded system, can improve the performance by a factor of up to 42% compared with just using the FPGA as an accelerator.
{"title":"Sparse Matrix-Dense Matrix Multiplication on Heterogeneous CPU+FPGA Embedded System","authors":"Mohammad Hosseinabady, J. Núñez-Yáñez","doi":"10.1145/3381427.3381428","DOIUrl":"https://doi.org/10.1145/3381427.3381428","url":null,"abstract":"Embedded intelligence is becoming the primary driver for new applications in industry, healthcare, and automotive, to name a few. The main characteristics of these applications are high computational demand, real-time interaction with the environment, security, low power consumption, and local autonomy, among others. Addressing these diverse characteristics, researchers have proposed heterogeneous multicore embedded systems comprising CPUs, GPUs, FPGAs, and ASICs. Whereas each computing element provides a unique capability to enable one of the application characteristics, collaborating these processing cores in running an application to get the maximum performance is a crucial challenge. This paper considers the collaborative usage of a multicore CPU and an FPGA in a heterogeneous embedded system to improve the performance of sparse matrix operations, which have been essential techniques in reducing the inference complexity in machine learning techniques, especially deep convolutional neural networks. Experimental results show that the collaborative execution of sparse-matrix-dense-matrix multiplication on the Xilinx Zynq MPSoC, a heterogeneous CPU+FPGA embedded system, can improve the performance by a factor of up to 42% compared with just using the FPGA as an accelerator.","PeriodicalId":38836,"journal":{"name":"Meta: Avaliacao","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75060532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
CubeSats are small satellites operating in harsh space environment. In order to ensure correct functionality on board despite faults, fault tolerant techniques taking into account spatial, time and energy constraints should be considered. This paper presents a software-level solution taking advantage of several processors available on board. Two online scheduling algorithms are introduced and evaluated. The results show their performances and the tradeoff between the rejection rate and energy consumption. Last but not least, it is stated that ordering policies achieving low rejection rate when using the algorithm scheduling all tasks as aperiodic are the "Earliest Deadline" and "Earliest Arrival Time". As for the algorithm treating arriving tasks as aperiodic or periodic tasks, the "Minimum Slack" ordering policy provides reasonable results.
{"title":"Fault-Tolerant Online Scheduling Algorithms for CubeSats","authors":"Petr Dobiáš, E. Casseau, O. Sinnen","doi":"10.1145/3381427.3381430","DOIUrl":"https://doi.org/10.1145/3381427.3381430","url":null,"abstract":"CubeSats are small satellites operating in harsh space environment. In order to ensure correct functionality on board despite faults, fault tolerant techniques taking into account spatial, time and energy constraints should be considered. This paper presents a software-level solution taking advantage of several processors available on board. Two online scheduling algorithms are introduced and evaluated. The results show their performances and the tradeoff between the rejection rate and energy consumption. Last but not least, it is stated that ordering policies achieving low rejection rate when using the algorithm scheduling all tasks as aperiodic are the \"Earliest Deadline\" and \"Earliest Arrival Time\". As for the algorithm treating arriving tasks as aperiodic or periodic tasks, the \"Minimum Slack\" ordering policy provides reasonable results.","PeriodicalId":38836,"journal":{"name":"Meta: Avaliacao","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83096256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 11th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures / 9th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms","authors":"","doi":"10.1145/3381427","DOIUrl":"https://doi.org/10.1145/3381427","url":null,"abstract":"","PeriodicalId":38836,"journal":{"name":"Meta: Avaliacao","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74102113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonathan Van der Cruysse, L. Hoste, W. V. Raemdonck
Object serialization is important to a variety of applications, including session migration and distributed computing. A general JavaScript object serializer must support function serialization as functions are first-class objects. However, JavaScript offers no built-in function serialization and limits custom serializers by exposing no meta operator to query a function’s captured variables. Code instrumentation can expose captured variables but state-of-the-art instrumentation techniques introduce high overheads, vary in supported syntax and/or use complex (de)serialization algorithms. We introduce FlashFreeze, an instrumentation technique based on capture lists. FlashFreeze achieves a tiny run time overhead: an Octane score reduction of 3% compared to 76% for the state-of-the-art ThingsMigrate tool and 1% for the work-in-progress FSM tool. FlashFreeze supports all self-contained ECMAScript 5 programs except for specific uses of eval, with, and source code inspection. FlashFreeze’s construction gives rise to simple (de)serialization algorithms.
{"title":"FlashFreeze: low-overhead JavaScript instrumentation for function serialization","authors":"Jonathan Van der Cruysse, L. Hoste, W. V. Raemdonck","doi":"10.1145/3358502.3361268","DOIUrl":"https://doi.org/10.1145/3358502.3361268","url":null,"abstract":"Object serialization is important to a variety of applications, including session migration and distributed computing. A general JavaScript object serializer must support function serialization as functions are first-class objects. However, JavaScript offers no built-in function serialization and limits custom serializers by exposing no meta operator to query a function’s captured variables. Code instrumentation can expose captured variables but state-of-the-art instrumentation techniques introduce high overheads, vary in supported syntax and/or use complex (de)serialization algorithms. We introduce FlashFreeze, an instrumentation technique based on capture lists. FlashFreeze achieves a tiny run time overhead: an Octane score reduction of 3% compared to 76% for the state-of-the-art ThingsMigrate tool and 1% for the work-in-progress FSM tool. FlashFreeze supports all self-contained ECMAScript 5 programs except for specific uses of eval, with, and source code inspection. FlashFreeze’s construction gives rise to simple (de)serialization algorithms.","PeriodicalId":38836,"journal":{"name":"Meta: Avaliacao","volume":"313 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79696672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Program code needs to be understood by both machines and programmers. While the goal of executing programs requires the unambiguity of a formal language, programmers use natural language within these formal constraints to explain implemented concepts to each other. This so called naturalness – the property of programs to resemble human communication – motivated many statistical and machine learning (ML) approaches with the goal to improve software engineering activities. The metaprogramming facilities of most programming environments model the formal elements of a program (meta-objects). If ML is used to support engineering or analysis tasks, complex infrastructure needs to bridge the gap between meta-objects and ML models, changes are not reflected in the ML model, and the mapping from an ML output back into the program’s meta-object domain is laborious. In the scope of this work, we propose to extend metaprogramming facilities to give tool developers access to the representations of program elements within an exchangeable ML model. We demonstrate the usefulness of this abstraction in two case studies on test prioritization and refactoring. We conclude that aligning ML representations with the program’s formal structure lowers the entry barrier to exploit statistical properties in tool development.
{"title":"Ambiguous, informal, and unsound: metaprogramming for naturalness","authors":"Toni Mattis, Patrick Rein, R. Hirschfeld","doi":"10.1145/3358502.3361270","DOIUrl":"https://doi.org/10.1145/3358502.3361270","url":null,"abstract":"Program code needs to be understood by both machines and programmers. While the goal of executing programs requires the unambiguity of a formal language, programmers use natural language within these formal constraints to explain implemented concepts to each other. This so called naturalness – the property of programs to resemble human communication – motivated many statistical and machine learning (ML) approaches with the goal to improve software engineering activities. The metaprogramming facilities of most programming environments model the formal elements of a program (meta-objects). If ML is used to support engineering or analysis tasks, complex infrastructure needs to bridge the gap between meta-objects and ML models, changes are not reflected in the ML model, and the mapping from an ML output back into the program’s meta-object domain is laborious. In the scope of this work, we propose to extend metaprogramming facilities to give tool developers access to the representations of program elements within an exchangeable ML model. We demonstrate the usefulness of this abstraction in two case studies on test prioritization and refactoring. We conclude that aligning ML representations with the program’s formal structure lowers the entry barrier to exploit statistical properties in tool development.","PeriodicalId":38836,"journal":{"name":"Meta: Avaliacao","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89707234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adrian D. Mensing, H. V. Antwerpen, Casper Bach Poulsen, E. Visser
Symbolic execution is a technique for automatic software validation and verification. New symbolic executors regularly appear for both existing and new languages and such symbolic executors are generally manually (re)implemented each time we want to support a new language. We propose to automatically generate symbolic executors from language definitions, and present a technique for mechanically (but as yet, manually) deriving a symbolic executor from a definitional interpreter. The idea is that language designers define their language as a monadic definitional interpreter, where the monad of the interpreter defines the meaning of branch points. Developing a symbolic executor for a language is a matter of changing the monadic interpretation of branch points. In this paper, we illustrate the technique on a language with recursive functions and pattern matching, and use the derived symbolic executor to automatically generate test cases for definitional interpreters implemented in our defined language.
{"title":"From definitional interpreter to symbolic executor","authors":"Adrian D. Mensing, H. V. Antwerpen, Casper Bach Poulsen, E. Visser","doi":"10.1145/3358502.3361269","DOIUrl":"https://doi.org/10.1145/3358502.3361269","url":null,"abstract":"Symbolic execution is a technique for automatic software validation and verification. New symbolic executors regularly appear for both existing and new languages and such symbolic executors are generally manually (re)implemented each time we want to support a new language. We propose to automatically generate symbolic executors from language definitions, and present a technique for mechanically (but as yet, manually) deriving a symbolic executor from a definitional interpreter. The idea is that language designers define their language as a monadic definitional interpreter, where the monad of the interpreter defines the meaning of branch points. Developing a symbolic executor for a language is a matter of changing the monadic interpretation of branch points. In this paper, we illustrate the technique on a language with recursive functions and pattern matching, and use the derived symbolic executor to automatically generate test cases for definitional interpreters implemented in our defined language.","PeriodicalId":38836,"journal":{"name":"Meta: Avaliacao","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81202776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}