In general, embedded systems can be designed at different levels of abstraction, e.g., as pure hardware circuit designs, as bare-iron level programs (without an operating system), as programs based on a real-time operating system, and as models of a model-driven development. This paper focuses on a synchronous model-driven development tool called Averest. Using Averest, we describe how we consider and combine system descriptions at the mentioned four levels of abstraction. We discuss a case study targeting a distributed embedded system where these different levels have been used.
{"title":"Generating hardware specific code at different abstraction levels using Averest","authors":"Omair Rafique, Manuel Gesell, K. Schneider","doi":"10.1145/2463596.2486154","DOIUrl":"https://doi.org/10.1145/2463596.2486154","url":null,"abstract":"In general, embedded systems can be designed at different levels of abstraction, e.g., as pure hardware circuit designs, as bare-iron level programs (without an operating system), as programs based on a real-time operating system, and as models of a model-driven development. This paper focuses on a synchronous model-driven development tool called Averest. Using Averest, we describe how we consider and combine system descriptions at the mentioned four levels of abstraction. We discuss a case study targeting a distributed embedded system where these different levels have been used.","PeriodicalId":344517,"journal":{"name":"M-SCOPES","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128602858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPUs have evolved to programmable, energy efficient compute accelerators for massively parallel applications. Still, compute power is lost in many applications because of cycles spent on data movement and control instead of computations on actual data. Additional cycles can be lost as well on pipeline stalls due to long latency operations. To improve performance and energy efficiency, we introduce GPU-CC: a reconfigurable GPU architecture with communicating cores. It is based on a contemporary GPU, which can still be used as such, but also has the ability to reorganize the cores of a GPU in a reconfigurable network. In GPU-CC data movement and control is implicit in the configuration of the communication network. Additionally each core executes a fixed instruction, reducing instruction decode count and increasing energy efficiency. We show a large performance potential for GPU-CC, e.g. 1.9x and 2.4x for a 3x3 and 5x5 convolution application. The hardware cost of GPU-CC is mainly determined by the buffers in the added network, which amounts to 12.4% of extra memory space.
{"title":"GPU-CC: a reconfigurable GPU architecture with communicating cores","authors":"Gert-Jan van den Braak, H. Corporaal","doi":"10.1145/2463596.2486153","DOIUrl":"https://doi.org/10.1145/2463596.2486153","url":null,"abstract":"GPUs have evolved to programmable, energy efficient compute accelerators for massively parallel applications. Still, compute power is lost in many applications because of cycles spent on data movement and control instead of computations on actual data. Additional cycles can be lost as well on pipeline stalls due to long latency operations.\u0000 To improve performance and energy efficiency, we introduce GPU-CC: a reconfigurable GPU architecture with communicating cores. It is based on a contemporary GPU, which can still be used as such, but also has the ability to reorganize the cores of a GPU in a reconfigurable network. In GPU-CC data movement and control is implicit in the configuration of the communication network. Additionally each core executes a fixed instruction, reducing instruction decode count and increasing energy efficiency. We show a large performance potential for GPU-CC, e.g. 1.9x and 2.4x for a 3x3 and 5x5 convolution application. The hardware cost of GPU-CC is mainly determined by the buffers in the added network, which amounts to 12.4% of extra memory space.","PeriodicalId":344517,"journal":{"name":"M-SCOPES","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126892857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present an exact approach to the Simple Offset Assignment problem arising in the domain of address code generation for digital signal processors. It is based on transformations to weighted Hamiltonian cycle problems and integer linear programming. To the best of our knowledge, it is the rst approach capable to solve all instances of the established OffsetStone benchmark set to optimality within reasonable time. It therefore enables the rst evaluation of the quality of several heuristics relative to the optimum solutions. Further, using the same transformations, we present a novel improvement heuristic that provides a well-tunable trade-off between running time and solution quality.
{"title":"Solving the simple offset assignment problem as a traveling salesman","authors":"M. Jünger, Sven Mallach","doi":"10.1145/2463596.2463601","DOIUrl":"https://doi.org/10.1145/2463596.2463601","url":null,"abstract":"In this paper, we present an exact approach to the Simple Offset Assignment problem arising in the domain of address code generation for digital signal processors. It is based on transformations to weighted Hamiltonian cycle problems and integer linear programming. To the best of our knowledge, it is the rst approach capable to solve all instances of the established OffsetStone benchmark set to optimality within reasonable time. It therefore enables the rst evaluation of the quality of several heuristics relative to the optimum solutions. Further, using the same transformations, we present a novel improvement heuristic that provides a well-tunable trade-off between running time and solution quality.","PeriodicalId":344517,"journal":{"name":"M-SCOPES","volume":"202 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133935240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As single-threaded performance is reaching its limits, the prevailing trend in multi-core and embedded MPSoC architectures is to provide an ever increasing number of processing units. This convergence leads to shared concerns, like scalability and programmability. Exploiting such architectures poses tremendous challenges to application programmers and to compiler/runtime developers alike. Uncovering raw parallelism is often insufficient in and of itself: improving performance requires changing the code structure to harness complex parallel hardware and memory hierarchies; translating more processing units into effective performance gains involves a combination of target-specific optimizations, subtle concurrency concepts and non-deterministic algorithms. In this presentation, we examine the limitations of current, von Neumann architectures and the impact on programmability of the drift from hardware-managed complexity to an increasing reliance on software solutions. We first propose OpenStream, a high-level data-flow programming model, as a pragmatic answer from the application programmer's perspective. Recognizing that the burden cannot be borne by either programmers or compilers alone, OpenStream is designed to strike a fair balance: programmers provide abstract information about their applications and leave the compiler and runtime system with the responsibility of lowering these abstractions to well-orchestrated threads and memory management. In the second part, we adopt the runtime developer's perspective and examine these impacts through the example of the implementation and proof of concurrent lock-free algorithms, a cornerstone of runtime system implementation, critically important in the context of relaxed memory consistency models.
{"title":"OpenStream: a data-flow approach to solving the von Neumann bottlenecks","authors":"Antoniu Pop","doi":"10.1145/2463596.2486782","DOIUrl":"https://doi.org/10.1145/2463596.2486782","url":null,"abstract":"As single-threaded performance is reaching its limits, the prevailing trend in multi-core and embedded MPSoC architectures is to provide an ever increasing number of processing units. This convergence leads to shared concerns, like scalability and programmability. Exploiting such architectures poses tremendous challenges to application programmers and to compiler/runtime developers alike. Uncovering raw parallelism is often insufficient in and of itself: improving performance requires changing the code structure to harness complex parallel hardware and memory hierarchies; translating more processing units into effective performance gains involves a combination of target-specific optimizations, subtle concurrency concepts and non-deterministic algorithms.\u0000 In this presentation, we examine the limitations of current, von Neumann architectures and the impact on programmability of the drift from hardware-managed complexity to an increasing reliance on software solutions. We first propose OpenStream, a high-level data-flow programming model, as a pragmatic answer from the application programmer's perspective. Recognizing that the burden cannot be borne by either programmers or compilers alone, OpenStream is designed to strike a fair balance: programmers provide abstract information about their applications and leave the compiler and runtime system with the responsibility of lowering these abstractions to well-orchestrated threads and memory management. In the second part, we adopt the runtime developer's perspective and examine these impacts through the example of the implementation and proof of concurrent lock-free algorithms, a cornerstone of runtime system implementation, critically important in the context of relaxed memory consistency models.","PeriodicalId":344517,"journal":{"name":"M-SCOPES","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129167519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Safety-critical Java (SCJ) is designed to enable development of applications that are amenable to certification under safety-critical standards. However, its shared-memory concurrency model causes several problems such as data races, deadlocks, and priority inversion. We propose therefore a dataflow design model of SCJ applications in which periodic and aperiodic tasks communicate only through lock-free channels. We provide the necessary tools that compute scheduling parameters of tasks (i.e. periods, phases, priorities, etc) so that uniprocessor/multiprocessor preemptive fixed-priority schedulability is ensured and the throughput is maximized. Furthermore, the resulted schedule together with the computed channel sizes ensure underflow/overflow-free communications. The scheduling approach consists in constructing an abstract affine schedule of the dataflow graph and then concretizing it.
{"title":"Design of safety-critical Java level 1 applications using affine abstract clocks","authors":"A. Bouakaz, J. Talpin","doi":"10.1145/2463596.2463600","DOIUrl":"https://doi.org/10.1145/2463596.2463600","url":null,"abstract":"Safety-critical Java (SCJ) is designed to enable development of applications that are amenable to certification under safety-critical standards. However, its shared-memory concurrency model causes several problems such as data races, deadlocks, and priority inversion. We propose therefore a dataflow design model of SCJ applications in which periodic and aperiodic tasks communicate only through lock-free channels. We provide the necessary tools that compute scheduling parameters of tasks (i.e. periods, phases, priorities, etc) so that uniprocessor/multiprocessor preemptive fixed-priority schedulability is ensured and the throughput is maximized. Furthermore, the resulted schedule together with the computed channel sizes ensure underflow/overflow-free communications. The scheduling approach consists in constructing an abstract affine schedule of the dataflow graph and then concretizing it.","PeriodicalId":344517,"journal":{"name":"M-SCOPES","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115032824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}