W. V. Teijlingen, R. V. Leuken, C. Galuzzi, B. Kienhuis
We can significantly reduce the time required to realize designs if it is possible to find limits to the performance of an embedded system, solely based on high-level system specifications. For that purpose, we present in this paper the cprof profiler, which determines the number of clock cycles needed to execute a C-program in hardware. The cprof tool is based on the Clang compiler front-end to parse C-programs and to produce instrumented source code for the profiling. Using cprof, we determine a lower and upper bound limit for all 29 cases of the PolyBench/C benchmark suite. The lower and upper bound are determined using the absolute performance estimations assuming all statement are mapped onto the same processing resource and unbounded performance estimations assuming unlimited resources. We also compared the clock cycles found by cprof with RTL implementations for all 29 Polybench/C cases and found that cprof determines with 1.2% accuracy the correct number of clock cycles. It does this in a fraction of the time compared to the time needed to do a full RTL simulation.
{"title":"Determining Performance Boundaries on High-Level System Specifications","authors":"W. V. Teijlingen, R. V. Leuken, C. Galuzzi, B. Kienhuis","doi":"10.1145/2906363.2906386","DOIUrl":"https://doi.org/10.1145/2906363.2906386","url":null,"abstract":"We can significantly reduce the time required to realize designs if it is possible to find limits to the performance of an embedded system, solely based on high-level system specifications. For that purpose, we present in this paper the cprof profiler, which determines the number of clock cycles needed to execute a C-program in hardware. The cprof tool is based on the Clang compiler front-end to parse C-programs and to produce instrumented source code for the profiling. Using cprof, we determine a lower and upper bound limit for all 29 cases of the PolyBench/C benchmark suite. The lower and upper bound are determined using the absolute performance estimations assuming all statement are mapped onto the same processing resource and unbounded performance estimations assuming unlimited resources. We also compared the clock cycles found by cprof with RTL implementations for all 29 Polybench/C cases and found that cprof determines with 1.2% accuracy the correct number of clock cycles. It does this in a fraction of the time compared to the time needed to do a full RTL simulation.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126537545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Li, Philipp Wagner, Ramesh Ramaswamy, A. Mayer, Thomas Wild, A. Herkersdorf
As the complexity of multicore SoCs increases, more potential system issues are arising. Hardware-related configuration issues are becoming more complicated owing to the introduction of more cores and various complex peripherals. Considering the complexity of multicore programming, consultation of the main source of guidance, i.e. the user manual, is not an efficient approach to identify such problems. Improper hardware-related configurations could lead to either functional or performance issues. Some of these issues are even subtle and hard to detect. Therefore, a rule-based validation methodology is proposed to deal with hardware-related configuration issues in an efficient and reliable way. Hardware trace is applied in this methodology to detect issues even before symptoms appear. The method directly observes the register accesses and detects bugs based on trace data. It is independent of the application as long as they are run on the given platform, which means the same method implementation could be applied to any applications on the same platform. In this paper, an initial proof-of-concept for the proposed methodology has been implemented and demonstrated on the Infineon TC29 device.
{"title":"A Rule-based Methodology for Hardware Configuration Validation in Embedded Systems","authors":"Lin Li, Philipp Wagner, Ramesh Ramaswamy, A. Mayer, Thomas Wild, A. Herkersdorf","doi":"10.1145/2906363.2906377","DOIUrl":"https://doi.org/10.1145/2906363.2906377","url":null,"abstract":"As the complexity of multicore SoCs increases, more potential system issues are arising. Hardware-related configuration issues are becoming more complicated owing to the introduction of more cores and various complex peripherals. Considering the complexity of multicore programming, consultation of the main source of guidance, i.e. the user manual, is not an efficient approach to identify such problems. Improper hardware-related configurations could lead to either functional or performance issues. Some of these issues are even subtle and hard to detect. Therefore, a rule-based validation methodology is proposed to deal with hardware-related configuration issues in an efficient and reliable way. Hardware trace is applied in this methodology to detect issues even before symptoms appear. The method directly observes the register accesses and detects bugs based on trace data. It is independent of the application as long as they are run on the given platform, which means the same method implementation could be applied to any applications on the same platform. In this paper, an initial proof-of-concept for the proposed methodology has been implemented and demonstrated on the Infineon TC29 device.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122137018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scheduling of real-time applications is an important research topic. We consider a large-scale application consisting of 100--1000 tasks with inter-task communications, which can be represented by a task graph. For scheduling these applications, previous research results have shown that the time-triggered scheduling approach is capable to effectively utilize real-time platforms. However, the time-triggered scheduling approach only supports periodically activated tasks. Sporadic (aperiodic) tasks, which are also common in industrial applications, require additional treatments in time-triggered approaches. In this paper, we present a method to handle the sporadic tasks (that are not periodic) by shifting the time-triggered schedule. This method improves the responsiveness of the real-time sporadic tasks, whereas the schedule of the time-triggered tasks remains feasible. We define a time-triggered server to handle sporadic events and reserve time slots to ensure a safe recovery of the delayed time-triggered schedule. If a sporadic task arrives, this task starts its execution during the time-triggered server slot and the current time-triggered schedule is shifted. This paper provides the feasibility analysis for the time-triggered and the sporadic tasks under this slot shifting method. We determine time-triggered scheduling parameters to maximize the performance of the time-triggered server. Experiments confirm higher reachable system utilization by using our slot shifting approach.
{"title":"Sporadic Task Handling in Time-Triggered Systems","authors":"Matthias Freier, Jian-Jia Chen","doi":"10.1145/2906363.2906383","DOIUrl":"https://doi.org/10.1145/2906363.2906383","url":null,"abstract":"Scheduling of real-time applications is an important research topic. We consider a large-scale application consisting of 100--1000 tasks with inter-task communications, which can be represented by a task graph. For scheduling these applications, previous research results have shown that the time-triggered scheduling approach is capable to effectively utilize real-time platforms. However, the time-triggered scheduling approach only supports periodically activated tasks. Sporadic (aperiodic) tasks, which are also common in industrial applications, require additional treatments in time-triggered approaches. In this paper, we present a method to handle the sporadic tasks (that are not periodic) by shifting the time-triggered schedule. This method improves the responsiveness of the real-time sporadic tasks, whereas the schedule of the time-triggered tasks remains feasible. We define a time-triggered server to handle sporadic events and reserve time slots to ensure a safe recovery of the delayed time-triggered schedule. If a sporadic task arrives, this task starts its execution during the time-triggered server slot and the current time-triggered schedule is shifted. This paper provides the feasibility analysis for the time-triggered and the sporadic tasks under this slot shifting method. We determine time-triggered scheduling parameters to maximize the performance of the time-triggered server. Experiments confirm higher reachable system utilization by using our slot shifting approach.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130904573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Programming stream processing multiprocessor systems is a challenging task especially if there are real-time requirements. Therefore it is desirable to use formal models and real-time analysis techniques. However the classical periodic task-model does not match well with stream processing applications which results in suboptimal designs. In this talk we show that data-driven execution of stream processing application improves the robustness against faulty workload assumptions. Using the earlier-the-better-refinement theory practically useful deterministic timed-dataflow analysis models can be created of these applications. Strong analytical properties are obtained by reservation of resources in the multiprocessor systems. Compilation tools can hide the modelling effort for the programmers of the multiprocessor systems. Future cyber-physical systems can benefit from the higher level of non-determinism that is supported by the presented timed-dataflow analysis techniques.
{"title":"From dataflow analysis basics to the programming of ASICs","authors":"M. Bekooij","doi":"10.1145/2906363.2930673","DOIUrl":"https://doi.org/10.1145/2906363.2930673","url":null,"abstract":"Programming stream processing multiprocessor systems is a challenging task especially if there are real-time requirements. Therefore it is desirable to use formal models and real-time analysis techniques. However the classical periodic task-model does not match well with stream processing applications which results in suboptimal designs. In this talk we show that data-driven execution of stream processing application improves the robustness against faulty workload assumptions. Using the earlier-the-better-refinement theory practically useful deterministic timed-dataflow analysis models can be created of these applications. Strong analytical properties are obtained by reservation of resources in the multiprocessor systems. Compilation tools can hide the modelling effort for the programmers of the multiprocessor systems. Future cyber-physical systems can benefit from the higher level of non-determinism that is supported by the presented timed-dataflow analysis techniques.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133861578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Debugging reactive programs requires to provide a lot of inputs -- at each reaction step. Moreover, because a reactive system reacts to an environment it tries to control, providing realistic inputs can be hard. The same considerations apply for automatic testing. This work take advantage on previous work on automated testing of reactive programs that close this feedback loop. This article demonstrates how to implement opportunistically such a debugging commands interpreter by taking advantage of an existing (ocaml) toplevel Read-Eval-Print Loop (REPL). Then it shows how a small kernel is enough to build a full-featured debugger with little effort. The given examples provide a tutorial for end-users that wish to write their own debugging primitives, fitting to their needs, or to tune existing ones. An orthogonal contribution of this article is to present an efficient way to implement the debugger coroutining using continuations. The Reactive programs DeBuGger (RDBG) prototype aims at being versatile and general enough to be able to deal with any reactive languages. We have experimented it on 2 synchronous programming: Lustre and Lutin.
{"title":"RDBG: a Reactive Programs Extensible Debugger","authors":"Erwan Jahier","doi":"10.1145/2906363.2906372","DOIUrl":"https://doi.org/10.1145/2906363.2906372","url":null,"abstract":"Debugging reactive programs requires to provide a lot of inputs -- at each reaction step. Moreover, because a reactive system reacts to an environment it tries to control, providing realistic inputs can be hard. The same considerations apply for automatic testing. This work take advantage on previous work on automated testing of reactive programs that close this feedback loop. This article demonstrates how to implement opportunistically such a debugging commands interpreter by taking advantage of an existing (ocaml) toplevel Read-Eval-Print Loop (REPL). Then it shows how a small kernel is enough to build a full-featured debugger with little effort. The given examples provide a tutorial for end-users that wish to write their own debugging primitives, fitting to their needs, or to tune existing ones. An orthogonal contribution of this article is to present an efficient way to implement the debugger coroutining using continuations. The Reactive programs DeBuGger (RDBG) prototype aims at being versatile and general enough to be able to deal with any reactive languages. We have experimented it on 2 synchronous programming: Lustre and Lutin.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127298495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peter Koek, Stefan J. Geuns, J. Hausmans, H. Corporaal, M. Bekooij
Real-time stream processing applications, such as Software Defined Radio applications, are often executed concurrently on multiprocessor systems. A unified data flow model and analysis method have been proposed that can be used to simultaneously determine the amount of pipeline and coarse-grained data parallelism required to meet the temporal constraints of such applications. However, this unified model is only defined for Synchronous Data Flow (SDF) graphs. Defining a unified model for a more expressive model such as Cyclo-Static Data Flow (CSDF) is not possible, because auto-concurrency can cause a time-dependent order of tokens and dependencies. This paper introduces the Cyclo-Static Data Flow with Auto-concurrency (CSDFa) model. In CSDFa, tokens have indices and the consumption order of tokens is static and time-independent. This allows expressing and trading off pipeline and coarse-grained data parallelism in a single, unified model. Furthermore, we introduce a new type of circular buffer that implements the same static order as is used by the CSDFa model. The overhead of operations on this buffer is independent of the amount of auto-concurrency, which corresponds to the constant firing durations in the CSDFa model. Exploiting the trade-off between data and pipeline parallelism with the CSDFa model is demonstrated with a part of a FMCW radar processing pipeline. We show that the CSDFa model enables optimizing the balance between processing units and memory, resulting in a significant reduction of silicon area. Additionally, it is shown that reducing the maximum allowed latency increases the minimum required amount of data parallelism by up to a factor of 16.
{"title":"CSDFa: A Model for Exploiting the Trade-Off between Data and Pipeline Parallelism","authors":"Peter Koek, Stefan J. Geuns, J. Hausmans, H. Corporaal, M. Bekooij","doi":"10.1145/2906363.2906364","DOIUrl":"https://doi.org/10.1145/2906363.2906364","url":null,"abstract":"Real-time stream processing applications, such as Software Defined Radio applications, are often executed concurrently on multiprocessor systems. A unified data flow model and analysis method have been proposed that can be used to simultaneously determine the amount of pipeline and coarse-grained data parallelism required to meet the temporal constraints of such applications. However, this unified model is only defined for Synchronous Data Flow (SDF) graphs. Defining a unified model for a more expressive model such as Cyclo-Static Data Flow (CSDF) is not possible, because auto-concurrency can cause a time-dependent order of tokens and dependencies. This paper introduces the Cyclo-Static Data Flow with Auto-concurrency (CSDFa) model. In CSDFa, tokens have indices and the consumption order of tokens is static and time-independent. This allows expressing and trading off pipeline and coarse-grained data parallelism in a single, unified model. Furthermore, we introduce a new type of circular buffer that implements the same static order as is used by the CSDFa model. The overhead of operations on this buffer is independent of the amount of auto-concurrency, which corresponds to the constant firing durations in the CSDFa model. Exploiting the trade-off between data and pipeline parallelism with the CSDFa model is demonstrated with a part of a FMCW radar processing pipeline. We show that the CSDFa model enables optimizing the balance between processing units and memory, resulting in a significant reduction of silicon area. Additionally, it is shown that reducing the maximum allowed latency increases the minimum required amount of data parallelism by up to a factor of 16.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116236285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Kuiper, Stefan J. Geuns, J. Hausmans, M. Bekooij
Modal real-time stream processing applications often contain cyclic dependencies and are typically executed on multiprocessor systems with processor sharing. Most real-time operating system kernels for these systems support Static Priority Pre-emptive (SPP) scheduling, however there is currently no suitable temporal analysis technique available for this type of systems. In this paper, we present a compositional temporal analysis approach for modal and cyclic stream processing applications executed on SPP scheduled multiprocessor systems. In this approach, locks and barriers are added such that the temporal behavior of modes can be characterized independently. As a result, the composition of modes does not change their characterization. This enables the use of an existing Structured Variable-Rate Phased Dataflow (SVPDF) model based dataflow analysis technique to determine the worst-case temporal behavior. The SVPDF model and the parallel implementation including locks and barriers are generated by a multiprocessor compiler. The applicability of the analysis approach is demonstrated using a WLAN 802.11p application. Conditions under which pipelined execution can be achieved are identified. The analysis results are verified with a dataflow simulator that supports sharing of resources.
{"title":"Compositional Temporal Analysis Method for Fixed Priority Pre-emptive Scheduled Modal Stream Processing Applications","authors":"G. Kuiper, Stefan J. Geuns, J. Hausmans, M. Bekooij","doi":"10.1145/2906363.2906375","DOIUrl":"https://doi.org/10.1145/2906363.2906375","url":null,"abstract":"Modal real-time stream processing applications often contain cyclic dependencies and are typically executed on multiprocessor systems with processor sharing. Most real-time operating system kernels for these systems support Static Priority Pre-emptive (SPP) scheduling, however there is currently no suitable temporal analysis technique available for this type of systems. In this paper, we present a compositional temporal analysis approach for modal and cyclic stream processing applications executed on SPP scheduled multiprocessor systems. In this approach, locks and barriers are added such that the temporal behavior of modes can be characterized independently. As a result, the composition of modes does not change their characterization. This enables the use of an existing Structured Variable-Rate Phased Dataflow (SVPDF) model based dataflow analysis technique to determine the worst-case temporal behavior. The SVPDF model and the parallel implementation including locks and barriers are generated by a multiprocessor compiler. The applicability of the analysis approach is demonstrated using a WLAN 802.11p application. Conditions under which pipelined execution can be achieved are identified. The analysis results are verified with a dataflow simulator that supports sharing of resources.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126350743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In an ideal top-down system design flow, graphical diagrams are designed before an executable specification in a System Level Description Language (SLDL) is derived. Such initial charts typically also serve as visual documentation of the textual specification and aid in maintaining the model. In the absence of graphical charts, e.g. in case of legacy or 3rd party code, a textual SLDL model is hard to comprehend for any unfamiliar designer. Here, we propose to automatically extract graphical charts from given SystemC code to ease the understanding of the source code with a visual representation. Specifically, we extract the communication flow between the threads from the design model by use of an automatic SystemC compiler infrastructure that statically analyzes the code and generates custom Thread Communication Graphs (TCG) similar to message sequence charts. Our experimental results on embedded applications demonstrate that our novel static analysis can quickly extract accurate TCG that are highly useful for designers in becoming familiar with new source code.
{"title":"Automatic Generation of Thread Communication Graphs from SystemC Source Code","authors":"T. Schmidt, Guantao Liu, R. Dömer","doi":"10.1145/2906363.2906365","DOIUrl":"https://doi.org/10.1145/2906363.2906365","url":null,"abstract":"In an ideal top-down system design flow, graphical diagrams are designed before an executable specification in a System Level Description Language (SLDL) is derived. Such initial charts typically also serve as visual documentation of the textual specification and aid in maintaining the model. In the absence of graphical charts, e.g. in case of legacy or 3rd party code, a textual SLDL model is hard to comprehend for any unfamiliar designer. Here, we propose to automatically extract graphical charts from given SystemC code to ease the understanding of the source code with a visual representation. Specifically, we extract the communication flow between the threads from the design model by use of an automatic SystemC compiler infrastructure that statically analyzes the code and generates custom Thread Communication Graphs (TCG) similar to message sequence charts. Our experimental results on embedded applications demonstrate that our novel static analysis can quickly extract accurate TCG that are highly useful for designers in becoming familiar with new source code.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128269865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PyPy is a widely known virtual machine for the Python programming language. PyPy itself is implemented in the statically typed subset of Python called RPython. RPython includes a tracing Just-In-Time (JIT) compiler and is capable of generating the compiler for a language from the specification of the interpreter for that language. In PyPy 4.0.0 we extended the tracing JIT compiler to support vectorization of loops and emit code for the SSE4 vector operations of the x86 instruction set. This article presents the details of the new vectorizer of PyPy. The vectorizer uses a loop unrolling approach to vectorization. It has been designed for efficient compilation as the compilation is done during the execution of the application. The scientific library NumPy introduced arrays which are homogeneous, primitive typed and contiguous in memory. These kind of arrays are used to avoid the problems with dynamic typing. Our contribution to PyPy's new vectorizer supports scalar and constant expansion, accumulator splitting for reductions, guard strengthening and array bounds check removal. The empirical evaluation shows that the vectorizer can gain speedups close to the theoretical optimum of the SSE4 instruction set.
{"title":"Vectorization in PyPy's Tracing Just-In-Time Compiler","authors":"Richard Plangger, A. Krall","doi":"10.1145/2906363.2906384","DOIUrl":"https://doi.org/10.1145/2906363.2906384","url":null,"abstract":"PyPy is a widely known virtual machine for the Python programming language. PyPy itself is implemented in the statically typed subset of Python called RPython. RPython includes a tracing Just-In-Time (JIT) compiler and is capable of generating the compiler for a language from the specification of the interpreter for that language. In PyPy 4.0.0 we extended the tracing JIT compiler to support vectorization of loops and emit code for the SSE4 vector operations of the x86 instruction set. This article presents the details of the new vectorizer of PyPy. The vectorizer uses a loop unrolling approach to vectorization. It has been designed for efficient compilation as the compilation is done during the execution of the application. The scientific library NumPy introduced arrays which are homogeneous, primitive typed and contiguous in memory. These kind of arrays are used to avoid the problems with dynamic typing. Our contribution to PyPy's new vectorizer supports scalar and constant expansion, accumulator splitting for reductions, guard strengthening and array bounds check removal. The empirical evaluation shows that the vectorizer can gain speedups close to the theoretical optimum of the SSE4 instruction set.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123981073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a cross-layer reliability modeling and optimization approach that leverages multiple software layers like compiler and run-time system to improve the overall reliability considering unreliable or partially-reliable hardware. In order to bridge the gap between hardware and software to achieve high efficiency, our technique incorporates the knowledge from hardware layers during reliability modeling and design of optimization techniques. We demonstrate how different software layers operate synergistically to achieve a high degree of reliability.
{"title":"Cross-Layer Reliability Modeling and Optimization: Compiler and Run-Time System Interactions","authors":"M. Shafique, Semeen Rehman, F. Kriebel, J. Henkel","doi":"10.1145/2906363.2911171","DOIUrl":"https://doi.org/10.1145/2906363.2911171","url":null,"abstract":"This paper presents a cross-layer reliability modeling and optimization approach that leverages multiple software layers like compiler and run-time system to improve the overall reliability considering unreliable or partially-reliable hardware. In order to bridge the gap between hardware and software to achieve high efficiency, our technique incorporates the knowledge from hardware layers during reliability modeling and design of optimization techniques. We demonstrate how different software layers operate synergistically to achieve a high degree of reliability.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124190746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}