Pub Date : 2015-12-17DOI: 10.1109/ESTIMedia.2015.7351765
H. R. Mendis, L. Indrusiak, N. Audsley
Centralised management of distributed systems require a significant amount of monitoring traffic to maintain an accurate view of the system global state. The communication overhead of these systems becomes a bottleneck as the number of processing elements in the network and workload increase. State-of-the art in decentralised resource management techniques address this issue by allowing individual or clusters of nodes to make decisions at runtime to manage the dynamic workload. The primary contribution of this paper is using a bio-inspired, distributed, task remapping technique to manage dynamic multiple video stream decoding workloads. Our proposed technique has a low-communication overhead and is used to reduce the cumulative job lateness of the video streams. Secondary contributions include, several improvements to an existing clusterbased resource management approach to introduce awareness of task blocking and relocation distance. We evaluate these two remapping methods by comparing the improvement of job lateness, communication overhead and distribution of utilisation via simulation of several workload patterns.
{"title":"Bio-inspired distributed task remapping for multiple video stream decoding on homogeneous NoCs","authors":"H. R. Mendis, L. Indrusiak, N. Audsley","doi":"10.1109/ESTIMedia.2015.7351765","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2015.7351765","url":null,"abstract":"Centralised management of distributed systems require a significant amount of monitoring traffic to maintain an accurate view of the system global state. The communication overhead of these systems becomes a bottleneck as the number of processing elements in the network and workload increase. State-of-the art in decentralised resource management techniques address this issue by allowing individual or clusters of nodes to make decisions at runtime to manage the dynamic workload. The primary contribution of this paper is using a bio-inspired, distributed, task remapping technique to manage dynamic multiple video stream decoding workloads. Our proposed technique has a low-communication overhead and is used to reduce the cumulative job lateness of the video streams. Secondary contributions include, several improvements to an existing clusterbased resource management approach to introduce awareness of task blocking and relocation distance. We evaluate these two remapping methods by comparing the improvement of job lateness, communication overhead and distribution of utilisation via simulation of several workload patterns.","PeriodicalId":350361,"journal":{"name":"2015 13th IEEE Symposium on Embedded Systems For Real-time Multimedia (ESTIMedia)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117243553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-12-17DOI: 10.1109/ESTIMedia.2015.7351766
J. Falk, T. Schwarzer, M. Glaß, J. Teich, C. Zebelein, C. Haubelt
Signal processing algorithms as can be found in multimedia applications are often modeled by dynamic Data Flow Graphs (DFGs), especially when targeting heterogeneous multicore platforms. However, there is often a mismatch between the fine granularity of the application and the coarse granularity of the platform. Tailoring the granularity of the DFG to a given platform by employing Quasi-Static Schedules (QSSs) promises performance gains by reducing dynamic scheduling overhead and enabling optimizations targeting groups of actors instead of individual actors in isolation. Unfortunately, all approaches known from literature to compute QSSs implicitly assume DFGs with unbounded First In First Out (FIFO) channels. In contrast, mappings of DFGs to multi-core platforms must adhere to FIFO channels with limited capacities. In this paper, we present a novel FIFO channel capacity adjustment algorithm that enables QSSs to DFGs with limited channel capacities, thus, extending the scope of QSS refinements to general multi-core targets.
{"title":"Quasi-static scheduling of data flow graphs in the presence of limited channel capacities","authors":"J. Falk, T. Schwarzer, M. Glaß, J. Teich, C. Zebelein, C. Haubelt","doi":"10.1109/ESTIMedia.2015.7351766","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2015.7351766","url":null,"abstract":"Signal processing algorithms as can be found in multimedia applications are often modeled by dynamic Data Flow Graphs (DFGs), especially when targeting heterogeneous multicore platforms. However, there is often a mismatch between the fine granularity of the application and the coarse granularity of the platform. Tailoring the granularity of the DFG to a given platform by employing Quasi-Static Schedules (QSSs) promises performance gains by reducing dynamic scheduling overhead and enabling optimizations targeting groups of actors instead of individual actors in isolation. Unfortunately, all approaches known from literature to compute QSSs implicitly assume DFGs with unbounded First In First Out (FIFO) channels. In contrast, mappings of DFGs to multi-core platforms must adhere to FIFO channels with limited capacities. In this paper, we present a novel FIFO channel capacity adjustment algorithm that enables QSSs to DFGs with limited channel capacities, thus, extending the scope of QSS refinements to general multi-core targets.","PeriodicalId":350361,"journal":{"name":"2015 13th IEEE Symposium on Embedded Systems For Real-time Multimedia (ESTIMedia)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114517577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-12-17DOI: 10.1109/ESTIMedia.2015.7351762
L. Bauer, Artjom Grudnitsky, Marvin Damschen, Srinivas Rao Kerekare, J. Henkel
Runtime reconfigurable processors provide a large degree of flexibility that allows them to dynamically adapt to different applications and requirements. They couple a standard processor with a runtime reconfigurable fabric (like an embedded FPGA) to offload computationally intensive kernels. In this paper we present the design and architecture of a flexible accelerator for floating point operations in stream processing applications. To integrate it in an existing reconfigurable processor, the different frequencies between the sequential processor (high frequency) and parallel accelerators (low frequencies) have to be managed. The results show 63.70× and 3.85× better performance-per-area efficiency when using our accelerator and the reconfigurable processor compared to the baseline processor with a soft-float implementation and a high-performance floating point unit, respectively.
{"title":"Floating point acceleration for stream processing applications in dynamically reconfigurable processors","authors":"L. Bauer, Artjom Grudnitsky, Marvin Damschen, Srinivas Rao Kerekare, J. Henkel","doi":"10.1109/ESTIMedia.2015.7351762","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2015.7351762","url":null,"abstract":"Runtime reconfigurable processors provide a large degree of flexibility that allows them to dynamically adapt to different applications and requirements. They couple a standard processor with a runtime reconfigurable fabric (like an embedded FPGA) to offload computationally intensive kernels. In this paper we present the design and architecture of a flexible accelerator for floating point operations in stream processing applications. To integrate it in an existing reconfigurable processor, the different frequencies between the sequential processor (high frequency) and parallel accelerators (low frequencies) have to be managed. The results show 63.70× and 3.85× better performance-per-area efficiency when using our accelerator and the reconfigurable processor compared to the baseline processor with a soft-float implementation and a high-performance floating point unit, respectively.","PeriodicalId":350361,"journal":{"name":"2015 13th IEEE Symposium on Embedded Systems For Real-time Multimedia (ESTIMedia)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123903375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-12-17DOI: 10.1109/ESTIMedia.2015.7351771
A. Lifa, P. Eles, Zebo Peng
The increasing computational demands of next generation multimedia systems require innovative optimization methods. Modern heterogeneous architectures bring together multiple general-purpose CPUs and multiple GPUs and FPGAs, in an attempt to answer the performance, energy-efficiency and flexibility requirements of today's complex multimedia applications. However, in order to leverage the advantages of such architectures, careful optimization is essential. In modern systems, more and more multimedia applications need real-time support (e.g. automotive systems that use image processing for active safety features). Real-time multi-mode systems are a good model for a wide range of applications that dynamically change their computational requirements over time. In this context, intelligent on-line resource management is needed, such that the heterogeneous resources are used in an energy-efficient manner, while meeting the real-time constraints. This paper proposes a resource manager that implements run-time policies to decide on-the-fly task admission and the mapping of active tasks to resources, such that the energy consumption of the system is minimized and all task deadlines are met.
{"title":"On-the-fly energy minimization for multi-mode real-time systems on heterogeneous platforms","authors":"A. Lifa, P. Eles, Zebo Peng","doi":"10.1109/ESTIMedia.2015.7351771","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2015.7351771","url":null,"abstract":"The increasing computational demands of next generation multimedia systems require innovative optimization methods. Modern heterogeneous architectures bring together multiple general-purpose CPUs and multiple GPUs and FPGAs, in an attempt to answer the performance, energy-efficiency and flexibility requirements of today's complex multimedia applications. However, in order to leverage the advantages of such architectures, careful optimization is essential. In modern systems, more and more multimedia applications need real-time support (e.g. automotive systems that use image processing for active safety features). Real-time multi-mode systems are a good model for a wide range of applications that dynamically change their computational requirements over time. In this context, intelligent on-line resource management is needed, such that the heterogeneous resources are used in an energy-efficient manner, while meeting the real-time constraints. This paper proposes a resource manager that implements run-time policies to decide on-the-fly task admission and the mapping of active tasks to resources, such that the energy consumption of the system is minimized and all task deadlines are met.","PeriodicalId":350361,"journal":{"name":"2015 13th IEEE Symposium on Embedded Systems For Real-time Multimedia (ESTIMedia)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124876597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-12-17DOI: 10.1109/ESTIMedia.2015.7351775
Chen-Ying Hsieh, Jurn-Gyu Park, N. Dutt, Sung-Soo Lim
Modern mobile heterogeneous platforms have GPUs integrated with multicore processors to enable execution of highend graphics-intensive games. However, these gaming applications consume significant power due to heavy utilization of CPU-GPU resources, which drains battery resources that are critical for mobile devices. While Dynamic Voltage and Frequency Scaling (DVFS) techniques have been exploited previously for dynamic power management, contemporary techniques do not fully exploit the memory access footprint for graphics-intensive gaming applications, missing opportunities for energy efficiency. In this paper, we for the first time propose a memory-aware cooperative CPU-GPU DVFS governor that considers both the memory access footprint as well as the CPU/GPU frequency to improve energy efficiency of high-end mobile game workloads. Our experimental results show that our proposed game governor achieves on average 13% and 5% improvement of energy efficiency with minor degradation of performance compared to default governors and state-of-the-art game governors.
{"title":"Memory-aware cooperative CPU-GPU DVFS governor for mobile games","authors":"Chen-Ying Hsieh, Jurn-Gyu Park, N. Dutt, Sung-Soo Lim","doi":"10.1109/ESTIMedia.2015.7351775","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2015.7351775","url":null,"abstract":"Modern mobile heterogeneous platforms have GPUs integrated with multicore processors to enable execution of highend graphics-intensive games. However, these gaming applications consume significant power due to heavy utilization of CPU-GPU resources, which drains battery resources that are critical for mobile devices. While Dynamic Voltage and Frequency Scaling (DVFS) techniques have been exploited previously for dynamic power management, contemporary techniques do not fully exploit the memory access footprint for graphics-intensive gaming applications, missing opportunities for energy efficiency. In this paper, we for the first time propose a memory-aware cooperative CPU-GPU DVFS governor that considers both the memory access footprint as well as the CPU/GPU frequency to improve energy efficiency of high-end mobile game workloads. Our experimental results show that our proposed game governor achieves on average 13% and 5% improvement of energy efficiency with minor degradation of performance compared to default governors and state-of-the-art game governors.","PeriodicalId":350361,"journal":{"name":"2015 13th IEEE Symposium on Embedded Systems For Real-time Multimedia (ESTIMedia)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121502219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-10-08DOI: 10.1109/ESTIMedia.2015.7351770
Yonghui Li, Hrishikesh Salunkhe, J. Bastos, Orlando Moreira, B. Akesson, K. Goossens
SDRAM is a shared resource in modern multi-core platforms executing multiple real-time (RT) streaming applications. It is crucial to analyze the minimum guaranteed SDRAM bandwidth to ensure that the requirements of the RT streaming applications are always satisfied. However, deriving the worstcase bandwidth (WCBW) is challenging because of the diverse memory traffic with variable transaction sizes. In fact, existing RT memory controllers either do not efficiently support variable transaction sizes or do not provide an analysis to tightly bound WCBW in their presence. We propose a new mode-controlled data-flow (MCDF) model to capture the command scheduling dependencies of memory transactions with variable sizes. The WCBW can be obtained by employing an existing tool to automatically analyze our MCDF model rather than using existing static analysis techniques, which in contrast to our model are hard to extend to cover different RT memory controllers. Moreover, the MCDF analysis can exploit static information about known transaction sequences provided by the applications or by the memory arbiter. Experimental results show that 77% improvement of WCBW can be achieved compared to the case without known transaction sequences. In addition, the results demonstrate that the proposed MCDF model outperforms state-of-the-art analysis approaches and improves the WCBW by 22% without known transaction sequences.
{"title":"Mode-controlled data-flow modeling of real-time memory controllers","authors":"Yonghui Li, Hrishikesh Salunkhe, J. Bastos, Orlando Moreira, B. Akesson, K. Goossens","doi":"10.1109/ESTIMedia.2015.7351770","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2015.7351770","url":null,"abstract":"SDRAM is a shared resource in modern multi-core platforms executing multiple real-time (RT) streaming applications. It is crucial to analyze the minimum guaranteed SDRAM bandwidth to ensure that the requirements of the RT streaming applications are always satisfied. However, deriving the worstcase bandwidth (WCBW) is challenging because of the diverse memory traffic with variable transaction sizes. In fact, existing RT memory controllers either do not efficiently support variable transaction sizes or do not provide an analysis to tightly bound WCBW in their presence. We propose a new mode-controlled data-flow (MCDF) model to capture the command scheduling dependencies of memory transactions with variable sizes. The WCBW can be obtained by employing an existing tool to automatically analyze our MCDF model rather than using existing static analysis techniques, which in contrast to our model are hard to extend to cover different RT memory controllers. Moreover, the MCDF analysis can exploit static information about known transaction sequences provided by the applications or by the memory arbiter. Experimental results show that 77% improvement of WCBW can be achieved compared to the case without known transaction sequences. In addition, the results demonstrate that the proposed MCDF model outperforms state-of-the-art analysis approaches and improves the WCBW by 22% without known transaction sequences.","PeriodicalId":350361,"journal":{"name":"2015 13th IEEE Symposium on Embedded Systems For Real-time Multimedia (ESTIMedia)","volume":"42 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126752514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}