A. Nikitakis, Savvas Papaioannou, I. Papaefstathiou
One very important challenge in the field of multimedia is the implementation of fast and detailed Object Detection and Recognition systems. In particular, in the current state-of-the-art mobile multimedia systems, it is highly desirable to detect and locate certain objects within a video frame in real time. In this paper, we present a novel FPGA-based embedded implementation of a very efficient object recognition algorithm called Receptive Field Cooccurrence Histograms Algorithm(RFCH). Our main focus was to increase its performance so as to be able to handle the object recognition task of today's highly sophisticated embedded multimedia systems while keeping its energy consumption at very low levels. Our low-power embedded reconfigurable system is at least 15 times faster than the software implementation on a low-voltage high-end CPU, while consuming at least 60 times less energy. Our novel system is also 88 times more energy efficient than the recently introduced low-power multi-core Intel devices which are optimized for embedded systems.
{"title":"A novel low-power embedded object recognition system working at multi-frames per second (Extended abstract)","authors":"A. Nikitakis, Savvas Papaioannou, I. Papaefstathiou","doi":"10.1145/2435227.2435229","DOIUrl":"https://doi.org/10.1145/2435227.2435229","url":null,"abstract":"One very important challenge in the field of multimedia is the implementation of fast and detailed Object Detection and Recognition systems. In particular, in the current state-of-the-art mobile multimedia systems, it is highly desirable to detect and locate certain objects within a video frame in real time. In this paper, we present a novel FPGA-based embedded implementation of a very efficient object recognition algorithm called Receptive Field Cooccurrence Histograms Algorithm(RFCH). Our main focus was to increase its performance so as to be able to handle the object recognition task of today's highly sophisticated embedded multimedia systems while keeping its energy consumption at very low levels. Our low-power embedded reconfigurable system is at least 15 times faster than the software implementation on a low-voltage high-end CPU, while consuming at least 60 times less energy. Our novel system is also 88 times more energy efficient than the recently introduced low-power multi-core Intel devices which are optimized for embedded systems.","PeriodicalId":431615,"journal":{"name":"2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122844958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-01DOI: 10.1109/ESTIMedia.2012.6507027
D. Tetzlaff, S. Glesner
Recursion poses a severe problem for static optimizations because its execution frequency usually depends upon runtime values, hence being rarely predictable at compile time. As a consequence, optimization potential of programs is sacrificed since possible hot paths where most of the execution time is spent and where optimization would be beneficial might be undiscovered. In this paper, we propose a sophisticated machine learning based approach to statically predict the recursion frequency of functions for programs in real-world application domains, which can be used to guide various hot spot optimizations. Our experiments with 369 programs of 25 benchmark suites from different domains demonstrate that our approach is applicable to a wide range of programs with different behavior and yields more precise heuristics than those generated by pure static analyses. Moreover, our results provide valuable insights into recursive structures in general, when they appear and how deep they are.
{"title":"Static prediction of recursion frequency using machine learning to enable hot spot optimizations","authors":"D. Tetzlaff, S. Glesner","doi":"10.1109/ESTIMedia.2012.6507027","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2012.6507027","url":null,"abstract":"Recursion poses a severe problem for static optimizations because its execution frequency usually depends upon runtime values, hence being rarely predictable at compile time. As a consequence, optimization potential of programs is sacrificed since possible hot paths where most of the execution time is spent and where optimization would be beneficial might be undiscovered. In this paper, we propose a sophisticated machine learning based approach to statically predict the recursion frequency of functions for programs in real-world application domains, which can be used to guide various hot spot optimizations. Our experiments with 369 programs of 25 benchmark suites from different domains demonstrate that our approach is applicable to a wide range of programs with different behavior and yields more precise heuristics than those generated by pure static analyses. Moreover, our results provide valuable insights into recursive structures in general, when they appear and how deep they are.","PeriodicalId":431615,"journal":{"name":"2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120933792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-01DOI: 10.1109/ESTIMedia.2012.6507024
Jianhua Li, Liang Shi, Qing'an Li, C. Xue, Yinlong Xu
Hybrid coherence protocols can provide the scalability of directory protocols and low latency sharing miss handling in snooping protocols simultaneously. Unfortunately, how to adapt the hybrid protocols at runtime is not well studied. This paper proposes Thread ProgrEss Aware Coherence Adaption (TEACA) which utilizes the thread progress information as the hints to adapt hybrid coherence protocols. Specifically, TEACA fuses the memory system statistics to estimate the progress of threads. Based on the estimated thread progress information, TEACA dynamically categorizes threads into leader threads and laggard threads. The thread categorization decisions are then leveraged for efficient coherence adaption in hybrid coherence protocols. A case study on a recently proposed hybrid protocol (PATCH [29]) shows that, with the hints from TEACA, the enhanced hybrid protocol outperforms its baseline in both application execution time and energy dissipation.
{"title":"TEACA: Thread ProgrEss Aware Coherence Adaption for hybrid coherence protocols","authors":"Jianhua Li, Liang Shi, Qing'an Li, C. Xue, Yinlong Xu","doi":"10.1109/ESTIMedia.2012.6507024","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2012.6507024","url":null,"abstract":"Hybrid coherence protocols can provide the scalability of directory protocols and low latency sharing miss handling in snooping protocols simultaneously. Unfortunately, how to adapt the hybrid protocols at runtime is not well studied. This paper proposes Thread ProgrEss Aware Coherence Adaption (TEACA) which utilizes the thread progress information as the hints to adapt hybrid coherence protocols. Specifically, TEACA fuses the memory system statistics to estimate the progress of threads. Based on the estimated thread progress information, TEACA dynamically categorizes threads into leader threads and laggard threads. The thread categorization decisions are then leveraged for efficient coherence adaption in hybrid coherence protocols. A case study on a recently proposed hybrid protocol (PATCH [29]) shows that, with the hints from TEACA, the enhanced hybrid protocol outperforms its baseline in both application execution time and energy dissipation.","PeriodicalId":431615,"journal":{"name":"2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133007499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-01DOI: 10.1109/ESTIMedia.2012.6507034
Yi-Fan Chung, Yin-Tsung Lo, C. King
The growing multimedia applications on smart phones place ever more stringent demands on user experiences. A key factor affecting user experiences is the delay in launching applications. It affects a user's perception of the responsiveness of the phone and the multimedia applications.
{"title":"Enhancing user experiences by exploiting energy and launch delay tradeoff of mobile multimedia applications (Extended abstract)","authors":"Yi-Fan Chung, Yin-Tsung Lo, C. King","doi":"10.1109/ESTIMedia.2012.6507034","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2012.6507034","url":null,"abstract":"The growing multimedia applications on smart phones place ever more stringent demands on user experiences. A key factor affecting user experiences is the delay in launching applications. It affects a user's perception of the responsiveness of the phone and the multimedia applications.","PeriodicalId":431615,"journal":{"name":"2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125254908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Streaming applications often require a parallel Model of Computation (MoC) to specify their application behavior and to facilitate mapping onto Multi-Processor System-on-Chip (MPSoC) platforms. Various performance requirements and resource budgets of embedded systems ask for an efficient design space exploration (DSE) approach to select the best design from a design space consisting of a large number of design choices. However, existing DSE approaches explore the design space that includes only architecture and mapping alternatives for an initial application specification given by the application designer. In this paper, we first show that a design often might not be optimal if alternative specifications of a given application are not taken into account. We further argue that the best alternative specification consists of only independent and load-balanced application tasks. Based on the Polyhedral Process Network (PPN) MoC, we present an approach to analyze and transform an initial PPN to an alternative one that contains only independent processes if possible. Finally, by prototyping real-life applications on both FPGA-based MPSoCs and desktop multi-core platforms, we demonstrate that mapping the alternative application specification results in a large performance gain compared to those approaches, in which alternative application specifications are not taken into account.
流应用程序通常需要并行计算模型(MoC)来指定其应用程序行为,并方便映射到多处理器片上系统(MPSoC)平台。嵌入式系统的各种性能需求和资源预算要求一种有效的设计空间探索(DSE)方法,以便从大量设计选择组成的设计空间中选择最佳设计。然而,现有的DSE方法探索的设计空间只包括由应用程序设计人员给出的初始应用程序规范的体系结构和映射替代方案。在本文中,我们首先表明,如果不考虑给定应用程序的可选规范,设计通常可能不是最优的。我们进一步论证,最佳替代规范只包含独立且负载均衡的应用程序任务。基于多面体过程网络(Polyhedral Process Network, PPN) MoC,我们提出了一种方法来分析和转换一个初始的PPN,并在可能的情况下将其转换为一个只包含独立过程的备选PPN。最后,通过在基于fpga的mpsoc和桌面多核平台上对实际应用程序进行原型设计,我们证明了与不考虑替代应用程序规范的方法相比,映射替代应用程序规范可以获得较大的性能增益。
{"title":"Mapping of streaming applications considering alternative application specifications (Extended abstract)","authors":"J. Zhai, Hristo Nikolov, T. Stefanov","doi":"10.1145/2435227.2435230","DOIUrl":"https://doi.org/10.1145/2435227.2435230","url":null,"abstract":"Streaming applications often require a parallel Model of Computation (MoC) to specify their application behavior and to facilitate mapping onto Multi-Processor System-on-Chip (MPSoC) platforms. Various performance requirements and resource budgets of embedded systems ask for an efficient design space exploration (DSE) approach to select the best design from a design space consisting of a large number of design choices. However, existing DSE approaches explore the design space that includes only architecture and mapping alternatives for an initial application specification given by the application designer. In this paper, we first show that a design often might not be optimal if alternative specifications of a given application are not taken into account. We further argue that the best alternative specification consists of only independent and load-balanced application tasks. Based on the Polyhedral Process Network (PPN) MoC, we present an approach to analyze and transform an initial PPN to an alternative one that contains only independent processes if possible. Finally, by prototyping real-life applications on both FPGA-based MPSoCs and desktop multi-core platforms, we demonstrate that mapping the alternative application specification results in a large performance gain compared to those approaches, in which alternative application specifications are not taken into account.","PeriodicalId":431615,"journal":{"name":"2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126992714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-01DOI: 10.1109/ESTIMedia.2012.6507036
Ji Gu, T. Ishihara, Kyungsoo Lee
With the exponential increase of power consumption in processor generations, energy dissipation has become one of the most critical constraints in system design. Cache memories are usually the most energy consuming components on the processor chip due to their large die size occupation and frequent access operations. Furthermore, in step with the increased complexity of modern embedded applications, microprocessors are increasingly executing multitasking applications. In multitasking processors, the conventional L1 instruction cache (I-cache) is usually shared by multiple tasks and thereby suffering a highly intensive read/write operations, which can be even more energy-consuming than used in a single-task based system. This paper presents an energy-efficient shared multitasking loop instruction cache (SMLIC), which is designed to address the tasks sharing and context switch issues so that it can be efficiently utilized to reduce the I-cache accesses for energy savings in multitasking processors. Experiments on a set of multitasking applications demonstrate that the proposed SMLIC design scheme can reduce I-cache accesses by 12∼86% and energy consumption in instruction supply by 11∼79% for multitasking system, depending on various frequencies of context switch.
{"title":"Loop instruction caching for energy-efficient embedded multitasking processors","authors":"Ji Gu, T. Ishihara, Kyungsoo Lee","doi":"10.1109/ESTIMedia.2012.6507036","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2012.6507036","url":null,"abstract":"With the exponential increase of power consumption in processor generations, energy dissipation has become one of the most critical constraints in system design. Cache memories are usually the most energy consuming components on the processor chip due to their large die size occupation and frequent access operations. Furthermore, in step with the increased complexity of modern embedded applications, microprocessors are increasingly executing multitasking applications. In multitasking processors, the conventional L1 instruction cache (I-cache) is usually shared by multiple tasks and thereby suffering a highly intensive read/write operations, which can be even more energy-consuming than used in a single-task based system. This paper presents an energy-efficient shared multitasking loop instruction cache (SMLIC), which is designed to address the tasks sharing and context switch issues so that it can be efficiently utilized to reduce the I-cache accesses for energy savings in multitasking processors. Experiments on a set of multitasking applications demonstrate that the proposed SMLIC design scheme can reduce I-cache accesses by 12∼86% and energy consumption in instruction supply by 11∼79% for multitasking system, depending on various frequencies of context switch.","PeriodicalId":431615,"journal":{"name":"2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129925830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-01DOI: 10.1109/ESTIMedia.2012.6507016
S. Bampi
Increasingly demanding complex algorithms for multimedia systems and higher resolutions for multiview videos hit power and memory walls in portable hardware. Silicon IC technology scaling is reaching two-dimensional limitations that accompany escalating technology cost wall. In this scenario the severe costs of power density, circuit performance variability and energy constraints call for new algorithms-to-architecture approaches. This talk will highlight the architectures and circuits techniques that will influence multimedia systems architectures in the future. Design challenges and specific solutions that deal with energy dissipation in the case of multiview video are addressed. In this presentation the technology-design-architecture-algorithms interactions are pointed as drivers for new cross-layer optimizations in energy-constrained multimedia systems.
{"title":"Keynote: “Design space exploration and run-time resource management in the embedded multi-core era”","authors":"S. Bampi","doi":"10.1109/ESTIMedia.2012.6507016","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2012.6507016","url":null,"abstract":"Increasingly demanding complex algorithms for multimedia systems and higher resolutions for multiview videos hit power and memory walls in portable hardware. Silicon IC technology scaling is reaching two-dimensional limitations that accompany escalating technology cost wall. In this scenario the severe costs of power density, circuit performance variability and energy constraints call for new algorithms-to-architecture approaches. This talk will highlight the architectures and circuits techniques that will influence multimedia systems architectures in the future. Design challenges and specific solutions that deal with energy dissipation in the case of multiview video are addressed. In this presentation the technology-design-architecture-algorithms interactions are pointed as drivers for new cross-layer optimizations in energy-constrained multimedia systems.","PeriodicalId":431615,"journal":{"name":"2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133361485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-01DOI: 10.1109/ESTIMedia.2012.6507031
Cheng-yan Yang, Yi-jui Wu, S. Liao
More than half-a-billion Android devices are world's most impactful real-time, interactive multimedia systems that are open-sourced. Google introduced Renderscript language and runtime in Android releases starting in 2011. Renderscript delivers performance and portability without losing usability. However, it is difficult to reuse software written in existing compute languages such as OpenCL. Thus, we develop the O2render system to enable OpenCL programs on Android devices. We analyze fundamental differences between OpenCL and Renderscript, and present our design of a translator between them using low-level virtual machine (LLVM). We extend LLVMs frontend, Clang, and show that we achieve about the same performance in Renderscript with minimal translation overhead.
{"title":"O2render: An OpenCL-to-Renderscript translator for porting across various GPUs or CPUs","authors":"Cheng-yan Yang, Yi-jui Wu, S. Liao","doi":"10.1109/ESTIMedia.2012.6507031","DOIUrl":"https://doi.org/10.1109/ESTIMedia.2012.6507031","url":null,"abstract":"More than half-a-billion Android devices are world's most impactful real-time, interactive multimedia systems that are open-sourced. Google introduced Renderscript language and runtime in Android releases starting in 2011. Renderscript delivers performance and portability without losing usability. However, it is difficult to reuse software written in existing compute languages such as OpenCL. Thus, we develop the O2render system to enable OpenCL programs on Android devices. We analyze fundamental differences between OpenCL and Renderscript, and present our design of a translator between them using low-level virtual machine (LLVM). We extend LLVMs frontend, Clang, and show that we achieve about the same performance in Renderscript with minimal translation overhead.","PeriodicalId":431615,"journal":{"name":"2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126959824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}