E. Vermij, Leandro Fiorin, C. Hagleitner, K. Bertels
Big data workloads assumed recently a relevant importance in many business and scientific applications. Sorting elements efficiently in big data workloads is a key operation. In this work, we analyze the implementation of the mergesort algorithm on heterogeneous systems composed of CPUs and near-data processors located on the system memory channels. For configurations with equal number of active CPU cores and near-data processors, our experiments show a performance speedup of up to 2.5, as well as up to 2.5x energy-per-solution reduction.
{"title":"Sorting big data on heterogeneous near-data processing systems","authors":"E. Vermij, Leandro Fiorin, C. Hagleitner, K. Bertels","doi":"10.1145/3075564.3078885","DOIUrl":"https://doi.org/10.1145/3075564.3078885","url":null,"abstract":"Big data workloads assumed recently a relevant importance in many business and scientific applications. Sorting elements efficiently in big data workloads is a key operation. In this work, we analyze the implementation of the mergesort algorithm on heterogeneous systems composed of CPUs and near-data processors located on the system memory channels. For configurations with equal number of active CPU cores and near-data processors, our experiments show a performance speedup of up to 2.5, as well as up to 2.5x energy-per-solution reduction.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132188271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Martini, G. Benetti, Filippo Cipolla, Davide Caprino, M. L. D. Vedova, T. Facchinetti
The use of real-time scheduling methods to coordinate a set of power loads is being explored in the field of Cyber-Physical Energy Systems, with the goal of optimizing the aggregated peak load of power used by many electric loads. Real-time scheduling has attractive features in this domain. Thanks to its inherent resource optimization, which limits the number of concurrent tasks that are running at the same time, real-time scheduling provides direct benefits to peak load optimization. This paper shows the combined use of a two-dimensional bin-packing method and an optimal multi-processor real-time scheduling algorithm to coordinate the activation of electric loads. The result is an effective global scheduling approach where the activation of loads is organized into a pattern that takes into account the timing constraints of the loads and the actual combination of active loads. The validation is done by scheduling a set of thermal loads (heaters) in a building, with accurately modeled temperature dynamics. The proposed method is shown to achieve a significant peak load reduction, up to around 70%, w.r.t. the traditional thermostat controller.
{"title":"Peak load optimization through 2-dimensional packing and multi-processor real-time scheduling","authors":"D. Martini, G. Benetti, Filippo Cipolla, Davide Caprino, M. L. D. Vedova, T. Facchinetti","doi":"10.1145/3075564.3075587","DOIUrl":"https://doi.org/10.1145/3075564.3075587","url":null,"abstract":"The use of real-time scheduling methods to coordinate a set of power loads is being explored in the field of Cyber-Physical Energy Systems, with the goal of optimizing the aggregated peak load of power used by many electric loads. Real-time scheduling has attractive features in this domain. Thanks to its inherent resource optimization, which limits the number of concurrent tasks that are running at the same time, real-time scheduling provides direct benefits to peak load optimization. This paper shows the combined use of a two-dimensional bin-packing method and an optimal multi-processor real-time scheduling algorithm to coordinate the activation of electric loads. The result is an effective global scheduling approach where the activation of loads is organized into a pattern that takes into account the timing constraints of the loads and the actual combination of active loads. The validation is done by scheduling a set of thermal loads (heaters) in a building, with accurately modeled temperature dynamics. The proposed method is shown to achieve a significant peak load reduction, up to around 70%, w.r.t. the traditional thermostat controller.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133640174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gabriel Ortiz, L. Svensson, Erik Alveflo, P. Larsson-Edefors
Processor energy models can be used by developers to estimate, without the need of hardware implementation or additional measurement setups, the power consumption of software applications. Furthermore, these energy models can be used for energy-aware compiler optimization. This paper presents a measurement-based instruction-level energy characterization for the Adapteva Epiphany processor, which is a 16-core shared-memory architecture connected by a 2D network-on-chip. Based on a number of microbenchmarks, the instruction-level characterization was used to build an energy model that includes essential Epiphany instructions such as remote memory loads and stores. To validate the model, an FFT application was developed. This validation showed that the energy estimated by the model is within 0.4% of the measured energy.
{"title":"Instruction level energy model for the Adapteva Epiphany multi-core processor","authors":"Gabriel Ortiz, L. Svensson, Erik Alveflo, P. Larsson-Edefors","doi":"10.1145/3075564.3078892","DOIUrl":"https://doi.org/10.1145/3075564.3078892","url":null,"abstract":"Processor energy models can be used by developers to estimate, without the need of hardware implementation or additional measurement setups, the power consumption of software applications. Furthermore, these energy models can be used for energy-aware compiler optimization. This paper presents a measurement-based instruction-level energy characterization for the Adapteva Epiphany processor, which is a 16-core shared-memory architecture connected by a 2D network-on-chip. Based on a number of microbenchmarks, the instruction-level characterization was used to build an energy model that includes essential Epiphany instructions such as remote memory loads and stores. To validate the model, an FFT application was developed. This validation showed that the energy estimated by the model is within 0.4% of the measured energy.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116383567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The replacement policies known as MIN and OPT are optimal for a two-level memory hierarchy. The computation of the cache content for these policies requires the off-line knowledge of the entire address trace. However, the stack distance of a given access, that is, the smallest capacity of a cache for which that access results in a hit, is independent of future accesses and can be computed on-line. Off-line and on-line algorithms to compute the stack distance in time O(V) per access have been known for several decades, where V denotes the number of distinct addresses within the trace. The off-line time bound was recently improved to O(√V log V). This paper introduces the Critical Stack Algorithm for the online computation of the stack distance of MIN and OPT, in time O(log V) per access. The result exploits a novel analysis of properties of OPT and data structures based on balanced binary trees. A corresponding Ω(log V) lower bound is derived by a reduction from element distinctness; this bound holds in a variety of models of computation and applies even to the off-line simulation of just one cache capacity.
{"title":"Optimal On-Line Computation of Stack Distances for MIN and OPT","authors":"G. Bilardi, K. Ekanadham, P. Pattnaik","doi":"10.1145/3075564.3075571","DOIUrl":"https://doi.org/10.1145/3075564.3075571","url":null,"abstract":"The replacement policies known as MIN and OPT are optimal for a two-level memory hierarchy. The computation of the cache content for these policies requires the off-line knowledge of the entire address trace. However, the stack distance of a given access, that is, the smallest capacity of a cache for which that access results in a hit, is independent of future accesses and can be computed on-line. Off-line and on-line algorithms to compute the stack distance in time O(V) per access have been known for several decades, where V denotes the number of distinct addresses within the trace. The off-line time bound was recently improved to O(√V log V). This paper introduces the Critical Stack Algorithm for the online computation of the stack distance of MIN and OPT, in time O(log V) per access. The result exploits a novel analysis of properties of OPT and data structures based on balanced binary trees. A corresponding Ω(log V) lower bound is derived by a reduction from element distinctness; this bound holds in a variety of models of computation and applies even to the off-line simulation of just one cache capacity.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"52 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126005457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Zhang, Rui Hou, Junfeng Fan, KeKe Liu, Lixin Zhang, S. Mckee
Control-flow integrity (CFI) is considered as a general and promising method to prevent code-reuse attacks, which utilize benign code sequences to realize arbitrary computation. Current approaches can efficiently protect control-flow transfers caused by indirect jumps and function calls (forward-edge CFI). However, they cannot effectively protect control-flow caused by the function return (backward-edge CFI). The reason is that the set of return addresses of the functions that are frequently called can be very large, which might bend the backward-edge CFI. We address this backward-edge CFI problem by proposing a novel hardware-assisted mechanism (RAGuard) that binds a message authentication code to each return address and enhances security via a physical unclonable function and a hardware hash function. The message authentication codes can be stored on the program stack with return address. RAGuard hardware automatically verifies the integrity of return addresses. Our experiments show that for a subset of the SPEC CPU2006 benchmarks, RAGuard incurs 1.86% runtime overheads on average with no need for OS support.
{"title":"RAGuard: A Hardware Based Mechanism for Backward-Edge Control-Flow Integrity","authors":"Jun Zhang, Rui Hou, Junfeng Fan, KeKe Liu, Lixin Zhang, S. Mckee","doi":"10.1145/3075564.3075570","DOIUrl":"https://doi.org/10.1145/3075564.3075570","url":null,"abstract":"Control-flow integrity (CFI) is considered as a general and promising method to prevent code-reuse attacks, which utilize benign code sequences to realize arbitrary computation. Current approaches can efficiently protect control-flow transfers caused by indirect jumps and function calls (forward-edge CFI). However, they cannot effectively protect control-flow caused by the function return (backward-edge CFI). The reason is that the set of return addresses of the functions that are frequently called can be very large, which might bend the backward-edge CFI. We address this backward-edge CFI problem by proposing a novel hardware-assisted mechanism (RAGuard) that binds a message authentication code to each return address and enhances security via a physical unclonable function and a hardware hash function. The message authentication codes can be stored on the program stack with return address. RAGuard hardware automatically verifies the integrity of return addresses. Our experiments show that for a subset of the SPEC CPU2006 benchmarks, RAGuard incurs 1.86% runtime overheads on average with no need for OS support.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127387059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mark will begin with a brief overview of deep learning and what has led to its recent popularity. He will provide a few demonstrations and examples of deep learning applications based on recent work at Intel Nervana. He will explain some of the challenges to continued progress in deep learning - such as high compute requirements and lengthy training time - and will discuss some of the solutions (e.g. custom deep learning hardware) that Intel Nervana is developing to usher in a new era of even more powerful AI.
{"title":"The Future of Deep Learning: Challenges & Solutions","authors":"M. Robins","doi":"10.1145/3075564.3097267","DOIUrl":"https://doi.org/10.1145/3075564.3097267","url":null,"abstract":"Mark will begin with a brief overview of deep learning and what has led to its recent popularity. He will provide a few demonstrations and examples of deep learning applications based on recent work at Intel Nervana. He will explain some of the challenges to continued progress in deep learning - such as high compute requirements and lengthy training time - and will discuss some of the solutions (e.g. custom deep learning hardware) that Intel Nervana is developing to usher in a new era of even more powerful AI.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129944172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Input/Output (I/O) performance is very important when running desktop applications in virtualized environments. Previous research has focused on cold execution or installation of desktop applications, where the I/O requests are obvious; in many other scenarios such as warm launch or web page browsing however, I/O behaviors are less clear, and in this paper, we analyze the I/O behavior of these desktop scenarios. Our analysis reveals several interesting I/O behaviors of desktop applications; for example, we show that many warm applications will send random read requests during their launch, which leads to storage-sensitivity of these applications. We also find that the write requests from web page browsing generates considerable I/O pressure, even when the users only open a simple news page and take no further action. Our results have strong ramifications for the management of storage systems and the deployment of virtual machines in virtualized environments.
{"title":"Understanding the I/O Behavior of Desktop Applications in Virtualization","authors":"Yan Sui, Chun Yang, Xu Cheng","doi":"10.1145/3075564.3076263","DOIUrl":"https://doi.org/10.1145/3075564.3076263","url":null,"abstract":"Input/Output (I/O) performance is very important when running desktop applications in virtualized environments. Previous research has focused on cold execution or installation of desktop applications, where the I/O requests are obvious; in many other scenarios such as warm launch or web page browsing however, I/O behaviors are less clear, and in this paper, we analyze the I/O behavior of these desktop scenarios. Our analysis reveals several interesting I/O behaviors of desktop applications; for example, we show that many warm applications will send random read requests during their launch, which leads to storage-sensitivity of these applications. We also find that the write requests from web page browsing generates considerable I/O pressure, even when the users only open a simple news page and take no further action. Our results have strong ramifications for the management of storage systems and the deployment of virtual machines in virtualized environments.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121192289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper discusses the potential of applying deep learning techniques for plant classification and its usage for citizen science in large-scale biodiversity monitoring. We show that plant classification using near state-of-the-art convolutional network architectures like ResNet50 achieves significant improvements in accuracy compared to the most widespread plant classification application in test sets composed of thousands of different species labels. We find that the predictions can be confidently used as a baseline classification in citizen science communities like iNaturalist (or its Spanish fork, Natusfera) which in turn can share their data with biodiversity portals like GBIF.
{"title":"Large-Scale Plant Classification with Deep Neural Networks","authors":"Ignacio Heredia","doi":"10.1145/3075564.3075590","DOIUrl":"https://doi.org/10.1145/3075564.3075590","url":null,"abstract":"This paper discusses the potential of applying deep learning techniques for plant classification and its usage for citizen science in large-scale biodiversity monitoring. We show that plant classification using near state-of-the-art convolutional network architectures like ResNet50 achieves significant improvements in accuracy compared to the most widespread plant classification application in test sets composed of thousands of different species labels. We find that the predictions can be confidently used as a baseline classification in citizen science communities like iNaturalist (or its Spanish fork, Natusfera) which in turn can share their data with biodiversity portals like GBIF.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131321792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Assessing error resilience inherent to the digital processing workloads provides application-specific insights towards approximate computing strategies for improving power efficiency and/or performance. With the case study of radio astronomy calibration, our contributions for improving the error resilience analysis are focused primarily on iterative methods that use a convergence criterion as a quality metric to terminate the iterative computations. We propose an adaptive statistical approximation model for high-level resilience analysis that provides an opportunity to divide a workload into exact and approximate iterations. This improves the existing error resilience analysis methodology by quantifying the number of approximate iterations (23% of the total iterations in our case study) in addition to other parameters used in the state-of-the-art techniques. This way heterogeneous architectures comprised of exact and inexact computing cores and adaptive accuracy architectures can be exploited efficiently. Moreover, we demonstrate the importance of quality function reconsideration for convergence based iterative processes as the original quality function (the convergence criterion) is not necessarily sufficient in the resilience analysis phase. If such is the case, an additional quality function has to be defined to assess the viability of the approximate techniques.
{"title":"Improving Error Resilience Analysis Methodology of Iterative Workloads for Approximate Computing","authors":"G. Gillani, A. Kokkeler","doi":"10.1145/3075564.3078891","DOIUrl":"https://doi.org/10.1145/3075564.3078891","url":null,"abstract":"Assessing error resilience inherent to the digital processing workloads provides application-specific insights towards approximate computing strategies for improving power efficiency and/or performance. With the case study of radio astronomy calibration, our contributions for improving the error resilience analysis are focused primarily on iterative methods that use a convergence criterion as a quality metric to terminate the iterative computations. We propose an adaptive statistical approximation model for high-level resilience analysis that provides an opportunity to divide a workload into exact and approximate iterations. This improves the existing error resilience analysis methodology by quantifying the number of approximate iterations (23% of the total iterations in our case study) in addition to other parameters used in the state-of-the-art techniques. This way heterogeneous architectures comprised of exact and inexact computing cores and adaptive accuracy architectures can be exploited efficiently. Moreover, we demonstrate the importance of quality function reconsideration for convergence based iterative processes as the original quality function (the convergence criterion) is not necessarily sufficient in the resilience analysis phase. If such is the case, an additional quality function has to be defined to assess the viability of the approximate techniques.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127534527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Increasing rates of transient hardware faults pose a problem for computing applications. Current and future trends are likely to exacerbate this problem. When a transient fault occurs during program execution, data in the output can become corrupted. The severity of output corruptions depends on the application domain. Hence, different applications require different levels of fault tolerance. We present an LLVM-based AN encoder that can equip programs with an error detection mechanism at configurable levels of rigor. Based on our AN encoder, the trade-off between fault tolerance and runtime overhead is analyzed. It is found that, by suitably configuring our AN encoder, the runtime overhead can be reduced from 9.9x to 2.1x. At the same time, however, the probability that a hardware fault in the CPU will result in silent data corruption rises from 0.007 to over 0.022. The same probability for memory faults increases from 0.009 to over 0.032. It is further demonstrated, by applying different configurations of our AN encoder to the components of an arithmetic expression interpreter, that having fine-grained control over levels of fault tolerance can be beneficial.
{"title":"Trading Fault Tolerance for Performance in AN Encoding","authors":"Norman A. Rink, J. Castrillón","doi":"10.1145/3075564.3075565","DOIUrl":"https://doi.org/10.1145/3075564.3075565","url":null,"abstract":"Increasing rates of transient hardware faults pose a problem for computing applications. Current and future trends are likely to exacerbate this problem. When a transient fault occurs during program execution, data in the output can become corrupted. The severity of output corruptions depends on the application domain. Hence, different applications require different levels of fault tolerance. We present an LLVM-based AN encoder that can equip programs with an error detection mechanism at configurable levels of rigor. Based on our AN encoder, the trade-off between fault tolerance and runtime overhead is analyzed. It is found that, by suitably configuring our AN encoder, the runtime overhead can be reduced from 9.9x to 2.1x. At the same time, however, the probability that a hardware fault in the CPU will result in silent data corruption rises from 0.007 to over 0.022. The same probability for memory faults increases from 0.009 to over 0.032. It is further demonstrated, by applying different configurations of our AN encoder to the components of an arithmetic expression interpreter, that having fine-grained control over levels of fault tolerance can be beneficial.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128134798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}