Disaggregated memory has recently been proposed as a way to allow flexible and fine-grained allocation of memory capacity to compute jobs. This paper makes an important step towards effective resource allocation on disaggregated memory systems. Specifically, we propose a generic approach to predict the performance degradation due to sharing of disaggregated memory. In contrast to prior work, cache capacity is not shared among multiple applications, which removes a major contributor to application performance. For this reason, our analysis is driven by the demand for memory bandwidth, which has been shown to have an important effect on application performance. We show that profiling the application slowdown often involves significant experimental error and noise, and to this end, we improve the accuracy by linear smoothing of the sensitivity curves. We also show that contention is sensitive to the ratio between read and write memory accesses, and we address this sensitivity by building a family of sensitivity curves according to the read/write ratios. Our results show that the methodology predicts the slowdown in application performance subject to memory contention with an average error of 1.19% and max error of 14.6%. Compared with state-of-the-art, the relative improvements are almost 24% on average and 33% for the worst case.
{"title":"Contention-aware application performance prediction for disaggregated memory systems","authors":"F. V. Zacarias, Rajiv Nishtala, P. Carpenter","doi":"10.1145/3387902.3392625","DOIUrl":"https://doi.org/10.1145/3387902.3392625","url":null,"abstract":"Disaggregated memory has recently been proposed as a way to allow flexible and fine-grained allocation of memory capacity to compute jobs. This paper makes an important step towards effective resource allocation on disaggregated memory systems. Specifically, we propose a generic approach to predict the performance degradation due to sharing of disaggregated memory. In contrast to prior work, cache capacity is not shared among multiple applications, which removes a major contributor to application performance. For this reason, our analysis is driven by the demand for memory bandwidth, which has been shown to have an important effect on application performance. We show that profiling the application slowdown often involves significant experimental error and noise, and to this end, we improve the accuracy by linear smoothing of the sensitivity curves. We also show that contention is sensitive to the ratio between read and write memory accesses, and we address this sensitivity by building a family of sensitivity curves according to the read/write ratios. Our results show that the methodology predicts the slowdown in application performance subject to memory contention with an average error of 1.19% and max error of 14.6%. Compared with state-of-the-art, the relative improvements are almost 24% on average and 33% for the worst case.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131801103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matheus A. Cavalcante, Andreas Kurth, Fabian Schuiki, L. Benini
In heterogeneous computer architectures, the serial part of an application is coupled with domain-specific accelerators that promise high computing throughput and efficiency across a wide range of applications. In such systems, the serial part of a program is executed on a Central Processing Unit (CPU) core optimized for single-thread performance, while parallel sections are offloaded to Programmable Manycore Accelerators (PMCAs). This heterogeneity requires CPU cores and PMCAs to share data in memory efficiently, although CPUs rely on a coherent memory system where data is transferred in cache lines, while PMCAs are based on non-coherent scratchpad memories where data is transferred in bursts by DMA engines. In this paper, we tackle the challenges and hardware complexity of bridging the gap from a non-coherent, burst-based memory hierarchy to a coherent, cache-line-based one. We design and implement an open-source hardware module that reaches 97% peak throughput over a wide range of realistic linear algebra kernels and is suited for a wide spectrum of memory architectures. Implemented in a state-of-the-art 22 nm FD-SOI technology, our module bridges up to 650 Gbps at 130 fJ/bit and has a complexity of less than 1 kGE/Gbps.
{"title":"Design of an open-source bridge between non-coherent burst-based and coherent cache-line-based memory systems","authors":"Matheus A. Cavalcante, Andreas Kurth, Fabian Schuiki, L. Benini","doi":"10.1145/3387902.3392631","DOIUrl":"https://doi.org/10.1145/3387902.3392631","url":null,"abstract":"In heterogeneous computer architectures, the serial part of an application is coupled with domain-specific accelerators that promise high computing throughput and efficiency across a wide range of applications. In such systems, the serial part of a program is executed on a Central Processing Unit (CPU) core optimized for single-thread performance, while parallel sections are offloaded to Programmable Manycore Accelerators (PMCAs). This heterogeneity requires CPU cores and PMCAs to share data in memory efficiently, although CPUs rely on a coherent memory system where data is transferred in cache lines, while PMCAs are based on non-coherent scratchpad memories where data is transferred in bursts by DMA engines. In this paper, we tackle the challenges and hardware complexity of bridging the gap from a non-coherent, burst-based memory hierarchy to a coherent, cache-line-based one. We design and implement an open-source hardware module that reaches 97% peak throughput over a wide range of realistic linear algebra kernels and is suited for a wide spectrum of memory architectures. Implemented in a state-of-the-art 22 nm FD-SOI technology, our module bridges up to 650 Gbps at 130 fJ/bit and has a complexity of less than 1 kGE/Gbps.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132660972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the density and capability of reconfigurable computing using FPGAs continues to increase and access to large scale ASIC integration continues to increase, research activities associated with high level synthesis flows have expanded at a similar rate. The goal of these research efforts is to reduce the time and effort required to construct and deploy application-specific architectures. However, these synthesis techniques often force users to consider the entire circuit design space in order to develop a successful implementation. This lack of design specificity often results in hardware design implementations that are difficult to program, difficult to reuse in future designs and make sub-optimal use of hardware resources. In this work we introduce the StoneCutter instruction set design language and tool infrastructure. StoneCutter provides a familiar, C-like language construct by which to develop the implementation for individual, programmable instructions. The LLVM-based StoneCutter compiler performs individual instruction and whole-ISA optimizations in order to generate a high performance, Chisel HDL representation of the target design. Utilizing the existing Chisel tools, users can also generate C++ cycle accurate simulation models as well as Verilog representations of the target design. As a result, StoneCutter provides a very rapid design environment for development and experimentation.
{"title":"StoneCutter","authors":"J. Leidel, D. Donofrio, Frank Conlon","doi":"10.1145/3387902.3394029","DOIUrl":"https://doi.org/10.1145/3387902.3394029","url":null,"abstract":"As the density and capability of reconfigurable computing using FPGAs continues to increase and access to large scale ASIC integration continues to increase, research activities associated with high level synthesis flows have expanded at a similar rate. The goal of these research efforts is to reduce the time and effort required to construct and deploy application-specific architectures. However, these synthesis techniques often force users to consider the entire circuit design space in order to develop a successful implementation. This lack of design specificity often results in hardware design implementations that are difficult to program, difficult to reuse in future designs and make sub-optimal use of hardware resources. In this work we introduce the StoneCutter instruction set design language and tool infrastructure. StoneCutter provides a familiar, C-like language construct by which to develop the implementation for individual, programmable instructions. The LLVM-based StoneCutter compiler performs individual instruction and whole-ISA optimizations in order to generate a high performance, Chisel HDL representation of the target design. Utilizing the existing Chisel tools, users can also generate C++ cycle accurate simulation models as well as Verilog representations of the target design. As a result, StoneCutter provides a very rapid design environment for development and experimentation.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125368270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonathan M. Baker, Casey Duckering, Alexander P. Hoover, F. Chong
Current quantum computer designs will not scale. To scale beyond small prototypes, quantum architectures will likely adopt a modular approach with clusters of tightly connected quantum bits and sparser connections between clusters. We exploit this clustering and the statically-known control flow of quantum programs to create tractable partitioning heuristics which map quantum circuits to modular physical machines one time slice at a time. Specifically, we create optimized mappings for each time slice, accounting for the cost to move data from the previous time slice and using a tunable lookahead scheme to reduce the cost to move to future time slices. We compare our approach to a traditional statically-mapped, owner-computes model. Our results show strict improvement over the static mapping baseline. We reduce the non-local communication overhead by 89.8% in the best case and by 60.9% on average. Our techniques, unlike many exact solver methods, are computationally tractable.
{"title":"Time-sliced quantum circuit partitioning for modular architectures","authors":"Jonathan M. Baker, Casey Duckering, Alexander P. Hoover, F. Chong","doi":"10.1145/3387902.3392617","DOIUrl":"https://doi.org/10.1145/3387902.3392617","url":null,"abstract":"Current quantum computer designs will not scale. To scale beyond small prototypes, quantum architectures will likely adopt a modular approach with clusters of tightly connected quantum bits and sparser connections between clusters. We exploit this clustering and the statically-known control flow of quantum programs to create tractable partitioning heuristics which map quantum circuits to modular physical machines one time slice at a time. Specifically, we create optimized mappings for each time slice, accounting for the cost to move data from the previous time slice and using a tunable lookahead scheme to reduce the cost to move to future time slices. We compare our approach to a traditional statically-mapped, owner-computes model. Our results show strict improvement over the static mapping baseline. We reduce the non-local communication overhead by 89.8% in the best case and by 60.9% on average. Our techniques, unlike many exact solver methods, are computationally tractable.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"609 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121979988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Macaluso, L. Clissa, Stefano Lodi, Claudio Sartori
Quantum Computing offers a new paradigm for efficient computing and many AI applications could benefit from its potential boost in performance. However, the main limitation is the constraint to linear operations that hampers the representation of complex relationships in data. In this work, we propose an efficient implementation of quantum splines for non-linear approximation. In particular, we first discuss possible parametrisations, and select the most convenient for exploiting the HHL algorithm to obtain the estimates of spline coefficients. Then, we investigate QSpline performance as an evaluation routine for some of the most popular activation functions adopted in ML. Finally, a detailed comparison with classical alternatives to the HHL is also presented.
{"title":"Quantum splines for non-linear approximations","authors":"A. Macaluso, L. Clissa, Stefano Lodi, Claudio Sartori","doi":"10.1145/3387902.3394032","DOIUrl":"https://doi.org/10.1145/3387902.3394032","url":null,"abstract":"Quantum Computing offers a new paradigm for efficient computing and many AI applications could benefit from its potential boost in performance. However, the main limitation is the constraint to linear operations that hampers the representation of complex relationships in data. In this work, we propose an efficient implementation of quantum splines for non-linear approximation. In particular, we first discuss possible parametrisations, and select the most convenient for exploiting the HHL algorithm to obtain the estimates of spline coefficients. Then, we investigate QSpline performance as an evaluation routine for some of the most popular activation functions adopted in ML. Finally, a detailed comparison with classical alternatives to the HHL is also presented.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127033646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yifan Shen, Ke Liu, Ziting Guo, Wenli Zhang, Guanghui Zhang, V. Aggarwal, Mingyu Chen
After reading this book, you will really know how exactly the importance of reading books as common. Think once again as what this freeway gives you new lesson, the other books with many themes and genres and million PDFs will also give you same, or more than it. This is why, we always provide what you need and what you need to do. Many collections of the books from not only this country, from abroad a countries in the world are provided here. By providing easy way to help you finding the books, hopefully, reading habit will spread out easily to other people, too.
{"title":"Freeway","authors":"Yifan Shen, Ke Liu, Ziting Guo, Wenli Zhang, Guanghui Zhang, V. Aggarwal, Mingyu Chen","doi":"10.1145/3387902.3394028","DOIUrl":"https://doi.org/10.1145/3387902.3394028","url":null,"abstract":"After reading this book, you will really know how exactly the importance of reading books as common. Think once again as what this freeway gives you new lesson, the other books with many themes and genres and million PDFs will also give you same, or more than it. This is why, we always provide what you need and what you need to do. Many collections of the books from not only this country, from abroad a countries in the world are provided here. By providing easy way to help you finding the books, hopefully, reading habit will spread out easily to other people, too.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134351226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nazareno Bruschi, Angelo Garofalo, Francesco Conti, Giuseppe Tagliavini, D. Rossi
The deployment of Quantized Neural Networks (QNN) on advanced microcontrollers requires optimized software to exploit digital signal processing (DSP) extensions of modern instruction set architectures (ISA). As such, recent research proposed optimized libraries for QNNs (from 8-bit to 2-bit) such as CMSIS-NN and PULP-NN. This work presents an extension to the PULP-NN library targeting the acceleration of mixed-precision Deep Neural Networks, an emerging paradigm able to significantly shrink the memory footprint of deep neural networks with negligible accuracy loss. The library, composed of 27 kernels, one for each permutation of input feature maps, weights, and output feature maps precision (considering 8-bit, 4-bit and 2-bit), enables efficient inference of QNN on parallel ultra-low-power (PULP) clusters of RISC-V based processors, featuring the RV32IMCXpulpV2 ISA. The proposed solution, benchmarked on an 8-cores GAP-8 PULP cluster, reaches peak performance of 16 MACs/cycle on 8 cores, performing 21× to 25× faster than an STM32H7 (powered by an ARM Cortex M7 processor) with 15× to 21× better energy efficiency.
在高级微控制器上部署量化神经网络(QNN)需要优化软件来利用现代指令集架构(ISA)的数字信号处理(DSP)扩展。因此,最近的研究提出了针对qnn(从8位到2位)的优化库,如CMSIS-NN和PULP-NN。这项工作提出了对PULP-NN库的扩展,目标是加速混合精度深度神经网络,这是一种新兴的范例,能够显著缩小深度神经网络的内存占用,而精度损失可以忽略不计。该库由27个内核组成,每个内核用于输入特征映射、权重和输出特征映射精度的排列(考虑8位、4位和2位),能够在基于RISC-V处理器的并行超低功耗(PULP)集群上高效地推断QNN,具有RV32IMCXpulpV2 ISA。提出的解决方案在8核GAP-8 PULP集群上进行基准测试,在8核上达到16 mac /周期的峰值性能,比STM32H7(由ARM Cortex M7处理器驱动)快21到25倍,能效提高15到21倍。
{"title":"Enabling mixed-precision quantized neural networks in extreme-edge devices","authors":"Nazareno Bruschi, Angelo Garofalo, Francesco Conti, Giuseppe Tagliavini, D. Rossi","doi":"10.1145/3387902.3394038","DOIUrl":"https://doi.org/10.1145/3387902.3394038","url":null,"abstract":"The deployment of Quantized Neural Networks (QNN) on advanced microcontrollers requires optimized software to exploit digital signal processing (DSP) extensions of modern instruction set architectures (ISA). As such, recent research proposed optimized libraries for QNNs (from 8-bit to 2-bit) such as CMSIS-NN and PULP-NN. This work presents an extension to the PULP-NN library targeting the acceleration of mixed-precision Deep Neural Networks, an emerging paradigm able to significantly shrink the memory footprint of deep neural networks with negligible accuracy loss. The library, composed of 27 kernels, one for each permutation of input feature maps, weights, and output feature maps precision (considering 8-bit, 4-bit and 2-bit), enables efficient inference of QNN on parallel ultra-low-power (PULP) clusters of RISC-V based processors, featuring the RV32IMCXpulpV2 ISA. The proposed solution, benchmarked on an 8-cores GAP-8 PULP cluster, reaches peak performance of 16 MACs/cycle on 8 cores, performing 21× to 25× faster than an STM32H7 (powered by an ARM Cortex M7 processor) with 15× to 21× better energy efficiency.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132398132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In order to ensure data durability and crash consistency, the LSM-tree based key-value stores suffer from high WAL synchronization overhead. Fortunately, the advent of NVM offers an opportunity to address this issue. However, NVM is currently too expensive to meet the demand of massive storage systems. Therefore, the hybrid NVM and SSD storage system provides a more cost-efficient solution. This paper proposes HiLSM, a key-value store for hybrid NVM-SSD storage systems. According to the characteristics of hybrid storage mediums, HiLSM adopts hybrid data structures consisting of the log-structured memory and the LSM-tree. Aiming at the issue of write stalls in write intensive scenario, a fine-grained data migration strategy is proposed to make the data migration start as early as possible. Aiming at the performance gap between NVM and SSD, a multi-threaded data migration strategy is proposed to make the data migration complete as soon as possible. Aiming at the LSM-tree's inherent issue of write amplification, a data filtering strategy is proposed to make data updates be absorbed in NVM as much as possible. We compare HiLSM with the state-of-the-art key-value stores via extensive experiments and the results show that HiLSM achieves 1.3x higher throughput for write, 10x higher throughput for read and 79% less write traffic under the skewed workload.
{"title":"HiLSM","authors":"Wenjie Li, Dejun Jiang, Jin Xiong, Yungang Bao","doi":"10.1145/3387902.3392621","DOIUrl":"https://doi.org/10.1145/3387902.3392621","url":null,"abstract":"In order to ensure data durability and crash consistency, the LSM-tree based key-value stores suffer from high WAL synchronization overhead. Fortunately, the advent of NVM offers an opportunity to address this issue. However, NVM is currently too expensive to meet the demand of massive storage systems. Therefore, the hybrid NVM and SSD storage system provides a more cost-efficient solution. This paper proposes HiLSM, a key-value store for hybrid NVM-SSD storage systems. According to the characteristics of hybrid storage mediums, HiLSM adopts hybrid data structures consisting of the log-structured memory and the LSM-tree. Aiming at the issue of write stalls in write intensive scenario, a fine-grained data migration strategy is proposed to make the data migration start as early as possible. Aiming at the performance gap between NVM and SSD, a multi-threaded data migration strategy is proposed to make the data migration complete as soon as possible. Aiming at the LSM-tree's inherent issue of write amplification, a data filtering strategy is proposed to make data updates be absorbed in NVM as much as possible. We compare HiLSM with the state-of-the-art key-value stores via extensive experiments and the results show that HiLSM achieves 1.3x higher throughput for write, 10x higher throughput for read and 79% less write traffic under the skewed workload.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122299815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The proliferation of smart connected devices using digital assistants activated by voice commands (e.g., Apple Siri, Google Assistant, Amazon Alexa, etc.) is raising the interest in algorithms to localize and recognize audio sources. Among the others, deep neural networks (DNNs) are seen as a promising approach to accomplish such task. Unlike other approaches, DNNs can categorize received events, thus discriminating between events of interests and not even in presence of noise. Despite their advantages, DNNs require large datasets to be trained. Thus, tools for generating datasets are of great value, being able to accelerate the development of advanced learning models. This paper presents SoundFactory, a framework for simulating the propagation of sound waves (also considering noise, reverberation, reflection, attenuation, and other interfering waves) and the microphone array response to such sound waves. As such, SoundFactory allows to easily generate datasets to train deep neural networks which are at the basis of modern applications. SoundFactory is flexible enough to simulate many different microphone array configurations, thus covering a large set of use cases. To demonstrate the capabilities offered by SoundFactory, we generated a dataset and trained two different (rather simple) learning models against them, achieving up to 97% of accuracy. The quality of the generated dataset has been also assessed comparing the microphone array model responses with the real ones.
{"title":"SoundFactory","authors":"A. Scionti, S. Ciccia, O. Terzo","doi":"10.1145/3387902.3394036","DOIUrl":"https://doi.org/10.1145/3387902.3394036","url":null,"abstract":"The proliferation of smart connected devices using digital assistants activated by voice commands (e.g., Apple Siri, Google Assistant, Amazon Alexa, etc.) is raising the interest in algorithms to localize and recognize audio sources. Among the others, deep neural networks (DNNs) are seen as a promising approach to accomplish such task. Unlike other approaches, DNNs can categorize received events, thus discriminating between events of interests and not even in presence of noise. Despite their advantages, DNNs require large datasets to be trained. Thus, tools for generating datasets are of great value, being able to accelerate the development of advanced learning models. This paper presents SoundFactory, a framework for simulating the propagation of sound waves (also considering noise, reverberation, reflection, attenuation, and other interfering waves) and the microphone array response to such sound waves. As such, SoundFactory allows to easily generate datasets to train deep neural networks which are at the basis of modern applications. SoundFactory is flexible enough to simulate many different microphone array configurations, thus covering a large set of use cases. To demonstrate the capabilities offered by SoundFactory, we generated a dataset and trained two different (rather simple) learning models against them, achieving up to 97% of accuracy. The quality of the generated dataset has been also assessed comparing the microphone array model responses with the real ones.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129707179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rupam Mondal, H. Ngo, James Shey, R. Rakvic, Owens Walker, Dane Brown
Many applications make use of the edge devices in wireless sensor networks (WSNs), including video surveillance, traffic monitoring and enforcement, personal and health care, gaming, habitat monitoring, and industrial process control. However, these edge devices are resource-limited embedded systems that require a low-cost, low-power, and high-performance encryption/decryption solution to prevent attacks such as eavesdropping, message modification, and impersonation. This paper proposes a field-programmable gate array (FPGA) based design and implementation of the Advanced Encryption Standard (AES) algorithm for encryption and decryption using a parallel-pipeline architecture with a data forwarding mechanism that efficiently utilizes on-chip memory modules and massive parallel processing units to support a high throughput rate. Hardware designs that optimize the implementation of the AES algorithm are proposed to minimize resource allocation and maximize throughput. These designs are shown to outperform existing solutions in the literature. Additionally, a rapid prototype of a complete system-on-chip (SoC) solution that employs the proposed design on a configurable platform has been developed and proven to be suitable for real-time applications.
{"title":"Efficient architecture design for the AES-128 algorithm on embedded systems","authors":"Rupam Mondal, H. Ngo, James Shey, R. Rakvic, Owens Walker, Dane Brown","doi":"10.1145/3387902.3392624","DOIUrl":"https://doi.org/10.1145/3387902.3392624","url":null,"abstract":"Many applications make use of the edge devices in wireless sensor networks (WSNs), including video surveillance, traffic monitoring and enforcement, personal and health care, gaming, habitat monitoring, and industrial process control. However, these edge devices are resource-limited embedded systems that require a low-cost, low-power, and high-performance encryption/decryption solution to prevent attacks such as eavesdropping, message modification, and impersonation. This paper proposes a field-programmable gate array (FPGA) based design and implementation of the Advanced Encryption Standard (AES) algorithm for encryption and decryption using a parallel-pipeline architecture with a data forwarding mechanism that efficiently utilizes on-chip memory modules and massive parallel processing units to support a high throughput rate. Hardware designs that optimize the implementation of the AES algorithm are proposed to minimize resource allocation and maximize throughput. These designs are shown to outperform existing solutions in the literature. Additionally, a rapid prototype of a complete system-on-chip (SoC) solution that employs the proposed design on a configurable platform has been developed and proven to be suitable for real-time applications.","PeriodicalId":155089,"journal":{"name":"Proceedings of the 17th ACM International Conference on Computing Frontiers","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132054607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}