Amir Hossein Jalilvand, Seyedeh Newsha Estiri, S. Naderi, M. Najafi, M. Imani
Hardware-efficient implementation of sorting operation is crucial for numerous applications, particularly when fast and energy-efficient sorting of data is desired. Unary computing has been used for low-cost hardware sorting. This work proposes a comparison-free unary sorting engine by iteratively finding maximum values. Synthesis results show up to 81% reduction in hardware area compared to the state-of-the-art unary sorting design. By processing right-aligned unary bit-streams, our unary sorter is able to sort many inputs in fewer clock cycles.
{"title":"A fast and low-cost comparison-free sorting engine with unary computing: late breaking results","authors":"Amir Hossein Jalilvand, Seyedeh Newsha Estiri, S. Naderi, M. Najafi, M. Imani","doi":"10.1145/3489517.3530615","DOIUrl":"https://doi.org/10.1145/3489517.3530615","url":null,"abstract":"Hardware-efficient implementation of sorting operation is crucial for numerous applications, particularly when fast and energy-efficient sorting of data is desired. Unary computing has been used for low-cost hardware sorting. This work proposes a comparison-free unary sorting engine by iteratively finding maximum values. Synthesis results show up to 81% reduction in hardware area compared to the state-of-the-art unary sorting design. By processing right-aligned unary bit-streams, our unary sorter is able to sort many inputs in fewer clock cycles.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133472485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huimin Wang, Xingyu Tong, Chenyue Ma, Runming Shi, Jianli Chen, Kun Wang, Jun Yu, Yao-Wen Chang
The fast-growing capacity and complexity are challenging for FPGA global placement. Besides, while many recent studies have focused on the eDensity-based placement as its great efficiency and quality, they suffer from redundant frequency translation. This paper presents a CNN-inspired analytical placement algorithm to effectively handle the redundant frequency translation problem for large-scale FPGAs. Specifically, we compute the density penalty by a fully-connected propagation and gradient to a discrete differential convolution backward. With the FPGA heterogeneity, vectorization plays a vital role in self-adjusting the density penalty factor and the learning rate. In addition, a pseudo net model is used to further optimize the site constraints by establishing connections between blocks and their nearest available regions. Finally, we formulate a refined objective function and a degree-specific gradient preconditioning to achieve a robust, high-quality solution. Experimental results show that our algorithm achieves an 8% reduction on HPWL and 15% less global placement runtime on average over leading commercial tools.
{"title":"CNN-inspired analytical global placement for large-scale heterogeneous FPGAs","authors":"Huimin Wang, Xingyu Tong, Chenyue Ma, Runming Shi, Jianli Chen, Kun Wang, Jun Yu, Yao-Wen Chang","doi":"10.1145/3489517.3530566","DOIUrl":"https://doi.org/10.1145/3489517.3530566","url":null,"abstract":"The fast-growing capacity and complexity are challenging for FPGA global placement. Besides, while many recent studies have focused on the eDensity-based placement as its great efficiency and quality, they suffer from redundant frequency translation. This paper presents a CNN-inspired analytical placement algorithm to effectively handle the redundant frequency translation problem for large-scale FPGAs. Specifically, we compute the density penalty by a fully-connected propagation and gradient to a discrete differential convolution backward. With the FPGA heterogeneity, vectorization plays a vital role in self-adjusting the density penalty factor and the learning rate. In addition, a pseudo net model is used to further optimize the site constraints by establishing connections between blocks and their nearest available regions. Finally, we formulate a refined objective function and a degree-specific gradient preconditioning to achieve a robust, high-quality solution. Experimental results show that our algorithm achieves an 8% reduction on HPWL and 15% less global placement runtime on average over leading commercial tools.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130172005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianqiang Wang, Pouya Mahmoody, Ferdinand Brasser, Patrick Jauernig, A. Sadeghi, D.Y. Yu, Dahan Pan, Yuanyuan Zhang
Modern security architectures provide Trusted Execution Environments (TEEs) to protect critical data and applications against malicious privileged software in so-called enclaves. However, the seamless integration of existing TEEs into the cloud is hindered, as they require substantial adaptation of the software executing inside an enclave as well as the cloud management software to handle enclaved workloads. We tackle these challenges by presenting VirTEE, the first TEE architecture that allows strongly isolated execution of unmodified virtual machines (VMs) in enclaves, as well as secure live migration of VM enclaves between VirTEE-enabled servers. Combined with its secure I/O capabilities, VirTEE enables the integration of enclaved computing in today's complex cloud infrastructure. We thoroughly evaluate our RISC-V-based prototype, and show its effectiveness and efficiency.
{"title":"VirTEE","authors":"Jianqiang Wang, Pouya Mahmoody, Ferdinand Brasser, Patrick Jauernig, A. Sadeghi, D.Y. Yu, Dahan Pan, Yuanyuan Zhang","doi":"10.1145/3489517.3530436","DOIUrl":"https://doi.org/10.1145/3489517.3530436","url":null,"abstract":"Modern security architectures provide Trusted Execution Environments (TEEs) to protect critical data and applications against malicious privileged software in so-called enclaves. However, the seamless integration of existing TEEs into the cloud is hindered, as they require substantial adaptation of the software executing inside an enclave as well as the cloud management software to handle enclaved workloads. We tackle these challenges by presenting VirTEE, the first TEE architecture that allows strongly isolated execution of unmodified virtual machines (VMs) in enclaves, as well as secure live migration of VM enclaves between VirTEE-enabled servers. Combined with its secure I/O capabilities, VirTEE enables the integration of enclaved computing in today's complex cloud infrastructure. We thoroughly evaluate our RISC-V-based prototype, and show its effectiveness and efficiency.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114967060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qi Sun, Xinyun Zhang, Hao Geng, Yuxuan Zhao, Yang Bai, Haisheng Zheng, Bei Yu
It is an open problem to compile DNN models on GPU and improve the performance. A novel framework, GTuner, is proposed to jointly learn from the structures of computational graphs and the statistical features of codes to find the optimal code implementations. A Graph ATtention network (GAT) is designed as the performance estimator in GTuner. In GAT, graph neural layers are used to propagate the information in the graph and a multi-head self-attention module is designed to learn the complicated relationships between the features. Under the guidance of GAT, the GPU codes are generated through auto-tuning. Experimental results demonstrate that our method outperforms the previous arts remarkably.
{"title":"GTuner","authors":"Qi Sun, Xinyun Zhang, Hao Geng, Yuxuan Zhao, Yang Bai, Haisheng Zheng, Bei Yu","doi":"10.1145/3489517.3530584","DOIUrl":"https://doi.org/10.1145/3489517.3530584","url":null,"abstract":"It is an open problem to compile DNN models on GPU and improve the performance. A novel framework, GTuner, is proposed to jointly learn from the structures of computational graphs and the statistical features of codes to find the optimal code implementations. A Graph ATtention network (GAT) is designed as the performance estimator in GTuner. In GAT, graph neural layers are used to propagate the information in the graph and a multi-head self-attention module is designed to learn the complicated relationships between the features. Under the guidance of GAT, the GPU codes are generated through auto-tuning. Experimental results demonstrate that our method outperforms the previous arts remarkably.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116518521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pengwen Chen, Chung-Kuan Cheng, Albert Chern, Chester Holtz, Aoxi Li, Yucheng Wang
Canonical methods for analytical placement of VLSI designs rely on solving nonlinear programs to minimize wirelength and cell overlap. We focus on producing initial layouts such that a global analytical placer performs better compared to existing heuristics for initialization. We reduce the problem of initialization to a quadratically constrained quadratic program. Our formulation is aware of fixed macros. We propose an efficient algorithm which can quickly generate initializations for testcases with millions of cells. We show that the our method for parameter initialization results in superior performance with respect to post-detailed placement wirelength.
{"title":"Placement initialization via a projected eigenvector algorithm: late breaking results","authors":"Pengwen Chen, Chung-Kuan Cheng, Albert Chern, Chester Holtz, Aoxi Li, Yucheng Wang","doi":"10.1145/3489517.3530620","DOIUrl":"https://doi.org/10.1145/3489517.3530620","url":null,"abstract":"Canonical methods for analytical placement of VLSI designs rely on solving nonlinear programs to minimize wirelength and cell overlap. We focus on producing initial layouts such that a global analytical placer performs better compared to existing heuristics for initialization. We reduce the problem of initialization to a quadratically constrained quadratic program. Our formulation is aware of fixed macros. We propose an efficient algorithm which can quickly generate initializations for testcases with millions of cells. We show that the our method for parameter initialization results in superior performance with respect to post-detailed placement wirelength.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114646082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Real-world AI applications, such as augmented reality or autonomous driving, require processing multiple CV tasks simultaneously. However, the enormous data size and the memory footprint have been a crucial hurdle for deep neural networks to be applied in resource-constrained devices. To solve the problem, we propose an algorithm/architecture co-design. The proposed algorithmic scheme, named SqueeD, reduces per-task weight and activation size by 21.9x and 2.1x, respectively, by sharing those data between tasks. Moreover, we design architecture and dataflow to minimize DRAM access by fully utilizing benefits from SqueeD. As a result, the proposed architecture reduces the DRAM access increment and energy consumption increment per task by 2.2x and 1.3x, respectively.
{"title":"Algorithm/architecture co-design for energy-efficient acceleration of multi-task DNN","authors":"Jaekang Shin, Seungkyu Choi, Jongwoo Ra, L. Kim","doi":"10.1145/3489517.3530455","DOIUrl":"https://doi.org/10.1145/3489517.3530455","url":null,"abstract":"Real-world AI applications, such as augmented reality or autonomous driving, require processing multiple CV tasks simultaneously. However, the enormous data size and the memory footprint have been a crucial hurdle for deep neural networks to be applied in resource-constrained devices. To solve the problem, we propose an algorithm/architecture co-design. The proposed algorithmic scheme, named SqueeD, reduces per-task weight and activation size by 21.9x and 2.1x, respectively, by sharing those data between tasks. Moreover, we design architecture and dataflow to minimize DRAM access by fully utilizing benefits from SqueeD. As a result, the proposed architecture reduces the DRAM access increment and energy consumption increment per task by 2.2x and 1.3x, respectively.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"311 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131857077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Remarkable advances in machine learning and artificial intelligence have been made in various domains, achieving near-human performance in a plethora of cognitive tasks including vision, speech and natural language processing. However, implementations of such cognitive algorithms in conventional "von-Neumann" architectures are orders of magnitude more area and power expensive than the biological brain. Therefore, it is imperative to search for fundamentally new approaches so that the improvement in computing performance and efficiency can keep up with the exponential growth of the AI computational demand. In this article, we present a cross-layer approach to the exploration of new paradigms in cognitive computing. This effort spans new learning algorithms inspired from biological information processing principles, network architectures best suited for such algorithms, and neuromorphic hardware substrates such as computing-in-memory fabrics in order to build intelligent machines that can achieve orders of improvement in energy efficiency at cognitive processing. We argue that such cross-layer innovations in cognitive computing are well-poised to enable a new wave of autonomous intelligence across the computing spectrum, from resource-constrained IoT devices to the cloud.
{"title":"A cross-layer approach to cognitive computing: invited","authors":"Gobinda Saha, Cheng Wang, A. Raghunathan, K. Roy","doi":"10.1145/3489517.3530642","DOIUrl":"https://doi.org/10.1145/3489517.3530642","url":null,"abstract":"Remarkable advances in machine learning and artificial intelligence have been made in various domains, achieving near-human performance in a plethora of cognitive tasks including vision, speech and natural language processing. However, implementations of such cognitive algorithms in conventional \"von-Neumann\" architectures are orders of magnitude more area and power expensive than the biological brain. Therefore, it is imperative to search for fundamentally new approaches so that the improvement in computing performance and efficiency can keep up with the exponential growth of the AI computational demand. In this article, we present a cross-layer approach to the exploration of new paradigms in cognitive computing. This effort spans new learning algorithms inspired from biological information processing principles, network architectures best suited for such algorithms, and neuromorphic hardware substrates such as computing-in-memory fabrics in order to build intelligent machines that can achieve orders of improvement in energy efficiency at cognitive processing. We argue that such cross-layer innovations in cognitive computing are well-poised to enable a new wave of autonomous intelligence across the computing spectrum, from resource-constrained IoT devices to the cloud.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133665838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pierpaolo Morì, M. Vemparala, Nael Fasfous, Saptarshi Mitra, Sreetama Sarkar, Alexander Frickenstein, Lukas Frickenstein, D. Helms, N. Nagaraja, W. Stechele, C. Passerone
Semantic segmentation is one of the popular tasks in computer vision, providing pixel-wise annotations for scene understanding. However, segmentation-based convolutional neural networks require tremendous computational power. In this work, a fully-pipelined hardware accelerator with support for dilated convolution is introduced, which cuts down the redundant zero multiplications. Furthermore, we propose a genetic algorithm based automated channel pruning technique to jointly optimize computational complexity and model accuracy. Finally, hardware heuristics and an accurate model of the custom accelerator design enable a hardware-aware pruning framework. We achieve 2.44X lower latency with minimal degradation in semantic prediction quality (−1.98 pp lower mean intersection over union) compared to the baseline DeepLabV3+ model, evaluated on an Arria-10 FPGA. The binary files of the FPGA design, baseline and pruned models can be found in github.com/pierpaolomori/SemanticSegmentationFPGA
{"title":"Accelerating and pruning CNNs for semantic segmentation on FPGA","authors":"Pierpaolo Morì, M. Vemparala, Nael Fasfous, Saptarshi Mitra, Sreetama Sarkar, Alexander Frickenstein, Lukas Frickenstein, D. Helms, N. Nagaraja, W. Stechele, C. Passerone","doi":"10.1145/3489517.3530424","DOIUrl":"https://doi.org/10.1145/3489517.3530424","url":null,"abstract":"Semantic segmentation is one of the popular tasks in computer vision, providing pixel-wise annotations for scene understanding. However, segmentation-based convolutional neural networks require tremendous computational power. In this work, a fully-pipelined hardware accelerator with support for dilated convolution is introduced, which cuts down the redundant zero multiplications. Furthermore, we propose a genetic algorithm based automated channel pruning technique to jointly optimize computational complexity and model accuracy. Finally, hardware heuristics and an accurate model of the custom accelerator design enable a hardware-aware pruning framework. We achieve 2.44X lower latency with minimal degradation in semantic prediction quality (−1.98 pp lower mean intersection over union) compared to the baseline DeepLabV3+ model, evaluated on an Arria-10 FPGA. The binary files of the FPGA design, baseline and pruned models can be found in github.com/pierpaolomori/SemanticSegmentationFPGA","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"192 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116145892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Valentin Poirot, Laura Harms, Hendrica Martens, O. Landsiedel
IoT devices rely on environment detection to trigger specific actions, e.g., for headphones to adapt noise cancellation to the surroundings. While phones feature many sensors, from GNSS to cameras, small wearables must rely on the few energy-efficient components they already incorporate. In this paper, we demonstrate that a Bluetooth radio is the only component required to accurately classify environments and present BlueSeer, an environment-detection system that solely relies on received BLE packets and an embedded neural network. BlueSeer achieves an accuracy of up to 84% differentiating between 7 environments on resource-constrained devices, and requires only ~ 12 ms for inference on a 64 MHz microcontroller-unit.
{"title":"BlueSeer","authors":"Valentin Poirot, Laura Harms, Hendrica Martens, O. Landsiedel","doi":"10.1145/3489517.3530519","DOIUrl":"https://doi.org/10.1145/3489517.3530519","url":null,"abstract":"IoT devices rely on environment detection to trigger specific actions, e.g., for headphones to adapt noise cancellation to the surroundings. While phones feature many sensors, from GNSS to cameras, small wearables must rely on the few energy-efficient components they already incorporate. In this paper, we demonstrate that a Bluetooth radio is the only component required to accurately classify environments and present BlueSeer, an environment-detection system that solely relies on received BLE packets and an embedded neural network. BlueSeer achieves an accuracy of up to 84% differentiating between 7 environments on resource-constrained devices, and requires only ~ 12 ms for inference on a 64 MHz microcontroller-unit.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"325 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116610964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}