Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218646
Jinwei Liu, Chak-Wa Pui, Fangzhou Wang, Evangeline F. Y. Young
Many competitive global routers adopt the technique of compressing the 3D routing space into 2D in order to handle today’s massive circuit scales. It has been shown as an effective way to shorten the routing time, however, quality will inevitably be sacrificed to different extents. In this paper, we propose two routing techniques that directly operate on the 3D routing space and can maximally utilize the 3D structure of a grid graph. The first technique is called 3D pattern routing, by which we combine pattern routing and layer assignment, and we are able to produce optimal solutions with respect to the patterns under consideration in terms of a cost function in wire length and routability. The second technique is called multi-level 3D maze routing. Two levels of maze routing with different cost functions and objectives are designed to maximize the routability and to search for the minimum cost path efficiently. Besides, we also designed a cost function that is sensitive to resources changes and a post-processing technique called patching that gives the detailed router more flexibility in escaping congested regions. Finally, the experimental results show that our global router outperforms all the contestants in the ICCAD’19 global routing contest.
{"title":"CUGR: Detailed-Routability-Driven 3D Global Routing with Probabilistic Resource Model","authors":"Jinwei Liu, Chak-Wa Pui, Fangzhou Wang, Evangeline F. Y. Young","doi":"10.1109/DAC18072.2020.9218646","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218646","url":null,"abstract":"Many competitive global routers adopt the technique of compressing the 3D routing space into 2D in order to handle today’s massive circuit scales. It has been shown as an effective way to shorten the routing time, however, quality will inevitably be sacrificed to different extents. In this paper, we propose two routing techniques that directly operate on the 3D routing space and can maximally utilize the 3D structure of a grid graph. The first technique is called 3D pattern routing, by which we combine pattern routing and layer assignment, and we are able to produce optimal solutions with respect to the patterns under consideration in terms of a cost function in wire length and routability. The second technique is called multi-level 3D maze routing. Two levels of maze routing with different cost functions and objectives are designed to maximize the routability and to search for the minimum cost path efficiently. Besides, we also designed a cost function that is sensitive to resources changes and a post-processing technique called patching that gives the detailed router more flexibility in escaping congested regions. Finally, the experimental results show that our global router outperforms all the contestants in the ICCAD’19 global routing contest.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"s3-44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130189378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218518
Hyungjun Oh, Yongseung Yu, G. Ryu, Gunjoo Ahn, Yuri Jeong, Yongjun Park, Jiwon Seo
Training a deep neural network(DNN) is expensive, requiring a large amount of computation time. While the training overhead is high, not all computation in DNN training is equal. Some parameters converge faster and thus their gradient computation may contribute little to the parameter update; in nearstationary points a subset of parameters may change very little. In this paper we exploit the parameter convergence to optimize gradient computation in DNN training. We design a light-weight monitoring technique to track the parameter convergence; we prune the gradient computation stochastically for a group of semantically related parameters, exploiting their convergence correlations. These techniques are efficiently implemented in existing GPU kernels. In our evaluation the optimization techniques substantially and robustly improve the training throughput for four DNN models on three public datasets.
{"title":"Convergence-Aware Neural Network Training","authors":"Hyungjun Oh, Yongseung Yu, G. Ryu, Gunjoo Ahn, Yuri Jeong, Yongjun Park, Jiwon Seo","doi":"10.1109/DAC18072.2020.9218518","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218518","url":null,"abstract":"Training a deep neural network(DNN) is expensive, requiring a large amount of computation time. While the training overhead is high, not all computation in DNN training is equal. Some parameters converge faster and thus their gradient computation may contribute little to the parameter update; in nearstationary points a subset of parameters may change very little. In this paper we exploit the parameter convergence to optimize gradient computation in DNN training. We design a light-weight monitoring technique to track the parameter convergence; we prune the gradient computation stochastically for a group of semantically related parameters, exploiting their convergence correlations. These techniques are efficiently implemented in existing GPU kernels. In our evaluation the optimization techniques substantially and robustly improve the training throughput for four DNN models on three public datasets.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134437560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218657
Hyungjun Kim, Yulhwa Kim, Sungju Ryu, Jae-Joon Kim
We propose an in-memory neural network accelerator architecture called MOSAIC which uses minimal form of peripheral circuits; 1-bit word line driver to replace DAC and 1-bit sense amplifier to replace ADC. To map multi-bit neural networks on MOSAIC architecture which has 1-bit precision peripheral circuits, we also propose a bit-splitting method to approximate the original network by separating each bit path of the multi-bit network so that each bit path can propagate independently throughout the network. Thanks to the minimal form of peripheral circuits, MOSAIC can achieve an order of magnitude higher energy and area efficiency than previous in-memory neural network accelerators.
{"title":"Algorithm/Hardware Co-Design for In-Memory Neural Network Computing with Minimal Peripheral Circuit Overhead","authors":"Hyungjun Kim, Yulhwa Kim, Sungju Ryu, Jae-Joon Kim","doi":"10.1109/DAC18072.2020.9218657","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218657","url":null,"abstract":"We propose an in-memory neural network accelerator architecture called MOSAIC which uses minimal form of peripheral circuits; 1-bit word line driver to replace DAC and 1-bit sense amplifier to replace ADC. To map multi-bit neural networks on MOSAIC architecture which has 1-bit precision peripheral circuits, we also propose a bit-splitting method to approximate the original network by separating each bit path of the multi-bit network so that each bit path can propagate independently throughout the network. Thanks to the minimal form of peripheral circuits, MOSAIC can achieve an order of magnitude higher energy and area efficiency than previous in-memory neural network accelerators.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133949445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218504
Jianli Chen, Zhipeng Huang, Ye Huang, Wen-xing Zhu, Jun Yu, Yao-Wen Chang
With the increasing design requirements of modern circuits, a standard-cell library often contains cells of different row heights to address various trade-offs among performance, power, and area. However, maintaining all standard cells with integer multiples of a single-row height could cause some area overheads and increase power consumption. In this paper, we present an analytical placer to directly consider a circuit design with non-integer multiple-height standard cells and additional layout constraints. The region of different cell heights is adaptively generated by the global placement result. In particular, an exact penalty iterative shrinkage and thresholding (EPIST) algorithm is employed to efficiently optimize the global placement problem. The convergence of the algorithm is proved, and the acceleration strategy is proposed to improve the performance of our algorithm. Compared with the state-of-the-art works, experimental results based on the 2017 CAD Contest at ICCAD benchmarks show that our algorithm achieves the best wirelength and area for every benchmark. In particular, our proposed EPIST algorithm provides a new direction for effectively solving large-scale nonlinear optimization problems with non-smooth terms, which are often seen in real-world applications.
{"title":"An Efficient EPIST Algorithm for Global Placement with Non-Integer Multiple-Height Cells *","authors":"Jianli Chen, Zhipeng Huang, Ye Huang, Wen-xing Zhu, Jun Yu, Yao-Wen Chang","doi":"10.1109/DAC18072.2020.9218504","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218504","url":null,"abstract":"With the increasing design requirements of modern circuits, a standard-cell library often contains cells of different row heights to address various trade-offs among performance, power, and area. However, maintaining all standard cells with integer multiples of a single-row height could cause some area overheads and increase power consumption. In this paper, we present an analytical placer to directly consider a circuit design with non-integer multiple-height standard cells and additional layout constraints. The region of different cell heights is adaptively generated by the global placement result. In particular, an exact penalty iterative shrinkage and thresholding (EPIST) algorithm is employed to efficiently optimize the global placement problem. The convergence of the algorithm is proved, and the acceleration strategy is proposed to improve the performance of our algorithm. Compared with the state-of-the-art works, experimental results based on the 2017 CAD Contest at ICCAD benchmarks show that our algorithm achieves the best wirelength and area for every benchmark. In particular, our proposed EPIST algorithm provides a new direction for effectively solving large-scale nonlinear optimization problems with non-smooth terms, which are often seen in real-world applications.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132779115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218576
Md Fahim Faysal Khan, Mohammad Mahdi Kamani, M. Mahdavi, V. Narayanan
Reducing the model size and computation costs for dedicated AI accelerator designs, neural network quantization methods have attracted momentous attention recently. Unfortunately, merely minimizing quantization loss using constant discretization causes accuracy deterioration. In this paper, we propose an iterative accuracy-driven learning framework of competitive-collaborative quantization (CCQ) to gradually adapt the bit-precision of each individual layer. Orthogonal to prior quantization policies working with full precision for the first and last layers of the network, CCQ offers layer-wise competition for any target quantization policy with holistic layer fine-tuning to recover accuracy, where the state-of-the-art networks can be entirely quantized without any significant accuracy degradation.
{"title":"Learning to Quantize Deep Neural Networks: A Competitive-Collaborative Approach","authors":"Md Fahim Faysal Khan, Mohammad Mahdi Kamani, M. Mahdavi, V. Narayanan","doi":"10.1109/DAC18072.2020.9218576","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218576","url":null,"abstract":"Reducing the model size and computation costs for dedicated AI accelerator designs, neural network quantization methods have attracted momentous attention recently. Unfortunately, merely minimizing quantization loss using constant discretization causes accuracy deterioration. In this paper, we propose an iterative accuracy-driven learning framework of competitive-collaborative quantization (CCQ) to gradually adapt the bit-precision of each individual layer. Orthogonal to prior quantization policies working with full precision for the first and last layers of the network, CCQ offers layer-wise competition for any target quantization policy with holistic layer fine-tuning to recover accuracy, where the state-of-the-art networks can be entirely quantized without any significant accuracy degradation.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130683740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218666
Gushu Li, Yufei Ding, Yuan Xie
Noisy Quantum Computing (QC) simulation on a classical machine is very time consuming since it requires Monte Carlo simulation with a large number of error-injection trials to model the effect of random noises. Orthogonal to existing QC simulation optimizations, we aim to accelerate the simulation by eliminating the redundant computation among those Monte Carlo simulation trials. We observe that the intermediate states of many trials can often be the same. Once these states are computed in one trial, they can be temporarily stored and reused in other trials. However, storing such states will consume significant memory space. To leverage the shared intermediate states without introducing too much storage overhead, we propose to statically generate and analyze the Monte Carlo simulation simulation trials before the actual simulation. Those trials are reordered to maximize the overlapped computation between two consecutive trials. The states that cannot be reused in follow-up simulation are dropped, so that we only need to store a few states. Experiment results show that the proposed optimization scheme can save on average 80% computation with only a small number of state vectors stored. In addition, the proposed simulation scheme demonstrates great scalability as more computation can be saved with more simulation trials or on future QC devices with reduced error rates.
{"title":"Eliminating Redundant Computation in Noisy Quantum Computing Simulation","authors":"Gushu Li, Yufei Ding, Yuan Xie","doi":"10.1109/DAC18072.2020.9218666","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218666","url":null,"abstract":"Noisy Quantum Computing (QC) simulation on a classical machine is very time consuming since it requires Monte Carlo simulation with a large number of error-injection trials to model the effect of random noises. Orthogonal to existing QC simulation optimizations, we aim to accelerate the simulation by eliminating the redundant computation among those Monte Carlo simulation trials. We observe that the intermediate states of many trials can often be the same. Once these states are computed in one trial, they can be temporarily stored and reused in other trials. However, storing such states will consume significant memory space. To leverage the shared intermediate states without introducing too much storage overhead, we propose to statically generate and analyze the Monte Carlo simulation simulation trials before the actual simulation. Those trials are reordered to maximize the overlapped computation between two consecutive trials. The states that cannot be reused in follow-up simulation are dropped, so that we only need to store a few states. Experiment results show that the proposed optimization scheme can save on average 80% computation with only a small number of state vectors stored. In addition, the proposed simulation scheme demonstrates great scalability as more computation can be saved with more simulation trials or on future QC devices with reduced error rates.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130979422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218506
Yung-Chih Chen, Hao-Ju Chang, Li-Cheng Zheng
Threshold logic re-attracts researchers’ attention recently due to the advancement of hardware realization techniques and its applications to deep learning. In the past decade, several design automation techniques for threshold logic have been proposed, such as logic synthesis and logic optimization. Although they are effective, threshold logic network (TLN) optimization based on don’t cares has not been well studied. In this paper, we propose a don’t-care-based node minimization scheme for TLNs. We first present a sufficient condition for don’t cares to exist and a logic-implication-based method to identify the don’t cares of a threshold logic gate (TLG). Then, we transform the problem of TLG minimization with don’t cares to an integer linear programming problem, and present a method to compute the necessary constraints for the ILP formulation. We apply the proposed optimization scheme to two set of TLNs generated by the state-of-the-art synthesis technique. The experimental results show that, for the two sets, it achieves an average of 11% and 19% of area reduction in terms of the sum of the weights and threshold values without overhead on the TLG count and logic depth. Additionally, it completes the optimization of most TLNs within one minute.
{"title":"Don’t-Care-Based Node Minimization for Threshold Logic Networks","authors":"Yung-Chih Chen, Hao-Ju Chang, Li-Cheng Zheng","doi":"10.1109/DAC18072.2020.9218506","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218506","url":null,"abstract":"Threshold logic re-attracts researchers’ attention recently due to the advancement of hardware realization techniques and its applications to deep learning. In the past decade, several design automation techniques for threshold logic have been proposed, such as logic synthesis and logic optimization. Although they are effective, threshold logic network (TLN) optimization based on don’t cares has not been well studied. In this paper, we propose a don’t-care-based node minimization scheme for TLNs. We first present a sufficient condition for don’t cares to exist and a logic-implication-based method to identify the don’t cares of a threshold logic gate (TLG). Then, we transform the problem of TLG minimization with don’t cares to an integer linear programming problem, and present a method to compute the necessary constraints for the ILP formulation. We apply the proposed optimization scheme to two set of TLNs generated by the state-of-the-art synthesis technique. The experimental results show that, for the two sets, it achieves an average of 11% and 19% of area reduction in terms of the sum of the weights and threshold values without overhead on the TLG count and logic depth. Additionally, it completes the optimization of most TLNs within one minute.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"417 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132449905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218554
Qing Wang, Youyou Lu, Zhongjie Wu, Fan Yang, J. Shu
Persistent memory provides data persistence to in-memory transaction systems, enabling full ACID properties. However, high data persistence worsens the concurrency performance due to delayed execution of conflicted transactions on multicores. In this paper, we propose SP 3 (SPeculative Parallel Persistence) to improve the concurrency performance of persistent memory transactions. SP3 keeps the dependencies between different transactions in a DAG (direct acyclic graph) by detecting conflicts in the read/write sets, and speculatively executes conflicted transactions without waiting for the completeness of data persistence. Evaluation shows that SP3 significantly improves concurrency performance and achieves almost linear scalability in most evaluated workloads.
{"title":"Improving the Concurrency Performance of Persistent Memory Transactions on Multicores","authors":"Qing Wang, Youyou Lu, Zhongjie Wu, Fan Yang, J. Shu","doi":"10.1109/DAC18072.2020.9218554","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218554","url":null,"abstract":"Persistent memory provides data persistence to in-memory transaction systems, enabling full ACID properties. However, high data persistence worsens the concurrency performance due to delayed execution of conflicted transactions on multicores. In this paper, we propose SP 3 (SPeculative Parallel Persistence) to improve the concurrency performance of persistent memory transactions. SP3 keeps the dependencies between different transactions in a DAG (direct acyclic graph) by detecting conflicts in the read/write sets, and speculatively executes conflicted transactions without waiting for the completeness of data persistence. Evaluation shows that SP3 significantly improves concurrency performance and achieves almost linear scalability in most evaluated workloads.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127389786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218618
S. Baidya, Yu-Jen Ku, Hengyu Zhao, Jishen Zhao, S. Dey
Emerging connected and autonomous vehicles involve complex applications requiring not only optimal computing resource allocations but also efficient computing architectures. In this paper, we unfold the critical performance metrics required for emerging vehicular computing applications and show with preliminary experimental results, how optimal choices can be made to satisfy the static and dynamic computing requirements in terms of the performance metrics. We also discuss the feasibility of edge computing architectures for vehicular computing and show tradeoffs for different offloading strategies. The paper shows directions for light weight, high performance and low power computing paradigms, architectures and design-space exploration tools to satisfy evolving applications and requirements for connected and autonomous vehicles.
{"title":"Vehicular and Edge Computing for Emerging Connected and Autonomous Vehicle Applications","authors":"S. Baidya, Yu-Jen Ku, Hengyu Zhao, Jishen Zhao, S. Dey","doi":"10.1109/DAC18072.2020.9218618","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218618","url":null,"abstract":"Emerging connected and autonomous vehicles involve complex applications requiring not only optimal computing resource allocations but also efficient computing architectures. In this paper, we unfold the critical performance metrics required for emerging vehicular computing applications and show with preliminary experimental results, how optimal choices can be made to satisfy the static and dynamic computing requirements in terms of the performance metrics. We also discuss the feasibility of edge computing architectures for vehicular computing and show tradeoffs for different offloading strategies. The paper shows directions for light weight, high performance and low power computing paradigms, architectures and design-space exploration tools to satisfy evolving applications and requirements for connected and autonomous vehicles.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115456729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/DAC18072.2020.9218654
Hao Zheng, Ke Wang, A. Louri
Heterogeneous manycore architectures are deployed to simultaneously run multiple and diverse applications. This requires various computing capabilities (CPUs, GPUs, and accelerators), and an efficient network-on-chip (NoC) architecture to concurrently handle diverse application communication behavior. However, supporting the concurrent communication requirements of diverse applications is challenging due to the dynamic application mapping, the complexity of handling distinct communication patterns and limited on-chip resources. In this paper, we propose Adapt-NoC, a versatile and flexible NoC architecture for chiplet-based manycore architectures, consisting of adaptable routers and links. Adapt-NoC can dynamically allocate disjoint regions of the NoC, called subNoCs, for concurrently-running applications, each of which can be optimized for different communication behavior. The adaptable routers and links are capable of providing various subNoC topologies, satisfying different latency and bandwidth requirements of various traffic patterns (e.g. all-to-all, one-to-many). Full system simulation shows that AdaptNoC can achieve 31% latency reduction, 24% energy saving and 10% execution time reduction on average, when compared to prior designs.
{"title":"A Versatile and Flexible Chiplet-based System Design for Heterogeneous Manycore Architectures","authors":"Hao Zheng, Ke Wang, A. Louri","doi":"10.1109/DAC18072.2020.9218654","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218654","url":null,"abstract":"Heterogeneous manycore architectures are deployed to simultaneously run multiple and diverse applications. This requires various computing capabilities (CPUs, GPUs, and accelerators), and an efficient network-on-chip (NoC) architecture to concurrently handle diverse application communication behavior. However, supporting the concurrent communication requirements of diverse applications is challenging due to the dynamic application mapping, the complexity of handling distinct communication patterns and limited on-chip resources. In this paper, we propose Adapt-NoC, a versatile and flexible NoC architecture for chiplet-based manycore architectures, consisting of adaptable routers and links. Adapt-NoC can dynamically allocate disjoint regions of the NoC, called subNoCs, for concurrently-running applications, each of which can be optimized for different communication behavior. The adaptable routers and links are capable of providing various subNoC topologies, satisfying different latency and bandwidth requirements of various traffic patterns (e.g. all-to-all, one-to-many). Full system simulation shows that AdaptNoC can achieve 31% latency reduction, 24% energy saving and 10% execution time reduction on average, when compared to prior designs.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115020590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}