Pub Date : 2015-06-06DOI: 10.1109/SLIP.2015.7171711
Xiang Zhang, Yang Liu, R. Coutts, Chung-Kuan Cheng
The number of available pins in ball grid array (BGA) of modern system-on-chips (SOCs) has been discussed as one of the major bottlenecks to the performance of the processors, for example many-core enabled portable devices, where the package size and PCB floorplan are tightly constrained. A typical SOC package allocates more than half of the pins for power delivery, resulting in the number of IO pins for off-chip communications is greatly reduced. We observe that the requirement for the number of power and ground (P/G) pins is driven by the highest performance state and the worst design corners, while SOCs are in lower performance state for most of the time for longer battery life. Under this observation, we propose to reuse some of the power pins as dynamic power/signal pins for off-chip data transmissions to increase the off-chip bandwidth during SOC low performance state. Our proposed method provides 20Gbps bandwidth per hybrid pin pair, while providing minimum impact to the original power delivery network (PDN) design.
{"title":"Power line communication for hybrid power/signal pin SOC design","authors":"Xiang Zhang, Yang Liu, R. Coutts, Chung-Kuan Cheng","doi":"10.1109/SLIP.2015.7171711","DOIUrl":"https://doi.org/10.1109/SLIP.2015.7171711","url":null,"abstract":"The number of available pins in ball grid array (BGA) of modern system-on-chips (SOCs) has been discussed as one of the major bottlenecks to the performance of the processors, for example many-core enabled portable devices, where the package size and PCB floorplan are tightly constrained. A typical SOC package allocates more than half of the pins for power delivery, resulting in the number of IO pins for off-chip communications is greatly reduced. We observe that the requirement for the number of power and ground (P/G) pins is driven by the highest performance state and the worst design corners, while SOCs are in lower performance state for most of the time for longer battery life. Under this observation, we propose to reuse some of the power pins as dynamic power/signal pins for off-chip data transmissions to increase the off-chip bandwidth during SOC low performance state. Our proposed method provides 20Gbps bandwidth per hybrid pin pair, while providing minimum impact to the original power delivery network (PDN) design.","PeriodicalId":431489,"journal":{"name":"2015 ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP)","volume":"75 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134426906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-06DOI: 10.1109/SLIP.2015.7171710
Tsung-Wei Huang, Martin D. F. Wong
Incremental path-based timing analysis (PBA) is a pivotal step in the timing optimization flow. A core building block analyzes the timing path-by-path subject to a critical amount of incremental changes on the design. However, this process in nature demands an extremely high computational complexity and has been a major bottleneck in accelerating timing closure. Therefore, we introduce in this paper a fast and scalable algorithm of incremental PBA with MapReduce - a recently popular programming paradigm in big-data era. Inspired by the spirit of MapReduce, we formulate our problem into tasks that are associated with keys and values and perform massively-parallel map and reduce operations on a distributed system. Experimental results demonstrated that our approach can not only easily analyze huge deisgns in a few minutes, but also quickly revalidate the timing after the incremental changes. Our results are beneficial for speeding up the lengthy design cycle of timing closure.
{"title":"On fast timing closure: speeding up incremental path-based timing analysis with mapreduce","authors":"Tsung-Wei Huang, Martin D. F. Wong","doi":"10.1109/SLIP.2015.7171710","DOIUrl":"https://doi.org/10.1109/SLIP.2015.7171710","url":null,"abstract":"Incremental path-based timing analysis (PBA) is a pivotal step in the timing optimization flow. A core building block analyzes the timing path-by-path subject to a critical amount of incremental changes on the design. However, this process in nature demands an extremely high computational complexity and has been a major bottleneck in accelerating timing closure. Therefore, we introduce in this paper a fast and scalable algorithm of incremental PBA with MapReduce - a recently popular programming paradigm in big-data era. Inspired by the spirit of MapReduce, we formulate our problem into tasks that are associated with keys and values and perform massively-parallel map and reduce operations on a distributed system. Experimental results demonstrated that our approach can not only easily analyze huge deisgns in a few minutes, but also quickly revalidate the timing after the incremental changes. Our results are beneficial for speeding up the lengthy design cycle of timing closure.","PeriodicalId":431489,"journal":{"name":"2015 ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125920750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-06DOI: 10.1109/SLIP.2015.7171712
Haifeng Xu, M. Bilec, William O. Collinge, L. Schaefer, A. Landis, A. Jones
While the embedded class processors found in commodity palmtop computers continue to become increasingly capable, various wireless connectivity functions on them provide new opportunities in designing more flexible yet smarter wireless sensor networks (WSNs), and utilizing the computation power in a way we could never imagine before. Designing Lynx, a selforganizing wireless sensor network (SOWSN), is our further step taken in exploiting the potential of palmtop computers. Fundamental functionalities such as automatic neighbor relation detection, link state maintenance, sensor integration, and multihop routing, together make a real world distributively managed WSN system implementation work quite well. And by combining with Ocelot, our mobile distributed computing engine, sensor nodes are now capable of collecting, recording, processing and sending data without any central server support. Significant energy saving is achieved by the Lynx and Ocelot combined system, compare to traditional power-hungry computer platforms such as BOINC when doing same tasks.
{"title":"Lynx: a self-organizing wireless sensor network with commodity palmtop computers","authors":"Haifeng Xu, M. Bilec, William O. Collinge, L. Schaefer, A. Landis, A. Jones","doi":"10.1109/SLIP.2015.7171712","DOIUrl":"https://doi.org/10.1109/SLIP.2015.7171712","url":null,"abstract":"While the embedded class processors found in commodity palmtop computers continue to become increasingly capable, various wireless connectivity functions on them provide new opportunities in designing more flexible yet smarter wireless sensor networks (WSNs), and utilizing the computation power in a way we could never imagine before. Designing Lynx, a selforganizing wireless sensor network (SOWSN), is our further step taken in exploiting the potential of palmtop computers. Fundamental functionalities such as automatic neighbor relation detection, link state maintenance, sensor integration, and multihop routing, together make a real world distributively managed WSN system implementation work quite well. And by combining with Ocelot, our mobile distributed computing engine, sensor nodes are now capable of collecting, recording, processing and sending data without any central server support. Significant energy saving is achieved by the Lynx and Ocelot combined system, compare to traditional power-hungry computer platforms such as BOINC when doing same tasks.","PeriodicalId":431489,"journal":{"name":"2015 ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130350229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-06DOI: 10.1109/SLIP.2015.7171707
Sai Manoj Pudukotai Dinakarrao, Kanwen Wang, Hantao Huang, Hao Yu
A data-pattern aware smart I/O is introduced in this paper for 2.5D through-silicon interposer (TSI) interconnect based memory-logic integration. To match huge many-core bandwidth demand with limited supply of 2.5D I/O channels when accessing one shared memory, a space-time multiplexing based channel utilisation is developed inside the memory controller to reuse 2.5D I/O channels. Many cores are adaptively classified into clusters based on the bandwidth demand by space multiplexing to access the shared memory. Time multiplexing is then performed to schedule the cores in one cluster to occupy the supplied 2.5D I/O channels at different time-slots upon priority. The proposed smart 2.5D TSI I/O is verified by the system-level simulator with benchmarked workloads, which shows up to 58.85% bandwidth balancing and 11.90% QoS improvement.
本文介绍了一种基于2.5D通硅介面(TSI)互连的存储逻辑集成的数据模式感知智能I/O。当访问一个共享内存时,为了匹配巨大的多核带宽需求和有限的2.5D I/O通道供应,在内存控制器内部开发了基于时空复用的通道利用来重用2.5D I/O通道。通过空间多路复用,将多个内核根据带宽需求自适应地划分为集群,以访问共享内存。然后执行时间复用来调度一个集群中的内核,根据优先级在不同的时隙占用提供的2.5D I/O通道。通过系统级模拟器和基准工作负载对所提出的智能2.5D TSI I/O进行了验证,其带宽均衡率高达58.85%,QoS提高了11.90%。
{"title":"Smart I/Os: a data-pattern aware 2.5D interconnect with space-time multiplexing","authors":"Sai Manoj Pudukotai Dinakarrao, Kanwen Wang, Hantao Huang, Hao Yu","doi":"10.1109/SLIP.2015.7171707","DOIUrl":"https://doi.org/10.1109/SLIP.2015.7171707","url":null,"abstract":"A data-pattern aware smart I/O is introduced in this paper for 2.5D through-silicon interposer (TSI) interconnect based memory-logic integration. To match huge many-core bandwidth demand with limited supply of 2.5D I/O channels when accessing one shared memory, a space-time multiplexing based channel utilisation is developed inside the memory controller to reuse 2.5D I/O channels. Many cores are adaptively classified into clusters based on the bandwidth demand by space multiplexing to access the shared memory. Time multiplexing is then performed to schedule the cores in one cluster to occupy the supplied 2.5D I/O channels at different time-slots upon priority. The proposed smart 2.5D TSI I/O is verified by the system-level simulator with benchmarked workloads, which shows up to 58.85% bandwidth balancing and 11.90% QoS improvement.","PeriodicalId":431489,"journal":{"name":"2015 ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129649190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-06DOI: 10.1109/SLIP.2015.7171713
M. Escalante, A. Kahng, M. Kishinevsky, Ümit Y. Ogras, K. Samadi
Chip multiprocessors (CMPs) for server and high-performance computing markets are offered in multiple classes to satisfy various power, performance and cost requirements. As the number of processor cores on a single die grows, resources outside the “core”, such as the distributed last-level cache, on-chip memory controllers and network-on-chip (NoC) interconnecting these resources, which constitute the “uncore”, play an increasingly important role. While it is crucial to optimize the floorplan and uncore of each product class to achieve the best power-performance tradeoff, independent optimization may greatly increase the design effort, and undermine the savings ultimately achieved with a given total amount of optimization effort. This paper presents a novel multi-product optimization framework for next generation CMPs. Unlike traditional chip optimization techniques, we optimize the floorplan of multiple product classes at once, and ensure that the smaller floorplans can be obtained from larger ones by optimally removing, i.e., chopping, the unused parts.
{"title":"Multi-product floorplan and uncore design framework for chip multiprocessors","authors":"M. Escalante, A. Kahng, M. Kishinevsky, Ümit Y. Ogras, K. Samadi","doi":"10.1109/SLIP.2015.7171713","DOIUrl":"https://doi.org/10.1109/SLIP.2015.7171713","url":null,"abstract":"Chip multiprocessors (CMPs) for server and high-performance computing markets are offered in multiple classes to satisfy various power, performance and cost requirements. As the number of processor cores on a single die grows, resources outside the “core”, such as the distributed last-level cache, on-chip memory controllers and network-on-chip (NoC) interconnecting these resources, which constitute the “uncore”, play an increasingly important role. While it is crucial to optimize the floorplan and uncore of each product class to achieve the best power-performance tradeoff, independent optimization may greatly increase the design effort, and undermine the savings ultimately achieved with a given total amount of optimization effort. This paper presents a novel multi-product optimization framework for next generation CMPs. Unlike traditional chip optimization techniques, we optimize the floorplan of multiple product classes at once, and ensure that the smaller floorplans can be obtained from larger ones by optimally removing, i.e., chopping, the unused parts.","PeriodicalId":431489,"journal":{"name":"2015 ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127668244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-06DOI: 10.1109/SLIP.2015.7171708
Rui Wu, Chin-Hui Chen, J. Fédéli, M. Fournier, R. Beausoleil, K. Cheng
Silicon microring modulators are critical components in optical on-chip communications. In this paper, we develop theoretical compact models for optical transmission, power consumption, bit-error-rate (BER), and electrical tuning of microring modulators. The proposed theoretical models have been extensively validated by fabricated devices from a number of designs and fabrication batches. Since the quality factor (Q) and the extinction ratio (ER) of the microring modulator are important to determine the BER and link power budget, we include accurate equations for the Q and the ER in our models. Based on the proposed models, we identify an extra power penalty for the electrical tuning, and an energy-efficient swing voltage for the microring modulator to achieve to minimum total energy consumption.
{"title":"Compact modeling and system implications of microring modulators in nanophotonic interconnects","authors":"Rui Wu, Chin-Hui Chen, J. Fédéli, M. Fournier, R. Beausoleil, K. Cheng","doi":"10.1109/SLIP.2015.7171708","DOIUrl":"https://doi.org/10.1109/SLIP.2015.7171708","url":null,"abstract":"Silicon microring modulators are critical components in optical on-chip communications. In this paper, we develop theoretical compact models for optical transmission, power consumption, bit-error-rate (BER), and electrical tuning of microring modulators. The proposed theoretical models have been extensively validated by fabricated devices from a number of designs and fabrication batches. Since the quality factor (Q) and the extinction ratio (ER) of the microring modulator are important to determine the BER and link power budget, we include accurate equations for the Q and the ER in our models. Based on the proposed models, we identify an extra power penalty for the electrical tuning, and an energy-efficient swing voltage for the microring modulator to achieve to minimum total energy consumption.","PeriodicalId":431489,"journal":{"name":"2015 ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125239011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-06DOI: 10.1109/SLIP.2015.7171706
A. Kahng, Mulong Luo, S. Nath
In advanced technology nodes, incremental delay due to coupling is a serious concern. Design companies spend significant resources on static timing analysis (STA) tool licenses with signal integrity (SI) enabled. The runtime of the STA tools in SI mode is typically large due to complex algorithms and iterative calculation of timing windows to accurately determine aggressor and victim alignments, as well as delay and slew estimations. In this work, we develop machine learning-based predictors of timing in SI mode based on timing reports from non-SI mode. Timing analysis in non-SI mode is faster and the license costs can be several times less than those of SI mode. We determine electrical and logic structure parameters that affect the incremental arc delay/slew and path delay (i.e., the difference in arrival times at the clock pin of the launch flip-flop and the D pin of the capture flip-flop) in SI mode, and develop models that can predict these SI-aware delays. We report worst-case error of 7.0ps and average error of 0.7ps for our models to predict incremental transition time, worst-case error of 5.2ps and average error of 1.2ps for our models to predict incremental delay, and worst-case error of 8.2ps and average error of 1.7ps for our models to predict path delay, in 28nm FDSOI technology. We also demonstrate that our models are robust across designs and signoff constraints at a particular technology node.
{"title":"SI for free: machine learning of interconnect coupling delay and transition effects","authors":"A. Kahng, Mulong Luo, S. Nath","doi":"10.1109/SLIP.2015.7171706","DOIUrl":"https://doi.org/10.1109/SLIP.2015.7171706","url":null,"abstract":"In advanced technology nodes, incremental delay due to coupling is a serious concern. Design companies spend significant resources on static timing analysis (STA) tool licenses with signal integrity (SI) enabled. The runtime of the STA tools in SI mode is typically large due to complex algorithms and iterative calculation of timing windows to accurately determine aggressor and victim alignments, as well as delay and slew estimations. In this work, we develop machine learning-based predictors of timing in SI mode based on timing reports from non-SI mode. Timing analysis in non-SI mode is faster and the license costs can be several times less than those of SI mode. We determine electrical and logic structure parameters that affect the incremental arc delay/slew and path delay (i.e., the difference in arrival times at the clock pin of the launch flip-flop and the D pin of the capture flip-flop) in SI mode, and develop models that can predict these SI-aware delays. We report worst-case error of 7.0ps and average error of 0.7ps for our models to predict incremental transition time, worst-case error of 5.2ps and average error of 1.2ps for our models to predict incremental delay, and worst-case error of 8.2ps and average error of 1.7ps for our models to predict path delay, in 28nm FDSOI technology. We also demonstrate that our models are robust across designs and signoff constraints at a particular technology node.","PeriodicalId":431489,"journal":{"name":"2015 ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128071679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-06DOI: 10.1109/SLIP.2015.7171709
Samyoung Bang, Kwangsoo Han, A. Kahng, V. Srinivas
3D interconnect between two dies can span a wide range of bandwidths and region areas, depending on the application, partitioning of the dies, die size, and floorplan. We explore the concept of dividing such an interconnect into local clusters, each with a cluster clock. We combine such clustering with a choice of three clock synchronization schemes (synchronous, source-synchronous, asynchronous) and study impacts on power, area and timing of the clock tree, data path and 3DIO. We build a model for the power, area and timing as a function of key system requirements and constraints: total bandwidth, region area, number of clusters, clock synchronization scheme, and 3DIO frequency. Such a model enables architects to perform pathfinding exploration of clocking and IO power, area and bandwidth optimization for 3D integration.
{"title":"Clock clustering and IO optimization for 3D integration","authors":"Samyoung Bang, Kwangsoo Han, A. Kahng, V. Srinivas","doi":"10.1109/SLIP.2015.7171709","DOIUrl":"https://doi.org/10.1109/SLIP.2015.7171709","url":null,"abstract":"3D interconnect between two dies can span a wide range of bandwidths and region areas, depending on the application, partitioning of the dies, die size, and floorplan. We explore the concept of dividing such an interconnect into local clusters, each with a cluster clock. We combine such clustering with a choice of three clock synchronization schemes (synchronous, source-synchronous, asynchronous) and study impacts on power, area and timing of the clock tree, data path and 3DIO. We build a model for the power, area and timing as a function of key system requirements and constraints: total bandwidth, region area, number of clusters, clock synchronization scheme, and 3DIO frequency. Such a model enables architects to perform pathfinding exploration of clocking and IO power, area and bandwidth optimization for 3D integration.","PeriodicalId":431489,"journal":{"name":"2015 ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129340008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}