首页 > 最新文献

2012 IEEE 30th International Conference on Computer Design (ICCD)最新文献

英文 中文
Mamba: A scalable communication centric multi-threaded processor architecture Mamba:一个可扩展的以通信为中心的多线程处理器架构
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378652
Greg Chadwick, S. Moore
In this paper we describe Mamba, an architecture designed for multi-core systems. Mamba has two major aims: (i) make on-chip communication explicit to the programmer so they can optimize for it and (ii) support many threads and supply very lightweight communication and synchronization primitives for them. These aims are based on the observations that: (i) as feature sizes shrink, on-chip communication becomes relatively more expensive than computation and (ii) as we go increasingly multi-core we need highly scalable approaches to inter-thread communication and synchronization. We employ a network of processors where a given memory access will always go to the same cache, removing the need for a coherence protocol and allowing the program explicit control over all communication. A presence bit associated with each word provides a very lightweight, finegrained synchronization primitive. We demonstrate an FPGA implementation with micro-benchmarks of standard spinlock and FIFO implementations and show that presence bit based implementations provide more efficient locking, and lower latency FIFO communications compared to a conventional shared memory implementation whilst also requiring fewer memory accesses. We also show that Mamba performance is insensitive to total thread count, allowing the use of as many threads as desired.
在本文中,我们描述了Mamba,一个为多核系统设计的体系结构。Mamba有两个主要目标:(i)使芯片上的通信对程序员显式,以便他们可以进行优化;(ii)支持许多线程,并为它们提供非常轻量级的通信和同步原语。这些目标是基于以下观察:(i)随着特征尺寸的缩小,片上通信变得比计算更昂贵;(ii)随着我们越来越多地使用多核,我们需要高度可扩展的方法来实现线程间通信和同步。我们采用了一个处理器网络,在这个网络中,给定的内存访问将始终访问同一个缓存,从而消除了对一致性协议的需求,并允许程序显式控制所有通信。与每个单词相关联的存在位提供了非常轻量级的细粒度同步原语。我们用标准自旋锁和FIFO实现的微基准测试演示了FPGA实现,并表明与传统的共享内存实现相比,基于存在位的实现提供了更有效的锁定和更低延迟的FIFO通信,同时还需要更少的内存访问。我们还表明,曼巴性能是不敏感的总线程数,允许使用尽可能多的线程所需。
{"title":"Mamba: A scalable communication centric multi-threaded processor architecture","authors":"Greg Chadwick, S. Moore","doi":"10.1109/ICCD.2012.6378652","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378652","url":null,"abstract":"In this paper we describe Mamba, an architecture designed for multi-core systems. Mamba has two major aims: (i) make on-chip communication explicit to the programmer so they can optimize for it and (ii) support many threads and supply very lightweight communication and synchronization primitives for them. These aims are based on the observations that: (i) as feature sizes shrink, on-chip communication becomes relatively more expensive than computation and (ii) as we go increasingly multi-core we need highly scalable approaches to inter-thread communication and synchronization. We employ a network of processors where a given memory access will always go to the same cache, removing the need for a coherence protocol and allowing the program explicit control over all communication. A presence bit associated with each word provides a very lightweight, finegrained synchronization primitive. We demonstrate an FPGA implementation with micro-benchmarks of standard spinlock and FIFO implementations and show that presence bit based implementations provide more efficient locking, and lower latency FIFO communications compared to a conventional shared memory implementation whilst also requiring fewer memory accesses. We also show that Mamba performance is insensitive to total thread count, allowing the use of as many threads as desired.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132944411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Design methodology for sample preparation on digital microfluidic biochips 数字微流控生物芯片样品制备的设计方法
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378639
Yi-Ling Hsieh, Tsung-Yi Ho, K. Chakrabarty
Recent advances in digital microfluidic biochips have led to a promising future for miniaturized laboratories, with the associated advantages of high sensitivity and reconfigurability. As one of the front-end operations on digital microfluidic biochips, sample preparation plays an important role in biochemical assays and applications. For fast and high-throughput biochemical applications, it is critical to develop an automated design methodology for sample preparation. Prior work in this area does not provide solutions to the problem of design automation for sample preparation. Moreover, it is critical to ensure the correctness of droplets and recover from errors efficiently during sample preparation. Published work on error recovery is inefficient and impractical for sample preparation. Therefore, in this paper, we present an automated design methodology for sample preparation, including architectural synthesis, layout synthesis, and dynamic error recovery. The proposed algorithm is evaluated on real-life biochemical applications to demonstrate its effectiveness and efficiency. Compared to prior work, the proposed algorithm can achieve up to 48.39% reduction in sample preparation time.
数字微流控生物芯片的最新进展为小型化实验室带来了广阔的未来,具有高灵敏度和可重构性的相关优势。样品制备作为数字微流控生物芯片的前端操作之一,在生化分析和应用中起着重要的作用。对于快速和高通量的生化应用,开发样品制备的自动化设计方法至关重要。在这一领域之前的工作并没有为样品制备的设计自动化问题提供解决方案。此外,在样品制备过程中,确保液滴的正确性和有效地从错误中恢复是至关重要的。已发表的关于错误恢复的工作对于样品制备是低效和不切实际的。因此,在本文中,我们提出了一种用于样品制备的自动化设计方法,包括结构综合、布局综合和动态误差恢复。通过对实际生化应用的评估,证明了该算法的有效性和高效性。与之前的工作相比,该算法最多可以减少48.39%的样品制备时间。
{"title":"Design methodology for sample preparation on digital microfluidic biochips","authors":"Yi-Ling Hsieh, Tsung-Yi Ho, K. Chakrabarty","doi":"10.1109/ICCD.2012.6378639","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378639","url":null,"abstract":"Recent advances in digital microfluidic biochips have led to a promising future for miniaturized laboratories, with the associated advantages of high sensitivity and reconfigurability. As one of the front-end operations on digital microfluidic biochips, sample preparation plays an important role in biochemical assays and applications. For fast and high-throughput biochemical applications, it is critical to develop an automated design methodology for sample preparation. Prior work in this area does not provide solutions to the problem of design automation for sample preparation. Moreover, it is critical to ensure the correctness of droplets and recover from errors efficiently during sample preparation. Published work on error recovery is inefficient and impractical for sample preparation. Therefore, in this paper, we present an automated design methodology for sample preparation, including architectural synthesis, layout synthesis, and dynamic error recovery. The proposed algorithm is evaluated on real-life biochemical applications to demonstrate its effectiveness and efficiency. Compared to prior work, the proposed algorithm can achieve up to 48.39% reduction in sample preparation time.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115667227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Architecture and design flow for a debug event distribution interconnect 调试事件分发互连的体系结构和设计流
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378676
A. Azevedo, B. Vermeulen, K. Goossens
In this paper, we describe and analyze the architecture of the proposed Debug Event Distribution Interconnect (EDI). The EDI transmits debug events, which are 1-bit signals, between debug entities in different areas of the Network-on-Chip based Multi-Processor System-on-Chip. The EDI replicates the NoC topology with an EDI node instantiated for each underlying NoC data module. Contention in the EDI node is handled by replicating the EDI in layers. The EDI generation is automatic, and uses as input the cross-triggering patterns that are not required to follow the communication patterns in the NoC. The generation and routing tool is also presented in this paper. The EDI is evaluated with four different implementations varying complexity and handling of contention. The area of a single EDI Layer is around 0.9% of the area occupied by the tested NoCs, using the lower area implementation. These results show that the proposed implementation of the EDI incurs low cost on the overall system.
在本文中,我们描述和分析了所提出的调试事件分发互连(EDI)的体系结构。EDI在基于片上网络的多处理器片上系统的不同区域的调试实体之间传输调试事件,这些事件是1位信号。EDI复制NoC拓扑,并为每个底层NoC数据模块实例化一个EDI节点。通过在层中复制EDI来处理EDI节点中的争用。EDI的生成是自动的,并且使用交叉触发模式作为输入,这些模式不需要遵循NoC中的通信模式。本文还介绍了生成和路由工具。使用四种不同的实现对EDI进行评估,这些实现的复杂性和争用处理方式各不相同。使用较低区域实现时,单个EDI层的面积约为测试noc占用面积的0.9%。这些结果表明,所提出的EDI实现在整个系统上的成本较低。
{"title":"Architecture and design flow for a debug event distribution interconnect","authors":"A. Azevedo, B. Vermeulen, K. Goossens","doi":"10.1109/ICCD.2012.6378676","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378676","url":null,"abstract":"In this paper, we describe and analyze the architecture of the proposed Debug Event Distribution Interconnect (EDI). The EDI transmits debug events, which are 1-bit signals, between debug entities in different areas of the Network-on-Chip based Multi-Processor System-on-Chip. The EDI replicates the NoC topology with an EDI node instantiated for each underlying NoC data module. Contention in the EDI node is handled by replicating the EDI in layers. The EDI generation is automatic, and uses as input the cross-triggering patterns that are not required to follow the communication patterns in the NoC. The generation and routing tool is also presented in this paper. The EDI is evaluated with four different implementations varying complexity and handling of contention. The area of a single EDI Layer is around 0.9% of the area occupied by the tested NoCs, using the lower area implementation. These results show that the proposed implementation of the EDI incurs low cost on the overall system.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122934992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Post-layout OPE-predicted redundant wire insertion for clock skew minimization 布局后ope预测冗余线插入时钟倾斜最小化
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378695
Jin-Tai Yan, Zhi-Wei Chen
Based on the equilibrium concept of inserting load in a physical balance, the insertion of redundant wires can be used to minimize the clock skew in an OPE-predicted clock tree. For five tested benchmarks, the experimental results show that our proposed algorithm only increases 2.8% of the total load on the average for the insertion of OPE-predicted redundant wires and decreases 30.85 ps of the clock skew on the average to obtain the near zero-skew result in reasonable CPU time.
基于在物理平衡中插入负载的平衡概念,冗余导线的插入可用于最小化ope预测时钟树中的时钟倾斜。在5个测试的基准测试中,实验结果表明,我们提出的算法在合理的CPU时间内,平均只增加了总负载的2.8%,平均减少了30.85 ps的时钟倾斜,以获得接近零倾斜的结果。
{"title":"Post-layout OPE-predicted redundant wire insertion for clock skew minimization","authors":"Jin-Tai Yan, Zhi-Wei Chen","doi":"10.1109/ICCD.2012.6378695","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378695","url":null,"abstract":"Based on the equilibrium concept of inserting load in a physical balance, the insertion of redundant wires can be used to minimize the clock skew in an OPE-predicted clock tree. For five tested benchmarks, the experimental results show that our proposed algorithm only increases 2.8% of the total load on the average for the insertion of OPE-predicted redundant wires and decreases 30.85 ps of the clock skew on the average to obtain the near zero-skew result in reasonable CPU time.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126038018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Track assignment considering crosstalk-induced performance degradation 考虑串扰引起的性能下降的航迹分配
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378696
Qiong Zhao, Jiang Hu
Track assignment is a critical step between global routing and detailed routing in modern VLSI chip designs. Crosstalk, which is largely decided by wire adjacency, has significant impact on interconnect delay and circuit performance. Therefore, the amount of crosstalk should be restrained in order to satisfy timing constraints. In this work, a novel track assignment algorithm is proposed to reduce crosstalk-induced performance degradation. The problem is formulated as a Traveling Salesman Problem (TSP) and solved by a graph-based heuristic. Experimental results on the ISPD2011 benchmark circuits show that the violations on crosstalk bounds can be reduced by up to 99.56% compared to the conventional non-constraint-based heuristics.
在现代VLSI芯片设计中,航迹分配是介于全局路由和详细路由之间的关键步骤。串扰在很大程度上取决于导线的邻接性,对互连延迟和电路性能有重要影响。因此,为了满足时序约束,应该限制串扰的数量。在这项工作中,提出了一种新的航迹分配算法来减少串扰引起的性能下降。将该问题表述为旅行商问题(TSP),并采用基于图的启发式方法求解。在ISPD2011基准电路上的实验结果表明,与传统的基于非约束的启发式方法相比,该方法在串扰边界上的违例率可降低99.56%。
{"title":"Track assignment considering crosstalk-induced performance degradation","authors":"Qiong Zhao, Jiang Hu","doi":"10.1109/ICCD.2012.6378696","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378696","url":null,"abstract":"Track assignment is a critical step between global routing and detailed routing in modern VLSI chip designs. Crosstalk, which is largely decided by wire adjacency, has significant impact on interconnect delay and circuit performance. Therefore, the amount of crosstalk should be restrained in order to satisfy timing constraints. In this work, a novel track assignment algorithm is proposed to reduce crosstalk-induced performance degradation. The problem is formulated as a Traveling Salesman Problem (TSP) and solved by a graph-based heuristic. Experimental results on the ISPD2011 benchmark circuits show that the violations on crosstalk bounds can be reduced by up to 99.56% compared to the conventional non-constraint-based heuristics.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121855916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Dynamic phase-based tuning for embedded systems using phase distance mapping 使用相位距离映射的嵌入式系统动态相位调优
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378653
Tosiron Adegbija, A. Gordon-Ross, Arslan Munir
Phase-based tuning specializes a system's tunable parameters to the varying runtime requirements of an application's different phases of execution to meet optimization goals. Since the design space for tunable systems can be very large, one of the major challenges in phase-based tuning is determining the best configuration for each phase without incurring significant tuning overhead (e.g., energy and/or performance) during design space exploration. In this paper, we propose phase distance mapping, which directly determines the best configuration for a phase, thereby eliminating design space exploration. Phase distance mapping applies the correlation between a known phase's characteristics and best configuration to determine a new phase's best configuration based on the new phase's characteristics. Experimental results verify that our phase distance mapping approach determines configurations within 3% of the optimal configurations on average and yields an energy delay product savings of 26% on average.
基于阶段的调优将系统的可调参数专门用于应用程序不同执行阶段的不同运行时需求,以满足优化目标。由于可调系统的设计空间可能非常大,基于阶段的调优的主要挑战之一是在设计空间探索期间确定每个阶段的最佳配置,而不会产生重大的调优开销(例如,能源和/或性能)。在本文中,我们提出相位距离映射,直接确定一个相位的最佳配置,从而消除了设计空间的探索。相距离映射应用已知相的特征与最佳配置之间的相关性,根据新相的特征确定新相的最佳配置。实验结果证明,我们的相位距离映射方法确定的配置平均在最佳配置的3%以内,并且平均节省26%的能量延迟产品。
{"title":"Dynamic phase-based tuning for embedded systems using phase distance mapping","authors":"Tosiron Adegbija, A. Gordon-Ross, Arslan Munir","doi":"10.1109/ICCD.2012.6378653","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378653","url":null,"abstract":"Phase-based tuning specializes a system's tunable parameters to the varying runtime requirements of an application's different phases of execution to meet optimization goals. Since the design space for tunable systems can be very large, one of the major challenges in phase-based tuning is determining the best configuration for each phase without incurring significant tuning overhead (e.g., energy and/or performance) during design space exploration. In this paper, we propose phase distance mapping, which directly determines the best configuration for a phase, thereby eliminating design space exploration. Phase distance mapping applies the correlation between a known phase's characteristics and best configuration to determine a new phase's best configuration based on the new phase's characteristics. Experimental results verify that our phase distance mapping approach determines configurations within 3% of the optimal configurations on average and yields an energy delay product savings of 26% on average.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114288113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Row buffer locality aware caching policies for hybrid memories 混合内存的行缓冲区位置感知缓存策略
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378661
Hanbin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael A. Harding, O. Mutlu
Phase change memory (PCM) is a promising technology that can offer higher capacity than DRAM. Unfortunately, PCM's access latency and energy are higher than DRAM's and its endurance is lower. Many DRAM-PCM hybrid memory systems use DRAM as a cache to PCM, to achieve the low access latency and energy, and high endurance of DRAM, while taking advantage of PCM's large capacity. A key question is what data to cache in DRAM to best exploit the advantages of each technology while avoiding its disadvantages as much as possible. We propose a new caching policy that improves hybrid memory performance and energy efficiency. Our observation is that both DRAM and PCM banks employ row buffers that act as a cache for the most recently accessed memory row. Accesses that are row buffer hits incur similar latencies (and energy consumption) in DRAM and PCM, whereas accesses that are row buffer misses incur longer latencies (and higher energy consumption) in PCM. To exploit this, we devise a policy that avoids accessing in PCM data that frequently causes row buffer misses because such accesses are costly in terms of both latency and energy. Our policy tracks the row buffer miss counts of recently used rows in PCM, and caches in DRAM the rows that are predicted to incur frequent row buffer misses. Our proposed caching policy also takes into account the high write latencies of PCM, in addition to row buffer locality. Compared to a conventional DRAM-PCM hybrid memory system, our row buffer locality-aware caching policy improves system performance by 14% and energy efficiency by 10% on data-intensive server and cloud-type workloads. The proposed policy achieves 31% performance gain over an all-PCM memory system, and comes within 29% of the performance of an allDRAM memory system (not taking PCM's capacity benefit into account) on evaluated workloads.
相变存储器(PCM)是一种很有前途的技术,可以提供比DRAM更高的容量。遗憾的是,PCM的访问延迟和能量比DRAM高,而其持久时间较低。许多DRAM-PCM混合存储系统使用DRAM作为PCM的缓存,在利用PCM大容量的同时实现DRAM的低存取延迟和低存取能量、高持久性能。一个关键问题是,在DRAM中缓存哪些数据才能最大限度地利用每种技术的优点,同时尽可能避免其缺点。我们提出了一种新的缓存策略,可以提高混合内存的性能和能源效率。我们的观察是,DRAM和PCM组都使用行缓冲区作为最近访问的内存行的缓存。行缓冲区命中的访问在DRAM和PCM中产生类似的延迟(和能耗),而行缓冲区未命中的访问在PCM中产生更长的延迟(和更高的能耗)。为了利用这一点,我们设计了一种策略,避免访问经常导致行缓冲区丢失的PCM数据,因为这种访问在延迟和能量方面都是昂贵的。我们的策略跟踪PCM中最近使用的行缓冲区丢失计数,并在DRAM中缓存预计会导致频繁的行缓冲区丢失的行。我们提出的缓存策略除了考虑行缓冲区局域性外,还考虑了PCM的高写延迟。与传统的DRAM-PCM混合内存系统相比,在数据密集型服务器和云类型工作负载上,我们的行缓冲区位置感知缓存策略将系统性能提高14%,能源效率提高10%。在评估的工作负载上,建议的策略比全PCM内存系统的性能提高31%,比全dram内存系统的性能提高29%(不考虑PCM的容量优势)。
{"title":"Row buffer locality aware caching policies for hybrid memories","authors":"Hanbin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael A. Harding, O. Mutlu","doi":"10.1109/ICCD.2012.6378661","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378661","url":null,"abstract":"Phase change memory (PCM) is a promising technology that can offer higher capacity than DRAM. Unfortunately, PCM's access latency and energy are higher than DRAM's and its endurance is lower. Many DRAM-PCM hybrid memory systems use DRAM as a cache to PCM, to achieve the low access latency and energy, and high endurance of DRAM, while taking advantage of PCM's large capacity. A key question is what data to cache in DRAM to best exploit the advantages of each technology while avoiding its disadvantages as much as possible. We propose a new caching policy that improves hybrid memory performance and energy efficiency. Our observation is that both DRAM and PCM banks employ row buffers that act as a cache for the most recently accessed memory row. Accesses that are row buffer hits incur similar latencies (and energy consumption) in DRAM and PCM, whereas accesses that are row buffer misses incur longer latencies (and higher energy consumption) in PCM. To exploit this, we devise a policy that avoids accessing in PCM data that frequently causes row buffer misses because such accesses are costly in terms of both latency and energy. Our policy tracks the row buffer miss counts of recently used rows in PCM, and caches in DRAM the rows that are predicted to incur frequent row buffer misses. Our proposed caching policy also takes into account the high write latencies of PCM, in addition to row buffer locality. Compared to a conventional DRAM-PCM hybrid memory system, our row buffer locality-aware caching policy improves system performance by 14% and energy efficiency by 10% on data-intensive server and cloud-type workloads. The proposed policy achieves 31% performance gain over an all-PCM memory system, and comes within 29% of the performance of an allDRAM memory system (not taking PCM's capacity benefit into account) on evaluated workloads.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122160000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 197
Integration of correct-by-construction BIP models into the MetroII design space exploration flow 将按施工正确的BIP模型集成到MetroII设计空间探索流程中
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378688
Alena Simalatsar, Liangpeng Guo, M. Bozga, R. Passerone
Design correctness and performance are major issues which are usually considered separately, and with different emphasis, by traditional system design flows. In this paper we show that one can meaningfully connect and benefit from the advantages of two design frameworks, with different design goals. We consider BIP for high-level rigorous design and correct-by-construction implementation, and metroII, for low-level platform-based design and performance evaluation.
在传统的系统设计流程中,设计正确性和性能是两个主要问题,通常是分开考虑的,并且侧重点不同。在本文中,我们展示了一个人可以有意义地连接并受益于两个设计框架的优势,具有不同的设计目标。我们将BIP用于高级的严格设计和按施工正确执行,而将metroII用于低级的基于平台的设计和性能评估。
{"title":"Integration of correct-by-construction BIP models into the MetroII design space exploration flow","authors":"Alena Simalatsar, Liangpeng Guo, M. Bozga, R. Passerone","doi":"10.1109/ICCD.2012.6378688","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378688","url":null,"abstract":"Design correctness and performance are major issues which are usually considered separately, and with different emphasis, by traditional system design flows. In this paper we show that one can meaningfully connect and benefit from the advantages of two design frameworks, with different design goals. We consider BIP for high-level rigorous design and correct-by-construction implementation, and metroII, for low-level platform-based design and performance evaluation.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114251520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
SOLE: Speculative one-cycle load execution with scalability, high-performance and energy-efficiency SOLE:投机的单周期负载执行,具有可扩展性,高性能和能源效率
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378654
Zhen-Hao Zhang, Dong Tong, Xiaoyin Wang, Jiangfang Yi, Keyi Wang
Conventional superscalar processors usually contain large CAM-based LSQ (load/store queue) with poor scalability and high energy consumption. Recently proposals only focus on improving the LSQ scalability to increase the in-flight instruction capacity, but with poor performance improvement and energy efficiency. This paper presents a novel speculative store-load forwarding mechanism, named SOLE (speculative one-cycle load execution)1. Firstly, SOLE uses address identifiers to determine the memory disambiguation, rather than the exact memory addresses as the traditional LSQ does. Since the address identifier is just simple hash from the address base and offset, the speculative store-load forwarding could be advanced earlier to reduce the load execution latency and avoid unnecessary energy consumption by filtering unnecessary accesses to the data cache. Secondly, SOLE enlarges the forwarding communication range by using SSN (store sequential number) to determine the age order between stores, which further improves the performance. Finally, the implementation of SOLE all uses set-associative structures that avoid the non-scalable problem of CAM-based LSQ. Experiments show that performance of SOLE outperforms the traditional LSQ by 13.57% in terms of performance, with only 75.2% execution energy consumption of the loads and stores.
传统的超标量处理器通常包含大型的基于cam的LSQ (load/store queue),可扩展性差,能耗高。目前的研究只关注于提高LSQ的可扩展性,以增加飞行指令容量,但在性能和能效方面的改进效果较差。本文提出了一种新的推测存储负载转发机制,称为SOLE(推测单周期负载执行)1。首先,SOLE使用地址标识符来确定内存消歧,而不是像传统的LSQ那样使用确切的内存地址。由于地址标识符只是来自地址基和偏移量的简单散列,因此可以提前推测存储-负载转发,以减少负载执行延迟,并通过过滤对数据缓存的不必要访问来避免不必要的能量消耗。其次,SOLE利用SSN (store sequence number)来确定商店之间的年龄顺序,扩大了转发通信范围,进一步提高了性能。最后,SOLE的实现全部采用集合关联结构,避免了基于cam的LSQ的不可扩展性问题。实验表明,SOLE的性能比传统的LSQ提高了13.57%,而负载和存储的执行能耗仅为75.2%。
{"title":"SOLE: Speculative one-cycle load execution with scalability, high-performance and energy-efficiency","authors":"Zhen-Hao Zhang, Dong Tong, Xiaoyin Wang, Jiangfang Yi, Keyi Wang","doi":"10.1109/ICCD.2012.6378654","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378654","url":null,"abstract":"Conventional superscalar processors usually contain large CAM-based LSQ (load/store queue) with poor scalability and high energy consumption. Recently proposals only focus on improving the LSQ scalability to increase the in-flight instruction capacity, but with poor performance improvement and energy efficiency. This paper presents a novel speculative store-load forwarding mechanism, named SOLE (speculative one-cycle load execution)1. Firstly, SOLE uses address identifiers to determine the memory disambiguation, rather than the exact memory addresses as the traditional LSQ does. Since the address identifier is just simple hash from the address base and offset, the speculative store-load forwarding could be advanced earlier to reduce the load execution latency and avoid unnecessary energy consumption by filtering unnecessary accesses to the data cache. Secondly, SOLE enlarges the forwarding communication range by using SSN (store sequential number) to determine the age order between stores, which further improves the performance. Finally, the implementation of SOLE all uses set-associative structures that avoid the non-scalable problem of CAM-based LSQ. Experiments show that performance of SOLE outperforms the traditional LSQ by 13.57% in terms of performance, with only 75.2% execution energy consumption of the loads and stores.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126938621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A polynomial time flow for implementing free-choice Petri-nets 实现自由选择petri网的多项式时间流
Pub Date : 2012-09-30 DOI: 10.1109/ICCD.2012.6378645
Pavlos M. Mattheakis, C. Sotiriou, P. Beerel
FSM and PTnet control models are pertinent in both software and hardware applications as both specification and implementation models. The state-based, monolithic FSM model is directly implementable in software or hardware, but cannot model concurrency without state explosion. Interacting FSM models have so far lacked the formal rigor for expressing the synchronising interactions between different FSMs. The event-based, PTnet model is able to model both concurrency and choice within the same model, however lacks a polynomial time flow to implementation, as current methods of exposing the event state space require a potentially exponential number of states. In this work, we present a polynomial complexity flow for transforming a Free-Choice PTnet into a new formalism for Interacting FSMs, i.e Multiple, Synchronised FSMs (MSFSMs), a compact Interacting FSMs model, potentially implementable using any existing monolithic FSM implementation method. We believe that such a flow can in the long term bridge the event and state-based models. We present execution time and state space results of exercising our flow on 25 large PTnet specifications, describing asynchronous control circuits, and contrast our results to the popular Petrify tool for PTnet state space exploration and circuit implementation. Our results indicate a very significant reduction in both state space size and execution time.
作为规范模型和实现模型,FSM和PTnet控制模型在软件和硬件应用中都是相关的。基于状态的单片FSM模型可以直接在软件或硬件中实现,但如果没有状态爆炸,则无法对并发进行建模。到目前为止,相互作用的FSM模型缺乏表达不同FSM之间同步交互的形式化严谨性。基于事件的PTnet模型能够在同一模型中对并发性和选择进行建模,但是缺少实现的多项式时间流,因为当前暴露事件状态空间的方法可能需要指数数量的状态。在这项工作中,我们提出了一个多项式复杂度流,用于将自由选择PTnet转换为用于交互FSMs的新形式,即多个同步FSMs (MSFSMs),这是一个紧凑的交互FSMs模型,可以使用任何现有的单片FSM实现方法实现。我们相信,从长远来看,这样的流可以连接事件模型和基于状态的模型。我们给出了在25个大型PTnet规范上运行我们的流的执行时间和状态空间结果,描述了异步控制电路,并将我们的结果与用于PTnet状态空间探索和电路实现的流行石化工具进行了对比。我们的结果表明状态空间大小和执行时间都有很大的减少。
{"title":"A polynomial time flow for implementing free-choice Petri-nets","authors":"Pavlos M. Mattheakis, C. Sotiriou, P. Beerel","doi":"10.1109/ICCD.2012.6378645","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378645","url":null,"abstract":"FSM and PTnet control models are pertinent in both software and hardware applications as both specification and implementation models. The state-based, monolithic FSM model is directly implementable in software or hardware, but cannot model concurrency without state explosion. Interacting FSM models have so far lacked the formal rigor for expressing the synchronising interactions between different FSMs. The event-based, PTnet model is able to model both concurrency and choice within the same model, however lacks a polynomial time flow to implementation, as current methods of exposing the event state space require a potentially exponential number of states. In this work, we present a polynomial complexity flow for transforming a Free-Choice PTnet into a new formalism for Interacting FSMs, i.e Multiple, Synchronised FSMs (MSFSMs), a compact Interacting FSMs model, potentially implementable using any existing monolithic FSM implementation method. We believe that such a flow can in the long term bridge the event and state-based models. We present execution time and state space results of exercising our flow on 25 large PTnet specifications, describing asynchronous control circuits, and contrast our results to the popular Petrify tool for PTnet state space exploration and circuit implementation. Our results indicate a very significant reduction in both state space size and execution time.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124508465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2012 IEEE 30th International Conference on Computer Design (ICCD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1