2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)最新文献

英文中文

RExCache: Rapid exploration of unified last-level cache RExCache:快速探索统一的最后一级缓存

2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)

Pub Date : 2013-04-29 DOI: 10.1109/ASPDAC.2013.6509661

S. Min, Haris Javaid, S. Parameswaran

In this paper, we propose to explore design space of a unified last-level cache to improve system performance and energy efficiency. The challenge is to quickly estimate the execution time and energy consumption of the system with distinct cache configurations using minimal number of slow full-system cycle-accurate simulations. To this end, we propose a novel, simple yet highly accurate execution time estimator and a simple, reasonably accurate energy estimator. Our framework, RExCache, combines a cycle-accurate simulator and a trace-driven cache simulator with our novel execution time estimator and energy estimator to avoid cycle-accurate simulations of all the last-level cache configurations. Once execution time and energy estimates are available from the estimators, RExCache chooses minimum execution time or minimum energy consumption cache configuration. Our experiments with nine different applications from mediabench, and 330 last-level cache configurations show that the execution time and energy estimators had at least average absolute accuracy of 99.74% and 80.31% respectively. RExCache took only a few hours (21 hours for H.264enc) to explore last-level cache configurations compared to several days of traditional method (36 days for H.264enc) and cycle-accurate simulations (257 days for H.264enc), enabling quick exploration of the last-level cache. When 100 different real-time constraints on execution time and energy were used, all the cache configurations found by RExCache were similar to those from cycle-accurate simulations. On the other hand, the traditional method found correct cache configurations for only 69 out of 100 constraints. Thus, RExCache has better absolute accuracy than the traditional method, yet reducing the simulation time by at least 97%.

在本文中，我们提出探索统一的最后一级缓存的设计空间，以提高系统性能和能源效率。挑战在于使用最少数量的慢速全系统周期精确模拟，快速估计具有不同缓存配置的系统的执行时间和能耗。为此，我们提出了一种新颖的、简单而高精度的执行时间估计器和一种简单而合理精确的能量估计器。我们的框架RExCache结合了周期精确模拟器和跟踪驱动的缓存模拟器，以及我们新颖的执行时间估计器和能量估计器，以避免所有最后一级缓存配置的周期精确模拟。一旦估算器提供了执行时间和能量估算，RExCache就会选择最小执行时间或最小能耗缓存配置。我们对来自mediabbench的9个不同应用程序和330个最后一级缓存配置进行的实验表明，执行时间和能量估计器的平均绝对准确率分别为99.74%和80.31%。与传统方法(H.264enc需要36天)和周期精确模拟(H.264enc需要257天)相比，RExCache只需要几个小时(H.264enc需要21小时)来探索最后一级缓存配置，从而能够快速探索最后一级缓存。当使用100种不同的执行时间和能量的实时约束时，RExCache发现的所有缓存配置都与周期精确模拟中的相似。另一方面，传统方法只能为100个约束中的69个找到正确的缓存配置。因此，RExCache具有比传统方法更好的绝对精度，同时减少了至少97%的模拟时间。

{"title":"RExCache: Rapid exploration of unified last-level cache","authors":"S. Min, Haris Javaid, S. Parameswaran","doi":"10.1109/ASPDAC.2013.6509661","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509661","url":null,"abstract":"In this paper, we propose to explore design space of a unified last-level cache to improve system performance and energy efficiency. The challenge is to quickly estimate the execution time and energy consumption of the system with distinct cache configurations using minimal number of slow full-system cycle-accurate simulations. To this end, we propose a novel, simple yet highly accurate execution time estimator and a simple, reasonably accurate energy estimator. Our framework, RExCache, combines a cycle-accurate simulator and a trace-driven cache simulator with our novel execution time estimator and energy estimator to avoid cycle-accurate simulations of all the last-level cache configurations. Once execution time and energy estimates are available from the estimators, RExCache chooses minimum execution time or minimum energy consumption cache configuration. Our experiments with nine different applications from mediabench, and 330 last-level cache configurations show that the execution time and energy estimators had at least average absolute accuracy of 99.74% and 80.31% respectively. RExCache took only a few hours (21 hours for H.264enc) to explore last-level cache configurations compared to several days of traditional method (36 days for H.264enc) and cycle-accurate simulations (257 days for H.264enc), enabling quick exploration of the last-level cache. When 100 different real-time constraints on execution time and energy were used, all the cache configurations found by RExCache were similar to those from cycle-accurate simulations. On the other hand, the traditional method found correct cache configurations for only 69 out of 100 constraints. Thus, RExCache has better absolute accuracy than the traditional method, yet reducing the simulation time by at least 97%.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126892688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Curling-PCM: Application-specific wear leveling for phase change memory based embedded systems Curling-PCM:基于相变存储器的嵌入式系统的专用磨损平衡

2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)

Pub Date : 2013-04-29 DOI: 10.1109/ASPDAC.2013.6509609

Duo Liu, Tianzheng Wang, Yi Wang, Z. Shao, Qingfeng Zhuge, E. Sha

Phase change memory (PCM) has been used as NOR flash replacement in embedded systems with its attractive features. However, the endurance of PCM keeps drifting down and greatly limits its adoption in embedded systems. As most embedded systems are application-oriented, we can better utilize PCM by exploring application-specific features such as fixed access patterns and update frequencies to prolong the lifetime of PCM. In this paper, we propose an application-specific wear leveling technique, called Curling-PCM, to evenly distribute write activities across the PCM chip in order to improve the endurance of PCM. The basic idea is to exploit application-specific features in embedded systems and periodically move the hot region across the whole PCM chip. To further reduce the overhead of moving the hot region and improve the performance of PCM-based embedded systems, a fine-grained partial wear leveling policy is proposed in Curling-PCM, by which only part of the hot region is moved during each request handling period. The experimental results show that Curling-PCM can effectively evenly distribute write traffic in PCM chips compared with previous work. We expect this work can serve as a first step towards the full exploration of application-specific features in PCM-based embedded systems.

相变存储器(PCM)以其独特的特性在嵌入式系统中被用作NOR闪存的替代品。然而，PCM的耐久性不断下降，极大地限制了其在嵌入式系统中的应用。由于大多数嵌入式系统都是面向应用程序的，我们可以通过探索特定于应用程序的特性(如固定访问模式和更新频率)来更好地利用PCM，以延长PCM的生命周期。在本文中，我们提出了一种特定应用的磨损平衡技术，称为Curling-PCM，以均匀地分布在PCM芯片上的写活动，以提高PCM的耐用性。其基本思想是利用嵌入式系统中特定应用程序的特性，并周期性地在整个PCM芯片上移动热点区域。为了进一步降低移动热区域的开销，提高基于pcm的嵌入式系统的性能，在curling_pcm中提出了一种细粒度的局部磨损均衡策略，在每个请求处理周期中只移动部分热区域。实验结果表明，与以往的工作相比，Curling-PCM可以有效地均匀分配PCM芯片上的写流量。我们期望这项工作可以作为全面探索基于pcm的嵌入式系统中特定应用程序特性的第一步。

{"title":"Curling-PCM: Application-specific wear leveling for phase change memory based embedded systems","authors":"Duo Liu, Tianzheng Wang, Yi Wang, Z. Shao, Qingfeng Zhuge, E. Sha","doi":"10.1109/ASPDAC.2013.6509609","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509609","url":null,"abstract":"Phase change memory (PCM) has been used as NOR flash replacement in embedded systems with its attractive features. However, the endurance of PCM keeps drifting down and greatly limits its adoption in embedded systems. As most embedded systems are application-oriented, we can better utilize PCM by exploring application-specific features such as fixed access patterns and update frequencies to prolong the lifetime of PCM. In this paper, we propose an application-specific wear leveling technique, called Curling-PCM, to evenly distribute write activities across the PCM chip in order to improve the endurance of PCM. The basic idea is to exploit application-specific features in embedded systems and periodically move the hot region across the whole PCM chip. To further reduce the overhead of moving the hot region and improve the performance of PCM-based embedded systems, a fine-grained partial wear leveling policy is proposed in Curling-PCM, by which only part of the hot region is moved during each request handling period. The experimental results show that Curling-PCM can effectively evenly distribute write traffic in PCM chips compared with previous work. We expect this work can serve as a first step towards the full exploration of application-specific features in PCM-based embedded systems.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128059359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

Optimization of overdrive signoff 超速信号优化

2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)

Pub Date : 2013-04-29 DOI: 10.1109/ASPDAC.2013.6509619

T. Chan, A. Kahng, Jiajia Li, S. Nath

In modern SOC implementations, multi-mode design is commonly used to achieve better circuit performance and power across voltage-scaling, “turbo” and other operating modes. Although there are many tools for multi-mode circuit implementation, to our knowledge there is no available systematic analysis or methodology for the selection of associated signoff modes. We observe that the selection of signoff modes has significant impact on circuit area, power and performance. For example, incorrect choice of signoff voltages for required overdrive frequencies can result in a netlist with 15% suboptimality in power or 21% in area. In this paper, we propose a concept of mode dominance which can be used as a guideline for signoff mode selection. Further, we also propose efficient circuit implementation flows to optimize the selection of signoff modes within several distinct use cases. Our results show that our proposed methodology provides 5-7% improvement in performance compared to the traditional “signoff and scale” method. The signoff modes determined by our methods result in only 0.6% overhead in performance and 8% overhead in power after implementation, compared to the optimal signoff modes.

在现代SOC实现中，多模式设计通常用于实现更好的电路性能和跨电压缩放，“涡轮”和其他工作模式的功率。虽然有许多工具用于多模电路的实现，但据我们所知，没有可用的系统分析或方法来选择相关的信号模式。我们观察到，信号模式的选择对电路面积、功率和性能有重要影响。例如，为所需的超速频率选择不正确的信号电压可能导致网络列表在功率上有15%的次优性或在面积上有21%的次优性。在本文中，我们提出了一个模式优势的概念，它可以作为签名模式选择的指导方针。此外，我们还提出了有效的电路实现流程，以在几个不同的用例中优化签名模式的选择。我们的结果表明，与传统的“签名和规模”方法相比，我们提出的方法提供了5-7%的性能改进。与最佳的签名模式相比，由我们的方法确定的签名模式在实现后只会导致0.6%的性能开销和8%的功耗开销。

{"title":"Optimization of overdrive signoff","authors":"T. Chan, A. Kahng, Jiajia Li, S. Nath","doi":"10.1109/ASPDAC.2013.6509619","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509619","url":null,"abstract":"In modern SOC implementations, multi-mode design is commonly used to achieve better circuit performance and power across voltage-scaling, “turbo” and other operating modes. Although there are many tools for multi-mode circuit implementation, to our knowledge there is no available systematic analysis or methodology for the selection of associated signoff modes. We observe that the selection of signoff modes has significant impact on circuit area, power and performance. For example, incorrect choice of signoff voltages for required overdrive frequencies can result in a netlist with 15% suboptimality in power or 21% in area. In this paper, we propose a concept of mode dominance which can be used as a guideline for signoff mode selection. Further, we also propose efficient circuit implementation flows to optimize the selection of signoff modes within several distinct use cases. Our results show that our proposed methodology provides 5-7% improvement in performance compared to the traditional “signoff and scale” method. The signoff modes determined by our methods result in only 0.6% overhead in performance and 8% overhead in power after implementation, compared to the optimal signoff modes.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115977268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

High-level synthesis of multiple dependent CUDA kernels on FPGA 基于FPGA的多相关CUDA内核高级合成

2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)

Pub Date : 2013-04-29 DOI: 10.1109/ASPDAC.2013.6509613

S. Gurumani, Hisham Cholakkal, Yun Liang, K. Rupnow, Deming Chen

High-level synthesis (HLS) tools provide automatic generation of hardware at the register transfer level (RTL) from algorithm descriptions written in high-level languages, enabling faster creation of custom accelerators for FPGA architectures. Existing HLS tools support a wide variety of input languages, and assist users in design space exploration through automation and feedback on designs' performance bottlenecks. This design space exploration applies techniques such as pipelining, partitioning and resource sharing in order to improve performance, and resource utilization. However, although automated exploration can find some inherent parallelism, data-parallel input source code is still superior for exposing a greater variety of parallelism. In prior work, we demonstrated automated design space exploration of GPU multi-threaded (CUDA) language source code for efficient RTL generation. In this paper, we examine the challenges in extending this automated design space exploration to multiple dependent CUDA kernels, demonstrate a step-by-step procedure for efficiently performing multi-kernel synthesis, and demonstrate the potential of this approach through a case study of a stereo matching algorithm. This study demonstrates that HLS of multiple dependent CUDA kernels can maintain performance parity with the GPU implementation, while consuming over 16X less energy than the GPU. Based on our manual procedure, we identify the key challenges in fully automating the synthesis of multi-kernel CUDA programs.

高级合成(HLS)工具根据用高级语言编写的算法描述，在寄存器传输级别(RTL)自动生成硬件，从而可以更快地为FPGA架构创建自定义加速器。现有的HLS工具支持多种输入语言，并通过自动化和对设计性能瓶颈的反馈来帮助用户进行设计空间探索。这种设计空间探索应用了流水线、分区和资源共享等技术，以提高性能和资源利用率。然而，尽管自动化探索可以找到一些固有的并行性，但是数据并行输入源代码在揭示更多的并行性方面仍然是优越的。在之前的工作中，我们演示了GPU多线程(CUDA)语言源代码的自动设计空间探索，以实现高效的RTL生成。在本文中，我们研究了将这种自动化设计空间探索扩展到多个依赖的CUDA内核中的挑战，演示了有效执行多核合成的逐步过程，并通过立体匹配算法的案例研究展示了这种方法的潜力。本研究表明，多依赖CUDA内核的HLS可以保持与GPU实现的性能对等，同时消耗的能量比GPU少16倍以上。基于我们的手动程序，我们确定了完全自动化多内核CUDA程序合成的关键挑战。

{"title":"High-level synthesis of multiple dependent CUDA kernels on FPGA","authors":"S. Gurumani, Hisham Cholakkal, Yun Liang, K. Rupnow, Deming Chen","doi":"10.1109/ASPDAC.2013.6509613","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509613","url":null,"abstract":"High-level synthesis (HLS) tools provide automatic generation of hardware at the register transfer level (RTL) from algorithm descriptions written in high-level languages, enabling faster creation of custom accelerators for FPGA architectures. Existing HLS tools support a wide variety of input languages, and assist users in design space exploration through automation and feedback on designs' performance bottlenecks. This design space exploration applies techniques such as pipelining, partitioning and resource sharing in order to improve performance, and resource utilization. However, although automated exploration can find some inherent parallelism, data-parallel input source code is still superior for exposing a greater variety of parallelism. In prior work, we demonstrated automated design space exploration of GPU multi-threaded (CUDA) language source code for efficient RTL generation. In this paper, we examine the challenges in extending this automated design space exploration to multiple dependent CUDA kernels, demonstrate a step-by-step procedure for efficiently performing multi-kernel synthesis, and demonstrate the potential of this approach through a case study of a stereo matching algorithm. This study demonstrates that HLS of multiple dependent CUDA kernels can maintain performance parity with the GPU implementation, while consuming over 16X less energy than the GPU. Based on our manual procedure, we identify the key challenges in fully automating the synthesis of multi-kernel CUDA programs.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133861229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Unconditionally stable explicit method for the fast 3-D simulation of on-chip power distribution network with through silicon via 片上硅通孔配电网三维快速仿真的无条件稳定显式方法

2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)

Pub Date : 2013-04-29 DOI: 10.1109/ASPDAC.2013.6509550

T. Sekine, H. Asai

The equivalent circuit of an on-chip power distribution network (PDN) has a fine 3-D grid structure due to the vias between equipotential conductors, and the vertical couplings between power and ground lines. In addition, a through silicon via is modeled with inductive and capacitive parasitic elements and appended to the PDN. Therefore, the circuit related to the 3-D IC technology tends to be a tightly coupled large network. For the simulation of this type of network, an explicit time marching scheme has an advantage over conventional general-purpose circuit simulators such as SPICE in the computational cost. However, the explicit method has a strict numerical stability condition, which may limit the maximum time step size and increase the total amount of the cost. In this work, we propose the method which is explicit, but stable with no stability condition. Additionally, the proposed unconditionally-stable explicit method is accelerated more by combining with an order reduction technique.

片上配电网(PDN)的等效电路由于等电位导体之间的过孔和电源线与地线之间的垂直耦合而具有精细的三维网格结构。此外，通过硅通孔与电感和电容寄生元件建模，并附加到PDN。因此，与三维集成电路技术相关的电路往往是一个紧密耦合的大网络。对于此类网络的仿真，显式时间推进方案在计算成本上优于传统的通用电路模拟器(如SPICE)。然而，显式方法有严格的数值稳定性条件，可能会限制最大时间步长，增加总成本。在这项工作中，我们提出了一种显式的，但不需要稳定条件的稳定方法。此外，本文所提出的无条件稳定显式方法与降阶技术相结合，提高了计算速度。

引用次数: 4

MIXSyn: An efficient logic synthesis methodology for mixed XOR-AND/OR dominated circuits MIXSyn:一种用于混合异或与/或控制电路的高效逻辑合成方法

2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)

Pub Date : 2013-04-29 DOI: 10.1109/ASPDAC.2013.6509585

L. Amarù, P. Gaillardon, G. Micheli

We present a new logic synthesis methodology, called MIXSyn, that produces area-efficient results for mixed XOR-AND/OR dominated logic functions. MIXSyn is a two step synthesis process. The first step is a hybrid logic optimization that enables selective and distinct optimization of AND/OR and XOR-intensive portions of the logic circuit. The second step is a library-free technology mapping that enhances design flexibility with a tractable computational cost. MIXSyn has been tested on a set of large MCNC benchmarks. Experimental results indicate that MIXSyn produces CMOS circuits with 18.0% and 9.2% fewer devices, on the average, with respect to state-of-art academic and commercial synthesis tools, respectively. MIXSyn is also capable to exploit the opportunity of novel XOR implementations offered by the use of double-gate ambipolar devices. Experimental results show that MIXSyn can reduce the number of ambipolar transistors by 20.9% and 15.3%, on the average, with respect to state-of-art academic and commercial synthesis tools, respectively.

我们提出了一种新的逻辑合成方法，称为MIXSyn，它可以为混合异或与/或主导的逻辑函数产生面积高效的结果。MIXSyn是一个两步合成过程。第一步是混合逻辑优化，可以对逻辑电路的and /OR和xor密集部分进行选择性和明显的优化。第二步是无库的技术映射，它以可处理的计算成本增强设计灵活性。MIXSyn已经在一组大型MCNC基准测试中进行了测试。实验结果表明，与先进的学术合成工具和商业合成工具相比，MIXSyn生产的CMOS电路平均减少了18.0%和9.2%的器件。MIXSyn还能够利用双栅双极器件提供的新颖异或实现的机会。实验结果表明，与现有的学术合成工具和商用合成工具相比，MIXSyn可将双极晶体管的数量平均减少20.9%和15.3%。

引用次数: 26

Heterogeneous memory management for 3D-DRAM and external DRAM with QoS 具有QoS的3D-DRAM和外部DRAM的异构内存管理

2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)

Pub Date : 2013-04-29 DOI: 10.1109/ASPDAC.2013.6509676

L. Tran, F. Kurdahi, A. Eltawil, H. Homayoun

This paper presents an innovative memory management approach to utilize both 3D-DRAM and external DRAM (ex-DRAM). Our approach dynamically allocates and relocates memory blocks between the 3D-DRAM and the ex-DRAM to exploit the high memory bandwidth and the low memory latency of the 3D-DRAM as well as the high capacity and the low cost of the ex-DRAM. Our simulation shows that in workloads that are not memory intensive, our memory management technique transfers all active memory blocks to the 3D-DRAM which runs faster than the ex-DRAM. In memory intensive workloads, our memory management technique utilizes both the 3D-DRAM and the ex-DRAM to increase the memory bandwidth to alleviate bandwidth congestion. Our approach supports Quality of Service (QoS) for “latency sensitive”, “bandwidth sensitive”, and “insensitive” applications. To improve the performance and satisfy a certain level of QoS, memory blocks of different application types are allocated differently. Compared to the scratchpad memory management mechanism, the average memory access latency of our approach decreases by 19% and 23%, while performance improves by up to 5% and 12% in single threaded benchmarks and multi-threaded benchmarks respectively. Moreover, using our approach, applications do not need to manage memory explicitly like in the scratchpad case. Our memory block relocation comes with negligible performance overhead, particularly for applications which have high spatial memory locality.

本文提出了一种创新的内存管理方法，可以同时利用3D-DRAM和外部DRAM(前DRAM)。我们的方法在3D-DRAM和ex-DRAM之间动态分配和重新定位内存块，以利用3D-DRAM的高内存带宽和低内存延迟以及ex-DRAM的高容量和低成本。我们的模拟表明，在非内存密集型工作负载中，我们的内存管理技术将所有活动内存块转移到比前dram更快的3D-DRAM上。在内存密集型工作负载中，我们的内存管理技术同时利用3D-DRAM和ex-DRAM来增加内存带宽，以缓解带宽拥塞。我们的方法支持“延迟敏感”、“带宽敏感”和“不敏感”应用程序的服务质量(QoS)。为了提高性能和满足一定程度的QoS，不同应用类型的内存块分配是不同的。与scratchpad内存管理机制相比，我们的方法的平均内存访问延迟减少了19%和23%，而在单线程基准测试和多线程基准测试中，性能分别提高了5%和12%。此外，使用我们的方法，应用程序不需要像在scratchpad案例中那样显式地管理内存。我们的内存块重定位带来的性能开销可以忽略不计，特别是对于具有高空间内存局部性的应用程序。

{"title":"Heterogeneous memory management for 3D-DRAM and external DRAM with QoS","authors":"L. Tran, F. Kurdahi, A. Eltawil, H. Homayoun","doi":"10.1109/ASPDAC.2013.6509676","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509676","url":null,"abstract":"This paper presents an innovative memory management approach to utilize both 3D-DRAM and external DRAM (ex-DRAM). Our approach dynamically allocates and relocates memory blocks between the 3D-DRAM and the ex-DRAM to exploit the high memory bandwidth and the low memory latency of the 3D-DRAM as well as the high capacity and the low cost of the ex-DRAM. Our simulation shows that in workloads that are not memory intensive, our memory management technique transfers all active memory blocks to the 3D-DRAM which runs faster than the ex-DRAM. In memory intensive workloads, our memory management technique utilizes both the 3D-DRAM and the ex-DRAM to increase the memory bandwidth to alleviate bandwidth congestion. Our approach supports Quality of Service (QoS) for “latency sensitive”, “bandwidth sensitive”, and “insensitive” applications. To improve the performance and satisfy a certain level of QoS, memory blocks of different application types are allocated differently. Compared to the scratchpad memory management mechanism, the average memory access latency of our approach decreases by 19% and 23%, while performance improves by up to 5% and 12% in single threaded benchmarks and multi-threaded benchmarks respectively. Moreover, using our approach, applications do not need to manage memory explicitly like in the scratchpad case. Our memory block relocation comes with negligible performance overhead, particularly for applications which have high spatial memory locality.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121929818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

HS3DPG: Hierarchical simulation for 3D P/G network HS3DPG:三维P/G网络分层仿真

2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)

Pub Date : 2013-04-29 DOI: 10.1109/ASPDAC.2013.6509647

Shuai Tao, Xiaoming Chen, Yu Wang, Yuchun Ma, Yiyu Shi, Hui Wang, Huazhong Yang

As different chips are stacked together in 3D ICs, the power/ground (P/G) network simulation becomes more challenging than that of 2D cases. In this paper, we propose a hierarchical simulation method suitable for 3D P/G network (HS3DPG), which can ensure full parallelism and good scalability with the number of tiers. In the IR drop analysis, when there are 9 tiers, the hierarchical method can be 6.5 times faster than the direct full network simulation. The accuracy of HS3DPG has been verified by a 3D P/G network from the industrial design. Besides, we introduce the “locality” property into HS3DPG to further simplify the simulation. Finally, HS3DPG is used to analyze the voltage distribution of a 3D P/G network with clustered TSVs.

由于不同的芯片在3D集成电路中堆叠在一起，因此电源/地(P/G)网络模拟比2D情况更具挑战性。本文提出了一种适合于三维P/G网络(HS3DPG)的分层仿真方法，该方法可以保证完全并行性和随层数的增加而具有良好的可扩展性。在IR下降分析中，当有9层时，分层方法可以比直接全网络模拟快6.5倍。通过工业设计的三维P/G网络验证了HS3DPG的精度。此外，我们在HS3DPG中引入了“局部性”属性，进一步简化了仿真。最后，利用HS3DPG分析了具有聚类tsv的三维P/G网络的电压分布。

引用次数: 5

Application-specific fault-tolerant architecture synthesis for digital microfluidic biochips 数字微流控生物芯片专用容错架构合成

2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)

Pub Date : 2013-04-29 DOI: 10.1109/ASPDAC.2013.6509697

M. Alistar, P. Pop, J. Madsen

Microfluidic-based biochips are replacing the conventional biochemical analyzers, and are able to integrate onchip all the necessary functions for biochemical analysis using microfluidics. The digital microfluidic biochips are based on the manipulation of liquids not as a continuous flow, but as discrete droplets on an array of electrodes. Microfluidic operations, such as transport, mixing, split, are performed on this array by routing the corresponding droplets on a series of electrodes. Researchers have proposed several approaches for the synthesis of digital microfluidic biochips. All previous work assumes that the biochip architecture is given, and most approaches consider a rectangular shape for the electrode array. However, non-regular application-specific architectures are common in practice. Hence, in this paper, we propose an approach to the application-specific architecture synthesis. Our approach can also help the designer to increase the yield by introducing redundant electrodes to tolerate permanent faults. The proposed architecture synthesis algorithm has been evaluated using several benchmarks.

基于微流控技术的生物芯片正在取代传统的生化分析仪，并且能够在芯片上集成使用微流控技术进行生化分析的所有必要功能。数字微流体生物芯片是基于对液体的操纵，而不是将其作为连续流动，而是作为电极阵列上的离散液滴。微流控操作，如传输、混合、分裂，是通过在一系列电极上排列相应的液滴来完成的。研究人员提出了几种合成数字微流控生物芯片的方法。所有先前的工作都假设生物芯片结构是给定的，并且大多数方法都考虑电极阵列的矩形形状。然而，非常规的特定于应用程序的体系结构在实践中很常见。因此，在本文中，我们提出了一种特定于应用程序的体系结构综合方法。我们的方法还可以通过引入冗余电极来容忍永久故障来帮助设计者提高产量。所提出的架构综合算法已经使用几个基准进行了评估。

{"title":"Application-specific fault-tolerant architecture synthesis for digital microfluidic biochips","authors":"M. Alistar, P. Pop, J. Madsen","doi":"10.1109/ASPDAC.2013.6509697","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509697","url":null,"abstract":"Microfluidic-based biochips are replacing the conventional biochemical analyzers, and are able to integrate onchip all the necessary functions for biochemical analysis using microfluidics. The digital microfluidic biochips are based on the manipulation of liquids not as a continuous flow, but as discrete droplets on an array of electrodes. Microfluidic operations, such as transport, mixing, split, are performed on this array by routing the corresponding droplets on a series of electrodes. Researchers have proposed several approaches for the synthesis of digital microfluidic biochips. All previous work assumes that the biochip architecture is given, and most approaches consider a rectangular shape for the electrode array. However, non-regular application-specific architectures are common in practice. Hence, in this paper, we propose an approach to the application-specific architecture synthesis. Our approach can also help the designer to increase the yield by introducing redundant electrodes to tolerate permanent faults. The proposed architecture synthesis algorithm has been evaluated using several benchmarks.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127511846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

A sub-harmonic injection-locked frequency synthesizer with frequency calibration scheme for use in 60GHz TDD transceivers 用于60GHz TDD收发器的带频率校准方案的次谐波注入锁频合成器

2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)

Pub Date : 2013-04-29 DOI: 10.1109/ASPDAC.2013.6509574

T. Siriburanon, W. Deng, Ahmed Musa, K. Okada, A. Matsuzawa

A 58.1-to-65.0 GHz frequency synthesizer using sub-harmonic injection-locking technique is presented. The synthesizer can generate all 60GHz channels defined by IEEE 802.15.3c, wirelessHD, IEEE 802.11ad, WiGig, and ECMA-387. A frequency calibration scheme is proposed to monitor frequency shift resulting from environmental variations. Implemented in a 65nm CMOS process, the synthesizer achieves a typical phase noise of -117 dBc/Hz @10MHz offset from a carrier frequency of 61.56 GHz.

提出了一种基于次谐波注入锁定技术的58.1 ~ 65.0 GHz频率合成器。该合成器可以生成IEEE 802.15.3c、无线shd、IEEE 802.11ad、WiGig和ECMA-387定义的所有60GHz信道。提出了一种监测环境变化引起的频移的频率校准方案。在65nm CMOS工艺中实现的合成器在载波频率为61.56 GHz的情况下实现了典型的相位噪声-117 dBc/Hz @10MHz偏移。

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀