2010 International Workshop on Innovative Architecture for Future Generation High Performance最新文献

英文中文

[2010] VIX: A Router Architecture for Priority-Aware Networks-on-Chip [2010]一种基于优先级感知的片上网络结构

2010 International Workshop on Innovative Architecture for Future Generation High Performance

Pub Date : 2010-01-17 DOI: 10.1109/IWIA.2010.15

Takuma Kogo, N. Yamasaki

In future many-core chip multiprocessors (CMPs) and systems-on-chips (SoCs) architectures, networks-on-chip (NoC) will be one of the most critical components. In CMPs and SoCs, multiple applications will be executed concurrently and they interfere each other. Thus, packet conflicts will be caused in the NoC. Priority control is required in such environments, because each application has different bandwidth requirements and causes different traffic patterns of the packets. Unfortunately priority control degrades network performance and significantly increases the area of a priority-aware on-chip router.This paper proposes a router architecture for priority-aware NoCs in order to mitigate the performance and area overheads due to the priority control. We implement the proposed router architecture using a 90nm process technology. The synthesis result shows no critical path overhead and drastic reduction of the router area. The simulation result on a 8-ary 2-mesh network shows that the average latency of higher priority packets is reduced at the near saturation point.

在未来的多核芯片多处理器(cmp)和片上系统(soc)架构中，片上网络(NoC)将是最关键的组件之一。在cmp和soc中，多个应用程序将并发执行，并且它们会相互干扰。这样，就会在NoC中引起包冲突。在这种环境中，由于每个应用的带宽需求不同，导致报文的流量模式不同，因此需要进行优先级控制。不幸的是，优先级控制降低了网络性能，并显著增加了优先级感知的片上路由器的面积。为了减少优先级控制带来的性能和面积开销，本文提出了一种优先级感知noc的路由器架构。我们使用90nm制程技术来实现所提出的路由器架构。综合结果表明，没有关键路径开销和大幅减少路由器面积。在8-ary 2-mesh网络上的仿真结果表明，在接近饱和点时，高优先级数据包的平均延迟降低。

引用次数: 1

[2010] OREX - An Optical Ring with Electrical Crossbar Hybrid Photonic Network-on-Chip [2010]光环与电交叉棒混合光子片上网络

2010 International Workshop on Innovative Architecture for Future Generation High Performance

Pub Date : 2010-01-17 DOI: 10.1109/IWIA.2010.13

Cisse Ahmadou Dit Adi, Ping Qiu, H. Irie, T. Miyoshi, T. Yoshinaga

The role of network-on-chip (NoC) is becoming more important as the number of processing elements (PE) integration onto a single chip increases. Lowering power consumption while providing capability of high-performance communication is a challenging problem for the design of future NoCs. In this paper we propose OREX, which is a hybrid NoC consisting of an optical ring and an electrical crossbar central router. OREX takes advantage of both electrical and optical technology designs state-of-art to deliver a high data rate transfer NoC at an acceptable power consumption cost. Using a cycle accurate simulator, we evaluate the proposed hybrid NoC. Simulation experiment shows that OREX presents slightly better communication performance in terms of bandwidth and power consumption compare to a conventional hybrid photonic torus network.

随着集成到单个芯片上的处理元件(PE)数量的增加，片上网络(NoC)的作用变得越来越重要。在提供高性能通信能力的同时降低功耗是未来noc设计的一个具有挑战性的问题。在本文中，我们提出了OREX，它是一个由光环和电交叉棒中心路由器组成的混合NoC。OREX利用最先进的电气和光学技术设计，以可接受的功耗成本提供高数据速率传输NoC。使用周期精确模拟器，我们评估了所提出的混合NoC。仿真实验表明，与传统的混合光子环面网络相比，OREX在带宽和功耗方面具有稍好的通信性能。

引用次数: 1

[2009] A Stage-Level Recovery Scheme in Scalable Pipeline Modules for High Dependability [2009]高可靠性可扩展管道模块的阶段级恢复方案

2010 International Workshop on Innovative Architecture for Future Generation High Performance

Pub Date : 2010-01-17 DOI: 10.1109/IWIA.2010.11

Jun Yao, Hajime Shimada, Kazutoshi Kobayashi

In the recent years, the increasing error rate has become one of the major impediments for the application of new process technologies in electronic devices like microprocessors. This thereby necessitates the research of fault toleration mechanisms from all device, micro-architecture and system levels to keep correct computation in future microprocessors, along the advances of process technologies.Space redundancy, as dual or triple modular redundancy (DMR or TMR), is widely used to tolerate errors with a negligible performance loss. In this paper, at the micro-architecture level, we propose a very fine-grained recovery scheme based on a DMR processor architecture to cover every transient error inside of the memory interface boundary. Our recovery method makes full use of the existing duplicated hardware in the DMR processor, which can avoid large hardware extension by not using checkpoint buffers in many fault-tolerable processors. The hardware-based recovery is achieved by dynamically triggering an instruction re-execution procedure in the next cycle after error detection, which indicates a near-zero performance impact to achieve an error-free execution.A TMR architecture is usually preferred as it provides a seamless error correction by a majority voting logic and therefore generates no recovery delay. With our fast recovery scheme at a low hardware cost, our result shows that even under a relatively high transient error rate, it is possible to only use a DMR architecture to detect/recover errors at a negligible performance cost. Our reliable processor is thus constructed to use a DMR execution with the fast recovery as its major working mode. It saves around 1/3 energy consumption from a traditional TMR architecture, while the transient error coverage is still maintained.

近年来，不断增加的错误率已成为微处理器等电子器件中新工艺技术应用的主要障碍之一。因此，随着工艺技术的进步，有必要从所有设备、微架构和系统层面研究容错机制，以保持未来微处理器的正确计算。空间冗余，作为双或三模冗余(DMR或TMR)，被广泛用于容错，而性能损失可以忽略不计。在微体系结构层面，我们提出了一种基于DMR处理器体系结构的细粒度恢复方案，以覆盖内存接口边界内的每一个瞬态错误。我们的恢复方法充分利用了DMR处理器中已有的重复硬件，避免了在多个容错处理器中使用检查点缓冲区，从而避免了大量的硬件扩展。基于硬件的恢复是通过在错误检测后的下一个周期中动态触发指令重新执行过程来实现的，这表明实现无错误执行对性能的影响几乎为零。TMR架构通常是首选，因为它通过多数投票逻辑提供了无缝的错误纠正，因此不会产生恢复延迟。使用我们的低硬件成本的快速恢复方案，我们的结果表明，即使在相对较高的瞬态错误率下，也可以仅使用DMR架构以微不足道的性能成本检测/恢复错误。因此，我们可靠的处理器被构建为使用DMR执行，快速恢复作为其主要工作模式。它比传统的TMR架构节省了大约1/3的能耗，同时仍然保持了瞬态误差覆盖。

{"title":"[2009] A Stage-Level Recovery Scheme in Scalable Pipeline Modules for High Dependability","authors":"Jun Yao, Hajime Shimada, Kazutoshi Kobayashi","doi":"10.1109/IWIA.2010.11","DOIUrl":"https://doi.org/10.1109/IWIA.2010.11","url":null,"abstract":"In the recent years, the increasing error rate has become one of the major impediments for the application of new process technologies in electronic devices like microprocessors. This thereby necessitates the research of fault toleration mechanisms from all device, micro-architecture and system levels to keep correct computation in future microprocessors, along the advances of process technologies.Space redundancy, as dual or triple modular redundancy (DMR or TMR), is widely used to tolerate errors with a negligible performance loss. In this paper, at the micro-architecture level, we propose a very fine-grained recovery scheme based on a DMR processor architecture to cover every transient error inside of the memory interface boundary. Our recovery method makes full use of the existing duplicated hardware in the DMR processor, which can avoid large hardware extension by not using checkpoint buffers in many fault-tolerable processors. The hardware-based recovery is achieved by dynamically triggering an instruction re-execution procedure in the next cycle after error detection, which indicates a near-zero performance impact to achieve an error-free execution.A TMR architecture is usually preferred as it provides a seamless error correction by a majority voting logic and therefore generates no recovery delay. With our fast recovery scheme at a low hardware cost, our result shows that even under a relatively high transient error rate, it is possible to only use a DMR architecture to detect/recover errors at a negligible performance cost. Our reliable processor is thus constructed to use a DMR execution with the fast recovery as its major working mode. It saves around 1/3 energy consumption from a traditional TMR architecture, while the transient error coverage is still maintained.","PeriodicalId":339844,"journal":{"name":"2010 International Workshop on Innovative Architecture for Future Generation High Performance","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126490912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

[2010] Facing the Exascale Energy Wall [2010]“百亿亿次能源墙”

2010 International Workshop on Innovative Architecture for Future Generation High Performance

Pub Date : 2010-01-17 DOI: 10.1109/IWIA.2010.9

P. Kogge, P. Fratta, Megan Vance

A recent report focused on the technical challengesin advancing from today's "petascale" systems to "exascale."Power, or more accurately energy, was a dominant challenge. This paper briefly reviews the energy challenge for exascaled sized systems, with an emphasis on the relatively enormous energy costs of referencing operands from the memory hierarchy. Then, usinga key step from the LINPACK benchmark, we investigate twodifferent approaches to reducing such costs: one which migratescomputations up from the host to higher levels of the hierarchy,and another in moving the whole computation closer to memory. Both show significant improvements over architecture as usual.

最近的一份报告关注了从今天的“千万亿”系统到“百亿亿”系统的技术挑战。电力，或者更准确地说是能源，是一个主要的挑战。本文简要回顾了百亿级系统的能量挑战，重点介绍了从内存层次结构中引用操作数的相对巨大的能量成本。然后，使用LINPACK基准测试中的一个关键步骤，我们研究了两种降低此类成本的不同方法:一种是将计算从主机迁移到层次结构的更高级别，另一种是将整个计算移动到更靠近内存的地方。两者都像往常一样对架构进行了重大改进。

引用次数: 10

[2009] Exploring the Possible Past Futures of a Single Part Type Multi-core PIM Chip [2009]单零件型多核PIM芯片的未来发展

2010 International Workshop on Innovative Architecture for Future Generation High Performance

Pub Date : 2010-01-17 DOI: 10.1109/IWIA.2010.8

P. Kogge

Execube, a chip built in 1993, was most probablythe world's first true multi-core microprocessor, the world's first Processing-In-Memory chip built on a DRAM process, and oneof the earliest attempts to build a single part type chip out ofwhich larger parallel processors could be built. This paper looksback on that chip and explores what would have happened ifits development had continued through succeeding generationsof technology. Several different scenarios are explored, withdiscussions as to what the capabilities would have been, andwhere limitations would have surfaced.

1993年生产的Execube芯片很可能是世界上第一个真正的多核微处理器，是世界上第一个基于DRAM工艺的内存处理芯片，也是最早尝试制造单部件芯片的芯片之一，可以在此基础上制造更大的并行处理器。本文回顾了那块芯片，并探讨了如果它的发展通过下一代技术继续下去会发生什么。探讨了几种不同的场景，讨论了可能的能力，以及可能出现的限制。

引用次数: 2

[2009] An Instruction Decomposition Method for Reconfigurable Decoders [2009]一种可重构解码器的指令分解方法

2010 International Workshop on Innovative Architecture for Future Generation High Performance

Pub Date : 2010-01-17 DOI: 10.1109/IWIA.2010.12

Kazuhiro Yoshimura, Takashi Nakada, Y. Nakashima

Embedded multimedia processors are required to execute many kinds of traditional instruction sets. Since decomposition and translation of instructions by software emulators have larger overhead than that by hardware units, an IPC on software emulators is lower than that on real processors. In this paper, we propose a new method for executing many kinds of traditional instruction sets. The method decomposes them into internal instructions based on information from memory. The memory-based decoder decomposes target CISC instructions into simple instructions. We evaluate an instruction decomposition method and the memory-based decoders. The average IPC of a memory-based decoder is 0.53, which is six times higher than that on JIT type software emulators. The total memory size of the decoder is 98 KB. The chip area of the processor that has the decoder using RAM is 1.36 times larger than that with a hardwired decoder. Therefore, we conclude that the proposed method provides a good tradeoff between chip area and performance.

嵌入式多媒体处理器需要执行多种传统指令集。由于软件模拟器的指令分解和转换比硬件单元的开销更大，因此软件模拟器上的IPC比实际处理器上的IPC要低。本文提出了一种执行多种传统指令集的新方法。该方法根据来自存储器的信息将它们分解为内部指令。基于内存的解码器将目标CISC指令分解为简单指令。我们评估了一种指令分解方法和基于内存的解码器。基于内存的解码器的平均IPC为0.53，是JIT类型软件仿真器的6倍。解码器的总内存大小为98 KB。使用RAM的解码器的处理器的芯片面积比使用硬线解码器的处理器大1.36倍。因此，我们得出结论，所提出的方法在芯片面积和性能之间提供了良好的权衡。

引用次数: 0

[2010] Energy Efficiency Using Loop Buffer based Instruction Memory Organizations [2010]基于循环缓冲的指令存储器组织的能源效率研究

2010 International Workshop on Innovative Architecture for Future Generation High Performance

Pub Date : 2010-01-17 DOI: 10.1109/IWIA.2010.10

Antonio Artés, F. Duarte, M. Ashouei, J. Huisken, J. Ayala, David Atienza Alonso, F. Catthoor

Energy consumption in embedded systems is strongly dominated by instruction memory organizations. Based on this, any architectural enhancement introduced in this component will produce a significant reduction of the total energy bud-get of the system. Loop buffering is an effective scheme to reduce the energy consumption of the instruction memory organization.In this paper, a novel classification of architectural enhancements based on the use of loop buffer concept is presented. Using this classification, an energy design space exploration is performed to show the impact in the energy consumption on different application scenarios. From gate-level simulations, the energy analysis demonstrates that the instruction level parallelism of the system brings not only improvements in performance, but also improvements in the energy consumption of the system.The increase in instruction level parallelism makes easy the adaptation of the sizes of the loop buffers to the sizes of the loops that form the application, because gives more freedom to combine the execution of the loops that form the application.

嵌入式系统的能源消耗主要由指令存储器组织控制。基于此，在此组件中引入的任何架构增强都将显著减少系统的总能源预算。循环缓冲是一种有效的降低指令存储器组织能耗的方案。本文提出了一种基于循环缓冲区概念的架构增强分类方法。利用这一分类进行能源设计空间探索，展示不同应用场景下能源消耗的影响。通过门级仿真，能量分析表明，系统的指令级并行性不仅提高了性能，而且降低了系统的能耗。指令级并行性的增加使循环缓冲区的大小更容易适应构成应用程序的循环的大小，因为它提供了更多的自由来组合构成应用程序的循环的执行。

{"title":"[2010] Energy Efficiency Using Loop Buffer based Instruction Memory Organizations","authors":"Antonio Artés, F. Duarte, M. Ashouei, J. Huisken, J. Ayala, David Atienza Alonso, F. Catthoor","doi":"10.1109/IWIA.2010.10","DOIUrl":"https://doi.org/10.1109/IWIA.2010.10","url":null,"abstract":"Energy consumption in embedded systems is strongly dominated by instruction memory organizations. Based on this, any architectural enhancement introduced in this component will produce a significant reduction of the total energy bud-get of the system. Loop buffering is an effective scheme to reduce the energy consumption of the instruction memory organization.In this paper, a novel classification of architectural enhancements based on the use of loop buffer concept is presented. Using this classification, an energy design space exploration is performed to show the impact in the energy consumption on different application scenarios. From gate-level simulations, the energy analysis demonstrates that the instruction level parallelism of the system brings not only improvements in performance, but also improvements in the energy consumption of the system.The increase in instruction level parallelism makes easy the adaptation of the sizes of the loop buffers to the sizes of the loops that form the application, because gives more freedom to combine the execution of the loops that form the application.","PeriodicalId":339844,"journal":{"name":"2010 International Workshop on Innovative Architecture for Future Generation High Performance","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128132069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

[2010] Avoiding Side-Channel Attacks in Embedded Systems with Non-deterministic Branches [2010]基于非确定性分支的嵌入式系统侧信道攻击

2010 International Workshop on Innovative Architecture for Future Generation High Performance

Pub Date : 2010-01-17 DOI: 10.1109/IWIA.2010.14

P. Malagón, Juan-Mariano de Goyeneche, Marina Zapater, Jose M. Moya

In this paper, we suggest handling security in embedded systems by introducing a small architectural change. We propose the use of a non-deterministic branch instruction to generate non-determinism in the execution of encryption algorithms. Non-determinism makes side-channel attacks much more difficult. The experimental results show at least three orders of magnitude improvement in resistance to statistical side-channel attacks for a custom AES implementation, while enhancing its performance at the same time.Compared with previous countermeasures, this architectural-level hiding countermeasure is trivial to integrate in current embedded processor designs, offers similar resistance to side-channel attacks, while maintaining similar power consumption to the unprotected processor.

在本文中，我们建议通过引入一个小的架构更改来处理嵌入式系统中的安全性。我们建议使用非确定性分支指令在加密算法的执行中产生非确定性。不确定性使得侧信道攻击更加困难。实验结果表明，自定义AES实现对统计侧信道攻击的抵抗能力至少提高了三个数量级，同时提高了其性能。与以前的对抗措施相比，这种架构级隐藏对抗措施易于集成到当前的嵌入式处理器设计中，具有类似的抗侧信道攻击能力，同时保持与未受保护的处理器相似的功耗。

引用次数: 0

[2010] Combined Dynamic-Static Approach for Thermal-Awareness in Heterogeneous Data Centers [2010]基于动态-静态的异构数据中心热感知方法

2010 International Workshop on Innovative Architecture for Future Generation High Performance

Pub Date : 2010-01-17 DOI: 10.1109/IWIA.2010.7

Marina Zapater, J. L. Risco-Martín, Z. Bankovic, J. Ayala, Jose M. Moya

The thermal profile of data centers plays a significant role in affecting the cooling cost and power budget of the system. While several dynamic and static approaches have been proposed so far, these have failed on considering the whole picture. This paper proposes a combined static and dynamic approach that shows the benefits of the efficient scheduling strategies on leading to thermal-efficient floorplans. The devised methodology comes out with a placement of processors and task scheduling for a heterogeneous system, where the main thermal metrics (maximum temperature and thermal gradient) have been optimized.

数据中心的热分布对系统的冷却成本和功耗预算有着重要的影响。虽然到目前为止已经提出了几种动态和静态方法，但这些方法都未能考虑到整体情况。本文提出了一种静态和动态相结合的方法，显示了高效调度策略在导致热效率平面图方面的好处。所设计的方法为异构系统提供了处理器的位置和任务调度，其中主要的热度量(最高温度和热梯度)已经优化。

引用次数: 0

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2010 International Workshop on Innovative Architecture for Future Generation High Performance

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀