首页 > 最新文献

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)最新文献

英文 中文
Message from the Conference Chairs - ASAP 2020 会议主席寄语——尽快2020年
Dirk Koch, Frank Hannig, J. Navaridas
{"title":"Message from the Conference Chairs - ASAP 2020","authors":"Dirk Koch, Frank Hannig, J. Navaridas","doi":"10.1109/asap49362.2020.00005","DOIUrl":"https://doi.org/10.1109/asap49362.2020.00005","url":null,"abstract":"","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"42 1","pages":"i-ii"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84124038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Message from the ASAP 2016 chairs 来自2016年ASAP主席的信息
David B. Thomas, Suhaib A. Fahmy
We welcome you to the 27th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP 2016). This year's event takes place in London, United Kingdom on the campus of the Imperial College London. Prior to this year's visit to London, the conference has been held in many places around the globe including Oxford (1986), San Diego (1988), Killarney (1989), Princeton (1990), Barcelona (1991), Berkeley (1992), Venice (1993), San Francisco (1994), Strasbourg (1995), Chicago (1996), Zurich (1997), Boston (2000), San Jose (2002), The Hague (2003), Galveston (2004), Samos (2005), Steamboat Springs (2006), Montreal (2007), Leuven (2008), Boston (2009), Rennes (2010), Santa Monica (2011), Delft (2012), and Washington, D.C (2013), Zurich (2014), and Toronto (2015). Though this is the 27th iteration of ASAP, it is actually the 30 year anniversary of the first conference in Oxford.
欢迎您参加第27届IEEE专用系统、架构和处理器国际会议(ASAP 2016)。今年的活动在英国伦敦帝国学院的校园里举行。在今年访问伦敦之前,该会议已在全球许多地方举行,包括牛津(1986年)、圣地亚哥(1988年)、基拉尼(1989年)、普林斯顿(1990年)、巴塞罗那(1991年)、伯克利(1992年)、威尼斯(1993年)、旧金山(1994年)、斯特拉斯堡(1995年)、芝加哥(1996年)、苏黎世(1997年)、波士顿(2000年)、圣何塞(2002年)、海牙(2003年)、加尔维斯顿(2004年)、萨莫斯(2005年)、蒸汽船泉(2006年)、蒙特利尔(2007年)、鲁汶(2008年)、波士顿(2009年)、雷纳(2010年)、圣莫尼卡(2011年)、代尔夫特(2012)、华盛顿特区(2013)、苏黎世(2014)和多伦多(2015)。虽然这是第27届ASAP,但实际上它是牛津第一次会议的30周年纪念。
{"title":"Message from the ASAP 2016 chairs","authors":"David B. Thomas, Suhaib A. Fahmy","doi":"10.1109/ASAP.2016.7760764","DOIUrl":"https://doi.org/10.1109/ASAP.2016.7760764","url":null,"abstract":"We welcome you to the 27th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP 2016). This year's event takes place in London, United Kingdom on the campus of the Imperial College London. Prior to this year's visit to London, the conference has been held in many places around the globe including Oxford (1986), San Diego (1988), Killarney (1989), Princeton (1990), Barcelona (1991), Berkeley (1992), Venice (1993), San Francisco (1994), Strasbourg (1995), Chicago (1996), Zurich (1997), Boston (2000), San Jose (2002), The Hague (2003), Galveston (2004), Samos (2005), Steamboat Springs (2006), Montreal (2007), Leuven (2008), Boston (2009), Rennes (2010), Santa Monica (2011), Delft (2012), and Washington, D.C (2013), Zurich (2014), and Toronto (2015). Though this is the 27th iteration of ASAP, it is actually the 30 year anniversary of the first conference in Oxford.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"15 1","pages":"iii-iv"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81850938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An interpolation-based approach to multi-parameter performance modeling for heterogeneous systems 基于插值的异构系统多参数性能建模方法
D. Rudolph, G. Stitt
To effectively optimize applications for emerging heterogeneous architectures, compilers and synthesis tools must perform the challenging task of estimating the performance of different implementations and optimizations for different numbers and types of computational resources. Many performance-prediction techniques exist, but those approaches are specific to particular resources or applications, and are often not capable of prediction for all combinations of inputs. In this paper, we introduce an approach to multi-parameter performance modeling based on sampling and interpolation. This approach can be used in conjunction with execution time data, simulated or observed, to quickly perform performance estimation for any function, on any resource, with any combination of inputs. By evaluating a Kriging-based interpolator on a variety of functions and computational resources, we determine bounds on the accuracy of this approach, and show that an interpolation-based approach utilizing Kriging can effectively model execution time for most applications. We also show that Kriging is a highly effective interpolation technique for execution time, and can be up to four orders of magnitude more accurate than nearest-neighbor interpolation or radial basis function interpolation.
为了有效地为新兴的异构体系结构优化应用程序,编译器和综合工具必须执行一项具有挑战性的任务,即针对不同数量和类型的计算资源评估不同实现和优化的性能。存在许多性能预测技术,但这些方法都是针对特定资源或应用程序的,并且通常不能预测所有输入的组合。本文介绍了一种基于采样和插值的多参数性能建模方法。这种方法可以与模拟或观察到的执行时间数据结合使用,以便对任何功能、任何资源和任何输入组合快速执行性能估计。通过在各种函数和计算资源上评估基于Kriging的插值器,我们确定了该方法精度的界限,并表明利用Kriging的基于插值的方法可以有效地为大多数应用程序建模执行时间。我们还表明,Kriging是一种非常有效的执行时间插值技术,并且可以比最近邻插值或径向基函数插值精度高出四个数量级。
{"title":"An interpolation-based approach to multi-parameter performance modeling for heterogeneous systems","authors":"D. Rudolph, G. Stitt","doi":"10.1109/ASAP.2015.7245731","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245731","url":null,"abstract":"To effectively optimize applications for emerging heterogeneous architectures, compilers and synthesis tools must perform the challenging task of estimating the performance of different implementations and optimizations for different numbers and types of computational resources. Many performance-prediction techniques exist, but those approaches are specific to particular resources or applications, and are often not capable of prediction for all combinations of inputs. In this paper, we introduce an approach to multi-parameter performance modeling based on sampling and interpolation. This approach can be used in conjunction with execution time data, simulated or observed, to quickly perform performance estimation for any function, on any resource, with any combination of inputs. By evaluating a Kriging-based interpolator on a variety of functions and computational resources, we determine bounds on the accuracy of this approach, and show that an interpolation-based approach utilizing Kriging can effectively model execution time for most applications. We also show that Kriging is a highly effective interpolation technique for execution time, and can be up to four orders of magnitude more accurate than nearest-neighbor interpolation or radial basis function interpolation.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"62 1","pages":"174-180"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78404650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Reconfigurable acceleration of fitness evaluation in trading strategies 交易策略中适应度评估的可重构加速
Andreea-Ingrid Funie, Paul Grigoras, P. Burovskiy, W. Luk, Mark Salmon
Over the past years, examining financial markets has become a crucial part of both the trading and regulatory processes. Recently, genetic programs have been used to identify patterns in financial markets which may lead to more advanced trading strategies. We investigate the use of Field Programmable Gate Arrays to accelerate the evaluation of the fitness function which is an important kernel in genetic programming. Our pipelined design makes use of the massive amounts of parallelism available on chip to evaluate the fitness of multiple genetic programs simultaneously. An evaluation of our designs on both synthetic and historical market data shows that our implementation evaluates fitness function up to 21.56 times faster than a multi-threaded C++11 implementation running on two six-core Intel Xeon E5-2640 processors using OpenMP.
在过去的几年里,检查金融市场已经成为交易和监管过程的重要组成部分。最近,遗传程序已被用于识别金融市场的模式,这可能导致更先进的交易策略。我们研究了利用现场可编程门阵列来加速遗传规划中一个重要核心——适应度函数的评估。我们的流水线设计利用芯片上的大量并行性来同时评估多个遗传程序的适应度。对我们设计的综合和历史市场数据的评估表明,我们的实现评估适应度函数的速度比使用OpenMP在两个六核Intel Xeon E5-2640处理器上运行的多线程c++ 11实现快21.56倍。
{"title":"Reconfigurable acceleration of fitness evaluation in trading strategies","authors":"Andreea-Ingrid Funie, Paul Grigoras, P. Burovskiy, W. Luk, Mark Salmon","doi":"10.1109/ASAP.2015.7245736","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245736","url":null,"abstract":"Over the past years, examining financial markets has become a crucial part of both the trading and regulatory processes. Recently, genetic programs have been used to identify patterns in financial markets which may lead to more advanced trading strategies. We investigate the use of Field Programmable Gate Arrays to accelerate the evaluation of the fitness function which is an important kernel in genetic programming. Our pipelined design makes use of the massive amounts of parallelism available on chip to evaluate the fitness of multiple genetic programs simultaneously. An evaluation of our designs on both synthetic and historical market data shows that our implementation evaluates fitness function up to 21.56 times faster than a multi-threaded C++11 implementation running on two six-core Intel Xeon E5-2640 processors using OpenMP.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"3 1","pages":"210-217"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79519353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Speeding up graph-based SLAM algorithm: A GPU-based heterogeneous architecture study 加速基于图的SLAM算法:基于gpu的异构架构研究
Abdelhamid Dine, A. Elouardi, B. Vincke, S. Bouaziz
In this paper we present a study of using an heterogeneous architecture to implement the graph-based SLAM algorithm. The study aims to investigate the performances of an ARM-GPU based architecture by offloading some critical compute-intensive tasks of the algorithm to the integrated GPU.
在本文中,我们提出了一个使用异构架构来实现基于图的SLAM算法的研究。该研究旨在通过将算法的一些关键计算密集型任务卸载到集成GPU来研究基于ARM-GPU的架构的性能。
{"title":"Speeding up graph-based SLAM algorithm: A GPU-based heterogeneous architecture study","authors":"Abdelhamid Dine, A. Elouardi, B. Vincke, S. Bouaziz","doi":"10.1109/ASAP.2015.7245711","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245711","url":null,"abstract":"In this paper we present a study of using an heterogeneous architecture to implement the graph-based SLAM algorithm. The study aims to investigate the performances of an ARM-GPU based architecture by offloading some critical compute-intensive tasks of the algorithm to the integrated GPU.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"24 1","pages":"72-73"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91535158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Does arithmetic logic dominate data movement? a systematic comparison of energy-efficiency for FFT accelerators 算术逻辑支配数据移动吗?FFT加速器能源效率的系统比较
T. Hoang, Amirali Shambayati, H. Hoffmann, A. Chien
In this paper, we perform a systematic comparison to study the energy cost of varying data formats and data types w.r.t. arithmetic logic and data movement for accelerator-based heterogeneous systems in which both compute-intensive (FFT accelerator) and data-intensive accelerators (DLT accelerator) are added. We explore evaluation for a wide range of design processes (e.g. 32nm bulk-CMOS and projected 7nm FinFET) and memory systems (e.g. DDR3 and HMC). First, our result shows that when varying data formats, the energy costs of using floating point over fixed point are 5.3% (DDR3), 6.2% (HMC) for core and 0.8% (DDR3), 1.5% (HMC) for system in 32nm process. These energy costs are negligible as 0.2% and 0.01% for core and system in 7nm FinFET process in DDR3 memory and slightly increasing in HMC. Second, we identify that the core and system energy of systems using fixed point, 16-bit, FFT accelerator is nearly half of using 32-bit if data movement is also accelerated. This evidence implies that system energy is highly proportional to the amount of moving data when varying data types.
在本文中,我们进行了系统的比较,以研究不同数据格式和数据类型w.r.t.算法逻辑和数据移动的能量成本,在基于加速器的异构系统中,同时添加了计算密集型(FFT加速器)和数据密集型加速器(DLT加速器)。我们探索了广泛的设计工艺(例如32nm大块cmos和预计7nm FinFET)和存储系统(例如DDR3和HMC)的评估。首先,我们的研究结果表明,在不同的数据格式下,使用浮点比定点的能量成本在32nm制程中,核心为5.3% (DDR3), 6.2% (HMC),系统为0.8% (DDR3), 1.5% (HMC)。在DDR3存储器的7nm FinFET工艺中,这些能量成本可以忽略不计,为核心和系统的0.2%和0.01%,而在HMC中略有增加。其次,我们发现,如果数据移动也加速,使用定点16位FFT加速器的系统的核心和系统能量几乎是使用32位加速器的一半。这一证据表明,当改变数据类型时,系统能量与移动数据的数量高度成正比。
{"title":"Does arithmetic logic dominate data movement? a systematic comparison of energy-efficiency for FFT accelerators","authors":"T. Hoang, Amirali Shambayati, H. Hoffmann, A. Chien","doi":"10.1109/ASAP.2015.7245708","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245708","url":null,"abstract":"In this paper, we perform a systematic comparison to study the energy cost of varying data formats and data types w.r.t. arithmetic logic and data movement for accelerator-based heterogeneous systems in which both compute-intensive (FFT accelerator) and data-intensive accelerators (DLT accelerator) are added. We explore evaluation for a wide range of design processes (e.g. 32nm bulk-CMOS and projected 7nm FinFET) and memory systems (e.g. DDR3 and HMC). First, our result shows that when varying data formats, the energy costs of using floating point over fixed point are 5.3% (DDR3), 6.2% (HMC) for core and 0.8% (DDR3), 1.5% (HMC) for system in 32nm process. These energy costs are negligible as 0.2% and 0.01% for core and system in 7nm FinFET process in DDR3 memory and slightly increasing in HMC. Second, we identify that the core and system energy of systems using fixed point, 16-bit, FFT accelerator is nearly half of using 32-bit if data movement is also accelerated. This evidence implies that system energy is highly proportional to the amount of moving data when varying data types.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"66 1","pages":"66-67"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80641890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A scheduling and binding heuristic for high-level synthesis of fault-tolerant FPGA applications 一种调度和绑定启发式算法,用于高级容错FPGA应用的综合
Aniruddha Shastri, G. Stitt, Eduardo Riccio
Space computing systems commonly use field-programmable gate arrays to provide fault tolerance by applying triple modular redundancy (TMR) to existing register-transfer-level (RTL) code. Although effective, this approach has a 3× area overhead that can be prohibitive for many designs that often allocate resources before considering effects of redundancy. Although a designer could modify existing RTL code to reduce resource usage, such a process is time consuming and error prone. Integrating redundancy into high-level synthesis is a more attractive approach that enables synthesis to rapidly explore different tradeoffs at no cost to the designer. In this paper, we introduce a scheduling and binding heuristic for high-level synthesis that explores tradeoffs between resource usage, latency, and the amount of redundancy. In many cases, an application will not require 100% error correction, which enables significant flexibility for scheduling and binding to reduce resources. Even for applications that require 100% error correction, our heuristic is able to explore solutions that sacrifice latency for reduced resources, and typically save up to 47% when relaxing the latency up to 2×. When the error constraint is reduced to 70%, our heuristic achieves typical resource savings ranging from 18% to 49% when relaxing the latency up to 2×, with a maximum of 77%. Even when comparing with optimized RTL designs, our heuristic uses up to 61% fewer resources than TMR.
空间计算系统通常使用现场可编程门阵列,通过对现有的寄存器-传输级(RTL)代码应用三模冗余(TMR)来提供容错能力。虽然这种方法是有效的,但它有3倍的面积开销,这对于经常在考虑冗余影响之前分配资源的许多设计来说可能是令人望而却步的。尽管设计人员可以修改现有的RTL代码以减少资源使用,但这样的过程既耗时又容易出错。将冗余集成到高级合成中是一种更有吸引力的方法,它使合成能够快速探索不同的权衡,而不需要设计者付出任何代价。在本文中,我们引入了一种用于高级综合的调度和绑定启发式方法,该方法探索了资源使用、延迟和冗余量之间的权衡。在许多情况下,应用程序不需要100%的错误纠正,这为调度和绑定提供了极大的灵活性,从而减少了资源。即使对于需要100%纠错的应用程序,我们的启发式方法也能够探索牺牲延迟以减少资源的解决方案,当将延迟放宽到2倍时,通常可以节省高达47%的时间。当误差约束减少到70%时,我们的启发式算法在将延迟放宽到2倍(最大为77%)时实现了典型的资源节省,范围从18%到49%。即使与优化后的RTL设计相比,我们的启发式算法使用的资源也比TMR少61%。
{"title":"A scheduling and binding heuristic for high-level synthesis of fault-tolerant FPGA applications","authors":"Aniruddha Shastri, G. Stitt, Eduardo Riccio","doi":"10.1109/ASAP.2015.7245735","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245735","url":null,"abstract":"Space computing systems commonly use field-programmable gate arrays to provide fault tolerance by applying triple modular redundancy (TMR) to existing register-transfer-level (RTL) code. Although effective, this approach has a 3× area overhead that can be prohibitive for many designs that often allocate resources before considering effects of redundancy. Although a designer could modify existing RTL code to reduce resource usage, such a process is time consuming and error prone. Integrating redundancy into high-level synthesis is a more attractive approach that enables synthesis to rapidly explore different tradeoffs at no cost to the designer. In this paper, we introduce a scheduling and binding heuristic for high-level synthesis that explores tradeoffs between resource usage, latency, and the amount of redundancy. In many cases, an application will not require 100% error correction, which enables significant flexibility for scheduling and binding to reduce resources. Even for applications that require 100% error correction, our heuristic is able to explore solutions that sacrifice latency for reduced resources, and typically save up to 47% when relaxing the latency up to 2×. When the error constraint is reduced to 70%, our heuristic achieves typical resource savings ranging from 18% to 49% when relaxing the latency up to 2×, with a maximum of 77%. Even when comparing with optimized RTL designs, our heuristic uses up to 61% fewer resources than TMR.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"21 1","pages":"202-209"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75888933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
MultiExplorer: A tool set for multicore system-on-chip design exploration MultiExplorer:用于多核系统芯片设计探索的工具集
Rodrigo Devigo, Liana Duenha, R. Azevedo, R. Santos
This paper proposes MultiExplorer, a new toolset for MPSoCs modelling, experimentation, and design space exploration, by combining fast high-abstraction simulation and low-level physical estimates (power, area, and timing). The MultiExplorer infrastructure takes a range of high and low-level parameters to improve accuracy in the design of a multiprocessor system on a chip. Our toolset results show a viable alternative to explore multiprocessor scalability (1-64 cores) on affordable simulation times.
本文提出了MultiExplorer,这是一种用于mpsoc建模、实验和设计空间探索的新工具集,它结合了快速的高抽象仿真和低层次的物理估计(功率、面积和时间)。MultiExplorer基础结构采用一系列高、低级参数,以提高芯片上多处理器系统设计的准确性。我们的工具集结果显示了在可承受的模拟时间内探索多处理器可扩展性(1-64核)的可行替代方案。
{"title":"MultiExplorer: A tool set for multicore system-on-chip design exploration","authors":"Rodrigo Devigo, Liana Duenha, R. Azevedo, R. Santos","doi":"10.1109/ASAP.2015.7245727","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245727","url":null,"abstract":"This paper proposes MultiExplorer, a new toolset for MPSoCs modelling, experimentation, and design space exploration, by combining fast high-abstraction simulation and low-level physical estimates (power, area, and timing). The MultiExplorer infrastructure takes a range of high and low-level parameters to improve accuracy in the design of a multiprocessor system on a chip. Our toolset results show a viable alternative to explore multiprocessor scalability (1-64 cores) on affordable simulation times.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"28 1","pages":"160-161"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82166762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A metamorphotic Network-on-Chip for various types of parallel applications 一种变形的片上网络,用于各种类型的并行应用
S. Tade, Hiroki Matsutani, H. Amano, M. Koibuchi
A metamorphotic Network-on-Chip (NoC) architecture is proposed in order to customize for performance or energy consumption on a per-application basis. Adding reconfigurability on conventional topologies has been studied so far especially for application workloads that can be statically analyzed. In this context, we propose such a platform to take care of both the static and the dynamic cases where application workloads cannot be statically analyzed while performance or energy constraints are given. Our metamorphotic NoC reconfigures its topology, routing, operating frequency, and supply voltage based on the following three modes. 1) Regular mode uses a traditional mesh topology for neighboring communications. As the link length is short and uniform, it can be operated at a higher frequency and higher voltage, while a long-range communication increases the path length. 2) Random mode uses a random topology for unknown workloads to reduce the path length by exploiting the small-world effect. As the path length is reduced but the wire delay is increased, it is intended for a lower operating frequency and lower voltage. 3) Custom mode uses an optimized topology for given workloads. To support Random and Custom modes, assembled multiplexers are embedded into the metamorphotic NoC. Random and Regular/Custom modes are generated by randomly or selectively reconfiguring these multiplexers, respectively, based on the performance or energy constraints. This paper explores the design space of assembled multiplexers and provides a reasonable design recommendation through a graph analysis. It is demonstrated based on experimental results on the area overhead, operating frequency, network performance, and energy consumption. The results show that Regular mode can operate at 1.27GHz and Random mode can reduce the average network latency by 19.6% and the energy consumption by 44.2% compared with a traditional NoC that has mesh topology with little overhead. Custom mode can reduce them as well as Random mode.
为了在每个应用的基础上定制性能或能耗,提出了一种变形片上网络(NoC)架构。到目前为止,已经研究了在传统拓扑上添加可重构性,特别是对于可以静态分析的应用程序工作负载。在这种情况下,我们建议使用这样一个平台来处理静态和动态情况,即在给定性能或能量限制的情况下无法静态分析应用程序工作负载。我们的变形NoC基于以下三种模式重新配置其拓扑结构、路由、工作频率和电源电压。1)正则模式使用传统的网格拓扑进行相邻通信。由于链路长度短而均匀,可以在更高的频率和更高的电压下工作,而远程通信则增加了路径长度。2)随机模式对未知负载采用随机拓扑,利用小世界效应减少路径长度。由于路径长度减少,但导线延迟增加,因此用于较低的工作频率和较低的电压。3)自定义模式使用给定工作负载的优化拓扑。为了支持随机和自定义模式,组装的多路复用器被嵌入到变形NoC中。随机模式和规则/自定义模式分别是根据性能或能量限制随机或有选择地重新配置这些多路复用器而产生的。本文探讨了组合多路复用器的设计空间,并通过图形分析给出了合理的设计建议。从面积开销、工作频率、网络性能和能耗等方面进行了实验验证。结果表明,常规模式可以工作在1.27GHz,随机模式与具有网状拓扑结构的传统NoC相比,平均网络延迟降低19.6%,能耗降低44.2%。自定义模式可以减少它们以及随机模式。
{"title":"A metamorphotic Network-on-Chip for various types of parallel applications","authors":"S. Tade, Hiroki Matsutani, H. Amano, M. Koibuchi","doi":"10.1109/ASAP.2015.7245715","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245715","url":null,"abstract":"A metamorphotic Network-on-Chip (NoC) architecture is proposed in order to customize for performance or energy consumption on a per-application basis. Adding reconfigurability on conventional topologies has been studied so far especially for application workloads that can be statically analyzed. In this context, we propose such a platform to take care of both the static and the dynamic cases where application workloads cannot be statically analyzed while performance or energy constraints are given. Our metamorphotic NoC reconfigures its topology, routing, operating frequency, and supply voltage based on the following three modes. 1) Regular mode uses a traditional mesh topology for neighboring communications. As the link length is short and uniform, it can be operated at a higher frequency and higher voltage, while a long-range communication increases the path length. 2) Random mode uses a random topology for unknown workloads to reduce the path length by exploiting the small-world effect. As the path length is reduced but the wire delay is increased, it is intended for a lower operating frequency and lower voltage. 3) Custom mode uses an optimized topology for given workloads. To support Random and Custom modes, assembled multiplexers are embedded into the metamorphotic NoC. Random and Regular/Custom modes are generated by randomly or selectively reconfiguring these multiplexers, respectively, based on the performance or energy constraints. This paper explores the design space of assembled multiplexers and provides a reasonable design recommendation through a graph analysis. It is demonstrated based on experimental results on the area overhead, operating frequency, network performance, and energy consumption. The results show that Regular mode can operate at 1.27GHz and Random mode can reduce the average network latency by 19.6% and the energy consumption by 44.2% compared with a traditional NoC that has mesh topology with little overhead. Custom mode can reduce them as well as Random mode.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"23 1","pages":"98-105"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83943718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Timing speculation-aware instruction set extension for resource-constrained embedded systems 资源受限嵌入式系统的时间推测感知指令集扩展
Tanvir Ahmed, Yuko Hara-Azumi
Performance, area, and power are important issues for many embedded systems. One area- and power-efficient way to improve performance is instruction set architecture (ISA) extension. Although existing works have introduced application-specific accelerators co-operating with a basic processor, most of them are still not suitable for embedded systems with stringent resource and/or power constraints because of excess, power-hungry resources in the basic processor. In this paper, we propose ISA extension for such stringently constrained embedded systems. Contrary to previous works, our work rather simplifies the basic processor by replacing original power-hungry resources with power-efficient alternatives. Then, considering the application features (not only input patterns but also instruction sequence), we extend software binary with new instructions executable on the simplified processor. These hardware and software extensions can jointly work well for timing speculation (TS). To the best of our knowledge, this is the first TS-aware ISA extension applicable to embedded systems with stringent area- and/or power-constraints. In our evaluation, we achieved 29.9% speedup in execution time and 1.5× aggressive clock scaling along with 8.7% and 48.3% reduction in circuit area and power-delay product, respectively, compared with the traditional worst-case design.
性能、面积和功耗是许多嵌入式系统的重要问题。指令集体系结构(ISA)扩展是提高性能的一种既省地又省电的方法。虽然现有的工作已经引入了与基本处理器合作的特定应用程序加速器,但它们中的大多数仍然不适合具有严格资源和/或功率限制的嵌入式系统,因为基本处理器中存在多余的、耗电的资源。在本文中,我们提出了对这种严格约束嵌入式系统的ISA扩展。与以前的工作相反,我们的工作通过用节能的替代方案取代原始的耗电资源,从而简化了基本处理器。然后,考虑到应用程序的特点(不仅是输入模式,还有指令顺序),我们扩展了软件二进制,在简化的处理器上添加了可执行的新指令。这些硬件和软件扩展可以很好地共同用于时间推测(TS)。据我们所知,这是第一个适用于具有严格面积和/或功率限制的嵌入式系统的TS-aware ISA扩展。在我们的评估中,与传统的最坏情况设计相比,我们实现了29.9%的执行时间加速和1.5倍的侵略性时钟缩放,同时电路面积和功耗延迟产品分别减少了8.7%和48.3%。
{"title":"Timing speculation-aware instruction set extension for resource-constrained embedded systems","authors":"Tanvir Ahmed, Yuko Hara-Azumi","doi":"10.1109/ASAP.2015.7245701","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245701","url":null,"abstract":"Performance, area, and power are important issues for many embedded systems. One area- and power-efficient way to improve performance is instruction set architecture (ISA) extension. Although existing works have introduced application-specific accelerators co-operating with a basic processor, most of them are still not suitable for embedded systems with stringent resource and/or power constraints because of excess, power-hungry resources in the basic processor. In this paper, we propose ISA extension for such stringently constrained embedded systems. Contrary to previous works, our work rather simplifies the basic processor by replacing original power-hungry resources with power-efficient alternatives. Then, considering the application features (not only input patterns but also instruction sequence), we extend software binary with new instructions executable on the simplified processor. These hardware and software extensions can jointly work well for timing speculation (TS). To the best of our knowledge, this is the first TS-aware ISA extension applicable to embedded systems with stringent area- and/or power-constraints. In our evaluation, we achieved 29.9% speedup in execution time and 1.5× aggressive clock scaling along with 8.7% and 48.3% reduction in circuit area and power-delay product, respectively, compared with the traditional worst-case design.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"39 1","pages":"30-34"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87866414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1