首页 > 最新文献

Proceedings of the ACM International Conference on Computing Frontiers最新文献

英文 中文
Exploring embedded systems virtualization using MIPS virtualization module 使用MIPS虚拟化模块探索嵌入式系统虚拟化
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903179
C. Moratelli, S. J. Filho, Fabiano Hessel
Embedded virtualization has emerged as a valuable way to increase security, reduce costs, improve software quality and decrease design time. The late adoption of hardware-assisted virtualization in embedded processors induced the development of hypervisors primarily based on para-virtualization. Recently, embedded processor designers developed virtualization extensions for their processor architectures similar to those adopted in cloud computing years ago. Now, the hypervisors are migrating to a mixed approach, where basic operating system functionalities take advantage of full-virtualization and advanced functionalities such as inter-domain communication remain para-virtualized. In this paper, we discuss the key features for embedded virtualization. We show how our embedded hypervisor was designed to support these features, taking advantage of the hardware-assisted virtualization available to the MIPS family of processors. Different aspects of our hypervisor are evaluated and compared to other similar approaches. A hardware platform was used to run benchmarks on virtualized instances of both Linux and a RTOS for performance analysis. Finally, the results obtained show that our hypervisor can be applied as a sound solution for the IoT.
嵌入式虚拟化已经成为提高安全性、降低成本、提高软件质量和缩短设计时间的一种有价值的方式。嵌入式处理器中较晚采用硬件辅助虚拟化,导致了主要基于准虚拟化的管理程序的开发。最近,嵌入式处理器设计人员为其处理器架构开发了虚拟化扩展,类似于多年前云计算中采用的扩展。现在,管理程序正在向混合方法迁移,在这种方法中,基本操作系统功能利用了完全虚拟化,而域间通信等高级功能仍然是准虚拟化的。在本文中,我们讨论了嵌入式虚拟化的关键特性。我们将展示如何设计我们的嵌入式管理程序来支持这些特性,利用MIPS处理器家族可用的硬件辅助虚拟化。我们将评估管理程序的不同方面,并与其他类似方法进行比较。我们使用硬件平台在Linux和RTOS的虚拟化实例上运行基准测试,以进行性能分析。最后,得到的结果表明,我们的管理程序可以作为一个完善的解决方案应用于物联网。
{"title":"Exploring embedded systems virtualization using MIPS virtualization module","authors":"C. Moratelli, S. J. Filho, Fabiano Hessel","doi":"10.1145/2903150.2903179","DOIUrl":"https://doi.org/10.1145/2903150.2903179","url":null,"abstract":"Embedded virtualization has emerged as a valuable way to increase security, reduce costs, improve software quality and decrease design time. The late adoption of hardware-assisted virtualization in embedded processors induced the development of hypervisors primarily based on para-virtualization. Recently, embedded processor designers developed virtualization extensions for their processor architectures similar to those adopted in cloud computing years ago. Now, the hypervisors are migrating to a mixed approach, where basic operating system functionalities take advantage of full-virtualization and advanced functionalities such as inter-domain communication remain para-virtualized. In this paper, we discuss the key features for embedded virtualization. We show how our embedded hypervisor was designed to support these features, taking advantage of the hardware-assisted virtualization available to the MIPS family of processors. Different aspects of our hypervisor are evaluated and compared to other similar approaches. A hardware platform was used to run benchmarks on virtualized instances of both Linux and a RTOS for performance analysis. Finally, the results obtained show that our hypervisor can be applied as a sound solution for the IoT.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121034927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
First impressions from detailed brain model simulations on a Xeon/Xeon-Phi node Xeon/Xeon- phi节点上详细的大脑模型模拟的第一印象
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903477
G. Chatzikonstantis, D. Rodopoulos, Sofia Nomikou, C. Strydis, C. I. Zeeuw, D. Soudris
The development of physiologically plausible neuron models comes with increased complexity, which poses a challenge for many-core computing. In this work, we have chosen an extension of the demanding Hodgkin-Huxley model for the neurons of the Inferior Olivary Nucleus, an area of vital importance for motor skills. The computing fabric of choice is an Intel Xeon-Xeon Phi system, widely-used in modern computing infrastructure. The target application is parallelized with combinations of MPI and OpenMP. The best configurations are scaled up to human InfOli numbers.
生理上合理的神经元模型的发展伴随着复杂性的增加,这对多核计算提出了挑战。在这项工作中,我们选择了对下橄榄核神经元的苛刻的霍奇金-赫胥黎模型的扩展,这是一个对运动技能至关重要的区域。计算结构的选择是英特尔Xeon-Xeon Phi系统,广泛用于现代计算基础设施。目标应用程序通过MPI和OpenMP的组合并行化。最好的配置是按比例扩展到人类InfOli数字。
{"title":"First impressions from detailed brain model simulations on a Xeon/Xeon-Phi node","authors":"G. Chatzikonstantis, D. Rodopoulos, Sofia Nomikou, C. Strydis, C. I. Zeeuw, D. Soudris","doi":"10.1145/2903150.2903477","DOIUrl":"https://doi.org/10.1145/2903150.2903477","url":null,"abstract":"The development of physiologically plausible neuron models comes with increased complexity, which poses a challenge for many-core computing. In this work, we have chosen an extension of the demanding Hodgkin-Huxley model for the neurons of the Inferior Olivary Nucleus, an area of vital importance for motor skills. The computing fabric of choice is an Intel Xeon-Xeon Phi system, widely-used in modern computing infrastructure. The target application is parallelized with combinations of MPI and OpenMP. The best configurations are scaled up to human InfOli numbers.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115732474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Sub-PicoJoule per operation scalable computing: why, when, how? 每操作次皮焦耳可扩展计算:为什么,何时,如何?
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2916035
L. Benini
The "internet of everything" envisions trillions of connected objects loaded with high-bandwidth sensors requiring massive amounts of local signal processing, fusion, pattern extraction and classification. From the computational viewpoint, the challenge is formidable and can be addressed only by pushing computing fabrics toward massive parallelism and brain-like energy efficiency levels. CMOS technology can still take us a long way toward this vision. Our recent results with the open-source PULP (parallel ultra-low power) chips demonstrate that pj/OP (GOPS/mW) computational efficiency is within reach in today's 28nm CMOS FDSOI technology. In this talk, I will look at the next 1000x of energy efficiency improvement, which will require heterogeneous 3D integration, mixed-signal, approximate processing and non-Von-Neumann architectures for scalable acceleration.
“万物互联”设想了数万亿个连接的物体,这些物体装载着高带宽传感器,需要大量的本地信号处理、融合、模式提取和分类。从计算的角度来看,这个挑战是艰巨的,只能通过推动计算结构向大规模并行和类似大脑的能量效率水平发展来解决。CMOS技术在实现这一愿景方面还有很长的路要走。我们最近使用开源PULP(并行超低功耗)芯片的结果表明,在当今的28nm CMOS FDSOI技术中,pj/OP (GOPS/mW)的计算效率是可以达到的。在这次演讲中,我将研究下一个1000倍的能效改进,这将需要异构3D集成、混合信号、近似处理和非冯-诺伊曼架构来实现可扩展的加速。
{"title":"Sub-PicoJoule per operation scalable computing: why, when, how?","authors":"L. Benini","doi":"10.1145/2903150.2916035","DOIUrl":"https://doi.org/10.1145/2903150.2916035","url":null,"abstract":"The \"internet of everything\" envisions trillions of connected objects loaded with high-bandwidth sensors requiring massive amounts of local signal processing, fusion, pattern extraction and classification. From the computational viewpoint, the challenge is formidable and can be addressed only by pushing computing fabrics toward massive parallelism and brain-like energy efficiency levels. CMOS technology can still take us a long way toward this vision. Our recent results with the open-source PULP (parallel ultra-low power) chips demonstrate that pj/OP (GOPS/mW) computational efficiency is within reach in today's 28nm CMOS FDSOI technology. In this talk, I will look at the next 1000x of energy efficiency improvement, which will require heterogeneous 3D integration, mixed-signal, approximate processing and non-Von-Neumann architectures for scalable acceleration.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125997299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Techniques for modulating error resilience in emerging multi-value technologies 新兴多值技术中的误差弹性调制技术
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903154
Magnus Själander, Gustaf Borgström, M. Klymenko, F. Remacle, S. Kaxiras
There exist extensive ongoing research efforts on emerging atomic scale technologies that have the potential to become an alternative to today's CMOS technologies. A common feature among the investigated technologies is that of multi-value devices, in particular, the possibility of implementing quaternary logic and memory. However, multi-value devices tend to be more sensitive to interferences and, thus, have reduced error resilience. We present an architecture based on multi-value devices where we can trade energy efficiency against error resilience. Important data are encoded in a more robust binary format while error tolerant data is encoded in a quaternary format. We show for eight benchmarks an average energy reduction of 14%, 20%, and 32% for the register file, level-one data cache, and main memory, respectively, and for three integer benchmarks, an energy reduction for arithmetic operations of up to 28%. We also show that for a quaternary technology to be viable a raw bit error rate of one error in 100 million or better is required.
在新兴的原子尺度技术上存在着广泛的研究工作,这些技术有可能成为当今CMOS技术的替代品。所研究的技术的一个共同特点是多值器件,特别是实现第四元逻辑和存储器的可能性。然而,多值器件往往对干扰更敏感,因此,降低了错误恢复能力。我们提出了一种基于多值设备的架构,在这种架构中,我们可以在能源效率和错误恢复能力之间进行交易。重要数据以更健壮的二进制格式编码,而容错数据以四元格式编码。我们显示,在8个基准测试中,寄存器文件、一级数据缓存和主内存的平均能耗分别降低了14%、20%和32%,而在3个整数基准测试中,算术运算的能耗降低了28%。我们还表明,为了使四元技术可行,需要的原始误码率为1亿分之一或更好。
{"title":"Techniques for modulating error resilience in emerging multi-value technologies","authors":"Magnus Själander, Gustaf Borgström, M. Klymenko, F. Remacle, S. Kaxiras","doi":"10.1145/2903150.2903154","DOIUrl":"https://doi.org/10.1145/2903150.2903154","url":null,"abstract":"There exist extensive ongoing research efforts on emerging atomic scale technologies that have the potential to become an alternative to today's CMOS technologies. A common feature among the investigated technologies is that of multi-value devices, in particular, the possibility of implementing quaternary logic and memory. However, multi-value devices tend to be more sensitive to interferences and, thus, have reduced error resilience. We present an architecture based on multi-value devices where we can trade energy efficiency against error resilience. Important data are encoded in a more robust binary format while error tolerant data is encoded in a quaternary format. We show for eight benchmarks an average energy reduction of 14%, 20%, and 32% for the register file, level-one data cache, and main memory, respectively, and for three integer benchmarks, an energy reduction for arithmetic operations of up to 28%. We also show that for a quaternary technology to be viable a raw bit error rate of one error in 100 million or better is required.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131453738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Scalable betweenness centrality on multi-GPU systems 多gpu系统上的可伸缩中间性中心
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903153
M. Bernaschi, Giancarlo Carbone, Flavio Vella
Betweenness Centrality (BC) is steadily growing in popularity as a metrics of the influence of a vertex in a graph. The BC score of a vertex is proportional to the number of all-pairs-shortest-paths passing through it. However, complete and exact BC computation for a large-scale graph is an extraordinary challenge that requires high performance computing techniques to provide results in a reasonable amount of time. Our approach combines bi-dimensional (2-D) decomposition of the graph and multi-level parallelism together with a suitable data-thread mapping that overcomes most of the difficulties caused by the irregularity of the computation on GPUs. In order to reduce time and space requirements of BC computation, a heuristics based on 1-degree reduction technique is developed as well. Experimental results on synthetic and real-world graphs show that the proposed techniques are well suited to compute BC scores in graphs which are too large to fit in the memory of a single computational node.
中间中心性(between - Centrality, BC)作为图中顶点影响的度量,正逐渐受到人们的欢迎。顶点的BC分数与经过它的全对最短路径的数量成正比。然而,对大规模图进行完整而精确的BC计算是一项巨大的挑战,需要高性能计算技术在合理的时间内提供结果。我们的方法结合了二维(2-D)图分解和多层次并行性,以及合适的数据线程映射,克服了gpu上计算不规则性造成的大多数困难。为了减少BC计算对时间和空间的要求,提出了一种基于1度约简的启发式算法。在合成图和实际图上的实验结果表明,所提出的技术非常适合于计算大到无法容纳单个计算节点内存的图的BC分数。
{"title":"Scalable betweenness centrality on multi-GPU systems","authors":"M. Bernaschi, Giancarlo Carbone, Flavio Vella","doi":"10.1145/2903150.2903153","DOIUrl":"https://doi.org/10.1145/2903150.2903153","url":null,"abstract":"Betweenness Centrality (BC) is steadily growing in popularity as a metrics of the influence of a vertex in a graph. The BC score of a vertex is proportional to the number of all-pairs-shortest-paths passing through it. However, complete and exact BC computation for a large-scale graph is an extraordinary challenge that requires high performance computing techniques to provide results in a reasonable amount of time. Our approach combines bi-dimensional (2-D) decomposition of the graph and multi-level parallelism together with a suitable data-thread mapping that overcomes most of the difficulties caused by the irregularity of the computation on GPUs. In order to reduce time and space requirements of BC computation, a heuristics based on 1-degree reduction technique is developed as well. Experimental results on synthetic and real-world graphs show that the proposed techniques are well suited to compute BC scores in graphs which are too large to fit in the memory of a single computational node.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130527657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
vSIP: virtual scheduler for interactive performance vSIP:用于交互性能的虚拟调度程序
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903178
Yan Sui, Chun Yang, Ning Jia, Xu Cheng
This paper presents vSIP, a new scheme of virtual desktop disk scheduling on sharing storage system for user-interactive performance. The proposed scheme enables requests to be dynamically prioritized based on the interactive feature of applications sending them. To enhance user experience on consolidated desktops, our scheme provides interactive applications with priority requests, which have less latency in accessing storage than requests from non-interactive applications sharing the same storage. To this end, we devise a hypervisor extension that classifies interactive applications from non-interactive applications. Our framework prioritizes the requests from these applications and limits the requests rate. Our evaluation shows that the proposed scheme significantly improves interactive performance of storage-sensitive application such as applications launch, Web page loading and video cold playback, when other storage-intensive applications highly disturb the interactive applications. In addition, we introduce a guest OS information transfer method, hence the efficiency and accuracy of the identification of interactive applications can be further improved.
本文提出了一种基于用户交互性能的共享存储系统虚拟桌面磁盘调度新方案vSIP。所提出的方案使请求能够根据发送请求的应用程序的交互特性动态地确定优先级。为了增强合并桌面的用户体验,我们的方案为交互式应用程序提供了具有优先级请求的应用程序,与来自共享相同存储的非交互式应用程序的请求相比,这些请求在访问存储时的延迟更短。为此,我们设计了一个管理程序扩展,将交互式应用程序与非交互式应用程序进行分类。我们的框架对来自这些应用程序的请求进行优先级排序,并限制请求速率。我们的评估表明,当其他存储密集型应用程序高度干扰交互应用时,该方案显著提高了应用程序启动、网页加载和视频冷播放等存储敏感应用程序的交互性能。此外,我们还引入了一种客户操作系统信息传递方法,从而进一步提高了交互应用识别的效率和准确性。
{"title":"vSIP: virtual scheduler for interactive performance","authors":"Yan Sui, Chun Yang, Ning Jia, Xu Cheng","doi":"10.1145/2903150.2903178","DOIUrl":"https://doi.org/10.1145/2903150.2903178","url":null,"abstract":"This paper presents vSIP, a new scheme of virtual desktop disk scheduling on sharing storage system for user-interactive performance. The proposed scheme enables requests to be dynamically prioritized based on the interactive feature of applications sending them. To enhance user experience on consolidated desktops, our scheme provides interactive applications with priority requests, which have less latency in accessing storage than requests from non-interactive applications sharing the same storage. To this end, we devise a hypervisor extension that classifies interactive applications from non-interactive applications. Our framework prioritizes the requests from these applications and limits the requests rate. Our evaluation shows that the proposed scheme significantly improves interactive performance of storage-sensitive application such as applications launch, Web page loading and video cold playback, when other storage-intensive applications highly disturb the interactive applications. In addition, we introduce a guest OS information transfer method, hence the efficiency and accuracy of the identification of interactive applications can be further improved.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133460962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
From FLOPS to BYTES: disruptive change in high-performance computing towards the post-moore era 从FLOPS到BYTES:后摩尔时代高性能计算的颠覆性变化
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2906830
S. Matsuoka, H. Amano, K. Nakajima, Koji Inoue, T. Kudoh, N. Maruyama, K. Taura, Takeshi Iwashita, T. Katagiri, T. Hanawa, Toshio Endo
Slowdown and inevitable end in exponential scaling of processor performance, the end of the so-called "Moore's Law" is predicted to occur around 2025--2030 timeframe. Because CMOS semiconductor voltage is also approaching its limits, this means that logic transistor power will become constant, and as a result, the system FLOPS will cease to improve, resulting in serious consequences for IT in general, especially supercomputing. Existing attempts to overcome the end of Moore's law are rather limited in their future outlook or applicability. We claim that data-oriented parameters, such as bandwidth and capacity, or BYTES, are the new parameters that will allow continued performance gains for periods even after computing performance or FLOPS ceases to improve, due to continued advances in storage device technologies and optics, and manufacturing technologies including 3-D packaging. Such transition from FLOPS to BYTES will lead to disruptive changes in the overall systems from applications, algorithms, software to architecture, as to what parameter to optimize for, in order to achieve continued performance growth over time. We are launching a new set of research efforts to investigate and devise new technologies to enable such disruptive changes from FLOPS to BYTES in the Post-Moore era, focusing on HPC, where there is extreme sensitivity to performance, and expect the results to disseminate to the rest of IT.
处理器性能指数级增长的放缓和不可避免的终结,所谓的“摩尔定律”的终结预计将在2025年至2030年左右发生。由于CMOS半导体电压也在接近极限,这意味着逻辑晶体管功率将趋于恒定,从而导致系统FLOPS停止提高,从而对一般的IT,特别是超级计算造成严重后果。现有的克服摩尔定律终结的尝试在其未来前景或适用性方面相当有限。我们声称,面向数据的参数,如带宽和容量,或字节,是新的参数,即使在计算性能或FLOPS停止改善之后,由于存储设备技术和光学技术的持续进步,以及包括3-D封装在内的制造技术,也将允许持续的性能提升。从FLOPS到BYTES的这种转变将导致整个系统从应用程序、算法、软件到架构的颠覆性变化,以及为了实现持续的性能增长而优化的参数。我们正在开展一系列新的研究工作,以调查和设计新技术,以实现后摩尔时代从FLOPS到字节的破坏性变化,重点关注高性能计算,这对性能非常敏感,并期望结果传播到IT的其他领域。
{"title":"From FLOPS to BYTES: disruptive change in high-performance computing towards the post-moore era","authors":"S. Matsuoka, H. Amano, K. Nakajima, Koji Inoue, T. Kudoh, N. Maruyama, K. Taura, Takeshi Iwashita, T. Katagiri, T. Hanawa, Toshio Endo","doi":"10.1145/2903150.2906830","DOIUrl":"https://doi.org/10.1145/2903150.2906830","url":null,"abstract":"Slowdown and inevitable end in exponential scaling of processor performance, the end of the so-called \"Moore's Law\" is predicted to occur around 2025--2030 timeframe. Because CMOS semiconductor voltage is also approaching its limits, this means that logic transistor power will become constant, and as a result, the system FLOPS will cease to improve, resulting in serious consequences for IT in general, especially supercomputing. Existing attempts to overcome the end of Moore's law are rather limited in their future outlook or applicability. We claim that data-oriented parameters, such as bandwidth and capacity, or BYTES, are the new parameters that will allow continued performance gains for periods even after computing performance or FLOPS ceases to improve, due to continued advances in storage device technologies and optics, and manufacturing technologies including 3-D packaging. Such transition from FLOPS to BYTES will lead to disruptive changes in the overall systems from applications, algorithms, software to architecture, as to what parameter to optimize for, in order to achieve continued performance growth over time. We are launching a new set of research efforts to investigate and devise new technologies to enable such disruptive changes from FLOPS to BYTES in the Post-Moore era, focusing on HPC, where there is extreme sensitivity to performance, and expect the results to disseminate to the rest of IT.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133108224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Exploring dataflow-based thread level parallelism in cyber-physical systems 探索网络物理系统中基于数据流的线程级并行性
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2906829
R. Giorgi
Smart Cyber-Physical Systems (SCPS) aim not only at integrating computational platforms and physical processes, but also at creating larger "systems of systems" capable of satisfying multiple critical constraints such as energy efficiency, high-performance, safety, security, size and cost. The AXIOM project aims at designing such systems by focusing on low-cost Single Board Computers (SBC), based on current System-on-Chips (SoC) that include both programmable logic (FPGA), multi-core CPUs, accelerators and peripherals. A dataflow execution model, partially developed in the TERAFLUX project, brings a more predictable and reliable execution. The goals of AXIOM include: i) the possibility to easily program the system with a shared-memory model based on OmpSs; ii) the possibility of scaling up the system through a custom but inexpensive interconnect; iii) the possibility of accelerating a specific function on a single or multiple FPGAs of the system. The dataflow execution model operates at thread-level granularity. In this paper the AXIOM execution model and the related memory memory model is further detailed. The memory model is key for the execution of threads while reducing the need of data transfers. The preliminary results confirm the scalability of this model.
智能信息物理系统(SCPS)的目标不仅是集成计算平台和物理过程,而且还旨在创建更大的“系统的系统”,能够满足多种关键约束,如能源效率、高性能、安全性、安全性、尺寸和成本。AXIOM项目旨在通过专注于低成本的单板计算机(SBC)来设计这样的系统,该系统基于当前的片上系统(SoC),包括可编程逻辑(FPGA)、多核cpu、加速器和外围设备。在TERAFLUX项目中部分开发的数据流执行模型带来了更可预测和可靠的执行。AXIOM的目标包括:i)使用基于OmpSs的共享内存模型轻松编程系统的可能性;Ii)通过定制但廉价的互连扩展系统的可能性;iii)在系统的单个或多个fpga上加速特定功能的可能性。数据流执行模型在线程级粒度上操作。本文进一步详细介绍了AXIOM的执行模型和相关的内存模型。内存模型是线程执行的关键,同时减少了对数据传输的需求。初步结果证实了该模型的可扩展性。
{"title":"Exploring dataflow-based thread level parallelism in cyber-physical systems","authors":"R. Giorgi","doi":"10.1145/2903150.2906829","DOIUrl":"https://doi.org/10.1145/2903150.2906829","url":null,"abstract":"Smart Cyber-Physical Systems (SCPS) aim not only at integrating computational platforms and physical processes, but also at creating larger \"systems of systems\" capable of satisfying multiple critical constraints such as energy efficiency, high-performance, safety, security, size and cost. The AXIOM project aims at designing such systems by focusing on low-cost Single Board Computers (SBC), based on current System-on-Chips (SoC) that include both programmable logic (FPGA), multi-core CPUs, accelerators and peripherals. A dataflow execution model, partially developed in the TERAFLUX project, brings a more predictable and reliable execution. The goals of AXIOM include: i) the possibility to easily program the system with a shared-memory model based on OmpSs; ii) the possibility of scaling up the system through a custom but inexpensive interconnect; iii) the possibility of accelerating a specific function on a single or multiple FPGAs of the system. The dataflow execution model operates at thread-level granularity. In this paper the AXIOM execution model and the related memory memory model is further detailed. The memory model is key for the execution of threads while reducing the need of data transfers. The preliminary results confirm the scalability of this model.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128575255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Curbing the roofline: a scalable and flexible architecture for CNNs on FPGA 遏制屋顶线:一个可扩展和灵活的FPGA cnn架构
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2911715
P. Meloni, Gianfranco Deriu, Francesco Conti, Igor Loi, L. Raffo, L. Benini
Convolutional Neural Networks (CNNs) have reached outstanding results in several complex visual recognition tasks, such as classification and scene parsing. CNNs are composed of multiple filtering layers that perform 2D convolutions over input images. The intrinsic parallelism in such a computation kernel makes it suitable to be effectively accelerated on parallel hardware. In this paper we propose a highly flexible and scalable architectural template for acceleration of CNNs on FPGA devices, based on the cooperation between a set of software cores and a parallel convolution engine that communicate via a tightly coupled L1 shared scratchpad. Our accelerator structure, tested on a Xilinx Zynq XC-Z7045 device, delivers peak performance up to 80 GMAC/s, corresponding to 100 MMAC/s for each DSP slice in the programmable fabric. Thanks to the flexible architecture, convolution operations can be scheduled in order to reduce input/output bandwidth down to 8 bytes per cycle without degrading the performance of the accelerator in most of the meaningful use-cases.
卷积神经网络(cnn)在分类和场景解析等复杂的视觉识别任务中取得了突出的成绩。cnn由多个过滤层组成,这些过滤层对输入图像进行二维卷积。这种计算内核固有的并行性使得它适合在并行硬件上进行有效的加速。在本文中,我们提出了一个高度灵活和可扩展的架构模板,用于在FPGA设备上加速cnn,该架构模板基于一组软件内核和一个通过紧密耦合L1共享刮擦板通信的并行卷积引擎之间的合作。我们的加速器结构在Xilinx Zynq XC-Z7045设备上进行了测试,峰值性能高达80 GMAC/s,对应于可编程结构中的每个DSP片100 MMAC/s。由于灵活的架构,可以调度卷积操作,以便在大多数有意义的用例中将输入/输出带宽降低到每个周期8字节,而不会降低加速器的性能。
{"title":"Curbing the roofline: a scalable and flexible architecture for CNNs on FPGA","authors":"P. Meloni, Gianfranco Deriu, Francesco Conti, Igor Loi, L. Raffo, L. Benini","doi":"10.1145/2903150.2911715","DOIUrl":"https://doi.org/10.1145/2903150.2911715","url":null,"abstract":"Convolutional Neural Networks (CNNs) have reached outstanding results in several complex visual recognition tasks, such as classification and scene parsing. CNNs are composed of multiple filtering layers that perform 2D convolutions over input images. The intrinsic parallelism in such a computation kernel makes it suitable to be effectively accelerated on parallel hardware. In this paper we propose a highly flexible and scalable architectural template for acceleration of CNNs on FPGA devices, based on the cooperation between a set of software cores and a parallel convolution engine that communicate via a tightly coupled L1 shared scratchpad. Our accelerator structure, tested on a Xilinx Zynq XC-Z7045 device, delivers peak performance up to 80 GMAC/s, corresponding to 100 MMAC/s for each DSP slice in the programmable fabric. Thanks to the flexible architecture, convolution operations can be scheduled in order to reduce input/output bandwidth down to 8 bytes per cycle without degrading the performance of the accelerator in most of the meaningful use-cases.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133408171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Accelerating the mining of influential nodes in complex networks through community detection 通过社区检测加速复杂网络中有影响节点的挖掘
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903181
M. Halappanavar, A. Sathanur, A. Nandi
Computing the set of influential nodes of a given size, which when activated will ensure maximal spread of influence on a complex network, is a challenging problem impacting multiple applications. A rigorous approach to influence maximization involves utilization of optimization routines that come with a high computational cost. In this work, we propose to exploit the existence of communities in complex networks to accelerate the mining of influential seeds. We provide intuitive reasoning to explain why our approach should be able to provide speedups without significantly degrading the extent of the spread of influence when compared to the case of influence maximization without using the community information. Additionally, we have parallelized the complete workflow by leveraging an existing parallel implementation of the Louvain community detection algorithm. We then conduct a series of experiments on a dataset with three representative graphs to first verify our implementation and then demonstrate the speedups. Our method achieves speedups ranging from 3x to 28x for graphs with small number of communities while nearly matching or even exceeding the activation performance on the entire graph. Complexity analysis reveals that dramatic speedups are possible for larger graphs that contain a correspondingly larger number of communities. In addition to the speedups obtained from the utilization of the community structure, scalability results show up to 6.3x speedup on 20 cores relative to the baseline run on 2 cores. Finally, current limitations of the approach are outlined along with the planned next steps.
计算给定大小的有影响力节点集是一个影响多个应用程序的具有挑战性的问题,它在激活时将确保影响在复杂网络上的最大传播。影响最大化的严格方法包括利用具有高计算成本的优化例程。在这项工作中,我们建议利用复杂网络中存在的社区来加速挖掘有影响力的种子。我们提供了直观的推理来解释为什么与不使用社区信息的影响最大化的情况相比,我们的方法应该能够提供加速而不会显着降低影响传播的程度。此外,我们利用Louvain社区检测算法的现有并行实现,将整个工作流程并行化。然后,我们在一个具有三个代表性图的数据集上进行一系列实验,首先验证我们的实现,然后演示加速。我们的方法在具有少量社区的图上实现了从3倍到28倍的加速,同时几乎匹配甚至超过了整个图的激活性能。复杂性分析表明,对于包含相应数量的社区的较大图形,可能会出现显著的加速。除了利用社区结构获得的速度提升之外,可伸缩性结果显示,相对于2核基准运行,20核的速度提升了6.3倍。最后,概述了该方法当前的局限性以及计划的后续步骤。
{"title":"Accelerating the mining of influential nodes in complex networks through community detection","authors":"M. Halappanavar, A. Sathanur, A. Nandi","doi":"10.1145/2903150.2903181","DOIUrl":"https://doi.org/10.1145/2903150.2903181","url":null,"abstract":"Computing the set of influential nodes of a given size, which when activated will ensure maximal spread of influence on a complex network, is a challenging problem impacting multiple applications. A rigorous approach to influence maximization involves utilization of optimization routines that come with a high computational cost. In this work, we propose to exploit the existence of communities in complex networks to accelerate the mining of influential seeds. We provide intuitive reasoning to explain why our approach should be able to provide speedups without significantly degrading the extent of the spread of influence when compared to the case of influence maximization without using the community information. Additionally, we have parallelized the complete workflow by leveraging an existing parallel implementation of the Louvain community detection algorithm. We then conduct a series of experiments on a dataset with three representative graphs to first verify our implementation and then demonstrate the speedups. Our method achieves speedups ranging from 3x to 28x for graphs with small number of communities while nearly matching or even exceeding the activation performance on the entire graph. Complexity analysis reveals that dramatic speedups are possible for larger graphs that contain a correspondingly larger number of communities. In addition to the speedups obtained from the utilization of the community structure, scalability results show up to 6.3x speedup on 20 cores relative to the baseline run on 2 cores. Finally, current limitations of the approach are outlined along with the planned next steps.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129815713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
期刊
Proceedings of the ACM International Conference on Computing Frontiers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1