首页 > 最新文献

2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)最新文献

英文 中文
Fault Detection and Localization for Network-on-Chips in Mixed-Criticality Systems 片上网络混合临界系统故障检测与定位
Adele Maleki, Hamidreza Ahmadian, R. Obermaisser
The increasing trend towards mixed-critical systems, in which applications with different levels of criticality coexist and interact on the same platform, calls for fault-tolerant hardware platforms. On the other hand, due to the demanded performance in such systems, network-on-chips are employed to interconnect several computation resources. Consequently, the detection and localization of faults in the communication and computation resources becomes a challenge, if a high number of shared resources (e.g., routers, physical links) are used. This paper proposes a new hardware architecture for run-time fault detection and localization in mixed-criticality networks-on-chips. The proposed architecture detects the transient and permanent faults in the network and distinguishes between faults of different resources. The fault detection and localization mechanisms have been evaluated using Gem5 simulation and example scenarios.
混合关键系统(不同关键级别的应用程序在同一平台上共存和交互)的发展趋势要求有容错的硬件平台。另一方面,由于这类系统对性能的要求,采用片上网络来互连多个计算资源。因此,当使用大量共享资源(如路由器、物理链路)时,通信和计算资源中的故障检测和定位成为一个挑战。本文提出了一种用于混合临界片上网络运行时故障检测和定位的新硬件结构。该架构可以检测网络中的瞬时故障和永久故障,并区分不同资源的故障。使用Gem5仿真和示例场景对故障检测和定位机制进行了评估。
{"title":"Fault Detection and Localization for Network-on-Chips in Mixed-Criticality Systems","authors":"Adele Maleki, Hamidreza Ahmadian, R. Obermaisser","doi":"10.1109/MCSoC.2019.00038","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00038","url":null,"abstract":"The increasing trend towards mixed-critical systems, in which applications with different levels of criticality coexist and interact on the same platform, calls for fault-tolerant hardware platforms. On the other hand, due to the demanded performance in such systems, network-on-chips are employed to interconnect several computation resources. Consequently, the detection and localization of faults in the communication and computation resources becomes a challenge, if a high number of shared resources (e.g., routers, physical links) are used. This paper proposes a new hardware architecture for run-time fault detection and localization in mixed-criticality networks-on-chips. The proposed architecture detects the transient and permanent faults in the network and distinguishes between faults of different resources. The fault detection and localization mechanisms have been evaluated using Gem5 simulation and example scenarios.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123105385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Data-Driven Scenario-Based Application Mapping for Heterogeneous Many-Core Systems 异构多核系统数据驱动的基于场景的应用映射
J. Spieck, S. Wildermann, T. Schwarzer, J. Teich, M. Glaß
For applications whose workload and execution behavior significantly varies with the input, a single mapping of application tasks to a given target architecture is insufficient. A single mapping may deliver a high-quality solution for the average case but rarely exploits the specific execution behavior of concurrent tasks triggered by each input tuple. E.g., tasks with higher computational demands under certain input should be mapped onto high-performance resources of the heterogeneous architecture. This necessitates mappings that are specialized for specific input data. Yet, due to the large size of input combinations, determining a separate optimized mapping for each individual input workload is not feasible for most applications. As a remedy, we propose to group input data with similar execution characteristics into a selected, small number of so-called workload scenarios for which we supply optimized mappings. In this paper, we provide a data-driven approach for detecting workload scenarios and exploring scenario-optimized mappings based on a collection of input data. The identification of scenarios and the determination of optimized mappings are interdependent: For the data-driven identification of workload scenarios, we have to measure the profiles when executing the application with the given input data for different application mappings. However, to come up with scenario-optimized application mappings, the workload scenarios have to be known. We tackle this interdependence problem by proposing a cyclic design methodology that optimizes both aspects in an iterative fashion. It is shown that with our approach, the latency of two exemplary applications, a ray tracing as well as an image stitching application, can be significantly improved compared to methods that ignore workload scenarios or do not perform the proposed iterative refinement. Furthermore, we demonstrate that our proposal can be used in the context of a hybrid application mapping methodology.
对于其工作负载和执行行为随输入而显著变化的应用程序,将应用程序任务映射到给定的目标体系结构是不够的。单个映射可以为一般情况提供高质量的解决方案,但很少利用由每个输入元组触发的并发任务的特定执行行为。例如,在一定输入条件下具有较高计算需求的任务应映射到异构架构的高性能资源上。这就需要专门针对特定输入数据的映射。然而,由于输入组合的规模很大,对于大多数应用程序来说,为每个单独的输入工作负载确定单独的优化映射是不可行的。作为补救措施,我们建议将具有类似执行特征的输入数据分组到选定的少数所谓的工作负载场景中,我们为这些场景提供优化的映射。在本文中,我们提供了一种数据驱动的方法,用于检测工作负载场景,并基于一组输入数据探索场景优化映射。场景的识别和优化映射的确定是相互依赖的:对于工作负载场景的数据驱动识别,我们必须在使用不同应用程序映射的给定输入数据执行应用程序时度量配置文件。然而,要实现场景优化的应用程序映射,必须了解工作负载场景。我们通过提出一种循环设计方法来解决这种相互依赖的问题,该方法以迭代的方式优化了这两个方面。结果表明,与忽略工作负载场景或不执行所提出的迭代细化的方法相比,使用我们的方法,两个示例应用程序(光线跟踪和图像拼接应用程序)的延迟可以显着改善。此外,我们证明了我们的建议可以在混合应用程序映射方法的上下文中使用。
{"title":"Data-Driven Scenario-Based Application Mapping for Heterogeneous Many-Core Systems","authors":"J. Spieck, S. Wildermann, T. Schwarzer, J. Teich, M. Glaß","doi":"10.1109/MCSoC.2019.00054","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00054","url":null,"abstract":"For applications whose workload and execution behavior significantly varies with the input, a single mapping of application tasks to a given target architecture is insufficient. A single mapping may deliver a high-quality solution for the average case but rarely exploits the specific execution behavior of concurrent tasks triggered by each input tuple. E.g., tasks with higher computational demands under certain input should be mapped onto high-performance resources of the heterogeneous architecture. This necessitates mappings that are specialized for specific input data. Yet, due to the large size of input combinations, determining a separate optimized mapping for each individual input workload is not feasible for most applications. As a remedy, we propose to group input data with similar execution characteristics into a selected, small number of so-called workload scenarios for which we supply optimized mappings. In this paper, we provide a data-driven approach for detecting workload scenarios and exploring scenario-optimized mappings based on a collection of input data. The identification of scenarios and the determination of optimized mappings are interdependent: For the data-driven identification of workload scenarios, we have to measure the profiles when executing the application with the given input data for different application mappings. However, to come up with scenario-optimized application mappings, the workload scenarios have to be known. We tackle this interdependence problem by proposing a cyclic design methodology that optimizes both aspects in an iterative fashion. It is shown that with our approach, the latency of two exemplary applications, a ray tracing as well as an image stitching application, can be significantly improved compared to methods that ignore workload scenarios or do not perform the proposed iterative refinement. Furthermore, we demonstrate that our proposal can be used in the context of a hybrid application mapping methodology.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"158 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115992756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Cloud Based Super-Optimization Method to Parallelize the Sequential Code's Nested Loops 一种基于云的并行化顺序代码嵌套循环的超级优化方法
Amin Majd, Mohammad Loni, Golnaz Sahebi, M. Daneshtalab, E. Troubitsyna
Advances in hardware architecture regarding multi-core processors make parallel computing ubiquitous. To achieve the maximum utilization of multi-core processors, parallel programming techniques are required. However, there are several challenges standing in front of parallel programming. These problems are mainly divided into three major groups. First, although recent advancements in parallel programming languages (e.g. MPI, OpenCL, etc.) assist developers, still parallel programming is not desirable for most programmers. The second one belongs to the massive volume of old software and applications, which have been written in serial mode. However, converting millions of line of serial codes to parallel codes is highly time-consuming and requiring huge verification effort. Third, the production of software and applications in parallel mode is very expensive since it needs knowledge and expertise. Super-optimization provided by super compilers is the process of automatically determine the dependent and independent instructions to find any data dependency and loop-free sequence of instructions. Super compiler then runs these instructions on different processors in the parallel mode, if it is possible. Super-optimization is a feasible solution for helping the programmer to get relaxed from parallel programming workload. Since the most complexity of the sequential codes is in the nested loops, we try to parallelize the nested loops by using the idea of super-optimization. One of the underlying stages in the super-optimization is scheduling tiled space for iterating nested loops. Since the problem is NP-Hard, using the traditional optimization methods are not feasible. In this paper, we propose a cloud-based super-optimization method as Software-as-a-Service (SaaS) to reduce the cost of parallel programming. In addition, it increases the utilization of the processing capacity of the multi-core processor. As the result, an intermediate programmer can use the whole processing capacity of his/her system without knowing anything about writing parallel codes or super compiler functions by sending the serial code to a cloud server and receiving the parallel version of the code from the cloud server. In this paper, an evolutionary algorithm is leveraged to solve the scheduling problem of tiles. Our proposed super-optimization method will serve as software and provided as a hybrid (public and private) deployment model.
多核处理器硬件架构的进步使得并行计算无处不在。为了最大限度地利用多核处理器,需要并行编程技术。然而,并行编程面临着一些挑战。这些问题主要分为三大类。首先,尽管并行编程语言(例如MPI、OpenCL等)的最新进展有助于开发人员,但并行编程对大多数程序员来说仍然是不可取的。第二类属于以串行方式编写的大量旧软件和应用程序。然而,将数百万行串行代码转换为并行代码非常耗时,并且需要大量的验证工作。第三,以并行模式生产软件和应用程序非常昂贵,因为它需要知识和专业知识。超级编译器提供的超级优化是自动确定依赖和独立指令的过程,以找到任何数据依赖和无循环的指令序列。超级编译器然后在不同的处理器上以并行模式运行这些指令,如果可能的话。超级优化是帮助程序员从并行编程工作中解脱出来的一种可行的解决方案。由于最复杂的顺序代码是在嵌套循环中,我们尝试使用超优化的思想来并行化嵌套循环。超级优化的底层阶段之一是为迭代嵌套循环调度平铺空间。由于问题是NP-Hard,使用传统的优化方法是不可行的。在本文中,我们提出了一种基于云的超级优化方法,即软件即服务(SaaS),以降低并行编程的成本。此外,它还提高了多核处理器处理能力的利用率。因此,中级程序员可以通过将串行代码发送到云服务器,并从云服务器接收代码的并行版本,而无需了解编写并行代码或超级编译器函数,即可使用其系统的整个处理能力。本文利用一种进化算法来解决贴片的调度问题。我们提出的超级优化方法将作为软件,并作为混合(公共和私有)部署模型提供。
{"title":"A Cloud Based Super-Optimization Method to Parallelize the Sequential Code's Nested Loops","authors":"Amin Majd, Mohammad Loni, Golnaz Sahebi, M. Daneshtalab, E. Troubitsyna","doi":"10.1109/MCSoC.2019.00047","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00047","url":null,"abstract":"Advances in hardware architecture regarding multi-core processors make parallel computing ubiquitous. To achieve the maximum utilization of multi-core processors, parallel programming techniques are required. However, there are several challenges standing in front of parallel programming. These problems are mainly divided into three major groups. First, although recent advancements in parallel programming languages (e.g. MPI, OpenCL, etc.) assist developers, still parallel programming is not desirable for most programmers. The second one belongs to the massive volume of old software and applications, which have been written in serial mode. However, converting millions of line of serial codes to parallel codes is highly time-consuming and requiring huge verification effort. Third, the production of software and applications in parallel mode is very expensive since it needs knowledge and expertise. Super-optimization provided by super compilers is the process of automatically determine the dependent and independent instructions to find any data dependency and loop-free sequence of instructions. Super compiler then runs these instructions on different processors in the parallel mode, if it is possible. Super-optimization is a feasible solution for helping the programmer to get relaxed from parallel programming workload. Since the most complexity of the sequential codes is in the nested loops, we try to parallelize the nested loops by using the idea of super-optimization. One of the underlying stages in the super-optimization is scheduling tiled space for iterating nested loops. Since the problem is NP-Hard, using the traditional optimization methods are not feasible. In this paper, we propose a cloud-based super-optimization method as Software-as-a-Service (SaaS) to reduce the cost of parallel programming. In addition, it increases the utilization of the processing capacity of the multi-core processor. As the result, an intermediate programmer can use the whole processing capacity of his/her system without knowing anything about writing parallel codes or super compiler functions by sending the serial code to a cloud server and receiving the parallel version of the code from the cloud server. In this paper, an evolutionary algorithm is leveraged to solve the scheduling problem of tiles. Our proposed super-optimization method will serve as software and provided as a hybrid (public and private) deployment model.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128589525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Real-Time Attitude Estimation of Sigma-Point Kalman Filter via Matrix Operation Accelerator 基于矩阵运算加速器的Sigma-Point卡尔曼滤波器实时姿态估计
Zeyang Dai, Lei Jing
Attitude estimation is an important part for navigation of mobile robotics and unmanned aerial vehicle (UAV) control. Although the Extended Kalman Filter (EKF) can be done typically, the trend is to use Sigma-Point Kalman Filter (SPKF) instead due to its higher accuracy and robustness in harsh environment. The only drawback of such system is the higher computation cost. In order to accelerate the system, most approaches based on Field Programmable Gate Arrays (FPGA) are proposed in the past but too specific, which is not reusable and the high price for design complexity. With looking for re-usability, we present an IP core called matrix operation accelerator in this paper. Moreover, we do the verification on Zynq-7020, the experimental result shows that the proposed scheme can reduce about 50% computing time and save silicon as well.
姿态估计是移动机器人导航和无人机控制的重要组成部分。虽然扩展卡尔曼滤波器(EKF)通常可以实现,但由于其在恶劣环境下具有更高的精度和鲁棒性,因此趋势是使用西格玛点卡尔曼滤波器(SPKF)代替。这种系统唯一的缺点是计算成本较高。为了提高系统的速度,以往提出的方法大多基于现场可编程门阵列(FPGA),但过于具体,不可重复使用,且设计复杂性高。在寻找可重用性的前提下,本文提出了一种名为矩阵运算加速器的IP核。此外,我们在Zynq-7020上进行了验证,实验结果表明,该方案可以减少约50%的计算时间,并节省了硅。
{"title":"Real-Time Attitude Estimation of Sigma-Point Kalman Filter via Matrix Operation Accelerator","authors":"Zeyang Dai, Lei Jing","doi":"10.1109/MCSoC.2019.00055","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00055","url":null,"abstract":"Attitude estimation is an important part for navigation of mobile robotics and unmanned aerial vehicle (UAV) control. Although the Extended Kalman Filter (EKF) can be done typically, the trend is to use Sigma-Point Kalman Filter (SPKF) instead due to its higher accuracy and robustness in harsh environment. The only drawback of such system is the higher computation cost. In order to accelerate the system, most approaches based on Field Programmable Gate Arrays (FPGA) are proposed in the past but too specific, which is not reusable and the high price for design complexity. With looking for re-usability, we present an IP core called matrix operation accelerator in this paper. Moreover, we do the verification on Zynq-7020, the experimental result shows that the proposed scheme can reduce about 50% computing time and save silicon as well.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132664935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Fault-Tolerant Traffic-Aware Routing Algorithm for 3-D Photonic Networks-on-Chip 片上三维光子网络容错流量感知路由算法
M. Meyer, Yu Wang, Takahiro Watanabe
As the number of cores on a single chip increased, the inter-core communication system quickly became the performance bottleneck. In order to solve the performance and scalability issues of bus-based systems, Network-on-chip (NoC) was proposed. This eventually met its own bottleneck and several technologies sprouted out from NoC research. The most commonly researched upgrade to NoCs was 3D NoCs, which utilized stacked routers to reduce the maximum hop count. Other researchers have looked at alternative transmission mediums, such as photonics. These technologies can be combined to give great performance and power benefits but can be slowed down by congestion in their path-setup phase. In order to solve this issue, we propose a traffic-aware routing algorithm that can evenly distribute the traffic throughout the chip, all while simultaneously avoiding faulty nodes. The results show that the proposed algorithm was successful in balancing the load across the chip and that the performance costs of the algorithm were mostly offset by the benefits of reducing blocked paths.
随着单芯片上核数的增加,核间通信系统迅速成为性能瓶颈。为了解决基于总线的系统的性能和可扩展性问题,提出了片上网络(NoC)。这最终遇到了自己的瓶颈,一些技术从NoC研究中涌现出来。最常见的noc升级研究是3D noc,它利用堆叠路由器来减少最大跳数。其他研究人员已经研究了其他的传输介质,比如光子学。这些技术可以结合起来提供出色的性能和功耗优势,但可能会因路径设置阶段的拥塞而减慢速度。为了解决这个问题,我们提出了一种流量感知路由算法,该算法可以均匀地将流量分布在整个芯片上,同时避免故障节点。结果表明,该算法成功地平衡了芯片上的负载,并且算法的性能成本大部分被减少阻塞路径的好处所抵消。
{"title":"Fault-Tolerant Traffic-Aware Routing Algorithm for 3-D Photonic Networks-on-Chip","authors":"M. Meyer, Yu Wang, Takahiro Watanabe","doi":"10.1109/MCSoC.2019.00032","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00032","url":null,"abstract":"As the number of cores on a single chip increased, the inter-core communication system quickly became the performance bottleneck. In order to solve the performance and scalability issues of bus-based systems, Network-on-chip (NoC) was proposed. This eventually met its own bottleneck and several technologies sprouted out from NoC research. The most commonly researched upgrade to NoCs was 3D NoCs, which utilized stacked routers to reduce the maximum hop count. Other researchers have looked at alternative transmission mediums, such as photonics. These technologies can be combined to give great performance and power benefits but can be slowed down by congestion in their path-setup phase. In order to solve this issue, we propose a traffic-aware routing algorithm that can evenly distribute the traffic throughout the chip, all while simultaneously avoiding faulty nodes. The results show that the proposed algorithm was successful in balancing the load across the chip and that the performance costs of the algorithm were mostly offset by the benefits of reducing blocked paths.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132485565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Exploiting Model-Level Parallelism in Recurrent Neural Network Accelerators 循环神经网络加速器中模型级并行性的开发
Lu Peng, Wentao Shi, Jian Zhang, Samuel Irving
Recurrent Neural Networks (RNNs) have continued to facilitate rapid progress in a variety of academic and industrial fields, though their complexity continues to make efficient deployment difficult; when the RNN model size is not properly matched to hardware resources, performance can suffer from hardware under-utilization. In this work, we propose to explore model-level parallelism for LSTM-RNN accelerators in different levels of the model using a multicore design. The multi-core design proposed in this work operates in three computing modes: multi-programming mode in which independent models are executed; multithreading mode in which parallelism among layers of an LSTM model is explored and properly scheduled; and helper-core mode in which cores collaborate on a single LSTM layer in a lower model level comparing with multithread mode. Our design can achieve up to 1.98x speedup in "multi-programming" mode, a 1.91x speedup in "multithreading" mode and a 1.88x speedup in "helper-core" mode over the single-core design.
递归神经网络(rnn)继续促进各种学术和工业领域的快速发展,尽管其复杂性继续使有效部署变得困难;当RNN模型大小与硬件资源不匹配时,性能可能会受到硬件利用率不足的影响。在这项工作中,我们建议使用多核设计在模型的不同级别探索LSTM-RNN加速器的模型级并行性。本文提出的多核设计在三种计算模式下运行:执行独立模型的多编程模式;多线程模式,探索LSTM模型各层之间的并行性并合理调度;在helper-core模式中,与多线程模式相比,内核在较低的模型级别上在单个LSTM层上进行协作。我们的设计在“多编程”模式下可以实现高达1.98倍的加速,在“多线程”模式下可以实现1.91倍的加速,在“辅助核心”模式下可以实现1.88倍的加速。
{"title":"Exploiting Model-Level Parallelism in Recurrent Neural Network Accelerators","authors":"Lu Peng, Wentao Shi, Jian Zhang, Samuel Irving","doi":"10.1109/MCSoC.2019.00042","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00042","url":null,"abstract":"Recurrent Neural Networks (RNNs) have continued to facilitate rapid progress in a variety of academic and industrial fields, though their complexity continues to make efficient deployment difficult; when the RNN model size is not properly matched to hardware resources, performance can suffer from hardware under-utilization. In this work, we propose to explore model-level parallelism for LSTM-RNN accelerators in different levels of the model using a multicore design. The multi-core design proposed in this work operates in three computing modes: multi-programming mode in which independent models are executed; multithreading mode in which parallelism among layers of an LSTM model is explored and properly scheduled; and helper-core mode in which cores collaborate on a single LSTM layer in a lower model level comparing with multithread mode. Our design can achieve up to 1.98x speedup in \"multi-programming\" mode, a 1.91x speedup in \"multithreading\" mode and a 1.88x speedup in \"helper-core\" mode over the single-core design.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125504626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Design-Time Memory Subsystem Optimization for Low-Power Multi-Core Embedded Systems 低功耗多核嵌入式系统的设计时内存子系统优化
Manuel Strobel, M. Radetzki
Embedded multi-core systems are increasingly in use. As established single-core design methodologies are often not applicable out of the box, novel design-time optimization methods are required in order to manage real-time characteristics, predictability, or tight constraints with respect to energy consumption or system performance. With focus on the memory subsystem in a multi-core embedded system, this paper proposes an optimization workflow for the application-specific optimal binding of code and data to memory instances, efficient handling and scheduling of available memory low-power modes, and the automated and transparent integration of these optimization results on the software level. Presented optimization algorithms are realized as integer linear programs; code modification and generation are implemented on the basis of LLVM. Experimental results for an ARM-based quad-core platform with SRAM memory subsystem, consisting of core-local scratchpad memories and global shared memory, prove the efficiency of our method in terms of energy consumption when compared to a system using direct-mapped caches, but also in comparison with a state-of-the-art scratchpad mapping heuristic.
嵌入式多核系统的应用越来越广泛。由于已建立的单核设计方法通常不适用开箱即用,因此需要新的设计时优化方法来管理实时特性、可预测性或与能耗或系统性能相关的严格约束。本文以多核嵌入式系统中的内存子系统为研究对象,提出了一种针对特定应用的代码和数据与内存实例的优化绑定、可用内存低功耗模式的高效处理和调度以及这些优化结果在软件层面的自动化和透明集成的优化工作流程。所提出的优化算法以整数线性规划的形式实现;基于LLVM实现了代码的修改和生成。基于arm的四核平台的SRAM存储子系统的实验结果表明,与使用直接映射缓存的系统相比,我们的方法在能耗方面是有效的,而且与最先进的刮板映射启发式方法相比也是如此。
{"title":"Design-Time Memory Subsystem Optimization for Low-Power Multi-Core Embedded Systems","authors":"Manuel Strobel, M. Radetzki","doi":"10.1109/MCSoC.2019.00056","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00056","url":null,"abstract":"Embedded multi-core systems are increasingly in use. As established single-core design methodologies are often not applicable out of the box, novel design-time optimization methods are required in order to manage real-time characteristics, predictability, or tight constraints with respect to energy consumption or system performance. With focus on the memory subsystem in a multi-core embedded system, this paper proposes an optimization workflow for the application-specific optimal binding of code and data to memory instances, efficient handling and scheduling of available memory low-power modes, and the automated and transparent integration of these optimization results on the software level. Presented optimization algorithms are realized as integer linear programs; code modification and generation are implemented on the basis of LLVM. Experimental results for an ARM-based quad-core platform with SRAM memory subsystem, consisting of core-local scratchpad memories and global shared memory, prove the efficiency of our method in terms of energy consumption when compared to a system using direct-mapped caches, but also in comparison with a state-of-the-art scratchpad mapping heuristic.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125582181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Optimization of Numerous Small Dense-Matrix–Vector Multiplications in H-Matrix Arithmetic on GPU GPU上h -矩阵算法中大量小密度矩阵向量乘法的优化
S. Ohshima, I. Yamazaki, Akihiro Ida, Rio Yokota
Dense-matrix–vector multiplication is one of the well-known important matrix calculations. This calculation is provided a general matrix–vector multiplication (GEMV) function in the basic linear algebra subprograms (BLAS) libraries for several computation hardware. Traditionally, studies focus one large dense-matrix (the length of each side of the dense matrix is long)–vector multiplication. However, some applications require acceleration of numerous small dense-matrix–vector multiplications. This feature is provided by batched BLAS libraries. This calculation is also needed to compute a hierarchical-matrix–vector multiplication. In this study, we implemented numerous small dense-matrix–vector multiplications on a Pascal GPU and evaluated the performance. Thus, we considered the impact of optimization parameters and succeeded in obtaining a better performance than previous works. The maximum differences from our previous work is 28.47% and from batched GEMV of MAGMA BLAS is upto 81.81%. Moreover, we considered the use of two optimization parameters in one GPU kernel; one parameter was applied to some matrices, whereas the second parameter was applied to other matrices. The amount of the improvement was limited (upto 5%), a performance improvement was achieved. Our result will serve as a good reference for users who need to use numerous small dense-matrix–vector multiplications on a GPU and want to optimize a matrix–vector multiplication by hand-tuning and auto-tuning.
密集矩阵-向量乘法是众所周知的重要矩阵计算之一。在基本线性代数子程序(BLAS)库中为几种计算硬件提供了通用矩阵向量乘法(GEMV)函数。传统上,研究的重点是一个大的密集矩阵(密集矩阵的每条边的长度都很长)-向量乘法。然而,一些应用程序需要加速许多小的密集矩阵-向量乘法。此特性由批处理BLAS库提供。这个计算也需要计算一个层次矩阵向量乘法。在这项研究中,我们在Pascal GPU上实现了许多小的密集矩阵向量乘法,并评估了性能。因此,我们考虑了优化参数的影响,并成功地获得了比以往工作更好的性能。与前人的最大差异为28.47%,与MAGMA BLAS的批处理GEMV最大差异为81.81%。此外,我们考虑在一个GPU内核中使用两个优化参数;一个参数应用于一些矩阵,而第二个参数应用于其他矩阵。改进的数量有限(最多5%),但实现了性能改进。我们的结果将为需要在GPU上使用大量小的密集矩阵向量乘法并希望通过手动调优和自动调优来优化矩阵向量乘法的用户提供很好的参考。
{"title":"Optimization of Numerous Small Dense-Matrix–Vector Multiplications in H-Matrix Arithmetic on GPU","authors":"S. Ohshima, I. Yamazaki, Akihiro Ida, Rio Yokota","doi":"10.1109/MCSoC.2019.00009","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00009","url":null,"abstract":"Dense-matrix–vector multiplication is one of the well-known important matrix calculations. This calculation is provided a general matrix–vector multiplication (GEMV) function in the basic linear algebra subprograms (BLAS) libraries for several computation hardware. Traditionally, studies focus one large dense-matrix (the length of each side of the dense matrix is long)–vector multiplication. However, some applications require acceleration of numerous small dense-matrix–vector multiplications. This feature is provided by batched BLAS libraries. This calculation is also needed to compute a hierarchical-matrix–vector multiplication. In this study, we implemented numerous small dense-matrix–vector multiplications on a Pascal GPU and evaluated the performance. Thus, we considered the impact of optimization parameters and succeeded in obtaining a better performance than previous works. The maximum differences from our previous work is 28.47% and from batched GEMV of MAGMA BLAS is upto 81.81%. Moreover, we considered the use of two optimization parameters in one GPU kernel; one parameter was applied to some matrices, whereas the second parameter was applied to other matrices. The amount of the improvement was limited (upto 5%), a performance improvement was achieved. Our result will serve as a good reference for users who need to use numerous small dense-matrix–vector multiplications on a GPU and want to optimize a matrix–vector multiplication by hand-tuning and auto-tuning.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126493087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Algorithm to Determine Extended Edit Distance between Program Codes 确定程序代码之间扩展编辑距离的算法
Kazuki Anzai, Y. Watanobe
An algorithm to determine the extended edit distance between program codes is presented. In addition to the conventional Levenshtein distance, the extended edit distance considers some common operations to a program code to find similar programs more accurately. To calculate the distance, the algorithm employs dynamic programming techniques as well as an algorithm for solving the minimum cost flow on a bipartite graph. In this paper, details of the algorithm and experimental results are presented. These experiments were conducted with source code submitted to an online judge system, where a number of source codes for each programming problem are located. The results show that the proposed algorithm can find source code that cannot be found by the conventional Levenshtein distance, with a higher probability.
提出了一种确定程序代码间扩展编辑距离的算法。除了传统的Levenshtein距离之外,扩展的编辑距离考虑了程序代码的一些常见操作,以便更准确地找到类似的程序。为了计算距离,该算法采用了动态规划技术和求解二部图上最小代价流的算法。文中给出了具体的算法和实验结果。这些实验是通过将源代码提交给在线判断系统进行的,其中每个编程问题都有许多源代码。结果表明,该算法能够以较高的概率找到传统Levenshtein距离无法找到的源代码。
{"title":"Algorithm to Determine Extended Edit Distance between Program Codes","authors":"Kazuki Anzai, Y. Watanobe","doi":"10.1109/MCSoC.2019.00033","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00033","url":null,"abstract":"An algorithm to determine the extended edit distance between program codes is presented. In addition to the conventional Levenshtein distance, the extended edit distance considers some common operations to a program code to find similar programs more accurately. To calculate the distance, the algorithm employs dynamic programming techniques as well as an algorithm for solving the minimum cost flow on a bipartite graph. In this paper, details of the algorithm and experimental results are presented. These experiments were conducted with source code submitted to an online judge system, where a number of source codes for each programming problem are located. The results show that the proposed algorithm can find source code that cannot be found by the conventional Levenshtein distance, with a higher probability.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"6 15","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113932060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Automatic Generation of Fill-in-the-Blank Programming Problems 填空编程问题的自动生成
Kenta Terada, Y. Watanobe
In solving programming problems, it is difficult for beginners to create program code from scratch. One way to navigate this difficulty is to provide a programming problem to them which takes a fill-in-the-blank format. In this work, we propose a method to automatically generate programming problems that has two key constituents, selection of exemplary source code and selection of places to be blanks. In terms of selecting exemplary source code, k-means clustering with silhouette analysis in the Online Judge System (OJ) is proposed. Regarding the selection of places to be blanks, a model based on a bidirectional Long Short-Term Memory Network (Bi-LSTM) with a sequential Conditional Random Field (CRF) is proposed. We discuss evaluation of the proposed approach in the context of how fill-in-the-blank programming problems are generated.
在解决编程问题时,初学者很难从头开始编写程序代码。解决这一困难的一种方法是向他们提供一个采用填空格式的编程问题。在这项工作中,我们提出了一种自动生成编程问题的方法,该方法有两个关键组成部分,即选择示例源代码和选择空白位置。在选择示例源代码方面,提出了基于轮廓分析的k-均值聚类在线评判系统(OJ)。针对空白位置的选择问题,提出了一种基于双向长短期记忆网络(Bi-LSTM)的序列条件随机场(CRF)模型。我们讨论了如何在填空规划问题产生的背景下评估所提出的方法。
{"title":"Automatic Generation of Fill-in-the-Blank Programming Problems","authors":"Kenta Terada, Y. Watanobe","doi":"10.1109/MCSoC.2019.00034","DOIUrl":"https://doi.org/10.1109/MCSoC.2019.00034","url":null,"abstract":"In solving programming problems, it is difficult for beginners to create program code from scratch. One way to navigate this difficulty is to provide a programming problem to them which takes a fill-in-the-blank format. In this work, we propose a method to automatically generate programming problems that has two key constituents, selection of exemplary source code and selection of places to be blanks. In terms of selecting exemplary source code, k-means clustering with silhouette analysis in the Online Judge System (OJ) is proposed. Regarding the selection of places to be blanks, a model based on a bidirectional Long Short-Term Memory Network (Bi-LSTM) with a sequential Conditional Random Field (CRF) is proposed. We discuss evaluation of the proposed approach in the context of how fill-in-the-blank programming problems are generated.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"202 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114051797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1