2012 International Conference on Embedded Computer Systems (SAMOS)最新文献

英文中文

Maximum performance computing for exascale applications 为百亿亿级应用程序提供最高性能计算

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404150

O. Mencer

Summary form only given. Ever since Fermi, Pasta and Ulam conducted the first fundamentally important numerical experiments in 1953, science has been driven by the progress of available computational capability. In particular, computational quantum chemistry and computational quantum physics depend on ever increasing amounts of computation. However, due to power density limitations at the chip we have seen the end of single CPU performance scaling. Now the challenge is to improve compute performance through some form of parallel processing without incurring power limits at the system level. One way to deal with the system “power wall” question is to ask “what is the maximum amount of computation that can be achieved within a certain power budget”. We argue that such Maximum Performance Computing needs to focus on end-to-end execution time of complete scientific applications and needs to include a multi-disciplinary approach, bringing together scientists and engineers to optimize the whole process from mathematics and algorithms all the way down to arithmetic and number representation. We have done a number of such multidisciplinary studies with our customers (Chevron, Schlumberger, and JP Morgan). Our current results with Maxeler Dataflow Engines for production PDE solver applications in Earth Sciences and Finance show an improvement of 20-40x in Speed and/or Watts per application run.

只提供摘要形式。自从1953年费米、意大利面和乌拉姆进行了第一次具有根本意义的数值实验以来，科学一直受到可用计算能力进步的推动。特别是，计算量子化学和计算量子物理依赖于不断增加的计算量。然而，由于芯片的功率密度限制，我们已经看到了单个CPU性能扩展的终结。现在的挑战是通过某种形式的并行处理来提高计算性能，而不引起系统级的功率限制。处理系统“功率墙”问题的一种方法是问“在一定的功率预算内可以实现的最大计算量是多少”。我们认为，这种最大性能计算需要关注完整科学应用的端到端执行时间，需要包括多学科方法，将科学家和工程师聚集在一起，从数学和算法一直到算术和数字表示来优化整个过程。我们已经与我们的客户(雪佛龙、斯伦贝谢和摩根大通)进行了许多这样的多学科研究。我们目前使用Maxeler数据流引擎在地球科学和金融领域的生产PDE求解器应用程序上的结果表明，每次应用程序运行的速度和/或功率提高了20-40倍。

{"title":"Maximum performance computing for exascale applications","authors":"O. Mencer","doi":"10.1109/SAMOS.2012.6404150","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404150","url":null,"abstract":"Summary form only given. Ever since Fermi, Pasta and Ulam conducted the first fundamentally important numerical experiments in 1953, science has been driven by the progress of available computational capability. In particular, computational quantum chemistry and computational quantum physics depend on ever increasing amounts of computation. However, due to power density limitations at the chip we have seen the end of single CPU performance scaling. Now the challenge is to improve compute performance through some form of parallel processing without incurring power limits at the system level. One way to deal with the system “power wall” question is to ask “what is the maximum amount of computation that can be achieved within a certain power budget”. We argue that such Maximum Performance Computing needs to focus on end-to-end execution time of complete scientific applications and needs to include a multi-disciplinary approach, bringing together scientists and engineers to optimize the whole process from mathematics and algorithms all the way down to arithmetic and number representation. We have done a number of such multidisciplinary studies with our customers (Chevron, Schlumberger, and JP Morgan). Our current results with Maxeler Dataflow Engines for production PDE solver applications in Earth Sciences and Finance show an improvement of 20-40x in Speed and/or Watts per application run.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"6 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132779564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

An efficient asymmetric distributed lock for embedded multiprocessor systems 嵌入式多处理器系统中一种高效的非对称分布式锁

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404172

J. Rutgers, M. Bekooij, G. Smit

Efficient synchronization is a key concern in an embedded many-core system-on-chip (SoC). The use of atomic read-modify-write instructions combined with cache coherency as synchronization primitive is not always an option for shared-memory SoCs due to the lack of suitable IP. Furthermore, there are doubts about the scalability of hardware cache coherency protocols. Existing distributed locks for NUMA multiprocessor systems do not rely on cache coherency and are more scalable, but exchange many messages per lock. This paper introduces an asymmetric distributed lock algorithm for shared-memory embedded multiprocessor systems without hardware cache coherency. Messages are exchanged via a low-cost inter-processor communication ring in combination with a small local memory per processor. Typically, a mutex is used over and over again by the same process, which is exploited by our algorithm. As a result, the number of messages exchanged per lock is significantly reduced. Experiments with our 32-core system show that when having locks in SDRAM, 35% of the memory traffic is lock related. In comparison, our solution eliminates all of this traffic and reduces the execution time by up to 89%.

高效同步是嵌入式多核片上系统(SoC)的关键问题。由于缺乏合适的IP，对于共享内存soc来说，使用原子读-修改-写指令和缓存一致性作为同步原语并不总是一种选择。此外，对硬件缓存一致性协议的可扩展性也存在疑问。用于NUMA多处理器系统的现有分布式锁不依赖于缓存一致性，并且具有更高的可扩展性，但是每个锁交换许多消息。针对无硬件缓存一致性的共享内存嵌入式多处理器系统，提出了一种非对称分布式锁算法。消息通过低成本的处理器间通信环交换，并结合每个处理器的小本地内存。通常，一个互斥锁被同一个进程反复使用，我们的算法利用了这一点。因此，每个锁交换的消息数量大大减少。在我们的32核系统上进行的实验表明，当在SDRAM中使用锁时，35%的内存流量与锁相关。相比之下，我们的解决方案消除了所有这些流量，并将执行时间减少了89%。

{"title":"An efficient asymmetric distributed lock for embedded multiprocessor systems","authors":"J. Rutgers, M. Bekooij, G. Smit","doi":"10.1109/SAMOS.2012.6404172","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404172","url":null,"abstract":"Efficient synchronization is a key concern in an embedded many-core system-on-chip (SoC). The use of atomic read-modify-write instructions combined with cache coherency as synchronization primitive is not always an option for shared-memory SoCs due to the lack of suitable IP. Furthermore, there are doubts about the scalability of hardware cache coherency protocols. Existing distributed locks for NUMA multiprocessor systems do not rely on cache coherency and are more scalable, but exchange many messages per lock. This paper introduces an asymmetric distributed lock algorithm for shared-memory embedded multiprocessor systems without hardware cache coherency. Messages are exchanged via a low-cost inter-processor communication ring in combination with a small local memory per processor. Typically, a mutex is used over and over again by the same process, which is exploited by our algorithm. As a result, the number of messages exchanged per lock is significantly reduced. Experiments with our 32-core system show that when having locks in SDRAM, 35% of the memory traffic is lock related. In comparison, our solution eliminates all of this traffic and reduces the execution time by up to 89%.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122696336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A quantitative analysis of fixed-point LDPC-decoder implementations using hardware-accelerated HDL emulations 使用硬件加速HDL仿真的定点ldpc解码器实现的定量分析

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404189

Matthias Korb, T. Noll

Using hardware-accelerated HDL emulators of fixed-point implementations has several advantages in comparison to C-based simulations: The high degree of parallelism for example of field-programmable gate-array based hardware accelerators promise an increased emulation throughput. Furthermore, the HDL model of the considered circuit can be used in the following design process making an additional verification dispensable. For a system analysis of different low-density parity-check (LDPC) decoders such an emulator is practically inevitable from a throughput perspective: the outstanding error correction capability of those decoders allowing for bit-error rates (BER) of well below 10-10 requires a simulative decoding of billions of blocks. In this work, an HDL-based emulator is used. The designed HDL model is highly parameterizable and includes an LDPC decoder and high-quality Box-Muller-based white Gaussian-noise generators to create rare error-events. Using this emulator a comparison of the decoding capability of different fixed-point decoder implementations has been performed. Additionally, accurate cost-models are used for estimating the hardware costs of the different decoder implementations which enable an identification of Pareto-optimal decoder implementations. Finally, the achievable emulator throughput is discussed and compared to the simulation throughput of a speed optimized C-model.

与基于c语言的仿真相比，使用硬件加速的定点实现HDL仿真器有几个优点:高度的并行性(例如基于现场可编程门阵列的硬件加速器)承诺提高仿真吞吐量。此外，所考虑的电路的HDL模型可以在以下设计过程中使用，从而无需进行额外的验证。对于不同的低密度奇偶校验(LDPC)解码器的系统分析，从吞吐量的角度来看，这样的仿真器实际上是不可避免的:这些解码器的出色纠错能力允许误码率(BER)远低于10-10，需要数十亿块的模拟解码。在这项工作中，使用了一个基于hdl的仿真器。设计的HDL模型是高度可参数化的，包括一个LDPC解码器和高质量的基于box - muller的白高斯噪声发生器，以产生罕见的错误事件。利用该仿真器对不同定点解码器实现的译码能力进行了比较。此外，准确的成本模型用于估计不同解码器实现的硬件成本，从而能够识别帕累托最优解码器实现。最后，讨论了可实现的仿真吞吐量，并与速度优化的c模型的仿真吞吐量进行了比较。

{"title":"A quantitative analysis of fixed-point LDPC-decoder implementations using hardware-accelerated HDL emulations","authors":"Matthias Korb, T. Noll","doi":"10.1109/SAMOS.2012.6404189","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404189","url":null,"abstract":"Using hardware-accelerated HDL emulators of fixed-point implementations has several advantages in comparison to C-based simulations: The high degree of parallelism for example of field-programmable gate-array based hardware accelerators promise an increased emulation throughput. Furthermore, the HDL model of the considered circuit can be used in the following design process making an additional verification dispensable. For a system analysis of different low-density parity-check (LDPC) decoders such an emulator is practically inevitable from a throughput perspective: the outstanding error correction capability of those decoders allowing for bit-error rates (BER) of well below 10-10 requires a simulative decoding of billions of blocks. In this work, an HDL-based emulator is used. The designed HDL model is highly parameterizable and includes an LDPC decoder and high-quality Box-Muller-based white Gaussian-noise generators to create rare error-events. Using this emulator a comparison of the decoding capability of different fixed-point decoder implementations has been performed. Additionally, accurate cost-models are used for estimating the hardware costs of the different decoder implementations which enable an identification of Pareto-optimal decoder implementations. Finally, the achievable emulator throughput is discussed and compared to the simulation throughput of a speed optimized C-model.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128706442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Predictable dynamic embedded data processing 可预测的动态嵌入式数据处理

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404194

M. Geilen, S. Stuijk, T. Basten

Cyber-physical systems interact with their physical environment. In this interaction, non-functional aspects, most notably timing, are essential to correct operation. In modern systems, dynamism is introduced in many different ways. The additional complexity threatens timely development and reliable operation. Applications often have different modes of operation with different resource requirements and different levels of required quality-of-service. Moreover, multiple applications in dynamically changing combinations share a platform and its resources. To preserve efficient development of such systems, dynamism needs to be taken into account as a primary concern, not as a verification or tuning effort after the design is done. This requires a model-driven design approach in which timing of interaction with the physical environment is taken into consideration; formal models capture applications and their platforms in the physical environment. Moreover, platforms with resources and resource arbitration are needed that allow for predictable and reliable behavior to be realized. Run-time management is further required to deal with dynamic use-cases and dynamic trade-offs encountered at run-time. In this paper, we present a model-driven approach that combines model-based design and synthesis with development of platforms that support predictable, repeatable, composable realizations and a run-time management approach to deal with dynamic use-cases at run-time. A formal, compositional model is used to exploit Pareto-optimal trade-offs in the system use. The approach is illustrated with dataflow models with dynamic application scenarios, a predictable platform architecture and run-time resource management that determines optimal trade-offs through an efficient knapsack heuristic.

网络物理系统与其物理环境相互作用。在这种交互中，非功能方面(最明显的是计时)对于正确操作至关重要。在现代系统中，动力以许多不同的方式被引入。额外的复杂性威胁到及时开发和可靠运行。应用程序通常具有不同的操作模式，具有不同的资源需求和不同级别的所需服务质量。此外，动态变化组合中的多个应用程序共享一个平台及其资源。为了保持这种系统的有效开发，需要将动态性作为主要关注点来考虑，而不是在设计完成后进行验证或调整。这需要一种模型驱动的设计方法，其中考虑到与物理环境交互的时间;正式模型捕获物理环境中的应用程序及其平台。此外，需要具有资源和资源仲裁的平台来实现可预测和可靠的行为。进一步需要运行时管理来处理运行时遇到的动态用例和动态权衡。在本文中，我们提出了一种模型驱动的方法，它将基于模型的设计和综合与支持可预测、可重复、可组合实现的平台的开发相结合，并采用运行时管理方法在运行时处理动态用例。一个正式的组合模型被用来利用系统使用中的帕累托最优权衡。该方法通过具有动态应用程序场景的数据流模型、可预测的平台架构和运行时资源管理来说明，运行时资源管理通过有效的背包启发式确定最佳权衡。

{"title":"Predictable dynamic embedded data processing","authors":"M. Geilen, S. Stuijk, T. Basten","doi":"10.1109/SAMOS.2012.6404194","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404194","url":null,"abstract":"Cyber-physical systems interact with their physical environment. In this interaction, non-functional aspects, most notably timing, are essential to correct operation. In modern systems, dynamism is introduced in many different ways. The additional complexity threatens timely development and reliable operation. Applications often have different modes of operation with different resource requirements and different levels of required quality-of-service. Moreover, multiple applications in dynamically changing combinations share a platform and its resources. To preserve efficient development of such systems, dynamism needs to be taken into account as a primary concern, not as a verification or tuning effort after the design is done. This requires a model-driven design approach in which timing of interaction with the physical environment is taken into consideration; formal models capture applications and their platforms in the physical environment. Moreover, platforms with resources and resource arbitration are needed that allow for predictable and reliable behavior to be realized. Run-time management is further required to deal with dynamic use-cases and dynamic trade-offs encountered at run-time. In this paper, we present a model-driven approach that combines model-based design and synthesis with development of platforms that support predictable, repeatable, composable realizations and a run-time management approach to deal with dynamic use-cases at run-time. A formal, compositional model is used to exploit Pareto-optimal trade-offs in the system use. The approach is illustrated with dataflow models with dynamic application scenarios, a predictable platform architecture and run-time resource management that determines optimal trade-offs through an efficient knapsack heuristic.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"81 1-2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116733115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Efficient system design using the Statistical Analysis of Architectural Bottlenecks methodology 使用架构瓶颈统计分析方法进行有效的系统设计

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404177

Manish Arora, Feng Wang, Bob Rychlik, D. Tullsen

CPU processor design involves a large set of increasingly complex design decisions Doing full, accurate simulation of all possible designs is typically not feasible. Prior techniques for sensitivity analysis seek to identify the most critical design parameters, but also struggle to handle the increasing design space well. They can be overly sensitive to the starting fixed point of the design, can still require a large number of simulations, and do not necessarily account for the cost of each design parameter. The Statistical Analysis of Architectural Bottlenecks (SAAB) methodology simultaneously analyzes multiple parameters and requires a small number of experiments. SAAB leverages the Plackett and Burman analysis method, but builds upon the technique in two specific ways. It allows a parameter to take multiple values and replaces the unit-less impact factor with a cost-proportional impact value. This paper applies the SAAB methodology to the design of a mobile processor sub-system. It considers area and power cost models for the design.

CPU处理器设计涉及大量日益复杂的设计决策，对所有可能的设计进行全面、准确的模拟通常是不可行的。先前的灵敏度分析技术试图确定最关键的设计参数，但也难以处理越来越大的设计空间。它们可能对设计的起始固定点过于敏感，可能仍然需要大量的模拟，并且不一定考虑到每个设计参数的成本。结构瓶颈统计分析(SAAB)方法同时分析多个参数，并且需要少量的实验。萨博利用了Plackett和Burman的分析方法，但以两种具体方式建立在该技术的基础上。它允许一个参数取多个值，并用成本比例影响值取代无单位影响因子。本文将SAAB方法应用于移动处理器子系统的设计。在设计时考虑了面积和功耗模型。

引用次数: 2

Towards future adaptive multiprocessor systems-on-chip: An innovative approach for flexible architectures 面向未来的自适应多处理器片上系统:灵活架构的创新方法

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404179

F. Lemonnier, P. Millet, G. M. Almeida, M. Hübner, J. Becker, S. Pillement, O. Sentieys, Martijn Koedam, Shubhendu Sinha, K. Goossens, C. Piguet, M. Morgan, R. Lemaire

This paper introduces adaptive techniques targeted for heterogeneous manycore architectures and introduces the FlexTiles platform, which consists of general purpose processors with some dedicated accelerators. The different components are based on low power DSP cores and an eFPGA on which dedicated IPs can be dynamically configured at run-time. These features enable a breakthrough in term of computing performance while improving the on-line adaptive capabilities brought from smart heuristics. Thus, we propose a virtualisation layer which provides a higher abstraction level to mask the underlying heterogeneity present in such architectures. Given the large variety of possible use cases that these platforms must support and the resulting workload variability, offline approaches are no longer sufficient because they do not allow coping with time changing workloads. The upcoming generation of applications include smart cameras, drones, and cognitive radio. In order to facilitate the architecture adaptation under different scenarios, we propose a programming model that considers both static and dynamic behaviors. This is associated with self adaptive strategies endowed by an operating system kernel that provides a set of functions that guarantee quality of service (QoS) by implementing runtime adaptive policies. Dynamic adaptation will be mainly used to reduce both overall power consumption and temperature and to ease the problem of decreasing yield and reliability that results from submicron CMOS scales.

本文介绍了针对异构多核架构的自适应技术，并介绍了FlexTiles平台，该平台由通用处理器和一些专用加速器组成。不同的组件基于低功耗DSP内核和一个eFPGA，在其上可以在运行时动态配置专用ip。这些特性在提高智能启发式带来的在线自适应能力的同时，在计算性能方面实现了突破。因此，我们提出了一个虚拟化层，它提供了一个更高的抽象级别，以掩盖这种体系结构中存在的潜在异质性。考虑到这些平台必须支持大量可能的用例以及由此产生的工作负载可变性，离线方法不再足够，因为它们不允许处理随时间变化的工作负载。下一代应用包括智能相机、无人机和认知无线电。为了方便架构在不同场景下的适应，我们提出了一个同时考虑静态和动态行为的编程模型。这与操作系统内核赋予的自适应策略相关，该内核提供了一组功能，通过实现运行时自适应策略来保证服务质量(QoS)。动态自适应将主要用于降低整体功耗和温度，并缓解因亚微米CMOS尺度而导致的成品率和可靠性下降的问题。

{"title":"Towards future adaptive multiprocessor systems-on-chip: An innovative approach for flexible architectures","authors":"F. Lemonnier, P. Millet, G. M. Almeida, M. Hübner, J. Becker, S. Pillement, O. Sentieys, Martijn Koedam, Shubhendu Sinha, K. Goossens, C. Piguet, M. Morgan, R. Lemaire","doi":"10.1109/SAMOS.2012.6404179","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404179","url":null,"abstract":"This paper introduces adaptive techniques targeted for heterogeneous manycore architectures and introduces the FlexTiles platform, which consists of general purpose processors with some dedicated accelerators. The different components are based on low power DSP cores and an eFPGA on which dedicated IPs can be dynamically configured at run-time. These features enable a breakthrough in term of computing performance while improving the on-line adaptive capabilities brought from smart heuristics. Thus, we propose a virtualisation layer which provides a higher abstraction level to mask the underlying heterogeneity present in such architectures. Given the large variety of possible use cases that these platforms must support and the resulting workload variability, offline approaches are no longer sufficient because they do not allow coping with time changing workloads. The upcoming generation of applications include smart cameras, drones, and cognitive radio. In order to facilitate the architecture adaptation under different scenarios, we propose a programming model that considers both static and dynamic behaviors. This is associated with self adaptive strategies endowed by an operating system kernel that provides a set of functions that guarantee quality of service (QoS) by implementing runtime adaptive policies. Dynamic adaptation will be mainly used to reduce both overall power consumption and temperature and to ease the problem of decreasing yield and reliability that results from submicron CMOS scales.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128070062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Design space exploration in application-specific hardware synthesis for multiple communicating nested loops 针对多个通信嵌套循环的特定应用硬件综合的设计空间探索

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404166

R. Corvino, A. Gamatie, M. Geilen, L. Józwiak

Application specific MPSoCs are often used to implement high-performance data-intensive applications. MPSoC design requires a rapid and efficient exploration of the hardware architecture possibilities to adequately orchestrate the data distribution and architecture of parallel MPSoC computing resources. Behavioral specifications of data-intensive applications are usually given in the form of a loop-based sequential code, which requires parallelization and task scheduling for an efficient MPSoC implementation. Existing approaches in application specific hardware synthesis, use loop transformations to efficiently parallelize single nested loops and use Synchronous Data Flows to statically schedule and balance the data production and consumption of multiple communicating loops. This creates a separation between data and task parallelism analyses, which can reduce the possibilities for throughput optimization in high-performance data-intensive applications. This paper proposes a method for a concurrent exploration of data and task parallelism when using loop transformations to optimize data transfer and storage mechanisms for both single and multiple communicating nested loops. This method provides orchestrated application specific decisions on communication architecture, memory hierarchy and computing resource parallelism. It is computationally efficient and produces high-performance architectures.

应用专用mpsoc通常用于实现高性能数据密集型应用。MPSoC设计需要快速有效地探索硬件架构的可能性，以充分协调并行MPSoC计算资源的数据分布和架构。数据密集型应用程序的行为规范通常以基于循环的顺序代码的形式给出，这需要并行化和任务调度才能有效地实现MPSoC。现有的特定于应用程序的硬件综合方法，使用循环转换来有效地并行化单个嵌套循环，使用同步数据流来静态调度和平衡多个通信循环的数据生产和消费。这在数据和任务并行性分析之间创建了分离，这可以减少高性能数据密集型应用程序中吞吐量优化的可能性。本文提出了一种使用循环转换来优化单个和多个通信嵌套循环的数据传输和存储机制时并发探索数据和任务并行性的方法。该方法在通信体系结构、内存层次结构和计算资源并行性方面提供了特定于应用程序的编排决策。它具有计算效率并产生高性能架构。

{"title":"Design space exploration in application-specific hardware synthesis for multiple communicating nested loops","authors":"R. Corvino, A. Gamatie, M. Geilen, L. Józwiak","doi":"10.1109/SAMOS.2012.6404166","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404166","url":null,"abstract":"Application specific MPSoCs are often used to implement high-performance data-intensive applications. MPSoC design requires a rapid and efficient exploration of the hardware architecture possibilities to adequately orchestrate the data distribution and architecture of parallel MPSoC computing resources. Behavioral specifications of data-intensive applications are usually given in the form of a loop-based sequential code, which requires parallelization and task scheduling for an efficient MPSoC implementation. Existing approaches in application specific hardware synthesis, use loop transformations to efficiently parallelize single nested loops and use Synchronous Data Flows to statically schedule and balance the data production and consumption of multiple communicating loops. This creates a separation between data and task parallelism analyses, which can reduce the possibilities for throughput optimization in high-performance data-intensive applications. This paper proposes a method for a concurrent exploration of data and task parallelism when using loop transformations to optimize data transfer and storage mechanisms for both single and multiple communicating nested loops. This method provides orchestrated application specific decisions on communication architecture, memory hierarchy and computing resource parallelism. It is computationally efficient and produces high-performance architectures.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125131416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

The homogeneity of architecture in a heterogeneous world 异质世界中建筑的同质性

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404148

J. Goodacre

Summary form only given. It has long been accepted within embedded computing that using a heterogeneous core focused to a specific task can deliver improved performance and subsequent improved power efficiency. The challenge has always been how can the software programmer integrate this hardware diversity as workloads become generalized or often unknown at design time? Using library abstraction permits specific tasks to benefit from heterogeneity, but how can general purpose code benefit? This talk will describe the hardware and software techniques currently being developed in next generation ARM based SoC to address the challenge of maintaining the homogeneity of the software architecture while extending to the benefits of heterogeneity in hardware.

只提供摘要形式。在嵌入式计算中，使用专注于特定任务的异构内核可以提高性能并随后提高功率效率，这一点早已被人们所接受。挑战一直是，当工作负载变得一般化或在设计时常常未知时，软件程序员如何集成这种硬件多样性?使用库抽象允许特定任务从异构性中获益，但是通用代码如何获益呢?本演讲将介绍目前在下一代基于ARM的SoC中开发的硬件和软件技术，以解决保持软件架构同质性的挑战，同时扩展到硬件异构的好处。

引用次数: 3

Just-in-Time Verification in ADL-based processor design 基于adl的处理器设计中的实时验证

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404151

Dominik Auras, Andreas Minwegen, Uwe Deidersen

A novel verification methodology, combining the two new techniques of Live Verification and Processor State Transfer, is introduced to Architecture Description Language (ADL) based processor design. The proposed Just-in-Time Verification significantly accelerates the simulation-based equivalence check of the register-transfer and instruction-set level models, generated from the ADL-based specification. This is accomplished by omitting redundant simulation steps occurring in the conventional architecture debug cycle. The potential speedup is demonstrated with a case study, achieving an acceleration of the debug cycle by 660x.

将实时验证和处理器状态转移两种新技术相结合的验证方法引入到基于体系结构描述语言(ADL)的处理器设计中。提出的即时验证显著加快了基于仿真的寄存器传输和指令集级模型的等效性检查，这些模型是由基于adl的规范生成的。这是通过省略在常规架构调试周期中出现的冗余模拟步骤来实现的。通过一个案例研究演示了潜在的加速，实现了660x的调试周期加速。

引用次数: 0

From Scilab to multicore embedded systems: Algorithms and methodologies 从Scilab到多核嵌入式系统:算法和方法

2012 International Conference on Embedded Computer Systems (SAMOS)

Pub Date : 2012-07-16 DOI: 10.1109/SAMOS.2012.6404184

G. Goulas, P. Alefragis, N. Voros, Christos Valouxis, Christos G Gogos, N. Kavvadias, G. Dimitroulakos, K. Masselos, D. Göhringer, Steven Derrien, D. Ménard, O. Sentieys, M. Hübner, Timo Stripf, Oliver Oey, J. Becker, G. Rauwerda, K. Sunesen, D. Kritharidis, N. Mitas

While advances in processor architecture continues to increase hardware parallelism, parallel software creation is hard. There is an increasing need for tools and methodologies to narrow the entry gap for non-experts in parallel software development as well as to streamline the work for experts. This paper presents the methodology and algorithms for the creation of parallel software written in Scilab source code for multicore embedded processors in the context of the “Architecture oriented paraLlelization for high performance embedded Multicore systems using scilAb” (ALMA) EU FP7 project. The ALMA parallelization approach in a nutshell attempts to manage the complexity of the task by alternating focus between very localized and holistic view program optimization strategies.

虽然处理器架构的进步不断提高硬件的并行性，但并行软件的创建是困难的。越来越需要工具和方法来缩小并行软件开发中非专家的入门差距，并简化专家的工作。本文介绍了在“面向架构的并行化高性能嵌入式多核系统使用Scilab”(ALMA) EU FP7项目背景下，用Scilab源代码为多核嵌入式处理器编写并行软件的方法和算法。简而言之，ALMA并行化方法试图通过在非常局部和整体视图程序优化策略之间交替关注来管理任务的复杂性。

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2012 International Conference on Embedded Computer Systems (SAMOS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀