首页 > 最新文献

2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)最新文献

英文 中文
Parallelism extraction in embedded software for android devices android设备嵌入式软件的并行抽取
M. Aguilar, Juan Fernando Eusse Giraldo, Projjol Ray, R. Leupers, G. Ascheid, Weihua Sheng, Prashant Sharma
In the last years the presence of embedded devices in everyday life has grown exponentially. The market of these devices imposes conflicting requirements such as cost, performance and energy. The use of Multiprocessor Systems on Chip (MPSoCs) is a widely accepted solution to provide a trade-off between these demands. However, programming MPSoCs is still a cumbersome task. Several research efforts have addressed this challenge in two complementary directions: paradigms for parallel programming and tools for parallelism extraction. However, most of these efforts are focused on the high performance domain and they do not consider the characteristics of the underlying platform. In this paper, we present an approach to extract multiple forms of parallelism from sequential C code, which is applied to widespread Android mobile devices. We show the effectiveness of our work by parallelizing relevant embedded benchmarks on a quad-core Nexus 7 tablet.
在过去的几年里,嵌入式设备在日常生活中的出现呈指数级增长。这些设备的市场在成本、性能和能源等方面提出了相互矛盾的要求。使用多处理器片上系统(mpsoc)是一种广泛接受的解决方案,可以在这些需求之间提供折衷。然而,编程mpsoc仍然是一项繁琐的任务。一些研究工作已经从两个互补的方向解决了这一挑战:并行编程范例和并行抽取工具。然而,这些工作大多集中在高性能领域,而没有考虑底层平台的特征。在本文中,我们提出了一种从顺序C代码中提取多种形式的并行性的方法,该方法应用于广泛的Android移动设备。我们通过在四核Nexus 7平板电脑上并行化相关嵌入式基准测试来展示我们工作的有效性。
{"title":"Parallelism extraction in embedded software for android devices","authors":"M. Aguilar, Juan Fernando Eusse Giraldo, Projjol Ray, R. Leupers, G. Ascheid, Weihua Sheng, Prashant Sharma","doi":"10.1109/SAMOS.2015.7363654","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363654","url":null,"abstract":"In the last years the presence of embedded devices in everyday life has grown exponentially. The market of these devices imposes conflicting requirements such as cost, performance and energy. The use of Multiprocessor Systems on Chip (MPSoCs) is a widely accepted solution to provide a trade-off between these demands. However, programming MPSoCs is still a cumbersome task. Several research efforts have addressed this challenge in two complementary directions: paradigms for parallel programming and tools for parallelism extraction. However, most of these efforts are focused on the high performance domain and they do not consider the characteristics of the underlying platform. In this paper, we present an approach to extract multiple forms of parallelism from sequential C code, which is applied to widespread Android mobile devices. We show the effectiveness of our work by parallelizing relevant embedded benchmarks on a quad-core Nexus 7 tablet.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129472216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
GPU implementation of an anisotropic Huber-L1 dense optical flow algorithm using OpenCL 利用OpenCL实现GPU各向异性Huber-L1密集光流算法
Duygu Buyukaydin, Toygar Akgün
Optical flow estimation aims at inferring a dense pixel-wise correspondence field between two images or video frames. It is commonly used in video processing and computer vision applications, including motion-compensated frame processing, extracting temporal features, computing stereo disparity, understanding scene context/dynamics and understanding behavior. Dense optical flow estimation is a computationally complex problem. Fortunately, a wide range of optical flow estimation algorithms are embarrassingly parallel and can efficiently be accelerated on GPUs. In this work we discuss a massively multi-threaded GPU implementation of the anisotropic Huber-L1 optical flow estimation algorithm using OpenCL framework, which achieves per frame execution time speed-up factors up to almost 300×. Overall algorithm flow, GPU specific implementation details and performance results are presented.
光流估计的目的是推断两个图像或视频帧之间密集的逐像素对应场。它通常用于视频处理和计算机视觉应用,包括运动补偿帧处理、提取时间特征、计算立体视差、理解场景上下文/动态和理解行为。密集光流估计是一个计算复杂的问题。幸运的是,许多光流估计算法都是并行的,并且可以在gpu上有效地加速。在这项工作中,我们讨论了使用OpenCL框架的各向异性Huber-L1光流估计算法的大规模多线程GPU实现,该算法实现了每帧执行时间加速因子高达近300倍。给出了总体算法流程、GPU具体实现细节和性能结果。
{"title":"GPU implementation of an anisotropic Huber-L1 dense optical flow algorithm using OpenCL","authors":"Duygu Buyukaydin, Toygar Akgün","doi":"10.1109/SAMOS.2015.7363693","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363693","url":null,"abstract":"Optical flow estimation aims at inferring a dense pixel-wise correspondence field between two images or video frames. It is commonly used in video processing and computer vision applications, including motion-compensated frame processing, extracting temporal features, computing stereo disparity, understanding scene context/dynamics and understanding behavior. Dense optical flow estimation is a computationally complex problem. Fortunately, a wide range of optical flow estimation algorithms are embarrassingly parallel and can efficiently be accelerated on GPUs. In this work we discuss a massively multi-threaded GPU implementation of the anisotropic Huber-L1 optical flow estimation algorithm using OpenCL framework, which achieves per frame execution time speed-up factors up to almost 300×. Overall algorithm flow, GPU specific implementation details and performance results are presented.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115292213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Software fault tolerance for FPUs via vectorization 基于矢量化的fpu软件容错
Zhi Chen, R. Inagaki, A. Nicolau, A. Veidenbaum
Future generation processors are expected to have high soft error rates and will require increased fault detection and fault tolerance. This work focuses on errors in execution units. Hardware or software duplication or triplication, parity, or residue codes could be used to detect errors in execution units. However, hardware duplication/triplication have significant area overhead and, in applications with high utilization of floating point units (FPU), very high energy cost. Software duplication/ triplication of instructions also increases both execution time and energy consumption. This paper proposes to reduce the cost of redundant instruction execution in FPUs through vectorization. Duplicated or triplicated instructions and result comparisons can be packed by a compiler into vector instructions, such as SSE or AVX. Experimental results using hand vectorization on a variety of benchmarks show that, compared to error detection through scalar instruction duplication, vector mode redundant execution achieves 1.78× and 2.73× average speedup for SSE and AVX instructions, respectively. It also significantly reduces the energy consumption, by an average of 40% and 53%, respectively, for SSE and AVX. Thus the proposed technique enables error detection with no hardware cost and reduced time and energy overhead compared to brute-force scalar instruction duplication.
未来一代处理器预计将具有较高的软错误率,并且将需要增加故障检测和容错能力。这项工作的重点是执行单元中的错误。硬件或软件复制或复制、奇偶校验或剩余代码可用于检测执行单元中的错误。但是,硬件复制/三次复制会产生很大的面积开销,并且在浮点单元(FPU)利用率高的应用程序中,会产生非常高的能源成本。软件重复/重复指令也增加了执行时间和能耗。本文提出通过向量化的方法来降低fpu中冗余指令的执行成本。编译器可以将重复或重复的指令和结果比较打包到向量指令中,例如SSE或AVX。在各种基准测试中使用手动矢量化的实验结果表明,与通过标量指令重复进行错误检测相比,矢量模式冗余执行对SSE和AVX指令分别实现了1.78倍和2.73倍的平均加速。它还显著降低了能耗,SSE和AVX的能耗平均分别降低了40%和53%。因此,与强力标量指令复制相比,所提出的技术可以在没有硬件成本的情况下进行错误检测,并减少了时间和能量开销。
{"title":"Software fault tolerance for FPUs via vectorization","authors":"Zhi Chen, R. Inagaki, A. Nicolau, A. Veidenbaum","doi":"10.1109/SAMOS.2015.7363677","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363677","url":null,"abstract":"Future generation processors are expected to have high soft error rates and will require increased fault detection and fault tolerance. This work focuses on errors in execution units. Hardware or software duplication or triplication, parity, or residue codes could be used to detect errors in execution units. However, hardware duplication/triplication have significant area overhead and, in applications with high utilization of floating point units (FPU), very high energy cost. Software duplication/ triplication of instructions also increases both execution time and energy consumption. This paper proposes to reduce the cost of redundant instruction execution in FPUs through vectorization. Duplicated or triplicated instructions and result comparisons can be packed by a compiler into vector instructions, such as SSE or AVX. Experimental results using hand vectorization on a variety of benchmarks show that, compared to error detection through scalar instruction duplication, vector mode redundant execution achieves 1.78× and 2.73× average speedup for SSE and AVX instructions, respectively. It also significantly reduces the energy consumption, by an average of 40% and 53%, respectively, for SSE and AVX. Thus the proposed technique enables error detection with no hardware cost and reduced time and energy overhead compared to brute-force scalar instruction duplication.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117327099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Pre-simulation elaboration of heterogeneous systems: The SystemC multi-disciplinary virtual prototyping approach 异构系统的预仿真细化:SystemC多学科虚拟样机方法
C. Aoun, Liliana Andrade, Torsten Mähne, F. Pêcheux, M. Louërat, A. Vachoux
Designers of the upcoming digital-centric More-than-Moore systems are lacking a common design and simulation environment able to efficiently manage all the multi-disciplinary aspects of its components of various nature that closely interact with each other. A key to successful design and verification lies in a SystemC-based virtual prototyping environment that is able to simulate a complex heterogeneous system as a whole, for which each component is described and solved using the most appropriate Model of Computation (MoC). In this paper, we present a new generic MoC-independent elaboration scheme that aims at preparing a Virtual Prototype (VP) for simulation. It requires to check the correct composition of the system model through dimensional analysis, to explore the model structure to identify involved MoC and interfaces between MoCs, and to detect the underlying dependencies. Eventually, information extracted from the exploration allow the instantiation of MoC-specific solvers. To soundly handle the global model execution with a Discrete Event (DE) kernel as the main solver, synchronization mechanisms with master-slave semantics within the model structure are implicitly deduced.
即将到来的以数字为中心的More-than-Moore系统的设计者缺乏一个通用的设计和仿真环境,能够有效地管理其各种性质的组件的所有多学科方面,这些组件彼此密切交互。成功设计和验证的关键在于基于systemc的虚拟原型环境,该环境能够将复杂的异构系统作为一个整体进行模拟,其中每个组件都使用最合适的计算模型(MoC)进行描述和求解。在本文中,我们提出了一个新的通用的moc无关的细化方案,旨在准备一个虚拟样机(VP)的仿真。它要求通过量纲分析检查系统模型的正确组成,探索模型结构以识别涉及的MoC和MoC之间的接口,并检测底层依赖关系。最后,从探索中提取的信息允许实例化特定于moc的求解器。为了更好地处理以离散事件内核为主要求解器的全局模型执行,隐式推导了模型结构中具有主从语义的同步机制。
{"title":"Pre-simulation elaboration of heterogeneous systems: The SystemC multi-disciplinary virtual prototyping approach","authors":"C. Aoun, Liliana Andrade, Torsten Mähne, F. Pêcheux, M. Louërat, A. Vachoux","doi":"10.1109/SAMOS.2015.7363686","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363686","url":null,"abstract":"Designers of the upcoming digital-centric More-than-Moore systems are lacking a common design and simulation environment able to efficiently manage all the multi-disciplinary aspects of its components of various nature that closely interact with each other. A key to successful design and verification lies in a SystemC-based virtual prototyping environment that is able to simulate a complex heterogeneous system as a whole, for which each component is described and solved using the most appropriate Model of Computation (MoC). In this paper, we present a new generic MoC-independent elaboration scheme that aims at preparing a Virtual Prototype (VP) for simulation. It requires to check the correct composition of the system model through dimensional analysis, to explore the model structure to identify involved MoC and interfaces between MoCs, and to detect the underlying dependencies. Eventually, information extracted from the exploration allow the instantiation of MoC-specific solvers. To soundly handle the global model execution with a Discrete Event (DE) kernel as the main solver, synchronization mechanisms with master-slave semantics within the model structure are implicitly deduced.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134102493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Chip-independent Error Correction in main memories 主存储器中与芯片无关的纠错
Mehrtash Manoochehri, M. Dubois
Main memory reliability is an important concern in today's computer systems. Error Correction Codes (ECCs) improve memory reliability but have high area and energy overheads. Furthermore, ECCs cannot be easily applied to memories with wide chips such as stacked memories. In this paper, we introduce a new low-overhead error correction scheme, which can easily be applied to DRAM memories with wide devices. The scheme is called Chip-Independent Error Correction (CIEC) because it is independent of the memory chip width. Our simulation results in the context of transient faults show that CIEC has only 4.5% energy overhead, 0.5% performance overhead, and 0.7% area overhead on the processor chip as compared to a non-ECC DIMM while its reliability is much higher than the reliability of non-ECC DIMMs.
在当今的计算机系统中,主存储器的可靠性是一个重要的问题。纠错码(ECCs)提高了存储器的可靠性,但具有较高的面积和能量开销。此外,ECCs不容易应用于具有宽芯片的存储器,如堆叠存储器。本文介绍了一种新的低开销纠错方案,它可以很容易地应用于具有宽器件的DRAM存储器。该方案被称为芯片无关纠错(CIEC),因为它与存储芯片宽度无关。我们在瞬态故障情况下的仿真结果表明,与非ecc DIMM相比,CIEC在处理器芯片上的能量开销仅为4.5%,性能开销为0.5%,面积开销为0.7%,而其可靠性远高于非ecc DIMM的可靠性。
{"title":"Chip-independent Error Correction in main memories","authors":"Mehrtash Manoochehri, M. Dubois","doi":"10.1109/SAMOS.2015.7363674","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363674","url":null,"abstract":"Main memory reliability is an important concern in today's computer systems. Error Correction Codes (ECCs) improve memory reliability but have high area and energy overheads. Furthermore, ECCs cannot be easily applied to memories with wide chips such as stacked memories. In this paper, we introduce a new low-overhead error correction scheme, which can easily be applied to DRAM memories with wide devices. The scheme is called Chip-Independent Error Correction (CIEC) because it is independent of the memory chip width. Our simulation results in the context of transient faults show that CIEC has only 4.5% energy overhead, 0.5% performance overhead, and 0.7% area overhead on the processor chip as compared to a non-ECC DIMM while its reliability is much higher than the reliability of non-ECC DIMMs.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133602144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Imposing coarse-grained reconfiguration to general purpose processors 对通用处理器实施粗粒度的重新配置
M. Duric, Milan Stanic, Ivan Ratković, Oscar Palomar, O. Unsal, A. Cristal, M. Valero, Aaron Smith
Mobile devices execute applications with diverse compute and performance demands. This paper proposes a general purpose processor that adapts the underlying hardware to a given workload. Existing mobile processors need to utilize more complex heterogeneous substrates to deliver the demanded performance. They incorporate different cores and specialized accelerators. On the contrary, our processor utilizes only modest homogeneous cores and dynamically provides an execution substrate suitable to accelerate a particular workload. Instead of incorporating accelerators, the processor reconfigures one or more cores into accelerators on-the-fly. It improves performance with minimal hardware additions. The accelerators are made of general purpose ALUs reconfigured into a compute fabric and the general purpose pipeline that streams data through the fabric. To enable reconfiguration of ALUs into the fabric, the floorplan of a 4-core processor is changed to place the ALUs in close proximity on the chip. A configurable switched network is added to couple and dynamically reconfigure the ALUs to perform computation of frequently repeated regions, instead of executing general purpose instructions. Through this reconfiguration, the mobile processor specializes its substrate for a given workload and maximizes performance of the existing resources. Our results show that reconfiguration accelerates a set of selected compute intensive workloads by 1.56×, 2,39×, 3,51×, when configuring the accelerator of 1-, 2-, or 4- cores respectively.
移动设备执行具有不同计算和性能需求的应用程序。本文提出了一种通用处理器,它可以使底层硬件适应给定的工作负载。现有的移动处理器需要利用更复杂的异构基板来提供所需的性能。它们包含不同的内核和专门的加速器。相反,我们的处理器只使用适度的同构内核,并动态地提供适合加速特定工作负载的执行基板。处理器没有集成加速器,而是动态地将一个或多个核心重新配置为加速器。它以最少的硬件添加提高了性能。加速器由重新配置到计算结构中的通用alu和通过该结构传输数据的通用管道组成。为了将alu重新配置到fabric中,改变了4核处理器的平面布局,将alu放置在芯片上的距离很近的位置。增加了一个可配置的交换网络来耦合和动态地重新配置alu,以执行频繁重复区域的计算,而不是执行通用指令。通过这种重新配置,移动处理器将其基板专门用于给定的工作负载,并最大化现有资源的性能。我们的研究结果表明,当配置1核、2核或4核加速器时,重新配置可分别将一组选定的计算密集型工作负载加速1.56倍、2倍、39倍、3倍和51倍。
{"title":"Imposing coarse-grained reconfiguration to general purpose processors","authors":"M. Duric, Milan Stanic, Ivan Ratković, Oscar Palomar, O. Unsal, A. Cristal, M. Valero, Aaron Smith","doi":"10.1109/SAMOS.2015.7363658","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363658","url":null,"abstract":"Mobile devices execute applications with diverse compute and performance demands. This paper proposes a general purpose processor that adapts the underlying hardware to a given workload. Existing mobile processors need to utilize more complex heterogeneous substrates to deliver the demanded performance. They incorporate different cores and specialized accelerators. On the contrary, our processor utilizes only modest homogeneous cores and dynamically provides an execution substrate suitable to accelerate a particular workload. Instead of incorporating accelerators, the processor reconfigures one or more cores into accelerators on-the-fly. It improves performance with minimal hardware additions. The accelerators are made of general purpose ALUs reconfigured into a compute fabric and the general purpose pipeline that streams data through the fabric. To enable reconfiguration of ALUs into the fabric, the floorplan of a 4-core processor is changed to place the ALUs in close proximity on the chip. A configurable switched network is added to couple and dynamically reconfigure the ALUs to perform computation of frequently repeated regions, instead of executing general purpose instructions. Through this reconfiguration, the mobile processor specializes its substrate for a given workload and maximizes performance of the existing resources. Our results show that reconfiguration accelerates a set of selected compute intensive workloads by 1.56×, 2,39×, 3,51×, when configuring the accelerator of 1-, 2-, or 4- cores respectively.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131551818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Reconfigurable computing for future vision-capable devices 可重构计算的未来视觉设备
Miguel Bordallo López, A. Nieto, O. Silvén, J. Boutellier, D. L. Vilariño
Mobile devices have been identified as promising platforms for interactive vision-based applications. However, this type of applications still pose significant challenges in terms of latency, throughput and energy-efficiency. In this context, the integration of reconfigurable architectures on mobile devices allows dynamic reconfiguration to match the computation and data flow of interactive applications, demonstrating significant performance benefits compared to general purpose architectures. This paper presents concepts laying on platform level adaptability, exploring the acceleration of vision-based interactive applications through the utilization of three reconfigurable architectures: A low-power EnCore processor with a Configurable Flow Accelerator co-processor, a hybrid reconfigurable SIMD/MIMD platform and Transport-Triggered Architecture-based processors. The architectures are evaluated and compared with current processors, analyzing their advantages and weaknesses in terms of performance and energy-efficiency when implementing highly interactive vision-based applications. The results show that the inclusion of reconfigurable platforms on mobile devices can enable the computation of several computationally heavy tasks with high performance and small energy consumption while providing enough flexibility.
移动设备已被确定为基于交互式视觉的应用程序的有前途的平台。然而,这种类型的应用程序在延迟、吞吐量和能效方面仍然面临重大挑战。在这种情况下,在移动设备上集成可重构架构允许动态重新配置,以匹配交互式应用程序的计算和数据流,与通用架构相比,显示出显著的性能优势。本文提出了基于平台级适应性的概念,通过利用三种可重构架构来探索基于视觉的交互式应用程序的加速:具有可配置流加速器协处理器的低功耗EnCore处理器,混合可重构SIMD/MIMD平台和基于传输触发架构的处理器。对这些架构进行了评估,并与当前的处理器进行了比较,在实现高度交互式的基于视觉的应用程序时,分析了它们在性能和能效方面的优缺点。结果表明,在移动设备上加入可重构平台,可以在提供足够灵活性的同时,以高性能和小能耗计算多个计算量较大的任务。
{"title":"Reconfigurable computing for future vision-capable devices","authors":"Miguel Bordallo López, A. Nieto, O. Silvén, J. Boutellier, D. L. Vilariño","doi":"10.1109/SAMOS.2015.7363657","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363657","url":null,"abstract":"Mobile devices have been identified as promising platforms for interactive vision-based applications. However, this type of applications still pose significant challenges in terms of latency, throughput and energy-efficiency. In this context, the integration of reconfigurable architectures on mobile devices allows dynamic reconfiguration to match the computation and data flow of interactive applications, demonstrating significant performance benefits compared to general purpose architectures. This paper presents concepts laying on platform level adaptability, exploring the acceleration of vision-based interactive applications through the utilization of three reconfigurable architectures: A low-power EnCore processor with a Configurable Flow Accelerator co-processor, a hybrid reconfigurable SIMD/MIMD platform and Transport-Triggered Architecture-based processors. The architectures are evaluated and compared with current processors, analyzing their advantages and weaknesses in terms of performance and energy-efficiency when implementing highly interactive vision-based applications. The results show that the inclusion of reconfigurable platforms on mobile devices can enable the computation of several computationally heavy tasks with high performance and small energy consumption while providing enough flexibility.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127791102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tervel: A unification of descriptor-based techniques for non-blocking programming Tervel:用于非阻塞编程的基于描述符技术的统一
S. Feldman, P. Laborde, D. Dechev
The development of non-blocking code is difficult; developers must ensure the progress of an operation on shared memory despite conflicting operations. Managing this shared memory in a non-blocking fashion is even more problematic. The non-blocking property guarantees that progress is made toward the desired operation in a finite amount of time. We present a framework that implements memory reclamation and progress assurance for code that follows the semantics of our framework. This reduces the effort required to implement non-blocking, and more specifically wait-free, algorithms. We also present a library that demonstrates the ease with which wait-free algorithms can be implemented using our framework.
开发非阻塞代码是困难的;开发人员必须确保共享内存上操作的进度,尽管存在冲突操作。以非阻塞方式管理这个共享内存甚至更成问题。非阻塞特性保证了在有限的时间内进行所需的操作。我们提供了一个框架,它实现了内存回收和代码的进度保证,这些代码遵循我们框架的语义。这减少了实现非阻塞(更具体地说是无等待)算法所需的工作量。我们还提供了一个库,该库演示了使用我们的框架实现无等待算法的便利性。
{"title":"Tervel: A unification of descriptor-based techniques for non-blocking programming","authors":"S. Feldman, P. Laborde, D. Dechev","doi":"10.1109/SAMOS.2015.7363668","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363668","url":null,"abstract":"The development of non-blocking code is difficult; developers must ensure the progress of an operation on shared memory despite conflicting operations. Managing this shared memory in a non-blocking fashion is even more problematic. The non-blocking property guarantees that progress is made toward the desired operation in a finite amount of time. We present a framework that implements memory reclamation and progress assurance for code that follows the semantics of our framework. This reduces the effort required to implement non-blocking, and more specifically wait-free, algorithms. We also present a library that demonstrates the ease with which wait-free algorithms can be implemented using our framework.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129128148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Hardware task migration module for improved fault tolerance and predictability 硬件任务迁移模块,提高容错性和可预测性
Shyamsundar Venkataraman, Rui Santos, Akash Kumar, Jasper Kuijsten
Task migration has been applied as an efficient mechanism to handle faulty processing elements (PEs) in Multi-processor Systems-on-Chip (MPSoCs). However, current task migration solutions are either implemented or emulated in software, compromising intrinsically the predictability and degrading the system robustness. Moreover, the initial placement and mapping of the tasks in the MPSoC plays an important role in minimising the task migration overhead and overall system energy. This paper proposes a hardware-based task migration scheme for MPSoC systems, offering better predictability as well as an improved method of fault tolerance. The proposed scheme intelligently generates an initial placement for the tasks with improved fault tolerance and stores these mappings on a hash map, which is looked up at run-time as and when faults occur. Compared with the state-of-the-art, our scheme performs up to 1500× faster task migration without any significant overheads.
在多处理器片上系统(mpsoc)中,任务迁移是一种有效的故障处理机制。然而,当前的任务迁移解决方案要么在软件中实现,要么在软件中模拟,这从本质上损害了可预测性,降低了系统的健壮性。此外,MPSoC中任务的初始位置和映射在最小化任务迁移开销和整体系统能量方面起着重要作用。本文提出了一种基于硬件的MPSoC系统任务迁移方案,提供了更好的可预测性和改进的容错方法。提出的方案智能地生成任务的初始位置,并提高容错性,并将这些映射存储在散列映射中,在运行时发生故障时查找散列映射。与最先进的方案相比,我们的方案执行的任务迁移速度提高了1500倍,而没有任何显著的开销。
{"title":"Hardware task migration module for improved fault tolerance and predictability","authors":"Shyamsundar Venkataraman, Rui Santos, Akash Kumar, Jasper Kuijsten","doi":"10.1109/SAMOS.2015.7363676","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363676","url":null,"abstract":"Task migration has been applied as an efficient mechanism to handle faulty processing elements (PEs) in Multi-processor Systems-on-Chip (MPSoCs). However, current task migration solutions are either implemented or emulated in software, compromising intrinsically the predictability and degrading the system robustness. Moreover, the initial placement and mapping of the tasks in the MPSoC plays an important role in minimising the task migration overhead and overall system energy. This paper proposes a hardware-based task migration scheme for MPSoC systems, offering better predictability as well as an improved method of fault tolerance. The proposed scheme intelligently generates an initial placement for the tasks with improved fault tolerance and stores these mappings on a hash map, which is looked up at run-time as and when faults occur. Compared with the state-of-the-art, our scheme performs up to 1500× faster task migration without any significant overheads.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126278755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
An FPGA-based systolic array to accelerate the BWA-MEM genomic mapping algorithm 基于fpga的心脏收缩阵列加速BWA-MEM基因组图谱算法
Ernst Houtgast, V. Sima, K. Bertels, Z. Al-Ars
We present the first accelerated implementation of BWA-MEM, a popular genome sequence alignment algorithm widely used in next generation sequencing genomics pipelines. The Smith-Waterman-like sequence alignment kernel requires a significant portion of overall execution time. We propose and evaluate a number of FPGA-based systolic array architectures, presenting optimizations generally applicable to variable length Smith-Waterman execution. Our kernel implementation is up to 3× faster, compared to software-only execution. This translates into an overall application speedup of up to 45%, which is 96% of the theoretically maximum achievable speedup when accelerating only this kernel.
我们提出了BWA-MEM的第一个加速实现,BWA-MEM是一种流行的基因组序列比对算法,广泛用于下一代测序基因组学管道。类似smith - waterman的序列对齐内核需要大量的总执行时间。我们提出并评估了一些基于fpga的收缩阵列架构,提出了通常适用于可变长度Smith-Waterman执行的优化。与纯软件执行相比,我们的内核实现要快3倍。这意味着整个应用程序的加速高达45%,这是仅加速该内核时理论上可实现的最大加速的96%。
{"title":"An FPGA-based systolic array to accelerate the BWA-MEM genomic mapping algorithm","authors":"Ernst Houtgast, V. Sima, K. Bertels, Z. Al-Ars","doi":"10.1109/SAMOS.2015.7363679","DOIUrl":"https://doi.org/10.1109/SAMOS.2015.7363679","url":null,"abstract":"We present the first accelerated implementation of BWA-MEM, a popular genome sequence alignment algorithm widely used in next generation sequencing genomics pipelines. The Smith-Waterman-like sequence alignment kernel requires a significant portion of overall execution time. We propose and evaluate a number of FPGA-based systolic array architectures, presenting optimizations generally applicable to variable length Smith-Waterman execution. Our kernel implementation is up to 3× faster, compared to software-only execution. This translates into an overall application speedup of up to 45%, which is 96% of the theoretically maximum achievable speedup when accelerating only this kernel.","PeriodicalId":346802,"journal":{"name":"2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129637952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
期刊
2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1