Parallel implementation of real-time semi-global matching on embedded multi-core architectures

2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS) Pub Date : 2013-07-15 DOI:10.1109/SAMOS.2013.6621106

Oliver Jakob Arndt, Daniel Becker, C. Banz, H. Blume

{"title":"Parallel implementation of real-time semi-global matching on embedded multi-core architectures","authors":"Oliver Jakob Arndt, Daniel Becker, C. Banz, H. Blume","doi":"10.1109/SAMOS.2013.6621106","DOIUrl":null,"url":null,"abstract":"Embedded real-time algorithms are often realized with dedicated hardware, exhibiting high production costs and low programming flexibility thereafter. For instance, semi-global matching for stereo image processing, including complex data flows, traditionally runs on customized hardware modules. Combining the processing and memory capabilities of multiple individual cores, emerging embedded multi-core technologies address these problems. However, considering concurrency issues (e.g., data races and lock contentions), parallel programming requires experienced programmers and technology-specific techniques (e.g., synchronization libraries) and tools (e.g., parallel profilers), which are often not available on embedded platforms. In this work, we introduce a parallel version of a semi-global matching algorithm and demonstrate within this case study runtime optimizations necessary to meet real-time requirements. We also show structured steps of the applied parallelization workflow, illustrating an efficient migration strategy to multi-core platforms using runtime information (e.g., profiles and hardware counters). Finally, to evaluate the resulting performance characteristics, we compare the runtime behavior of the parallel version running on a Freescale P4080 processor with reference values taken on an Intel i7, a field-programmable logic device, an extended general purpose processor and a GPU.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SAMOS.2013.6621106","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

Embedded real-time algorithms are often realized with dedicated hardware, exhibiting high production costs and low programming flexibility thereafter. For instance, semi-global matching for stereo image processing, including complex data flows, traditionally runs on customized hardware modules. Combining the processing and memory capabilities of multiple individual cores, emerging embedded multi-core technologies address these problems. However, considering concurrency issues (e.g., data races and lock contentions), parallel programming requires experienced programmers and technology-specific techniques (e.g., synchronization libraries) and tools (e.g., parallel profilers), which are often not available on embedded platforms. In this work, we introduce a parallel version of a semi-global matching algorithm and demonstrate within this case study runtime optimizations necessary to meet real-time requirements. We also show structured steps of the applied parallelization workflow, illustrating an efficient migration strategy to multi-core platforms using runtime information (e.g., profiles and hardware counters). Finally, to evaluate the resulting performance characteristics, we compare the runtime behavior of the parallel version running on a Freescale P4080 processor with reference values taken on an Intel i7, a field-programmable logic device, an extended general purpose processor and a GPU.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

嵌入式多核架构下实时半全局匹配的并行实现

嵌入式实时算法通常使用专用硬件来实现，因此制作成本高，编程灵活性低。例如，包括复杂数据流在内的立体图像处理的半全局匹配，传统上是在定制的硬件模块上运行的。新兴的嵌入式多核技术结合了多个单独核心的处理和存储能力，解决了这些问题。然而，考虑到并发性问题(例如，数据竞争和锁争用)，并行编程需要有经验的程序员和特定于技术的技术(例如，同步库)和工具(例如，并行分析器)，这些在嵌入式平台上通常是不可用的。在这项工作中，我们介绍了半全局匹配算法的并行版本，并在本案例研究中演示了满足实时需求所需的运行时优化。我们还展示了应用并行化工作流的结构化步骤，说明了使用运行时信息(例如，配置文件和硬件计数器)向多核平台的有效迁移策略。最后，为了评估由此产生的性能特征，我们比较了运行在飞思卡尔P4080处理器上的并行版本的运行时行为与在Intel i7(现场可编程逻辑设备)、扩展通用处理器和GPU上的参考值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)

自引率

0.00%

发文量