Balanced Block Design Architecture for Parallel Computing in Mobile CPUs/GPUs

2013 Fourth International Conference on Computing for Geospatial Research and Application Pub Date : 2013-07-22 DOI:10.1109/COMGEO.2013.27

G. Mani, S. Berkovich, Duoduo Liao

{"title":"Balanced Block Design Architecture for Parallel Computing in Mobile CPUs/GPUs","authors":"G. Mani, S. Berkovich, Duoduo Liao","doi":"10.1109/COMGEO.2013.27","DOIUrl":null,"url":null,"abstract":"To increase performance, processor manufacturers extract parallelism through shrinking transistors and adding more of them to single-core chips and create multi-core systems. Although microprocessors performance continues to grow at an exponential rate, this approach generates too much heat and consumes too much power. These architectures not only introduce several complications but require tremendous efforts for organization of special software for parallel processing. In many cases, these difficulties are insurmountable. The programmers have to write complex code to prioritize the tasks or perform the task in parallel like extracting parallelism through threads in GPUs. One of the key issues for the programmers is how to divide the tasks in to sub-tasks. A faulty calculation may lead to increased data dependency which will slow the processor. Processor that performs more parallel operations can simultaneously increase the queuing delays. In both of the scenarios mentioned above, the relative cost of communication (also known as data transportation energy) between processing elements in microprocessor (or objects in parallel programming) is increasing relative to that of computation. This trend is resulting in larger caches for every new processor generation and more complex and costly latency tolerant mechanisms. Here we introduce a combinatorial architecture that has a unique property-multi-core running on a sequential code. This architecture can be used for both CPUs and GPUs. Some minor adjustments to a regular compiler are needed for loading. Especially, current mobile GPUs technologies are still relatively immature and require substantial improvements to enable wireless devices to perform the complex graphics-related functions. Our new architecture is more suitable for mobile GPUs/CPUs, i.e., mobile heterogeneous computing, with limited resources and relative greater performance.","PeriodicalId":383309,"journal":{"name":"2013 Fourth International Conference on Computing for Geospatial Research and Application","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 Fourth International Conference on Computing for Geospatial Research and Application","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMGEO.2013.27","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

To increase performance, processor manufacturers extract parallelism through shrinking transistors and adding more of them to single-core chips and create multi-core systems. Although microprocessors performance continues to grow at an exponential rate, this approach generates too much heat and consumes too much power. These architectures not only introduce several complications but require tremendous efforts for organization of special software for parallel processing. In many cases, these difficulties are insurmountable. The programmers have to write complex code to prioritize the tasks or perform the task in parallel like extracting parallelism through threads in GPUs. One of the key issues for the programmers is how to divide the tasks in to sub-tasks. A faulty calculation may lead to increased data dependency which will slow the processor. Processor that performs more parallel operations can simultaneously increase the queuing delays. In both of the scenarios mentioned above, the relative cost of communication (also known as data transportation energy) between processing elements in microprocessor (or objects in parallel programming) is increasing relative to that of computation. This trend is resulting in larger caches for every new processor generation and more complex and costly latency tolerant mechanisms. Here we introduce a combinatorial architecture that has a unique property-multi-core running on a sequential code. This architecture can be used for both CPUs and GPUs. Some minor adjustments to a regular compiler are needed for loading. Especially, current mobile GPUs technologies are still relatively immature and require substantial improvements to enable wireless devices to perform the complex graphics-related functions. Our new architecture is more suitable for mobile GPUs/CPUs, i.e., mobile heterogeneous computing, with limited resources and relative greater performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

移动cpu / gpu并行计算的平衡块设计体系结构

为了提高性能，处理器制造商通过缩小晶体管并在单核芯片中添加更多晶体管来提取并行性，并创建多核系统。虽然微处理器的性能继续以指数速度增长，但这种方法产生太多的热量，消耗太多的功率。这些体系结构不仅带来了一些复杂性，而且需要大量的工作来组织用于并行处理的专用软件。在许多情况下，这些困难是无法克服的。程序员必须编写复杂的代码来确定任务的优先级，或者并行执行任务，比如通过gpu中的线程提取并行性。对于程序员来说，关键问题之一是如何将任务划分为子任务。错误的计算可能导致增加的数据依赖性，这将减慢处理器的速度。执行更多并行操作的处理器会同时增加排队延迟。在上面提到的两种情况下，微处理器(或并行编程中的对象)中处理元素之间的通信(也称为数据传输能量)的相对成本相对于计算的相对成本正在增加。这种趋势导致每一代新处理器都需要更大的缓存，以及更复杂、更昂贵的延迟容忍机制。这里我们介绍一种组合体系结构，它具有独特的特性——在顺序代码上运行多核。这种架构既可以用于cpu，也可以用于gpu。加载需要对常规编译器进行一些小的调整。特别是，目前的移动gpu技术还相对不成熟，需要大量的改进才能使无线设备执行复杂的图形相关功能。我们的新架构更适合移动gpu / cpu，即移动异构计算，资源有限，性能相对更高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2013 Fourth International Conference on Computing for Geospatial Research and Application

自引率

0.00%

发文量