Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470830
Bharatkumar Sharma, N. Vydyanathan
The discrete wavelet transform (DWT) is a powerful signal processing technique used in the JPEG 2000 image compression standard. The multi-resolution sub-band encoding provided by DWT allows for higher compression ratios, avoids blocking artifacts and enables progressive transmission of images. However, these advantages come at the expense of additional computational complexity. Achieving real-time or interactive compression/de-compression speeds, therefore, requires a fast implementation of DWT that leverages emerging parallel hardware systems. In this paper, we develop an optimized parallel implementation of the lifting-based DWT algorithm using the recently proposed Open Computing Language (OpenCL). OpenCL is a standard for cross-platform parallel programming of heterogeneous systems comprising of multi-core CPUs, GPUs and other accelerators. We explore the potential of OpenCL in accelerating the DWT computation and analyze the programmability, portability and performance aspects of this language. Our experimental analysis is done using NVIDIA's and AMD's drivers that support OpenCL.
{"title":"Parallel discrete wavelet transform using the Open Computing Language: a performance and portability study","authors":"Bharatkumar Sharma, N. Vydyanathan","doi":"10.1109/IPDPSW.2010.5470830","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470830","url":null,"abstract":"The discrete wavelet transform (DWT) is a powerful signal processing technique used in the JPEG 2000 image compression standard. The multi-resolution sub-band encoding provided by DWT allows for higher compression ratios, avoids blocking artifacts and enables progressive transmission of images. However, these advantages come at the expense of additional computational complexity. Achieving real-time or interactive compression/de-compression speeds, therefore, requires a fast implementation of DWT that leverages emerging parallel hardware systems. In this paper, we develop an optimized parallel implementation of the lifting-based DWT algorithm using the recently proposed Open Computing Language (OpenCL). OpenCL is a standard for cross-platform parallel programming of heterogeneous systems comprising of multi-core CPUs, GPUs and other accelerators. We explore the potential of OpenCL in accelerating the DWT computation and analyze the programmability, portability and performance aspects of this language. Our experimental analysis is done using NVIDIA's and AMD's drivers that support OpenCL.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126956408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470901
Yaohang Li, Weihang Zhu
Accurate protein loop structure models are important to understand functions of many proteins. One of the main problems in correctly modeling protein loop structures is sampling the large loop backbone conformation space, particularly when the loop is long. In this paper, we present a GPU-accelerated loop backbone structure modeling approach by sampling multiple scoring functions based on pair-wise atom distance, torsion angles of triplet residues, or soft-sphere van der Waals potential. The sampling program implemented on a heterogeneous CPU-GPU platform has observed a speedup of ∼40 in sampling long loops, which enables the sampling process to carry out computation with large population size. The GPU-accelerated multi-scoring functions loop structure sampling allows fast generation of decoy sets composed of structurally-diversified backbone decoys with various compromises of multiple scoring functions. In the 53 long loop benchmark targets we tested, our computational results show that in more than 90% of the targets, the decoy sets we generated include decoys within 1.5A RMSD (Root Mean Square Deviation) from native while in 77% of the targets, decoys in 1.0A RMSD are reached.
{"title":"GPU-accelerated multi-scoring functions protein loop structure sampling","authors":"Yaohang Li, Weihang Zhu","doi":"10.1109/IPDPSW.2010.5470901","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470901","url":null,"abstract":"Accurate protein loop structure models are important to understand functions of many proteins. One of the main problems in correctly modeling protein loop structures is sampling the large loop backbone conformation space, particularly when the loop is long. In this paper, we present a GPU-accelerated loop backbone structure modeling approach by sampling multiple scoring functions based on pair-wise atom distance, torsion angles of triplet residues, or soft-sphere van der Waals potential. The sampling program implemented on a heterogeneous CPU-GPU platform has observed a speedup of ∼40 in sampling long loops, which enables the sampling process to carry out computation with large population size. The GPU-accelerated multi-scoring functions loop structure sampling allows fast generation of decoy sets composed of structurally-diversified backbone decoys with various compromises of multiple scoring functions. In the 53 long loop benchmark targets we tested, our computational results show that in more than 90% of the targets, the decoy sets we generated include decoys within 1.5A RMSD (Root Mean Square Deviation) from native while in 77% of the targets, decoys in 1.0A RMSD are reached.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115332356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470818
Marcus Edvinsson, Welf Löwe
Program analysis supporting software development is often part of edit-compile-cycles, and precise program analysis is time consuming. With the availability of parallel processing power on desktop computers, parallelization is a way to speed up program analysis. This requires a parallel data-flow analysis with sufficient work for each processing unit. The present paper suggests such an approach for object-oriented programs analyzing the target methods of polymorphic calls in parallel. With carefully selected thresholds guaranteeing sufficient work for the parallel threads and only little redundancy between them, this approach achieves a maximum speed-up of 5 (average 1.78) on 8 cores for the benchmark programs.
{"title":"A multi-threaded approach for data-flow analysis","authors":"Marcus Edvinsson, Welf Löwe","doi":"10.1109/IPDPSW.2010.5470818","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470818","url":null,"abstract":"Program analysis supporting software development is often part of edit-compile-cycles, and precise program analysis is time consuming. With the availability of parallel processing power on desktop computers, parallelization is a way to speed up program analysis. This requires a parallel data-flow analysis with sufficient work for each processing unit. The present paper suggests such an approach for object-oriented programs analyzing the target methods of polymorphic calls in parallel. With carefully selected thresholds guaranteeing sufficient work for the parallel threads and only little redundancy between them, this approach achieves a maximum speed-up of 5 (average 1.78) on 8 cores for the benchmark programs.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116135023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470701
S. Bitam, M. Batouche, E. Talbi
This paper presents a survey of current research activities inspired by bee life. This work is intended to provide a broad and comprehensive view of the various principles and applications of these bio-inspired systems. We propose to classify them into two major models. The first one is based on the foraging behavior in the bee quotidian life and the second is inspired by the marriage principle. Different original studies are described and classified along with their applications, comparisons against other approaches and results. We then summarize a review of their derived algorithms and research efforts.
{"title":"A survey on bee colony algorithms","authors":"S. Bitam, M. Batouche, E. Talbi","doi":"10.1109/IPDPSW.2010.5470701","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470701","url":null,"abstract":"This paper presents a survey of current research activities inspired by bee life. This work is intended to provide a broad and comprehensive view of the various principles and applications of these bio-inspired systems. We propose to classify them into two major models. The first one is based on the foraging behavior in the bee quotidian life and the second is inspired by the marriage principle. Different original studies are described and classified along with their applications, comparisons against other approaches and results. We then summarize a review of their derived algorithms and research efforts.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116581979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470741
Giang Nguyen Thi Huong, S. Kim
The coupling architecture containing an FPGA device and a microprocessor has been widely used to accelerate microprocessor execution. Therefore, there have been intensive researches about synthesizing high-level programming languages (HLL) such as C and C++ into HW in the high-level synthesis community in order to make the work of reconfiguring the FPGA easier. However, the difference in a calling method in terms of semantics between HDLs and HLLs makes their interface implementation very difficult. This paper presents a novel communication framework between a microprocessor and FPGA, which allows the full implementation of cross calls between SW and HW and even recursive calls in HW without any limitation. We show that our proposed calling overhead is very small. With our communication framework, hardware components inside the FPGA are no longer isolated accelerators, and they can work as other master components in a system configuration.
{"title":"Support of cross calls between a microprocessor and FPGA in CPU-FPGA coupling architecture","authors":"Giang Nguyen Thi Huong, S. Kim","doi":"10.1109/IPDPSW.2010.5470741","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470741","url":null,"abstract":"The coupling architecture containing an FPGA device and a microprocessor has been widely used to accelerate microprocessor execution. Therefore, there have been intensive researches about synthesizing high-level programming languages (HLL) such as C and C++ into HW in the high-level synthesis community in order to make the work of reconfiguring the FPGA easier. However, the difference in a calling method in terms of semantics between HDLs and HLLs makes their interface implementation very difficult. This paper presents a novel communication framework between a microprocessor and FPGA, which allows the full implementation of cross calls between SW and HW and even recursive calls in HW without any limitation. We show that our proposed calling overhead is very small. With our communication framework, hardware components inside the FPGA are no longer isolated accelerators, and they can work as other master components in a system configuration.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122594287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470815
K. Pouget, Marc Pérache, Patrick Carribault, H. Jourdren
With the advent of the multicore era, parallel programming is becoming ubiquitous. Multithreading is a common approach to benefit from these architectures. Hybrid M:N libraries like MultiProcessor Communication (MPC) or MARCEL reach high performance expressing fine-grain parallelism by mapping M user-level threads onto N kernel-level threads. However, such implementations skew the debuggers' ability to distinguish one thread from another, because only kernel threads can be handled. SUN MICROSYSTEMS' THREAD_DB API is an interface between the debugger and the thread library allowing the debugger to inquire for thread semantics details. In this paper we introduce the USER LEVEL DB (ULDB) library, an implementation of the THREAD_DB interface abstracting the common features of user-level thread libraries. ULDB gathers the generic algorithms required to debug threads and provide the thread library with a small and focused interface. We describe the usage of our library with widely-used debuggers (GDB, DBX) and the integration into a user-level thread library (GNUPTH) and two high-performance hybrid libraries (MPC, MARCEL).
{"title":"User level DB: a debugging API for user-level thread libraries","authors":"K. Pouget, Marc Pérache, Patrick Carribault, H. Jourdren","doi":"10.1109/IPDPSW.2010.5470815","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470815","url":null,"abstract":"With the advent of the multicore era, parallel programming is becoming ubiquitous. Multithreading is a common approach to benefit from these architectures. Hybrid M:N libraries like MultiProcessor Communication (MPC) or MARCEL reach high performance expressing fine-grain parallelism by mapping M user-level threads onto N kernel-level threads. However, such implementations skew the debuggers' ability to distinguish one thread from another, because only kernel threads can be handled. SUN MICROSYSTEMS' THREAD_DB API is an interface between the debugger and the thread library allowing the debugger to inquire for thread semantics details. In this paper we introduce the USER LEVEL DB (ULDB) library, an implementation of the THREAD_DB interface abstracting the common features of user-level thread libraries. ULDB gathers the generic algorithms required to debug threads and provide the thread library with a small and focused interface. We describe the usage of our library with widely-used debuggers (GDB, DBX) and the integration into a user-level thread library (GNUPTH) and two high-performance hybrid libraries (MPC, MARCEL).","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122052621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470747
J. Curreri, G. Stitt, A. George
Field-Programmable Gate Arrays (FPGAs) are increasingly employed in both high-performance computing and embedded systems due to performance and power advantages compared to microprocessors. However, widespread usage of FPGAs has been limited by increased design complexity. Highlevel synthesis has reduced this complexity but often relies on inaccurate software simulation or lengthy register-transfer-level simulations for verification and debugging, which is unattractive to software developers. In this paper, we present high-level synthesis techniques that allow application designers to efficiently synthesize ANSI-C assertions into FPGA circuits, enabling real-time verification and debugging of circuits generated from highlevel languages, while executing in the actual FPGA environment. Although not appropriate for all systems (e.g., safety-critical systems), the proposed techniques enable software developers to rapidly verify and debug FPGA applications, while reducing frequency by less than 3% and increasing FPGA resource utilization by less than 0.13% for several application case studies on an Altera Stratix-II EP2S180 using Impulse-C. The presented techniques reduced area overhead by as much as 3x and improved assertion performance by as much as 100% compared to unoptimized in-circuit assertions.
{"title":"High-level synthesis techniques for in-circuit assertion-based verification","authors":"J. Curreri, G. Stitt, A. George","doi":"10.1109/IPDPSW.2010.5470747","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470747","url":null,"abstract":"Field-Programmable Gate Arrays (FPGAs) are increasingly employed in both high-performance computing and embedded systems due to performance and power advantages compared to microprocessors. However, widespread usage of FPGAs has been limited by increased design complexity. Highlevel synthesis has reduced this complexity but often relies on inaccurate software simulation or lengthy register-transfer-level simulations for verification and debugging, which is unattractive to software developers. In this paper, we present high-level synthesis techniques that allow application designers to efficiently synthesize ANSI-C assertions into FPGA circuits, enabling real-time verification and debugging of circuits generated from highlevel languages, while executing in the actual FPGA environment. Although not appropriate for all systems (e.g., safety-critical systems), the proposed techniques enable software developers to rapidly verify and debug FPGA applications, while reducing frequency by less than 3% and increasing FPGA resource utilization by less than 0.13% for several application case studies on an Altera Stratix-II EP2S180 using Impulse-C. The presented techniques reduced area overhead by as much as 3x and improved assertion performance by as much as 100% compared to unoptimized in-circuit assertions.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129313289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470936
Wei Wu, J. Womack, Xinhua Ling
It is expected that Peer-to-Peer (P2P) services will co-exist with the client-server based services such as IMS. Mobile users may subscribe to the traditional wireless cellular services while participating in P2P overlay networks. In this paper, a method is proposed to reduce the signaling overhead in a mobile P2P system. With the help of the underlying infrastructure, a mobile device in the P2P overlay can be located using out-of-band non-P2P signaling. This reduces its P2P signaling for location update while a mobile device is changing the point of attachment in the P2P overlay. As the signaling cost depends on both the client's mobility and traffic models, an analytical model has been developed to determine the optimal threshold for the registration update. Analytical results have shown that the proposed method could save up to 70% signaling cost when the Call-to-Mobility Ratio (CMR) is low. On the other hand, it would be better to fall back to the base client routing method when the CMR is high, i.e., perform the registration update whenever the client changes the point of attachment in the P2P overlay.
{"title":"Mobile-friendly Peer-to-Peer client routing using out-of-band signaling","authors":"Wei Wu, J. Womack, Xinhua Ling","doi":"10.1109/IPDPSW.2010.5470936","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470936","url":null,"abstract":"It is expected that Peer-to-Peer (P2P) services will co-exist with the client-server based services such as IMS. Mobile users may subscribe to the traditional wireless cellular services while participating in P2P overlay networks. In this paper, a method is proposed to reduce the signaling overhead in a mobile P2P system. With the help of the underlying infrastructure, a mobile device in the P2P overlay can be located using out-of-band non-P2P signaling. This reduces its P2P signaling for location update while a mobile device is changing the point of attachment in the P2P overlay. As the signaling cost depends on both the client's mobility and traffic models, an analytical model has been developed to determine the optimal threshold for the registration update. Analytical results have shown that the proposed method could save up to 70% signaling cost when the Call-to-Mobility Ratio (CMR) is low. On the other hand, it would be better to fall back to the base client routing method when the CMR is high, i.e., perform the registration update whenever the client changes the point of attachment in the P2P overlay.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124562134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470802
Bogdan Nicolae, Gabriel Antoniu, L. Bougé
As the rate, scale and variety of data increases in complexity, the need for flexible applications that can crunch huge amounts of heterogeneous data fast and cost-effective is of utmost importance. Such applications are data-intensive: in a typical scenario, they continuously acquire massive datasets (e.g. by crawling the Web or analyzing access logs) while performing computations over these changing datasets (e.g. building up-to-date search indexes). In order to achieve scalability and performance, data acquisitions and computations need to be distributed at large scale in infrastructures comprising hundreds and thousands of machines. As these applications focus on data rather then on computation, a heavy burden is put on the storage service employed to handle data management, because it must efficiently deal with massively parallel data accesses. In order to achieve this, a series of issues need to be address properly: scalable aggregation of storage space from the participating nodes with minimal overhead, the ability to store huge data objects, efficient fine-grain access to data subsets, high throughput even under heavy access concurrency, versioning, as well as fault tolerance and a high quality of service for access throughput. This paper introduces BlobSeer, an efficient distributed data management service that addresses the issues presented above. In BlobSeer, long sequences of bytes representing unstructured data are called blobs (Binary Large OBject).
{"title":"BlobSeer: Efficient data management for data-intensive applications distributed at large-scale","authors":"Bogdan Nicolae, Gabriel Antoniu, L. Bougé","doi":"10.1109/IPDPSW.2010.5470802","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470802","url":null,"abstract":"As the rate, scale and variety of data increases in complexity, the need for flexible applications that can crunch huge amounts of heterogeneous data fast and cost-effective is of utmost importance. Such applications are data-intensive: in a typical scenario, they continuously acquire massive datasets (e.g. by crawling the Web or analyzing access logs) while performing computations over these changing datasets (e.g. building up-to-date search indexes). In order to achieve scalability and performance, data acquisitions and computations need to be distributed at large scale in infrastructures comprising hundreds and thousands of machines. As these applications focus on data rather then on computation, a heavy burden is put on the storage service employed to handle data management, because it must efficiently deal with massively parallel data accesses. In order to achieve this, a series of issues need to be address properly: scalable aggregation of storage space from the participating nodes with minimal overhead, the ability to store huge data objects, efficient fine-grain access to data subsets, high throughput even under heavy access concurrency, versioning, as well as fault tolerance and a high quality of service for access throughput. This paper introduces BlobSeer, an efficient distributed data management service that addresses the issues presented above. In BlobSeer, long sequences of bytes representing unstructured data are called blobs (Binary Large OBject).","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125161985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470684
M. Gallet, M. Jacquelin, L. Marchal
In this paper, we consider the problem of scheduling streaming applications described by complex task graphs on a heterogeneous multicore processor, the STI Cell BE processor. We first present a theoretical model of the Cell processor. Then, we use this model to express the problem of maximizing the throughput of a streaming application on this processor. Although the problem is proven NP-complete, we present an optimal solution based on mixed linear programming. This allows us to compute the optimal mapping for a number of applications, ranging from a real audio encoder to complex random task graphs. These mappings are then tested on two platforms embedding Cell processors, and compared to simple heuristic solutions. We show that we are able to achieve a good speed-up, whereas the heuristic solutions generally fail to deal with the strong memory and communication constraints.
{"title":"Scheduling complex streaming applications on the Cell processor","authors":"M. Gallet, M. Jacquelin, L. Marchal","doi":"10.1109/IPDPSW.2010.5470684","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470684","url":null,"abstract":"In this paper, we consider the problem of scheduling streaming applications described by complex task graphs on a heterogeneous multicore processor, the STI Cell BE processor. We first present a theoretical model of the Cell processor. Then, we use this model to express the problem of maximizing the throughput of a streaming application on this processor. Although the problem is proven NP-complete, we present an optimal solution based on mixed linear programming. This allows us to compute the optimal mapping for a number of applications, ranging from a real audio encoder to complex random task graphs. These mappings are then tested on two platforms embedding Cell processors, and compared to simple heuristic solutions. We show that we are able to achieve a good speed-up, whereas the heuristic solutions generally fail to deal with the strong memory and communication constraints.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129180961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}