Parallel garbage collection seeks to exploit the inherent parallelism of graph tracing by evenly distributing the set of objects in the heap among all available processing resources. Any straightforward implementation, however, suffers from prohibitive overheads since each access to the worklist of objects and to the objects themselves needs to be protected by synchronization, especially so in the case of compacting collectors. For this reason, known parallel collectors sacrifice a great deal of work distribution granularity and scalability to keep the synchronization costs acceptable. In this paper, we present a case study of a different approach. Our parallel compacting collector is based on Cheney's copying algorithm, employs a single worklist and distributes garbage collection work on an object-by-object basis. This way, it achieves well balanced work distribution and good scalability. To solve the synchronization problem, we introduce a low-cost multi-core garbage collection coprocessor and take advantage of hardware-supported synchronization. We built an FPGA-based prototype with a single-core main processor supported by a multi-core garbage collection coprocessor. Measurement results show that an 8-core garbage collection coprocessor decreases the duration of garbage collection cycles by a factor of up to 7.4, while a 16-core configuration still achieves a factor of up to 12.1.
{"title":"Fine-Grained Parallel Compacting Garbage Collection through Hardware-Supported Synchronization","authors":"O. Horvath, M. Meyer","doi":"10.1109/ICPPW.2010.28","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.28","url":null,"abstract":"Parallel garbage collection seeks to exploit the inherent parallelism of graph tracing by evenly distributing the set of objects in the heap among all available processing resources. Any straightforward implementation, however, suffers from prohibitive overheads since each access to the worklist of objects and to the objects themselves needs to be protected by synchronization, especially so in the case of compacting collectors. For this reason, known parallel collectors sacrifice a great deal of work distribution granularity and scalability to keep the synchronization costs acceptable. In this paper, we present a case study of a different approach. Our parallel compacting collector is based on Cheney's copying algorithm, employs a single worklist and distributes garbage collection work on an object-by-object basis. This way, it achieves well balanced work distribution and good scalability. To solve the synchronization problem, we introduce a low-cost multi-core garbage collection coprocessor and take advantage of hardware-supported synchronization. We built an FPGA-based prototype with a single-core main processor supported by a multi-core garbage collection coprocessor. Measurement results show that an 8-core garbage collection coprocessor decreases the duration of garbage collection cycles by a factor of up to 7.4, while a 16-core configuration still achieves a factor of up to 12.1.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"252 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122489791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Call path profiling is a scalable measurement technique that has been shown to provide insight into the performance characteristics of complex modular programs. However, poor presentation of accurate and precise call path profiles obscures insight. To enable rapid analysis of an execution's performance bottlenecks, we make the following contributions for effectively presenting call path profiles. First, we combine a relatively small set of complementary presentation techniques to form a coherent synthesis that is greater than the constituent parts. Second, we extend existing presentation techniques to rapidly focus an analyst's attention on performance bottlenecks. In particular, we (1) show how to scalably present three complementary views of calling-context-sensitive metrics; (2) treat a procedure's static structure as first-class information with respect to both performance metrics and constructing views; (3) enable construction of a large variety of user-defined metrics to assess performance inefficiency; and (4) automatically expand hot paths based on arbitrary performance metrics --- through calling contexts and static structure --- to rapidly highlight important program contexts. Our work is implemented within HPCToolkit, which collects call path profiles using low-overhead asynchronous sampling.
{"title":"Effectively Presenting Call Path Profiles of Application Performance","authors":"L. Adhianto, J. Mellor-Crummey, Nathan R. Tallent","doi":"10.1109/ICPPW.2010.35","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.35","url":null,"abstract":"Call path profiling is a scalable measurement technique that has been shown to provide insight into the performance characteristics of complex modular programs. However, poor presentation of accurate and precise call path profiles obscures insight. To enable rapid analysis of an execution's performance bottlenecks, we make the following contributions for effectively presenting call path profiles. First, we combine a relatively small set of complementary presentation techniques to form a coherent synthesis that is greater than the constituent parts. Second, we extend existing presentation techniques to rapidly focus an analyst's attention on performance bottlenecks. In particular, we (1) show how to scalably present three complementary views of calling-context-sensitive metrics; (2) treat a procedure's static structure as first-class information with respect to both performance metrics and constructing views; (3) enable construction of a large variety of user-defined metrics to assess performance inefficiency; and (4) automatically expand hot paths based on arbitrary performance metrics --- through calling contexts and static structure --- to rapidly highlight important program contexts. Our work is implemented within HPCToolkit, which collects call path profiles using low-overhead asynchronous sampling.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121315666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we explore resource discovery and scheduling issues that arise in unstructured peer-to-peer (P2P) desktop grids. We examine the use of a super-peer based approach to address these issues. The super-peers form a resource information tracking and exchange overlay to enable users to rapidly locate resources for remote execution of jobs. Resource availability information is exchanged among the super-peers using a light-weight threshold-driven gossip protocol with the aim of minimizing the resource discovery overhead. We conduct detailed simulation experiments to illustrate the comparative results. Our results indicate that this approach offers a lightweight and scalable method for managing resources in a desktop grid.
{"title":"Resource Discovery and Scheduling in Unstructured Peer-to-Peer Desktop Grids","authors":"S. Kwan, J. Muppala","doi":"10.1109/ICPPW.2010.49","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.49","url":null,"abstract":"In this paper, we explore resource discovery and scheduling issues that arise in unstructured peer-to-peer (P2P) desktop grids. We examine the use of a super-peer based approach to address these issues. The super-peers form a resource information tracking and exchange overlay to enable users to rapidly locate resources for remote execution of jobs. Resource availability information is exchanged among the super-peers using a light-weight threshold-driven gossip protocol with the aim of minimizing the resource discovery overhead. We conduct detailed simulation experiments to illustrate the comparative results. Our results indicate that this approach offers a lightweight and scalable method for managing resources in a desktop grid.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128925159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large parallel machines with hundreds of thousands of processors are being built. Recent studies have shown that ensuring good load balance is critical for scaling certain classes of parallel applications on even thousands of processors. Centralized load balancing algorithms suffer from scalability problems, especially on machines with relatively small amount of memory. Fully distributed load balancing algorithms, on the other hand, tend to yield poor load balance on very large machines. In this paper, we present an automatic dynamic hierarchical load balancing method that overcomes the scalability challenges of centralized schemes and poor solutions of traditional distributed schemes. This is done by creating multiple levels of aggressive load balancing domains which form a tree. This hierarchical method is demonstrated within a measurement-based load balancing framework in Charm++. We present techniques to deal with scalability challenges of load balancing at very large scale. We show performance data of the hierarchical load balancing method on up to 16,384 cores of Ranger (at TACC) for a synthetic benchmark. We also demonstrate the successful deployment of the method in a scientific application, NAMD with results on the Blue Gene/P machine at ANL.
{"title":"Hierarchical Load Balancing for Charm++ Applications on Large Supercomputers","authors":"G. Zheng, Esteban Meneses, A. Bhatele, L. Kalé","doi":"10.1109/ICPPW.2010.65","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.65","url":null,"abstract":"Large parallel machines with hundreds of thousands of processors are being built. Recent studies have shown that ensuring good load balance is critical for scaling certain classes of parallel applications on even thousands of processors. Centralized load balancing algorithms suffer from scalability problems, especially on machines with relatively small amount of memory. Fully distributed load balancing algorithms, on the other hand, tend to yield poor load balance on very large machines. In this paper, we present an automatic dynamic hierarchical load balancing method that overcomes the scalability challenges of centralized schemes and poor solutions of traditional distributed schemes. This is done by creating multiple levels of aggressive load balancing domains which form a tree. This hierarchical method is demonstrated within a measurement-based load balancing framework in Charm++. We present techniques to deal with scalability challenges of load balancing at very large scale. We show performance data of the hierarchical load balancing method on up to 16,384 cores of Ranger (at TACC) for a synthetic benchmark. We also demonstrate the successful deployment of the method in a scientific application, NAMD with results on the Blue Gene/P machine at ANL.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123123680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Noncontiguous I/O access is one of the main access patterns in parallel and distributed applications. An I/O architecture EXIO enables Globus, a popular run-time environment for distributed computing, on RDMA networks such as InfiniBand. In this paper, we investigate the benefits of InfiniBand zero-copy RDMA to noncontiguous I/O on Globus. Our experimental results demonstrate that, by enabling zero-copy RDMA on InfiniBand, EXIO significantly improves the performance of Globus noncontiguous I/O. Compared to the packing and unpacking, zero-copy RDMA improve the bandwidth by up to 2.7 times. Compared to both IPoIB and 10GigE, it increases the bandwidth by more than three times. While achieving efficient noncontiguous I/O, RDMA-based noncontiguous I/O on InfiniBand also leads to dramatical reduction of CPU utilization on Globus clients and servers.
{"title":"Efficient Zero-Copy Noncontiguous I/O for Globus on InfiniBand","authors":"Weikuan Yu, Yuan Tian, J. Vetter","doi":"10.1109/ICPPW.2010.56","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.56","url":null,"abstract":"Noncontiguous I/O access is one of the main access patterns in parallel and distributed applications. An I/O architecture EXIO enables Globus, a popular run-time environment for distributed computing, on RDMA networks such as InfiniBand. In this paper, we investigate the benefits of InfiniBand zero-copy RDMA to noncontiguous I/O on Globus. Our experimental results demonstrate that, by enabling zero-copy RDMA on InfiniBand, EXIO significantly improves the performance of Globus noncontiguous I/O. Compared to the packing and unpacking, zero-copy RDMA improve the bandwidth by up to 2.7 times. Compared to both IPoIB and 10GigE, it increases the bandwidth by more than three times. While achieving efficient noncontiguous I/O, RDMA-based noncontiguous I/O on InfiniBand also leads to dramatical reduction of CPU utilization on Globus clients and servers.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"220 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124364944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a performance estimation technique for a multi-core segmented bus platform, SegBus. The technique enables us to assess the performance aspects of any specific application on a particular platform configuration, modeled in Unified Modeling Language (UML). We present methods to transform Packet Synchronous Data Flow (PSDF) and Platform Specific Model (PSM) models of the application into Extensible Markup Language (XML) schemes using modeling tool and how the generated XML schemes can be utilized by the emulator program to get the execution results. The technique facilitates us to estimate performance aspects of application mapped on a number of different platform configurations during the early stages of the design process.
{"title":"A Performance Estimation Technique for the SegBus Distributed Architecture","authors":"M. F. Niazi, T. Seceleanu, H. Tenhunen","doi":"10.1109/ICPPW.2010.24","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.24","url":null,"abstract":"We propose a performance estimation technique for a multi-core segmented bus platform, SegBus. The technique enables us to assess the performance aspects of any specific application on a particular platform configuration, modeled in Unified Modeling Language (UML). We present methods to transform Packet Synchronous Data Flow (PSDF) and Platform Specific Model (PSM) models of the application into Extensible Markup Language (XML) schemes using modeling tool and how the generated XML schemes can be utilized by the emulator program to get the execution results. The technique facilitates us to estimate performance aspects of application mapped on a number of different platform configurations during the early stages of the design process.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129909107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cloud computing provides a framework for supporting end users easily attaching powerful services and applications through Internet. To provide secure and reliable services in cloud computing environment is an important issue. One of the security issues is how to reduce the impact of denial-of-service (DoS) attack or distributed denial-of-service (DDoS) in this environment. To counter these kinds of attacks, a framework of cooperative intrusion detection system (IDS) is proposed. The proposed system could reduce the impact of these kinds of attacks. To provide such ability, IDSs in the cloud computing regions exchange their alerts with each other. In the system, each of IDSs has a cooperative agent used to compute and determine whether to accept the alerts sent from other IDSs or not. By this way, IDSs could avoid the same type of attack happening. The implementation results indicate that the proposed system could resist DoS attack. Moreover, by comparison, the proposed cooperative IDS system only increases little computation effort compared with pure Snort based IDS but prevents the system from single point of failure attack.
{"title":"A Cooperative Intrusion Detection System Framework for Cloud Computing Networks","authors":"Chi-Chun Lo, Chun-Chieh Huang, Joy Ku","doi":"10.1109/ICPPW.2010.46","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.46","url":null,"abstract":"Cloud computing provides a framework for supporting end users easily attaching powerful services and applications through Internet. To provide secure and reliable services in cloud computing environment is an important issue. One of the security issues is how to reduce the impact of denial-of-service (DoS) attack or distributed denial-of-service (DDoS) in this environment. To counter these kinds of attacks, a framework of cooperative intrusion detection system (IDS) is proposed. The proposed system could reduce the impact of these kinds of attacks. To provide such ability, IDSs in the cloud computing regions exchange their alerts with each other. In the system, each of IDSs has a cooperative agent used to compute and determine whether to accept the alerts sent from other IDSs or not. By this way, IDSs could avoid the same type of attack happening. The implementation results indicate that the proposed system could resist DoS attack. Moreover, by comparison, the proposed cooperative IDS system only increases little computation effort compared with pure Snort based IDS but prevents the system from single point of failure attack.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124906909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Severe received signal strength (RSS) fluctuation is one of the crucial problems in an indoor positioning system using fingerprint-based algorithms. Even at a fixed location, the RSSs received by a mobile device at different time have large discrepancy. Adopting these fluctuated signals for positioning may lead to inaccurate results. To mitigate this problem, in this paper, any of the existing fingerprint-based indoor positioning algorithms can be integrated into our positioning system to estimate the location of mobile device. Then, a mobility prediction algorithm using the model of Brownian motion is presented for further calculating the rationality of the estimated location and correcting the inaccurate results. To be realistic, some experiments in a real WLAN environment with a multitude of people moving in a testing area demonstrate the noticeably better accuracy of this approach. The solution can ensure low and stable positioning error. Besides, the region where training records are out of date can also be found out.
{"title":"A Novel RSS-Based Indoor Positioning Algorithm Using Mobility Prediction","authors":"Lyu-Han Chen, Gen-Huey Chen, Ming-Hui Jin, E. Wu","doi":"10.1109/ICPPW.2010.80","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.80","url":null,"abstract":"Severe received signal strength (RSS) fluctuation is one of the crucial problems in an indoor positioning system using fingerprint-based algorithms. Even at a fixed location, the RSSs received by a mobile device at different time have large discrepancy. Adopting these fluctuated signals for positioning may lead to inaccurate results. To mitigate this problem, in this paper, any of the existing fingerprint-based indoor positioning algorithms can be integrated into our positioning system to estimate the location of mobile device. Then, a mobility prediction algorithm using the model of Brownian motion is presented for further calculating the rationality of the estimated location and correcting the inaccurate results. To be realistic, some experiments in a real WLAN environment with a multitude of people moving in a testing area demonstrate the noticeably better accuracy of this approach. The solution can ensure low and stable positioning error. Besides, the region where training records are out of date can also be found out.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123082572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In VANETs, it is very important to communicate between two vehicles, but how to get the correct position of a vehicle is not easy. Due to vehicles are moving fast, topology in VANETs changes rapidly. As a result, location services processed in VANETs are more difficult than in MANETs. In our thesis, we propose a hierarchical location service system, it provides a low cost and rapid service. First, we select the main arteries to divide network into grids because of there are more vehicles than in normal roads, and then design a mechanism that when vehicles need to send update packets. This mechanism can decrease the number of update packets and still gain correct vehicles??? location. Second, we design grids with three levels, the higher the level, the larger the area. Each level stores update packets sent within its area. Vehicles using our system can find the destination vehicle distributedly within a small area; if the target is not within this area, then find within a larger area. Besides, we propose a packets collection method, it can be adjusted with different size of collection area. The simulation results show that our scheme could decrease the number of location update packets effectively, and keep high success rate of location service.
{"title":"A Region-Based Hierarchical Location Service with Road-Adapted Grids for Vehicular Networks","authors":"Guey-Yun Chang, Yun-Yu Chen, J. Sheu","doi":"10.1109/ICPPW.2010.81","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.81","url":null,"abstract":"In VANETs, it is very important to communicate between two vehicles, but how to get the correct position of a vehicle is not easy. Due to vehicles are moving fast, topology in VANETs changes rapidly. As a result, location services processed in VANETs are more difficult than in MANETs. In our thesis, we propose a hierarchical location service system, it provides a low cost and rapid service. First, we select the main arteries to divide network into grids because of there are more vehicles than in normal roads, and then design a mechanism that when vehicles need to send update packets. This mechanism can decrease the number of update packets and still gain correct vehicles??? location. Second, we design grids with three levels, the higher the level, the larger the area. Each level stores update packets sent within its area. Vehicles using our system can find the destination vehicle distributedly within a small area; if the target is not within this area, then find within a larger area. Besides, we propose a packets collection method, it can be adjusted with different size of collection area. The simulation results show that our scheme could decrease the number of location update packets effectively, and keep high success rate of location service.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"1070 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132392838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Task scheduling is one of the most prominent problems in the era of parallel computing. We find scheduling algorithms in every domain of computer science, e.g., mapping multiprocessor tasks to clusters, mapping jobs to grid resources, or mapping fine-grained tasks to cores of multicore processors. Many tools exist that help understand or debug an application by presenting visual representations of a certain program run, e.g., visualizations of MPI traces. However, often developers want to get a global and abstract view of their schedules first. In this paper we introduce Jedule, a tool dedicated to visualize schedules of parallel applications. We demonstrate the effectiveness of Jedule by showing how it helped analyzing problems in several case studies.
{"title":"Jedule: A Tool for Visualizing Schedules of Parallel Applications","authors":"S. Hunold, Ralf Hoffmann, F. Suter","doi":"10.1109/ICPPW.2010.34","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.34","url":null,"abstract":"Task scheduling is one of the most prominent problems in the era of parallel computing. We find scheduling algorithms in every domain of computer science, e.g., mapping multiprocessor tasks to clusters, mapping jobs to grid resources, or mapping fine-grained tasks to cores of multicore processors. Many tools exist that help understand or debug an application by presenting visual representations of a certain program run, e.g., visualizations of MPI traces. However, often developers want to get a global and abstract view of their schedules first. In this paper we introduce Jedule, a tool dedicated to visualize schedules of parallel applications. We demonstrate the effectiveness of Jedule by showing how it helped analyzing problems in several case studies.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128677063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}