Pub Date : 2003-06-22DOI: 10.1109/HPDC.2003.1210028
V. Talwar, Sujoy Basu, Raj Kumar
Traditional use of grid computing allows a user to submit batch jobs in a grid environment. We believe, next generation grids will extend the application domain to include interactive graphical sessions. We term such grids interactive grids. In this paper, we describe some of the challenges involved in building interactive grids. These include fine grain access control, QoS guarantees, and dynamic account management. In order to architect interactive grids, we propose and describe I-GENV, an environment for enabling interactive grids. I-GENV consists of GISH-'Grid Interactive Shell', Controlled Desktop, SAC-'Session Admission Control' module, GMMA-'Grid Monitoring and Management Agents', System Policies, and Dynamic Account Manager. We also present our testbed implementation of I-GENV using and extending Globus Toolkit 2.0 for the Grid middleware infrastructure, and VNC as the remote display technology.
{"title":"An environment for enabling interactive grids","authors":"V. Talwar, Sujoy Basu, Raj Kumar","doi":"10.1109/HPDC.2003.1210028","DOIUrl":"https://doi.org/10.1109/HPDC.2003.1210028","url":null,"abstract":"Traditional use of grid computing allows a user to submit batch jobs in a grid environment. We believe, next generation grids will extend the application domain to include interactive graphical sessions. We term such grids interactive grids. In this paper, we describe some of the challenges involved in building interactive grids. These include fine grain access control, QoS guarantees, and dynamic account management. In order to architect interactive grids, we propose and describe I-GENV, an environment for enabling interactive grids. I-GENV consists of GISH-'Grid Interactive Shell', Controlled Desktop, SAC-'Session Admission Control' module, GMMA-'Grid Monitoring and Management Agents', System Policies, and Dynamic Account Manager. We also present our testbed implementation of I-GENV using and extending Globus Toolkit 2.0 for the Grid middleware infrastructure, and VNC as the remote display technology.","PeriodicalId":430378,"journal":{"name":"High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127999535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2003-06-22DOI: 10.1109/HPDC.2003.1210034
S. Agarwala, C. Poellabauer, J. Kong, K. Schwan, M. Wolf
Monitoring the resources of distributed systems is essential to the successful deployment and execution of grid applications, particularly when such applications have well-defined QoS requirements. The dproc system-level monitoring mechanisms implemented for standard Linux kernels have several key components. First, utilizing the familiar /proc filesystem, dproc extends this interface with resource information collected from both local and remote hosts. Second, to predictably capture and distribute monitoring information, dproc uses a kernel-level group communication facility, termed KECho, which is based on events and event channels. Third and the focus of this paper is dproc's run-time customizability for resource monitoring, which includes the generation and deployment of monitoring functionality within remote operating system kernels. Using dproc, we show that: (a) data streams can be customized according to a client's resource availabilities (dynamic stream management); (b) by dynamically varying distributed monitoring (dynamic filtering of monitoring information), appropriate balance can be maintained between monitoring overheads and application quality; and (c) by performing monitoring at kernel-level, the information captured enables decision making that takes into account the multiple resources used by applications.
{"title":"Resource-aware stream management with the customizable dproc distributed monitoring mechanisms","authors":"S. Agarwala, C. Poellabauer, J. Kong, K. Schwan, M. Wolf","doi":"10.1109/HPDC.2003.1210034","DOIUrl":"https://doi.org/10.1109/HPDC.2003.1210034","url":null,"abstract":"Monitoring the resources of distributed systems is essential to the successful deployment and execution of grid applications, particularly when such applications have well-defined QoS requirements. The dproc system-level monitoring mechanisms implemented for standard Linux kernels have several key components. First, utilizing the familiar /proc filesystem, dproc extends this interface with resource information collected from both local and remote hosts. Second, to predictably capture and distribute monitoring information, dproc uses a kernel-level group communication facility, termed KECho, which is based on events and event channels. Third and the focus of this paper is dproc's run-time customizability for resource monitoring, which includes the generation and deployment of monitoring functionality within remote operating system kernels. Using dproc, we show that: (a) data streams can be customized according to a client's resource availabilities (dynamic stream management); (b) by dynamically varying distributed monitoring (dynamic filtering of monitoring information), appropriate balance can be maintained between monitoring overheads and application quality; and (c) by performing monitoring at kernel-level, the information captured enables decision making that takes into account the multiple resources used by applications.","PeriodicalId":430378,"journal":{"name":"High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133656007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2003-06-22DOI: 10.1109/HPDC.2003.1210017
A. Bucur, D. Epema
In systems consisting of multiple clusters of processors which employ space sharing for scheduling jobs, such as our Distributed ASCI (Advanced School for Computing Imaging) Supercomputer (DAS), co-allocation, i.e., the simultaneous allocation of processors to single jobs in multiple clusters, may be required. In this paper we study the performance of several scheduling policies for co-allocating unordered requests in multiclusters with a workload derived from the DAS. We find that beside the policy, limiting the total job size significantly improves the performance, and that for a slowdown of jobs due to global communication bounded by 1.25, co-allocation is a viable choice.
{"title":"Trace-based simulations of processor co-allocation policies in multiclusters","authors":"A. Bucur, D. Epema","doi":"10.1109/HPDC.2003.1210017","DOIUrl":"https://doi.org/10.1109/HPDC.2003.1210017","url":null,"abstract":"In systems consisting of multiple clusters of processors which employ space sharing for scheduling jobs, such as our Distributed ASCI (Advanced School for Computing Imaging) Supercomputer (DAS), co-allocation, i.e., the simultaneous allocation of processors to single jobs in multiple clusters, may be required. In this paper we study the performance of several scheduling policies for co-allocating unordered requests in multiclusters with a workload derived from the DAS. We find that beside the policy, limiting the total job size significantly improves the performance, and that for a slowdown of jobs due to global communication bounded by 1.25, co-allocation is a viable choice.","PeriodicalId":430378,"journal":{"name":"High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134613346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2003-06-22DOI: 10.1109/HPDC.2003.1210021
Yang Yang, H. Casanova
Divisible workload applications arise in many fields of science and engineering. They can be parallelized in master-worker fashion and relevant scheduling strategies have been proposed to reduce application markspan. Our goal is to developed a practical divisible workload scheduling strategy. This requires that previous work be revisited as several usual assumptions about the computing platform do not hold in practice. We have partially addressed this concern in a previous paper via an algorithm that achieves high performance with realistic resource latency models. In this paper we extend our approach to account for performance prediction errors, which are expected for most real-world performance and applications. In essence, we combine ideas from multiround divisible workload scheduling, for performance, and from factoring-based scheduling, for robustness. We present simulation results to quantify the benefits of our approach compared to our original algorithm and to other previously proposed algorithms.
{"title":"RUMR: robust scheduling for divisible workloads","authors":"Yang Yang, H. Casanova","doi":"10.1109/HPDC.2003.1210021","DOIUrl":"https://doi.org/10.1109/HPDC.2003.1210021","url":null,"abstract":"Divisible workload applications arise in many fields of science and engineering. They can be parallelized in master-worker fashion and relevant scheduling strategies have been proposed to reduce application markspan. Our goal is to developed a practical divisible workload scheduling strategy. This requires that previous work be revisited as several usual assumptions about the computing platform do not hold in practice. We have partially addressed this concern in a previous paper via an algorithm that achieves high performance with realistic resource latency models. In this paper we extend our approach to account for performance prediction errors, which are expected for most real-world performance and applications. In essence, we combine ideas from multiround divisible workload scheduling, for performance, and from factoring-based scheduling, for robustness. We present simulation results to quantify the benefits of our approach compared to our original algorithm and to other previously proposed algorithms.","PeriodicalId":430378,"journal":{"name":"High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125306851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2003-06-22DOI: 10.1109/HPDC.2003.1210012
S. Thulasidasan, Wu-chun Feng, M. Gardner
In this paper, we describe the integration of dynamic right-sizing - an automatic and scalable buffer management technique for enhancing TCP (transport control protocol) performance - into GridFTP, a subsystem of the Globus Toolkit for managing bulk data transfers across computational Grids. Such Grids are often characterized by networks with large bandwidth-delay products. Unfortunately, many of today's Grid applications use only a small fraction of available bandwidth because the default buffer sizes in TCP are tuned for yesterday's WAN (wide access network) speeds. Buffer sizes can be manually tuned to allow TCP flow control to adapt to high-speed WAN environments, but this is a tedious process. Although recent work has shown how to automatically tune system buffers during connection set-up, these values may not be appropriate for the connection's lifetime due to varying network delay and throughput. We show how using the technique of dynamic right-sizing (DRS) in GridFTP helps us optimize memory usage while maintaining high throughput over the lifetime of the connection. We also show how DRS enhances important GridFTP features such as striped and third-party data transfers in a scalable way. The technique is implemented entirely in user space so that end users do not have to modify the kernel.
{"title":"Optimizing GridFTP through dynamic right-sizing","authors":"S. Thulasidasan, Wu-chun Feng, M. Gardner","doi":"10.1109/HPDC.2003.1210012","DOIUrl":"https://doi.org/10.1109/HPDC.2003.1210012","url":null,"abstract":"In this paper, we describe the integration of dynamic right-sizing - an automatic and scalable buffer management technique for enhancing TCP (transport control protocol) performance - into GridFTP, a subsystem of the Globus Toolkit for managing bulk data transfers across computational Grids. Such Grids are often characterized by networks with large bandwidth-delay products. Unfortunately, many of today's Grid applications use only a small fraction of available bandwidth because the default buffer sizes in TCP are tuned for yesterday's WAN (wide access network) speeds. Buffer sizes can be manually tuned to allow TCP flow control to adapt to high-speed WAN environments, but this is a tedious process. Although recent work has shown how to automatically tune system buffers during connection set-up, these values may not be appropriate for the connection's lifetime due to varying network delay and throughput. We show how using the technique of dynamic right-sizing (DRS) in GridFTP helps us optimize memory usage while maintaining high throughput over the lifetime of the connection. We also show how DRS enhances important GridFTP features such as striped and third-party data transfers in a scalable way. The technique is implemented entirely in user space so that end users do not have to modify the kernel.","PeriodicalId":430378,"journal":{"name":"High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115020452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2003-06-22DOI: 10.1109/HPDC.2003.1210035
R. Sundaresan, Mario Lauria, T. Kurç, S. Parthasarathy, J. Saltz
As data and computational grids grow in size and complexity, the crucial task of identifying, monitoring and utilizing available resources in an efficient manner is becoming increasingly difficult. The design of monitoring systems that are scalable both in the number of sources being monitored and in the number of clients served is a challenging issue. In this paper we investigate the trade-offs of different polling strategies that can be used to monitor resource availability on machines in a distributed environment. We show how adaptive polling protocols can substantially increase scalability with a less than proportional loss of precision, and how these protocols can be personalized for different types of resource usage patterns.
{"title":"Adaptive polling of grid resource monitors using a slacker coherence model","authors":"R. Sundaresan, Mario Lauria, T. Kurç, S. Parthasarathy, J. Saltz","doi":"10.1109/HPDC.2003.1210035","DOIUrl":"https://doi.org/10.1109/HPDC.2003.1210035","url":null,"abstract":"As data and computational grids grow in size and complexity, the crucial task of identifying, monitoring and utilizing available resources in an efficient manner is becoming increasingly difficult. The design of monitoring systems that are scalable both in the number of sources being monitored and in the number of clients served is a challenging issue. In this paper we investigate the trade-offs of different polling strategies that can be used to monitor resource availability on machines in a distributed environment. We show how adaptive polling protocols can substantially increase scalability with a less than proportional loss of precision, and how these protocols can be personalized for different types of resource usage patterns.","PeriodicalId":430378,"journal":{"name":"High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114941761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2003-06-22DOI: 10.1109/HPDC.2003.1210024
D. Thain, M. Livny
Despite many competitors, Ethernet became the dominant protocol for local area networking due to its simplicity, robustness, and efficiency in wide variety of conditions and technology. Reflecting on the current frailty of much software, grid and otherwise, we propose that the Ethernet approach to resource sharing is an effective and reliable technique for combining coarse-grained software when failures are common and poorly detailed. This approach involves placing several simple but important responsibilities on client software to acquire shared resources conservatively, to back off during periods of failure, and to inform competing clients when resources are in contention. We present a simple scripting language that simplifies and encourages the Ethernet approach, and demonstrate its use in several grid computing scenarios, including job submission, disk allocation, and data replication. We conclude with a discussion of the limitations of this approach, and describe how it is uniquely suited to high-level programming.
{"title":"The Ethernet approach to grid computing","authors":"D. Thain, M. Livny","doi":"10.1109/HPDC.2003.1210024","DOIUrl":"https://doi.org/10.1109/HPDC.2003.1210024","url":null,"abstract":"Despite many competitors, Ethernet became the dominant protocol for local area networking due to its simplicity, robustness, and efficiency in wide variety of conditions and technology. Reflecting on the current frailty of much software, grid and otherwise, we propose that the Ethernet approach to resource sharing is an effective and reliable technique for combining coarse-grained software when failures are common and poorly detailed. This approach involves placing several simple but important responsibilities on client software to acquire shared resources conservatively, to back off during periods of failure, and to inform competing clients when resources are in contention. We present a simple scripting language that simplifies and encourages the Ethernet approach, and demonstrate its use in several grid computing scenarios, including job submission, disk allocation, and data replication. We conclude with a discussion of the limitations of this approach, and describe how it is uniquely suited to high-level programming.","PeriodicalId":430378,"journal":{"name":"High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123926790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2003-06-22DOI: 10.1109/HPDC.2003.1210011
C. Kurmann, T. Stricker
Many large applications require distributed computing for the sake of better performance and software systems that facilitate the development of such applications have attracted a great deal of attention. Modeling the application as distributed objects or components promises the benefits of better abstractions and increased software reuse. Using distributed object middleware (DOM) like CORBA (common object request broker architecture) looks promising, but most often one cannot afford its notorious inefficiency. We address the bandwidth bottleneck by extending highly efficient zero-copy communication architecture from the operating system through the middleware layers all the way to the application. In contrast to previous attempts on improving efficiency in CORBA we preserve the advantages of object oriented abstraction for the software design process and propose an efficient CORBA system that can handle bulk data transfers within the object request broker (ORB). Our prototype uses a clean separation of control-and data transfers within the ORB and for the ORB-to-ORB communication and manages to get rid of all inefficient buffering for certain types while still preserving the standard Internet interORB protocol (IIOP). It achieves the full performance that is only available with a strict zero-copy implementation across all layers between the operating system and the application.
{"title":"Zero-copy for CORBA - efficient communication for distributed object middleware","authors":"C. Kurmann, T. Stricker","doi":"10.1109/HPDC.2003.1210011","DOIUrl":"https://doi.org/10.1109/HPDC.2003.1210011","url":null,"abstract":"Many large applications require distributed computing for the sake of better performance and software systems that facilitate the development of such applications have attracted a great deal of attention. Modeling the application as distributed objects or components promises the benefits of better abstractions and increased software reuse. Using distributed object middleware (DOM) like CORBA (common object request broker architecture) looks promising, but most often one cannot afford its notorious inefficiency. We address the bandwidth bottleneck by extending highly efficient zero-copy communication architecture from the operating system through the middleware layers all the way to the application. In contrast to previous attempts on improving efficiency in CORBA we preserve the advantages of object oriented abstraction for the software design process and propose an efficient CORBA system that can handle bulk data transfers within the object request broker (ORB). Our prototype uses a clean separation of control-and data transfers within the ORB and for the ORB-to-ORB communication and manages to get rid of all inefficient buffering for certain types while still preserving the standard Internet interORB protocol (IIOP). It achieves the full performance that is only available with a strict zero-copy implementation across all layers between the operating system and the application.","PeriodicalId":430378,"journal":{"name":"High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116753718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2003-06-22DOI: 10.1109/HPDC.2003.1210031
David Spence, T. Harris
We describe the XenoSearch system for performing expressive resource discovery searches in a distributed environment. We represent server meta-data, such as their locations and facilities, as points in a multi-dimensional space and then express queries as predicates over these points. Each XenoSearch node holds a portion of this space and the key goal of XenoSearch is to direct queries to those nodes containing the meta-data of matching XenoServers. Communication between these XenoSearch nodes is based on the self-organizing Pastry peer-to-peer routing substrate. Our initial performance evaluation on a wide-area prototype shows that queries are only a factor of 3 to 5 times longer than basic Pastry routing, while supporting multi-dimensional searches of arbitrary shapes.
{"title":"XenoSearch: distributed resource discovery in the XenoServer open platform","authors":"David Spence, T. Harris","doi":"10.1109/HPDC.2003.1210031","DOIUrl":"https://doi.org/10.1109/HPDC.2003.1210031","url":null,"abstract":"We describe the XenoSearch system for performing expressive resource discovery searches in a distributed environment. We represent server meta-data, such as their locations and facilities, as points in a multi-dimensional space and then express queries as predicates over these points. Each XenoSearch node holds a portion of this space and the key goal of XenoSearch is to direct queries to those nodes containing the meta-data of matching XenoServers. Communication between these XenoSearch nodes is based on the self-organizing Pastry peer-to-peer routing substrate. Our initial performance evaluation on a wide-area prototype shows that queries are only a factor of 3 to 5 times longer than basic Pastry routing, while supporting multi-dimensional searches of arbitrary shapes.","PeriodicalId":430378,"journal":{"name":"High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132548307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2003-06-22DOI: 10.1109/HPDC.2003.1210026
S. Kleban, S. Clearwater
This paper characterizes "queue storms" in supercomputer systems and discusses methods for quelling them. Queue storms are anomalously large queue lengths dependent upon the job size mix, the queuing system, the machine size, and correlations and dependencies between job submissions. We use synthetic data generated from actual job log data from the ASCI Blue Mountain supercomputer combined with different long-range dependencies. We show the distribution of times from the first storm to occur, which is in a sense the time when the machine becomes obsolete because it represents the time when the machine first fails to provide satisfactory turnaround. To overcome queue storms, more resources are needed even if they appear superfluous most of the time. We present two methods, including a grid-based solution, for reducing these correlations and their resulting effect on the size and frequency of queue storms.
{"title":"Quelling queue storms","authors":"S. Kleban, S. Clearwater","doi":"10.1109/HPDC.2003.1210026","DOIUrl":"https://doi.org/10.1109/HPDC.2003.1210026","url":null,"abstract":"This paper characterizes \"queue storms\" in supercomputer systems and discusses methods for quelling them. Queue storms are anomalously large queue lengths dependent upon the job size mix, the queuing system, the machine size, and correlations and dependencies between job submissions. We use synthetic data generated from actual job log data from the ASCI Blue Mountain supercomputer combined with different long-range dependencies. We show the distribution of times from the first storm to occur, which is in a sense the time when the machine becomes obsolete because it represents the time when the machine first fails to provide satisfactory turnaround. To overcome queue storms, more resources are needed even if they appear superfluous most of the time. We present two methods, including a grid-based solution, for reducing these correlations and their resulting effect on the size and frequency of queue storms.","PeriodicalId":430378,"journal":{"name":"High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124080122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}