Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116899
K. Tej, N. Sivadasan, Vatsalya Sharma, R. Banerjee
Graphics Processing Units (GPUs) have evolved over the years from being graphics accelerator to scalable coprocessor. We implement an algebraic multigrid solver for three dimensional unstructured grids using GPU. Such a solver has extensive applications in Computational Fluid Dynamics (CFD). Using a combination of vertex coloring, optimized memory representations, multi-grid and improved coarsening techniques, we obtain considerable speedup in our parallel implementation. Our solver provides significant acceleration for solving pressure Poisson equations, which is the most time consuming part while solving Navier-Stokes equations. In our experimental study, we solve pressure Poisson equations for flow over lid driven cavity and for laminar flow past square cylinder. Our implementation achieves 915 times speed up for the lid driven cavity problem on a grid of size 2.6 million and a speed up of 1020 times for the laminar flow past square cylinder problem on a grid of size 1.7 million, compared to serial non-multigrid implementations. For our implementation, we used NVIDIA's CUDA programming model.
{"title":"Parallel AMG solver for three dimensional unstructured grids using GPU","authors":"K. Tej, N. Sivadasan, Vatsalya Sharma, R. Banerjee","doi":"10.1109/HiPC.2014.7116899","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116899","url":null,"abstract":"Graphics Processing Units (GPUs) have evolved over the years from being graphics accelerator to scalable coprocessor. We implement an algebraic multigrid solver for three dimensional unstructured grids using GPU. Such a solver has extensive applications in Computational Fluid Dynamics (CFD). Using a combination of vertex coloring, optimized memory representations, multi-grid and improved coarsening techniques, we obtain considerable speedup in our parallel implementation. Our solver provides significant acceleration for solving pressure Poisson equations, which is the most time consuming part while solving Navier-Stokes equations. In our experimental study, we solve pressure Poisson equations for flow over lid driven cavity and for laminar flow past square cylinder. Our implementation achieves 915 times speed up for the lid driven cavity problem on a grid of size 2.6 million and a speed up of 1020 times for the laminar flow past square cylinder problem on a grid of size 1.7 million, compared to serial non-multigrid implementations. For our implementation, we used NVIDIA's CUDA programming model.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123613325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116889
Jeeva Paudel, O. Tardieu, J. N. Amaral
Prior studies have established the performance impact of coherence protocols optimized for specific patterns of shared-data accesses in Non-Uniform-Memory-Architecture (NUMA) systems. First, this work incorporates a directory-based protocol into the runtime system of X10 - a Partitioned-Global-Address-Space (PGAS) programming language - to manage read-mostly, producer-consumer, stencil, and migratory variables. This protocol complements the existing X10Protocol, which keeps a unique copy of a shared variable and relies on message transfers for all remote accesses. The X10Protocol is effective to manage accumulator, write-mostly and general read-write variables. Then, it introduces a new shared-variable access-pattern profiler that is used by a new coherence-policy manager to decide which protocol should be used for each shared variable. The profiler can be run in both offline and online modes. An evaluation on a 128-core distributed-memory machine reveals that coordination between these protocols does not degrade performance on any of the applications studied, and achieves speedup in the range of 15% to 40% over X10Protocol. The performance is also comparable to carefully hand-written versions of the applications.
{"title":"Optimizing shared data accesses in distributed-memory X10 systems","authors":"Jeeva Paudel, O. Tardieu, J. N. Amaral","doi":"10.1109/HiPC.2014.7116889","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116889","url":null,"abstract":"Prior studies have established the performance impact of coherence protocols optimized for specific patterns of shared-data accesses in Non-Uniform-Memory-Architecture (NUMA) systems. First, this work incorporates a directory-based protocol into the runtime system of X10 - a Partitioned-Global-Address-Space (PGAS) programming language - to manage read-mostly, producer-consumer, stencil, and migratory variables. This protocol complements the existing X10Protocol, which keeps a unique copy of a shared variable and relies on message transfers for all remote accesses. The X10Protocol is effective to manage accumulator, write-mostly and general read-write variables. Then, it introduces a new shared-variable access-pattern profiler that is used by a new coherence-policy manager to decide which protocol should be used for each shared variable. The profiler can be run in both offline and online modes. An evaluation on a 128-core distributed-memory machine reveals that coordination between these protocols does not degrade performance on any of the applications studied, and achieves speedup in the range of 15% to 40% over X10Protocol. The performance is also comparable to carefully hand-written versions of the applications.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121692878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116912
Chirag Jain, Subodh Kumar
The Smith-Waterman algorithm is used in Bio-informatics to perform pairwise local alignment between a query sequence and a subject sequence. We present a GPU based parallel version of this algorithm that is able to perform pair-wise alignment faster than previous algorithms. In particular, it parallelizes each alignment, rather than relying on parallelism across multiple pair alignments, which many other proposed GPU algorithms do. As a result it scales better. We further extend our algorithm to work efficiently on a cluster of GPUs. At a high level, our approach subdivides the iterative computation of elements of a matrix among blocks of processors such that each block can simply recompute the data it needs instead of waiting for other processors to compute them. Sometimes this may lead to excessive recomputation, however. We evaluate these cases and employ a hybrid approach, recomputing only limited data and communicating the rest. Our algorithm is also extended to produce not only the best but all `best K' alignments. Our results on SSCA#1 benchmark show that our method is upto 5-24 times faster than previous method.
{"title":"Fine-grained GPU parallelization of pairwise local sequence alignment","authors":"Chirag Jain, Subodh Kumar","doi":"10.1109/HiPC.2014.7116912","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116912","url":null,"abstract":"The Smith-Waterman algorithm is used in Bio-informatics to perform pairwise local alignment between a query sequence and a subject sequence. We present a GPU based parallel version of this algorithm that is able to perform pair-wise alignment faster than previous algorithms. In particular, it parallelizes each alignment, rather than relying on parallelism across multiple pair alignments, which many other proposed GPU algorithms do. As a result it scales better. We further extend our algorithm to work efficiently on a cluster of GPUs. At a high level, our approach subdivides the iterative computation of elements of a matrix among blocks of processors such that each block can simply recompute the data it needs instead of waiting for other processors to compute them. Sometimes this may lead to excessive recomputation, however. We evaluate these cases and employ a hybrid approach, recomputing only limited data and communicating the rest. Our algorithm is also extended to produce not only the best but all `best K' alignments. Our results on SSCA#1 benchmark show that our method is upto 5-24 times faster than previous method.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130273057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116880
Enver Kayaaslan, B. Uçar
The elimination tree for unsymmetric matrices is a recent model playing important roles in sparse LU factorization. This tree captures the dependencies between the tasks of some well-known variants of sparse LU factorization. Therefore, the height of the elimination tree corresponds to the critical path length of the task dependency graph in the corresponding parallel LU factorization methods. We investigate the problem of finding minimum height elimination trees to expose a maximum degree of parallelism by minimizing the critical path length. This problem has recently been shown to be NP-complete. Therefore, we propose heuristics, which generalize the most successful approaches used for symmetric matrices to unsymmetric ones. We test the proposed heuristics on a large set of real world matrices and report 28% reduction in the elimination tree heights with respect to a common method, which exploits the state of the art tools used in Cholesky factorization.
{"title":"Reducing elimination tree height for parallel LU factorization of sparse unsymmetric matrices","authors":"Enver Kayaaslan, B. Uçar","doi":"10.1109/HiPC.2014.7116880","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116880","url":null,"abstract":"The elimination tree for unsymmetric matrices is a recent model playing important roles in sparse LU factorization. This tree captures the dependencies between the tasks of some well-known variants of sparse LU factorization. Therefore, the height of the elimination tree corresponds to the critical path length of the task dependency graph in the corresponding parallel LU factorization methods. We investigate the problem of finding minimum height elimination trees to expose a maximum degree of parallelism by minimizing the critical path length. This problem has recently been shown to be NP-complete. Therefore, we propose heuristics, which generalize the most successful approaches used for symmetric matrices to unsymmetric ones. We test the proposed heuristics on a large set of real world matrices and report 28% reduction in the elimination tree heights with respect to a common method, which exploits the state of the art tools used in Cholesky factorization.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128305752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116913
M. Gowanlock, H. Casanova
The processing of moving object trajectories arises in many application domains. We focus on a trajectory similarity search, the distance threshold search, which finds all trajectories within a given distance of a query trajectory over a time interval. A multithreaded CPU implementation that makes use of an in-memory R-tree index can achieve high parallel efficiency. We propose a GPGPU implementation that avoids index-trees altogether and instead features a GPU-friendly indexing scheme. We show that our GPU implementation compares well to the CPU implementation. One interesting question is that of creating efficient query batches (so as to reduce both memory pressure and computation cost on the GPU). We design algorithms for creating such batches, and we find that using fixed-size batches is sufficient in practice. We develop an empirical response time model that can be used to pick a good batch size.
{"title":"Distance threshold similarity searches on spatiotemporal trajectories using GPGPU","authors":"M. Gowanlock, H. Casanova","doi":"10.1109/HiPC.2014.7116913","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116913","url":null,"abstract":"The processing of moving object trajectories arises in many application domains. We focus on a trajectory similarity search, the distance threshold search, which finds all trajectories within a given distance of a query trajectory over a time interval. A multithreaded CPU implementation that makes use of an in-memory R-tree index can achieve high parallel efficiency. We propose a GPGPU implementation that avoids index-trees altogether and instead features a GPU-friendly indexing scheme. We show that our GPU implementation compares well to the CPU implementation. One interesting question is that of creating efficient query batches (so as to reduce both memory pressure and computation cost on the GPU). We design algorithms for creating such batches, and we find that using fixed-size batches is sufficient in practice. We develop an empirical response time model that can be used to pick a good batch size.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130307671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116914
George M. Slota, Kamesh Madduri
We present two new algorithms for finding the biconnected components of a large undirected sparse graph. The first algorithm is based on identifying articulation points and labeling edges using multiple connectivity queries, and the second approach uses the color propagation technique to decompose the graph. Both methods use a breadth-first spanning tree and some auxiliary information computed during Breadth-First Search (BFS). These methods are simpler than the Tarjan-Vishkin PRAM algorithm for biconnectivity and do not require Euler tour computation or any auxiliary graph construction. We identify steps in these algorithms that can be parallelized in a shared-memory environment and develop tuned OpenMP implementations. Using a collection of large-scale real-world graph instances, we show that these methods outperform the state-of-the-art Cong-Bader biconnected components implementation, which is based on the Tarjan-Vishkin algorithm. We achieve up to 7.1× and 4.2× parallel speedup over the serial Hopcroft-Tarjan and parallel Cong-Bader algorithms, respectively, on a 16-core Intel Sandy Bridge system. For some graph instances, due to the fast BFS-based preprocessing step, the single-threaded implementation of our first algorithm is faster than the serial Hopcroft-Tarjan algorithm.
{"title":"Simple parallel biconnectivity algorithms for multicore platforms","authors":"George M. Slota, Kamesh Madduri","doi":"10.1109/HiPC.2014.7116914","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116914","url":null,"abstract":"We present two new algorithms for finding the biconnected components of a large undirected sparse graph. The first algorithm is based on identifying articulation points and labeling edges using multiple connectivity queries, and the second approach uses the color propagation technique to decompose the graph. Both methods use a breadth-first spanning tree and some auxiliary information computed during Breadth-First Search (BFS). These methods are simpler than the Tarjan-Vishkin PRAM algorithm for biconnectivity and do not require Euler tour computation or any auxiliary graph construction. We identify steps in these algorithms that can be parallelized in a shared-memory environment and develop tuned OpenMP implementations. Using a collection of large-scale real-world graph instances, we show that these methods outperform the state-of-the-art Cong-Bader biconnected components implementation, which is based on the Tarjan-Vishkin algorithm. We achieve up to 7.1× and 4.2× parallel speedup over the serial Hopcroft-Tarjan and parallel Cong-Bader algorithms, respectively, on a 16-core Intel Sandy Bridge system. For some graph instances, due to the fast BFS-based preprocessing step, the single-threaded implementation of our first algorithm is faster than the serial Hopcroft-Tarjan algorithm.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"311 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128625123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116708
D. Chavarría-Miranda, M. Halappanavar, A. Kalyanaraman
In an era when power constraints and data movement are proving to be significant barriers for the application of high-end computing, the Tilera many-core architecture offers a low-power platform exhibiting many important characteristics of future systems, including a large number of simple cores, a sophisticated network-on-chip, and fine-grained control over memory and caching policies. While this emerging architecture has been previously studied for structured compute-intensive kernels, benchmarking the platform for data-bound, irregular applications present significant challenges that have remained unexplored. Community detection is an advanced prototypical graph-theoretic operation with applications in numerous scientific domains including life sciences, cyber security, and power systems. In this work, we explore multiple design strategies toward developing a scalable tool for community detection on the Tilera platform. Using several memory layout and work scheduling techniques we demonstrate speedups of up to 47× on 36 cores of the Tilera TileGX36 platform over the best serial implementation, and also show results that have comparable quality and performance to mainstream x86 platforms. To the best of our knowledge this is the first work addressing graph algorithms on the Tilera platform. This study demonstrates that through careful design space exploration, low-power many-core platforms like Tilera can be effectively exploited for graph algorithms that embody all the essential characteristics of an irregular application.
{"title":"Scaling graph community detection on the Tilera many-core architecture","authors":"D. Chavarría-Miranda, M. Halappanavar, A. Kalyanaraman","doi":"10.1109/HiPC.2014.7116708","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116708","url":null,"abstract":"In an era when power constraints and data movement are proving to be significant barriers for the application of high-end computing, the Tilera many-core architecture offers a low-power platform exhibiting many important characteristics of future systems, including a large number of simple cores, a sophisticated network-on-chip, and fine-grained control over memory and caching policies. While this emerging architecture has been previously studied for structured compute-intensive kernels, benchmarking the platform for data-bound, irregular applications present significant challenges that have remained unexplored. Community detection is an advanced prototypical graph-theoretic operation with applications in numerous scientific domains including life sciences, cyber security, and power systems. In this work, we explore multiple design strategies toward developing a scalable tool for community detection on the Tilera platform. Using several memory layout and work scheduling techniques we demonstrate speedups of up to 47× on 36 cores of the Tilera TileGX36 platform over the best serial implementation, and also show results that have comparable quality and performance to mainstream x86 platforms. To the best of our knowledge this is the first work addressing graph algorithms on the Tilera platform. This study demonstrates that through careful design space exploration, low-power many-core platforms like Tilera can be effectively exploited for graph algorithms that embody all the essential characteristics of an irregular application.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"352 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114025816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116887
Yang Song, Rakesh Jain, R. Routray
In enterprise data centers, reliable performance models on storage devices are desirable for efficient storage management and optimization. However, many cloud environments consist of heterogeneous storage devices, e.g., a mixture of commodity disks, where accurate performance models are of particular challenge to attain. In this paper, we propose a lightweight queueing-based storage performance modeling framework, which is able to infer the maximum IO load that a storage device can sustain, as well as its IO load v.s. response time performance curve. Our inference framework views the underlying storage resources as blackboxes and only utilizes historical measurements of the IO and response time on the devices. In an OpenStack environment, we also develop a new storage volume placement algorithm using our performance inference and modeling framework. Experimental results show that our solution can provide up to 80% increase of the IO throughput, in tandem with a 40% reduction of the average response time, compared to the performance provided by the default OpenStack policy.
{"title":"Queueing-based storage performance modeling and placement in OpenStack environments","authors":"Yang Song, Rakesh Jain, R. Routray","doi":"10.1109/HiPC.2014.7116887","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116887","url":null,"abstract":"In enterprise data centers, reliable performance models on storage devices are desirable for efficient storage management and optimization. However, many cloud environments consist of heterogeneous storage devices, e.g., a mixture of commodity disks, where accurate performance models are of particular challenge to attain. In this paper, we propose a lightweight queueing-based storage performance modeling framework, which is able to infer the maximum IO load that a storage device can sustain, as well as its IO load v.s. response time performance curve. Our inference framework views the underlying storage resources as blackboxes and only utilizes historical measurements of the IO and response time on the devices. In an OpenStack environment, we also develop a new storage volume placement algorithm using our performance inference and modeling framework. Experimental results show that our solution can provide up to 80% increase of the IO throughput, in tandem with a 40% reduction of the average response time, compared to the performance provided by the default OpenStack policy.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116071268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116908
Pooja Aggarwal, G. Yasa, S. Sarangi
Novel applications such as micro-blogging and algorithmic trading typically place a very high load on the underlying storage system. They are characterized by a stream of very short requests, and thus they require a very high I/O throughput. The traditional solution for supporting such applications is to use an array of hard disks. With the advent of solid state drives (SSDs), storage vendors are increasingly preferring them because their I/O throughput can scale up to a million IOPS (I/O operations per second). In this paper, we design a family of algorithms, RADIR, to schedule requests for such systems. Our algorithms are lock-free/wait-free, lineariz-able, and take the characteristics of requests into account such as the deadlines, request sizes, dependences, and the amount of available redundancy in RAID configurations. We perform simulations with workloads derived from traces provided by Microsoft and demonstrate a scheduling throughput of 900K IOPS on a 64 thread Intel server. Our algorithms are 2-3 orders of magnitude faster than the versions that use locks. We show detailed results for the effect of deadlines, request sizes, and the effect of RAID levels on the quality of the schedule.
{"title":"RADIR: Lock-free and wait-free bandwidth allocation models for solid state drives","authors":"Pooja Aggarwal, G. Yasa, S. Sarangi","doi":"10.1109/HiPC.2014.7116908","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116908","url":null,"abstract":"Novel applications such as micro-blogging and algorithmic trading typically place a very high load on the underlying storage system. They are characterized by a stream of very short requests, and thus they require a very high I/O throughput. The traditional solution for supporting such applications is to use an array of hard disks. With the advent of solid state drives (SSDs), storage vendors are increasingly preferring them because their I/O throughput can scale up to a million IOPS (I/O operations per second). In this paper, we design a family of algorithms, RADIR, to schedule requests for such systems. Our algorithms are lock-free/wait-free, lineariz-able, and take the characteristics of requests into account such as the deadlines, request sizes, dependences, and the amount of available redundancy in RAID configurations. We perform simulations with workloads derived from traces provided by Microsoft and demonstrate a scheduling throughput of 900K IOPS on a 64 thread Intel server. Our algorithms are 2-3 orders of magnitude faster than the versions that use locks. We show detailed results for the effect of deadlines, request sizes, and the effect of RAID levels on the quality of the schedule.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124689335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116891
Sonika Arora, Archita Agarwal, Venkatesan T. Chakaravarthy, Yogish Sabharwal
We study the problem of minimally activating a resource that is shared by multiple jobs. In a power-aware computing environment, the resource needs to be activated (powered-up) so that it can service the jobs. Each job specifies an interval during which its needs the services of the resource and the duration (time length) for which it requires the resource to be active. Our goal is to activate the resource for a minimum amount of time, while satisfying all the jobs. We study two variants of this problem, the contiguous and the non-contiguous cases. In the contiguous case, each job requires that its demand for the resource be serviced with a set of contiguous timeslots whereas in the non-contiguous case, the demand of a job may be serviced with a set of non-contiguous timeslots. For the contiguous case, we present an optimal polynomial time algorithm; this improves the best known result, which is an approximation algorithm having a ratio of 2. For the non-contiguous case, we present efficient algorithms for finding optimal and approximate solutions.
{"title":"Algorithms for power-aware resource activation","authors":"Sonika Arora, Archita Agarwal, Venkatesan T. Chakaravarthy, Yogish Sabharwal","doi":"10.1109/HiPC.2014.7116891","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116891","url":null,"abstract":"We study the problem of minimally activating a resource that is shared by multiple jobs. In a power-aware computing environment, the resource needs to be activated (powered-up) so that it can service the jobs. Each job specifies an interval during which its needs the services of the resource and the duration (time length) for which it requires the resource to be active. Our goal is to activate the resource for a minimum amount of time, while satisfying all the jobs. We study two variants of this problem, the contiguous and the non-contiguous cases. In the contiguous case, each job requires that its demand for the resource be serviced with a set of contiguous timeslots whereas in the non-contiguous case, the demand of a job may be serviced with a set of non-contiguous timeslots. For the contiguous case, we present an optimal polynomial time algorithm; this improves the best known result, which is an approximation algorithm having a ratio of 2. For the non-contiguous case, we present efficient algorithms for finding optimal and approximate solutions.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"255 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127362531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}