Pub Date : 2017-10-27DOI: 10.1109/SBAC-PAD.2017.30
Arthur Loussert, Benoit Welterlen, Patrick Carribault, Julien Jaeger, Marc Pérache, R. Namyst
With the advent of multicore and manycore processors as building blocks of HPC supercomputers, many applications shift from relying solely on a distributed programming model (e.g., MPI) to mixing distributed and shared-memory models (e.g., MPI+OpenMP), to better exploit shared-memory communications and reduce the overall memory footprint. One side effect of this programming approach is runtime stacking: mixing multiple models involve various runtime libraries to be alive at the same time and to share the underlying computing resources. This paper explores different configurations where this stacking may appear and introduces algorithms to detect the misuse of compute resources when running a hybrid parallel application. We have implemented our algorithms inside a dynamic tool that monitors applications and outputs resource usage to the user. We validated this tool on applications from CORAL benchmarks. This leads to relevant information which can be used to improve runtime placement, and to an average overhead lower than 1% of total execution time.
{"title":"Resource-Management Study in HPC Runtime-Stacking Context","authors":"Arthur Loussert, Benoit Welterlen, Patrick Carribault, Julien Jaeger, Marc Pérache, R. Namyst","doi":"10.1109/SBAC-PAD.2017.30","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.30","url":null,"abstract":"With the advent of multicore and manycore processors as building blocks of HPC supercomputers, many applications shift from relying solely on a distributed programming model (e.g., MPI) to mixing distributed and shared-memory models (e.g., MPI+OpenMP), to better exploit shared-memory communications and reduce the overall memory footprint. One side effect of this programming approach is runtime stacking: mixing multiple models involve various runtime libraries to be alive at the same time and to share the underlying computing resources. This paper explores different configurations where this stacking may appear and introduces algorithms to detect the misuse of compute resources when running a hybrid parallel application. We have implemented our algorithms inside a dynamic tool that monitors applications and outputs resource usage to the user. We validated this tool on applications from CORAL benchmarks. This leads to relevant information which can be used to improve runtime placement, and to an average overhead lower than 1% of total execution time.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126569861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-17DOI: 10.1109/SBAC-PAD.2017.19
Gilles Madi-Wamba, Yunbo Li, Anne-Cécile Orgerie, Nicolas Beldiceanu, Jean-Marc Menaud
Cloud computing allows for elasticity as users can dynamically benefit from new virtual resources when their workload increases. Such a feature requires highly reactive resource provisioning mechanisms. In this paper, we propose two new workload prediction models, based on constraint programming and neural networks, that can be used for dynamic resource provisioning in Cloud environments. We also present two workload trace generators that can help to extend an experimental dataset in order to test more widely resource optimization heuristics. Our models are validated using real traces from a small Cloud provider. Both approaches are shown to be complimentary as neural networks give better prediction results, while constraint programming is more suitable for trace generation.
{"title":"Cloud Workload Prediction and Generation Models","authors":"Gilles Madi-Wamba, Yunbo Li, Anne-Cécile Orgerie, Nicolas Beldiceanu, Jean-Marc Menaud","doi":"10.1109/SBAC-PAD.2017.19","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.19","url":null,"abstract":"Cloud computing allows for elasticity as users can dynamically benefit from new virtual resources when their workload increases. Such a feature requires highly reactive resource provisioning mechanisms. In this paper, we propose two new workload prediction models, based on constraint programming and neural networks, that can be used for dynamic resource provisioning in Cloud environments. We also present two workload trace generators that can help to extend an experimental dataset in order to test more widely resource optimization heuristics. Our models are validated using real traces from a small Cloud provider. Both approaches are shown to be complimentary as neural networks give better prediction results, while constraint programming is more suitable for trace generation.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131582284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.10
F. Pisani, Jeferson Rech Brunetta, Vanderson Martins do Rosário, E. Borin
Considering the prediction that there will be over 50 billion devices connected to the Internet of Things (IoT) in the near future, the demand for efficient ways to process data streams generated by sensors grows ever larger, highlighting the necessity to re-evaluate current approaches, such as sending all data to the cloud for processing and analysis.In this paper, we explore one of the methods for improving this scenario: bringing the computation closer to data sources. By executing the code on the IoT devices themselves instead of on the network edge or the cloud, solutions can better meet the latency requirements of several applications, avoid problems with slow and intermittent network connections, prevent network congestion, and potentially save energy by reducing communication.To this end, we propose the LMC framework and compare it with Edgent, an open-source project that is under development by the Apache Incubator. By using a DragonBoard 410c to execute a simple filter, an outlier detector, and a program that calculates the FFT, we obtained results that indicate that LMC outperforms Edgent when dynamic translation is disabled for both of them and is more suitable for lightweight quick queries otherwise. More importantly, the LMC also enables us to perform cross-platform code execution on small, cheap devices that do not have enough resources to run Edgent, like the NodeMCU 1.0.
{"title":"Beyond the Fog: Bringing Cross-Platform Code Execution to Constrained IoT Devices","authors":"F. Pisani, Jeferson Rech Brunetta, Vanderson Martins do Rosário, E. Borin","doi":"10.1109/SBAC-PAD.2017.10","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.10","url":null,"abstract":"Considering the prediction that there will be over 50 billion devices connected to the Internet of Things (IoT) in the near future, the demand for efficient ways to process data streams generated by sensors grows ever larger, highlighting the necessity to re-evaluate current approaches, such as sending all data to the cloud for processing and analysis.In this paper, we explore one of the methods for improving this scenario: bringing the computation closer to data sources. By executing the code on the IoT devices themselves instead of on the network edge or the cloud, solutions can better meet the latency requirements of several applications, avoid problems with slow and intermittent network connections, prevent network congestion, and potentially save energy by reducing communication.To this end, we propose the LMC framework and compare it with Edgent, an open-source project that is under development by the Apache Incubator. By using a DragonBoard 410c to execute a simple filter, an outlier detector, and a program that calculates the FFT, we obtained results that indicate that LMC outperforms Edgent when dynamic translation is disabled for both of them and is more suitable for lightweight quick queries otherwise. More importantly, the LMC also enables us to perform cross-platform code execution on small, cheap devices that do not have enough resources to run Edgent, like the NodeMCU 1.0.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116869302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.20
Thouraya Louati, Heithem Abbes, C. Cérin, M. Jemni
Infrastructure-as-a-Service container-based virtualization technology is gaining significant interest in industry as an alternative platform for running distributed applications. With increasing scale of Cloud Computing architectures, faults are becoming a frequent occurrence. Checkpoint-Restart is a key method to survive to failures in this context. However, there is a need to reduce the amount of checkpointing data as the Cloud is based on the pay-as-you-go model. This paper addresses the issue of garbage collection in LXCloud-CR and contributes with a novel decentralized garbage collection component GC-CR. LXCloud-CR, a decentralized Checkpoint-Restart model, is able to take snapshots of Linux Container instances and it uses replication to increase snapshots availability. LXCloud-CR contains a versioning scheme for each replica. The disadvantage refers to snapshots availability issues with versioning as the number of useless files grows. GC-CR is a decentralized garbage collector (checkpoint deletion) component that attempts to identify and eliminate old snapshots versions from the system in order to free storage space. Large scale experiments on the Grid5000 testbed demonstrate the benefits of our proposal. Obtained results validate our model and show significant reduction of storage space consumption
{"title":"GC-CR: A Decentralized Garbage Collector Component for Checkpointing in Clouds","authors":"Thouraya Louati, Heithem Abbes, C. Cérin, M. Jemni","doi":"10.1109/SBAC-PAD.2017.20","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.20","url":null,"abstract":"Infrastructure-as-a-Service container-based virtualization technology is gaining significant interest in industry as an alternative platform for running distributed applications. With increasing scale of Cloud Computing architectures, faults are becoming a frequent occurrence. Checkpoint-Restart is a key method to survive to failures in this context. However, there is a need to reduce the amount of checkpointing data as the Cloud is based on the pay-as-you-go model. This paper addresses the issue of garbage collection in LXCloud-CR and contributes with a novel decentralized garbage collection component GC-CR. LXCloud-CR, a decentralized Checkpoint-Restart model, is able to take snapshots of Linux Container instances and it uses replication to increase snapshots availability. LXCloud-CR contains a versioning scheme for each replica. The disadvantage refers to snapshots availability issues with versioning as the number of useless files grows. GC-CR is a decentralized garbage collector (checkpoint deletion) component that attempts to identify and eliminate old snapshots versions from the system in order to free storage space. Large scale experiments on the Grid5000 testbed demonstrate the benefits of our proposal. Obtained results validate our model and show significant reduction of storage space consumption","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114378165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.12
J. Panetta, P. S. Filho, Luiz A. F. Laranjeira, Carlos A. Teixeira
Elliptic curve asymmetric cryptography has achieved increased popularity due to its capability of providing comparable levels of security as other existing cryptographic systems while requiring less computational work. Pollard Rho and Parallel Collision Search, the fastest known sequential and parallel algorithms for breaking this cryptographic system, have been successfully applied over time to break ever-increasing bit-length system instances using implementations heavily optimized for the available hardware. This work presents portable, general implementations of a Parallel Collision Search based solution for prime elliptic curve asymmetric cryptographic systems that use publicly available big integer libraries and make no assumption on prime curve properties. It investigates which bit-length keys can be broken in reasonable time by a user that has access to a state of the art, public HPC equipment with CPUs and GPUs. The final implementation breaks a 79-bit system in about two hours using 80 GPUs and 94-bits system in about 15 hours using 256 GPUs. Extensive experimentation investigates scalability of CPU, GPU and CPU+GPU runs. The discussed results indicate that speed-up is not a good metric for parallel scalability. This paper proposes and evaluates a new metric that is better suited for this task.
{"title":"Scalability of CPU and GPU Solutions of the Prime Elliptic Curve Discrete Logarithm Problem","authors":"J. Panetta, P. S. Filho, Luiz A. F. Laranjeira, Carlos A. Teixeira","doi":"10.1109/SBAC-PAD.2017.12","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.12","url":null,"abstract":"Elliptic curve asymmetric cryptography has achieved increased popularity due to its capability of providing comparable levels of security as other existing cryptographic systems while requiring less computational work. Pollard Rho and Parallel Collision Search, the fastest known sequential and parallel algorithms for breaking this cryptographic system, have been successfully applied over time to break ever-increasing bit-length system instances using implementations heavily optimized for the available hardware. This work presents portable, general implementations of a Parallel Collision Search based solution for prime elliptic curve asymmetric cryptographic systems that use publicly available big integer libraries and make no assumption on prime curve properties. It investigates which bit-length keys can be broken in reasonable time by a user that has access to a state of the art, public HPC equipment with CPUs and GPUs. The final implementation breaks a 79-bit system in about two hours using 80 GPUs and 94-bits system in about 15 hours using 256 GPUs. Extensive experimentation investigates scalability of CPU, GPU and CPU+GPU runs. The discussed results indicate that speed-up is not a good metric for parallel scalability. This paper proposes and evaluates a new metric that is better suited for this task.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124843895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Borja Pérez, Esteban Stafford, J. L. Bosque, R. Beivide, Sergi Mateo, Xavier Teruel, X. Martorell, E. Ayguadé
Heterogeneous systems have a very high potential performance but present difficulties in their programming. OmpSs is a well known framework for task based parallel applications, which is an interesting tool to simplify the programming of these systems. However, it does not support the co-execution of a single OpenCL kernel instance on several compute devices. To overcome this limitation, this paper presents an extension of the OmpSs framework that solves two main objectives: the automatic division of datasets among several devices and the management of their memory address spaces. To adapt to different kinds of applications, the data division can be performed by the novel HGuided load balancing algorithm or by the well known Static and Dynamic. All this is accomplished with negligible impact on the programming. Experimental results reveal that there is always one load balancing algorithm that improves the performance and energy consumption of the system.
{"title":"Extending OmpSs for OpenCL Kernel Co-Execution in Heterogeneous Systems","authors":"Borja Pérez, Esteban Stafford, J. L. Bosque, R. Beivide, Sergi Mateo, Xavier Teruel, X. Martorell, E. Ayguadé","doi":"10.1109/SBAC-PAD.2017.8","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.8","url":null,"abstract":"Heterogeneous systems have a very high potential performance but present difficulties in their programming. OmpSs is a well known framework for task based parallel applications, which is an interesting tool to simplify the programming of these systems. However, it does not support the co-execution of a single OpenCL kernel instance on several compute devices. To overcome this limitation, this paper presents an extension of the OmpSs framework that solves two main objectives: the automatic division of datasets among several devices and the management of their memory address spaces. To adapt to different kinds of applications, the data division can be performed by the novel HGuided load balancing algorithm or by the well known Static and Dynamic. All this is accomplished with negligible impact on the programming. Experimental results reveal that there is always one load balancing algorithm that improves the performance and energy consumption of the system.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129654589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.15
F. Candel, A. Valero, S. Petit, D. S. Gracia, J. Sahuquillo
Nowadays, GPUs sit at the forefront of highperformance computing thanks to their massive computational capabilities. Internally, thousands of functional units, architected to be fed by large register files, fuel such a performance.At nanometer technologies, the SRAM cells that implement register files suffer the Negative Bias Temperature Instability (NBTI) effect, which degrades the transistor threshold voltage Vth and, in turn, can make cells faulty unreliable when they hold the same logic value for long periods of time.Fortunately, the GPU single-thread multiple-data execution model writes data in recognizable patterns. This work proposes mechanisms to detect those patterns, and to compress and shuffle the data, so that compressed register file entries can be safely powered off, mitigating NBTI aging.Experimental results show that a conventional GPU register file experiences the worst case for NBTI, since maintains cells with a single logic value during the entire application execution (i.e., a 100% 0 and 1 duty cycle distributions). On average, the proposal reduces these distributions by 61% and 72%, respectively, which translates into Vth degradation savings by 57% and 64%, respectively.
{"title":"Exploiting Data Compression to Mitigate Aging in GPU Register Files","authors":"F. Candel, A. Valero, S. Petit, D. S. Gracia, J. Sahuquillo","doi":"10.1109/SBAC-PAD.2017.15","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.15","url":null,"abstract":"Nowadays, GPUs sit at the forefront of highperformance computing thanks to their massive computational capabilities. Internally, thousands of functional units, architected to be fed by large register files, fuel such a performance.At nanometer technologies, the SRAM cells that implement register files suffer the Negative Bias Temperature Instability (NBTI) effect, which degrades the transistor threshold voltage Vth and, in turn, can make cells faulty unreliable when they hold the same logic value for long periods of time.Fortunately, the GPU single-thread multiple-data execution model writes data in recognizable patterns. This work proposes mechanisms to detect those patterns, and to compress and shuffle the data, so that compressed register file entries can be safely powered off, mitigating NBTI aging.Experimental results show that a conventional GPU register file experiences the worst case for NBTI, since maintains cells with a single logic value during the entire application execution (i.e., a 100% 0 and 1 duty cycle distributions). On average, the proposal reduces these distributions by 61% and 72%, respectively, which translates into Vth degradation savings by 57% and 64%, respectively.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130829937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.14
Ricardo Alves, Nikos Nikoleris, S. Kaxiras, D. Black-Schaffer
Filter caches and way-predictors are common approaches to improve the efficiency and/or performance of first-level caches. Filter caches use a small L0 to provide more efficient and faster access to a small subset of the data, and work well for programs with high locality. Way-predictors improve efficiency by accessing only the way predicted, which alleviates the need to read all ways in parallel without increasing latency, but hurts performance due to mispredictions.In this work we examine how SRAM layout constraints (h-trees and data mapping inside the cache) affect way-predictors and filter caches. We show that accessing the smaller L0 array can be significantly more energy efficient than attempting to read fewer ways from a larger L1 cache; and that the main source of energy inefficiency in filter caches comes from L0 and L1 misses. We propose a filter cache optimization that shares the tag array between the L0 and the L1, which incurs the overhead of reading the larger tag array on every access, but in return allows us to directly access the correct L1 way on each L0 miss. This optimization does not add any extra latency and counter-intuitively, improves the filter caches overall energy efficiency beyond that of the way-predictor.By combining the low power benefits of a physically smaller L0 with the reduction in miss energy by reading L1 tags upfront in parallel with L0 data, we show that the optimized filter cache reduces the dynamic cache energy compared to a traditional filter cache by 26% while providing the same performance advantage. Compared to a way-predictor, the optimized cache improves performance by 6% and energy by 2%.
{"title":"Addressing Energy Challenges in Filter Caches","authors":"Ricardo Alves, Nikos Nikoleris, S. Kaxiras, D. Black-Schaffer","doi":"10.1109/SBAC-PAD.2017.14","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.14","url":null,"abstract":"Filter caches and way-predictors are common approaches to improve the efficiency and/or performance of first-level caches. Filter caches use a small L0 to provide more efficient and faster access to a small subset of the data, and work well for programs with high locality. Way-predictors improve efficiency by accessing only the way predicted, which alleviates the need to read all ways in parallel without increasing latency, but hurts performance due to mispredictions.In this work we examine how SRAM layout constraints (h-trees and data mapping inside the cache) affect way-predictors and filter caches. We show that accessing the smaller L0 array can be significantly more energy efficient than attempting to read fewer ways from a larger L1 cache; and that the main source of energy inefficiency in filter caches comes from L0 and L1 misses. We propose a filter cache optimization that shares the tag array between the L0 and the L1, which incurs the overhead of reading the larger tag array on every access, but in return allows us to directly access the correct L1 way on each L0 miss. This optimization does not add any extra latency and counter-intuitively, improves the filter caches overall energy efficiency beyond that of the way-predictor.By combining the low power benefits of a physically smaller L0 with the reduction in miss energy by reading L1 tags upfront in parallel with L0 data, we show that the optimized filter cache reduces the dynamic cache energy compared to a traditional filter cache by 26% while providing the same performance advantage. Compared to a way-predictor, the optimized cache improves performance by 6% and energy by 2%.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130163401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.11
W. D. C. Moreira, Guilherme Andrade, Pedro Caldeira, Renato Utsch Goncalves, R. Ferreira, L. Rocha, Renan de Carvalho Sousa, Millas Nasser Ramsses Avelar
The development of new technologies is setting a new era characterized, among other factors, by the rise of sophisticated mobile devices containing CPUs and GPUs. This emerging scenario of heterogeneous mobile architectures brings challenging issues regarding the use of the available computing resources. Such issues are mainly related to the intrinsic complexity of coordinating these processors in order to increase application performance. In this sense, this paper presents a high-level programming model to implement parallel patterns that can be executed in a coordinate way by heterogeneous mobile architectures. A comparative analysis of performance and programming complexity is presented, contrasting code generated automatically from the proposed programming model with low-level manually-optimized implementations.
{"title":"Exploring Heterogeneous Mobile Architectures with a High-Level Programming Model","authors":"W. D. C. Moreira, Guilherme Andrade, Pedro Caldeira, Renato Utsch Goncalves, R. Ferreira, L. Rocha, Renan de Carvalho Sousa, Millas Nasser Ramsses Avelar","doi":"10.1109/SBAC-PAD.2017.11","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.11","url":null,"abstract":"The development of new technologies is setting a new era characterized, among other factors, by the rise of sophisticated mobile devices containing CPUs and GPUs. This emerging scenario of heterogeneous mobile architectures brings challenging issues regarding the use of the available computing resources. Such issues are mainly related to the intrinsic complexity of coordinating these processors in order to increase application performance. In this sense, this paper presents a high-level programming model to implement parallel patterns that can be executed in a coordinate way by heterogeneous mobile architectures. A comparative analysis of performance and programming complexity is presented, contrasting code generated automatically from the proposed programming model with low-level manually-optimized implementations.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126424520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SBAC-PAD.2017.27
Guilherme Andrade, George Teodoro, R. Ferreira
This paper presents an efficient parallel implementation of the Product Quantization based approximate nearest neighbor multimedia similarity search indexing (PQANNS). The parallel PQANNS efficiently answers nearest neighbor queries by exploiting the ability of the quantization approach to reduce the data dimensionality (and memory demand) and by leveraging parallelism to speed up the search capabilities of the application. Our solution is also optimized to minimize query response times under scenarios with fluctuating query rates (load) as observed in online services. To achieve this goal, we have developed strategies to dynamically select the parallelism configuration and task granularity that minimizes the query response times during the execution. The proposed strategies (ADAPT and ADAPT+G) were thoroughly evaluated and have shown, for instance, to reduce the query response times in 6.4x as compared to the best static configuration of parallelism and task granularity.
{"title":"Online Multimedia Similarity Search with Response Time-Aware Parallelism and Task Granularity Auto-Tuning","authors":"Guilherme Andrade, George Teodoro, R. Ferreira","doi":"10.1109/SBAC-PAD.2017.27","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2017.27","url":null,"abstract":"This paper presents an efficient parallel implementation of the Product Quantization based approximate nearest neighbor multimedia similarity search indexing (PQANNS). The parallel PQANNS efficiently answers nearest neighbor queries by exploiting the ability of the quantization approach to reduce the data dimensionality (and memory demand) and by leveraging parallelism to speed up the search capabilities of the application. Our solution is also optimized to minimize query response times under scenarios with fluctuating query rates (load) as observed in online services. To achieve this goal, we have developed strategies to dynamically select the parallelism configuration and task granularity that minimizes the query response times during the execution. The proposed strategies (ADAPT and ADAPT+G) were thoroughly evaluated and have shown, for instance, to reduce the query response times in 6.4x as compared to the best static configuration of parallelism and task granularity.","PeriodicalId":187204,"journal":{"name":"2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130969413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}