Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00026
Konrad Moren, D. Göhringer
This article introduces GraphCL, an automated system for seamlessly mapping multi-kernel applications to multiple computing devices. GraphCL consists of a C ++ API and a runtime that abstracts and simplifies the execution of multi-kernel applications on heterogeneous platforms across multiple devices. The GraphCL approach has three steps. First, the application designer provides a kernel graph. In the second phase, GraphCL computes the execution schedule. After the schedule has been computed, the runtime uses the execution schedule to enqueue in parallel the processing for all system processors. GraphCL takes the kernel dependencies and the processor performance differences into account during the schedule calculation process. By deciding on the schedule, GraphCL transparently manages the order of execution and data transfers for each processor. On two asymmetric workstations, GraphCL achieves an average acceleration of 1.8x compared to the fastest device. GraphCL achieves also for the set of multi-kernel benchmarks an average 24.5% energy reduction compared to the lazy partition heuristic, that uses all the system processors without considering their power usage.
{"title":"GraphCL: A Framework for Execution of Data-Flow Graphs on Multi-Device Platforms","authors":"Konrad Moren, D. Göhringer","doi":"10.1109/pdp55904.2022.00026","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00026","url":null,"abstract":"This article introduces GraphCL, an automated system for seamlessly mapping multi-kernel applications to multiple computing devices. GraphCL consists of a C ++ API and a runtime that abstracts and simplifies the execution of multi-kernel applications on heterogeneous platforms across multiple devices. The GraphCL approach has three steps. First, the application designer provides a kernel graph. In the second phase, GraphCL computes the execution schedule. After the schedule has been computed, the runtime uses the execution schedule to enqueue in parallel the processing for all system processors. GraphCL takes the kernel dependencies and the processor performance differences into account during the schedule calculation process. By deciding on the schedule, GraphCL transparently manages the order of execution and data transfers for each processor. On two asymmetric workstations, GraphCL achieves an average acceleration of 1.8x compared to the fastest device. GraphCL achieves also for the set of multi-kernel benchmarks an average 24.5% energy reduction compared to the lazy partition heuristic, that uses all the system processors without considering their power usage.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129700259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00028
S. Santos, Francis B. Moreira, T. R. Kepe, M. Alves
As applications become more data-intensive, issues like von Neumann’s bottleneck and the memory wall became more apparent since data movement is the main source of inefficiency in computer systems. Looking to mitigate this issue, Near-Data Processing (NDP) moves computation from the processor to the memory, thus reducing the data movement required by many data-intensive workloads. In this paper, we look to database query operators, common targets of NDP research as database systems often need to deal with large amounts of data. We investigate the migration of most time-consuming database operators to Vector-In-Memory Architecture (VIMA), a novel 3D-stacked memory-based NDP architecture. We consider the selection, projection, and bloom join database query operators, commonly used by data analytics applications, comparing VIMA to a high-performance x86 baseline. Our results show speedups of up to 8× for selection, 6× for projection, and 16× for join while consuming up to 99% less energy. To the best of our knowledge, these results outperform the state-of-the-art for these operators on NDP platforms.
{"title":"Advancing Database System Operators with Near-Data Processing","authors":"S. Santos, Francis B. Moreira, T. R. Kepe, M. Alves","doi":"10.1109/pdp55904.2022.00028","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00028","url":null,"abstract":"As applications become more data-intensive, issues like von Neumann’s bottleneck and the memory wall became more apparent since data movement is the main source of inefficiency in computer systems. Looking to mitigate this issue, Near-Data Processing (NDP) moves computation from the processor to the memory, thus reducing the data movement required by many data-intensive workloads. In this paper, we look to database query operators, common targets of NDP research as database systems often need to deal with large amounts of data. We investigate the migration of most time-consuming database operators to Vector-In-Memory Architecture (VIMA), a novel 3D-stacked memory-based NDP architecture. We consider the selection, projection, and bloom join database query operators, commonly used by data analytics applications, comparing VIMA to a high-performance x86 baseline. Our results show speedups of up to 8× for selection, 6× for projection, and 16× for join while consuming up to 99% less energy. To the best of our knowledge, these results outperform the state-of-the-art for these operators on NDP platforms.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"4 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114098003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose RISCLESS, a Reinforcement Learning strategy to exploit unused Cloud resources. Our approach consists in using a small proportion of stable on-demand resources alongside the ephemeral ones in order to guarantee customers SLA and reduce the overall costs. The approach decides when and how much stable resources to allocate in order to fulfill customers’ demands. RISCLESS improved the Cloud Providers (CPs)’ profits by an average of 15.9% compared to past strategies. It also reduced the SLA violation time by 36.7% while increasing the amount of used ephemeral resources by 19.5%.
{"title":"RISCLESS: A Reinforcement Learning Strategy to Guarantee SLA on Cloud Ephemeral and Stable Resources","authors":"SidAhmed Yalles, Mohamed Handaoui, Jean-Emile Dartois, Olivier Barais, Laurent d'Orazio, Jalil Boukhobza","doi":"10.1109/pdp55904.2022.00021","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00021","url":null,"abstract":"In this paper, we propose RISCLESS, a Reinforcement Learning strategy to exploit unused Cloud resources. Our approach consists in using a small proportion of stable on-demand resources alongside the ephemeral ones in order to guarantee customers SLA and reduce the overall costs. The approach decides when and how much stable resources to allocate in order to fulfill customers’ demands. RISCLESS improved the Cloud Providers (CPs)’ profits by an average of 15.9% compared to past strategies. It also reduced the SLA violation time by 36.7% while increasing the amount of used ephemeral resources by 19.5%.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123711256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00043
G. Utrera, Marisa Gil, X. Martorell
MPI is the de facto communication standard library for parallel applications in distributed memory architectures. Collective operations performance is critical in HPC applications as they can become the bottleneck of their executions. The advent of larger node sizes on multicore clusters has motivated the exploration of hierarchical collective algorithms aware of the process placement in the cluster and the memory hierarchy. This work analyses and compares several hierarchical collective algorithms from the literature that do not form part of the current MPI standard. We implement the algorithms on top of OpenMPI using the shared-memory facility provided by MPI-3 at the intra-node level and evaluate them on ARM-based multicore clusters. From our results, we evidence aspects of the algorithms that impact the performance and applicability of the different algorithms. Finally, we propose a model that helps us to analyze the scalability of the algorithms.
{"title":"Analyzing the performance of hierarchical collective algorithms on ARM-based multicore clusters","authors":"G. Utrera, Marisa Gil, X. Martorell","doi":"10.1109/pdp55904.2022.00043","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00043","url":null,"abstract":"MPI is the de facto communication standard library for parallel applications in distributed memory architectures. Collective operations performance is critical in HPC applications as they can become the bottleneck of their executions. The advent of larger node sizes on multicore clusters has motivated the exploration of hierarchical collective algorithms aware of the process placement in the cluster and the memory hierarchy. This work analyses and compares several hierarchical collective algorithms from the literature that do not form part of the current MPI standard. We implement the algorithms on top of OpenMPI using the shared-memory facility provided by MPI-3 at the intra-node level and evaluate them on ARM-based multicore clusters. From our results, we evidence aspects of the algorithms that impact the performance and applicability of the different algorithms. Finally, we propose a model that helps us to analyze the scalability of the algorithms.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133252234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00047
Christian Plappert, Florian Fenzl, R. Rieke, I. Matteucci, Gianpiero Costantino, Marco De Vincenzi
Automated driving requires increasing networking of vehicles, which in turn broadens their attack surface. In this paper, we describe several security design patterns that target critical steps in automotive attack chains and mitigate their con-sequences. These patterns enable the detection of anomalies in the firmware when booting, detect anomalies in the communication in the vehicle, prevent unauthorized control units from successfully transmitting messages, offer a way of transmitting security-related events within a vehicle network and reporting them to units external to the vehicle, and ensure that communication in the vehicle is secure. Using the example of a future high-level Electrical / Electronic (E / E) architecture, we also describe how these security design patterns can be used to become aware of the current attack situation and how to react to it.
{"title":"SECPAT: Security Patterns for Resilient Automotive E / E Architectures","authors":"Christian Plappert, Florian Fenzl, R. Rieke, I. Matteucci, Gianpiero Costantino, Marco De Vincenzi","doi":"10.1109/pdp55904.2022.00047","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00047","url":null,"abstract":"Automated driving requires increasing networking of vehicles, which in turn broadens their attack surface. In this paper, we describe several security design patterns that target critical steps in automotive attack chains and mitigate their con-sequences. These patterns enable the detection of anomalies in the firmware when booting, detect anomalies in the communication in the vehicle, prevent unauthorized control units from successfully transmitting messages, offer a way of transmitting security-related events within a vehicle network and reporting them to units external to the vehicle, and ensure that communication in the vehicle is secure. Using the example of a future high-level Electrical / Electronic (E / E) architecture, we also describe how these security design patterns can be used to become aware of the current attack situation and how to react to it.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130823978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00050
D. Levshun, O. Tushkanova, A. Chechulin
This paper describes an original approach of classification with active learning for inappropriate information detection and its application for the text posts from the VKontakte social network. The novelty of the approach lies in the constantly growing dataset, while the classifiers training process takes place during the operator's work. The approach works with texts of any size and content and applicable for Russian social networks. The research contribution lies in the original approach for inappropriate information detection, while practical significance lies in the automation of routine tasks to reduce the burden on specialists in the area of protection from information. Experimental evaluation of the approach is focused on its iterative retraining part. For the experiment, text posts of different topics from the VKontakte social network were collected and labeled. After that, we have evaluated F-measure and ROC-AUC metrics for classifiers trained on random subsamples of different sizes and different topics. Moreover, the advantages and disadvantages of the approach, as well as future work directions, were indicated.
{"title":"Active learning approach for inappropriate information classification in social networks","authors":"D. Levshun, O. Tushkanova, A. Chechulin","doi":"10.1109/pdp55904.2022.00050","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00050","url":null,"abstract":"This paper describes an original approach of classification with active learning for inappropriate information detection and its application for the text posts from the VKontakte social network. The novelty of the approach lies in the constantly growing dataset, while the classifiers training process takes place during the operator's work. The approach works with texts of any size and content and applicable for Russian social networks. The research contribution lies in the original approach for inappropriate information detection, while practical significance lies in the automation of routine tasks to reduce the burden on specialists in the area of protection from information. Experimental evaluation of the approach is focused on its iterative retraining part. For the experiment, text posts of different topics from the VKontakte social network were collected and labeled. After that, we have evaluated F-measure and ROC-AUC metrics for classifiers trained on random subsamples of different sizes and different topics. Moreover, the advantages and disadvantages of the approach, as well as future work directions, were indicated.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130983062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00042
M. Ianni, E. Masciari
The rise of cyber crime observed in recent years calls for more efficient and effective data exploration and analysis tools. In this respect, the need to support advanced analytics on activity logs and real time data is driving data scientist’ interest in designing and implementing scalable cyber security solutions. However, when data science algorithms are leveraged for huge amounts of data, their fully scalable deployment faces a number of technical challenges that grow with the complexity of the algorithms involved and the task to be tackled. Thus algorithms, that were originally designed for classical scenarios, need to be redesigned in order to be effectively used for cyber security purposes. In this paper, we explore these problems and then propose a solution which has proven to be very effective in identifying malicious activities.
{"title":"Some Experiments on High Performance Anomaly Detection","authors":"M. Ianni, E. Masciari","doi":"10.1109/pdp55904.2022.00042","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00042","url":null,"abstract":"The rise of cyber crime observed in recent years calls for more efficient and effective data exploration and analysis tools. In this respect, the need to support advanced analytics on activity logs and real time data is driving data scientist’ interest in designing and implementing scalable cyber security solutions. However, when data science algorithms are leveraged for huge amounts of data, their fully scalable deployment faces a number of technical challenges that grow with the complexity of the algorithms involved and the task to be tackled. Thus algorithms, that were originally designed for classical scenarios, need to be redesigned in order to be effectively used for cyber security purposes. In this paper, we explore these problems and then propose a solution which has proven to be very effective in identifying malicious activities.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"57 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114042164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A parameter crucial for preserving the underlying shortest path information in spanning tree construction is called stretch. It is the ratio of the distance of two nodes x and y in the spanning tree to the shortest distance between x and y in the graph. In this paper, we present a heuristic LSTree that constructs a Minimum Average Stretch Spanning Tree of an n− node undirected and unweighted graph in $mathcal{O}$(n) rounds of the CONGEST model. We like to stress that LSTree protocol is the first use of Betweenness centrality in the construction of low stretch trees. The heuristic outperforms the current benchmark algorithm of Alon et. al. as well as other spanning tree construction techniques presently known, when tested against synthetic as well as real-world graph inputs.
{"title":"A Heuristic for Constructing Minimum Average Stretch Spanning Tree Using Betweenness Centrality","authors":"Sinchan Sengupta, Sathya Peri, Vipul Aggarwal, Ambey Kumari Gupta","doi":"10.1109/pdp55904.2022.00019","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00019","url":null,"abstract":"A parameter crucial for preserving the underlying shortest path information in spanning tree construction is called stretch. It is the ratio of the distance of two nodes x and y in the spanning tree to the shortest distance between x and y in the graph. In this paper, we present a heuristic LSTree that constructs a Minimum Average Stretch Spanning Tree of an n− node undirected and unweighted graph in $mathcal{O}$(n) rounds of the CONGEST model. We like to stress that LSTree protocol is the first use of Betweenness centrality in the construction of low stretch trees. The heuristic outperforms the current benchmark algorithm of Alon et. al. as well as other spanning tree construction techniques presently known, when tested against synthetic as well as real-world graph inputs.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130725469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00049
Mikhail Kuznetsov, E. Novikova, Igor Kotenko
Nowadays the collection and usage of users' personal data have become an extremely common scenario. The users actively provide their personal data to customize or improve the quality of various digital services. Privacy policies are the only official way to inform data owners how their personal data are processed. There are different approaches for increasing the transparency of privacy policies and user agreements. This paper discusses ontology-based approaches and proposes formal descriptions of data processors' obligations relating to policy change and notification in case of a data breach.
{"title":"An approach to formal desription of the user notification scenarios in privacy policies","authors":"Mikhail Kuznetsov, E. Novikova, Igor Kotenko","doi":"10.1109/pdp55904.2022.00049","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00049","url":null,"abstract":"Nowadays the collection and usage of users' personal data have become an extremely common scenario. The users actively provide their personal data to customize or improve the quality of various digital services. Privacy policies are the only official way to inform data owners how their personal data are processed. There are different approaches for increasing the transparency of privacy policies and user agreements. This paper discusses ontology-based approaches and proposes formal descriptions of data processors' obligations relating to policy change and notification in case of a data breach.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131873342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/pdp55904.2022.00030
David Corbalán-Navarro, Juan L. Aragón, Joan-Manuel Parcerisa, Antonio González
Modern mobile GPUs integrate an increasing number of shader cores to speedup the execution of graphics workloads. Each core integrates a private Texture Cache to apply texturing effects on objects, which is backed-up by a shared L2 cache. However, as in any other memory hierarchy, such organization produces data replication in the upper levels (i.e., the private Texture Caches) to allow for faster accesses at the expense of reducing their overall effective capacity. E.g., in a mobile GPU with four shader cores, about 84.6% of the requested texture blocks are replicated in at least one of the other private Texture Caches.This paper proposes a novel dynamically-mapped Non-Uniform Cache Architecture (NUCA) organization for the private Texture Caches of a mobile GPU aimed at increasing their effective overall capacity and decreasing the overall access latency by attacking data replication. A block missing in a local Texture Cache may be serviced by a remote one at a cost smaller than a round trip to the shared L2. The proposed Dynamic Texture Mapping-NUCA (DTM-NUCA) features a lightweight mapping table, called Affinity Table, that is independent of the L2 cache size, unlike a traditional NUCA organization. The best owner for a given set of blocks is dynamically determined and stored in the Affinity Table to maximize local accesses. The mechanism also allows for a certain amount of replication to favor local accesses where appropriate, without hurting performance due to the small capacity loss resulting from the allowed replication. DTM-NUCA is presented in two flavors. One with a centralized Affinity Table, and another with a distributed Affinity Table. Experimental results show first that the L2 pressure is effectively reduced, eliminating 41.8% of the L2 accesses on average. As for the average latency, DTM-NUCA performs a very effective job at maximizing local over remote accesses, achieving 73.8% of local accesses on average. As a consequence, our novel DTM-NUCA organization obtains an average speedup of 16.9% and overall 7.6% energy savings over a conventional organization.
{"title":"DTM-NUCA: Dynamic Texture Mapping-NUCA for Energy-Efficient Graphics Rendering","authors":"David Corbalán-Navarro, Juan L. Aragón, Joan-Manuel Parcerisa, Antonio González","doi":"10.1109/pdp55904.2022.00030","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00030","url":null,"abstract":"Modern mobile GPUs integrate an increasing number of shader cores to speedup the execution of graphics workloads. Each core integrates a private Texture Cache to apply texturing effects on objects, which is backed-up by a shared L2 cache. However, as in any other memory hierarchy, such organization produces data replication in the upper levels (i.e., the private Texture Caches) to allow for faster accesses at the expense of reducing their overall effective capacity. E.g., in a mobile GPU with four shader cores, about 84.6% of the requested texture blocks are replicated in at least one of the other private Texture Caches.This paper proposes a novel dynamically-mapped Non-Uniform Cache Architecture (NUCA) organization for the private Texture Caches of a mobile GPU aimed at increasing their effective overall capacity and decreasing the overall access latency by attacking data replication. A block missing in a local Texture Cache may be serviced by a remote one at a cost smaller than a round trip to the shared L2. The proposed Dynamic Texture Mapping-NUCA (DTM-NUCA) features a lightweight mapping table, called Affinity Table, that is independent of the L2 cache size, unlike a traditional NUCA organization. The best owner for a given set of blocks is dynamically determined and stored in the Affinity Table to maximize local accesses. The mechanism also allows for a certain amount of replication to favor local accesses where appropriate, without hurting performance due to the small capacity loss resulting from the allowed replication. DTM-NUCA is presented in two flavors. One with a centralized Affinity Table, and another with a distributed Affinity Table. Experimental results show first that the L2 pressure is effectively reduced, eliminating 41.8% of the L2 accesses on average. As for the average latency, DTM-NUCA performs a very effective job at maximizing local over remote accesses, achieving 73.8% of local accesses on average. As a consequence, our novel DTM-NUCA organization obtains an average speedup of 16.9% and overall 7.6% energy savings over a conventional organization.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132143830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}