2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)最新文献

英文中文

GraphCL: A Framework for Execution of Data-Flow Graphs on Multi-Device Platforms GraphCL:在多设备平台上执行数据流图的框架

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00026

Konrad Moren, D. Göhringer

This article introduces GraphCL, an automated system for seamlessly mapping multi-kernel applications to multiple computing devices. GraphCL consists of a C ++ API and a runtime that abstracts and simplifies the execution of multi-kernel applications on heterogeneous platforms across multiple devices. The GraphCL approach has three steps. First, the application designer provides a kernel graph. In the second phase, GraphCL computes the execution schedule. After the schedule has been computed, the runtime uses the execution schedule to enqueue in parallel the processing for all system processors. GraphCL takes the kernel dependencies and the processor performance differences into account during the schedule calculation process. By deciding on the schedule, GraphCL transparently manages the order of execution and data transfers for each processor. On two asymmetric workstations, GraphCL achieves an average acceleration of 1.8x compared to the fastest device. GraphCL achieves also for the set of multi-kernel benchmarks an average 24.5% energy reduction compared to the lazy partition heuristic, that uses all the system processors without considering their power usage.

本文介绍GraphCL，这是一个用于将多内核应用程序无缝映射到多个计算设备的自动化系统。GraphCL由一个c++ API和一个运行时组成，该运行时抽象并简化了跨多个设备的异构平台上的多内核应用程序的执行。GraphCL方法有三个步骤。首先，应用程序设计器提供一个内核图。在第二个阶段，GraphCL计算执行计划。计算完调度后，运行时使用执行调度并行地为所有系统处理器的处理排队。在调度计算过程中，GraphCL考虑了内核依赖关系和处理器性能差异。通过决定调度，GraphCL透明地管理每个处理器的执行顺序和数据传输。在两个非对称工作站上，与最快的设备相比，GraphCL实现了1.8倍的平均加速。与惰性分区启发式方法相比，GraphCL还实现了多内核基准测试集平均24.5%的能耗降低，惰性分区启发式方法使用所有系统处理器而不考虑它们的功耗。

{"title":"GraphCL: A Framework for Execution of Data-Flow Graphs on Multi-Device Platforms","authors":"Konrad Moren, D. Göhringer","doi":"10.1109/pdp55904.2022.00026","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00026","url":null,"abstract":"This article introduces GraphCL, an automated system for seamlessly mapping multi-kernel applications to multiple computing devices. GraphCL consists of a C ++ API and a runtime that abstracts and simplifies the execution of multi-kernel applications on heterogeneous platforms across multiple devices. The GraphCL approach has three steps. First, the application designer provides a kernel graph. In the second phase, GraphCL computes the execution schedule. After the schedule has been computed, the runtime uses the execution schedule to enqueue in parallel the processing for all system processors. GraphCL takes the kernel dependencies and the processor performance differences into account during the schedule calculation process. By deciding on the schedule, GraphCL transparently manages the order of execution and data transfers for each processor. On two asymmetric workstations, GraphCL achieves an average acceleration of 1.8x compared to the fastest device. GraphCL achieves also for the set of multi-kernel benchmarks an average 24.5% energy reduction compared to the lazy partition heuristic, that uses all the system processors without considering their power usage.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129700259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Advancing Database System Operators with Near-Data Processing 用近数据处理推进数据库系统操作

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00028

S. Santos, Francis B. Moreira, T. R. Kepe, M. Alves

As applications become more data-intensive, issues like von Neumann’s bottleneck and the memory wall became more apparent since data movement is the main source of inefficiency in computer systems. Looking to mitigate this issue, Near-Data Processing (NDP) moves computation from the processor to the memory, thus reducing the data movement required by many data-intensive workloads. In this paper, we look to database query operators, common targets of NDP research as database systems often need to deal with large amounts of data. We investigate the migration of most time-consuming database operators to Vector-In-Memory Architecture (VIMA), a novel 3D-stacked memory-based NDP architecture. We consider the selection, projection, and bloom join database query operators, commonly used by data analytics applications, comparing VIMA to a high-performance x86 baseline. Our results show speedups of up to 8× for selection, 6× for projection, and 16× for join while consuming up to 99% less energy. To the best of our knowledge, these results outperform the state-of-the-art for these operators on NDP platforms.

随着应用程序变得更加数据密集，像冯·诺伊曼瓶颈和内存墙这样的问题变得更加明显，因为数据移动是计算机系统效率低下的主要来源。为了缓解这个问题，近数据处理(NDP)将计算从处理器转移到内存，从而减少了许多数据密集型工作负载所需的数据移动。在本文中，我们着眼于数据库查询操作符，这是NDP研究的共同目标，因为数据库系统经常需要处理大量数据。我们研究了大多数耗时的数据库操作迁移到内存向量架构(VIMA)，这是一种新颖的基于3d堆叠内存的NDP架构。我们考虑数据分析应用程序通常使用的选择、投影和bloom连接数据库查询操作符，并将VIMA与高性能x86基线进行比较。我们的结果表明，选择的速度提高了8倍，投影的速度提高了6倍，连接的速度提高了16倍，同时消耗的能量减少了99%。据我们所知，这些结果优于NDP平台上的最先进技术。

{"title":"Advancing Database System Operators with Near-Data Processing","authors":"S. Santos, Francis B. Moreira, T. R. Kepe, M. Alves","doi":"10.1109/pdp55904.2022.00028","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00028","url":null,"abstract":"As applications become more data-intensive, issues like von Neumann’s bottleneck and the memory wall became more apparent since data movement is the main source of inefficiency in computer systems. Looking to mitigate this issue, Near-Data Processing (NDP) moves computation from the processor to the memory, thus reducing the data movement required by many data-intensive workloads. In this paper, we look to database query operators, common targets of NDP research as database systems often need to deal with large amounts of data. We investigate the migration of most time-consuming database operators to Vector-In-Memory Architecture (VIMA), a novel 3D-stacked memory-based NDP architecture. We consider the selection, projection, and bloom join database query operators, commonly used by data analytics applications, comparing VIMA to a high-performance x86 baseline. Our results show speedups of up to 8× for selection, 6× for projection, and 16× for join while consuming up to 99% less energy. To the best of our knowledge, these results outperform the state-of-the-art for these operators on NDP platforms.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"4 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114098003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

RISCLESS: A Reinforcement Learning Strategy to Guarantee SLA on Cloud Ephemeral and Stable Resources riseless:一种强化学习策略以保证云上短暂和稳定资源的SLA

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00021

SidAhmed Yalles, Mohamed Handaoui, Jean-Emile Dartois, Olivier Barais, Laurent d'Orazio, Jalil Boukhobza

In this paper, we propose RISCLESS, a Reinforcement Learning strategy to exploit unused Cloud resources. Our approach consists in using a small proportion of stable on-demand resources alongside the ephemeral ones in order to guarantee customers SLA and reduce the overall costs. The approach decides when and how much stable resources to allocate in order to fulfill customers’ demands. RISCLESS improved the Cloud Providers (CPs)’ profits by an average of 15.9% compared to past strategies. It also reduced the SLA violation time by 36.7% while increasing the amount of used ephemeral resources by 19.5%.

在本文中，我们提出了risless，一种利用未使用的云资源的强化学习策略。我们的方法是使用一小部分稳定的按需资源和临时资源，以保证客户的SLA并降低总体成本。该方法决定何时以及分配多少稳定资源以满足客户需求。与过去的策略相比，risless使云提供商(CPs)的利润平均提高了15.9%。它还减少了36.7%的SLA违规时间，同时增加了19.5%的临时资源使用量。

引用次数: 3

Analyzing the performance of hierarchical collective algorithms on ARM-based multicore clusters 分层集体算法在arm多核集群上的性能分析

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00043

G. Utrera, Marisa Gil, X. Martorell

MPI is the de facto communication standard library for parallel applications in distributed memory architectures. Collective operations performance is critical in HPC applications as they can become the bottleneck of their executions. The advent of larger node sizes on multicore clusters has motivated the exploration of hierarchical collective algorithms aware of the process placement in the cluster and the memory hierarchy. This work analyses and compares several hierarchical collective algorithms from the literature that do not form part of the current MPI standard. We implement the algorithms on top of OpenMPI using the shared-memory facility provided by MPI-3 at the intra-node level and evaluate them on ARM-based multicore clusters. From our results, we evidence aspects of the algorithms that impact the performance and applicability of the different algorithms. Finally, we propose a model that helps us to analyze the scalability of the algorithms.

MPI是分布式内存体系结构中并行应用程序事实上的通信标准库。集合操作性能在HPC应用程序中是至关重要的，因为它们可能成为其执行的瓶颈。多核集群上更大节点大小的出现激发了对分层集体算法的探索，这些算法能够意识到集群中的进程位置和内存层次结构。这项工作分析和比较了文献中不构成当前MPI标准一部分的几种分层集体算法。我们使用MPI-3在节点内级别提供的共享内存功能在OpenMPI之上实现算法，并在基于arm的多核集群上对它们进行评估。从我们的结果中，我们证明了影响不同算法的性能和适用性的算法方面。最后，我们提出了一个模型来帮助我们分析算法的可扩展性。

引用次数: 0

SECPAT: Security Patterns for Resilient Automotive E / E Architectures SECPAT:弹性汽车E / E架构的安全模式

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00047

Christian Plappert, Florian Fenzl, R. Rieke, I. Matteucci, Gianpiero Costantino, Marco De Vincenzi

Automated driving requires increasing networking of vehicles, which in turn broadens their attack surface. In this paper, we describe several security design patterns that target critical steps in automotive attack chains and mitigate their con-sequences. These patterns enable the detection of anomalies in the firmware when booting, detect anomalies in the communication in the vehicle, prevent unauthorized control units from successfully transmitting messages, offer a way of transmitting security-related events within a vehicle network and reporting them to units external to the vehicle, and ensure that communication in the vehicle is secure. Using the example of a future high-level Electrical / Electronic (E / E) architecture, we also describe how these security design patterns can be used to become aware of the current attack situation and how to react to it.

自动驾驶需要更多的车辆联网，这反过来又扩大了攻击面。在本文中，我们描述了针对汽车攻击链中的关键步骤并减轻其后果的几种安全设计模式。这些模式能够在启动时检测固件中的异常情况，检测车辆通信中的异常情况，防止未经授权的控制单元成功传输消息，提供一种在车辆网络中传输安全相关事件并将其报告给车辆外部单元的方法，并确保车辆通信的安全性。通过使用未来高级电气/电子(E / E)体系结构的示例，我们还描述了如何使用这些安全设计模式来了解当前的攻击情况以及如何对其做出反应。

引用次数: 3

Active learning approach for inappropriate information classification in social networks 社交网络中不恰当信息分类的主动学习方法

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00050

D. Levshun, O. Tushkanova, A. Chechulin

This paper describes an original approach of classification with active learning for inappropriate information detection and its application for the text posts from the VKontakte social network. The novelty of the approach lies in the constantly growing dataset, while the classifiers training process takes place during the operator's work. The approach works with texts of any size and content and applicable for Russian social networks. The research contribution lies in the original approach for inappropriate information detection, while practical significance lies in the automation of routine tasks to reduce the burden on specialists in the area of protection from information. Experimental evaluation of the approach is focused on its iterative retraining part. For the experiment, text posts of different topics from the VKontakte social network were collected and labeled. After that, we have evaluated F-measure and ROC-AUC metrics for classifiers trained on random subsamples of different sizes and different topics. Moreover, the advantages and disadvantages of the approach, as well as future work directions, were indicated.

本文描述了一种新颖的主动学习分类方法，用于不恰当信息的检测，并在VKontakte社交网络文本帖子中的应用。该方法的新颖之处在于不断增长的数据集，而分类器的训练过程是在操作员的工作中进行的。这种方法适用于任何大小和内容的文本，并适用于俄罗斯的社交网络。本研究的贡献在于对不恰当信息的检测提出了新颖的方法，而实际意义在于将日常工作自动化，减轻了信息保护领域专家的负担。该方法的实验评价主要集中在迭代再训练部分。在实验中，从VKontakte社交网络上收集并标记了不同主题的文本帖子。之后，我们评估了在不同大小和不同主题的随机子样本上训练的分类器的F-measure和ROC-AUC指标。并指出了该方法的优缺点及今后的工作方向。

{"title":"Active learning approach for inappropriate information classification in social networks","authors":"D. Levshun, O. Tushkanova, A. Chechulin","doi":"10.1109/pdp55904.2022.00050","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00050","url":null,"abstract":"This paper describes an original approach of classification with active learning for inappropriate information detection and its application for the text posts from the VKontakte social network. The novelty of the approach lies in the constantly growing dataset, while the classifiers training process takes place during the operator's work. The approach works with texts of any size and content and applicable for Russian social networks. The research contribution lies in the original approach for inappropriate information detection, while practical significance lies in the automation of routine tasks to reduce the burden on specialists in the area of protection from information. Experimental evaluation of the approach is focused on its iterative retraining part. For the experiment, text posts of different topics from the VKontakte social network were collected and labeled. After that, we have evaluated F-measure and ROC-AUC metrics for classifiers trained on random subsamples of different sizes and different topics. Moreover, the advantages and disadvantages of the approach, as well as future work directions, were indicated.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130983062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Some Experiments on High Performance Anomaly Detection 高性能异常检测的一些实验

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00042

M. Ianni, E. Masciari

The rise of cyber crime observed in recent years calls for more efficient and effective data exploration and analysis tools. In this respect, the need to support advanced analytics on activity logs and real time data is driving data scientist’ interest in designing and implementing scalable cyber security solutions. However, when data science algorithms are leveraged for huge amounts of data, their fully scalable deployment faces a number of technical challenges that grow with the complexity of the algorithms involved and the task to be tackled. Thus algorithms, that were originally designed for classical scenarios, need to be redesigned in order to be effectively used for cyber security purposes. In this paper, we explore these problems and then propose a solution which has proven to be very effective in identifying malicious activities.

近年来，网络犯罪日益猖獗，需要更高效的数据探索和分析工具。在这方面，支持对活动日志和实时数据进行高级分析的需求推动了数据科学家对设计和实施可扩展的网络安全解决方案的兴趣。然而，当数据科学算法被用于大量数据时，它们的完全可扩展部署面临着许多技术挑战，这些挑战随着所涉及算法的复杂性和要解决的任务而增长。因此，最初为经典场景设计的算法需要重新设计，以便有效地用于网络安全目的。在本文中，我们探讨了这些问题，然后提出了一种解决方案，该解决方案已被证明在识别恶意活动方面非常有效。

引用次数: 4

A Heuristic for Constructing Minimum Average Stretch Spanning Tree Using Betweenness Centrality 一种利用中间中心性构造最小平均伸缩生成树的启发式方法

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00019

Sinchan Sengupta, Sathya Peri, Vipul Aggarwal, Ambey Kumari Gupta

A parameter crucial for preserving the underlying shortest path information in spanning tree construction is called stretch. It is the ratio of the distance of two nodes x and y in the spanning tree to the shortest distance between x and y in the graph. In this paper, we present a heuristic LSTree that constructs a Minimum Average Stretch Spanning Tree of an n− node undirected and unweighted graph in $mathcal{O}$(n) rounds of the CONGEST model. We like to stress that LSTree protocol is the first use of Betweenness centrality in the construction of low stretch trees. The heuristic outperforms the current benchmark algorithm of Alon et. al. as well as other spanning tree construction techniques presently known, when tested against synthetic as well as real-world graph inputs.

在生成树的构造中，一个对保持底层最短路径信息至关重要的参数被称为拉伸。它是生成树中两个节点x和y的距离与图中x和y之间的最短距离的比值。在本文中，我们提出了一种启发式lstreet，它在CONGEST模型的$mathcal{O}$(n)轮中构造了n -节点无向无权图的最小平均伸缩生成树。我们想强调的是，lstreet协议是第一个在构造低伸缩树时使用中间性中心的协议。当针对合成和现实世界的图形输入进行测试时，启发式算法优于Alon等人的当前基准算法以及其他已知的生成树构建技术。

引用次数: 0

An approach to formal desription of the user notification scenarios in privacy policies 一种在隐私策略中对用户通知场景进行正式描述的方法

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00049

Mikhail Kuznetsov, E. Novikova, Igor Kotenko

Nowadays the collection and usage of users' personal data have become an extremely common scenario. The users actively provide their personal data to customize or improve the quality of various digital services. Privacy policies are the only official way to inform data owners how their personal data are processed. There are different approaches for increasing the transparency of privacy policies and user agreements. This paper discusses ontology-based approaches and proposes formal descriptions of data processors' obligations relating to policy change and notification in case of a data breach.

如今，收集和使用用户个人数据已经成为一种极其普遍的情况。用户主动提供个人数据，为客户定制或提高各类数字化服务的质量。隐私政策是告知数据所有者如何处理其个人数据的唯一官方方式。增加隐私政策和用户协议的透明度有不同的方法。本文讨论了基于本体的方法，并提出了数据处理器在数据泄露情况下与策略更改和通知相关的义务的正式描述。

引用次数: 0

DTM-NUCA: Dynamic Texture Mapping-NUCA for Energy-Efficient Graphics Rendering 动态纹理映射- nuca节能图形渲染

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2022-03-01 DOI: 10.1109/pdp55904.2022.00030

David Corbalán-Navarro, Juan L. Aragón, Joan-Manuel Parcerisa, Antonio González

Modern mobile GPUs integrate an increasing number of shader cores to speedup the execution of graphics workloads. Each core integrates a private Texture Cache to apply texturing effects on objects, which is backed-up by a shared L2 cache. However, as in any other memory hierarchy, such organization produces data replication in the upper levels (i.e., the private Texture Caches) to allow for faster accesses at the expense of reducing their overall effective capacity. E.g., in a mobile GPU with four shader cores, about 84.6% of the requested texture blocks are replicated in at least one of the other private Texture Caches.This paper proposes a novel dynamically-mapped Non-Uniform Cache Architecture (NUCA) organization for the private Texture Caches of a mobile GPU aimed at increasing their effective overall capacity and decreasing the overall access latency by attacking data replication. A block missing in a local Texture Cache may be serviced by a remote one at a cost smaller than a round trip to the shared L2. The proposed Dynamic Texture Mapping-NUCA (DTM-NUCA) features a lightweight mapping table, called Affinity Table, that is independent of the L2 cache size, unlike a traditional NUCA organization. The best owner for a given set of blocks is dynamically determined and stored in the Affinity Table to maximize local accesses. The mechanism also allows for a certain amount of replication to favor local accesses where appropriate, without hurting performance due to the small capacity loss resulting from the allowed replication. DTM-NUCA is presented in two flavors. One with a centralized Affinity Table, and another with a distributed Affinity Table. Experimental results show first that the L2 pressure is effectively reduced, eliminating 41.8% of the L2 accesses on average. As for the average latency, DTM-NUCA performs a very effective job at maximizing local over remote accesses, achieving 73.8% of local accesses on average. As a consequence, our novel DTM-NUCA organization obtains an average speedup of 16.9% and overall 7.6% energy savings over a conventional organization.

现代移动gpu集成了越来越多的着色器内核来加速图形工作负载的执行。每个核心都集成了一个私有的纹理缓存，用于在对象上应用纹理效果，这是由一个共享的L2缓存备份的。然而，就像在任何其他内存层次结构中一样，这种组织在上层(即私有纹理缓存)中产生数据复制，以牺牲其整体有效容量为代价来实现更快的访问。例如，在具有四个着色器内核的移动GPU中，大约84.6%的请求纹理块在至少一个其他私有纹理缓存中被复制。针对移动GPU的私有纹理缓存，提出了一种新的动态映射非统一缓存架构(NUCA)组织，旨在通过攻击数据复制来提高纹理缓存的整体有效容量和降低整体访问延迟。在本地纹理缓存中丢失的块可以由远程缓存提供服务，其成本小于到共享L2的往返。提议的动态纹理映射-NUCA (DTM-NUCA)具有一个轻量级的映射表，称为亲和表，它与L2缓存大小无关，不像传统的NUCA组织。给定一组块的最佳所有者是动态确定的，并存储在Affinity Table中，以最大化本地访问。该机制还允许一定数量的复制，以便在适当的时候支持本地访问，而不会因为允许的复制导致的小容量损失而损害性能。DTM-NUCA以两种方式呈现。一个具有集中式关联表，另一个具有分布式关联表。实验结果表明:首先L2压力得到有效降低，平均消除了41.8%的L2访问;至于平均延迟，DTM-NUCA在最大化本地访问而不是远程访问方面执行了非常有效的工作，平均实现了73.8%的本地访问。因此，与传统组织相比，我们的新型DTM-NUCA组织获得了16.9%的平均加速和7.6%的总体节能。

{"title":"DTM-NUCA: Dynamic Texture Mapping-NUCA for Energy-Efficient Graphics Rendering","authors":"David Corbalán-Navarro, Juan L. Aragón, Joan-Manuel Parcerisa, Antonio González","doi":"10.1109/pdp55904.2022.00030","DOIUrl":"https://doi.org/10.1109/pdp55904.2022.00030","url":null,"abstract":"Modern mobile GPUs integrate an increasing number of shader cores to speedup the execution of graphics workloads. Each core integrates a private Texture Cache to apply texturing effects on objects, which is backed-up by a shared L2 cache. However, as in any other memory hierarchy, such organization produces data replication in the upper levels (i.e., the private Texture Caches) to allow for faster accesses at the expense of reducing their overall effective capacity. E.g., in a mobile GPU with four shader cores, about 84.6% of the requested texture blocks are replicated in at least one of the other private Texture Caches.This paper proposes a novel dynamically-mapped Non-Uniform Cache Architecture (NUCA) organization for the private Texture Caches of a mobile GPU aimed at increasing their effective overall capacity and decreasing the overall access latency by attacking data replication. A block missing in a local Texture Cache may be serviced by a remote one at a cost smaller than a round trip to the shared L2. The proposed Dynamic Texture Mapping-NUCA (DTM-NUCA) features a lightweight mapping table, called Affinity Table, that is independent of the L2 cache size, unlike a traditional NUCA organization. The best owner for a given set of blocks is dynamically determined and stored in the Affinity Table to maximize local accesses. The mechanism also allows for a certain amount of replication to favor local accesses where appropriate, without hurting performance due to the small capacity loss resulting from the allowed replication. DTM-NUCA is presented in two flavors. One with a centralized Affinity Table, and another with a distributed Affinity Table. Experimental results show first that the L2 pressure is effectively reduced, eliminating 41.8% of the L2 accesses on average. As for the average latency, DTM-NUCA performs a very effective job at maximizing local over remote accesses, achieving 73.8% of local accesses on average. As a consequence, our novel DTM-NUCA organization obtains an average speedup of 16.9% and overall 7.6% energy savings over a conventional organization.","PeriodicalId":210759,"journal":{"name":"2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132143830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀