Proceedings of the ACM International Conference on Computing Frontiers最新文献

英文中文

An architecture for near-data processing systems 近数据处理系统的体系结构

Proceedings of the ACM International Conference on Computing Frontiers

Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903478

E. Vermij, C. Hagleitner, Leandro Fiorin, R. Jongerius, J. V. Lunteren, K. Bertels

Near-data processing is a promising paradigm to address the bandwidth, latency, and energy limitations in today's computer systems. In this work, we introduce an architecture that enhances a contemporary multi-core CPU with new features for supporting a seamless integration of near-data processing capabilities. Crucial aspects such as coherency, data placement, communication, address translation, and the programming model are discussed. The essential components, as well as a system simulator, are realized in hardware and software. Results for the important Graph500 benchmark show a 1.5x speedup when using the proposed architecture.

近数据处理是解决当今计算机系统中带宽、延迟和能量限制的一个很有前途的范例。在这项工作中，我们介绍了一种架构，该架构增强了当代多核CPU的新功能，以支持近数据处理功能的无缝集成。讨论了关键方面，如一致性、数据放置、通信、地址转换和编程模型。在硬件和软件上实现了系统的主要组成部分以及系统模拟器。重要的Graph500基准测试结果显示，当使用提议的架构时，速度提高了1.5倍。

引用次数: 9

Heterogeneous chip multiprocessor architectures for big data applications 面向大数据应用的异构芯片多处理器架构

Proceedings of the ACM International Conference on Computing Frontiers

Pub Date : 2016-05-16 DOI: 10.1145/2903150.2908078

H. Homayoun

Emerging big data analytics applications require a significant amount of server computational power. The costs of building and running a computing server to process big data and the capacity to which we can scale it are driven in large part by those computational resources. However, big data applications share many characteristics that are fundamentally different from traditional desktop, parallel, and scale-out applications. Big data analytics applications rely heavily on specific deep machine learning and data mining algorithms, and are running a complex and deep software stack with various components (e.g. Hadoop, Spark, MPI, Hbase, Impala, MySQL, Hive, Shark, Apache, and MangoDB) that are bound together with a runtime software system and interact significantly with I/O and OS, exhibiting high computational intensity, memory intensity, I/O intensity and control intensity. Current server designs, based on commodity homogeneous processors, will not be the most efficient in terms of performance/watt for this emerging class of applications. In other domains, heterogeneous architectures have emerged as a promising solution to enhance energy-efficiency by allowing each application to run on a core that matches resource needs more closely than a one-size-fits-all core. A heterogeneous architecture integrates cores with various micro-architectures and accelerators to provide more opportunity for efficient workload mapping. In this work, through methodical investigation of power and performance measurements, and comprehensive system level characterization, we demonstrate that a heterogeneous architecture combining high performance big and low power little cores is required for efficient big data analytics applications processing, and in particular in the presence of accelerators and near real-time performance constraints.

新兴的大数据分析应用需要大量的服务器计算能力。构建和运行一台处理大数据的计算服务器的成本，以及我们可以扩展它的能力，在很大程度上是由这些计算资源驱动的。然而，大数据应用程序具有许多与传统桌面、并行和向外扩展应用程序根本不同的特征。大数据分析应用程序严重依赖特定的深度机器学习和数据挖掘算法，并且运行着一个复杂而深度的软件堆栈，其中包含各种组件(例如Hadoop、Spark、MPI、Hbase、Impala、MySQL、Hive、Shark、Apache和manggodb)，这些组件与运行时软件系统绑定在一起，并与I/O和OS进行显著交互，具有高计算强度、内存强度、I/O强度和控制强度。当前基于商品同构处理器的服务器设计，对于这类新兴应用程序而言，就性能/瓦特而言将不是最有效的。在其他领域，异构体系结构已经成为一种很有前途的解决方案，通过允许每个应用程序在更接近于匹配资源需求的核心上运行，而不是一个通用的核心，从而提高能源效率。异构体系结构将内核与各种微体系结构和加速器集成在一起，为高效的工作负载映射提供更多机会。在这项工作中，通过对功率和性能测量的系统调查，以及全面的系统级表征，我们证明了高效的大数据分析应用程序处理需要一个结合高性能大内核和低功耗小内核的异构架构，特别是在存在加速器和近实时性能限制的情况下。

{"title":"Heterogeneous chip multiprocessor architectures for big data applications","authors":"H. Homayoun","doi":"10.1145/2903150.2908078","DOIUrl":"https://doi.org/10.1145/2903150.2908078","url":null,"abstract":"Emerging big data analytics applications require a significant amount of server computational power. The costs of building and running a computing server to process big data and the capacity to which we can scale it are driven in large part by those computational resources. However, big data applications share many characteristics that are fundamentally different from traditional desktop, parallel, and scale-out applications. Big data analytics applications rely heavily on specific deep machine learning and data mining algorithms, and are running a complex and deep software stack with various components (e.g. Hadoop, Spark, MPI, Hbase, Impala, MySQL, Hive, Shark, Apache, and MangoDB) that are bound together with a runtime software system and interact significantly with I/O and OS, exhibiting high computational intensity, memory intensity, I/O intensity and control intensity. Current server designs, based on commodity homogeneous processors, will not be the most efficient in terms of performance/watt for this emerging class of applications. In other domains, heterogeneous architectures have emerged as a promising solution to enhance energy-efficiency by allowing each application to run on a core that matches resource needs more closely than a one-size-fits-all core. A heterogeneous architecture integrates cores with various micro-architectures and accelerators to provide more opportunity for efficient workload mapping. In this work, through methodical investigation of power and performance measurements, and comprehensive system level characterization, we demonstrate that a heterogeneous architecture combining high performance big and low power little cores is required for efficient big data analytics applications processing, and in particular in the presence of accelerators and near real-time performance constraints.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124757179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Does it sound as it claims: a detailed side-channel security analysis of QuadSeal countermeasure 它听起来像它声称的那样:对QuadSeal反措施的详细侧信道安全分析

Proceedings of the ACM International Conference on Computing Frontiers

Pub Date : 2016-05-16 DOI: 10.1145/2903150.2911709

Darshana Jayasinghe, S. Bhasin, S. Parameswaran, A. Ignjatović

VLSI systems often rely on embedded cryptographic cores for security when the confidentiality and authorization is a must. Such cores are theoretically sound but often vulnerable to physical attacks like side-channel analysis (SCA). Several countermeasures have been previously proposed to protect these cryptographic cores. QuadSeal was proposed as an algorithmic balancing technique to thwart power analysis attacks on block cipher algorithms. QuadSeal can be implemented either in hardware or software and it was previously shown on Advanced Encryption Standard (AES) (referred as QuadSeal-AES) to be resistant against power analysis attacks (Correlation Power Analsis and Mutual Information Analysis). In this paper, we analyze QuadSeal against SCA (against power analysis attacks) using leakage detection techniques as well as Correlation Power Analysis with success rates. Our results show that QuadSeal has leakages; however CPA with success rate attack was unable to exploit the leakages efficiently.

当必须保密和授权时，VLSI系统通常依赖于嵌入式加密核心来保证安全性。这样的核心在理论上是健全的，但往往容易受到像侧信道分析(SCA)这样的物理攻击。以前已经提出了几种对策来保护这些加密核心。提出了一种算法平衡技术，用于阻止分组密码算法的功率分析攻击。QuadSeal可以在硬件或软件中实现，并且以前在高级加密标准(AES)(称为QuadSeal-AES)中显示可以抵抗功率分析攻击(相关功率分析和互信息分析)。在本文中，我们使用泄漏检测技术以及具有成功率的相关功率分析来分析QuadSeal对抗SCA(对抗功率分析攻击)。我们的结果表明，QuadSeal有泄漏;然而，采用成功率攻击的CPA无法有效地利用泄漏。

引用次数: 1

Area-energy tradeoffs of logic wear-leveling for BTI-induced aging bti诱导老化逻辑磨损均衡的面积-能量权衡

Proceedings of the ACM International Conference on Computing Frontiers

Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903171

R. Ashraf, N. Khoshavi, Ahmad Alzahrani, R. Demara, S. Kiamehr, M. Tahoori

Ensuring operational reliability in the presence of Bias Temperature Instability (BTI) effects often results in a compromise either in the form of lower performance and/or higher energy-consumption. This is due to the performance degradation over time caused by BTI effects which needs to be compensated through frequency, voltage, or area margining to meet the circuit's timing specification till end of operational lifetime. In this paper, a circuit-level approach referred to as Logic-Wear-Leveling (LWL) utilizes Dark-Silicon to mitigate BTI effects in logic datapaths. LWL introduces fine-grained spatial redundancy in timing vulnerable logic components, and leverages it at runtime to enable post-Silicon adaptability. The activation interval of redundant datapaths allows for controlled stress and recovery phases. This produces a wear-leveling effect which helps to reduce the BTI induced performance degradation over time, which in turn helps to reduce the design margins. This approach demonstrates a significant reduction in energy consumption of up to 31.98% at 10 years as compared to conventional voltage guardbanding approach. The benefit of energy reduction is also assessed against the area overheads of spatial redundancy.

在存在偏置温度不稳定性(BTI)影响的情况下，确保运行可靠性通常会导致性能降低和/或能耗增加。这是由于BTI效应引起的性能下降，需要通过频率、电压或面积余量进行补偿，以满足电路的时序规范，直到工作寿命结束。在本文中，一种被称为逻辑损耗均衡(LWL)的电路级方法利用暗硅来减轻逻辑数据路径中的BTI影响。LWL在时序脆弱的逻辑组件中引入了细粒度的空间冗余，并在运行时利用它来实现后硅的适应性。冗余数据路径的激活间隔允许控制压力和恢复阶段。这产生了一种磨损平衡效应，有助于减少BTI引起的性能随时间的下降，从而有助于减少设计余量。与传统的电压保护带方法相比，这种方法在10年内显著减少了高达31.98%的能耗。能源减少的好处也被评估为空间冗余的面积开销。

{"title":"Area-energy tradeoffs of logic wear-leveling for BTI-induced aging","authors":"R. Ashraf, N. Khoshavi, Ahmad Alzahrani, R. Demara, S. Kiamehr, M. Tahoori","doi":"10.1145/2903150.2903171","DOIUrl":"https://doi.org/10.1145/2903150.2903171","url":null,"abstract":"Ensuring operational reliability in the presence of Bias Temperature Instability (BTI) effects often results in a compromise either in the form of lower performance and/or higher energy-consumption. This is due to the performance degradation over time caused by BTI effects which needs to be compensated through frequency, voltage, or area margining to meet the circuit's timing specification till end of operational lifetime. In this paper, a circuit-level approach referred to as Logic-Wear-Leveling (LWL) utilizes Dark-Silicon to mitigate BTI effects in logic datapaths. LWL introduces fine-grained spatial redundancy in timing vulnerable logic components, and leverages it at runtime to enable post-Silicon adaptability. The activation interval of redundant datapaths allows for controlled stress and recovery phases. This produces a wear-leveling effect which helps to reduce the BTI induced performance degradation over time, which in turn helps to reduce the design margins. This approach demonstrates a significant reduction in energy consumption of up to 31.98% at 10 years as compared to conventional voltage guardbanding approach. The benefit of energy reduction is also assessed against the area overheads of spatial redundancy.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121960476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

P-Socket: optimizing a communication library for a PCIe-based intra-rack interconnect P-Socket:优化基于pcie的机架内互连的通信库

Proceedings of the ACM International Conference on Computing Frontiers

Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903168

Liuhang Zhang, Rui Hou, S. Mckee, Jianbo Dong, Lixin Zhang

Data centers require efficient, low-cost, flexible interconnects to manage the rapidly growing internal traffic generated by an increasingly diverse set of applications. To meet these requirements, data center networks are increasingly employing alternatives such as RapidIO, Freedom, and PCIe, which require fewer physical devices and/or have simpler protocols than more traditional interconnects. These networks offer raw high performance communication capabilities, but simply using them for conventional TCP/IP-based communication fails to realize the potential performance of the physical network. Here we analyze causes for this performance loss for the TCP/IP protocol over one such fabric, PCIe, and we explore a hardware/software solution that mitigates overheads and exploits PCIe's advanced features. The result is P-Socket, an efficient library that enables legacy socket applications to run without modification. Our experiments show that P-Socket achieves an end-to-end latency of 1.2μs and effective bandwidth of up to 2.87GB/s (out of a theoretical peak of 3.05GB/s).

数据中心需要高效、低成本、灵活的互连来管理日益多样化的应用程序所产生的快速增长的内部流量。为了满足这些需求，数据中心网络越来越多地采用RapidIO、Freedom和PCIe等替代方案，它们需要更少的物理设备和/或具有比传统互连更简单的协议。这些网络提供了原始的高性能通信功能，但是仅仅将它们用于传统的基于TCP/ ip的通信并不能实现物理网络的潜在性能。在这里，我们分析了TCP/IP协议在这种结构(PCIe)上性能损失的原因，并探索了一种硬件/软件解决方案，以减轻开销并利用PCIe的高级功能。其结果是P-Socket，这是一个高效的库，使遗留套接字应用程序无需修改即可运行。实验表明，P-Socket实现了1.2μs的端到端延迟和高达2.87GB/s的有效带宽(理论峰值为3.05GB/s)。

引用次数: 2

Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table 提高基于目录的缓存一致性协议在子页面粒度上的一致性绕过和一个新的片上页表的性能

Proceedings of the ACM International Conference on Computing Frontiers

Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903175

M. Soltaniyeh, I. Kadayif, Özcan Özturk

Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core CMPs to keep the data blocks coherent at the last level private caches. However, the area overhead and high associativity requirement of the directory structures may not scale well with increasingly higher number of cores. As shown in some prior studies, a significant percentage of data blocks are accessed by only one core, therefore, it is not necessary to keep track of these in the directory structure. In this study, we have two major contributions. First, we show that compared to the classification of cache blocks at page granularity as done in some previous studies, data block classification at subpage level helps to detect considerably more private data blocks. Consequently, it reduces the percentage of blocks required to be tracked in the directory significantly compared to similar page level classification approaches. This, in turn, enables smaller directory caches with lower associativity to be used in CMPs without hurting performance, thereby helping the directory structure to scale gracefully with the increasing number of cores. Memory block classification at subpage level, however, may increase the frequency of the Operating System's (OS) involvement in updating the maintenance bits belonging to subpages stored in page table entries, nullifying some portion of performance benefits of subpage level data classification. To overcome this, we propose a distributed on-chip page table as a our second contribution.

芯片多处理器(cmp)需要有效的缓存一致性协议以及快速的虚拟到物理地址转换机制来实现高性能。基于目录的缓存一致性协议是多核cmp中最先进的方法，用于保持数据块在最后一级私有缓存中的一致性。但是，随着内核数量的增加，目录结构的面积开销和高关联性需求可能无法很好地扩展。如先前的一些研究所示，很大一部分数据块仅由一个核心访问，因此，没有必要在目录结构中跟踪这些数据块。在这项研究中，我们有两个主要贡献。首先，我们表明，与之前的一些研究中在页面粒度上对缓存块进行分类相比，在子页面级别上对数据块进行分类有助于检测更多的私有数据块。因此，与类似的页面级分类方法相比，它大大减少了需要在目录中跟踪的块的百分比。这反过来又支持在cmp中使用具有较低关联性的较小目录缓存，而不会损害性能，从而帮助目录结构随着内核数量的增加而优雅地扩展。然而，子页级别的内存块分类可能会增加操作系统(OS)参与更新属于存储在页表项中的子页的维护位的频率，从而抵消了子页级别数据分类的部分性能优势。为了克服这个问题，我们提出了一个分布式片上页表作为我们的第二个贡献。

{"title":"Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table","authors":"M. Soltaniyeh, I. Kadayif, Özcan Özturk","doi":"10.1145/2903150.2903175","DOIUrl":"https://doi.org/10.1145/2903150.2903175","url":null,"abstract":"Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core CMPs to keep the data blocks coherent at the last level private caches. However, the area overhead and high associativity requirement of the directory structures may not scale well with increasingly higher number of cores. As shown in some prior studies, a significant percentage of data blocks are accessed by only one core, therefore, it is not necessary to keep track of these in the directory structure. In this study, we have two major contributions. First, we show that compared to the classification of cache blocks at page granularity as done in some previous studies, data block classification at subpage level helps to detect considerably more private data blocks. Consequently, it reduces the percentage of blocks required to be tracked in the directory significantly compared to similar page level classification approaches. This, in turn, enables smaller directory caches with lower associativity to be used in CMPs without hurting performance, thereby helping the directory structure to scale gracefully with the increasing number of cores. Memory block classification at subpage level, however, may increase the frequency of the Operating System's (OS) involvement in updating the maintenance bits belonging to subpages stored in page table entries, nullifying some portion of performance benefits of subpage level data classification. To overcome this, we propose a distributed on-chip page table as a our second contribution.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114842905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Prototyping real-time tracking systems on mobile devices 移动设备上的实时跟踪系统原型

Proceedings of the ACM International Conference on Computing Frontiers

Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903471

Kyunghun Lee, Haifa Ben Salem, T. Damarla, W. Stechele, S. Bhattacharyya

In this paper, we address the design an implementation of low power embedded systems for real-time tracking of humans and vehicles. Such systems are important in applications such as activity monitoring and border security. We motivate the utility of mobile devices in prototyping the targeted class of tracking systems, and demonstrate a dataflow-based and cross-platform design methodology that enables efficient experimentation with key aspects of our tracking system design, including real-time operation, experimentation with advanced sensors, and streamlined management of design versions on host and mobile platforms. Our experiments demonstrate the utility of our mobile-device-targeted design methodology in validating tracking algorithm operation; evaluating real-time performance, energy efficiency, and accuracy of tracking system execution; and quantifying trade-offs involving use of advanced sensors, which offer improved sensing accuracy at the expense of increased cost and weight. Additionally, through application of a novel, cross-platform, model-based design approach, our design requires no change in source code when migrating from an initial, host-computer-based functional reference to a fully-functional implementation on the targeted mobile device.

在本文中，我们讨论了用于实时跟踪人和车辆的低功耗嵌入式系统的设计和实现。此类系统在活动监控和边境安全等应用中非常重要。我们在目标跟踪系统的原型设计中激发了移动设备的效用，并展示了一种基于数据流和跨平台的设计方法，使我们的跟踪系统设计的关键方面能够进行有效的实验，包括实时操作，使用先进传感器的实验，以及在主机和移动平台上简化设计版本的管理。我们的实验证明了我们针对移动设备的设计方法在验证跟踪算法操作中的实用性;评估实时性能、能源效率和跟踪系统执行的准确性;以及量化使用先进传感器的权衡，这些传感器以增加成本和重量为代价提供了更高的传感精度。此外，通过应用一种新颖的、跨平台的、基于模型的设计方法，我们的设计在从最初的、基于主机的功能参考迁移到目标移动设备上的全功能实现时，不需要更改源代码。

{"title":"Prototyping real-time tracking systems on mobile devices","authors":"Kyunghun Lee, Haifa Ben Salem, T. Damarla, W. Stechele, S. Bhattacharyya","doi":"10.1145/2903150.2903471","DOIUrl":"https://doi.org/10.1145/2903150.2903471","url":null,"abstract":"In this paper, we address the design an implementation of low power embedded systems for real-time tracking of humans and vehicles. Such systems are important in applications such as activity monitoring and border security. We motivate the utility of mobile devices in prototyping the targeted class of tracking systems, and demonstrate a dataflow-based and cross-platform design methodology that enables efficient experimentation with key aspects of our tracking system design, including real-time operation, experimentation with advanced sensors, and streamlined management of design versions on host and mobile platforms. Our experiments demonstrate the utility of our mobile-device-targeted design methodology in validating tracking algorithm operation; evaluating real-time performance, energy efficiency, and accuracy of tracking system execution; and quantifying trade-offs involving use of advanced sensors, which offer improved sensing accuracy at the expense of increased cost and weight. Additionally, through application of a novel, cross-platform, model-based design approach, our design requires no change in source code when migrating from an initial, host-computer-based functional reference to a fully-functional implementation on the targeted mobile device.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114901366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A lightweight user tracking method for app providers 应用程序提供商的轻量级用户跟踪方法

Proceedings of the ACM International Conference on Computing Frontiers

Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903484

R. M. Frey, Runhua Xu, A. Ilic

Since 2013, Google and Apple no longer allow app providers to use the persistent device identifiers (Android ID and UDID) for user tracking on mobile devices. Other tracking options provoke either severe privacy concerns, need additional hardware or are only practicable by a limited number of companies. In this paper, we present a lightweight method that overcomes these weaknesses by using the set of installed apps on a device to create a unique fingerprint. The method was evaluated in a field study with 2410 users and 175,658 installed apps in total. The sets of these installed apps are unique in 99.75% of all inspected users. Furthermore, by reducing the granularity from apps to app categories to lessen users' privacy concerns, the results remain highly unique with an identification rate of 96.22%. Since the information of installed apps and app categories on each device is freely available for any app developer, the method is a valuable instrument for app providers.

自2013年以来，谷歌和苹果不再允许应用程序提供商使用持久设备标识符(Android ID和UDID)在移动设备上跟踪用户。其他跟踪选项要么会引发严重的隐私问题，要么需要额外的硬件，要么只能由有限的公司实施。在本文中，我们提出了一种轻量级的方法，通过使用设备上安装的一组应用程序来创建唯一的指纹来克服这些弱点。该方法在2410名用户和175658个安装应用程序的现场研究中得到了评估。在所有被检查的用户中，99.75%的用户都是独一无二的。此外，通过减少从应用到应用类别的粒度，以减少用户对隐私的担忧，结果保持高度唯一性，识别率为96.22%。由于每个设备上安装的应用程序和应用类别的信息对任何应用程序开发人员都是免费的，因此该方法对应用程序提供商来说是一个有价值的工具。

引用次数: 6

IVM: a task-based shared memory programming model and runtime system to enable uniform access to CPU-GPU clusters IVM:基于任务的共享内存编程模型和运行时系统，支持对CPU-GPU集群的统一访问

Proceedings of the ACM International Conference on Computing Frontiers

Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903174

Kittisak Sajjapongse, Ruidong Gu, M. Becchi

GPUs have been widely used to accelerate a variety of applications from different domains and have become part of high-performance computing clusters. Yet, the use of GPUs within distributed applications still faces significant challenges in terms of programmability and performance portability. The use of popular programming models for distributed applications (such as MPI, SHMEM, and Charm++) in combination with GPU programming frameworks (such as CUDA and OpenCL) exposes to the programmer disjoint memory address spaces and provides a non-uniform view of compute resources (i.e., CPUs and GPUs). In addition, these programming models often perform static assignment of tasks to compute resources and require significant programming effort to embed dynamic scheduling and load balancing mechanisms within the application. In this work, we propose a programming framework called Inter-node Virtual Memory (IVM) that provides the programmer with a uniform view of compute resources and memory spaces within a CPU-GPU cluster, and a mechanism to easily incorporate load balancing within the application. We compare MPI, Charm++ and IVM on four distributed GPU applications. Our experimental results show that, while the main goal of IVM is programmer productivity, the use of the load balancing mechanisms offered by this framework can also lead to performance gains over existing frameworks.

gpu已被广泛应用于各种不同领域的应用，并已成为高性能计算集群的一部分。然而，在分布式应用程序中使用gpu仍然面临着可编程性和性能可移植性方面的重大挑战。分布式应用程序(如MPI, SHMEM和Charm++)与GPU编程框架(如CUDA和OpenCL)结合使用的流行编程模型向程序员暴露了不一致的内存地址空间，并提供了计算资源(即cpu和GPU)的非统一视图。此外，这些编程模型通常执行任务的静态分配以计算资源，并且需要大量编程工作来在应用程序中嵌入动态调度和负载平衡机制。在这项工作中，我们提出了一个称为节点间虚拟内存(IVM)的编程框架，它为程序员提供了CPU-GPU集群内计算资源和内存空间的统一视图，以及一种在应用程序中轻松合并负载平衡的机制。我们比较了MPI、Charm++和IVM在四种分布式GPU应用上的应用。我们的实验结果表明，虽然IVM的主要目标是提高程序员的工作效率，但使用该框架提供的负载平衡机制也可以提高现有框架的性能。

{"title":"IVM: a task-based shared memory programming model and runtime system to enable uniform access to CPU-GPU clusters","authors":"Kittisak Sajjapongse, Ruidong Gu, M. Becchi","doi":"10.1145/2903150.2903174","DOIUrl":"https://doi.org/10.1145/2903150.2903174","url":null,"abstract":"GPUs have been widely used to accelerate a variety of applications from different domains and have become part of high-performance computing clusters. Yet, the use of GPUs within distributed applications still faces significant challenges in terms of programmability and performance portability. The use of popular programming models for distributed applications (such as MPI, SHMEM, and Charm++) in combination with GPU programming frameworks (such as CUDA and OpenCL) exposes to the programmer disjoint memory address spaces and provides a non-uniform view of compute resources (i.e., CPUs and GPUs). In addition, these programming models often perform static assignment of tasks to compute resources and require significant programming effort to embed dynamic scheduling and load balancing mechanisms within the application. In this work, we propose a programming framework called Inter-node Virtual Memory (IVM) that provides the programmer with a uniform view of compute resources and memory spaces within a CPU-GPU cluster, and a mechanism to easily incorporate load balancing within the application. We compare MPI, Charm++ and IVM on four distributed GPU applications. Our experimental results show that, while the main goal of IVM is programmer productivity, the use of the load balancing mechanisms offered by this framework can also lead to performance gains over existing frameworks.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117126369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Accelerating the 3D euler atmospheric solver through heterogeneous CPU-GPU platforms 基于异构CPU-GPU平台的三维欧拉大气求解器加速

Proceedings of the ACM International Conference on Computing Frontiers

Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903480

Jingheng Xu, H. Fu, L. Gan, Chao Yang, Wei Xue, Guangwen Yang

In climate change studies, the atmospheric model is an essential component for building a high-resolution climate simulation system. While the accuracy of atmospheric simulations has long been limited by the computational capabilities of CPU platforms, the heterogeneous platforms equipped with accelerators are becoming promising candidates for achieving high simulating performance. However, due to the complex algorithms and the heavy communications, atmospheric developers have to face to the tough challenges from both the algorithmic and architectural aspects. In this paper, we propose a hybrid algorithm to accelerate the solver of Euler atmospheric equations, which are the most essential equation sets to simulate the mesoscale atmospheric dynamics. Based on the heterogeneous CPU-GPU platform, we develop a 3-dimensional domain decomposition mechanism, which can achieve more efficient utilization of the computing resources. Furthermore, an extensive set of optimization techniques is applied to boost the performance of the solver on both the host and accelerator side. Compared with the performance of fully-optimized two 6-core CPU version, the optimized Euler solver can achieve a speedup of 6.64x when running on a hybrid node with two 6-core Intel Xeon E5645 CPUs and one Tesla K20c GPU. In addition, a nearly linear weak scaling result is achieved on a cluster with 12 CPU-GPU nodes. The experimental results demonstrate promising possibility to apply heterogeneous architecture in the study of the atmospheric simulation.

在气候变化研究中，大气模式是构建高分辨率气候模拟系统的重要组成部分。虽然大气模拟的精度长期受到CPU平台计算能力的限制，但配备加速器的异构平台正在成为实现高模拟性能的有希望的候选平台。然而，由于复杂的算法和繁重的通信，大气开发人员不得不面对来自算法和架构方面的严峻挑战。欧拉大气方程是模拟中尺度大气动力学最重要的方程组，本文提出了一种加速求解欧拉大气方程的混合算法。基于异构CPU-GPU平台，我们开发了一种三维域分解机制，可以更有效地利用计算资源。此外，还应用了一套广泛的优化技术来提高求解器在主机和加速器方面的性能。与完全优化的双6核CPU版本相比，优化后的Euler求解器在两个6核Intel Xeon E5645 CPU和一个Tesla K20c GPU的混合节点上运行时的性能提升了6.64倍。此外，在具有12个CPU-GPU节点的集群上，获得了近似线性的弱缩放结果。实验结果表明，将异构结构应用于大气模拟研究具有广阔的前景。

{"title":"Accelerating the 3D euler atmospheric solver through heterogeneous CPU-GPU platforms","authors":"Jingheng Xu, H. Fu, L. Gan, Chao Yang, Wei Xue, Guangwen Yang","doi":"10.1145/2903150.2903480","DOIUrl":"https://doi.org/10.1145/2903150.2903480","url":null,"abstract":"In climate change studies, the atmospheric model is an essential component for building a high-resolution climate simulation system. While the accuracy of atmospheric simulations has long been limited by the computational capabilities of CPU platforms, the heterogeneous platforms equipped with accelerators are becoming promising candidates for achieving high simulating performance. However, due to the complex algorithms and the heavy communications, atmospheric developers have to face to the tough challenges from both the algorithmic and architectural aspects. In this paper, we propose a hybrid algorithm to accelerate the solver of Euler atmospheric equations, which are the most essential equation sets to simulate the mesoscale atmospheric dynamics. Based on the heterogeneous CPU-GPU platform, we develop a 3-dimensional domain decomposition mechanism, which can achieve more efficient utilization of the computing resources. Furthermore, an extensive set of optimization techniques is applied to boost the performance of the solver on both the host and accelerator side. Compared with the performance of fully-optimized two 6-core CPU version, the optimized Euler solver can achieve a speedup of 6.64x when running on a hybrid node with two 6-core Intel Xeon E5645 CPUs and one Tesla K20c GPU. In addition, a nearly linear weak scaling result is achieved on a cluster with 12 CPU-GPU nodes. The experimental results demonstrate promising possibility to apply heterogeneous architecture in the study of the atmospheric simulation.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127135562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the ACM International Conference on Computing Frontiers

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀