Enhancing computation-to-core assignment with physical location information

Q1 Computer Science ACM Sigplan Notices Pub Date : 2018-06-11 DOI:10.1145/3296979.3192386

Orhan Kislal, Jagadish B. Kotra, Xulong Tang, M. Kandemir, Myoungsoo Jung

{"title":"Enhancing computation-to-core assignment with physical location information","authors":"Orhan Kislal, Jagadish B. Kotra, Xulong Tang, M. Kandemir, Myoungsoo Jung","doi":"10.1145/3296979.3192386","DOIUrl":null,"url":null,"abstract":"Going beyond a certain number of cores in modern architectures requires an on-chip network more scalable than conventional buses. However, employing an on-chip network in a manycore system (to improve scalability) makes the latencies of the data accesses issued by a core non-uniform. This non-uniformity can play a significant role in shaping the overall application performance. This work presents a novel compiler strategy which involves exposing architecture information to the compiler to enable an optimized computation-to-core mapping. Specifically, we propose a compiler-guided scheme that takes into account the relative positions of (and distances between) cores, last-level caches (LLCs) and memory controllers (MCs) in a manycore system, and generates a mapping of computations to cores with the goal of minimizing the on-chip network traffic. The experimental data collected using a set of 21 multi-threaded applications reveal that, on an average, our approach reduces the on-chip network latency in a 6×6 manycore system by 38.4% in the case of private LLCs, and 43.8% in the case of shared LLCs. These improvements translate to the corresponding execution time improvements of 10.9% and 12.7% for the private LLC and shared LLC based systems, respectively.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"33 17","pages":"312 - 327"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3296979.3192386","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Sigplan Notices","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3296979.3192386","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 20

Abstract

Going beyond a certain number of cores in modern architectures requires an on-chip network more scalable than conventional buses. However, employing an on-chip network in a manycore system (to improve scalability) makes the latencies of the data accesses issued by a core non-uniform. This non-uniformity can play a significant role in shaping the overall application performance. This work presents a novel compiler strategy which involves exposing architecture information to the compiler to enable an optimized computation-to-core mapping. Specifically, we propose a compiler-guided scheme that takes into account the relative positions of (and distances between) cores, last-level caches (LLCs) and memory controllers (MCs) in a manycore system, and generates a mapping of computations to cores with the goal of minimizing the on-chip network traffic. The experimental data collected using a set of 21 multi-threaded applications reveal that, on an average, our approach reduces the on-chip network latency in a 6×6 manycore system by 38.4% in the case of private LLCs, and 43.8% in the case of shared LLCs. These improvements translate to the corresponding execution time improvements of 10.9% and 12.7% for the private LLC and shared LLC based systems, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用物理位置信息增强计算到核心的分配

在现代架构中，超过一定数量的核心需要一个比传统总线更具可扩展性的片上网络。然而，在多核系统中使用片上网络(以提高可伸缩性)会使由一个核心发出的数据访问延迟不一致。这种不均匀性在影响应用程序的整体性能方面起着重要的作用。这项工作提出了一种新的编译器策略，它包括向编译器公开体系结构信息，以实现优化的计算到核心的映射。具体来说，我们提出了一种编译器引导的方案，该方案考虑了多核系统中内核，最后一级缓存(llc)和内存控制器(mc)的相对位置(和之间的距离)，并生成了计算到内核的映射，目标是最小化片上网络流量。使用一组21个多线程应用程序收集的实验数据表明，平均而言，我们的方法在私有llc的情况下将6×6多核系统的片上网络延迟降低了38.4%，在共享llc的情况下降低了43.8%。这些改进转化为基于私有LLC和基于共享LLC的系统相应的执行时间改进，分别为10.9%和12.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Sigplan Notices 工程技术-计算机：软件工程

CiteScore

4.90

自引率

0.00%

发文量

审稿时长

2-4 weeks

期刊介绍： The ACM Special Interest Group on Programming Languages explores programming language concepts and tools, focusing on design, implementation, practice, and theory. Its members are programming language developers, educators, implementers, researchers, theoreticians, and users. SIGPLAN sponsors several major annual conferences, including the Symposium on Principles of Programming Languages (POPL), the Symposium on Principles and Practice of Parallel Programming (PPoPP), the Conference on Programming Language Design and Implementation (PLDI), the International Conference on Functional Programming (ICFP), the International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), as well as more than a dozen other events of either smaller size or in-cooperation with other SIGs. The monthly "ACM SIGPLAN Notices" publishes proceedings of selected sponsored events and an annual report on SIGPLAN activities. Members receive discounts on conference registrations and free access to ACM SIGPLAN publications in the ACM Digital Library. SIGPLAN recognizes significant research and service contributions of individuals with a variety of awards, supports current members through the Professional Activities Committee, and encourages future programming language enthusiasts with frequent Programming Languages Mentoring Workshops (PLMW).