Proceedings of the 49th Annual International Symposium on Computer Architecture最新文献

英文中文

SoftVN

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527378

Muhammad Umar, Weizhe Hua, Zhiru Zhang, G. Suh

Trusted execution environments (TEEs) in processors protect off-chip memory (DRAM), and ensure its confidentiality and integrity using memory encryption and integrity verification. However, such memory protection can incur significant performance overhead as it requires additional memory accesses for protection metadata such as version numbers (VNs) and MACs. This paper proposes SoftVN, an extension to the current memory protection schemes, which significantly reduces the overhead of today's state-of-the-art by allowing software to provide VNs for memory accesses. For memory-intensive applications with simple memory access patterns for large data structures, the VNs only need to be maintained for data structures instead of individual cache blocks and can be tracked in software with low efforts. Off-chip VN accesses for memory reads can be removed if they are tracked and provided by software. We evaluate SoftVN by simulating a diverse set of memory-intensive applications, including deep learning, graph processing, and bioinformatics algorithms. The experimental results show that SoftVN reduces the memory protection overhead by 82% compared to the baseline similar to Intel SGX, and improves the performance by 33% on average. The maximum performance improvement can be as high as 65%.

{"title":"SoftVN","authors":"Muhammad Umar, Weizhe Hua, Zhiru Zhang, G. Suh","doi":"10.1145/3470496.3527378","DOIUrl":"https://doi.org/10.1145/3470496.3527378","url":null,"abstract":"Trusted execution environments (TEEs) in processors protect off-chip memory (DRAM), and ensure its confidentiality and integrity using memory encryption and integrity verification. However, such memory protection can incur significant performance overhead as it requires additional memory accesses for protection metadata such as version numbers (VNs) and MACs. This paper proposes SoftVN, an extension to the current memory protection schemes, which significantly reduces the overhead of today's state-of-the-art by allowing software to provide VNs for memory accesses. For memory-intensive applications with simple memory access patterns for large data structures, the VNs only need to be maintained for data structures instead of individual cache blocks and can be tracked in software with low efforts. Off-chip VN accesses for memory reads can be removed if they are tracked and provided by software. We evaluate SoftVN by simulating a diverse set of memory-intensive applications, including deep learning, graph processing, and bioinformatics algorithms. The experimental results show that SoftVN reduces the memory protection overhead by 82% compared to the baseline similar to Intel SGX, and improves the performance by 33% on average. The maximum performance improvement can be as high as 65%.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121471928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Graphite 石墨

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1038/scientificamerican07101915-25supp

Zhangxiaowen Gong, Houxiang Ji, Yao Yao, Christopher W. Fletcher, Christopher J. Hughes, J. Torrellas

引用次数: 0

NvMR NvMR

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527413

A. Bhattacharyya, Abhijith Somashekhar, Joshua San Miguel

Intermittent systems on energy-harvesting devices have to frequently back up data because of an unreliable energy supply to make forward progress. These devices come with non-volatile memories like Flash/FRAM on board that are used to back up the system state. However, quite paradoxically, writing to a non-volatile memory consumes a lot of energy that makes backups expensive. Idem-potency violations inherent to intermittent programs are major contributors to the problem, as they render system state inconsistent and force backups to occur even when plenty of energy is available. In this work, we first characterize the complex persist dependencies that are unique to intermittent computing. Based on these insights, we propose NvMR, an intermittent architecture that eliminates idempotency violations in the program by renaming non-volatile memory addresses. This can reduce the number of backups to their theoretical minimum and decouple the decision of when to perform backups from the memory access constraints imposed by the program. Our evaluations show that compared to a state-of-the-art intermittent architecture, NvMR can save about 20% energy on average when running common embedded applications.

{"title":"NvMR","authors":"A. Bhattacharyya, Abhijith Somashekhar, Joshua San Miguel","doi":"10.1145/3470496.3527413","DOIUrl":"https://doi.org/10.1145/3470496.3527413","url":null,"abstract":"Intermittent systems on energy-harvesting devices have to frequently back up data because of an unreliable energy supply to make forward progress. These devices come with non-volatile memories like Flash/FRAM on board that are used to back up the system state. However, quite paradoxically, writing to a non-volatile memory consumes a lot of energy that makes backups expensive. Idem-potency violations inherent to intermittent programs are major contributors to the problem, as they render system state inconsistent and force backups to occur even when plenty of energy is available. In this work, we first characterize the complex persist dependencies that are unique to intermittent computing. Based on these insights, we propose NvMR, an intermittent architecture that eliminates idempotency violations in the program by renaming non-volatile memory addresses. This can reduce the number of backups to their theoretical minimum and decouple the decision of when to perform backups from the memory access constraints imposed by the program. Our evaluations show that compared to a state-of-the-art intermittent architecture, NvMR can save about 20% energy on average when running common embedded applications.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115307584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A scalable architecture for reprioritizing ordered parallelism 一个可扩展的架构，用于重新确定有序并行的优先级

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527387

Gilead Posluns, Yan Zhu, Guowei Zhang, M. C. Jeffrey

Many algorithms schedule their work, or tasks, according to a priority order for correctness or faster convergence. While priority schedulers commonly implement task enqueue and dequeueMin operations, some algorithms need a priority update operation that alters the scheduling metadata for a task. Prior software and hardware systems that support scheduling with priority updates compromise on either parallelism, work-efficiency, or both, leading to missed performance opportunities. Moreover, incorrectly navigating these compromises violates correctness in those algorithms that are not resilient to relaxing priority order. We present Hive, a task-based execution model and multicore architecture that extracts abundant fine-grain parallelism from algorithms with priority updates, while retaining their strict priority schedules. Like prior hardware systems for ordered parallelism, Hive uses data- and control-dependence speculation and a large speculative window to execute tasks in parallel and out of order. Hive improves on prior work by (i) directly supporting updates in the interface, (ii) identifying the novel scheduler-carried dependence, and (iii) speculating on such dependences with task versioning, distinct from data versioning. Hive enables safe speculative updates to the schedule and avoids spurious conflicts among tasks to better utilize speculation tracking resources and efficiently uncover more parallelism. Across a suite of nine benchmarks, Hive improves performance at 256 cores by up to 2.8× over the next best hardware solution, and even more over software-only parallel schedulers.

许多算法根据正确或更快收敛的优先顺序来安排它们的工作或任务。虽然优先级调度器通常实现任务排队和出队列操作，但有些算法需要优先级更新操作来更改任务的调度元数据。先前支持优先级更新调度的软件和硬件系统会在并行性、工作效率或两者上做出妥协，从而导致错失性能机会。此外，错误地导航这些折衷会破坏那些不适应放松优先顺序的算法的正确性。我们提出了Hive，一个基于任务的执行模型和多核架构，它从具有优先级更新的算法中提取了大量的细粒度并行性，同时保留了严格的优先级调度。与之前的有序并行硬件系统一样，Hive使用数据和控制依赖的推测和一个大的推测窗口来并行和无序地执行任务。Hive通过以下方式改进了之前的工作:(i)直接支持接口更新，(ii)识别新的调度器携带的依赖关系，以及(iii)通过任务版本控制来推测这些依赖关系，而不是数据版本控制。Hive支持对计划进行安全的推测更新，避免任务之间的虚假冲突，从而更好地利用推测跟踪资源，并有效地发现更多的并行性。在一组9个基准测试中，Hive在256核下的性能比第二好的硬件解决方案提高了2.8倍，甚至比纯软件并行调度器更高。

{"title":"A scalable architecture for reprioritizing ordered parallelism","authors":"Gilead Posluns, Yan Zhu, Guowei Zhang, M. C. Jeffrey","doi":"10.1145/3470496.3527387","DOIUrl":"https://doi.org/10.1145/3470496.3527387","url":null,"abstract":"Many algorithms schedule their work, or tasks, according to a priority order for correctness or faster convergence. While priority schedulers commonly implement task enqueue and dequeueMin operations, some algorithms need a priority update operation that alters the scheduling metadata for a task. Prior software and hardware systems that support scheduling with priority updates compromise on either parallelism, work-efficiency, or both, leading to missed performance opportunities. Moreover, incorrectly navigating these compromises violates correctness in those algorithms that are not resilient to relaxing priority order. We present Hive, a task-based execution model and multicore architecture that extracts abundant fine-grain parallelism from algorithms with priority updates, while retaining their strict priority schedules. Like prior hardware systems for ordered parallelism, Hive uses data- and control-dependence speculation and a large speculative window to execute tasks in parallel and out of order. Hive improves on prior work by (i) directly supporting updates in the interface, (ii) identifying the novel scheduler-carried dependence, and (iii) speculating on such dependences with task versioning, distinct from data versioning. Hive enables safe speculative updates to the schedule and avoids spurious conflicts among tasks to better utilize speculation tracking resources and efficiently uncover more parallelism. Across a suite of nine benchmarks, Hive improves performance at 256 cores by up to 2.8× over the next best hardware solution, and even more over software-only parallel schedulers.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115320047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

ASAP: architecture support for asynchronous persistence ASAP:异步持久化的架构支持

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527399

Ahmed H. M. O. Abulila, I. E. Hajj, Myoungsoo Jung, Nam Sung Kim

Supporting atomic durability of updates for persistent memories is typically achieved with Write-Ahead Logging (WAL). WAL flushes log entries to persistent memory before making the actual data persistent to ensure that a consistent state can be recovered if a crash occurs. Performing WAL in hardware is attractive because it makes most aspects of log management transparent to software, and it completes log persist operations (LPOs) and data persist operations (DPOs) in the background, overlapping them with the execution of other instructions. Prior hardware logging solutions commit atomic regions synchronously. That is, once the end of a region is reached, all outstanding persist operations required for the region to commit must complete before instruction execution may proceed. For undo logging, LPOs and DPOs are both performed synchronously to ensure that the region commits synchronously. For redo logging, DPOs can be performed asynchronously, but LPOs are performed synchronously to ensure that the region commits synchronously. In both cases, waiting for synchronous persist operations (LPO or DPO) at the end of an atomic region causes atomic regions to incur high latency. To tackle this limitation, we propose ASAP, a hardware logging solution that allows atomic regions to commit asynchronously. That is, once the end of an atomic region is reached, instruction execution may proceed without waiting for outstanding persist operations to complete. As such, both LPOs and DPOs can be performed asynchronously. The challenge with allowing atomic regions to commit asynchronously is that it can lead to control and data dependence violations in the commit order of the atomic regions, leaving data in an unrecoverable state in case of a crash. To address this issue, ASAP tracks and enforces control and data dependencies between atomic regions in hardware to ensure that the regions commit in the proper order. Our evaluation shows that ASAP outperforms the state-of-the-art hardware undo and redo logging techniques by 1.41X and 1.53X, respectively, while achieving 0.96X the ideal performance when no persistence is enforced, at a small hardware cost (< 3%). ASAP also reduces memory traffic to persistent memory by 38% and 48%, compared with the state-of-the-art hardware undo and redo logging techniques, respectively. ASAP is robust against increasing persistent memory latency, making it suitable for both fast and slow persistent memory technologies.

支持持久内存更新的原子持久性通常是通过预写日志记录(Write-Ahead Logging, WAL)实现的。在将实际数据持久化之前，WAL会将日志条目刷新到持久化内存中，以确保在发生崩溃时可以恢复一致的状态。在硬件中执行WAL很有吸引力，因为它使日志管理的大多数方面对软件透明，并且它在后台完成日志持久化操作(LPOs)和数据持久化操作(DPOs)，使它们与其他指令的执行重叠。以前的硬件日志解决方案以同步方式提交原子区域。也就是说，一旦到达区域的末端，该区域提交所需的所有未完成的持久化操作必须在指令执行之前完成。对于撤销日志记录，lpo和dpo都是同步执行的，以确保区域同步提交。对于重做日志，dpo可以异步执行，但lpo是同步执行的，以确保区域同步提交。在这两种情况下，在原子区域的末尾等待同步持久化操作(LPO或DPO)会导致原子区域产生高延迟。为了解决这个限制，我们提出了ASAP，这是一个硬件日志解决方案，允许原子区域异步提交。也就是说，一旦到达原子区域的末端，指令执行可以继续进行，而不必等待未完成的持久化操作完成。因此，lpo和dpo都可以异步执行。允许原子区域异步提交的挑战在于，它可能导致原子区域的提交顺序违反控制和数据依赖关系，从而在发生崩溃时使数据处于不可恢复状态。为了解决这个问题，ASAP跟踪并加强硬件中原子区域之间的控制和数据依赖关系，以确保这些区域以正确的顺序提交。我们的评估表明，ASAP的性能比最先进的硬件撤销和重做日志记录技术分别高出1.41倍和1.53倍，而在不强制执行持久化的情况下，其性能达到理想性能的0.96倍，而且硬件成本很小(< 3%)。与最先进的硬件撤销和重做日志记录技术相比，ASAP还分别减少了38%和48%的内存流量。ASAP对于不断增加的持久内存延迟是健壮的，这使得它既适用于快速持久内存技术，也适用于慢速持久内存技术。

{"title":"ASAP: architecture support for asynchronous persistence","authors":"Ahmed H. M. O. Abulila, I. E. Hajj, Myoungsoo Jung, Nam Sung Kim","doi":"10.1145/3470496.3527399","DOIUrl":"https://doi.org/10.1145/3470496.3527399","url":null,"abstract":"Supporting atomic durability of updates for persistent memories is typically achieved with Write-Ahead Logging (WAL). WAL flushes log entries to persistent memory before making the actual data persistent to ensure that a consistent state can be recovered if a crash occurs. Performing WAL in hardware is attractive because it makes most aspects of log management transparent to software, and it completes log persist operations (LPOs) and data persist operations (DPOs) in the background, overlapping them with the execution of other instructions. Prior hardware logging solutions commit atomic regions synchronously. That is, once the end of a region is reached, all outstanding persist operations required for the region to commit must complete before instruction execution may proceed. For undo logging, LPOs and DPOs are both performed synchronously to ensure that the region commits synchronously. For redo logging, DPOs can be performed asynchronously, but LPOs are performed synchronously to ensure that the region commits synchronously. In both cases, waiting for synchronous persist operations (LPO or DPO) at the end of an atomic region causes atomic regions to incur high latency. To tackle this limitation, we propose ASAP, a hardware logging solution that allows atomic regions to commit asynchronously. That is, once the end of an atomic region is reached, instruction execution may proceed without waiting for outstanding persist operations to complete. As such, both LPOs and DPOs can be performed asynchronously. The challenge with allowing atomic regions to commit asynchronously is that it can lead to control and data dependence violations in the commit order of the atomic regions, leaving data in an unrecoverable state in case of a crash. To address this issue, ASAP tracks and enforces control and data dependencies between atomic regions in hardware to ensure that the regions commit in the proper order. Our evaluation shows that ASAP outperforms the state-of-the-art hardware undo and redo logging techniques by 1.41X and 1.53X, respectively, while achieving 0.96X the ideal performance when no persistence is enforced, at a small hardware cost (< 3%). ASAP also reduces memory traffic to persistent memory by 38% and 48%, compared with the state-of-the-art hardware undo and redo logging techniques, respectively. ASAP is robust against increasing persistent memory latency, making it suitable for both fast and slow persistent memory technologies.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127176817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

FlexiCores

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527410

Nathaniel Bleier, Calvin Lee, F. Rodríguez, A. Sou, Scott White, Rakesh Kumar

Flexible electronics is a promising approach to target applications whose computational needs are not met by traditional silicon-based electronics due to their conformality, thinness, or cost requirements. A microprocessor is a critical component for many such applications; however, it is unclear whether it is feasible to build flexible processors at scale (i.e., at high yield), since very few flexible microprocessors have been reported and no yield data or data from multiple chips has been reported. Also, prior manufactured flexible systems were not field-reprogrammable and were evaluated either on a simple set of test vectors or a single program. A working flexible microprocessor chip supporting complex or multiple applications has not been demonstrated. Finally, no prior work performs a design space of flexible microprocessors to optimize area, code size, and energy of such microprocessors. In this work, we fabricate and test hundreds of FlexiCores - flexible 0.8 μm IGZO TFT-based field-reprogrammable 4 and 8-bit microprocessor chips optimized for low footprint and yield. We show that these gate count-optimized processors can have high yield (4-bit FlexiCores have 81% yield - sufficient to enable sub-cent cost if produced at volume). We evaluate these chips over a suite of representative kernels - the kernels take 4.28 ms to 12.9 ms and 21.0 μJ to 61.4 μJ for execution (at 360 nJ per instruction). We also present the first characterization of process variation for a flexible processor - we observe significant process variation (relative standard deviation of 15.3% and 21.5% in terms of current draw of 4-bit and 8-bit FlexiCore chips respectively). Finally, we perform a design space exploration and identify design points much better than FlexiCores - the new cores consume only 45--56% the energy of the base design, and have code size less than 30% of the base design, with an area overhead of 9--37%.

{"title":"FlexiCores","authors":"Nathaniel Bleier, Calvin Lee, F. Rodríguez, A. Sou, Scott White, Rakesh Kumar","doi":"10.1145/3470496.3527410","DOIUrl":"https://doi.org/10.1145/3470496.3527410","url":null,"abstract":"Flexible electronics is a promising approach to target applications whose computational needs are not met by traditional silicon-based electronics due to their conformality, thinness, or cost requirements. A microprocessor is a critical component for many such applications; however, it is unclear whether it is feasible to build flexible processors at scale (i.e., at high yield), since very few flexible microprocessors have been reported and no yield data or data from multiple chips has been reported. Also, prior manufactured flexible systems were not field-reprogrammable and were evaluated either on a simple set of test vectors or a single program. A working flexible microprocessor chip supporting complex or multiple applications has not been demonstrated. Finally, no prior work performs a design space of flexible microprocessors to optimize area, code size, and energy of such microprocessors. In this work, we fabricate and test hundreds of FlexiCores - flexible 0.8 μm IGZO TFT-based field-reprogrammable 4 and 8-bit microprocessor chips optimized for low footprint and yield. We show that these gate count-optimized processors can have high yield (4-bit FlexiCores have 81% yield - sufficient to enable sub-cent cost if produced at volume). We evaluate these chips over a suite of representative kernels - the kernels take 4.28 ms to 12.9 ms and 21.0 μJ to 61.4 μJ for execution (at 360 nJ per instruction). We also present the first characterization of process variation for a flexible processor - we observe significant process variation (relative standard deviation of 15.3% and 21.5% in terms of current draw of 4-bit and 8-bit FlexiCore chips respectively). Finally, we perform a design space exploration and identify design points much better than FlexiCores - the new cores consume only 45--56% the energy of the base design, and have code size less than 30% of the base design, with an area overhead of 9--37%.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124809331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

PPMLAC

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527392

Xingni Zhou, Zhilei Xu, Cong Wang, M. Gao

Privacy issue is a main concern restricting data sharing and cross-organization collaborations. While Privacy-Preserving Machine Learning techniques such as Multi-Party Computations (MPC), Homomorphic Encryption, and Federated Learning are proposed to solve this problem, no solution exists with both strong security and high performance to run large-scale, complex machine learning models. This paper presents PPMLAC, a novel chipset architecture to accelerate MPC, which combines MPC's strong security and hardware's high performance, eliminates the communication bottleneck from MPC, and achieves several orders of magnitudes speed up over software-based MPC. It is carefully designed to only rely on a minimum set of simple hardware components in the trusted domain, thus is robust against side-channel attacks and malicious adversaries. Our FPGA prototype can run mainstream large-scale ML models like ResNet in near real-time under a practical network environment with non-negligible latency, which is impossible for existing MPC solutions.

引用次数: 1

SmartSAGE

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527391

Yunjae Lee, Jin-Won Chung, Minsoo Rhu

Graph neural networks (GNNs) can extract features by learning both the representation of each objects (i.e., graph nodes) and the relationship across different objects (i.e., the edges that connect nodes), achieving state-of-the-art performance in various graph-based tasks. Despite its strengths, utilizing these algorithms in a production environment faces several challenges as the number of graph nodes and edges amount to several billions to hundreds of billions scale, requiring substantial storage space for training. Unfortunately, state-of-the-art ML frameworks employ an in-memory processing model which significantly hampers the productivity of ML practitioners as it mandates the overall working set to fit within DRAM capacity. In this work, we first conduct a detailed characterization on a state-of-the-art, large-scale GNN training algorithm, GraphSAGE. Based on the characterization, we then explore the feasibility of utilizing capacity-optimized NVMe SSDs for storing memory-hungry GNN data, which enables large-scale GNN training beyond the limits of main memory size. Given the large performance gap between DRAM and SSD, however, blindly utilizing SSDs as a direct substitute for DRAM leads to significant performance loss. We therefore develop SmartSAGE, our software/hardware co-design based on an in-storage processing (ISP) architecture. Our work demonstrates that an ISP based large-scale GNN training system can achieve both high capacity storage and high performance, opening up opportunities for ML practitioners to train large GNN datasets without being hampered by the physical limitations of main memory size.

{"title":"SmartSAGE","authors":"Yunjae Lee, Jin-Won Chung, Minsoo Rhu","doi":"10.1145/3470496.3527391","DOIUrl":"https://doi.org/10.1145/3470496.3527391","url":null,"abstract":"Graph neural networks (GNNs) can extract features by learning both the representation of each objects (i.e., graph nodes) and the relationship across different objects (i.e., the edges that connect nodes), achieving state-of-the-art performance in various graph-based tasks. Despite its strengths, utilizing these algorithms in a production environment faces several challenges as the number of graph nodes and edges amount to several billions to hundreds of billions scale, requiring substantial storage space for training. Unfortunately, state-of-the-art ML frameworks employ an in-memory processing model which significantly hampers the productivity of ML practitioners as it mandates the overall working set to fit within DRAM capacity. In this work, we first conduct a detailed characterization on a state-of-the-art, large-scale GNN training algorithm, GraphSAGE. Based on the characterization, we then explore the feasibility of utilizing capacity-optimized NVMe SSDs for storing memory-hungry GNN data, which enables large-scale GNN training beyond the limits of main memory size. Given the large performance gap between DRAM and SSD, however, blindly utilizing SSDs as a direct substitute for DRAM leads to significant performance loss. We therefore develop SmartSAGE, our software/hardware co-design based on an in-storage processing (ISP) architecture. Our work demonstrates that an ISP based large-scale GNN training system can achieve both high capacity storage and high performance, opening up opportunities for ML practitioners to train large GNN datasets without being hampered by the physical limitations of main memory size.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130847164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Crescent 新月

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1093/gmo/9781561592630.article.j106100

Yu Feng, Gunnar Hammonds, Yiming Gan, Yuhao Zhu

引用次数: 0

CaSMap CaSMap

Proceedings of the 49th Annual International Symposium on Computer Architecture

Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527426

Xingchen Man, Jianfeng Zhu, Guihuan Song, S. Yin, Shaojun Wei, Leibo Liu

Today, reconfigurable spatial architectures (RSAs) have sprung up as accelerators for compute- and data-intensive domains because they deliver energy and area efficiency close to ASICs and still retain sufficient programmability to keep the development cost low. The mapper, which is responsible for mapping algorithms onto RSAs, favors a systematic backtracking methodology because of high portability for evolving RSA designs. However, exponentially scaling compilation time has become the major obstacle. The key observation of this paper is that the key limiting factor to the systematic backtracking mappers is the waterfall mapping model which resolves all mapping variables and constraints at the same time using single-level intermediate representations (IRs). This work proposes CaSMap, an agile mapper framework independent of software and hardware of RSAs. By clustering the lowest-level software and hardware IRs into multi-level IRs, the original mapping process can be scattered as multi-stage decomposed ones and therefore the mapping problem with exponential complexity is mitigated. This paper introduces (a) strategies for clustering low-level hardware and software IRs with static connectivity and critical path analysis. (b) a multi-level scattered mapping model in which the higher-level model carries out the heuristics from IR clustering, endeavors to promote mapping success rate, and reduces the scale of the lower-level model. Our evaluation shows that CaSMap is able to reduce the problem scale (nonzeros) by 80.5% (23.1%-94.9%) and achieve a mapping time speedup of 83X over the state-of-the-art waterfall mapper across four different RSA topologies: MorphoSys, HReA, HyCUBE, and REVEL.

{"title":"CaSMap","authors":"Xingchen Man, Jianfeng Zhu, Guihuan Song, S. Yin, Shaojun Wei, Leibo Liu","doi":"10.1145/3470496.3527426","DOIUrl":"https://doi.org/10.1145/3470496.3527426","url":null,"abstract":"Today, reconfigurable spatial architectures (RSAs) have sprung up as accelerators for compute- and data-intensive domains because they deliver energy and area efficiency close to ASICs and still retain sufficient programmability to keep the development cost low. The mapper, which is responsible for mapping algorithms onto RSAs, favors a systematic backtracking methodology because of high portability for evolving RSA designs. However, exponentially scaling compilation time has become the major obstacle. The key observation of this paper is that the key limiting factor to the systematic backtracking mappers is the waterfall mapping model which resolves all mapping variables and constraints at the same time using single-level intermediate representations (IRs). This work proposes CaSMap, an agile mapper framework independent of software and hardware of RSAs. By clustering the lowest-level software and hardware IRs into multi-level IRs, the original mapping process can be scattered as multi-stage decomposed ones and therefore the mapping problem with exponential complexity is mitigated. This paper introduces (a) strategies for clustering low-level hardware and software IRs with static connectivity and critical path analysis. (b) a multi-level scattered mapping model in which the higher-level model carries out the heuristics from IR clustering, endeavors to promote mapping success rate, and reduces the scale of the lower-level model. Our evaluation shows that CaSMap is able to reduce the problem scale (nonzeros) by 80.5% (23.1%-94.9%) and achieve a mapping time speedup of 83X over the state-of-the-art waterfall mapper across four different RSA topologies: MorphoSys, HReA, HyCUBE, and REVEL.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122297364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 49th Annual International Symposium on Computer Architecture

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀