Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M

2019 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2019-07-06 DOI:10.1109/HPEC.2019.8916508

J. Kepner, V. Gadepally, Lauren Milechin, S. Samsi, W. Arcand, David Bestor, William Bergeron, C. Byun, M. Hubbell, Michael Houle, Michael Jones, Anna Klein, P. Michaleas, J. Mullen, Andrew Prout, Antonio Rosa, Charles Yee, A. Reuther

{"title":"Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M","authors":"J. Kepner, V. Gadepally, Lauren Milechin, S. Samsi, W. Arcand, David Bestor, William Bergeron, C. Byun, M. Hubbell, Michael Houle, Michael Jones, Anna Klein, P. Michaleas, J. Mullen, Andrew Prout, Antonio Rosa, Charles Yee, A. Reuther","doi":"10.1109/HPEC.2019.8916508","DOIUrl":null,"url":null,"abstract":"The Dynamic Distributed Dimensional Data Model (D4M) library implements associative arrays in a variety of languages (Python, Julia, and Matlab/Octave) and provides a lightweight in-memory database implementation of hypersparse arrays that are ideal for analyzing many types of network data. D4M relies on associative arrays which combine properties of spreadsheets, databases, matrices, graphs, and networks, while providing rigorous mathematical guarantees, such as linearity. Streaming updates of D4M associative arrays put enormous pressure on the memory hierarchy. This work describes the design and performance optimization of an implementation of hierarchical associative arrays that reduces memory pressure and dramatically increases the update rate into an associative array. The parameters of hierarchical associative arrays rely on controlling the number of entries in each level in the hierarchy before an update is cascaded. The parameters are easily tunable to achieve optimal performance for a variety of applications. Hierarchical arrays achieve over 40,000 updates per second in a single instance. Scaling to 34,000 instances of hierarchical D4M associative arrays on 1,100 server nodes on the MIT SuperCloud achieved a sustained update rate of 1,900,000,000 updates per second. This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC.2019.8916508","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

The Dynamic Distributed Dimensional Data Model (D4M) library implements associative arrays in a variety of languages (Python, Julia, and Matlab/Octave) and provides a lightweight in-memory database implementation of hypersparse arrays that are ideal for analyzing many types of network data. D4M relies on associative arrays which combine properties of spreadsheets, databases, matrices, graphs, and networks, while providing rigorous mathematical guarantees, such as linearity. Streaming updates of D4M associative arrays put enormous pressure on the memory hierarchy. This work describes the design and performance optimization of an implementation of hierarchical associative arrays that reduces memory pressure and dramatically increases the update rate into an associative array. The parameters of hierarchical associative arrays rely on controlling the number of entries in each level in the hierarchy before an update is cascaded. The parameters are easily tunable to achieve optimal performance for a variety of applications. Hierarchical arrays achieve over 40,000 updates per second in a single instance. Scaling to 34,000 instances of hierarchical D4M associative arrays on 1,100 server nodes on the MIT SuperCloud achieved a sustained update rate of 1,900,000,000 updates per second. This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用D4M每秒流式传输19亿次超稀疏网络更新

动态分布式维度数据模型(D4M)库用各种语言(Python、Julia和Matlab/Octave)实现了关联数组，并提供了超稀疏数组的轻量级内存数据库实现，非常适合分析多种类型的网络数据。D4M依赖于结合了电子表格、数据库、矩阵、图形和网络属性的关联数组，同时提供严格的数学保证，例如线性。D4M关联数组的流更新给内存层次结构带来了巨大的压力。这项工作描述了分层关联数组实现的设计和性能优化，该实现减少了内存压力并显着提高了关联数组的更新速率。层次关联数组的参数依赖于在级联更新之前控制层次结构中每个级别的条目数量。参数很容易调整，以实现各种应用程序的最佳性能。分层数组在单个实例中实现每秒超过40,000次更新。在MIT SuperCloud上的1,100个服务器节点上扩展到34,000个分层D4M关联数组实例，实现了每秒19亿次更新的持续更新速率。这种能力允许麻省理工学院的超级云分析非常大的流网络数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量