Inverted indices for particle tracking in petascale cosmological simulations

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management Pub Date : 2013-07-29 DOI:10.1145/2484838.2484882

D. Crankshaw, R. Burns, B. Falck, T. Budavári, A. Szalay, Jie-Shuang Wang

{"title":"Inverted indices for particle tracking in petascale cosmological simulations","authors":"D. Crankshaw, R. Burns, B. Falck, T. Budavári, A. Szalay, Jie-Shuang Wang","doi":"10.1145/2484838.2484882","DOIUrl":null,"url":null,"abstract":"We describe the challenges arising from tracking dark matter particles in state of the art cosmological simulations. We are in the process of running the Indra suite of simulations, with an aggregate count of more than 35 trillion particles and 1.1PB of total raw data volume. However, it is not enough just to store the particle positions and velocities in an efficient manner -- analyses also need to be able to track individual particles efficiently through the temporal history of the simulation. The required inverted indices can easily have raw sizes comparable to the original simulation.\n We explore various strategies on how to create an efficient index for such data, using additional insight from the physical properties of the particle motions for a greatly compressed data representation. The basic particle data are stored in a relational database in course-grained containers corresponding to leaves of a fixed depth oct-tree labeled by their Peano-Hilbert index. Within each container the individual objects are sorted by their Lagrangian identifier. Thus each particle has a multi-level address: the PH key of the container and the index of the particle within the sorted array (the slot).\n Given the nature of the cosmological simulations and choice of the PH-box sizes, in consecutive snapshots particles can only cross into spatially adjacent boxes. Also, the slot number of a particle in adjacent snapshots is adjusted up or down by typically a small number. As a result, a special version of delta encoding over the multi-tier address already results in a dramatic reduction of data that needs to be stored. We follow next with an efficient bit-compression, adapting to the statistical properties of the two-part addresses, achieving a final compression ratio better than a factor of 9. The final size of the full inverted index is projected to be 22.5 TB for a petabyte ensemble of simulations.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"50 1","pages":"25:1-25:10"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484838.2484882","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

We describe the challenges arising from tracking dark matter particles in state of the art cosmological simulations. We are in the process of running the Indra suite of simulations, with an aggregate count of more than 35 trillion particles and 1.1PB of total raw data volume. However, it is not enough just to store the particle positions and velocities in an efficient manner -- analyses also need to be able to track individual particles efficiently through the temporal history of the simulation. The required inverted indices can easily have raw sizes comparable to the original simulation. We explore various strategies on how to create an efficient index for such data, using additional insight from the physical properties of the particle motions for a greatly compressed data representation. The basic particle data are stored in a relational database in course-grained containers corresponding to leaves of a fixed depth oct-tree labeled by their Peano-Hilbert index. Within each container the individual objects are sorted by their Lagrangian identifier. Thus each particle has a multi-level address: the PH key of the container and the index of the particle within the sorted array (the slot). Given the nature of the cosmological simulations and choice of the PH-box sizes, in consecutive snapshots particles can only cross into spatially adjacent boxes. Also, the slot number of a particle in adjacent snapshots is adjusted up or down by typically a small number. As a result, a special version of delta encoding over the multi-tier address already results in a dramatic reduction of data that needs to be stored. We follow next with an efficient bit-compression, adapting to the statistical properties of the two-part addresses, achieving a final compression ratio better than a factor of 9. The final size of the full inverted index is projected to be 22.5 TB for a petabyte ensemble of simulations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

千万亿次宇宙学模拟中粒子跟踪的逆指数

我们描述了在最先进的宇宙模拟中跟踪暗物质粒子所带来的挑战。我们正在运行Indra模拟套件，其总数超过35万亿个粒子和1.1PB的总原始数据量。然而，仅仅以有效的方式存储粒子的位置和速度是不够的——分析还需要能够通过模拟的时间历史有效地跟踪单个粒子。所需的倒排索引很容易具有与原始模拟相当的原始大小。我们探索了如何为这些数据创建有效索引的各种策略，使用粒子运动的物理特性对大大压缩的数据表示的额外见解。基本粒子数据存储在关系数据库中的细粒度容器中，这些容器对应于固定深度oct树的叶子，这些叶子由它们的Peano-Hilbert索引标记。在每个容器中，单个对象按其拉格朗日标识符排序。因此，每个粒子都有一个多级地址:容器的PH键和粒子在排序数组(槽)中的索引。考虑到宇宙学模拟的性质和ph盒大小的选择，在连续的快照中，粒子只能穿过空间上相邻的盒子。此外，相邻快照中粒子的槽号通常会向上或向下调整一个小数字。因此，多层地址上的增量编码的特殊版本已经导致需要存储的数据的显著减少。接下来，我们采用有效的位压缩，适应两部分地址的统计特性，实现比9倍更好的最终压缩比。对于一个pb的模拟集合，完整倒排索引的最终大小预计为22.5 TB。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量

期刊最新文献

Towards Co-Evolution of Data-Centric Ecosystems. Data perturbation for outlier detection ensembles SLACID - sparse linear algebra in a column-oriented in-memory database system SensorBench: benchmarking approaches to processing wireless sensor network data Efficient data management and statistics with zero-copy integration