{"title":"EMPRESS: Accelerating Scientific Discovery through Descriptive Metadata Management","authors":"Margaret Lawson, William Gropp, Jay Lofstead","doi":"https://dl.acm.org/doi/10.1145/3523698","DOIUrl":null,"url":null,"abstract":"<p>High-performance computing scientists are producing unprecedented volumes of data that take a long time to load for analysis. However, many analyses only require loading in the data containing particular features of interest and scientists have many approaches for identifying these features. Therefore, if scientists store information (descriptive metadata) about these identified features, then for subsequent analyses they can use this information to only read in the data containing these features. This can greatly reduce the amount of data that scientists have to read in, thereby accelerating analysis. Despite the potential benefits of descriptive metadata management, no prior work has created a descriptive metadata system that can help scientists working with a wide range of applications and analyses to restrict their reads to data containing features of interest. In this article, we present EMPRESS, the first such solution. EMPRESS offers all of the features needed to help accelerate discovery: It can accelerate analysis by up to 300 ×, supports a wide range of applications and analyses, is high-performing, is highly scalable, and requires minimal storage space. In addition, EMPRESS offers features required for a production-oriented system: scalable metadata consistency techniques, flexible system configurations, fault tolerance as a service, and portability.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"44 5","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2022-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Storage","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3523698","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
High-performance computing scientists are producing unprecedented volumes of data that take a long time to load for analysis. However, many analyses only require loading in the data containing particular features of interest and scientists have many approaches for identifying these features. Therefore, if scientists store information (descriptive metadata) about these identified features, then for subsequent analyses they can use this information to only read in the data containing these features. This can greatly reduce the amount of data that scientists have to read in, thereby accelerating analysis. Despite the potential benefits of descriptive metadata management, no prior work has created a descriptive metadata system that can help scientists working with a wide range of applications and analyses to restrict their reads to data containing features of interest. In this article, we present EMPRESS, the first such solution. EMPRESS offers all of the features needed to help accelerate discovery: It can accelerate analysis by up to 300 ×, supports a wide range of applications and analyses, is high-performing, is highly scalable, and requires minimal storage space. In addition, EMPRESS offers features required for a production-oriented system: scalable metadata consistency techniques, flexible system configurations, fault tolerance as a service, and portability.
期刊介绍:
The ACM Transactions on Storage (TOS) is a new journal with an intent to publish original archival papers in the area of storage and closely related disciplines. Articles that appear in TOS will tend either to present new techniques and concepts or to report novel experiences and experiments with practical systems. Storage is a broad and multidisciplinary area that comprises of network protocols, resource management, data backup, replication, recovery, devices, security, and theory of data coding, densities, and low-power. Potential synergies among these fields are expected to open up new research directions.