Proceedings of the 8th Parallel Data Storage Workshop最新文献

英文中文

Asynchronous object storage with QoS for scientific and commercial big data 面向科学大数据和商业大数据的具有QoS的异步对象存储

Proceedings of the 8th Parallel Data Storage Workshop

Pub Date : 2013-11-17 DOI: 10.1145/2538542.2538565

Michael J. Brim, D. Dillow, S. Oral, B. Settlemyer, Feiyi Wang

This paper presents our design for an asynchronous object storage system intended for use in scientific and commercial big data workloads. Use cases from the target workload domains are used to motivate the key abstractions used in the application programming interface (API). The architecture of the Scalable Object Store (SOS), a prototype object storage system that supports the API's facilities, is presented. The SOS serves as a vehicle for future research into scalable and resilient big data object storage. We briefly review our research into providing efficient storage servers capable of providing quality of service (QoS) contracts relevant for big data use cases.

本文介绍了我们设计的异步对象存储系统，用于科学和商业大数据工作负载。来自目标工作负载域的用例用于激励应用程序编程接口(API)中使用的关键抽象。给出了可扩展对象存储(SOS)的体系结构，这是一个支持API功能的原型对象存储系统。SOS可以作为未来研究可扩展和弹性大数据对象存储的工具。我们简要回顾了我们对提供高效存储服务器的研究，这些服务器能够提供与大数据用例相关的服务质量(QoS)合同。

引用次数: 16

SDS: a framework for scientific data services SDS:科学数据服务的框架

Proceedings of the 8th Parallel Data Storage Workshop

Pub Date : 2013-11-17 DOI: 10.1145/2538542.2538563

Bin Dong, S. Byna, Kesheng Wu

Large-scale scientific applications typically write their data to parallel file systems with organizations designed to achieve fast write speeds. Analysis tasks frequently read the data in a pattern that is different from the write pattern, and therefore experience poor I/O performance. In this paper, we introduce a prototype framework for bridging the performance gap between write and read stages of data access from parallel file systems. We call this framework Scientific Data Services, or SDS for short. This initial implementation of SDS focuses on reorganizing previously written files into data layouts that benefit read patterns, and transparently directs read calls to the reorganized data. SDS follows a client-server architecture. The SDS Server manages partial or full replicas of reorganized datasets and serves SDS Clients' requests for data. The current version of the SDS client library supports HDF5 programming interface for reading data. The client library intercepts HDF5 calls using the HDF5 Virtual Object Layer (VOL) and transparently redirects them to the reorganized data. The SDS client library also provides a querying interface for reading part of the data based on user-specified selective criteria. We describe the design and implementation of the SDS client-server architecture, and evaluate the response time of the SDS Server and the performance benefits of SDS.

大型科学应用程序通常将其数据写入并行文件系统，其组织旨在实现快速写入速度。分析任务经常以不同于写模式的模式读取数据，因此I/O性能很差。在本文中，我们介绍了一个原型框架，用于弥合并行文件系统数据访问的写和读阶段之间的性能差距。我们称这个框架为科学数据服务，简称SDS。SDS的初始实现侧重于将以前写入的文件重新组织成有利于读取模式的数据布局，并透明地将读取调用定向到重新组织的数据。SDS遵循客户机-服务器体系结构。SDS服务器管理重组数据集的部分或完整副本，并为SDS客户端提供数据请求。当前版本的SDS客户端库支持HDF5编程接口读取数据。客户端库使用HDF5虚拟对象层(VOL)拦截HDF5调用，并透明地将它们重定向到重组的数据。SDS客户端库还提供了一个查询接口，用于根据用户指定的选择标准读取部分数据。我们描述了SDS客户机-服务器体系结构的设计和实现，并评估了SDS服务器的响应时间和SDS的性能优势。

{"title":"SDS: a framework for scientific data services","authors":"Bin Dong, S. Byna, Kesheng Wu","doi":"10.1145/2538542.2538563","DOIUrl":"https://doi.org/10.1145/2538542.2538563","url":null,"abstract":"Large-scale scientific applications typically write their data to parallel file systems with organizations designed to achieve fast write speeds. Analysis tasks frequently read the data in a pattern that is different from the write pattern, and therefore experience poor I/O performance. In this paper, we introduce a prototype framework for bridging the performance gap between write and read stages of data access from parallel file systems. We call this framework Scientific Data Services, or SDS for short. This initial implementation of SDS focuses on reorganizing previously written files into data layouts that benefit read patterns, and transparently directs read calls to the reorganized data. SDS follows a client-server architecture. The SDS Server manages partial or full replicas of reorganized datasets and serves SDS Clients' requests for data. The current version of the SDS client library supports HDF5 programming interface for reading data. The client library intercepts HDF5 calls using the HDF5 Virtual Object Layer (VOL) and transparently redirects them to the reorganized data. The SDS client library also provides a querying interface for reading part of the data based on user-specified selective criteria. We describe the design and implementation of the SDS client-server architecture, and evaluate the response time of the SDS Server and the performance benefits of SDS.","PeriodicalId":250653,"journal":{"name":"Proceedings of the 8th Parallel Data Storage Workshop","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122744543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Proceedings of the 8th Parallel Data Storage Workshop 第八届并行数据存储研讨会论文集

Proceedings of the 8th Parallel Data Storage Workshop

Pub Date : 2013-11-17 DOI: 10.1145/2538542

Dean Hildebrand, K. Schwan

引用次数: 0

Structuring PLFS for extensibility 为可扩展性构建PLFS

Proceedings of the 8th Parallel Data Storage Workshop

Pub Date : 2013-11-17 DOI: 10.1145/2538542.2538564

C. Cranor, Milo Polte, Garth A. Gibson

The Parallel Log Structured Filesystem (PLFS) [5] was designed to transparently transform highly concurrent, massive high-performance computing (HPC) N-to-1 checkpoint workloads into N-to-N workloads to avoid single-file performance bottlenecks in typical HPC distributed filesystems. PLFS has produced speedups of 2-150X for N-1 workloads at Los Alamos National Lab. Having successfully improved N-1 performance, we have restructured PLFS for extensibility so that it can be applied to more workloads and storage systems. In this paper we describe PLFS' evolution from a single-purpose log-structured middleware filesystem into a more general platform for transparently translating application I/O patterns. As an example of this extensibility, we show how PLFS can now be used to enable HPC applications to perform N-1 checkpoints on an HDFS-based cloud storage system.

并行日志结构化文件系统(Parallel Log Structured Filesystem, PLFS)[5]旨在透明地将高并发、大规模高性能计算(HPC) n对1检查点工作负载转换为n对n工作负载，以避免典型HPC分布式文件系统中的单文件性能瓶颈。PLFS已经在洛斯阿拉莫斯国家实验室为N-1工作负载提供了2-150倍的加速。在成功地提高了N-1性能之后，我们对PLFS进行了重构，使其具有可扩展性，从而可以应用于更多的工作负载和存储系统。在本文中，我们描述了PLFS如何从一个单一用途的日志结构中间件文件系统演变为一个更通用的平台，用于透明地转换应用程序I/O模式。作为这种可扩展性的一个示例，我们将展示如何使用PLFS使HPC应用程序能够在基于hdfs的云存储系统上执行N-1检查点。

引用次数: 8

Performance and scalability evaluation of the Ceph parallel file system Ceph并行文件系统的性能和可伸缩性评估

Proceedings of the 8th Parallel Data Storage Workshop

Pub Date : 2013-11-17 DOI: 10.1145/2538542.2538562

Feiyi Wang, M. Nelson, S. Oral, S. Atchley, S. Weil, B. Settlemyer, Blake Caldwell, Jason Hill

Ceph is an emerging open-source parallel distributed file and storage system. By design, Ceph leverages unreliable commodity storage and network hardware, and provides reliability and fault-tolerance via controlled object placement and data replication. This paper presents our file and block I/O performance and scalability evaluation of Ceph for scientific high-performance computing (HPC) environments. Our work makes two unique contributions. First, our evaluation is performed under a realistic setup for a large-scale capability HPC environment using a commercial high-end storage system. Second, our path of investigation, tuning efforts, and findings made direct contributions to Ceph's development and improved code quality, scalability, and performance. These changes should benefit both Ceph and the HPC community at large.

Ceph是一个新兴的开源并行分布式文件和存储系统。在设计上，Ceph利用了不可靠的商品存储和网络硬件，并通过受控的对象放置和数据复制提供了可靠性和容错性。本文介绍了Ceph在科学高性能计算(HPC)环境下的文件和块I/O性能和可扩展性评估。我们的工作有两个独特的贡献。首先，我们的评估是在使用商用高端存储系统的大规模高性能计算环境的现实设置下进行的。其次，我们的调查、调优工作和发现对Ceph的开发和改进代码质量、可伸缩性和性能做出了直接贡献。这些变化对Ceph和整个HPC社区都有好处。

引用次数: 25

Active data: a data-centric approach to data life-cycle management 活动数据:以数据为中心的数据生命周期管理方法

Proceedings of the 8th Parallel Data Storage Workshop

Pub Date : 2013-11-17 DOI: 10.1145/2538542.2538566

Anthony Simonet, G. Fedak, M. Ripeanu, S. Al-Kiswany

Data-intensive science offers new opportunities for innovation and discoveries, provided that large datasets can be handled efficiently. Data management for data-intensive science applications is challenging; requiring support for complex data life cycles, coordination across multiple sites, fault tolerance, and scalability to support tens of sites and petabytes of data. In this paper, we argue that data management for data-intensive science applications requires a fundamentally different management approach than the current ad-hoc task centric approach. We propose Active Data, a fundamentally novel paradigm for data life cycle management. Active Data follows two principles: data-centric and event-driven. We report on the Active Data programming model and its preliminary implementation, and discuss the benefits and limitations of the approach on recognized challenging data-intensive science use-cases.

数据密集型科学为创新和发现提供了新的机会，前提是可以有效地处理大型数据集。数据密集型科学应用的数据管理具有挑战性;需要支持复杂的数据生命周期、跨多个站点的协调、容错和可伸缩性，以支持数十个站点和pb级的数据。在本文中，我们认为数据密集型科学应用程序的数据管理需要一种与当前以临时任务为中心的方法根本不同的管理方法。我们提出活动数据，这是数据生命周期管理的一个全新范例。活动数据遵循两个原则:以数据为中心和事件驱动。我们报告了主动数据编程模型及其初步实现，并讨论了该方法在公认的具有挑战性的数据密集型科学用例中的优点和局限性。

引用次数: 6

Efficient transactions for parallel data movement 并行数据移动的高效事务

Proceedings of the 8th Parallel Data Storage Workshop

Pub Date : 2013-11-17 DOI: 10.1145/2538542.2538567

J. Lofstead, Jai Dayal, I. Jimenez, C. Maltzahn

The rise of Integrated Application Workflows (IAWs) for processing data prior to storage on persistent media prompts the need to incorporate features that reproduce many of the semantics of persistent storage devices. One such feature is the ability to manage data sets as chunks with natural barriers between different data sets. Towards that end, we need a mechanism to ensure that data moved to an intermediate storage area is both complete and correct before allowing access by other processing components. The Doubly Distributed Transactions (D2T) protocol offers such a mechanism. The initial development [9] suffered from scalability limitations and undue requirements on server processes. The current version has addressed these limitations and has demonstrated scalability with low overhead.

在将数据存储到持久介质之前处理数据的集成应用程序工作流(Integrated Application Workflows, iaw)的兴起，促使人们需要合并重现持久存储设备的许多语义的特性。其中一个特性是能够将数据集作为块来管理，不同数据集之间具有天然屏障。为此，我们需要一种机制来确保移动到中间存储区域的数据在允许其他处理组件访问之前是完整和正确的。双重分布式事务(D2T)协议提供了这样一种机制。最初的开发[9]受到可伸缩性限制和对服务器进程的过度需求的影响。当前的版本已经解决了这些限制，并展示了低开销的可伸缩性。

引用次数: 7

Fourier-assisted machine learning of hard disk drive access time models 硬盘访问时间模型的傅里叶辅助机器学习

Proceedings of the 8th Parallel Data Storage Workshop

Pub Date : 2013-11-17 DOI: 10.1145/2538542.2538561

A. Crume, C. Maltzahn, L. Ward, Thomas M. Kroeger, M. Curry, R. Oldfield

Predicting access times is a crucial part of predicting hard disk drive performance. Existing approaches use white-box modeling and require intimate knowledge of the internal layout of the drive, which can take months to extract. Automatically learning this behavior is a much more desirable approach, requiring less expert knowledge, fewer assumptions, and less time. Others have created behavioral models of hard disk drive performance, but none have shown low per-request errors. A barrier to machine learning of access times has been the existence of periodic behavior with high, unknown frequencies. We show how hard disk drive access times can be predicted to within 0:83 ms using a neural net after these frequencies are found using Fourier analysis.

预测访问时间是预测硬盘驱动器性能的关键部分。现有的方法使用白盒建模，并且需要对驱动器的内部布局有深入的了解，这可能需要几个月的时间来提取。自动学习这种行为是一种更可取的方法，它需要更少的专家知识、更少的假设和更少的时间。其他人创建了硬盘驱动器性能的行为模型，但没有一个显示出低的每次请求错误。机器学习访问时间的一个障碍是存在高频率、未知频率的周期性行为。我们展示了在使用傅里叶分析发现这些频率后，如何使用神经网络预测硬盘驱动器访问时间在0:83 ms内。

引用次数: 4

Predicting intermediate storage performance for workflow applications 预测工作流应用程序的中间存储性能

Proceedings of the 8th Parallel Data Storage Workshop

Pub Date : 2013-02-19 DOI: 10.1145/2538542.2538560

L. Costa, S. Al-Kiswany, A. Barros, Hao Yang, M. Ripeanu

System configuration decisions for I/O-intensive workflow applications can be complex even for expert users. Users face decisions to configure several parameters optimally (e.g., replication level, chunk size, number of storage node) - each having an impact on overall application performance. This paper presents our progress on addressing the problem of supporting storage system configuration decisions for workflow applications. Our approach accelerates the exploration of the configuration space based on a low-cost performance predictor that estimates turn-around time of a workflow application in a given setup. Our evaluation shows that the predictor is effective in identifying the desired system configuration, and it is lightweight using 2000-5000× less resources (machines × time) than running the actual benchmarks.

I/ o密集型工作流应用程序的系统配置决策即使对于专家用户也是复杂的。用户面临着优化配置几个参数的决策(例如，复制级别、块大小、存储节点数量)——每个参数都对应用程序的整体性能有影响。本文介绍了我们在解决支持工作流应用程序的存储系统配置决策问题方面的进展。我们的方法基于低成本的性能预测器加速了对配置空间的探索，该预测器可以估计给定设置中工作流应用程序的周转时间。我们的评估表明，预测器在识别所需的系统配置方面是有效的，并且它是轻量级的，使用的资源(机器×时间)比运行实际的基准测试少2000- 5000x。

引用次数: 5

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 8th Parallel Data Storage Workshop

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀