优化大数据处理中的并行数据访问

Jiangling Yin, Jun Wang
{"title":"优化大数据处理中的并行数据访问","authors":"Jiangling Yin, Jun Wang","doi":"10.1109/CCGrid.2015.168","DOIUrl":null,"url":null,"abstract":"Recent years the Hadoop Distributed File System(HDFS) has been deployed as the bedrock for many parallel big data processing systems, such as graph processing systems, MPI-based parallel programs and scala/java-based Spark frameworks, which can efficiently support iterative and interactive data analysis in memory. The first part of my dissertation mainly focuses on studying parallel data accession distributed file systems, e.g, HDFS. Since the distributed I/O resources and global data distribution are often not taken into consideration, the data requests from parallel processes/executors will unfortunately be served in a remoter imbalanced fashion on the storage servers. In order to address these problems, we develop I/O middleware systems and matching-based algorithms to map parallel data requests to storage servers such that local and balanced data access can be achieved. The last part of my dissertation presents our plans to improve the performance of interactive data access in big data analysis. Specifically, most interactive analysis programs will scan through the entire data set regardless of which data is actually required. We plan to develop a content-aware method to quickly access required data without this laborious scanning process.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"258 1","pages":"721-724"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Optimize Parallel Data Access in Big Data Processing\",\"authors\":\"Jiangling Yin, Jun Wang\",\"doi\":\"10.1109/CCGrid.2015.168\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent years the Hadoop Distributed File System(HDFS) has been deployed as the bedrock for many parallel big data processing systems, such as graph processing systems, MPI-based parallel programs and scala/java-based Spark frameworks, which can efficiently support iterative and interactive data analysis in memory. The first part of my dissertation mainly focuses on studying parallel data accession distributed file systems, e.g, HDFS. Since the distributed I/O resources and global data distribution are often not taken into consideration, the data requests from parallel processes/executors will unfortunately be served in a remoter imbalanced fashion on the storage servers. In order to address these problems, we develop I/O middleware systems and matching-based algorithms to map parallel data requests to storage servers such that local and balanced data access can be achieved. The last part of my dissertation presents our plans to improve the performance of interactive data access in big data analysis. Specifically, most interactive analysis programs will scan through the entire data set regardless of which data is actually required. We plan to develop a content-aware method to quickly access required data without this laborious scanning process.\",\"PeriodicalId\":6664,\"journal\":{\"name\":\"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing\",\"volume\":\"258 1\",\"pages\":\"721-724\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-05-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGrid.2015.168\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2015.168","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

近年来,Hadoop分布式文件系统(HDFS)被部署为许多并行大数据处理系统的基石,如图形处理系统、基于mpi的并行程序和基于scala/java的Spark框架,它可以有效地支持内存中的迭代和交互式数据分析。论文的第一部分主要研究并行数据接入分布式文件系统,如HDFS。由于通常不考虑分布式I/O资源和全局数据分布,因此来自并行进程/执行器的数据请求将不幸地在存储服务器上以远程不平衡的方式提供服务。为了解决这些问题,我们开发了I/O中间件系统和基于匹配的算法,将并行数据请求映射到存储服务器,从而实现本地和平衡的数据访问。论文的最后一部分提出了我们在大数据分析中提高交互数据访问性能的计划。具体来说,大多数交互式分析程序将扫描整个数据集,而不管实际需要哪些数据。我们计划开发一种内容感知的方法来快速访问所需的数据,而无需这种费力的扫描过程。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Optimize Parallel Data Access in Big Data Processing
Recent years the Hadoop Distributed File System(HDFS) has been deployed as the bedrock for many parallel big data processing systems, such as graph processing systems, MPI-based parallel programs and scala/java-based Spark frameworks, which can efficiently support iterative and interactive data analysis in memory. The first part of my dissertation mainly focuses on studying parallel data accession distributed file systems, e.g, HDFS. Since the distributed I/O resources and global data distribution are often not taken into consideration, the data requests from parallel processes/executors will unfortunately be served in a remoter imbalanced fashion on the storage servers. In order to address these problems, we develop I/O middleware systems and matching-based algorithms to map parallel data requests to storage servers such that local and balanced data access can be achieved. The last part of my dissertation presents our plans to improve the performance of interactive data access in big data analysis. Specifically, most interactive analysis programs will scan through the entire data set regardless of which data is actually required. We plan to develop a content-aware method to quickly access required data without this laborious scanning process.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Self Protecting Data Sharing Using Generic Policies Partition-Aware Routing to Improve Network Isolation in Infiniband Based Multi-tenant Clusters MIC-Tandem: Parallel X!Tandem Using MIC on Tandem Mass Spectrometry Based Proteomics Data Study of the KVM CPU Performance of Open-Source Cloud Management Platforms Visualizing City Events on Search Engine: Tword the Search Infrustration for Smart City
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1