按给定标准加速分布式数据集连接的方法

Yevgeniya Tyryshkina, S. Tumkovskiy
{"title":"按给定标准加速分布式数据集连接的方法","authors":"Yevgeniya Tyryshkina, S. Tumkovskiy","doi":"10.31799/1684-8853-2022-5-2-11","DOIUrl":null,"url":null,"abstract":"Introduction: rapidly growing volumes of information pose new challenges to modern data analysis technologies. Currently, based on cost and performance considerations, data processing is usually performed in cluster systems. One of the most common related operations in analytics is the joins of datasets. Join is an extremely expensive operation that is difficult to scale and increase efficiency in distributed databases or systems based on the MapReduce paradigm. Despite the fact that a lot of effort has been put into improving the performance of this operation, often the proposed methods either require fundamental changes in the MapReduce structure, or are aimed at reducing the overhead of the operation, such as balancing the load on the network. Objective: to develop an algorithm to accelerate the integration of data sets in distributed systems. Results: a review of the Apache Spark architecture and the features of distributed computing based on MapReduce is performed, typical methods for combining datasets are analyzed, the main recommendations for optimizing the operation of combining data are presented, an algorithm that allows you to speed up the special case of combining implemented in Apache Spark is presented. This algorithm uses the methods of partitioning and partial transfer of sets to the computing nodes of the cluster, in such a way as to take advantage of the merge and broadcast associations. The experimental data presented demonstrate that the method is all the more effective the larger the volume of input data. So, for 2Tb compressed data, acceleration up to ~37% was obtained in comparison with standard Spark SQL.","PeriodicalId":36977,"journal":{"name":"Informatsionno-Upravliaiushchie Sistemy","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Method for accelerating the joining of distributed datasets by a given criterion\",\"authors\":\"Yevgeniya Tyryshkina, S. Tumkovskiy\",\"doi\":\"10.31799/1684-8853-2022-5-2-11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: rapidly growing volumes of information pose new challenges to modern data analysis technologies. Currently, based on cost and performance considerations, data processing is usually performed in cluster systems. One of the most common related operations in analytics is the joins of datasets. Join is an extremely expensive operation that is difficult to scale and increase efficiency in distributed databases or systems based on the MapReduce paradigm. Despite the fact that a lot of effort has been put into improving the performance of this operation, often the proposed methods either require fundamental changes in the MapReduce structure, or are aimed at reducing the overhead of the operation, such as balancing the load on the network. Objective: to develop an algorithm to accelerate the integration of data sets in distributed systems. Results: a review of the Apache Spark architecture and the features of distributed computing based on MapReduce is performed, typical methods for combining datasets are analyzed, the main recommendations for optimizing the operation of combining data are presented, an algorithm that allows you to speed up the special case of combining implemented in Apache Spark is presented. This algorithm uses the methods of partitioning and partial transfer of sets to the computing nodes of the cluster, in such a way as to take advantage of the merge and broadcast associations. The experimental data presented demonstrate that the method is all the more effective the larger the volume of input data. So, for 2Tb compressed data, acceleration up to ~37% was obtained in comparison with standard Spark SQL.\",\"PeriodicalId\":36977,\"journal\":{\"name\":\"Informatsionno-Upravliaiushchie Sistemy\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Informatsionno-Upravliaiushchie Sistemy\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.31799/1684-8853-2022-5-2-11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatsionno-Upravliaiushchie Sistemy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31799/1684-8853-2022-5-2-11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 0

摘要

导读:快速增长的信息量对现代数据分析技术提出了新的挑战。目前,基于成本和性能的考虑,数据处理通常在集群系统中执行。分析中最常见的相关操作之一是数据集的连接。Join是一个非常昂贵的操作,在基于MapReduce范式的分布式数据库或系统中很难扩展和提高效率。尽管在改进该操作的性能方面已经付出了很多努力,但通常提出的方法要么需要对MapReduce结构进行根本性的更改,要么旨在减少操作的开销,例如平衡网络上的负载。目的:开发一种加速分布式系统中数据集集成的算法。结果:回顾了Apache Spark的架构和基于MapReduce的分布式计算的特点,分析了典型的数据集组合方法,提出了优化组合数据操作的主要建议,提出了一种可以加速Apache Spark中实现的组合的特殊情况的算法。该算法采用集的划分和部分转移到集群的计算节点的方法,充分利用了集群的合并和广播关联。实验数据表明,输入数据量越大,该方法越有效。因此,对于2Tb的压缩数据,与标准Spark SQL相比,获得了高达~37%的加速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Method for accelerating the joining of distributed datasets by a given criterion
Introduction: rapidly growing volumes of information pose new challenges to modern data analysis technologies. Currently, based on cost and performance considerations, data processing is usually performed in cluster systems. One of the most common related operations in analytics is the joins of datasets. Join is an extremely expensive operation that is difficult to scale and increase efficiency in distributed databases or systems based on the MapReduce paradigm. Despite the fact that a lot of effort has been put into improving the performance of this operation, often the proposed methods either require fundamental changes in the MapReduce structure, or are aimed at reducing the overhead of the operation, such as balancing the load on the network. Objective: to develop an algorithm to accelerate the integration of data sets in distributed systems. Results: a review of the Apache Spark architecture and the features of distributed computing based on MapReduce is performed, typical methods for combining datasets are analyzed, the main recommendations for optimizing the operation of combining data are presented, an algorithm that allows you to speed up the special case of combining implemented in Apache Spark is presented. This algorithm uses the methods of partitioning and partial transfer of sets to the computing nodes of the cluster, in such a way as to take advantage of the merge and broadcast associations. The experimental data presented demonstrate that the method is all the more effective the larger the volume of input data. So, for 2Tb compressed data, acceleration up to ~37% was obtained in comparison with standard Spark SQL.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Informatsionno-Upravliaiushchie Sistemy
Informatsionno-Upravliaiushchie Sistemy Mathematics-Control and Optimization
CiteScore
1.40
自引率
0.00%
发文量
35
期刊最新文献
Modeling of bumping routes in the RSK algorithm and analysis of their approach to limit shapes Continuous control algorithms for conveyer belt routing based on multi-agent deep reinforcement learning Fully integrated optical sensor system with intensity interrogation Decoding of linear codes for single error bursts correction based on the determination of certain events Backend Bug Finder — a platform for effective compiler fuzzing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1