{"title":"通过递归分区实现近最优分布式带状连接","authors":"Rundong Li, Wolfgang Gatterbauer, Mirek Riedewald","doi":"10.1145/3318464.3389750","DOIUrl":null,"url":null,"abstract":"<p><p>We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between <i>maximum load per worker</i> and <i>input duplication</i> for band-joins between two relations. Previous work suffered from high optimization cost or considered partitionings that were too restricted (resulting in suboptimal join performance). Our main insight is that <i>recursive partitioning of the join-attribute space</i> with the appropriate split scoring measure can achieve both low optimization cost and low join cost. It is the first approach that is not only effective for one-dimensional band-joins but also for joins on multiple attributes. Experiments indicate that our method is able to find partitionings that are within 10% of the <i>lower bound</i> for both maximum load per worker and input duplication for a broad range of settings, significantly improving over previous work.</p>","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7872589/pdf/nihms-1666242.pdf","citationCount":"0","resultStr":"{\"title\":\"Near-Optimal Distributed Band-Joins through Recursive Partitioning.\",\"authors\":\"Rundong Li, Wolfgang Gatterbauer, Mirek Riedewald\",\"doi\":\"10.1145/3318464.3389750\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between <i>maximum load per worker</i> and <i>input duplication</i> for band-joins between two relations. Previous work suffered from high optimization cost or considered partitionings that were too restricted (resulting in suboptimal join performance). Our main insight is that <i>recursive partitioning of the join-attribute space</i> with the appropriate split scoring measure can achieve both low optimization cost and low join cost. It is the first approach that is not only effective for one-dimensional band-joins but also for joins on multiple attributes. Experiments indicate that our method is able to find partitionings that are within 10% of the <i>lower bound</i> for both maximum load per worker and input duplication for a broad range of settings, significantly improving over previous work.</p>\",\"PeriodicalId\":87344,\"journal\":{\"name\":\"Proceedings. ACM-SIGMOD International Conference on Management of Data\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7872589/pdf/nihms-1666242.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. ACM-SIGMOD International Conference on Management of Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3318464.3389750\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. ACM-SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3318464.3389750","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Near-Optimal Distributed Band-Joins through Recursive Partitioning.
We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between maximum load per worker and input duplication for band-joins between two relations. Previous work suffered from high optimization cost or considered partitionings that were too restricted (resulting in suboptimal join performance). Our main insight is that recursive partitioning of the join-attribute space with the appropriate split scoring measure can achieve both low optimization cost and low join cost. It is the first approach that is not only effective for one-dimensional band-joins but also for joins on multiple attributes. Experiments indicate that our method is able to find partitionings that are within 10% of the lower bound for both maximum load per worker and input duplication for a broad range of settings, significantly improving over previous work.