Computing shortest distances is a central task in many graph applications. Since it is impractical to recompute shortest distances from scratch every time the graph changes, many algorithms have been proposed to incrementally maintain shortest distances after edge deletions or insertions. In this paper, we address the problem of maintaining all-pairs shortest distances in dynamic graphs and propose novel efficient incremental algorithms, working both in main memory and on disk. We prove their correctness and provide complexity analyses. Experimental results on real-world datasets show that current main-memory algorithms become soon impractical, disk-based ones are needed for larger graphs, and our approach significantly outperforms state-of-the-art algorithms.
{"title":"Efficient Maintenance of All-Pairs Shortest Distances","authors":"S. Greco, Cristian Molinaro, Chiara Pulice","doi":"10.1145/2949689.2949713","DOIUrl":"https://doi.org/10.1145/2949689.2949713","url":null,"abstract":"Computing shortest distances is a central task in many graph applications. Since it is impractical to recompute shortest distances from scratch every time the graph changes, many algorithms have been proposed to incrementally maintain shortest distances after edge deletions or insertions. In this paper, we address the problem of maintaining all-pairs shortest distances in dynamic graphs and propose novel efficient incremental algorithms, working both in main memory and on disk. We prove their correctness and provide complexity analyses. Experimental results on real-world datasets show that current main-memory algorithms become soon impractical, disk-based ones are needed for larger graphs, and our approach significantly outperforms state-of-the-art algorithms.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117158716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The analysis of social media data poses several challenges: first of all, the data sets are very large, secondly they change constantly, and third they are heterogeneous, consisting of text, images, geographic locations and social connections. In this article, we focus on detecting events consisting of text and location information, and introduce an analysis method that is scalable both with respect to volume and velocity. We also address the problems arising from differences in adoption of social media across cultures, languages, and countries in our event detection by efficient normalization. We introduce an algorithm capable of processing vast amounts of data using a scalable online approach based on the SigniTrend event detection system, which is able to identify unusual geo-textual patterns in the data stream without requiring the user to specify any constraints in advance, such as hashtags to track: In contrast to earlier work, we are able to monitor every word at every location with just a fixed amount of memory, compare the values to statistics from earlier data and immediately report significant deviations with minimal delay. Thus, this algorithm is capable of reporting "Breaking News" in real-time. Location is modeled using unsupervised geometric discretization and supervised administrative hierarchies, which permits detecting events at city, regional, and global levels at the same time. The usefulness of the approach is demonstrated using several real-world example use cases using Twitter data.
{"title":"SPOTHOT: Scalable Detection of Geo-spatial Events in Large Textual Streams","authors":"Erich Schubert, Michael Weiler, H. Kriegel","doi":"10.1145/2949689.2949699","DOIUrl":"https://doi.org/10.1145/2949689.2949699","url":null,"abstract":"The analysis of social media data poses several challenges: first of all, the data sets are very large, secondly they change constantly, and third they are heterogeneous, consisting of text, images, geographic locations and social connections. In this article, we focus on detecting events consisting of text and location information, and introduce an analysis method that is scalable both with respect to volume and velocity. We also address the problems arising from differences in adoption of social media across cultures, languages, and countries in our event detection by efficient normalization. We introduce an algorithm capable of processing vast amounts of data using a scalable online approach based on the SigniTrend event detection system, which is able to identify unusual geo-textual patterns in the data stream without requiring the user to specify any constraints in advance, such as hashtags to track: In contrast to earlier work, we are able to monitor every word at every location with just a fixed amount of memory, compare the values to statistics from earlier data and immediately report significant deviations with minimal delay. Thus, this algorithm is capable of reporting \"Breaking News\" in real-time. Location is modeled using unsupervised geometric discretization and supervised administrative hierarchies, which permits detecting events at city, regional, and global levels at the same time. The usefulness of the approach is demonstrated using several real-world example use cases using Twitter data.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127376516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cross-match is a central operation in astronomic databases to integrate multiple catalogs of celestial objects. With the rapid development of new astronomy projects, large amounts of astronomic catalogs are generated and require fast cross-match with existing databases. In this paper, we propose to adopt a Multi-Assignment Single Join (MASJ) method for cross-match on heterogeneous clusters that consist of both CPUs and GPUs. We chose MASJ for cross-match, because (1) cross-matching records from astronomic catalogs is essentially a spatial distance join on two sets of points, and (2) each reference point is mapped to only a small number of search intervals. As a result, the MASJ cross-match, or MASJ-CM algorithm is feasible and highly efficient in a heterogeneous cluster environment. We have implemented MASJ-CM in two packages: one is an MPI-CUDA implementation, which fully utilizes the multi-core CPUs, GPUs, and InfiniBand communications; the other is on top of the popular distributed computing platform Spark, which greatly simplifies the programming. Our results on a six-node CPU-GPU cluster show that the MPI-CUDA implementation achieved a speedup of 2.69 times over a previous indexed nested-loop join algorithm. The Spark-based implementation was an order of magnitude slower than the MPI-CUDA; nevertheless, it is widely applicable and its source code much simpler.
{"title":"Multi-Assignment Single Joins for Parallel Cross-Match of Astronomic Catalogs on Heterogeneous Clusters","authors":"Xiaoying Jia, Qiong Luo","doi":"10.1145/2949689.2949705","DOIUrl":"https://doi.org/10.1145/2949689.2949705","url":null,"abstract":"Cross-match is a central operation in astronomic databases to integrate multiple catalogs of celestial objects. With the rapid development of new astronomy projects, large amounts of astronomic catalogs are generated and require fast cross-match with existing databases. In this paper, we propose to adopt a Multi-Assignment Single Join (MASJ) method for cross-match on heterogeneous clusters that consist of both CPUs and GPUs. We chose MASJ for cross-match, because (1) cross-matching records from astronomic catalogs is essentially a spatial distance join on two sets of points, and (2) each reference point is mapped to only a small number of search intervals. As a result, the MASJ cross-match, or MASJ-CM algorithm is feasible and highly efficient in a heterogeneous cluster environment. We have implemented MASJ-CM in two packages: one is an MPI-CUDA implementation, which fully utilizes the multi-core CPUs, GPUs, and InfiniBand communications; the other is on top of the popular distributed computing platform Spark, which greatly simplifies the programming. Our results on a six-node CPU-GPU cluster show that the MPI-CUDA implementation achieved a speedup of 2.69 times over a previous indexed nested-loop join algorithm. The Spark-based implementation was an order of magnitude slower than the MPI-CUDA; nevertheless, it is widely applicable and its source code much simpler.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116619436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we show how skylines can be used to improve the stable matching algorithm with asymmetric preference sets for men and women. The skyline set of men (or women) in a dataset comprises of those who are not worse off in all the qualities in comparison to another man (or woman). We prove that if a man in the skyline set is matched with a woman in the skyline set, the resulting pair is stable. We design our algorithm, SMS, based on the above observation by running the matching algorithm in phases considering only the skyline sets. In addition to being efficient, SMS provides two important additional properties. The first is progressiveness where stable pairs are output without waiting for the entire algorithm to finish. The second is balance in quality between men versus women since the proposers are switched automatically between the sets. Empirical results show that SMS runs orders of magnitude faster than the original Gale-Shapley algorithm and produces better quality matchings.
{"title":"SMS: Stable Matching Algorithm using Skylines","authors":"Rohit Anurag, Arnab Bhattacharya","doi":"10.1145/2949689.2949712","DOIUrl":"https://doi.org/10.1145/2949689.2949712","url":null,"abstract":"In this paper we show how skylines can be used to improve the stable matching algorithm with asymmetric preference sets for men and women. The skyline set of men (or women) in a dataset comprises of those who are not worse off in all the qualities in comparison to another man (or woman). We prove that if a man in the skyline set is matched with a woman in the skyline set, the resulting pair is stable. We design our algorithm, SMS, based on the above observation by running the matching algorithm in phases considering only the skyline sets. In addition to being efficient, SMS provides two important additional properties. The first is progressiveness where stable pairs are output without waiting for the entire algorithm to finish. The second is balance in quality between men versus women since the proposers are switched automatically between the sets. Empirical results show that SMS runs orders of magnitude faster than the original Gale-Shapley algorithm and produces better quality matchings.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114478817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bettina Fazzinga, F. Furfaro, E. Masciari, G. Mazzeo
Big data paradigm is currently the leading paradigm for data production and management. As a matter of fact, new information are generated at high rates in specialized fields (e.g., cybersecurity scenario). This may cause that the events to be studied occur at rates that are too fast to be effectively analyzed in real time. For example, in order to detect possible security threats, millions of records in a high-speed flow stream must be screened. To ameliorate this problem, a viable solution is the use of data compression for reducing the amount of data to be analyzed. In this paper we propose the use of privacy-preserving histograms, that provide approximate answers to 'safe' queries, for analyzing data in the cybersecurity scenario without compromising individuals' privacy, and we describe our system that has been used in a real life scenario.
{"title":"Privacy or Security?: Take A Look And Then Decide","authors":"Bettina Fazzinga, F. Furfaro, E. Masciari, G. Mazzeo","doi":"10.1145/2949689.2949706","DOIUrl":"https://doi.org/10.1145/2949689.2949706","url":null,"abstract":"Big data paradigm is currently the leading paradigm for data production and management. As a matter of fact, new information are generated at high rates in specialized fields (e.g., cybersecurity scenario). This may cause that the events to be studied occur at rates that are too fast to be effectively analyzed in real time. For example, in order to detect possible security threats, millions of records in a high-speed flow stream must be screened. To ameliorate this problem, a viable solution is the use of data compression for reducing the amount of data to be analyzed. In this paper we propose the use of privacy-preserving histograms, that provide approximate answers to 'safe' queries, for analyzing data in the cybersecurity scenario without compromising individuals' privacy, and we describe our system that has been used in a real life scenario.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122482624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Compact data structures combine in a unique data structure a compressed representation of the data and the structures to access such data. The target is to be able to manage data directly in compressed form, and in this way, to keep data always compressed, even in main memory. With this, we obtain two benefits: we can manage larger datasets in main memory and we take advantage of a better usage of the memory hierarchy. In this work, we present a compact data structure to represent raster data, which is commonly used in geographical information systems to represent attributes of the space (i.e., temperatures, elevation measures, etc.). The proposed data structure is not only able to represent a dataset in compressed form and access to an individual datum without decompressing the dataset from the beginning, but it also indexes its content, and thus it is capable of speeding up queries. There have been previous attempts to represent raster data using compact data structures, which work well when the raster dataset has few different values. However, when the range of possible values increases, performance in both space and time degrades. Our new method competes with previous approaches in that first scenario, but scales much better when the number of different values and the size of the dataset increase, which is critical when applying over real datasets.
{"title":"Compact and queryable representation of raster datasets","authors":"Susana Ladra, J. Paramá, Fernando Silva-Coira","doi":"10.1145/2949689.2949710","DOIUrl":"https://doi.org/10.1145/2949689.2949710","url":null,"abstract":"Compact data structures combine in a unique data structure a compressed representation of the data and the structures to access such data. The target is to be able to manage data directly in compressed form, and in this way, to keep data always compressed, even in main memory. With this, we obtain two benefits: we can manage larger datasets in main memory and we take advantage of a better usage of the memory hierarchy. In this work, we present a compact data structure to represent raster data, which is commonly used in geographical information systems to represent attributes of the space (i.e., temperatures, elevation measures, etc.). The proposed data structure is not only able to represent a dataset in compressed form and access to an individual datum without decompressing the dataset from the beginning, but it also indexes its content, and thus it is capable of speeding up queries. There have been previous attempts to represent raster data using compact data structures, which work well when the raster dataset has few different values. However, when the range of possible values increases, performance in both space and time degrades. Our new method competes with previous approaches in that first scenario, but scales much better when the number of different values and the size of the dataset increase, which is critical when applying over real datasets.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"155 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116626593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dai Hai Ton That, I. S. Popa, K. Zeitouni, C. Borcea
Mobile participatory sensing could be used in many applications such as vehicular traffic monitoring, pollution tracking, or even health surveying. However, its success depends on finding a solution for querying large numbers of users which protects user location privacy and works in real-time. This paper presents PAMPAS, a privacy-aware mobile distributed system for efficient data aggregation in mobile participatory sensing. In PAMPAS, mobile devices enhanced with secure hardware, called secure probes (SPs), perform distributed query processing, while preventing users from accessing other users' data. A supporting server infrastructure (SSI) coordinates the inter-SP communication and the computation tasks executed on SPs. PAMPAS ensures that SSI cannot link the location reported by SPs to the user identities even if SSI has additional background information. In addition to its novel system architecture, PAMPAS also proposes two new protocols for privacy-aware location-based aggregation and adaptive spatial partitioning of SPs that work efficiently on resource-constrained SPs. Our experimental results and security analysis demonstrate that these protocols are able to collect the data, aggregate them, and share statistics or derived models in real-time, without any location privacy leakage.
{"title":"PAMPAS: Privacy-Aware Mobile Participatory Sensing Using Secure Probes","authors":"Dai Hai Ton That, I. S. Popa, K. Zeitouni, C. Borcea","doi":"10.1145/2949689.2949704","DOIUrl":"https://doi.org/10.1145/2949689.2949704","url":null,"abstract":"Mobile participatory sensing could be used in many applications such as vehicular traffic monitoring, pollution tracking, or even health surveying. However, its success depends on finding a solution for querying large numbers of users which protects user location privacy and works in real-time. This paper presents PAMPAS, a privacy-aware mobile distributed system for efficient data aggregation in mobile participatory sensing. In PAMPAS, mobile devices enhanced with secure hardware, called secure probes (SPs), perform distributed query processing, while preventing users from accessing other users' data. A supporting server infrastructure (SSI) coordinates the inter-SP communication and the computation tasks executed on SPs. PAMPAS ensures that SSI cannot link the location reported by SPs to the user identities even if SSI has additional background information. In addition to its novel system architecture, PAMPAS also proposes two new protocols for privacy-aware location-based aggregation and adaptive spatial partitioning of SPs that work efficiently on resource-constrained SPs. Our experimental results and security analysis demonstrate that these protocols are able to collect the data, aggregate them, and share statistics or derived models in real-time, without any location privacy leakage.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134186366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sangchul Kim, Seoung Gook Sohn, Taehoon Kim, Jinseon Yu, Bogyeong Kim, Bongki Moon
Recently there has been an increasing interest in analyzing scientific data generated by observations and scientific experiments. For managing these data efficiently, SciDB, a multi-dimensional array-based DBMS, is suggested. When SciDB processes a query with where predicates, it uses filter operator internally to produce a result array that matches the predicates. Most queries for scientific data analysis utilize spatial information. However, filter operator of SciDB reads all data without considering features of array-based DBMSs and spatial information. In this demo, we present an efficient query processing scheme utilizing characteristics of array-based data, implemented by employing coordinates. It uses a selective scan that retrieves data corresponding to a range that satisfies specific conditions. In our experiments, the selective scan is up to 30x faster than the original scan. We demonstrate that our implementation of the filter operator will reduce the processing time of a selection query significantly and enable SciDB to handle a massive amount of scientific data in more scalable manner.
{"title":"Selective Scan for Filter Operator of SciDB","authors":"Sangchul Kim, Seoung Gook Sohn, Taehoon Kim, Jinseon Yu, Bogyeong Kim, Bongki Moon","doi":"10.1145/2949689.2949707","DOIUrl":"https://doi.org/10.1145/2949689.2949707","url":null,"abstract":"Recently there has been an increasing interest in analyzing scientific data generated by observations and scientific experiments. For managing these data efficiently, SciDB, a multi-dimensional array-based DBMS, is suggested. When SciDB processes a query with where predicates, it uses filter operator internally to produce a result array that matches the predicates. Most queries for scientific data analysis utilize spatial information. However, filter operator of SciDB reads all data without considering features of array-based DBMSs and spatial information. In this demo, we present an efficient query processing scheme utilizing characteristics of array-based data, implemented by employing coordinates. It uses a selective scan that retrieves data corresponding to a range that satisfies specific conditions. In our experiments, the selective scan is up to 30x faster than the original scan. We demonstrate that our implementation of the filter operator will reduce the processing time of a selection query significantly and enable SciDB to handle a massive amount of scientific data in more scalable manner.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126734672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data Scientists rely on vector-based scripting languages such as R, Python and MATLAB to perform ad-hoc data analysis on potentially large data sets. When facing large data sets, they are only efficient when data is processed using vectorized or bulk operations. At the same time, overwhelming volume and variety of data as well as parsing overhead suggests that the use of specialized analytical data management systems would be beneficial. Data might also already be stored in a database. Efficient execution of data analysis programs such as data mining directly inside a database greatly improves analysis efficiency. We investigate how these vector-based languages can be efficiently integrated in the processing model of operator--at--a--time databases. We present MonetDB/Python, a new system that combines the open-source database MonetDB with the vector-based language Python. In our evaluation, we demonstrate efficiency gains of orders of magnitude.
{"title":"Vectorized UDFs in Column-Stores","authors":"Mark Raasveldt, H. Mühleisen","doi":"10.1145/2949689.2949703","DOIUrl":"https://doi.org/10.1145/2949689.2949703","url":null,"abstract":"Data Scientists rely on vector-based scripting languages such as R, Python and MATLAB to perform ad-hoc data analysis on potentially large data sets. When facing large data sets, they are only efficient when data is processed using vectorized or bulk operations. At the same time, overwhelming volume and variety of data as well as parsing overhead suggests that the use of specialized analytical data management systems would be beneficial. Data might also already be stored in a database. Efficient execution of data analysis programs such as data mining directly inside a database greatly improves analysis efficiency. We investigate how these vector-based languages can be efficiently integrated in the processing model of operator--at--a--time databases. We present MonetDB/Python, a new system that combines the open-source database MonetDB with the vector-based language Python. In our evaluation, we demonstrate efficiency gains of orders of magnitude.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126500811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data exchange is one of the oldest database problems, being of both practical and theoretical interest. Given the pace at which heterogeneous data are published on the web, thanks to initiatives such as Linked Data and Open Science, scalability of data exchange becomes crucial. Pivotal to data exchange is the chase algorithm, which is a fixpoint algorithm to evaluate both source-to-target constraints and target constraints in the data exchange process. In this paper, we investigate how new programming models such as MapReduce can be used to implement the chase on large-scale data sources. To the best of our knowledge, how to exchange data at scale has not been investigated so far. We present an initial solution for chasing source-to-target tuple generating dependencies and target tuple-generating dependencies, and discuss open issues that need to be addressed to leverage MapReduce for the data exchange problem.
{"title":"Data Exchange with MapReduce: A First Cut","authors":"Khalid Belhajjame, A. Bonifati","doi":"10.1145/2949689.2949702","DOIUrl":"https://doi.org/10.1145/2949689.2949702","url":null,"abstract":"Data exchange is one of the oldest database problems, being of both practical and theoretical interest. Given the pace at which heterogeneous data are published on the web, thanks to initiatives such as Linked Data and Open Science, scalability of data exchange becomes crucial. Pivotal to data exchange is the chase algorithm, which is a fixpoint algorithm to evaluate both source-to-target constraints and target constraints in the data exchange process. In this paper, we investigate how new programming models such as MapReduce can be used to implement the chase on large-scale data sources. To the best of our knowledge, how to exchange data at scale has not been investigated so far. We present an initial solution for chasing source-to-target tuple generating dependencies and target tuple-generating dependencies, and discuss open issues that need to be addressed to leverage MapReduce for the data exchange problem.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133339056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}