{"title":"Session details: Research 4: Query Processing","authors":"H. Pirk","doi":"10.1145/3258008","DOIUrl":"https://doi.org/10.1145/3258008","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87595786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Influence maximization is a classic and extensively studied problem with important applications in viral marketing. Existing algorithms for influence maximization, however, mostly focus on offline processing, in the sense that they do not provide any output to the user until the final answer is derived, and that the user is not allowed to terminate the algorithm early to trade the quality of solution for efficiency. Such lack of interactiveness and flexibility leads to poor user experience, especially when the algorithm incurs long running time. To address the above problem, this paper studies algorithms for online processing of influence maximization (OPIM), where the user can pause the algorithm at any time and ask for a solution (to the influence maximization problem) and its approximation guarantee, and can resume the algorithm to let it improve the quality of solution by giving it more time to run. (This interactive paradigm is similar in spirit to online query processing in database systems.) We show that the only existing algorithm for OPIM is vastly ineffective in practice, and that adopting existing influence maximization methods for OPIM yields unsatisfactory results. Motivated by this, we propose a new algorithm for OPIM with both superior empirical effectiveness and strong theoretical guarantees, and we show that it can also be extended to handle conventional influence maximization. Extensive experiments on real data demonstrate that our solutions outperform the state of the art for both OPIM and conventional influence maximization.
{"title":"Online Processing Algorithms for Influence Maximization","authors":"Jing Tang, Xueyan Tang, Xiaokui Xiao, Junsong Yuan","doi":"10.1145/3183713.3183749","DOIUrl":"https://doi.org/10.1145/3183713.3183749","url":null,"abstract":"Influence maximization is a classic and extensively studied problem with important applications in viral marketing. Existing algorithms for influence maximization, however, mostly focus on offline processing, in the sense that they do not provide any output to the user until the final answer is derived, and that the user is not allowed to terminate the algorithm early to trade the quality of solution for efficiency. Such lack of interactiveness and flexibility leads to poor user experience, especially when the algorithm incurs long running time. To address the above problem, this paper studies algorithms for online processing of influence maximization (OPIM), where the user can pause the algorithm at any time and ask for a solution (to the influence maximization problem) and its approximation guarantee, and can resume the algorithm to let it improve the quality of solution by giving it more time to run. (This interactive paradigm is similar in spirit to online query processing in database systems.) We show that the only existing algorithm for OPIM is vastly ineffective in practice, and that adopting existing influence maximization methods for OPIM yields unsatisfactory results. Motivated by this, we propose a new algorithm for OPIM with both superior empirical effectiveness and strong theoretical guarantees, and we show that it can also be extended to handle conventional influence maximization. Extensive experiments on real data demonstrate that our solutions outperform the state of the art for both OPIM and conventional influence maximization.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86170758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The use of business intelligence tools and other means to generate queries has led to great variety in the size of join queries. While most queries are reasonably small, join queries with up to a hundred relations are not that exotic anymore, and the distribution of query sizes has an incredible long tail. The largest real-world query that we are aware of accesses more than 4,000 relations. This large spread makes query optimization very challenging. Join ordering is known to be NP-hard, which means that we cannot hope to solve such large problems exactly. On the other hand most queries are much smaller, and there is no reason to sacrifice optimality there. This paper introduces an adaptive optimization framework that is able to solve most common join queries exactly, while simultaneously scaling to queries with thousands of joins. A key component there is a novel search space linearization technique that leads to near-optimal execution plans for large classes of queries. In addition, we describe implementation techniques that are necessary to scale join ordering algorithms to these extremely large queries. Extensive experiments with over 10 different approaches show that the new adaptive approach proposed here performs excellent over a huge spectrum of query sizes, and produces optimal or near-optimal solutions for most common queries.
{"title":"Adaptive Optimization of Very Large Join Queries","authors":"Thomas Neumann, Bernhard Radke","doi":"10.1145/3183713.3183733","DOIUrl":"https://doi.org/10.1145/3183713.3183733","url":null,"abstract":"The use of business intelligence tools and other means to generate queries has led to great variety in the size of join queries. While most queries are reasonably small, join queries with up to a hundred relations are not that exotic anymore, and the distribution of query sizes has an incredible long tail. The largest real-world query that we are aware of accesses more than 4,000 relations. This large spread makes query optimization very challenging. Join ordering is known to be NP-hard, which means that we cannot hope to solve such large problems exactly. On the other hand most queries are much smaller, and there is no reason to sacrifice optimality there. This paper introduces an adaptive optimization framework that is able to solve most common join queries exactly, while simultaneously scaling to queries with thousands of joins. A key component there is a novel search space linearization technique that leads to near-optimal execution plans for large classes of queries. In addition, we describe implementation techniques that are necessary to scale join ordering algorithms to these extremely large queries. Extensive experiments with over 10 different approaches show that the new adaptive approach proposed here performs excellent over a huge spectrum of query sizes, and produces optimal or near-optimal solutions for most common queries.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88569765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Two important metrics used to characterise a graph are its triangle count and clustering coefficient. In this paper, we present methods to approximate these metrics for graphs.
用来描述图的两个重要指标是三角形计数和聚类系数。在本文中,我们提出了近似图的这些度量的方法。
{"title":"Approximate Triangle Count and Clustering Coefficient","authors":"Siddharth Bhatia","doi":"10.1145/3183713.3183715","DOIUrl":"https://doi.org/10.1145/3183713.3183715","url":null,"abstract":"Two important metrics used to characterise a graph are its triangle count and clustering coefficient. In this paper, we present methods to approximate these metrics for graphs.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89382566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Disheng Qiu, Luciano Barbosa, Valter Crescenzi, P. Merialdo, D. Srivastava
An increasing number of product pages are available from thousands of web sources, each page associated with a product, containing its attributes and one or more product identifiers. The sources provide overlapping information about the products, using diverse schemas, making web-scale integration extremely challenging. In this paper, we take advantage of the opportunity that sources publish product identifiers to perform big data linkage across sources at the beginning of the data integration pipeline, before schema alignment. To realize this opportunity, several challenges need to be addressed: identifiers need to be discovered on product pages, made difficult by the diversity of identifiers; the main product identifier on the page needs to be identified, made difficult by the many related products presented on the page; and identifiers across pages need to beresolved, made difficult by the ambiguity between identifiers across product categories. We present our RaF (Redundancy as Friend) solution to the problem of big data linkage for product specification pages, which takes advantage of the redundancy of identifiers at a global level, and the homogeneity of structure and semantics at the local source level, to effectively and efficiently link millions of pages of head and tail products across thousands of head and tail sources. We perform a thorough empirical evaluation of our RaF approach using the publicly available Dexter dataset consisting of 1.9M product pages from 7.1k sources of 3.5k websites, and demonstrate its effectiveness in practice.
{"title":"Big Data Linkage for Product Specification Pages","authors":"Disheng Qiu, Luciano Barbosa, Valter Crescenzi, P. Merialdo, D. Srivastava","doi":"10.1145/3183713.3183757","DOIUrl":"https://doi.org/10.1145/3183713.3183757","url":null,"abstract":"An increasing number of product pages are available from thousands of web sources, each page associated with a product, containing its attributes and one or more product identifiers. The sources provide overlapping information about the products, using diverse schemas, making web-scale integration extremely challenging. In this paper, we take advantage of the opportunity that sources publish product identifiers to perform big data linkage across sources at the beginning of the data integration pipeline, before schema alignment. To realize this opportunity, several challenges need to be addressed: identifiers need to be discovered on product pages, made difficult by the diversity of identifiers; the main product identifier on the page needs to be identified, made difficult by the many related products presented on the page; and identifiers across pages need to beresolved, made difficult by the ambiguity between identifiers across product categories. We present our RaF (Redundancy as Friend) solution to the problem of big data linkage for product specification pages, which takes advantage of the redundancy of identifiers at a global level, and the homogeneity of structure and semantics at the local source level, to effectively and efficiently link millions of pages of head and tail products across thousands of head and tail sources. We perform a thorough empirical evaluation of our RaF approach using the publicly available Dexter dataset consisting of 1.9M product pages from 7.1k sources of 3.5k websites, and demonstrate its effectiveness in practice.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86953641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yeye He, Kris Ganjam, Kukjin Lee, Yue Wang, Vivek R. Narasayya, S. Chaudhuri, Xu Chu, Yudian Zheng
Business analysts and data scientists today increasingly need to clean, standardize and transform diverse data sets, such as name, address, date time, phone number, etc., before they can perform analysis. These ad-hoc transformation problems are typically solved by one-off scripts, which is both difficult and time-consuming. Our observation is that these domain-specific transformation problems have long been solved by developers with code libraries, which are often shared in places like GitHub. We thus develop an extensible data transformation system called Transform-Data-by-Example (TDE) that can leverage rich transformation logic in source code, DLLs, web services and mapping tables, so that end-users only need to provide a few (typically 3) input/output examples, and TDE can synthesize desired programs using relevant transformation logic from these sources. The beta version of TDE was released in Office Store for Excel.
{"title":"Transform-Data-by-Example (TDE): Extensible Data Transformation in Excel","authors":"Yeye He, Kris Ganjam, Kukjin Lee, Yue Wang, Vivek R. Narasayya, S. Chaudhuri, Xu Chu, Yudian Zheng","doi":"10.1145/3183713.3193539","DOIUrl":"https://doi.org/10.1145/3183713.3193539","url":null,"abstract":"Business analysts and data scientists today increasingly need to clean, standardize and transform diverse data sets, such as name, address, date time, phone number, etc., before they can perform analysis. These ad-hoc transformation problems are typically solved by one-off scripts, which is both difficult and time-consuming. Our observation is that these domain-specific transformation problems have long been solved by developers with code libraries, which are often shared in places like GitHub. We thus develop an extensible data transformation system called Transform-Data-by-Example (TDE) that can leverage rich transformation logic in source code, DLLs, web services and mapping tables, so that end-users only need to provide a few (typically 3) input/output examples, and TDE can synthesize desired programs using relevant transformation logic from these sources. The beta version of TDE was released in Office Store for Excel.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90281917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-27DOI: 10.14778/3229863.3229876
X. Dong, Theodoros Rekatsinas
There is now more data to analyze than ever before. As data volume and variety have increased, so have the ties between machine learning and data integration become stronger. For machine learning to be effective, one must utilize data from the greatest possible variety of sources; and this is why data integration plays a key role. At the same time machine learning is driving automation in data integration, resulting in overall reduction of integration costs and improved accuracy. This tutorial focuses on three aspects of the synergistic relationship between data integration and machine learning: (1) we survey how state-of-the-art data integration solutions rely on machine learning-based approaches for accurate results and effective human-in-the-loop pipelines, (2) we review how end-to-end machine learning applications rely on data integration to identify accurate, clean, and relevant data for their analytics exercises, and (3) we discuss open research challenges and opportunities that span across data integration and machine learning.
{"title":"Data Integration and Machine Learning: A Natural Synergy","authors":"X. Dong, Theodoros Rekatsinas","doi":"10.14778/3229863.3229876","DOIUrl":"https://doi.org/10.14778/3229863.3229876","url":null,"abstract":"There is now more data to analyze than ever before. As data volume and variety have increased, so have the ties between machine learning and data integration become stronger. For machine learning to be effective, one must utilize data from the greatest possible variety of sources; and this is why data integration plays a key role. At the same time machine learning is driving automation in data integration, resulting in overall reduction of integration costs and improved accuracy. This tutorial focuses on three aspects of the synergistic relationship between data integration and machine learning: (1) we survey how state-of-the-art data integration solutions rely on machine learning-based approaches for accurate results and effective human-in-the-loop pipelines, (2) we review how end-to-end machine learning applications rely on data integration to identify accurate, clean, and relevant data for their analytics exercises, and (3) we discuss open research challenges and opportunities that span across data integration and machine learning.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90449046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose Dynamic Paxos (DPaxos), a Paxos-based consensus protocol to manage access to partitioned data across globally-distributed datacenters and edge nodes. DPaxos is intended to implement a State Machine Replication component in data management systems for the edge. DPaxos targets the unique opportunities of utilizing edge computing resources to support emerging applications with stringent mobility and real-time requirements such as Augmented and Virtual Reality and vehicular applications. The main objective of DPaxos is to reduce the latency of serving user requests, recovering from failures, and reacting to mobility. DPaxos achieves these objectives by a few proposed changes to the traditional Paxos protocol. Most notably, DPaxos proposes a dynamic allocation of quorums ( i.e. , groups of nodes) that are needed for Paxos Leader Election. Leader Election quorums in DPaxos are smaller than traditional Paxos and expand only in the presence of conflicts.
{"title":"DPaxos: Managing Data Closer to Users for Low-Latency and Mobile Applications","authors":"Faisal Nawab, D. Agrawal, A. E. Abbadi","doi":"10.1145/3183713.3196928","DOIUrl":"https://doi.org/10.1145/3183713.3196928","url":null,"abstract":"In this paper, we propose Dynamic Paxos (DPaxos), a Paxos-based consensus protocol to manage access to partitioned data across globally-distributed datacenters and edge nodes. DPaxos is intended to implement a State Machine Replication component in data management systems for the edge. DPaxos targets the unique opportunities of utilizing edge computing resources to support emerging applications with stringent mobility and real-time requirements such as Augmented and Virtual Reality and vehicular applications. The main objective of DPaxos is to reduce the latency of serving user requests, recovering from failures, and reacting to mobility. DPaxos achieves these objectives by a few proposed changes to the traditional Paxos protocol. Most notably, DPaxos proposes a dynamic allocation of quorums ( i.e. , groups of nodes) that are needed for Paxos Leader Election. Leader Election quorums in DPaxos are smaller than traditional Paxos and expand only in the presence of conflicts.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76561711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SESSION DESCRIPTION Recently, there has begun a movement towards fairness, accountability, and transparency (FAT) in algorithmic decision making, and in data science more broadly [1–4]. The database community has not been significantly involved in this movement, despite “owning” the models, languages, and systems that produce the input to the machine learning applications that are often the focus in data science. If training data are biased, or have errors, it stands to reason that the algorithmic result will also be unfair or erroneous. Similarly, transparency of just the algorithm is usually insufficient to understand why certain results were obtained: one needs also to know the data used. In short, FAT depend not just on the algorithm, but also on the data. This observation raises several important questions: What are the core data management issues to which the objectives of fairness, accountability and transparency give rise? What role should the database community play in this movement? Will emphasis on these topics dilute our core competency in techniques and technologies for data, or can it reinforce our central role in technology stacks ranging from startups to the enterprise, and from local non-profits to the federal government? This special session features leading researchers from machine learning, software engineering, security and privacy, and natural language processing, who are doing exciting technical work in FAT. The goal of this session is to outline a technical research agenda in data management foundations and systems around data ethics.
{"title":"Special Session: A Technical Research Agenda in Data Ethics and Responsible Data Management","authors":"Julia Stoyanovich, Bill Howe, H. Jagadish","doi":"10.1145/3183713.3205185","DOIUrl":"https://doi.org/10.1145/3183713.3205185","url":null,"abstract":"SESSION DESCRIPTION Recently, there has begun a movement towards fairness, accountability, and transparency (FAT) in algorithmic decision making, and in data science more broadly [1–4]. The database community has not been significantly involved in this movement, despite “owning” the models, languages, and systems that produce the input to the machine learning applications that are often the focus in data science. If training data are biased, or have errors, it stands to reason that the algorithmic result will also be unfair or erroneous. Similarly, transparency of just the algorithm is usually insufficient to understand why certain results were obtained: one needs also to know the data used. In short, FAT depend not just on the algorithm, but also on the data. This observation raises several important questions: What are the core data management issues to which the objectives of fairness, accountability and transparency give rise? What role should the database community play in this movement? Will emphasis on these topics dilute our core competency in techniques and technologies for data, or can it reinforce our central role in technology stacks ranging from startups to the enterprise, and from local non-profits to the federal government? This special session features leading researchers from machine learning, software engineering, security and privacy, and natural language processing, who are doing exciting technical work in FAT. The goal of this session is to outline a technical research agenda in data management foundations and systems around data ethics.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78227358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Industry 1: Adaptive Query Processing","authors":"J. Dittrich","doi":"10.1145/3258006","DOIUrl":"https://doi.org/10.1145/3258006","url":null,"abstract":"","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78394440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}