The vast majority of phylogenetic databases do not support a declarative querying platform using which their contents can be flexibly and conveniently accessed. The template based query interfaces they support do not allow arbitrary speculative queries. While a small number of graph query languages such as XQuery, Cypher and GraphQL exist for computer savvy users, most are too general and complex to be useful for biologists, and too inefficient for large phylogeny querying. In this paper, we discuss a recently introduced visual query language, called PhyQL, that leverages phylogeny specific properties to support essential and powerful constructs for a large class of phylogentic queries. Its deductive reasoner based implementation offers opportunities for a wide range of pruning strategies to speed up processing using query specific optimization and thus making it suitable for large phylogeny querying. A hybrid optimization technique that exploits a set of indices and "graphlet" partitioning is discussed. A "fail soonest" strategy is used to avoid hopeless processing and is shown to produce dividends.
{"title":"Pruning Forests to Find the Trees","authors":"H. Jamil","doi":"10.1145/2949689.2949697","DOIUrl":"https://doi.org/10.1145/2949689.2949697","url":null,"abstract":"The vast majority of phylogenetic databases do not support a declarative querying platform using which their contents can be flexibly and conveniently accessed. The template based query interfaces they support do not allow arbitrary speculative queries. While a small number of graph query languages such as XQuery, Cypher and GraphQL exist for computer savvy users, most are too general and complex to be useful for biologists, and too inefficient for large phylogeny querying. In this paper, we discuss a recently introduced visual query language, called PhyQL, that leverages phylogeny specific properties to support essential and powerful constructs for a large class of phylogentic queries. Its deductive reasoner based implementation offers opportunities for a wide range of pruning strategies to speed up processing using query specific optimization and thus making it suitable for large phylogeny querying. A hybrid optimization technique that exploits a set of indices and \"graphlet\" partitioning is discussed. A \"fail soonest\" strategy is used to avoid hopeless processing and is shown to produce dividends.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125113336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many real-world decision problems involve solving optimization problems based on data in an SQL database. Traditionally, solving such problems requires combining a DBMS with optimization software packages for each required class of problems (e.g. linear and constraint programming) -- leading to workflows that are cumbersome, complex, inefficient, and error-prone. In this paper, we present SolveDB - a DBMS for optimization applications. SolveDB supports solvers for different problem classes and offers seamless data management and optimization problem solving in a pure SQL-based setting. This allows for much simpler and more effective solutions of database-based optimization problems. SolveDB is based on the 3-level ANSI/SPARC architecture and allows formulating, solving, and analysing solutions of optimization problems using a single so-called solve query. SolveDB provides (1) an SQL-based syntax for optimization problems, (2) an extensible infrastructure for integrating different solvers, and (3) query optimization techniques to achieve the best execution performance and/or result quality. Extensive experiments with the PostgreSQL-based implementation show that SolveDB is a versatile tool offering much higher developer productivity and order of magnitude better performance for specification-complex and data-intensive problems.
{"title":"SolveDB: Integrating Optimization Problem Solvers Into SQL Databases","authors":"Laurynas Siksnys, T. Pedersen","doi":"10.1145/2949689.2949693","DOIUrl":"https://doi.org/10.1145/2949689.2949693","url":null,"abstract":"Many real-world decision problems involve solving optimization problems based on data in an SQL database. Traditionally, solving such problems requires combining a DBMS with optimization software packages for each required class of problems (e.g. linear and constraint programming) -- leading to workflows that are cumbersome, complex, inefficient, and error-prone. In this paper, we present SolveDB - a DBMS for optimization applications. SolveDB supports solvers for different problem classes and offers seamless data management and optimization problem solving in a pure SQL-based setting. This allows for much simpler and more effective solutions of database-based optimization problems. SolveDB is based on the 3-level ANSI/SPARC architecture and allows formulating, solving, and analysing solutions of optimization problems using a single so-called solve query. SolveDB provides (1) an SQL-based syntax for optimization problems, (2) an extensible infrastructure for integrating different solvers, and (3) query optimization techniques to achieve the best execution performance and/or result quality. Extensive experiments with the PostgreSQL-based implementation show that SolveDB is a versatile tool offering much higher developer productivity and order of magnitude better performance for specification-complex and data-intensive problems.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130118839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the problem of similarity search in a set of top-k lists under the generalized Kendall's Tau distance. This distance describes how related two rankings are in terms of discordantly ordered items. We consider pair- and triplets-based indices to counter the shortcomings of naive inverted indices and derive efficient query schemes by relating the proposed index structures to the concept of locality sensitive hashing (LSH). Specifically, we devise four different LSH schemes for Kendall's Tau using two generic hash families over individual elements or pairs of them. We show that each of these functions has the desired property of being locality sensitive. Further, we discuss the selection of hash functions for the proposed LSH schemes for a given query ranking, called query-driven LSH and derive bounds for the required number of hash functions to use in order to achieve a predefined recall goal. Experimental results, using two real-world datasets, show that the devised methods outperform the SimJoin method---the state of the art method to query for similar sets---and are far superior to a plain inverted-index--based approach.
{"title":"Efficient Similarity Search across Top-k Lists under the Kendall's Tau Distance","authors":"K. Pal, S. Michel","doi":"10.1145/2949689.2949709","DOIUrl":"https://doi.org/10.1145/2949689.2949709","url":null,"abstract":"We consider the problem of similarity search in a set of top-k lists under the generalized Kendall's Tau distance. This distance describes how related two rankings are in terms of discordantly ordered items. We consider pair- and triplets-based indices to counter the shortcomings of naive inverted indices and derive efficient query schemes by relating the proposed index structures to the concept of locality sensitive hashing (LSH). Specifically, we devise four different LSH schemes for Kendall's Tau using two generic hash families over individual elements or pairs of them. We show that each of these functions has the desired property of being locality sensitive. Further, we discuss the selection of hash functions for the proposed LSH schemes for a given query ranking, called query-driven LSH and derive bounds for the required number of hash functions to use in order to achieve a predefined recall goal. Experimental results, using two real-world datasets, show that the devised methods outperform the SimJoin method---the state of the art method to query for similar sets---and are far superior to a plain inverted-index--based approach.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129061340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Differential privacy has gained attention from the community as the mechanism for privacy protection. Significant effort has focused on its application to data analysis, where statistical queries are submitted in batch and answers to these queries are perturbed with noise. The magnitude of this noise depends on the privacy parameter ϵ and the sensitivity of the query set. However, computing the sensitivity is known to be NP-hard. In this study, we propose a method that approximates the sensitivity of a query set. Our solution builds a query-region-intersection graph. We prove that computing the maximum clique size of this graph is equivalent to bounding the sensitivity from above. Our bounds, to the best of our knowledge, are the tightest known in the literature. Our solution currently supports a limited but expressive subset of SQL queries (i.e., range queries), and almost all popular aggregate functions directly (except AVERAGE). Experimental results show the efficiency of our approach: even for large query sets (e.g., more than 2K queries over 5 attributes), by utilizing a state-of-the-art solution for the maximum clique problem, we can approximate sensitivity in under a minute.
{"title":"Graph-based modelling of query sets for differential privacy","authors":"Ali Inan, M. E. Gursoy, Emir Esmerdag, Y. Saygin","doi":"10.1145/2949689.2949695","DOIUrl":"https://doi.org/10.1145/2949689.2949695","url":null,"abstract":"Differential privacy has gained attention from the community as the mechanism for privacy protection. Significant effort has focused on its application to data analysis, where statistical queries are submitted in batch and answers to these queries are perturbed with noise. The magnitude of this noise depends on the privacy parameter ϵ and the sensitivity of the query set. However, computing the sensitivity is known to be NP-hard. In this study, we propose a method that approximates the sensitivity of a query set. Our solution builds a query-region-intersection graph. We prove that computing the maximum clique size of this graph is equivalent to bounding the sensitivity from above. Our bounds, to the best of our knowledge, are the tightest known in the literature. Our solution currently supports a limited but expressive subset of SQL queries (i.e., range queries), and almost all popular aggregate functions directly (except AVERAGE). Experimental results show the efficiency of our approach: even for large query sets (e.g., more than 2K queries over 5 attributes), by utilizing a state-of-the-art solution for the maximum clique problem, we can approximate sensitivity in under a minute.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115016939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The aim of data exploration is to get acquainted with an unfamiliar database. Typically, explorers operate by trial and error: they submit a query, study the result, and refine their query subsequently. In this paper, we investigate how to help them understand their query results. In particular, we focus on medium to high dimension spaces: if the database contains dozens or hundreds of columns, which variables should they inspect? We propose to detect subspaces in which the users' selection is different from the rest of the database. From this idea, we built Ziggy, a tuple description engine. Ziggy can detect informative subspaces, and it can explain why it recommends them, with visualizations and natural language. It can cope with mixed data, missing values, and it penalizes redundancy. Our experiments reveal that it is up to an order of magnitude faster than state-of-the-art feature selection algorithms, at minimal accuracy costs.
{"title":"Fast, Explainable View Detection to Characterize Exploration Queries","authors":"Thibault Sellam, M. Kersten","doi":"10.1145/2949689.2949692","DOIUrl":"https://doi.org/10.1145/2949689.2949692","url":null,"abstract":"The aim of data exploration is to get acquainted with an unfamiliar database. Typically, explorers operate by trial and error: they submit a query, study the result, and refine their query subsequently. In this paper, we investigate how to help them understand their query results. In particular, we focus on medium to high dimension spaces: if the database contains dozens or hundreds of columns, which variables should they inspect? We propose to detect subspaces in which the users' selection is different from the rest of the database. From this idea, we built Ziggy, a tuple description engine. Ziggy can detect informative subspaces, and it can explain why it recommends them, with visualizations and natural language. It can cope with mixed data, missing values, and it penalizes redundancy. Our experiments reveal that it is up to an order of magnitude faster than state-of-the-art feature selection algorithms, at minimal accuracy costs.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128270248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Searching a database for similar graphs is a critical task in many scientific applications, such as in drug discovery, geoinformatics, or pattern recognition. Typically, graph edit distance is used to estimate the similarity of non-identical graphs, which is a very hard task. Several indexing structures and lower bound distances have been proposed to prune the search space. Most of them utilize the number of edit operations and assume graphs with a discrete label alphabet that has a certain canonical order. Unfortunately, such assumptions cannot be guaranteed for geometric graphs where vertices have coordinates in some two dimensional space. In this paper, we study similarity range queries for geometric graphs with edit distance constraints. First, we propose an efficient index structure to discover similar vertices. For this, we embed the vertices of different graphs in a higher dimensional space, which are then indexed using the well-known R-tree. Second, we propose three lower bound distances to filter non-similar graphs with different pruning power and complexity. Using representative geometric graphs extracted from a variety of application domains, namely chemoinformatics, character recognition, and image analysis, our framework achieved on average a pruning performance of 94% with 77% reduction in the response time.
{"title":"Geometric Graph Indexing for Similarity Search in Scientific Databases","authors":"Ayser Armiti, Michael Gertz","doi":"10.1145/2949689.2949691","DOIUrl":"https://doi.org/10.1145/2949689.2949691","url":null,"abstract":"Searching a database for similar graphs is a critical task in many scientific applications, such as in drug discovery, geoinformatics, or pattern recognition. Typically, graph edit distance is used to estimate the similarity of non-identical graphs, which is a very hard task. Several indexing structures and lower bound distances have been proposed to prune the search space. Most of them utilize the number of edit operations and assume graphs with a discrete label alphabet that has a certain canonical order. Unfortunately, such assumptions cannot be guaranteed for geometric graphs where vertices have coordinates in some two dimensional space. In this paper, we study similarity range queries for geometric graphs with edit distance constraints. First, we propose an efficient index structure to discover similar vertices. For this, we embed the vertices of different graphs in a higher dimensional space, which are then indexed using the well-known R-tree. Second, we propose three lower bound distances to filter non-similar graphs with different pruning power and complexity. Using representative geometric graphs extracted from a variety of application domains, namely chemoinformatics, character recognition, and image analysis, our framework achieved on average a pruning performance of 94% with 77% reduction in the response time.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132986611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Triangle listing plays an important role in graph analysis and has numerous graph mining applications. With the rapid growth of graph data, distributed methods for listing triangles over massive graphs are urgently needed. Therefore, the triangle listing problem has been studied in several distributed infrastructures including MapReduce. However, existing algorithms suffer from generating and shuffling huge amounts of intermediate data, where interestingly, a large percentage of this data is redundant. Inspired by this observation, we present the "Bermuda" method, an efficient MapReducebased triangle listing technique for massive graphs. Different from existing approaches, Bermuda effectively reduces the size of the intermediate data via redundancy elimination and sharing of messages whenever possible. As a result, Bermuda achieves orders-of-magnitudes of speedup and enables processing larger graphs that other techniques fail to process under the same resources. Bermuda exploits the locality of processing, i.e., in which reduce instance each graph vertex will be processed, to avoid the redundancy of generating messages from mappers to reducers. Bermuda also proposes novel message sharing techniques within each reduce instance to increase the usability of the received messages. We present and analyze several reduce-side caching strategies that dynamically learn the expected access patterns of the shared messages, and adaptively deploy the appropriate technique for better sharing. Extensive experiments conducted on real-world large-scale graphs show that Bermuda speeds up the triangle listing computations by factors up to 10x. Moreover, with a relatively small cluster, Bermuda can scale up to large datasets, e.g., ClueWeb graph dataset (688GB), while other techniques fail to finish.
{"title":"Bermuda: An Efficient MapReduce Triangle Listing Algorithm for Web-Scale Graphs","authors":"Dongqing Xiao, M. Eltabakh, Xiangnan Kong","doi":"10.1145/2949689.2949715","DOIUrl":"https://doi.org/10.1145/2949689.2949715","url":null,"abstract":"Triangle listing plays an important role in graph analysis and has numerous graph mining applications. With the rapid growth of graph data, distributed methods for listing triangles over massive graphs are urgently needed. Therefore, the triangle listing problem has been studied in several distributed infrastructures including MapReduce. However, existing algorithms suffer from generating and shuffling huge amounts of intermediate data, where interestingly, a large percentage of this data is redundant. Inspired by this observation, we present the \"Bermuda\" method, an efficient MapReducebased triangle listing technique for massive graphs. Different from existing approaches, Bermuda effectively reduces the size of the intermediate data via redundancy elimination and sharing of messages whenever possible. As a result, Bermuda achieves orders-of-magnitudes of speedup and enables processing larger graphs that other techniques fail to process under the same resources. Bermuda exploits the locality of processing, i.e., in which reduce instance each graph vertex will be processed, to avoid the redundancy of generating messages from mappers to reducers. Bermuda also proposes novel message sharing techniques within each reduce instance to increase the usability of the received messages. We present and analyze several reduce-side caching strategies that dynamically learn the expected access patterns of the shared messages, and adaptively deploy the appropriate technique for better sharing. Extensive experiments conducted on real-world large-scale graphs show that Bermuda speeds up the triangle listing computations by factors up to 10x. Moreover, with a relatively small cluster, Bermuda can scale up to large datasets, e.g., ClueWeb graph dataset (688GB), while other techniques fail to finish.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"26 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134040556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anja Kunkel, Astrid Rheinländer, C. Schiefer, S. Helmer, Panagiotis Bouros, U. Leser
The efficient computation of set containment joins (SCJ) over set-valued attributes is a well-studied problem with many applications in commercial and scientific fields. Nevertheless, there still exists a number of open questions: An extensive comparative evaluation is still missing, the two most recent algorithms have not yet been compared to each other, and the exact impact of item sort order and properties of the data on algorithms performance still is largely unknown. Furthermore, all previous works only considered sequential join algorithms, although modern servers offer ample opportunities for parallelization. We present PIEJoin, a novel algorithm for computing SCJ based on intersecting prefix trees built at runtime over the to-be-joined attributes. We also present a highly optimized implementation of PIEJoin which uses tree signatures for saving space and interval labeling for improving runtime of the basic method. Most importantly, PIEJoin can be parallelized easily by partitioning the tree intersection. A comprehensive evaluation on eight data sets shows that PIEJoin already in its sequential form clearly outperforms two of the three most important competitors (PRETTI and PRETTI+). It is mostly yet not always slower than the third, LIMIT+(opj) but requires significantly less space. The parallel version of PIEJoin we present here achieves significant further speed-ups, yet our evaluation also shows that further research is needed as finding the best way of partitioning the join turns out to be non-trivial.
{"title":"PIEJoin: Towards Parallel Set Containment Joins","authors":"Anja Kunkel, Astrid Rheinländer, C. Schiefer, S. Helmer, Panagiotis Bouros, U. Leser","doi":"10.1145/2949689.2949694","DOIUrl":"https://doi.org/10.1145/2949689.2949694","url":null,"abstract":"The efficient computation of set containment joins (SCJ) over set-valued attributes is a well-studied problem with many applications in commercial and scientific fields. Nevertheless, there still exists a number of open questions: An extensive comparative evaluation is still missing, the two most recent algorithms have not yet been compared to each other, and the exact impact of item sort order and properties of the data on algorithms performance still is largely unknown. Furthermore, all previous works only considered sequential join algorithms, although modern servers offer ample opportunities for parallelization. We present PIEJoin, a novel algorithm for computing SCJ based on intersecting prefix trees built at runtime over the to-be-joined attributes. We also present a highly optimized implementation of PIEJoin which uses tree signatures for saving space and interval labeling for improving runtime of the basic method. Most importantly, PIEJoin can be parallelized easily by partitioning the tree intersection. A comprehensive evaluation on eight data sets shows that PIEJoin already in its sequential form clearly outperforms two of the three most important competitors (PRETTI and PRETTI+). It is mostly yet not always slower than the third, LIMIT+(opj) but requires significantly less space. The parallel version of PIEJoin we present here achieves significant further speed-ups, yet our evaluation also shows that further research is needed as finding the best way of partitioning the join turns out to be non-trivial.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130593239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most messages posted in Twitter usually discuss an ongoing event, triggering a series of tweets that together may constitute a trending topic (e.g., #election2012, #jesuischarlie, #oscars2016). Sometimes, such a topic may be trending only locally, assuming that related posts have a geographical reference, either directly geotagging them with exact coordinates or indirectly by mentioning a well-known landmark (e.g., #bataclan). In this paper, we study how trending topics evolve both in space and time, by monitoring the Twitter stream and detecting online the varying spatial coverage of related geotagged posts across time. Observing the evolving spatial coverage of such posts may reveal the intensity of a phenomenon and its impact on local communities, and can further assist in improving user awareness on facts and situations with strong local footprint. We propose a technique that can maintain trending topics and readily recognize their locality by subdividing the area of interest into elementary cells. Thus, instead of costly spatial clustering of incoming messages by topic, we can approximately, but almost instantly, identify such areas of coverage as groups of contiguous cells, as well as their mutability with time. We conducted a comprehensive empirical study to evaluate the performance of the proposed methodology, as well as the quality of detected areas of coverage. Results confirm that our technique can efficiently cope with scalable volumes of messages, offering incremental response in real-time regarding coverage updates for trending topics.
{"title":"Monitoring Spatial Coverage of Trending Topics in Twitter","authors":"Kostas Patroumpas, M. Loukadakis","doi":"10.1145/2949689.2949716","DOIUrl":"https://doi.org/10.1145/2949689.2949716","url":null,"abstract":"Most messages posted in Twitter usually discuss an ongoing event, triggering a series of tweets that together may constitute a trending topic (e.g., #election2012, #jesuischarlie, #oscars2016). Sometimes, such a topic may be trending only locally, assuming that related posts have a geographical reference, either directly geotagging them with exact coordinates or indirectly by mentioning a well-known landmark (e.g., #bataclan). In this paper, we study how trending topics evolve both in space and time, by monitoring the Twitter stream and detecting online the varying spatial coverage of related geotagged posts across time. Observing the evolving spatial coverage of such posts may reveal the intensity of a phenomenon and its impact on local communities, and can further assist in improving user awareness on facts and situations with strong local footprint. We propose a technique that can maintain trending topics and readily recognize their locality by subdividing the area of interest into elementary cells. Thus, instead of costly spatial clustering of incoming messages by topic, we can approximately, but almost instantly, identify such areas of coverage as groups of contiguous cells, as well as their mutability with time. We conducted a comprehensive empirical study to evaluate the performance of the proposed methodology, as well as the quality of detected areas of coverage. Results confirm that our technique can efficiently cope with scalable volumes of messages, offering incremental response in real-time regarding coverage updates for trending topics.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134573761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a new approach for the the incremental evaluation of RDF graph streams over sliding windows. Our system, called "SPECTRA", combines a novel formof RDF graph summarisation, a new incremental evaluation method and adaptive indexing techniques. We materialise the summarised graph from each event using vertically partitioned views to facilitate the fast hash-joins for all types of queries. Our incremental and adaptive indexing is a byproduct of query processing, and thus provides considerable advantages over offline and online indexing. Furthermore, contrary to the existing approaches, we employ incremental evaluation of triples within a window. This results in considerable reduction in response time, while cutting the unnecessary cost imposed by recomputation models for each triple insertion and eviction within a defined window. We show that our resulting system is able to cope with complex queries and datasets with clear benefits. Our experimental results on both synthetic and real-world datasets show up to an order of magnitude of performance improvements as compared to state-of-the-art systems.
{"title":"SPECTRA: Continuous Query Processing for RDF Graph Streams Over Sliding Windows","authors":"Syed Gillani, Gauthier Picard, F. Laforest","doi":"10.1145/2949689.2949701","DOIUrl":"https://doi.org/10.1145/2949689.2949701","url":null,"abstract":"This paper proposes a new approach for the the incremental evaluation of RDF graph streams over sliding windows. Our system, called \"SPECTRA\", combines a novel formof RDF graph summarisation, a new incremental evaluation method and adaptive indexing techniques. We materialise the summarised graph from each event using vertically partitioned views to facilitate the fast hash-joins for all types of queries. Our incremental and adaptive indexing is a byproduct of query processing, and thus provides considerable advantages over offline and online indexing. Furthermore, contrary to the existing approaches, we employ incremental evaluation of triples within a window. This results in considerable reduction in response time, while cutting the unnecessary cost imposed by recomputation models for each triple insertion and eviction within a defined window. We show that our resulting system is able to cope with complex queries and datasets with clear benefits. Our experimental results on both synthetic and real-world datasets show up to an order of magnitude of performance improvements as compared to state-of-the-art systems.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129190894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}