Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498366
Michael J. Cafarella, I. Ilyas, Marcel Kornacker, Tim Kraska, C. Ré
With the increasing urge of the enterprises to ingest as much data as they can in what's commonly referred to as “Data Lakes”, the new environment presents serious challenges to traditional ETL models and to building analytic layers on top of well-understood global schema. With the recent development of multiple technologies to support this “load-first” paradigm, even traditional enterprises have fairly large HDFS-based data lakes now. They have even had them long enough that their first generation IT projects delivered on some, but not all, of the promise of integrating their enterprise's data assets. In short, we moved from no data to Dark data. Dark data is what enterprises might have in their possession, without the ability to access it or with limited awareness of what this data represents. In particular, business-critical information might still remain out of reach. This panel is about Dark Data and whether we have been focusing on the right data management challenges in dealing with it.
{"title":"Dark Data: Are we solving the right problems?","authors":"Michael J. Cafarella, I. Ilyas, Marcel Kornacker, Tim Kraska, C. Ré","doi":"10.1109/ICDE.2016.7498366","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498366","url":null,"abstract":"With the increasing urge of the enterprises to ingest as much data as they can in what's commonly referred to as “Data Lakes”, the new environment presents serious challenges to traditional ETL models and to building analytic layers on top of well-understood global schema. With the recent development of multiple technologies to support this “load-first” paradigm, even traditional enterprises have fairly large HDFS-based data lakes now. They have even had them long enough that their first generation IT projects delivered on some, but not all, of the promise of integrating their enterprise's data assets. In short, we moved from no data to Dark data. Dark data is what enterprises might have in their possession, without the ability to access it or with limited awareness of what this data represents. In particular, business-critical information might still remain out of reach. This panel is about Dark Data and whether we have been focusing on the right data management challenges in dealing with it.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"101 1","pages":"1444-1445"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74148380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498270
Peixiang Zhao, C. Aggarwal, Gewen He
Link prediction is a fundamental problem that aims to estimate the likelihood of the existence of edges (links) based on the current observed structure of a graph, and has found numerous applications in social networks, bioinformatics, E-commerce, and the Web. In many real-world scenarios, however, graphs are massive in size and dynamically evolving in a fast rate, which, without loss of generality, are often modeled and interpreted as graph streams. Existing link prediction methods fail to generalize in the graph stream setting because graph snapshots where link prediction is performed are no longer readily available in memory, or even on disks, for effective graph computation and analysis. It is therefore highly desirable, albeit challenging, to support link prediction online and in a dynamic way, which, in this paper, is referred to as the streaming link prediction problem in graph streams. In this paper, we consider three fundamental, neighborhood-based link prediction target measures, Jaccard coefficient, common neighbor, and Adamic-Adar, and provide accurate estimation to them in order to address the streaming link prediction problem in graph streams. Our main idea is to design cost-effective graph sketches (constant space per vertex) based on MinHash and vertex-biased sampling techniques, and to propose efficient sketch based algorithms (constant time per edge) with both theoretical accuracy guarantee and robust estimation results. We carry out experimental studies in a series of real-world graph streams. The results demonstrate that our graph sketch based methods are accurate, efficient, cost-effective, and thus can be practically employed for link prediction in real-world graph streams.
{"title":"Link prediction in graph streams","authors":"Peixiang Zhao, C. Aggarwal, Gewen He","doi":"10.1109/ICDE.2016.7498270","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498270","url":null,"abstract":"Link prediction is a fundamental problem that aims to estimate the likelihood of the existence of edges (links) based on the current observed structure of a graph, and has found numerous applications in social networks, bioinformatics, E-commerce, and the Web. In many real-world scenarios, however, graphs are massive in size and dynamically evolving in a fast rate, which, without loss of generality, are often modeled and interpreted as graph streams. Existing link prediction methods fail to generalize in the graph stream setting because graph snapshots where link prediction is performed are no longer readily available in memory, or even on disks, for effective graph computation and analysis. It is therefore highly desirable, albeit challenging, to support link prediction online and in a dynamic way, which, in this paper, is referred to as the streaming link prediction problem in graph streams. In this paper, we consider three fundamental, neighborhood-based link prediction target measures, Jaccard coefficient, common neighbor, and Adamic-Adar, and provide accurate estimation to them in order to address the streaming link prediction problem in graph streams. Our main idea is to design cost-effective graph sketches (constant space per vertex) based on MinHash and vertex-biased sampling techniques, and to propose efficient sketch based algorithms (constant time per edge) with both theoretical accuracy guarantee and robust estimation results. We carry out experimental studies in a series of real-world graph streams. The results demonstrate that our graph sketch based methods are accurate, efficient, cost-effective, and thus can be practically employed for link prediction in real-world graph streams.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"64 1","pages":"553-564"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74512062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498304
Weiqing Wang, Hongzhi Yin, S. Sadiq, Ling Chen, M. Xie, Xiaofang Zhou
With the rapid development of location-based social networks (LBSNs), spatial item recommendation has become an important way of helping users discover interesting locations to increase their engagement with location-based services. Although human movement exhibits sequential patterns in LBSNs, most current studies on spatial item recommendations do not consider the sequential influence of locations. Leveraging sequential patterns in spatial item recommendation is, however, very challenging, considering 1) users' check-in data in LBSNs has a low sampling rate in both space and time, which renders existing prediction techniques on GPS trajectories ineffective; 2) the prediction space is extremely large, with millions of distinct locations as the next prediction target, which impedes the application of classical Markov chain models; and 3) there is no existing framework that unifies users' personal interests and the sequential influence in a principled manner. In light of the above challenges, we propose a sequential personalized spatial item recommendation framework (SPORE) which introduces a novel latent variable topic-region to model and fuse sequential influence with personal interests in the latent and exponential space. The advantages of modeling the sequential effect at the topic-region level include a significantly reduced prediction space, an effective alleviation of data sparsity and a direct expression of the semantic meaning of users' spatial activities. Furthermore, we design an asymmetric Locality Sensitive Hashing (ALSH) technique to speed up the online top-k recommendation process by extending the traditional LSH. We evaluate the performance of SPORE on two real datasets and one large-scale synthetic dataset. The results demonstrate a significant improvement in SPORE's ability to recommend spatial items, in terms of both effectiveness and efficiency, compared with the state-of-the-art methods.
{"title":"SPORE: A sequential personalized spatial item recommender system","authors":"Weiqing Wang, Hongzhi Yin, S. Sadiq, Ling Chen, M. Xie, Xiaofang Zhou","doi":"10.1109/ICDE.2016.7498304","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498304","url":null,"abstract":"With the rapid development of location-based social networks (LBSNs), spatial item recommendation has become an important way of helping users discover interesting locations to increase their engagement with location-based services. Although human movement exhibits sequential patterns in LBSNs, most current studies on spatial item recommendations do not consider the sequential influence of locations. Leveraging sequential patterns in spatial item recommendation is, however, very challenging, considering 1) users' check-in data in LBSNs has a low sampling rate in both space and time, which renders existing prediction techniques on GPS trajectories ineffective; 2) the prediction space is extremely large, with millions of distinct locations as the next prediction target, which impedes the application of classical Markov chain models; and 3) there is no existing framework that unifies users' personal interests and the sequential influence in a principled manner. In light of the above challenges, we propose a sequential personalized spatial item recommendation framework (SPORE) which introduces a novel latent variable topic-region to model and fuse sequential influence with personal interests in the latent and exponential space. The advantages of modeling the sequential effect at the topic-region level include a significantly reduced prediction space, an effective alleviation of data sparsity and a direct expression of the semantic meaning of users' spatial activities. Furthermore, we design an asymmetric Locality Sensitive Hashing (ALSH) technique to speed up the online top-k recommendation process by extending the traditional LSH. We evaluate the performance of SPORE on two real datasets and one large-scale synthetic dataset. The results demonstrate a significant improvement in SPORE's ability to recommend spatial items, in terms of both effectiveness and efficiency, compared with the state-of-the-art methods.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"39 1","pages":"954-965"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74638452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498227
Arijit Khan, Benjamin Zehnder, Donald Kossmann
We study the novel problem of revenue maximization of a social network host that sells viral marketing campaigns to multiple competing campaigners. Each client campaigner informs the social network host about her target users in the network, as well as how much money she is willing to pay to the host if one of her target users buys her product. The social network host, in turn, assigns a set of seed users to each of her client campaigners. The seed set for a campaigner is a limited number of users to whom the campaigner provides free samples, discounted price etc. with the expectation that these seed users will buy her product, and would also be able to influence many of her target users in the network towards buying her product. Because of various product-adoption costs, it is very unlikely that an average user will purchase more than one of the competing products. Therefore, from the host's perspective, it is important to assign seed users to client campaigners in such a way that the seed assignment guarantees the maximum aggregated revenue for the host considering all her client campaigners. We formulate our problem by following two well-established influence cascading models: the independent cascade model and the linear threshold model. While our problem using both these models is NP-hard, and neither monotonic, nor sub-modular; we develop approximated algorithms with theoretical performance guarantees. However, as our approximated algorithms often incur higher running times, we also design efficient heuristic methods that empirically perform as good as our approximated algorithms. Our detailed experimental evaluation attests that the proposed techniques are effective and scalable over real-world datasets.
{"title":"Revenue maximization by viral marketing: A social network host's perspective","authors":"Arijit Khan, Benjamin Zehnder, Donald Kossmann","doi":"10.1109/ICDE.2016.7498227","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498227","url":null,"abstract":"We study the novel problem of revenue maximization of a social network host that sells viral marketing campaigns to multiple competing campaigners. Each client campaigner informs the social network host about her target users in the network, as well as how much money she is willing to pay to the host if one of her target users buys her product. The social network host, in turn, assigns a set of seed users to each of her client campaigners. The seed set for a campaigner is a limited number of users to whom the campaigner provides free samples, discounted price etc. with the expectation that these seed users will buy her product, and would also be able to influence many of her target users in the network towards buying her product. Because of various product-adoption costs, it is very unlikely that an average user will purchase more than one of the competing products. Therefore, from the host's perspective, it is important to assign seed users to client campaigners in such a way that the seed assignment guarantees the maximum aggregated revenue for the host considering all her client campaigners. We formulate our problem by following two well-established influence cascading models: the independent cascade model and the linear threshold model. While our problem using both these models is NP-hard, and neither monotonic, nor sub-modular; we develop approximated algorithms with theoretical performance guarantees. However, as our approximated algorithms often incur higher running times, we also design efficient heuristic methods that empirically perform as good as our approximated algorithms. Our detailed experimental evaluation attests that the proposed techniques are effective and scalable over real-world datasets.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"39 1","pages":"37-48"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85728876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498328
Shangfu Peng, Jagan Sankaranarayanan, H. Samet
In the past decades, shortest distance methods for road networks have been developed that focus on how to speed up the latency of a single source-target pair distance query. Large analytical applications on road networks including simulations (e.g., evacuation planning), logistics, and transportation planning require methods that provide high throughput (i.e., distance computations per second) and the ability to “scale out” by using large distributed computing clusters. A framework called SPDO is presented which implements an extremely fast distributed algorithm for computing road network distance queries on Apache Spark. The approach extends our previous work of developing the ε-distance oracle which has now been adapted to use Spark's resilient distributed dataset (RDD). Compared with state-of-the-art methods that focus on reducing latency, the proposed framework improves the throughput by at least an order of magnitude, which makes the approach suitable for applications that need to compute thousands to millions of network distances per second.
{"title":"SPDO: High-throughput road distance computations on Spark using Distance Oracles","authors":"Shangfu Peng, Jagan Sankaranarayanan, H. Samet","doi":"10.1109/ICDE.2016.7498328","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498328","url":null,"abstract":"In the past decades, shortest distance methods for road networks have been developed that focus on how to speed up the latency of a single source-target pair distance query. Large analytical applications on road networks including simulations (e.g., evacuation planning), logistics, and transportation planning require methods that provide high throughput (i.e., distance computations per second) and the ability to “scale out” by using large distributed computing clusters. A framework called SPDO is presented which implements an extremely fast distributed algorithm for computing road network distance queries on Apache Spark. The approach extends our previous work of developing the ε-distance oracle which has now been adapted to use Spark's resilient distributed dataset (RDD). Compared with state-of-the-art methods that focus on reducing latency, the proposed framework improves the throughput by at least an order of magnitude, which makes the approach suitable for applications that need to compute thousands to millions of network distances per second.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"57 1","pages":"1239-1250"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75192964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498337
Leila Bahri, Amira Soliman, Jacopo Squillaci, B. Carminati, E. Ferrari, Sarunas Girdzijauskas
Fake accounts in online social networks (OSNs) have known considerable sophistication and are now attempting to gain network trust by infiltrating within honest communities. Honest users have limited perspective on the truthfulness of new online identities requesting their friendship. This facilitates the task of fake accounts in deceiving honest users to befriend them. To address this, we have proposed a model that learns hidden correlations between profile attributes within OSN communities, and exploits them to assist users in estimating the trustworthiness of new profiles. To demonstrate our method, we suggest, in this demo, a game application through which players try to cheat the system and convince nodes in a simulated OSN to befriend them. The game deploys different strategies to challenge the players and to reach the objectives of the demo. These objectives are to make participants aware of how fake accounts can infiltrate within their OSN communities, to demonstrate how our suggested method could aid in mitigating this threat, and to eventually strengthen our model based on the data collected from the moves of the players.
{"title":"Beat the DIVa - decentralized identity validation for online social networks","authors":"Leila Bahri, Amira Soliman, Jacopo Squillaci, B. Carminati, E. Ferrari, Sarunas Girdzijauskas","doi":"10.1109/ICDE.2016.7498337","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498337","url":null,"abstract":"Fake accounts in online social networks (OSNs) have known considerable sophistication and are now attempting to gain network trust by infiltrating within honest communities. Honest users have limited perspective on the truthfulness of new online identities requesting their friendship. This facilitates the task of fake accounts in deceiving honest users to befriend them. To address this, we have proposed a model that learns hidden correlations between profile attributes within OSN communities, and exploits them to assist users in estimating the trustworthiness of new profiles. To demonstrate our method, we suggest, in this demo, a game application through which players try to cheat the system and convince nodes in a simulated OSN to befriend them. The game deploys different strategies to challenge the players and to reach the objectives of the demo. These objectives are to make participants aware of how fake accounts can infiltrate within their OSN communities, to demonstrate how our suggested method could aid in mitigating this threat, and to eventually strengthen our model based on the data collected from the moves of the players.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"1330-1333"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74877790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498382
Md. Saiful Islam, Chengfei Liu, Jianxin Li
Graph data management and matching similar graphs are very important for many applications including bioinformatics, computer vision, VLSI design, bug localization, road networks, social and communication networking. Many graph indexing and similarity matching techniques have already been proposed for managing and querying graph data. In similar graph matching, a user is returned with the database graphs whose distances with the query graph are below a threshold. In such query settings, a user may not receive certain database graphs that are very similar to the query graph if the initial query graph is inappropriate/imperfect for the expected answer set. To exemplify this, consider a drug designer who is looking for chemical compounds that could be the target of her hypothetical drug before realizing it. In response to her query, the traditional search system may return the structures from the database that are most similar to the query graph. However, she may get surprised if some of the expected targets are missing in the answer set. She may then seek assistance from the system by asking “Is there other query graph that can match my expected answer set?”. The system may then modify her initial query graph to include the missing answers in the new answer set. Here, we study this kind of problem of answering why-not questions in similar graph matching for graph databases.
{"title":"Efficient answering of why-not questions in similar graph matching","authors":"Md. Saiful Islam, Chengfei Liu, Jianxin Li","doi":"10.1109/ICDE.2016.7498382","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498382","url":null,"abstract":"Graph data management and matching similar graphs are very important for many applications including bioinformatics, computer vision, VLSI design, bug localization, road networks, social and communication networking. Many graph indexing and similarity matching techniques have already been proposed for managing and querying graph data. In similar graph matching, a user is returned with the database graphs whose distances with the query graph are below a threshold. In such query settings, a user may not receive certain database graphs that are very similar to the query graph if the initial query graph is inappropriate/imperfect for the expected answer set. To exemplify this, consider a drug designer who is looking for chemical compounds that could be the target of her hypothetical drug before realizing it. In response to her query, the traditional search system may return the structures from the database that are most similar to the query graph. However, she may get surprised if some of the expected targets are missing in the answer set. She may then seek assistance from the system by asking “Is there other query graph that can match my expected answer set?”. The system may then modify her initial query graph to include the missing answers in the new answer set. Here, we study this kind of problem of answering why-not questions in similar graph matching for graph databases.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"99 1","pages":"1476-1477"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77969909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498225
Daehoon Kim, Jae-Gil Lee, B. Lee
Social curation services are emerging social media platforms that enable users to curate their contents according to the topic and express their interests at the topic level by following curated collections of other users' contents rather than the users themselves. The topic-level information revealed through this new feature far exceeds what existing methods solicit from the traditional social networking services, to greatly enhance the quality of topic-sensitive influence modeling. In this paper, we propose a novel model called the topical influence with social curation (TISC) to find influential users from social curation services. This model, formulated by the continuous conditional random field, fully takes advantage of the explicitly available topic-level information reflected in both contents and interactions. In order to validate its merits, we comprehensively compare TISC with state-of-the-art models using two real-world data sets collected from Pinterest and Scoop.it. The results show that TISC achieves higher accuracy by up to around 80% and finds more convincing results in case studies than the other models. Moreover, we develop a distributed learning algorithm on Spark and demonstrate its excellent scalability on a cluster of 48 cores.
{"title":"Topical influence modeling via topic-level interests and interactions on social curation services","authors":"Daehoon Kim, Jae-Gil Lee, B. Lee","doi":"10.1109/ICDE.2016.7498225","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498225","url":null,"abstract":"Social curation services are emerging social media platforms that enable users to curate their contents according to the topic and express their interests at the topic level by following curated collections of other users' contents rather than the users themselves. The topic-level information revealed through this new feature far exceeds what existing methods solicit from the traditional social networking services, to greatly enhance the quality of topic-sensitive influence modeling. In this paper, we propose a novel model called the topical influence with social curation (TISC) to find influential users from social curation services. This model, formulated by the continuous conditional random field, fully takes advantage of the explicitly available topic-level information reflected in both contents and interactions. In order to validate its merits, we comprehensively compare TISC with state-of-the-art models using two real-world data sets collected from Pinterest and Scoop.it. The results show that TISC achieves higher accuracy by up to around 80% and finds more convincing results in case studies than the other models. Moreover, we develop a distributed learning algorithm on Spark and demonstrate its excellent scalability on a cluster of 48 cores.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"17 1","pages":"13-24"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81504875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498289
P. Bourhis, Daniel Deutch, Y. Moskovitch
We consider in this paper the analysis of complex applications that query and update an underlying database in their operation. We focus on three classes of analytical questions that are important for application owners and users alike: Why was a result generated? What would be the result if the application logic or database is modified in a particular way? How can one interact with the application to achieve a particular goal? Answering these questions efficiently is a fundamental step towards optimizing the application and its use. Noting that provenance was a key component in answering similar questions in the context of database queries, we develop a provenance-based model and efficient algorithms for these problems in the context of data-centric applications. Novel challenges here include the dynamic update of data, combined with the possibly complex workflows allowed by applications. We nevertheless achieve theoretical guarantees for the algorithms performance, and experimentally show their efficiency and usefulness, even in presence of complex applications and large-scale data.
{"title":"Analyzing data-centric applications: Why, what-if, and how-to","authors":"P. Bourhis, Daniel Deutch, Y. Moskovitch","doi":"10.1109/ICDE.2016.7498289","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498289","url":null,"abstract":"We consider in this paper the analysis of complex applications that query and update an underlying database in their operation. We focus on three classes of analytical questions that are important for application owners and users alike: Why was a result generated? What would be the result if the application logic or database is modified in a particular way? How can one interact with the application to achieve a particular goal? Answering these questions efficiently is a fundamental step towards optimizing the application and its use. Noting that provenance was a key component in answering similar questions in the context of database queries, we develop a provenance-based model and efficient algorithms for these problems in the context of data-centric applications. Novel challenges here include the dynamic update of data, combined with the possibly complex workflows allowed by applications. We nevertheless achieve theoretical guarantees for the algorithms performance, and experimentally show their efficiency and usefulness, even in presence of complex applications and large-scale data.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"48 1","pages":"779-790"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82443981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498365
A. Magdy, M. Mokbel
Microblogs data, e.g., tweets, reviews, news comments, and social media comments, has gained considerable attention in recent years due to its popularity and rich contents. Nowadays, microblogs applications span a wide spectrum of interests, including detecting and analyzing events, user analysis for geo-targeted ads and political elections, and critical applications like discovering health issues and rescue services. Consequently, major research efforts are spent to analyze and manage microblogs data to support different applications. In this tutorial, we give a 1.5 hours overview about microblogs data analysis, management, and systems. The tutorial gives a comprehensive review for research efforts that are trying to analyze microblogs contents to build on them new functionality and use cases. In addition, the tutorial reviews existing research that propose core data management components to support microblogs queries at scale. Finally, the tutorial reviews system-level issues and on-going work on supporting microblogs data through the rising big data systems. Through its different parts, the tutorial highlights the challenges and opportunities in microblogs data research.
{"title":"Microblogs data management and analysis","authors":"A. Magdy, M. Mokbel","doi":"10.1109/ICDE.2016.7498365","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498365","url":null,"abstract":"Microblogs data, e.g., tweets, reviews, news comments, and social media comments, has gained considerable attention in recent years due to its popularity and rich contents. Nowadays, microblogs applications span a wide spectrum of interests, including detecting and analyzing events, user analysis for geo-targeted ads and political elections, and critical applications like discovering health issues and rescue services. Consequently, major research efforts are spent to analyze and manage microblogs data to support different applications. In this tutorial, we give a 1.5 hours overview about microblogs data analysis, management, and systems. The tutorial gives a comprehensive review for research efforts that are trying to analyze microblogs contents to build on them new functionality and use cases. In addition, the tutorial reviews existing research that propose core data management components to support microblogs queries at scale. Finally, the tutorial reviews system-level issues and on-going work on supporting microblogs data through the rising big data systems. Through its different parts, the tutorial highlights the challenges and opportunities in microblogs data research.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"19 1","pages":"1440-1443"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78623556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}