2016 IEEE 32nd International Conference on Data Engineering (ICDE)最新文献_第7页

Dark Data: Are we solving the right problems? 暗数据:我们正在解决正确的问题吗?

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498366

Michael J. Cafarella, I. Ilyas, Marcel Kornacker, Tim Kraska, C. Ré

With the increasing urge of the enterprises to ingest as much data as they can in what's commonly referred to as “Data Lakes”, the new environment presents serious challenges to traditional ETL models and to building analytic layers on top of well-understood global schema. With the recent development of multiple technologies to support this “load-first” paradigm, even traditional enterprises have fairly large HDFS-based data lakes now. They have even had them long enough that their first generation IT projects delivered on some, but not all, of the promise of integrating their enterprise's data assets. In short, we moved from no data to Dark data. Dark data is what enterprises might have in their possession, without the ability to access it or with limited awareness of what this data represents. In particular, business-critical information might still remain out of reach. This panel is about Dark Data and whether we have been focusing on the right data management challenges in dealing with it.

随着企业越来越迫切地需要在通常被称为“数据湖”的环境中摄取尽可能多的数据，新环境对传统的ETL模型和在易于理解的全局模式之上构建分析层提出了严峻的挑战。随着最近支持这种“负载优先”范式的多种技术的发展，即使是传统企业现在也拥有相当大的基于hdfs的数据湖。他们甚至已经拥有了足够长的时间，以至于他们的第一代IT项目交付了一些(但不是全部)集成企业数据资产的承诺。简而言之，我们从没有数据变成了暗数据。暗数据是企业可能拥有的数据，但没有能力访问它，或者对这些数据的含义知之甚少。特别是，关键业务信息可能仍然遥不可及。这个小组讨论的是暗数据，以及我们在处理暗数据时是否关注了正确的数据管理挑战。

引用次数: 11

Link prediction in graph streams 图流中的链接预测

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498270

Peixiang Zhao, C. Aggarwal, Gewen He

Link prediction is a fundamental problem that aims to estimate the likelihood of the existence of edges (links) based on the current observed structure of a graph, and has found numerous applications in social networks, bioinformatics, E-commerce, and the Web. In many real-world scenarios, however, graphs are massive in size and dynamically evolving in a fast rate, which, without loss of generality, are often modeled and interpreted as graph streams. Existing link prediction methods fail to generalize in the graph stream setting because graph snapshots where link prediction is performed are no longer readily available in memory, or even on disks, for effective graph computation and analysis. It is therefore highly desirable, albeit challenging, to support link prediction online and in a dynamic way, which, in this paper, is referred to as the streaming link prediction problem in graph streams. In this paper, we consider three fundamental, neighborhood-based link prediction target measures, Jaccard coefficient, common neighbor, and Adamic-Adar, and provide accurate estimation to them in order to address the streaming link prediction problem in graph streams. Our main idea is to design cost-effective graph sketches (constant space per vertex) based on MinHash and vertex-biased sampling techniques, and to propose efficient sketch based algorithms (constant time per edge) with both theoretical accuracy guarantee and robust estimation results. We carry out experimental studies in a series of real-world graph streams. The results demonstrate that our graph sketch based methods are accurate, efficient, cost-effective, and thus can be practically employed for link prediction in real-world graph streams.

链接预测是一个基本问题，旨在根据当前观察到的图的结构来估计边缘(链接)存在的可能性，并在社交网络、生物信息学、电子商务和Web中得到了许多应用。然而，在许多现实世界的场景中，图的大小很大，并且以快速的速度动态发展，在不损失通用性的情况下，通常将其建模和解释为图流。现有的链路预测方法无法在图流设置中泛化，因为执行链路预测的图快照在内存中甚至磁盘上都不再可用，无法进行有效的图计算和分析。因此，尽管具有挑战性，但非常需要在线和动态地支持链接预测，本文将其称为图流中的流链接预测问题。本文考虑了基于邻域的三种基本链路预测目标度量:Jaccard系数、共同邻域和adam - adar，并对它们进行了精确估计，以解决图流中的流链路预测问题。我们的主要思想是基于MinHash和顶点偏差采样技术设计经济高效的图形草图(每个顶点的恒定空间)，并提出高效的基于草图的算法(每个边缘的恒定时间)，同时具有理论精度保证和鲁棒估计结果。我们在一系列真实世界的图形流中进行实验研究。结果表明，基于图草图的方法准确、高效、经济，可用于实际图流中的链接预测。

{"title":"Link prediction in graph streams","authors":"Peixiang Zhao, C. Aggarwal, Gewen He","doi":"10.1109/ICDE.2016.7498270","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498270","url":null,"abstract":"Link prediction is a fundamental problem that aims to estimate the likelihood of the existence of edges (links) based on the current observed structure of a graph, and has found numerous applications in social networks, bioinformatics, E-commerce, and the Web. In many real-world scenarios, however, graphs are massive in size and dynamically evolving in a fast rate, which, without loss of generality, are often modeled and interpreted as graph streams. Existing link prediction methods fail to generalize in the graph stream setting because graph snapshots where link prediction is performed are no longer readily available in memory, or even on disks, for effective graph computation and analysis. It is therefore highly desirable, albeit challenging, to support link prediction online and in a dynamic way, which, in this paper, is referred to as the streaming link prediction problem in graph streams. In this paper, we consider three fundamental, neighborhood-based link prediction target measures, Jaccard coefficient, common neighbor, and Adamic-Adar, and provide accurate estimation to them in order to address the streaming link prediction problem in graph streams. Our main idea is to design cost-effective graph sketches (constant space per vertex) based on MinHash and vertex-biased sampling techniques, and to propose efficient sketch based algorithms (constant time per edge) with both theoretical accuracy guarantee and robust estimation results. We carry out experimental studies in a series of real-world graph streams. The results demonstrate that our graph sketch based methods are accurate, efficient, cost-effective, and thus can be practically employed for link prediction in real-world graph streams.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"64 1","pages":"553-564"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74512062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

SPORE: A sequential personalized spatial item recommender system 《孢子》:连续的个性化空间道具推荐系统

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498304

Weiqing Wang, Hongzhi Yin, S. Sadiq, Ling Chen, M. Xie, Xiaofang Zhou

With the rapid development of location-based social networks (LBSNs), spatial item recommendation has become an important way of helping users discover interesting locations to increase their engagement with location-based services. Although human movement exhibits sequential patterns in LBSNs, most current studies on spatial item recommendations do not consider the sequential influence of locations. Leveraging sequential patterns in spatial item recommendation is, however, very challenging, considering 1) users' check-in data in LBSNs has a low sampling rate in both space and time, which renders existing prediction techniques on GPS trajectories ineffective; 2) the prediction space is extremely large, with millions of distinct locations as the next prediction target, which impedes the application of classical Markov chain models; and 3) there is no existing framework that unifies users' personal interests and the sequential influence in a principled manner. In light of the above challenges, we propose a sequential personalized spatial item recommendation framework (SPORE) which introduces a novel latent variable topic-region to model and fuse sequential influence with personal interests in the latent and exponential space. The advantages of modeling the sequential effect at the topic-region level include a significantly reduced prediction space, an effective alleviation of data sparsity and a direct expression of the semantic meaning of users' spatial activities. Furthermore, we design an asymmetric Locality Sensitive Hashing (ALSH) technique to speed up the online top-k recommendation process by extending the traditional LSH. We evaluate the performance of SPORE on two real datasets and one large-scale synthetic dataset. The results demonstrate a significant improvement in SPORE's ability to recommend spatial items, in terms of both effectiveness and efficiency, compared with the state-of-the-art methods.

随着基于位置的社交网络(LBSNs)的快速发展，空间项目推荐已经成为帮助用户发现感兴趣的位置以提高用户对基于位置的服务的参与度的一种重要方式。虽然人类运动在LBSNs中表现出顺序模式，但目前大多数关于空间项目建议的研究并未考虑位置的顺序影响。然而，在空间项目推荐中利用序列模式是非常具有挑战性的，考虑到1)LBSNs中用户签到数据在空间和时间上的采样率都很低，这使得现有的GPS轨迹预测技术无效;2)预测空间非常大，数以百万计的不同位置作为下一个预测目标，阻碍了经典马尔可夫链模型的应用;3)没有一个将用户的个人利益和后续影响以原则性的方式统一起来的框架。针对上述挑战，我们提出了一个序列个性化空间项目推荐框架(SPORE)，该框架引入了一个新的潜在变量主题区域，在潜在空间和指数空间中建模和融合序列影响与个人兴趣。在主题-区域层面对序列效应进行建模的优势包括显著减小预测空间、有效缓解数据稀疏性、直接表达用户空间活动的语义等。此外，我们设计了一种非对称局部敏感哈希(ALSH)技术，通过扩展传统的LSH来加快在线top-k推荐过程。我们在两个真实数据集和一个大规模合成数据集上评估了SPORE的性能。结果表明，与最先进的方法相比，SPORE推荐空间项目的能力在有效性和效率方面都有显着提高。

{"title":"SPORE: A sequential personalized spatial item recommender system","authors":"Weiqing Wang, Hongzhi Yin, S. Sadiq, Ling Chen, M. Xie, Xiaofang Zhou","doi":"10.1109/ICDE.2016.7498304","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498304","url":null,"abstract":"With the rapid development of location-based social networks (LBSNs), spatial item recommendation has become an important way of helping users discover interesting locations to increase their engagement with location-based services. Although human movement exhibits sequential patterns in LBSNs, most current studies on spatial item recommendations do not consider the sequential influence of locations. Leveraging sequential patterns in spatial item recommendation is, however, very challenging, considering 1) users' check-in data in LBSNs has a low sampling rate in both space and time, which renders existing prediction techniques on GPS trajectories ineffective; 2) the prediction space is extremely large, with millions of distinct locations as the next prediction target, which impedes the application of classical Markov chain models; and 3) there is no existing framework that unifies users' personal interests and the sequential influence in a principled manner. In light of the above challenges, we propose a sequential personalized spatial item recommendation framework (SPORE) which introduces a novel latent variable topic-region to model and fuse sequential influence with personal interests in the latent and exponential space. The advantages of modeling the sequential effect at the topic-region level include a significantly reduced prediction space, an effective alleviation of data sparsity and a direct expression of the semantic meaning of users' spatial activities. Furthermore, we design an asymmetric Locality Sensitive Hashing (ALSH) technique to speed up the online top-k recommendation process by extending the traditional LSH. We evaluate the performance of SPORE on two real datasets and one large-scale synthetic dataset. The results demonstrate a significant improvement in SPORE's ability to recommend spatial items, in terms of both effectiveness and efficiency, compared with the state-of-the-art methods.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"39 1","pages":"954-965"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74638452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 93

Revenue maximization by viral marketing: A social network host's perspective 通过病毒式营销实现收益最大化:社交网络主机的观点

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498227

Arijit Khan, Benjamin Zehnder, Donald Kossmann

We study the novel problem of revenue maximization of a social network host that sells viral marketing campaigns to multiple competing campaigners. Each client campaigner informs the social network host about her target users in the network, as well as how much money she is willing to pay to the host if one of her target users buys her product. The social network host, in turn, assigns a set of seed users to each of her client campaigners. The seed set for a campaigner is a limited number of users to whom the campaigner provides free samples, discounted price etc. with the expectation that these seed users will buy her product, and would also be able to influence many of her target users in the network towards buying her product. Because of various product-adoption costs, it is very unlikely that an average user will purchase more than one of the competing products. Therefore, from the host's perspective, it is important to assign seed users to client campaigners in such a way that the seed assignment guarantees the maximum aggregated revenue for the host considering all her client campaigners. We formulate our problem by following two well-established influence cascading models: the independent cascade model and the linear threshold model. While our problem using both these models is NP-hard, and neither monotonic, nor sub-modular; we develop approximated algorithms with theoretical performance guarantees. However, as our approximated algorithms often incur higher running times, we also design efficient heuristic methods that empirically perform as good as our approximated algorithms. Our detailed experimental evaluation attests that the proposed techniques are effective and scalable over real-world datasets.

我们研究了一个向多个竞争对手销售病毒式营销活动的社交网络主机的收入最大化问题。每个客户营销人员告知社交网络主机她在网络中的目标用户，以及如果她的目标用户购买了她的产品，她愿意向主机支付多少钱。反过来，社交网络的主人为她的每个客户活动人士分配了一组种子用户。竞选者为有限数量的用户设置种子，竞选者为这些用户提供免费样品，折扣价格等，期望这些种子用户会购买她的产品，并且也能够影响网络中的许多目标用户购买她的产品。由于不同的产品采用成本，普通用户不太可能购买多个竞争产品。因此，从主持人的角度来看，将种子用户分配给客户竞选人是很重要的，这样的种子分配保证了主持人考虑到所有客户竞选人的最大总收益。我们通过以下两种成熟的影响级联模型来阐述我们的问题:独立级联模型和线性阈值模型。而我们使用这两种模型的问题都是np困难的，既不是单调的，也不是亚模的;我们开发了具有理论性能保证的近似算法。然而，由于我们的近似算法通常会导致更高的运行时间，我们还设计了有效的启发式方法，在经验上表现得与我们的近似算法一样好。我们详细的实验评估证明，所提出的技术在现实世界的数据集上是有效的和可扩展的。

{"title":"Revenue maximization by viral marketing: A social network host's perspective","authors":"Arijit Khan, Benjamin Zehnder, Donald Kossmann","doi":"10.1109/ICDE.2016.7498227","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498227","url":null,"abstract":"We study the novel problem of revenue maximization of a social network host that sells viral marketing campaigns to multiple competing campaigners. Each client campaigner informs the social network host about her target users in the network, as well as how much money she is willing to pay to the host if one of her target users buys her product. The social network host, in turn, assigns a set of seed users to each of her client campaigners. The seed set for a campaigner is a limited number of users to whom the campaigner provides free samples, discounted price etc. with the expectation that these seed users will buy her product, and would also be able to influence many of her target users in the network towards buying her product. Because of various product-adoption costs, it is very unlikely that an average user will purchase more than one of the competing products. Therefore, from the host's perspective, it is important to assign seed users to client campaigners in such a way that the seed assignment guarantees the maximum aggregated revenue for the host considering all her client campaigners. We formulate our problem by following two well-established influence cascading models: the independent cascade model and the linear threshold model. While our problem using both these models is NP-hard, and neither monotonic, nor sub-modular; we develop approximated algorithms with theoretical performance guarantees. However, as our approximated algorithms often incur higher running times, we also design efficient heuristic methods that empirically perform as good as our approximated algorithms. Our detailed experimental evaluation attests that the proposed techniques are effective and scalable over real-world datasets.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"39 1","pages":"37-48"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85728876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

SPDO: High-throughput road distance computations on Spark using Distance Oracles SPDO:使用distance oracle在Spark上进行高吞吐量道路距离计算

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498328

Shangfu Peng, Jagan Sankaranarayanan, H. Samet

In the past decades, shortest distance methods for road networks have been developed that focus on how to speed up the latency of a single source-target pair distance query. Large analytical applications on road networks including simulations (e.g., evacuation planning), logistics, and transportation planning require methods that provide high throughput (i.e., distance computations per second) and the ability to “scale out” by using large distributed computing clusters. A framework called SPDO is presented which implements an extremely fast distributed algorithm for computing road network distance queries on Apache Spark. The approach extends our previous work of developing the ε-distance oracle which has now been adapted to use Spark's resilient distributed dataset (RDD). Compared with state-of-the-art methods that focus on reducing latency, the proposed framework improves the throughput by at least an order of magnitude, which makes the approach suitable for applications that need to compute thousands to millions of network distances per second.

在过去的几十年里，道路网络的最短距离方法已经发展起来，其重点是如何加快单个源-目标对距离查询的延迟。道路网络上的大型分析应用程序，包括模拟(例如，疏散计划)、物流和运输计划，需要提供高吞吐量(例如，每秒距离计算)和通过使用大型分布式计算集群“向外扩展”的能力的方法。提出了一个名为SPDO的框架，该框架在Apache Spark上实现了一种极快的分布式道路网络距离查询算法。该方法扩展了我们之前开发ε-distance oracle的工作，该工作现在已适应使用Spark的弹性分布式数据集(RDD)。与专注于减少延迟的最先进方法相比，所提出的框架将吞吐量提高了至少一个数量级，这使得该方法适合需要每秒计算数千到数百万网络距离的应用程序。

引用次数: 18

Beat the DIVa - decentralized identity validation for online social networks 击败DIVa——在线社交网络的分散身份验证

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498337

Leila Bahri, Amira Soliman, Jacopo Squillaci, B. Carminati, E. Ferrari, Sarunas Girdzijauskas

Fake accounts in online social networks (OSNs) have known considerable sophistication and are now attempting to gain network trust by infiltrating within honest communities. Honest users have limited perspective on the truthfulness of new online identities requesting their friendship. This facilitates the task of fake accounts in deceiving honest users to befriend them. To address this, we have proposed a model that learns hidden correlations between profile attributes within OSN communities, and exploits them to assist users in estimating the trustworthiness of new profiles. To demonstrate our method, we suggest, in this demo, a game application through which players try to cheat the system and convince nodes in a simulated OSN to befriend them. The game deploys different strategies to challenge the players and to reach the objectives of the demo. These objectives are to make participants aware of how fake accounts can infiltrate within their OSN communities, to demonstrate how our suggested method could aid in mitigating this threat, and to eventually strengthen our model based on the data collected from the moves of the players.

在线社交网络(OSNs)中的虚假账户已经相当复杂，现在正试图通过渗透到诚实的社区中来获得网络信任。诚实的用户对请求他们成为好友的新网络身份的真实性看法有限。这有助于假账户欺骗诚实的用户成为他们的朋友。为了解决这个问题，我们提出了一个模型，该模型可以学习OSN社区中配置文件属性之间隐藏的相关性，并利用它们来帮助用户估计新配置文件的可信度。为了演示我们的方法，在这个演示中，我们建议使用一个游戏应用程序，通过该应用程序，玩家可以尝试欺骗系统并说服模拟OSN中的节点成为他们的朋友。游戏采用不同的策略来挑战玩家并达到演示的目标。这些目标是让参与者意识到虚假账户如何渗透到他们的OSN社区中，展示我们建议的方法如何有助于减轻这种威胁，并最终根据从玩家的移动中收集的数据加强我们的模型。

{"title":"Beat the DIVa - decentralized identity validation for online social networks","authors":"Leila Bahri, Amira Soliman, Jacopo Squillaci, B. Carminati, E. Ferrari, Sarunas Girdzijauskas","doi":"10.1109/ICDE.2016.7498337","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498337","url":null,"abstract":"Fake accounts in online social networks (OSNs) have known considerable sophistication and are now attempting to gain network trust by infiltrating within honest communities. Honest users have limited perspective on the truthfulness of new online identities requesting their friendship. This facilitates the task of fake accounts in deceiving honest users to befriend them. To address this, we have proposed a model that learns hidden correlations between profile attributes within OSN communities, and exploits them to assist users in estimating the trustworthiness of new profiles. To demonstrate our method, we suggest, in this demo, a game application through which players try to cheat the system and convince nodes in a simulated OSN to befriend them. The game deploys different strategies to challenge the players and to reach the objectives of the demo. These objectives are to make participants aware of how fake accounts can infiltrate within their OSN communities, to demonstrate how our suggested method could aid in mitigating this threat, and to eventually strengthen our model based on the data collected from the moves of the players.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"1330-1333"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74877790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient answering of why-not questions in similar graph matching 相似图匹配中why-not问题的高效回答

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498382

Md. Saiful Islam, Chengfei Liu, Jianxin Li

Graph data management and matching similar graphs are very important for many applications including bioinformatics, computer vision, VLSI design, bug localization, road networks, social and communication networking. Many graph indexing and similarity matching techniques have already been proposed for managing and querying graph data. In similar graph matching, a user is returned with the database graphs whose distances with the query graph are below a threshold. In such query settings, a user may not receive certain database graphs that are very similar to the query graph if the initial query graph is inappropriate/imperfect for the expected answer set. To exemplify this, consider a drug designer who is looking for chemical compounds that could be the target of her hypothetical drug before realizing it. In response to her query, the traditional search system may return the structures from the database that are most similar to the query graph. However, she may get surprised if some of the expected targets are missing in the answer set. She may then seek assistance from the system by asking “Is there other query graph that can match my expected answer set?”. The system may then modify her initial query graph to include the missing answers in the new answer set. Here, we study this kind of problem of answering why-not questions in similar graph matching for graph databases.

图数据管理和匹配相似图对于许多应用非常重要，包括生物信息学，计算机视觉，VLSI设计，bug定位，道路网络，社交和通信网络。为了管理和查询图数据，已经提出了许多图索引和相似度匹配技术。在类似的图匹配中，返回给用户的是与查询图的距离低于阈值的数据库图。在这样的查询设置中，如果初始查询图不适合/不完善于预期的答案集，则用户可能无法接收到与查询图非常相似的某些数据库图。为了举例说明这一点，假设一个药物设计者正在寻找可能成为她的假设药物目标的化合物，然后才意识到它。为了响应她的查询，传统的搜索系统可能会从数据库中返回与查询图最相似的结构。然而，如果答案集中缺少一些预期目标，她可能会感到惊讶。然后，她可以通过询问“是否有其他查询图可以匹配我期望的答案集?”来寻求系统的帮助。然后，系统可以修改她的初始查询图，将缺失的答案包含在新的答案集中。本文主要研究了图数据库中相似图匹配中why-not问题的回答问题。

{"title":"Efficient answering of why-not questions in similar graph matching","authors":"Md. Saiful Islam, Chengfei Liu, Jianxin Li","doi":"10.1109/ICDE.2016.7498382","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498382","url":null,"abstract":"Graph data management and matching similar graphs are very important for many applications including bioinformatics, computer vision, VLSI design, bug localization, road networks, social and communication networking. Many graph indexing and similarity matching techniques have already been proposed for managing and querying graph data. In similar graph matching, a user is returned with the database graphs whose distances with the query graph are below a threshold. In such query settings, a user may not receive certain database graphs that are very similar to the query graph if the initial query graph is inappropriate/imperfect for the expected answer set. To exemplify this, consider a drug designer who is looking for chemical compounds that could be the target of her hypothetical drug before realizing it. In response to her query, the traditional search system may return the structures from the database that are most similar to the query graph. However, she may get surprised if some of the expected targets are missing in the answer set. She may then seek assistance from the system by asking “Is there other query graph that can match my expected answer set?”. The system may then modify her initial query graph to include the missing answers in the new answer set. Here, we study this kind of problem of answering why-not questions in similar graph matching for graph databases.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"99 1","pages":"1476-1477"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77969909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Topical influence modeling via topic-level interests and interactions on social curation services 通过社会策展服务的话题层面兴趣和互动进行话题影响建模

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498225

Daehoon Kim, Jae-Gil Lee, B. Lee

Social curation services are emerging social media platforms that enable users to curate their contents according to the topic and express their interests at the topic level by following curated collections of other users' contents rather than the users themselves. The topic-level information revealed through this new feature far exceeds what existing methods solicit from the traditional social networking services, to greatly enhance the quality of topic-sensitive influence modeling. In this paper, we propose a novel model called the topical influence with social curation (TISC) to find influential users from social curation services. This model, formulated by the continuous conditional random field, fully takes advantage of the explicitly available topic-level information reflected in both contents and interactions. In order to validate its merits, we comprehensively compare TISC with state-of-the-art models using two real-world data sets collected from Pinterest and Scoop.it. The results show that TISC achieves higher accuracy by up to around 80% and finds more convincing results in case studies than the other models. Moreover, we develop a distributed learning algorithm on Spark and demonstrate its excellent scalability on a cluster of 48 cores.

社交策划服务是一种新兴的社交媒体平台，用户可以根据主题策划自己的内容，并通过关注其他用户的内容而不是用户自己的内容，在主题层面表达自己的兴趣。通过这一新特性揭示的话题级信息远远超过了现有方法从传统社交网络服务中获取的信息，大大提高了话题敏感性影响力建模的质量。在本文中，我们提出了一个新的模型，即话题影响力与社会策展(TISC)，从社会策展服务中寻找有影响力的用户。该模型由连续条件随机场表述，充分利用了内容和交互中所反映的明确可用的主题级信息。为了验证其优点，我们使用从Pinterest和Scoop.it收集的两个真实世界数据集，将TISC与最先进的模型进行了全面比较。结果表明，与其他模型相比，TISC模型的准确率高达80%左右，并且在案例研究中发现了更令人信服的结果。此外，我们还在Spark上开发了一种分布式学习算法，并在48核集群上展示了其出色的可扩展性。

{"title":"Topical influence modeling via topic-level interests and interactions on social curation services","authors":"Daehoon Kim, Jae-Gil Lee, B. Lee","doi":"10.1109/ICDE.2016.7498225","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498225","url":null,"abstract":"Social curation services are emerging social media platforms that enable users to curate their contents according to the topic and express their interests at the topic level by following curated collections of other users' contents rather than the users themselves. The topic-level information revealed through this new feature far exceeds what existing methods solicit from the traditional social networking services, to greatly enhance the quality of topic-sensitive influence modeling. In this paper, we propose a novel model called the topical influence with social curation (TISC) to find influential users from social curation services. This model, formulated by the continuous conditional random field, fully takes advantage of the explicitly available topic-level information reflected in both contents and interactions. In order to validate its merits, we comprehensively compare TISC with state-of-the-art models using two real-world data sets collected from Pinterest and Scoop.it. The results show that TISC achieves higher accuracy by up to around 80% and finds more convincing results in case studies than the other models. Moreover, we develop a distributed learning algorithm on Spark and demonstrate its excellent scalability on a cluster of 48 cores.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"17 1","pages":"13-24"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81504875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Analyzing data-centric applications: Why, what-if, and how-to 分析以数据为中心的应用程序:为什么、假设和如何做

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498289

P. Bourhis, Daniel Deutch, Y. Moskovitch

We consider in this paper the analysis of complex applications that query and update an underlying database in their operation. We focus on three classes of analytical questions that are important for application owners and users alike: Why was a result generated? What would be the result if the application logic or database is modified in a particular way? How can one interact with the application to achieve a particular goal? Answering these questions efficiently is a fundamental step towards optimizing the application and its use. Noting that provenance was a key component in answering similar questions in the context of database queries, we develop a provenance-based model and efficient algorithms for these problems in the context of data-centric applications. Novel challenges here include the dynamic update of data, combined with the possibly complex workflows allowed by applications. We nevertheless achieve theoretical guarantees for the algorithms performance, and experimentally show their efficiency and usefulness, even in presence of complex applications and large-scale data.

在本文中，我们考虑了在操作中查询和更新底层数据库的复杂应用程序的分析。我们主要关注对应用程序所有者和用户都很重要的三类分析问题:为什么生成结果?如果以特定方式修改应用程序逻辑或数据库，结果会是什么?如何与应用程序交互以实现特定的目标?有效地回答这些问题是优化应用程序及其使用的基本步骤。注意到在数据库查询的上下文中，来源是回答类似问题的关键组成部分，我们开发了一个基于来源的模型和有效的算法，用于在以数据为中心的应用程序上下文中解决这些问题。这里的新挑战包括数据的动态更新，以及应用程序允许的可能复杂的工作流。尽管如此，我们还是从理论上保证了算法的性能，并通过实验证明了它们的有效性和实用性，即使在复杂的应用和大规模数据中也是如此。

引用次数: 11

Microblogs data management and analysis 微博数据管理与分析

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498365

A. Magdy, M. Mokbel

Microblogs data, e.g., tweets, reviews, news comments, and social media comments, has gained considerable attention in recent years due to its popularity and rich contents. Nowadays, microblogs applications span a wide spectrum of interests, including detecting and analyzing events, user analysis for geo-targeted ads and political elections, and critical applications like discovering health issues and rescue services. Consequently, major research efforts are spent to analyze and manage microblogs data to support different applications. In this tutorial, we give a 1.5 hours overview about microblogs data analysis, management, and systems. The tutorial gives a comprehensive review for research efforts that are trying to analyze microblogs contents to build on them new functionality and use cases. In addition, the tutorial reviews existing research that propose core data management components to support microblogs queries at scale. Finally, the tutorial reviews system-level issues and on-going work on supporting microblogs data through the rising big data systems. Through its different parts, the tutorial highlights the challenges and opportunities in microblogs data research.

微博数据，如推文、评论、新闻评论、社交媒体评论等，近年来因其广受欢迎和内容丰富而备受关注。如今，微博应用程序涵盖了广泛的兴趣范围，包括检测和分析事件，针对地理定位广告和政治选举的用户分析，以及发现健康问题和救援服务等关键应用程序。因此，主要的研究工作是分析和管理微博数据，以支持不同的应用程序。在本教程中，我们将用1.5小时概述微博数据分析、管理和系统。本教程对试图分析微博内容以在其基础上构建新功能和用例的研究工作进行了全面的回顾。此外，本教程还回顾了现有的研究，这些研究提出了核心数据管理组件，以支持大规模的微博查询。最后，本教程回顾了系统级问题和正在进行的通过新兴的大数据系统支持微博数据的工作。通过其不同的部分，该教程突出了微博数据研究的挑战和机遇。

{"title":"Microblogs data management and analysis","authors":"A. Magdy, M. Mokbel","doi":"10.1109/ICDE.2016.7498365","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498365","url":null,"abstract":"Microblogs data, e.g., tweets, reviews, news comments, and social media comments, has gained considerable attention in recent years due to its popularity and rich contents. Nowadays, microblogs applications span a wide spectrum of interests, including detecting and analyzing events, user analysis for geo-targeted ads and political elections, and critical applications like discovering health issues and rescue services. Consequently, major research efforts are spent to analyze and manage microblogs data to support different applications. In this tutorial, we give a 1.5 hours overview about microblogs data analysis, management, and systems. The tutorial gives a comprehensive review for research efforts that are trying to analyze microblogs contents to build on them new functionality and use cases. In addition, the tutorial reviews existing research that propose core data management components to support microblogs queries at scale. Finally, the tutorial reviews system-level issues and on-going work on supporting microblogs data through the rising big data systems. Through its different parts, the tutorial highlights the challenges and opportunities in microblogs data research.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"19 1","pages":"1440-1443"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78623556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1