首页 > 最新文献

2020 IEEE 36th International Conference on Data Engineering (ICDE)最新文献

英文 中文
Efficient Structural Clustering in Large Uncertain Graphs 大型不确定图的高效结构聚类
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00215
Yongjiang Liang, Tingting Hu, Peixiang Zhao
Clustering uncertain graphs based on the probabilistic graph model has sparked extensive research and widely varying applications. Existing structural clustering methods rely heavily on the computation of pairwise reliable structural similarity between vertices, which has proven to be extremely costly, especially in large uncertain graphs. In this paper, we develop a new, decomposition-based method, ProbSCAN, for efficient reliable structural similarity computation with theoretically improved complexity. We further design a cost-effective index structure UCNO-Index, and a series of powerful pruning strategies to expedite reliable structural similarity computation in uncertain graphs. Experimental studies on eight real-world uncertain graphs demonstrate the effectiveness of our proposed solutions, which achieves orders of magnitude improvement of clustering efficiency, compared with the state-of-the-art structural clustering methods in large uncertain graphs.
基于概率图模型的不确定图聚类已经引起了广泛的研究和广泛的应用。现有的结构聚类方法严重依赖于计算顶点之间的两两可靠结构相似度,这被证明是非常昂贵的,特别是在大型不确定图中。在本文中,我们开发了一种新的,基于分解的方法,ProbSCAN,高效可靠的结构相似度计算,理论上提高了复杂度。我们进一步设计了一种经济高效的索引结构UCNO-Index,以及一系列强大的剪枝策略,以加快不确定图中可靠的结构相似度计算。对8个真实不确定图的实验研究证明了我们提出的解决方案的有效性,与目前最先进的大型不确定图的结构聚类方法相比,聚类效率得到了数量级的提高。
{"title":"Efficient Structural Clustering in Large Uncertain Graphs","authors":"Yongjiang Liang, Tingting Hu, Peixiang Zhao","doi":"10.1109/ICDE48307.2020.00215","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00215","url":null,"abstract":"Clustering uncertain graphs based on the probabilistic graph model has sparked extensive research and widely varying applications. Existing structural clustering methods rely heavily on the computation of pairwise reliable structural similarity between vertices, which has proven to be extremely costly, especially in large uncertain graphs. In this paper, we develop a new, decomposition-based method, ProbSCAN, for efficient reliable structural similarity computation with theoretically improved complexity. We further design a cost-effective index structure UCNO-Index, and a series of powerful pruning strategies to expedite reliable structural similarity computation in uncertain graphs. Experimental studies on eight real-world uncertain graphs demonstrate the effectiveness of our proposed solutions, which achieves orders of magnitude improvement of clustering efficiency, compared with the state-of-the-art structural clustering methods in large uncertain graphs.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"32 1","pages":"1966-1969"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79769453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
SUDAF: Sharing User-Defined Aggregate Functions SUDAF:共享用户自定义聚合函数
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00161
Chao Zhang, F. Toumani, B. Doreau
We present SUDAF (Sharing User-Defined Aggregate Functions), a declarative framework that allows users to formulate UDAFs as mathematical expressions and use them in SQL statements. SUDAF rewrites partial aggregates of UDAFs using built-in aggregate functions and supports efficient dynamic caching and reusing of partial aggregates. Our evaluation shows that using SUDAF on top of Spark SQL can lead from one to two orders of magnitude improvement in query execution times compared to the original Spark SQL.
我们提出了SUDAF(共享用户定义聚合函数),这是一个声明性框架,允许用户将udaf表述为数学表达式,并在SQL语句中使用它们。SUDAF使用内置的聚合函数重写udaf的部分聚合,并支持高效的动态缓存和部分聚合的重用。我们的评估表明,与原始Spark SQL相比,在Spark SQL之上使用SUDAF可以使查询执行时间提高一到两个数量级。
{"title":"SUDAF: Sharing User-Defined Aggregate Functions","authors":"Chao Zhang, F. Toumani, B. Doreau","doi":"10.1109/ICDE48307.2020.00161","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00161","url":null,"abstract":"We present SUDAF (Sharing User-Defined Aggregate Functions), a declarative framework that allows users to formulate UDAFs as mathematical expressions and use them in SQL statements. SUDAF rewrites partial aggregates of UDAFs using built-in aggregate functions and supports efficient dynamic caching and reusing of partial aggregates. Our evaluation shows that using SUDAF on top of Spark SQL can lead from one to two orders of magnitude improvement in query execution times compared to the original Spark SQL.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"6 1","pages":"1750-1553"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84741099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Outdated Fact Detection in Knowledge Bases 知识库中过时的事实检测
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00196
Shuang Hao, Chengliang Chai, Guoliang Li, N. Tang, Ning Wang, Xiang Yu
Knowledge bases (KBs), which store high-quality information, are crucial for many applications, such as enhancing search results and serving as external sources for data cleaning. Not surprisingly, there exist outdated facts in most KBs due to the rapid change of information. Naturally, it is important to keep KBs up-to-date. Traditional wisdom has investigated the problem of using reference data (such as new facts extracted from the news) to detect outdated facts in KBs. However, existing approaches can only cover a small percentage of facts in KBs. In this paper, we propose a novel human-in-the-loop approach for outdated fact detection in KBs. It trains a binary classifier using features such as historical update frequency and existence time of a fact to compute the likelihood of a fact in a KB to be outdated. Then, it interacts with humans to verify whether a fact with high likelihood is indeed outdated. In addition, it also uses logical rules to detect more outdated facts based on human feedback. The outdated facts detected by the logical rules will also be fed back to train the ML model further for data augmentation. Extensive experiments on real-world KBs, such as Yago and DBpedia, show the effectiveness of our solution.
知识库(KBs)存储高质量的信息,对于许多应用程序至关重要,例如增强搜索结果和作为数据清理的外部源。由于信息的快速变化,大部分KBs都存在过时的事实,这并不奇怪。当然,保持KBs的最新状态很重要。传统智慧研究了使用参考数据(如从新闻中提取的新事实)来检测KBs中过时事实的问题。然而,现有的方法只能覆盖KBs中的一小部分事实。在本文中,我们提出了一种新的human-in-the-loop方法用于KBs中的过时事实检测。它使用历史更新频率和事实的存在时间等特征来训练二元分类器,以计算知识库中事实过时的可能性。然后,它与人类互动,以验证一个高可能性的事实是否确实过时了。此外,它还使用逻辑规则,根据人类的反馈来检测更多过时的事实。逻辑规则检测到的过时事实也将被反馈到ML模型中,以进一步训练数据增强。在实际的KBs(如Yago和DBpedia)上进行的大量实验表明了我们的解决方案的有效性。
{"title":"Outdated Fact Detection in Knowledge Bases","authors":"Shuang Hao, Chengliang Chai, Guoliang Li, N. Tang, Ning Wang, Xiang Yu","doi":"10.1109/ICDE48307.2020.00196","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00196","url":null,"abstract":"Knowledge bases (KBs), which store high-quality information, are crucial for many applications, such as enhancing search results and serving as external sources for data cleaning. Not surprisingly, there exist outdated facts in most KBs due to the rapid change of information. Naturally, it is important to keep KBs up-to-date. Traditional wisdom has investigated the problem of using reference data (such as new facts extracted from the news) to detect outdated facts in KBs. However, existing approaches can only cover a small percentage of facts in KBs. In this paper, we propose a novel human-in-the-loop approach for outdated fact detection in KBs. It trains a binary classifier using features such as historical update frequency and existence time of a fact to compute the likelihood of a fact in a KB to be outdated. Then, it interacts with humans to verify whether a fact with high likelihood is indeed outdated. In addition, it also uses logical rules to detect more outdated facts based on human feedback. The outdated facts detected by the logical rules will also be fed back to train the ML model further for data augmentation. Extensive experiments on real-world KBs, such as Yago and DBpedia, show the effectiveness of our solution.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"76 1","pages":"1890-1893"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84934452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A Class of R*-tree Indexes for Spatial-Visual Search of Geo-tagged Street Images 一类基于R*树索引的地理标记街道图像空间视觉搜索
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00221
Abdullah Alfarrarjeh, S. H. Kim, V. Hegde, Akshansh, C. Shahabi, Q. Xie, S. Ravada
Due to the prevalence of GPS-equipped cameras (e.g., smartphones and surveillance cameras), massive amounts of geo-tagged images capturing urban streets are increasingly being collected. Consequently, many smart city applications have emerged, relying on efficient image search. Such searches include spatial-visual queries in which spatial and visual properties are used in tandem to retrieve similar images to a given query image within a given geographical region. Towards this end, new index structures that organize images based on both spatial and visual properties are needed to efficiently execute such queries. Based on our observation that street images are typically similar in the same spatial locality, index structures for spatial-visual queries can be effectively built on a spatial index (i.e., R*-tree). Therefore, we propose a class of R*-tree indexes, particularly, by associating each node with two separate minimum bounding rectangles (MBR), one for spatial and the other for (dimension-reduced) visual properties of their contained images, and adapting the R*-tree optimization criteria to both property types.
由于配备gps的摄像头(如智能手机和监控摄像头)的普及,越来越多地收集了大量带有地理标记的城市街道图像。因此,许多智慧城市应用已经出现,依赖于高效的图像搜索。这样的搜索包括空间视觉查询,其中空间和视觉属性被串联使用以检索给定地理区域内给定查询图像的类似图像。为此,需要基于空间和视觉属性组织图像的新索引结构来有效地执行此类查询。根据我们的观察,街道图像在相同的空间位置通常是相似的,空间视觉查询的索引结构可以有效地建立在空间索引(即R*树)上。因此,我们提出了一类R*树索引,特别是通过将每个节点与两个单独的最小边界矩形(MBR)相关联,一个用于空间,另一个用于包含图像的(降维)视觉属性,并使R*树优化标准适用于这两种属性类型。
{"title":"A Class of R*-tree Indexes for Spatial-Visual Search of Geo-tagged Street Images","authors":"Abdullah Alfarrarjeh, S. H. Kim, V. Hegde, Akshansh, C. Shahabi, Q. Xie, S. Ravada","doi":"10.1109/ICDE48307.2020.00221","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00221","url":null,"abstract":"Due to the prevalence of GPS-equipped cameras (e.g., smartphones and surveillance cameras), massive amounts of geo-tagged images capturing urban streets are increasingly being collected. Consequently, many smart city applications have emerged, relying on efficient image search. Such searches include spatial-visual queries in which spatial and visual properties are used in tandem to retrieve similar images to a given query image within a given geographical region. Towards this end, new index structures that organize images based on both spatial and visual properties are needed to efficiently execute such queries. Based on our observation that street images are typically similar in the same spatial locality, index structures for spatial-visual queries can be effectively built on a spatial index (i.e., R*-tree). Therefore, we propose a class of R*-tree indexes, particularly, by associating each node with two separate minimum bounding rectangles (MBR), one for spatial and the other for (dimension-reduced) visual properties of their contained images, and adapting the R*-tree optimization criteria to both property types.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"49 1","pages":"1990-1993"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85035560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Graph Embeddings for One-pass Processing of Heterogeneous Queries 异构查询一次处理的图嵌入
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00222
Chi Thang Duong, Hongzhi Yin, Dung Hoang, Minn Hung Nguyen, M. Weidlich, Quoc Viet Hung Nguyen, K. Aberer
Effective information retrieval (IR) relies on the ability to comprehensively capture a user’s information needs. Traditional IR systems are limited to homogeneous queries that define the information to retrieve by a single modality. Support for heterogeneous queries that combine different modalities has been proposed recently. Yet, existing approaches for heterogeneous querying are computationally expensive, as they require several passes over the data to construct a query answer.In this paper, we propose an IR system that overcomes the computational challenges imposed by heterogeneous queries by adopting graph embeddings. Specifically, we propose graph-based models in which both, data and queries, incorporate information of different modalities. Then, we show how either representation is transformed into a graph embedding in the same space, capturing relations between information of different modalities. By grounding query processing in graph embeddings, we enable processing of heterogeneous queries with a single pass over the data representation. Our experiments on several real-world and synthetic datasets illustrate that our technique is able to return twice the amount of relevant information in comparison with several baselines, while being scalable to large-scale data.
有效的信息检索依赖于全面捕获用户信息需求的能力。传统的IR系统仅限于同构查询,这些查询定义了要通过单一模式检索的信息。最近提出了对结合不同模式的异构查询的支持。然而,现有的异构查询方法在计算上很昂贵,因为它们需要多次传递数据来构造查询答案。在本文中,我们提出了一种红外系统,通过采用图嵌入来克服异构查询带来的计算挑战。具体来说,我们提出了基于图的模型,其中数据和查询都包含不同模式的信息。然后,我们展示了如何将这两种表示转换为嵌入在同一空间中的图,捕获不同模态信息之间的关系。通过在图嵌入中进行查询处理,我们可以通过对数据表示的一次传递来处理异构查询。我们在几个真实世界和合成数据集上的实验表明,与几个基线相比,我们的技术能够返回两倍的相关信息,同时可扩展到大规模数据。
{"title":"Graph Embeddings for One-pass Processing of Heterogeneous Queries","authors":"Chi Thang Duong, Hongzhi Yin, Dung Hoang, Minn Hung Nguyen, M. Weidlich, Quoc Viet Hung Nguyen, K. Aberer","doi":"10.1109/ICDE48307.2020.00222","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00222","url":null,"abstract":"Effective information retrieval (IR) relies on the ability to comprehensively capture a user’s information needs. Traditional IR systems are limited to homogeneous queries that define the information to retrieve by a single modality. Support for heterogeneous queries that combine different modalities has been proposed recently. Yet, existing approaches for heterogeneous querying are computationally expensive, as they require several passes over the data to construct a query answer.In this paper, we propose an IR system that overcomes the computational challenges imposed by heterogeneous queries by adopting graph embeddings. Specifically, we propose graph-based models in which both, data and queries, incorporate information of different modalities. Then, we show how either representation is transformed into a graph embedding in the same space, capturing relations between information of different modalities. By grounding query processing in graph embeddings, we enable processing of heterogeneous queries with a single pass over the data representation. Our experiments on several real-world and synthetic datasets illustrate that our technique is able to return twice the amount of relevant information in comparison with several baselines, while being scalable to large-scale data.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"6 1","pages":"1994-1997"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87283235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Answering Skyline Queries over Incomplete Data with Crowdsourcing(Extended Abstract) 用众包解决不完整数据的Skyline查询(扩展摘要)
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00235
Xiaoye Miao, Yunjun Gao, Su Guo, Lu Chen, Jianwei Yin, Qing Li
Due to the pervasiveness of incomplete data, incomplete data queries are vital in a large number of real-life scenarios. Current models and approaches for incomplete data queries mainly rely on the machine power. In this paper, we study the problem of skyline queries over incomplete data with crowdsourcing. We propose a novel query framework, termed as BayesCrowd, on top of Bayesian network and the typical c-table model on incomplete data. Considering budget and latency constraints, we present a suite of effective task selection strategies. In particular, since the probability computation of each object being an answer object is at least as hard as #SAT problem, we propose an adaptive DPLL (i.e., Davis-Putnam-Logemann-Loveland) algorithm to speed up the computation. Extensive experiments using both real and synthetic data sets confirm the superiority of BayesCrowd to the state-of-the-art method.
由于不完整数据的普遍存在,不完整数据查询在大量现实场景中至关重要。目前不完整数据查询的模型和方法主要依赖于机器功率。本文用众包的方法研究了不完全数据的天际线查询问题。我们在贝叶斯网络和典型的c表模型的基础上提出了一种新的查询框架,称为BayesCrowd。考虑到预算和延迟的限制,我们提出了一套有效的任务选择策略。特别是,由于每个对象作为答案对象的概率计算至少与#SAT问题一样困难,因此我们提出了自适应DPLL(即Davis-Putnam-Logemann-Loveland)算法来加快计算速度。使用真实和合成数据集的大量实验证实了BayesCrowd优于最先进的方法。
{"title":"Answering Skyline Queries over Incomplete Data with Crowdsourcing(Extended Abstract)","authors":"Xiaoye Miao, Yunjun Gao, Su Guo, Lu Chen, Jianwei Yin, Qing Li","doi":"10.1109/ICDE48307.2020.00235","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00235","url":null,"abstract":"Due to the pervasiveness of incomplete data, incomplete data queries are vital in a large number of real-life scenarios. Current models and approaches for incomplete data queries mainly rely on the machine power. In this paper, we study the problem of skyline queries over incomplete data with crowdsourcing. We propose a novel query framework, termed as BayesCrowd, on top of Bayesian network and the typical c-table model on incomplete data. Considering budget and latency constraints, we present a suite of effective task selection strategies. In particular, since the probability computation of each object being an answer object is at least as hard as #SAT problem, we propose an adaptive DPLL (i.e., Davis-Putnam-Logemann-Loveland) algorithm to speed up the computation. Extensive experiments using both real and synthetic data sets confirm the superiority of BayesCrowd to the state-of-the-art method.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"41 1","pages":"2032-2033"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86335997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PrefixFPM: A Parallel Framework for General-Purpose Frequent Pattern Mining PrefixFPM:通用频繁模式挖掘的并行框架
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00208
Da Yan, Wenwen Qu, Guimu Guo, Xiaoling Wang
Frequent pattern mining (FPM) has been a focused theme in data mining research for decades, but there lacks a general programming framework that can be easily customized to mine different kinds of frequent patterns, and existing solutions to FPM over big transaction databases are IO-bound rendering CPU cores underutilized even though FPM is NP-hard.This paper presents, PrefixFPM, a general-purpose framework for FPM that is able to fully utilize the CPU cores in a multicore machine. PrefixFPM follows the idea of prefix projection to partition the workloads of PFM into independent tasks by divide and conquer. PrefixFPM exposes a unified programming interface to users who can customize it to mine their desired patterns, and the parallel execution engine is transparent to end-users and can be reused for mining all kinds of patterns. We have adapted the state-of-the-art serial algorithms for mining frequent patterns including subsequences, subtrees, and subgraphs on top of PrefixFPM, and extensive experiments demonstrate an excellent speedup ratio of PrefixFPM with the number of cores.A demo is available at https://youtu.be/PfioC0GDpsw; the code is available at https://github.com/yanlab19870714/PrefixFPM.
频繁模式挖掘(FPM)几十年来一直是数据挖掘研究中的一个重点主题,但缺乏一个通用的编程框架,可以轻松地定制来挖掘不同类型的频繁模式,并且现有的大型事务数据库上的FPM解决方案是io绑定的,尽管FPM是NP-hard的,但CPU内核的利用率却不足。PrefixFPM是一种通用的FPM框架,能够充分利用多核机器的CPU核。PrefixFPM遵循前缀投影的思想,通过分治法将PFM的工作负载划分为独立的任务。PrefixFPM向用户公开了一个统一的编程接口,用户可以定制它来挖掘他们想要的模式,并行执行引擎对最终用户是透明的,可以被重用来挖掘所有类型的模式。我们在PrefixFPM之上采用了最先进的串行算法来挖掘频繁模式,包括子序列、子树和子图,并且大量的实验证明了PrefixFPM随着内核数量的增加具有出色的加速比。可在https://youtu.be/PfioC0GDpsw获得演示;代码可在https://github.com/yanlab19870714/PrefixFPM上获得。
{"title":"PrefixFPM: A Parallel Framework for General-Purpose Frequent Pattern Mining","authors":"Da Yan, Wenwen Qu, Guimu Guo, Xiaoling Wang","doi":"10.1109/ICDE48307.2020.00208","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00208","url":null,"abstract":"Frequent pattern mining (FPM) has been a focused theme in data mining research for decades, but there lacks a general programming framework that can be easily customized to mine different kinds of frequent patterns, and existing solutions to FPM over big transaction databases are IO-bound rendering CPU cores underutilized even though FPM is NP-hard.This paper presents, PrefixFPM, a general-purpose framework for FPM that is able to fully utilize the CPU cores in a multicore machine. PrefixFPM follows the idea of prefix projection to partition the workloads of PFM into independent tasks by divide and conquer. PrefixFPM exposes a unified programming interface to users who can customize it to mine their desired patterns, and the parallel execution engine is transparent to end-users and can be reused for mining all kinds of patterns. We have adapted the state-of-the-art serial algorithms for mining frequent patterns including subsequences, subtrees, and subgraphs on top of PrefixFPM, and extensive experiments demonstrate an excellent speedup ratio of PrefixFPM with the number of cores.A demo is available at https://youtu.be/PfioC0GDpsw; the code is available at https://github.com/yanlab19870714/PrefixFPM.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"45 1","pages":"1938-1941"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88085549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Automated Anomaly Detection in Large Sequences 大序列中的自动异常检测
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00182
Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas
Subsequence anomaly (or outlier) detection in long sequences is an important problem with applications in a wide range of domains. However, current approaches have severe limitations: they either require prior domain knowledge, or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. In this work, we address these problems, and propose NorM, a novel approach, suitable for domain-agnostic anomaly detection. NorM is based on a new data series primitive, which permits to detect anomalies based on their (dis)similarity to a model that represents normal behavior. The experimental results on several real datasets demonstrate that the proposed approach outperforms by a large margin the current state-of-the art algorithms in terms of accuracy, while being orders of magnitude faster.
长序列的子序列异常(或离群值)检测是一个重要的问题,具有广泛的应用领域。然而,当前的方法有严重的局限性:它们要么需要先前的领域知识,要么在具有相同类型的反复出现的异常的情况下使用起来繁琐且昂贵。在这项工作中,我们解决了这些问题,并提出了一种适用于领域不可知异常检测的新方法NorM。NorM基于一种新的数据序列原语,它允许基于异常与代表正常行为的模型的(非)相似性来检测异常。在几个真实数据集上的实验结果表明,所提出的方法在精度方面大大优于当前最先进的算法,同时速度要快几个数量级。
{"title":"Automated Anomaly Detection in Large Sequences","authors":"Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas","doi":"10.1109/ICDE48307.2020.00182","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00182","url":null,"abstract":"Subsequence anomaly (or outlier) detection in long sequences is an important problem with applications in a wide range of domains. However, current approaches have severe limitations: they either require prior domain knowledge, or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. In this work, we address these problems, and propose NorM, a novel approach, suitable for domain-agnostic anomaly detection. NorM is based on a new data series primitive, which permits to detect anomalies based on their (dis)similarity to a model that represents normal behavior. The experimental results on several real datasets demonstrate that the proposed approach outperforms by a large margin the current state-of-the art algorithms in terms of accuracy, while being orders of magnitude faster.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"23 1","pages":"1834-1837"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90234173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 42
Computing Mutual Information of Big Categorical Data and Its Application to Feature Grouping 大分类数据互信息计算及其在特征分组中的应用
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00210
Junli Li, Chaowei Zhang, Jifu Zhang, X. Qin
This paper develops a parallel computing system - MiCS - for mutual information of big categorical data on the Spark computing platform. The MiCS algorithm is conductive to processing a large amount and strong repeatability of mutual-information calculation among feature pairs by applying a column-wise transformation scheme. And to improve the efficiency of the MiCS and the utilization rate of Spark cluster resources, we adopt a virtual partitioning scheme to achieve balanced load while mitigating the data skewness problem in the Spark Shuffle process.
本文在Spark计算平台上开发了一个用于大分类数据互信息的并行计算系统MiCS。MiCS算法采用逐列转换方案,有利于处理大量、可重复性强的特征对互信息计算。为了提高mic的效率和Spark集群资源的利用率,我们采用虚拟分区方案来实现负载均衡,同时缓解Spark Shuffle过程中的数据偏度问题。
{"title":"Computing Mutual Information of Big Categorical Data and Its Application to Feature Grouping","authors":"Junli Li, Chaowei Zhang, Jifu Zhang, X. Qin","doi":"10.1109/ICDE48307.2020.00210","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00210","url":null,"abstract":"This paper develops a parallel computing system - MiCS - for mutual information of big categorical data on the Spark computing platform. The MiCS algorithm is conductive to processing a large amount and strong repeatability of mutual-information calculation among feature pairs by applying a column-wise transformation scheme. And to improve the efficiency of the MiCS and the utilization rate of Spark cluster resources, we adopt a virtual partitioning scheme to achieve balanced load while mitigating the data skewness problem in the Spark Shuffle process.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"1946-1949"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83101872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Adaptive Network Alignment with Unsupervised and Multi-order Convolutional Networks 无监督多阶卷积网络的自适应网络对齐
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00015
T. T. Huynh, Vinh Tong, T. Nguyen, Hongzhi Yin, M. Weidlich, Nguyen Quoc Viet Hung
Network alignment is the problem of pairing nodes between two graphs such that the paired nodes are structurally and semantically similar. A well-known application of network alignment is to identify which accounts in different social networks belong to the same person. Existing alignment techniques, however, lack scalability, cannot incorporate multi-dimensional information without training data, and are limited in the consistency constraints enforced by an alignment. In this paper, we propose a fully unsupervised network alignment framework based on a multi-order embedding model. The model learns the embeddings of each node using a graph convolutional neural representation, which we prove to satisfy consistency constraints. We further design a data augmentation method and a refinement mechanism to make the model adaptive to consistency violations and noise. Extensive experiments on real and synthetic datasets show that our model outperforms state-of-the-art alignment techniques. We also demonstrate the robustness of our model against adversarial conditions, such as structural noises, attribute noises, graph size imbalance, and hyper-parameter sensitivity.
网络对齐是在两个图之间配对节点的问题,使配对节点在结构和语义上相似。网络对齐的一个著名应用是识别不同社交网络中的哪些帐户属于同一个人。然而,现有的对齐技术缺乏可伸缩性,不能在没有训练数据的情况下合并多维信息,并且在对齐所强制的一致性约束中受到限制。本文提出了一种基于多阶嵌入模型的完全无监督网络对齐框架。该模型使用图卷积神经表示学习每个节点的嵌入,并证明该模型满足一致性约束。我们进一步设计了一种数据增强方法和一种改进机制,使模型能够适应一致性违规和噪声。在真实和合成数据集上进行的大量实验表明,我们的模型优于最先进的对齐技术。我们还证明了我们的模型对对抗条件的鲁棒性,如结构噪声、属性噪声、图大小不平衡和超参数敏感性。
{"title":"Adaptive Network Alignment with Unsupervised and Multi-order Convolutional Networks","authors":"T. T. Huynh, Vinh Tong, T. Nguyen, Hongzhi Yin, M. Weidlich, Nguyen Quoc Viet Hung","doi":"10.1109/ICDE48307.2020.00015","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00015","url":null,"abstract":"Network alignment is the problem of pairing nodes between two graphs such that the paired nodes are structurally and semantically similar. A well-known application of network alignment is to identify which accounts in different social networks belong to the same person. Existing alignment techniques, however, lack scalability, cannot incorporate multi-dimensional information without training data, and are limited in the consistency constraints enforced by an alignment. In this paper, we propose a fully unsupervised network alignment framework based on a multi-order embedding model. The model learns the embeddings of each node using a graph convolutional neural representation, which we prove to satisfy consistency constraints. We further design a data augmentation method and a refinement mechanism to make the model adaptive to consistency violations and noise. Extensive experiments on real and synthetic datasets show that our model outperforms state-of-the-art alignment techniques. We also demonstrate the robustness of our model against adversarial conditions, such as structural noises, attribute noises, graph size imbalance, and hyper-parameter sensitivity.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"64 1","pages":"85-96"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84484193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 52
期刊
2020 IEEE 36th International Conference on Data Engineering (ICDE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1