首页 > 最新文献

2016 IEEE 32nd International Conference on Data Engineering (ICDE)最新文献

英文 中文
Optimizing secure classification performance with privacy-aware feature selection 通过隐私感知特征选择优化安全分类性能
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498242
Erman Pattuk, Murat Kantarcioglu, Huseyin Ulusoy, B. Malin
Recent advances in personalized medicine point towards a future where clinical decision making will be dependent upon the individual characteristics of the patient, e.g., their age, race, genomic variation, and lifestyle. Already, there are numerous commercial entities working towards the provision of software to support such decisions as cloud-based services. However, deployment of such services in such settings raises important challenges for privacy. A recent attack shows that disclosing personalized drug dosage recommendations, combined with several pieces of demographic knowledge, can be leveraged to infer single nucleotide polymorphism variants of a patient. One manner to prevent such inference is to apply secure multi-party computation (SMC) techniques that hide all patient data, so that no information, including the clinical recommendation, is disclosed during the decision making process. Yet, SMC is a computationally cumbersome process and disclosing some information may be necessary for various compliance purposes. Additionally, certain information (e.g., demographic information) may already be publicly available. In this work, we provide a novel approach to selectively disclose certain information before the SMC process to significantly improve personalized decision making performance while preserving desired levels of privacy. To achieve this goal, we introduce mechanisms to quickly compute the loss in privacy due to information disclosure while considering its performance impact on SMC execution phase. Our empirical analysis show that we can achieve up to three orders of magnitude improvement compared to pure SMC solutions with only a slight increase in privacy risks.
个性化医疗的最新进展表明,未来临床决策将取决于患者的个体特征,例如他们的年龄、种族、基因组变异和生活方式。已经有许多商业实体致力于提供软件来支持诸如基于云的服务之类的决策。然而,在这样的环境中部署这样的服务对隐私提出了重要的挑战。最近的一项攻击表明,披露个性化药物剂量建议,结合一些人口统计学知识,可以用来推断患者的单核苷酸多态性变异。防止这种推断的一种方法是应用隐藏所有患者数据的安全多方计算(SMC)技术,这样在决策过程中就不会泄露任何信息,包括临床建议。然而,SMC是一个计算繁琐的过程,为了各种合规性目的,披露一些信息可能是必要的。此外,某些信息(例如,人口统计信息)可能已经是公开的。在这项工作中,我们提供了一种新的方法,在SMC过程之前选择性地披露某些信息,以显着提高个性化决策性能,同时保留所需的隐私水平。为了实现这一目标,我们引入了一种机制来快速计算由于信息披露而导致的隐私损失,同时考虑其对SMC执行阶段的性能影响。我们的实证分析表明,与纯SMC解决方案相比,我们可以实现多达三个数量级的改进,而隐私风险仅略有增加。
{"title":"Optimizing secure classification performance with privacy-aware feature selection","authors":"Erman Pattuk, Murat Kantarcioglu, Huseyin Ulusoy, B. Malin","doi":"10.1109/ICDE.2016.7498242","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498242","url":null,"abstract":"Recent advances in personalized medicine point towards a future where clinical decision making will be dependent upon the individual characteristics of the patient, e.g., their age, race, genomic variation, and lifestyle. Already, there are numerous commercial entities working towards the provision of software to support such decisions as cloud-based services. However, deployment of such services in such settings raises important challenges for privacy. A recent attack shows that disclosing personalized drug dosage recommendations, combined with several pieces of demographic knowledge, can be leveraged to infer single nucleotide polymorphism variants of a patient. One manner to prevent such inference is to apply secure multi-party computation (SMC) techniques that hide all patient data, so that no information, including the clinical recommendation, is disclosed during the decision making process. Yet, SMC is a computationally cumbersome process and disclosing some information may be necessary for various compliance purposes. Additionally, certain information (e.g., demographic information) may already be publicly available. In this work, we provide a novel approach to selectively disclose certain information before the SMC process to significantly improve personalized decision making performance while preserving desired levels of privacy. To achieve this goal, we introduce mechanisms to quickly compute the loss in privacy due to information disclosure while considering its performance impact on SMC execution phase. Our empirical analysis show that we can achieve up to three orders of magnitude improvement compared to pure SMC solutions with only a slight increase in privacy risks.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"19 1","pages":"217-228"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76763927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Efficient handling of concept drift and concept evolution over Stream Data 有效地处理流数据上的概念漂移和概念演变
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498264
Ahsanul Haque, L. Khan, M. Baron, B. Thuraisingham, C. Aggarwal
To decide if an update to a data stream classifier is necessary, existing sliding window based techniques monitor classifier performance on recent instances. If there is a significant change in classifier performance, these approaches determine a chunk boundary, and update the classifier. However, monitoring classifier performance is costly due to scarcity of labeled data. In our previous work, we presented a semi-supervised framework SAND, which uses change detection on classifier confidence to detect a concept drift. Unlike most approaches, it requires only a limited amount of labeled data to detect chunk boundaries and to update the classifier. However, SAND is expensive in terms of execution time due to exhaustive invocation of the change detection module. In this paper, we present an efficient framework, which is based on the same principle as SAND, but exploits dynamic programming and executes the change detection module selectively. Moreover, we provide theoretical justification of the confidence calculation, and show effect of a concept drift on subsequent confidence scores. Experiment results show efficiency of the proposed framework in terms of both accuracy and execution time.
为了确定是否需要对数据流分类器进行更新,现有的基于滑动窗口的技术可以监控最近实例上的分类器性能。如果分类器性能有显著变化,这些方法确定块边界,并更新分类器。然而,由于标记数据的稀缺性,监控分类器性能的成本很高。在我们之前的工作中,我们提出了一个半监督框架SAND,它使用对分类器置信度的变化检测来检测概念漂移。与大多数方法不同,它只需要有限数量的标记数据来检测块边界并更新分类器。然而,由于彻底调用变更检测模块,SAND在执行时间方面代价高昂。在本文中,我们提出了一个有效的框架,它基于与SAND相同的原理,但利用动态规划并有选择地执行变更检测模块。此外,我们为置信度计算提供了理论依据,并展示了概念漂移对后续置信度得分的影响。实验结果表明,该框架在精度和执行时间上都是有效的。
{"title":"Efficient handling of concept drift and concept evolution over Stream Data","authors":"Ahsanul Haque, L. Khan, M. Baron, B. Thuraisingham, C. Aggarwal","doi":"10.1109/ICDE.2016.7498264","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498264","url":null,"abstract":"To decide if an update to a data stream classifier is necessary, existing sliding window based techniques monitor classifier performance on recent instances. If there is a significant change in classifier performance, these approaches determine a chunk boundary, and update the classifier. However, monitoring classifier performance is costly due to scarcity of labeled data. In our previous work, we presented a semi-supervised framework SAND, which uses change detection on classifier confidence to detect a concept drift. Unlike most approaches, it requires only a limited amount of labeled data to detect chunk boundaries and to update the classifier. However, SAND is expensive in terms of execution time due to exhaustive invocation of the change detection module. In this paper, we present an efficient framework, which is based on the same principle as SAND, but exploits dynamic programming and executes the change detection module selectively. Moreover, we provide theoretical justification of the confidence calculation, and show effect of a concept drift on subsequent confidence scores. Experiment results show efficiency of the proposed framework in terms of both accuracy and execution time.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"38 1","pages":"481-492"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78117274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 82
Collaborative analytics for data silos 数据孤岛的协作分析
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498286
Jinkyu Kim, Heonseok Ha, Byung-Gon Chun, Sungroh Yoon, S. Cha
As a great deal of data has been accumulated in various disciplines, the need for the integrative analysis of separate but relevant data sources is becoming more important. Combining data sources can provide global insight that is otherwise difficult to obtain from individual sources. Because of privacy, regulations, and other issues, many large-scale data repositories remain closed off from the outside, raising what has been termed the data silo issue. The huge volume of today's big data often leads to computational challenges, adding another layer of complexity to the solution. In this paper, we propose a novel method called collaborative analytics by ensemble learning (CABEL), which attempts to resolve the main hurdles regarding the silo issue: accuracy, privacy, and computational efficiency. CABEL represents the data stored in each silo as a compact aggregate of samples called the silo signature. The compact representation provides computational efficiency and privacy preservation but makes it challenging to produce accurate analytics. To resolve this challenge, we formulate the problem of attribute domain sampling and reconstruction, and propose a solution called the Chebyshev subset. To model collaborative efforts to analyze semantically linked but structurally disconnected databases, CABEL utilizes a new ensemble learning technique termed the weighted bagging of base classifiers. We demonstrate the effectiveness of CABEL by testing with a nationwide health-insurance data set containing approximately 4,182,000,000 records collected from the entire population of an Organisation for Economic Co-operation and Development (OECD) country in 2012. In our binary classification tests, CABEL achieved median recall, precision, and F-measure values of 89%, 64%, and 76%, respectively, although only 0.001-0.00001% of the original data was used for model construction, while maintaining data privacy and computational efficiency.
由于各个学科积累了大量的数据,对独立但相关的数据源进行综合分析的需求变得越来越重要。结合数据源可以提供难以从单个来源获得的全局洞察力。由于隐私、法规和其他问题,许多大型数据存储库仍然与外部封闭,从而引发了所谓的数据孤岛问题。当今庞大的大数据量通常会带来计算挑战,为解决方案增加了另一层复杂性。在本文中,我们提出了一种称为集成学习协作分析(CABEL)的新方法,该方法试图解决有关筒仓问题的主要障碍:准确性、隐私性和计算效率。CABEL将存储在每个筒仓中的数据表示为一个紧凑的样本集合,称为筒仓特征。紧凑的表示提供了计算效率和隐私保护,但使其难以产生准确的分析。为了解决这一挑战,我们提出了属性域采样和重构问题,并提出了一种称为Chebyshev子集的解决方案。为了对分析语义相连但结构不相连的数据库的协作努力进行建模,CABEL采用了一种新的集成学习技术,称为基础分类器的加权装袋。我们通过对2012年从经济合作与发展组织(OECD)国家的全体人口中收集的包含约41.82亿条记录的全国性医疗保险数据集进行测试,证明了CABEL的有效性。在我们的二元分类测试中,尽管在保持数据隐私性和计算效率的前提下,仅使用原始数据的0.001-0.00001%进行模型构建,但CABEL的中位召回率、精度和f测量值分别达到89%、64%和76%。
{"title":"Collaborative analytics for data silos","authors":"Jinkyu Kim, Heonseok Ha, Byung-Gon Chun, Sungroh Yoon, S. Cha","doi":"10.1109/ICDE.2016.7498286","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498286","url":null,"abstract":"As a great deal of data has been accumulated in various disciplines, the need for the integrative analysis of separate but relevant data sources is becoming more important. Combining data sources can provide global insight that is otherwise difficult to obtain from individual sources. Because of privacy, regulations, and other issues, many large-scale data repositories remain closed off from the outside, raising what has been termed the data silo issue. The huge volume of today's big data often leads to computational challenges, adding another layer of complexity to the solution. In this paper, we propose a novel method called collaborative analytics by ensemble learning (CABEL), which attempts to resolve the main hurdles regarding the silo issue: accuracy, privacy, and computational efficiency. CABEL represents the data stored in each silo as a compact aggregate of samples called the silo signature. The compact representation provides computational efficiency and privacy preservation but makes it challenging to produce accurate analytics. To resolve this challenge, we formulate the problem of attribute domain sampling and reconstruction, and propose a solution called the Chebyshev subset. To model collaborative efforts to analyze semantically linked but structurally disconnected databases, CABEL utilizes a new ensemble learning technique termed the weighted bagging of base classifiers. We demonstrate the effectiveness of CABEL by testing with a nationwide health-insurance data set containing approximately 4,182,000,000 records collected from the entire population of an Organisation for Economic Co-operation and Development (OECD) country in 2012. In our binary classification tests, CABEL achieved median recall, precision, and F-measure values of 89%, 64%, and 76%, respectively, although only 0.001-0.00001% of the original data was used for model construction, while maintaining data privacy and computational efficiency.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"152 1","pages":"743-754"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74904219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Anonymizing collections of tree-structured data 对树状结构数据集合进行匿名化
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498404
Olga Gkountouna, Manolis Terrovitis
Collections of real-world data usually have implicit or explicit structural relations. For example, databases link records through foreign keys, and XML documents express associations between different values through syntax. Privacy preservation, until now, has focused either on data with a very simple structure, e.g. relational tables, or on data with very complex structure e.g. social network graphs, but has ignored intermediate cases, which are the most frequent in practice. In this work, we focus on tree structured data. The paper defines k(m;n)-anonymity, which provides protection against identity disclosure and proposes a greedy anonymization heuristic that is able to sanitize large datasets. The algorithm and the quality of the anonymization are evaluated experimentally.
实际数据的集合通常具有隐式或显式的结构关系。例如,数据库通过外键链接记录,XML文档通过语法表达不同值之间的关联。到目前为止,隐私保护要么集中在结构非常简单的数据上,比如关系表,要么集中在结构非常复杂的数据上,比如社交网络图,但是忽略了在实践中最常见的中间情况。在这项工作中,我们专注于树结构数据。本文定义了k(m;n)-匿名,它提供了防止身份泄露的保护,并提出了一种能够对大型数据集进行清理的贪婪匿名启发式算法。通过实验对算法和匿名化质量进行了评价。
{"title":"Anonymizing collections of tree-structured data","authors":"Olga Gkountouna, Manolis Terrovitis","doi":"10.1109/ICDE.2016.7498404","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498404","url":null,"abstract":"Collections of real-world data usually have implicit or explicit structural relations. For example, databases link records through foreign keys, and XML documents express associations between different values through syntax. Privacy preservation, until now, has focused either on data with a very simple structure, e.g. relational tables, or on data with very complex structure e.g. social network graphs, but has ignored intermediate cases, which are the most frequent in practice. In this work, we focus on tree structured data. The paper defines k(m;n)-anonymity, which provides protection against identity disclosure and proposes a greedy anonymization heuristic that is able to sanitize large datasets. The algorithm and the quality of the anonymization are evaluated experimentally.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"52 1","pages":"1520-1521"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84418234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
DataXFormer: A robust transformation discovery system DataXFormer:一个健壮的转换发现系统
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498319
Ziawasch Abedjan, J. Morcos, I. Ilyas, M. Ouzzani, Paolo Papotti, M. Stonebraker
In data integration, data curation, and other data analysis tasks, users spend a considerable amount of time converting data from one representation to another. For example US dates to European dates or airport codes to city names. In a previous vision paper, we presented the initial design of DataXFormer, a system that uses web resources to assist in transformation discovery. Specifically, DataXFormer discovers possible transformations from web tables and web forms and involves human feedback where appropriate. In this paper, we present the full fledged system along with several extensions. In particular, we present algorithms to find (i) transformations that entail multiple columns of input data, (ii) indirect transformations that are compositions of other transformations, (iii) transformations that are not functions but rather relationships, and (iv) transformations from a knowledge base of public data. We report on experiments with a collection of 120 transformation tasks, and show our enhanced system automatically covers 101 of them by using openly available resources.
在数据集成、数据管理和其他数据分析任务中,用户花费大量时间将数据从一种表示转换为另一种表示。例如,美国日期转换为欧洲日期或机场代码转换为城市名称。在之前的愿景论文中,我们介绍了DataXFormer的初始设计,这是一个使用web资源来协助发现转换的系统。具体来说,DataXFormer从web表和web表单中发现可能的转换,并在适当的地方包含人工反馈。在本文中,我们给出了一个完整的系统以及几个扩展。特别是,我们提出了寻找(i)需要多列输入数据的转换的算法,(ii)由其他转换组成的间接转换,(iii)不是函数而是关系的转换,以及(iv)来自公共数据知识库的转换。我们报告了120个转换任务集合的实验,并显示我们的增强系统通过使用公开可用的资源自动覆盖其中的101个。
{"title":"DataXFormer: A robust transformation discovery system","authors":"Ziawasch Abedjan, J. Morcos, I. Ilyas, M. Ouzzani, Paolo Papotti, M. Stonebraker","doi":"10.1109/ICDE.2016.7498319","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498319","url":null,"abstract":"In data integration, data curation, and other data analysis tasks, users spend a considerable amount of time converting data from one representation to another. For example US dates to European dates or airport codes to city names. In a previous vision paper, we presented the initial design of DataXFormer, a system that uses web resources to assist in transformation discovery. Specifically, DataXFormer discovers possible transformations from web tables and web forms and involves human feedback where appropriate. In this paper, we present the full fledged system along with several extensions. In particular, we present algorithms to find (i) transformations that entail multiple columns of input data, (ii) indirect transformations that are compositions of other transformations, (iii) transformations that are not functions but rather relationships, and (iv) transformations from a knowledge base of public data. We report on experiments with a collection of 120 transformation tasks, and show our enhanced system automatically covers 101 of them by using openly available resources.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"99 1","pages":"1134-1145"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80568307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Load balancing and skew resilience for parallel joins 并行连接的负载平衡和倾斜弹性
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498250
Aleksandar Vitorovic, Mohammed Elseidy, Christoph E. Koch
We address the problem of load balancing for parallel joins.We show that the distribution of input data received and the output data produced by worker machines are both important for performance. As a result, previous work, which optimizes either for input or output, stands ineffective for load balancing. To that end, we propose a multi-stage load-balancing algorithm which considers the properties of both input and output data through sampling of the original join matrix. To do this efficiently, we propose a novel category of equi-weight histograms. To build them, we exploit state-of-the-art computational geometry algorithms for rectangle tiling. To our knowledge, we are the first to employ tiling algorithms for join load-balancing. In addition, we propose a novel, join-specialized tiling algorithm that has drastically lower time and space complexity than existing algorithms. Experiments show that our scheme outperforms state-of-the-art techniques by up to a factor of 15.
我们解决了并行连接的负载平衡问题。我们表明,接收到的输入数据和工作机器产生的输出数据的分布对性能都很重要。因此,之前针对输入或输出进行优化的工作对于负载平衡是无效的。为此,我们提出了一种多阶段负载平衡算法,该算法通过对原始连接矩阵的采样来考虑输入和输出数据的属性。为了有效地做到这一点,我们提出了一种新的等权直方图类别。为了构建它们,我们利用最先进的计算几何算法进行矩形平铺。据我们所知,我们是第一个使用平铺算法来实现连接负载平衡的。此外,我们提出了一种新颖的,连接专用的平铺算法,该算法比现有算法具有更低的时间和空间复杂度。实验表明,我们的方案比最先进的技术性能高出15倍。
{"title":"Load balancing and skew resilience for parallel joins","authors":"Aleksandar Vitorovic, Mohammed Elseidy, Christoph E. Koch","doi":"10.1109/ICDE.2016.7498250","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498250","url":null,"abstract":"We address the problem of load balancing for parallel joins.We show that the distribution of input data received and the output data produced by worker machines are both important for performance. As a result, previous work, which optimizes either for input or output, stands ineffective for load balancing. To that end, we propose a multi-stage load-balancing algorithm which considers the properties of both input and output data through sampling of the original join matrix. To do this efficiently, we propose a novel category of equi-weight histograms. To build them, we exploit state-of-the-art computational geometry algorithms for rectangle tiling. To our knowledge, we are the first to employ tiling algorithms for join load-balancing. In addition, we propose a novel, join-specialized tiling algorithm that has drastically lower time and space complexity than existing algorithms. Experiments show that our scheme outperforms state-of-the-art techniques by up to a factor of 15.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"313-324"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90870662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Flow-Join: Adaptive skew handling for distributed joins over high-speed networks Flow-Join:高速网络上分布式连接的自适应倾斜处理
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498324
Wolf Rödiger, S. Idicula, A. Kemper, Thomas Neumann
Modern InfiniBand interconnects offer link speeds of several gigabytes per second and a remote direct memory access (RDMA) paradigm for zero-copy network communication. Both are crucial for parallel database systems to achieve scalable distributed query processing where adding a server to the cluster increases performance. However, the scalability of distributed joins is threatened by unexpected data characteristics: Skew can cause a severe load imbalance such that a single server has to process a much larger part of the input than its fair share and by this slows down the entire distributed query. We introduce Flow-Join, a novel distributed join algorithm that handles attribute value skew with minimal overhead. Flow-Join detects heavy hitters at runtime using small approximate histograms and adapts the redistribution scheme to resolve load imbalances before they impact the join performance. Previous approaches often involve expensive analysis phases, which slow down distributed join processing for non-skewed workloads. This is especially the case for modern high-speed interconnects, which are too fast to hide the extra computation. Other skew handling approaches require detailed statistics, which are often not available or overly inaccurate for intermediate results. In contrast, Flow-Join uses our novel lightweight skew handling scheme to execute at the full network speed of more than 6 GB/s for InfiniBand 4×FDR, joining a skewed input at 11.5 billion tuples/s with 32 servers. This is 6.8× faster than a standard distributed hash join using the same hardware. At the same time, Flow-Join does not compromise the join performance for non-skewed workloads.
现代InfiniBand互连提供每秒数gb的链路速度和用于零复制网络通信的远程直接内存访问(RDMA)范例。这两者对于并行数据库系统实现可伸缩的分布式查询处理至关重要,在这种情况下,向集群添加服务器可以提高性能。然而,分布式连接的可伸缩性受到意外数据特征的威胁:Skew可能导致严重的负载不平衡,这样单个服务器必须处理比其公平份额大得多的输入,从而减慢整个分布式查询的速度。我们介绍了Flow-Join,一种新的分布式连接算法,以最小的开销处理属性值倾斜。Flow-Join在运行时使用小的近似直方图检测严重的攻击,并在负载不平衡影响连接性能之前调整重新分配方案来解决负载不平衡。以前的方法通常涉及昂贵的分析阶段,这会减慢非倾斜工作负载的分布式连接处理速度。对于现代高速互连来说尤其如此,因为其速度太快而无法隐藏额外的计算。其他歪斜处理方法需要详细的统计数据,这些数据通常无法获得,或者对于中间结果来说过于不准确。相比之下,Flow-Join使用我们新颖的轻量级倾斜处理方案,在InfiniBand上以超过6 GB/s的全网络速度执行4×FDR,以115亿个元组/s的速度连接32台服务器的倾斜输入。这比使用相同硬件的标准分布式散列连接快6.8倍。同时,对于非倾斜的工作负载,Flow-Join不会影响连接性能。
{"title":"Flow-Join: Adaptive skew handling for distributed joins over high-speed networks","authors":"Wolf Rödiger, S. Idicula, A. Kemper, Thomas Neumann","doi":"10.1109/ICDE.2016.7498324","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498324","url":null,"abstract":"Modern InfiniBand interconnects offer link speeds of several gigabytes per second and a remote direct memory access (RDMA) paradigm for zero-copy network communication. Both are crucial for parallel database systems to achieve scalable distributed query processing where adding a server to the cluster increases performance. However, the scalability of distributed joins is threatened by unexpected data characteristics: Skew can cause a severe load imbalance such that a single server has to process a much larger part of the input than its fair share and by this slows down the entire distributed query. We introduce Flow-Join, a novel distributed join algorithm that handles attribute value skew with minimal overhead. Flow-Join detects heavy hitters at runtime using small approximate histograms and adapts the redistribution scheme to resolve load imbalances before they impact the join performance. Previous approaches often involve expensive analysis phases, which slow down distributed join processing for non-skewed workloads. This is especially the case for modern high-speed interconnects, which are too fast to hide the extra computation. Other skew handling approaches require detailed statistics, which are often not available or overly inaccurate for intermediate results. In contrast, Flow-Join uses our novel lightweight skew handling scheme to execute at the full network speed of more than 6 GB/s for InfiniBand 4×FDR, joining a skewed input at 11.5 billion tuples/s with 32 servers. This is 6.8× faster than a standard distributed hash join using the same hardware. At the same time, Flow-Join does not compromise the join performance for non-skewed workloads.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"14 1","pages":"1194-1205"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89689275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
Discovering interpretable geo-social communities for user behavior prediction 为用户行为预测发现可解释的地理社会社区
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498303
Hongzhi Yin, Zhiting Hu, Xiaofang Zhou, Hao Wang, Kai Zheng, Nguyen Quoc Viet Hung, S. Sadiq
Social community detection is a growing field of interest in the area of social network applications, and many approaches have been developed, including graph partitioning, latent space model, block model and spectral clustering. Most existing work purely focuses on network structure information which is, however, often sparse, noisy and lack of interpretability. To improve the accuracy and interpretability of community discovery, we propose to infer users' social communities by incorporating their spatiotemporal data and semantic information. Technically, we propose a unified probabilistic generative model, User-Community-Geo-Topic (UCGT), to simulate the generative process of communities as a result of network proximities, spatiotemporal co-occurrences and semantic similarity. With a well-designed multi-component model structure and a parallel inference implementation to leverage the power of multicores and clusters, our UCGT model is expressive while remaining efficient and scalable to growing large-scale geo-social networking data. We deploy UCGT to two application scenarios of user behavior predictions: check-in prediction and social interaction prediction. Extensive experiments on two large-scale geo-social networking datasets show that UCGT achieves better performance than existing state-of-the-art comparison methods.
社交社区检测是社交网络应用领域中一个日益受到关注的领域,目前已经开发了许多方法,包括图划分、潜在空间模型、块模型和谱聚类。大多数现有的工作纯粹集中在网络结构信息上,然而这些信息往往是稀疏的、有噪声的和缺乏可解释性的。为了提高社区发现的准确性和可解释性,我们建议结合用户的时空数据和语义信息来推断用户的社交社区。在技术上,我们提出了一个统一的概率生成模型User-Community-Geo-Topic (UCGT)来模拟由于网络邻近性、时空共现性和语义相似性而导致的社区生成过程。通过精心设计的多组件模型结构和并行推理实现来利用多核和集群的力量,我们的UCGT模型具有表现力,同时保持高效和可扩展性,以适应不断增长的大规模地理社交网络数据。我们将UCGT部署到用户行为预测的两个应用场景中:签到预测和社交互动预测。在两个大型地理社交网络数据集上的大量实验表明,UCGT比现有的最先进的比较方法取得了更好的性能。
{"title":"Discovering interpretable geo-social communities for user behavior prediction","authors":"Hongzhi Yin, Zhiting Hu, Xiaofang Zhou, Hao Wang, Kai Zheng, Nguyen Quoc Viet Hung, S. Sadiq","doi":"10.1109/ICDE.2016.7498303","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498303","url":null,"abstract":"Social community detection is a growing field of interest in the area of social network applications, and many approaches have been developed, including graph partitioning, latent space model, block model and spectral clustering. Most existing work purely focuses on network structure information which is, however, often sparse, noisy and lack of interpretability. To improve the accuracy and interpretability of community discovery, we propose to infer users' social communities by incorporating their spatiotemporal data and semantic information. Technically, we propose a unified probabilistic generative model, User-Community-Geo-Topic (UCGT), to simulate the generative process of communities as a result of network proximities, spatiotemporal co-occurrences and semantic similarity. With a well-designed multi-component model structure and a parallel inference implementation to leverage the power of multicores and clusters, our UCGT model is expressive while remaining efficient and scalable to growing large-scale geo-social networking data. We deploy UCGT to two application scenarios of user behavior predictions: check-in prediction and social interaction prediction. Extensive experiments on two large-scale geo-social networking datasets show that UCGT achieves better performance than existing state-of-the-art comparison methods.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"2 1","pages":"942-953"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91534110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 117
A novel, low-latency algorithm for multiple Group-By query optimization 一种新颖的、低延迟的多Group-By查询优化算法
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498249
Duy-Hung Phan, P. Michiardi
Data summarization is essential for users to interact with data. Current state of the art algorithms to optimize its most general form, the multiple Group By queries, have limitations in scalability. In this paper, we propose a novel algorithm, Top-Down Splitting, that scales to hundreds or even thousands of attributes and queries, and that quickly and efficiently produces optimized query execution plans. We analyze the complexity of our algorithm, and evaluate, empirically, its scalability and effectiveness through an experimental campaign. Results show that our algorithm is remarkably faster than alternatives in prior works, while generally producing better solutions. Ultimately, our algorithm reduces up to 34% the query execution time, when compared to un-optimized plans.
数据汇总是用户与数据交互的必要条件。当前的最先进的算法,以优化其最一般的形式,多个Group By查询,在可扩展性方面有限制。在本文中,我们提出了一种新的算法,自顶向下分割,它可以扩展到数百甚至数千个属性和查询,并且可以快速有效地生成优化的查询执行计划。我们分析了我们的算法的复杂性,并通过实验活动经验地评估了它的可扩展性和有效性。结果表明,我们的算法明显快于以前的替代方案,同时通常产生更好的解决方案。最终,与未优化的计划相比,我们的算法最多减少了34%的查询执行时间。
{"title":"A novel, low-latency algorithm for multiple Group-By query optimization","authors":"Duy-Hung Phan, P. Michiardi","doi":"10.1109/ICDE.2016.7498249","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498249","url":null,"abstract":"Data summarization is essential for users to interact with data. Current state of the art algorithms to optimize its most general form, the multiple Group By queries, have limitations in scalability. In this paper, we propose a novel algorithm, Top-Down Splitting, that scales to hundreds or even thousands of attributes and queries, and that quickly and efficiently produces optimized query execution plans. We analyze the complexity of our algorithm, and evaluate, empirically, its scalability and effectiveness through an experimental campaign. Results show that our algorithm is remarkably faster than alternatives in prior works, while generally producing better solutions. Ultimately, our algorithm reduces up to 34% the query execution time, when compared to un-optimized plans.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"25 1","pages":"301-312"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87882363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
ORLF: A flexible framework for online record linkage and fusion ORLF:一个灵活的在线记录链接和融合框架
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498349
E. Rezig, Eduard Constantin Dragut, M. Ouzzani, A. Elmagarmid, Walid G. Aref
With the exponential growth of data on the Web comes the opportunity to integrate multiple sources to give more accurate answers to user queries. Upon retrieving records from multiple Web databases, a key task is to merge records that refer to the same real-world entity. We demonstrate ORLF (Online Record Linkage and Fusion), a flexible query-time record linkage and fusion framework. ORLF deduplicates newly arriving query results jointly with previously processed query results. We use an iterative caching solution that leverages query locality to effectively deduplicate newly incoming records with cached records. ORLF aims to deliver timely query answers that are duplicate-free and reflect knowledge collected from previous queries.
随着Web上数据的指数级增长,有机会集成多个来源,从而为用户查询提供更准确的答案。在从多个Web数据库检索记录时,一个关键任务是合并引用同一真实实体的记录。我们演示了一种灵活的查询时间记录链接和融合框架ORLF (Online Record Linkage and Fusion)。ORLF将新到达的查询结果与先前处理过的查询结果一起去重。我们使用迭代缓存解决方案,利用查询局域性有效地用缓存记录去重复新传入的记录。ORLF旨在提供及时的、无重复的查询答案,并反映从以前的查询中收集的知识。
{"title":"ORLF: A flexible framework for online record linkage and fusion","authors":"E. Rezig, Eduard Constantin Dragut, M. Ouzzani, A. Elmagarmid, Walid G. Aref","doi":"10.1109/ICDE.2016.7498349","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498349","url":null,"abstract":"With the exponential growth of data on the Web comes the opportunity to integrate multiple sources to give more accurate answers to user queries. Upon retrieving records from multiple Web databases, a key task is to merge records that refer to the same real-world entity. We demonstrate ORLF (Online Record Linkage and Fusion), a flexible query-time record linkage and fusion framework. ORLF deduplicates newly arriving query results jointly with previously processed query results. We use an iterative caching solution that leverages query locality to effectively deduplicate newly incoming records with cached records. ORLF aims to deliver timely query answers that are duplicate-free and reflect knowledge collected from previous queries.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"1378-1381"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87982704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2016 IEEE 32nd International Conference on Data Engineering (ICDE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1