首页 > 最新文献

2016 IEEE 32nd International Conference on Data Engineering (ICDE)最新文献

英文 中文
Input selection for fast feature engineering 快速特征工程的输入选择
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498272
Michael R. Anderson, Michael J. Cafarella
The application of machine learning to large datasets has become a vital component of many important and sophisticated software systems built today. Such trained systems are often based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw data objects into a small number of salient values. A trained system's success depends substantially on the quality of its features. Unfortunately, feature engineering-the process of writing code that takes raw data objects as input and outputs feature vectors suitable for a machine learning algorithm-is a tedious, time-consuming experience. Because “big data” inputs are so diverse, feature engineering is often a trial-and-error process requiring many small, iterative code changes. Because the inputs are so large, each code change can involve a time-consuming data processing task (over each page in a Web crawl, for example). We introduce Zombie, a data-centric system that accelerates feature engineering through intelligent input selection, optimizing the “inner loop” of the feature engineering process. Our system yields feature evaluation speedups of up to 8× in some cases and reduces engineer wait times from 8 to 5 hours in others.
机器学习在大数据集上的应用已经成为当今许多重要而复杂的软件系统的重要组成部分。这样的训练系统通常基于监督学习任务,这些任务需要从数据中提取特征和信号,将复杂的原始数据对象提炼成少量显著值。经过训练的系统的成功很大程度上取决于其特征的质量。不幸的是,特征工程——编写将原始数据对象作为输入和输出适合机器学习算法的特征向量的代码的过程——是一种冗长而耗时的体验。因为“大数据”的输入是如此多样化,特征工程通常是一个试错过程,需要许多小的、迭代的代码更改。由于输入非常大,每次代码更改都可能涉及耗时的数据处理任务(例如,对Web抓取中的每个页面进行处理)。我们介绍了Zombie,一个以数据为中心的系统,通过智能输入选择加速特征工程,优化特征工程过程的“内循环”。我们的系统在某些情况下可以产生高达8倍的特征评估速度,并在其他情况下将工程师的等待时间从8小时减少到5小时。
{"title":"Input selection for fast feature engineering","authors":"Michael R. Anderson, Michael J. Cafarella","doi":"10.1109/ICDE.2016.7498272","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498272","url":null,"abstract":"The application of machine learning to large datasets has become a vital component of many important and sophisticated software systems built today. Such trained systems are often based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw data objects into a small number of salient values. A trained system's success depends substantially on the quality of its features. Unfortunately, feature engineering-the process of writing code that takes raw data objects as input and outputs feature vectors suitable for a machine learning algorithm-is a tedious, time-consuming experience. Because “big data” inputs are so diverse, feature engineering is often a trial-and-error process requiring many small, iterative code changes. Because the inputs are so large, each code change can involve a time-consuming data processing task (over each page in a Web crawl, for example). We introduce Zombie, a data-centric system that accelerates feature engineering through intelligent input selection, optimizing the “inner loop” of the feature engineering process. Our system yields feature evaluation speedups of up to 8× in some cases and reduces engineer wait times from 8 to 5 hours in others.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"16 1","pages":"577-588"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87794216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Authentication of function queries 函数查询的身份验证
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498252
Guolei Yang, Ying Cai, Zhenbi Hu
Consider a database where each record represents a math function. A third party is in charge of processing queries over this database and we want to provide a mechanism for users to verify the correctness of their query results. Here each query, referred to as a Function Query (FQ), retrieves the functions whose computation results with user-supplied arguments satisfy certain conditions (e.g., within a certain range). We present authentication solutions that work on a variety of functions, including univariate linear function, multivariate linear function, and multivariate high degree function. Our solutions are based on the fact that the functions can be sorted in the subdomains defined by their intersections and thus can be chained to produce a signature mesh for query result verification. We study the performance of the proposed techniques through theoretical analysis, simulation and empirical study, and include the results in this paper.
考虑一个数据库,其中每条记录代表一个数学函数。第三方负责处理该数据库上的查询,我们希望为用户提供一种机制来验证其查询结果的正确性。这里的每个查询都被称为函数查询(Function query, FQ),它检索那些使用用户提供的参数计算结果满足特定条件(例如,在特定范围内)的函数。我们提出了适用于多种函数的认证解决方案,包括单变量线性函数、多变量线性函数和多变量高阶函数。我们的解决方案是基于这样一个事实,即函数可以在它们的交叉点定义的子域中进行排序,从而可以链接以生成用于查询结果验证的签名网格。我们通过理论分析、仿真和实证研究来研究所提出的技术的性能,并将结果纳入本文。
{"title":"Authentication of function queries","authors":"Guolei Yang, Ying Cai, Zhenbi Hu","doi":"10.1109/ICDE.2016.7498252","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498252","url":null,"abstract":"Consider a database where each record represents a math function. A third party is in charge of processing queries over this database and we want to provide a mechanism for users to verify the correctness of their query results. Here each query, referred to as a Function Query (FQ), retrieves the functions whose computation results with user-supplied arguments satisfy certain conditions (e.g., within a certain range). We present authentication solutions that work on a variety of functions, including univariate linear function, multivariate linear function, and multivariate high degree function. Our solutions are based on the fact that the functions can be sorted in the subdomains defined by their intersections and thus can be chained to produce a signature mesh for query result verification. We study the performance of the proposed techniques through theoretical analysis, simulation and empirical study, and include the results in this paper.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"94 1","pages":"337-348"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84333601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Recommendations meet web browsing: enhancing collaborative filtering using internet browsing logs 建议满足网页浏览:增强协同过滤使用互联网浏览日志
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498327
Royi Ronen, E. Yom-Tov, G. Lavee
Collaborative filtering (CF) recommendation systems are one of the most popular and successful methods for recommending products to people. CF systems work by finding similarities between different people according to their past purchases, and using these similarities to suggest possible items of interest. In this work we show that CF systems can be enhanced using Internet browsing data and search engine query logs, both of which represent a rich profile of individuals' interests.
协同过滤(CF)推荐系统是向人们推荐产品的最流行和成功的方法之一。CF系统的工作原理是根据不同的人过去的购买行为找到他们之间的相似之处,并利用这些相似之处来推荐可能感兴趣的商品。在这项工作中,我们表明CF系统可以使用互联网浏览数据和搜索引擎查询日志来增强,这两者都代表了个人兴趣的丰富轮廓。
{"title":"Recommendations meet web browsing: enhancing collaborative filtering using internet browsing logs","authors":"Royi Ronen, E. Yom-Tov, G. Lavee","doi":"10.1109/ICDE.2016.7498327","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498327","url":null,"abstract":"Collaborative filtering (CF) recommendation systems are one of the most popular and successful methods for recommending products to people. CF systems work by finding similarities between different people according to their past purchases, and using these similarities to suggest possible items of interest. In this work we show that CF systems can be enhanced using Internet browsing data and search engine query logs, both of which represent a rich profile of individuals' interests.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"1230-1238"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84575855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Moolle: Fan-out control for scalable distributed data stores Moolle:可扩展分布式数据存储的扇出控制
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498325
Sun-Yeong Cho, A. Carter, J. Ehrlich, J. A. Jan
Many Online Social Networks horizontally partition data across data stores. This allows the addition of server nodes to increase capacity and throughput. For single key lookup queries such as computing a member's 1st degree connections, clients need to generate only one request to one data store. However, for multi key lookup queries such as computing a 2nd degree network, clients need to generate multiple requests to multiple data stores. The number of requests to fulfill the multi key lookup queries grows in relation to the number of partitions. Increasing the number of server nodes in order to increase capacity also increases the number of requests between the client and data stores. This may increase the latency of the query response time because of network congestion, tail-latency, and CPU bounding. Replication based partitioning strategies can reduce the number of requests in the multi key lookup queries. However, reducing the number of requests in a query can degrade the performance of certain queries where processing, computing, and filtering can be done by the data stores. A better system would provide the capability of controlling the number of requests in a query. This paper presents Moolle, a system of controlling the number of requests in queries to scalable distributed data stores. Moolle has been implemented in the LinkedIn distributed graph service that serves hundreds of thousands of social graph traversal queries per second. We believe that Moolle can be applied to other distributed systems that handle distributed data processing with a high volume of variable-sized requests.
许多在线社交网络横向划分数据存储。这允许添加服务器节点来增加容量和吞吐量。对于单键查找查询,例如计算成员的一级连接,客户端只需要生成一个到一个数据存储的请求。但是,对于多键查找查询,例如计算二级网络,客户端需要生成对多个数据存储的多个请求。完成多键查找查询的请求数量随着分区数量的增加而增加。为了增加容量而增加服务器节点的数量也会增加客户机和数据存储之间的请求数量。由于网络拥塞、尾部延迟和CPU边界,这可能会增加查询响应时间的延迟。基于复制的分区策略可以减少多键查找查询中的请求数量。但是,减少查询中的请求数量可能会降低某些查询的性能,因为这些查询的处理、计算和过滤可以由数据存储完成。更好的系统应该提供控制查询中请求数量的功能。本文介绍了Moolle,一个控制可扩展分布式数据存储查询请求数量的系统。Moolle已经在LinkedIn分布式图形服务中实现,该服务每秒处理数十万个社交图形遍历查询。我们相信,Moolle可以应用于其他分布式系统,处理具有大量可变大小请求的分布式数据处理。
{"title":"Moolle: Fan-out control for scalable distributed data stores","authors":"Sun-Yeong Cho, A. Carter, J. Ehrlich, J. A. Jan","doi":"10.1109/ICDE.2016.7498325","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498325","url":null,"abstract":"Many Online Social Networks horizontally partition data across data stores. This allows the addition of server nodes to increase capacity and throughput. For single key lookup queries such as computing a member's 1st degree connections, clients need to generate only one request to one data store. However, for multi key lookup queries such as computing a 2nd degree network, clients need to generate multiple requests to multiple data stores. The number of requests to fulfill the multi key lookup queries grows in relation to the number of partitions. Increasing the number of server nodes in order to increase capacity also increases the number of requests between the client and data stores. This may increase the latency of the query response time because of network congestion, tail-latency, and CPU bounding. Replication based partitioning strategies can reduce the number of requests in the multi key lookup queries. However, reducing the number of requests in a query can degrade the performance of certain queries where processing, computing, and filtering can be done by the data stores. A better system would provide the capability of controlling the number of requests in a query. This paper presents Moolle, a system of controlling the number of requests in queries to scalable distributed data stores. Moolle has been implemented in the LinkedIn distributed graph service that serves hundreds of thousands of social graph traversal queries per second. We believe that Moolle can be applied to other distributed systems that handle distributed data processing with a high volume of variable-sized requests.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"146 1","pages":"1206-1217"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90150338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
TRANSFORMERS: Robust spatial joins on non-uniform data distributions 变形金刚:非均匀数据分布上的鲁棒空间连接
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498280
Mirjana Pavlovic, T. Heinis, F. Tauheed, Panagiotis Karras, A. Ailamaki
Spatial joins are becoming increasingly ubiquitous in many applications, particularly in the scientific domain. While several approaches have been proposed for joining spatial datasets, each of them has a strength for a particular type of density ratio among the joined datasets. More generally, no single proposed method can efficiently join two spatial datasets in a robust manner with respect to their data distributions. Some approaches do well for datasets with contrasting densities while others do better with similar densities. None of them does well when the datasets have locally divergent data distributions. In this paper we develop TRANSFORMERS, an efficient and robust spatial join approach that is indifferent to such variations of distribution among the joined data. TRANSFORMERS achieves this feat by departing from the state-of-the-art through adapting the join strategy and data layout to local density variations among the joined data. It employs a join method based on data-oriented partitioning when joining areas of substantially different local densities, whereas it uses big partitions (as in space-oriented partitioning) when the densities are similar, while seamlessly switching among these two strategies at runtime. We experimentally demonstrate that TRANSFORMERS outperforms state-of-the-art approaches by a factor of between 2 and 8.
空间连接在许多应用中变得越来越普遍,特别是在科学领域。虽然已经提出了几种用于连接空间数据集的方法,但每种方法对于连接的数据集之间的特定类型的密度比都有其强度。更普遍的是,没有一种单一的方法可以有效地结合两个空间数据集的数据分布。一些方法在密度对比的数据集上做得很好,而另一些方法在密度相似的数据集上做得更好。当数据集具有局部分散的数据分布时,它们都不能很好地工作。在本文中,我们开发了一种高效且鲁棒的空间连接方法TRANSFORMERS,它不受连接数据之间分布变化的影响。TRANSFORMERS通过调整连接策略和数据布局来适应连接数据之间的局部密度变化,从而摆脱了最先进的技术,实现了这一壮举。当连接具有完全不同局部密度的区域时,它使用基于面向数据的分区的连接方法,而当密度相似时,它使用大分区(如面向空间的分区),同时在运行时在这两种策略之间无缝切换。我们通过实验证明,TRANSFORMERS的性能比最先进的方法高出2到8倍。
{"title":"TRANSFORMERS: Robust spatial joins on non-uniform data distributions","authors":"Mirjana Pavlovic, T. Heinis, F. Tauheed, Panagiotis Karras, A. Ailamaki","doi":"10.1109/ICDE.2016.7498280","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498280","url":null,"abstract":"Spatial joins are becoming increasingly ubiquitous in many applications, particularly in the scientific domain. While several approaches have been proposed for joining spatial datasets, each of them has a strength for a particular type of density ratio among the joined datasets. More generally, no single proposed method can efficiently join two spatial datasets in a robust manner with respect to their data distributions. Some approaches do well for datasets with contrasting densities while others do better with similar densities. None of them does well when the datasets have locally divergent data distributions. In this paper we develop TRANSFORMERS, an efficient and robust spatial join approach that is indifferent to such variations of distribution among the joined data. TRANSFORMERS achieves this feat by departing from the state-of-the-art through adapting the join strategy and data layout to local density variations among the joined data. It employs a join method based on data-oriented partitioning when joining areas of substantially different local densities, whereas it uses big partitions (as in space-oriented partitioning) when the densities are similar, while seamlessly switching among these two strategies at runtime. We experimentally demonstrate that TRANSFORMERS outperforms state-of-the-art approaches by a factor of between 2 and 8.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"70 1","pages":"673-684"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90251979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
TOPIC: Toward perfect Influence Graph Summarization 主题:走向完美的影响图总结
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498314
Lei Shi, Sibai Sun, Yuan Xuan, Yue Su, Hanghang Tong, Shuai Ma, Yang Chen
Summarizing large influence graphs is crucial for many graph visualization and mining tasks. Classical graph clustering and compression algorithms focus on summarizing the nodes by their structural-level or attribute-level similarities, but usually are not designed to characterize the flow-level pattern which is the centerpiece of influence graphs. On the other hand, the social influence analysis has been intensively studied, but little is done on the summarization problem without an explicit focus on social networks. Building on the recent study of the Influence Graph Summarization (IGS), this paper presents a new perspective of the underlying flow-based heuristic. It establishes a direct linkage between the optimal summarization and the classic eigenvector centrality of the graph nodes. Such a theoretic linkage has important implications on numerous aspects in the pursuit of a perfect influence graph summarization. In particular, it enables us to develop a suite of algorithms that can: 1) achieve a near-optimal IGS objective, 2) support dynamic summarizations balancing the IGS objective and the stability of transition in navigating the summarization, and 3) scale to million-node graphs with a near-linear computational complexity. Both quantitative experiments on real-world citation networks and the user studies on the task analysis experience demonstrate the effectiveness of the proposed summarization algorithms.
总结大型影响图对于许多图形可视化和挖掘任务至关重要。经典的图聚类和压缩算法侧重于根据节点的结构级或属性级相似性来总结节点,但通常没有设计用于描述流级模式,而流级模式是影响图的核心。另一方面,社会影响分析已经被深入研究,但在总结问题上做得很少,没有明确关注社会网络。基于影响图摘要(IGS)的最新研究,本文提出了一种基于底层流的启发式方法的新视角。它建立了最优总结与图节点的经典特征向量中心性之间的直接联系。这种理论联系对于追求完美的影响图总结具有多方面的重要意义。特别是,它使我们能够开发一套算法,这些算法可以:1)实现接近最优的IGS目标,2)支持动态摘要,平衡IGS目标和导航摘要时的过渡稳定性,以及3)缩放到具有近线性计算复杂度的百万节点图。对真实引文网络的定量实验和对任务分析经验的用户研究都证明了所提出的摘要算法的有效性。
{"title":"TOPIC: Toward perfect Influence Graph Summarization","authors":"Lei Shi, Sibai Sun, Yuan Xuan, Yue Su, Hanghang Tong, Shuai Ma, Yang Chen","doi":"10.1109/ICDE.2016.7498314","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498314","url":null,"abstract":"Summarizing large influence graphs is crucial for many graph visualization and mining tasks. Classical graph clustering and compression algorithms focus on summarizing the nodes by their structural-level or attribute-level similarities, but usually are not designed to characterize the flow-level pattern which is the centerpiece of influence graphs. On the other hand, the social influence analysis has been intensively studied, but little is done on the summarization problem without an explicit focus on social networks. Building on the recent study of the Influence Graph Summarization (IGS), this paper presents a new perspective of the underlying flow-based heuristic. It establishes a direct linkage between the optimal summarization and the classic eigenvector centrality of the graph nodes. Such a theoretic linkage has important implications on numerous aspects in the pursuit of a perfect influence graph summarization. In particular, it enables us to develop a suite of algorithms that can: 1) achieve a near-optimal IGS objective, 2) support dynamic summarizations balancing the IGS objective and the stability of transition in navigating the summarization, and 3) scale to million-node graphs with a near-linear computational complexity. Both quantitative experiments on real-world citation networks and the user studies on the task analysis experience demonstrate the effectiveness of the proposed summarization algorithms.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"35 8 1","pages":"1074-1085"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82802011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Self-Adaptive Linear Hashing for solid state drives 自适应线性哈希固态驱动器
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498260
Chengcheng Yang, Peiquan Jin, Lihua Yue, Dezhi Zhang
Flash memory based solid state drives (SSDs) have emerged as a new alternative to replace magnetic disks due to their high performance and low power consumption. However, random writes on SSDs are much slower than SSD reads. Therefore, traditional index structures, which are designed based on the symmetrical I/O property of magnetic disks, cannot completely exert the high performance of SSDs. In this paper, we propose an SSD-optimized linear hashing index called Self-Adaptive Linear Hashing (SAL-Hashing) to reduce small random writes to SSDs that are caused by index operations. The contributions of our work are manifold. First, we propose to organize buckets into groups and sets to facilitate coarse-grained writes and lazy-split so as to avoid intermediate writes on the hash structure. A group consists of a fixed number of buckets and a set consists of a number of groups. Second, we attach a log region to each set, and amortize the cost of reads and writes by committing updates to the log region in batch. Third, in order to reduce search cost, each log region is equipped with Bloom filters to index update logs. We devise a cost-based online algorithm to adaptively merge the log region with the corresponding set when the set becomes search-intensive. Finally, in order to exploit the internal package-level parallelisms of SSDs, we apply coarse-grained writes for merging or split operations to achieve a high bandwidth. Our experimental results suggest that our proposal is self-adaptive according to the change of access patterns, and outperforms several competitors under various workloads on two commodity SSDs.
基于闪存的固态硬盘(ssd)因其高性能和低功耗而成为取代磁盘的新选择。但是,SSD上的随机写入要比SSD上的读取慢得多。因此,传统的基于磁盘对称I/O特性设计的索引结构,并不能完全发挥ssd的高性能。在本文中,我们提出了一种ssd优化的线性哈希索引,称为自适应线性哈希(sal -哈希),以减少由索引操作引起的对ssd的小随机写操作。我们工作的贡献是多方面的。首先,我们建议将桶组织成组和集合,以方便粗粒度的写入和延迟分割,从而避免对哈希结构进行中间写入。一个组由固定数量的桶组成,而一个集合由多个组组成。其次,我们为每个集合附加一个日志区域,并通过批量向日志区域提交更新来分摊读写成本。第三,为了降低搜索成本,每个日志区域都配备了Bloom过滤器来索引更新日志。我们设计了一种基于代价的在线算法,当集合成为搜索密集型时,自适应地将日志区域与相应的集合合并。最后,为了利用ssd的内部包级并行性,我们对合并或拆分操作应用粗粒度写,以实现高带宽。我们的实验结果表明,我们的提议是根据访问模式的变化自适应的,并且在两个商品ssd上的各种工作负载下优于几个竞争对手。
{"title":"Self-Adaptive Linear Hashing for solid state drives","authors":"Chengcheng Yang, Peiquan Jin, Lihua Yue, Dezhi Zhang","doi":"10.1109/ICDE.2016.7498260","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498260","url":null,"abstract":"Flash memory based solid state drives (SSDs) have emerged as a new alternative to replace magnetic disks due to their high performance and low power consumption. However, random writes on SSDs are much slower than SSD reads. Therefore, traditional index structures, which are designed based on the symmetrical I/O property of magnetic disks, cannot completely exert the high performance of SSDs. In this paper, we propose an SSD-optimized linear hashing index called Self-Adaptive Linear Hashing (SAL-Hashing) to reduce small random writes to SSDs that are caused by index operations. The contributions of our work are manifold. First, we propose to organize buckets into groups and sets to facilitate coarse-grained writes and lazy-split so as to avoid intermediate writes on the hash structure. A group consists of a fixed number of buckets and a set consists of a number of groups. Second, we attach a log region to each set, and amortize the cost of reads and writes by committing updates to the log region in batch. Third, in order to reduce search cost, each log region is equipped with Bloom filters to index update logs. We devise a cost-based online algorithm to adaptively merge the log region with the corresponding set when the set becomes search-intensive. Finally, in order to exploit the internal package-level parallelisms of SSDs, we apply coarse-grained writes for merging or split operations to achieve a high bandwidth. Our experimental results suggest that our proposal is self-adaptive according to the change of access patterns, and outperforms several competitors under various workloads on two commodity SSDs.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"2 1","pages":"433-444"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78522073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Cruncher: Distributed in-memory processing for location-based services Cruncher:用于基于位置的服务的分布式内存处理
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498356
A. S. Abdelhamid, Mingjie Tang, Ahmed M. Aly, Ahmed R. Mahmood, Thamir M. Qadah, Walid G. Aref, Saleh M. Basalamah
Advances in location-based services (LBS) demand high-throughput processing of both static and streaming data. Recently, many systems have been introduced to support distributed main-memory processing to maximize the query throughput. However, these systems are not optimized for spatial data processing. In this demonstration, we showcase Cruncher, a distributed main-memory spatial data warehouse and streaming system. Cruncher extends Spark with adaptive query processing techniques for spatial data. Cruncher uses dynamic batch processing to distribute the queries and the data streams over commodity hardware according to an adaptive partitioning scheme. The batching technique also groups and orders the overlapping spatial queries to enable inter-query optimization. Both the data streams and the offline data share the same partitioning strategy that allows for data co-locality optimization. Furthermore, Cruncher uses an adaptive caching strategy to maintain the frequently-used location data in main memory. Cruncher maintains operational statistics to optimize query processing, data partitioning, and caching at runtime. We demonstrate two LBS applications over Cruncher using real datasets from OpenStreetMap and two synthetic data streams. We demonstrate that Cruncher achieves order(s) of magnitude throughput improvement over Spark when processing spatial data.
基于位置的服务(LBS)的发展要求静态和流数据的高吞吐量处理。最近,已经引入了许多支持分布式主存处理的系统,以最大限度地提高查询吞吐量。然而,这些系统并没有优化空间数据处理。在这个演示中,我们展示了Cruncher,一个分布式主存空间数据仓库和流系统。Cruncher扩展了Spark对空间数据的自适应查询处理技术。Cruncher根据自适应分区方案,使用动态批处理将查询和数据流分布在商品硬件上。批处理技术还对重叠的空间查询进行分组和排序,以实现查询间优化。数据流和脱机数据共享相同的分区策略,从而支持数据共局部性优化。此外,Cruncher使用自适应缓存策略在主存中维护频繁使用的位置数据。Cruncher维护运行统计信息,以优化查询处理、数据分区和运行时缓存。我们使用来自OpenStreetMap的真实数据集和两个合成数据流在Cruncher上演示了两个LBS应用程序。我们证明了Cruncher在处理空间数据时比Spark实现了数量级的吞吐量改进。
{"title":"Cruncher: Distributed in-memory processing for location-based services","authors":"A. S. Abdelhamid, Mingjie Tang, Ahmed M. Aly, Ahmed R. Mahmood, Thamir M. Qadah, Walid G. Aref, Saleh M. Basalamah","doi":"10.1109/ICDE.2016.7498356","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498356","url":null,"abstract":"Advances in location-based services (LBS) demand high-throughput processing of both static and streaming data. Recently, many systems have been introduced to support distributed main-memory processing to maximize the query throughput. However, these systems are not optimized for spatial data processing. In this demonstration, we showcase Cruncher, a distributed main-memory spatial data warehouse and streaming system. Cruncher extends Spark with adaptive query processing techniques for spatial data. Cruncher uses dynamic batch processing to distribute the queries and the data streams over commodity hardware according to an adaptive partitioning scheme. The batching technique also groups and orders the overlapping spatial queries to enable inter-query optimization. Both the data streams and the offline data share the same partitioning strategy that allows for data co-locality optimization. Furthermore, Cruncher uses an adaptive caching strategy to maintain the frequently-used location data in main memory. Cruncher maintains operational statistics to optimize query processing, data partitioning, and caching at runtime. We demonstrate two LBS applications over Cruncher using real datasets from OpenStreetMap and two synthetic data streams. We demonstrate that Cruncher achieves order(s) of magnitude throughput improvement over Spark when processing spatial data.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"14 1","pages":"1406-1409"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88201328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
SEED: A system for entity exploration and debugging in large-scale knowledge graphs SEED:一个用于大规模知识图谱中实体探索和调试的系统
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498342
Jun Chen, Yueguo Chen, Xiaoyong Du, Xiangling Zhang, Xuan Zhou
Large-scale knowledge graphs (KGs) contain massive entities and abundant relations among the entities. Data exploration over KGs allows users to browse the attributes of entities as well as the relations among entities. It therefore provides a good way of learning the structure and coverage of KGs. In this paper, we introduce a system called SEED that is designed to support entity-oriented exploration in large-scale KGs, based on retrieving similar entities of some seed entities as well as their semantic relations that show how entities are similar to each other. A by-product of entity exploration in SEED is to facilitate discovering the deficiency of KGs, so that the detected bugs can be easily fixed by users as they explore the KGs.
大规模知识图包含大量的实体和丰富的实体之间的关系。通过KGs进行数据探索,用户可以浏览实体的属性以及实体之间的关系。在本文中,我们介绍了一个名为SEED的系统,该系统旨在支持大规模KGs中面向实体的探索,该系统基于检索一些种子实体的相似实体以及它们之间的语义关系,这些关系表明实体之间是如何相似的。SEED中实体探索的一个副产品是方便用户发现KGs的不足之处,这样用户在探索KGs时就可以很容易地修复发现的bug。
{"title":"SEED: A system for entity exploration and debugging in large-scale knowledge graphs","authors":"Jun Chen, Yueguo Chen, Xiaoyong Du, Xiangling Zhang, Xuan Zhou","doi":"10.1109/ICDE.2016.7498342","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498342","url":null,"abstract":"Large-scale knowledge graphs (KGs) contain massive entities and abundant relations among the entities. Data exploration over KGs allows users to browse the attributes of entities as well as the relations among entities. It therefore provides a good way of learning the structure and coverage of KGs. In this paper, we introduce a system called SEED that is designed to support entity-oriented exploration in large-scale KGs, based on retrieving similar entities of some seed entities as well as their semantic relations that show how entities are similar to each other. A by-product of entity exploration in SEED is to facilitate discovering the deficiency of KGs, so that the detected bugs can be easily fixed by users as they explore the KGs.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"20 1","pages":"1350-1353"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72662389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Mining social ties beyond homophily 挖掘超越同质性的社会关系
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498259
Hongwei Liang, Ke Wang, Feida Zhu
Summarizing patterns of connections or social ties in a social network, in terms of attributes information on nodes and edges, holds a key to the understanding of how the actors interact and form relationships. We formalize this problem as mining top-k group relationships (GRs), which captures strong social ties between groups of actors. While existing works focus on patterns that follow from the well known homophily principle, we are interested in social ties that do not follow from homophily, thus, provide new insights. Finding top-k GRs faces new challenges: it requires a novel ranking metric because traditional metrics favor patterns that are expected from the homophily principle; it requires an innovative search strategy since there is no obvious anti-monotonicity for such GRs; it requires a novel data structure to avoid data explosion caused by multidimensional nodes and edges and many-to-many relationships in a social network. We address these issues through presenting an efficient algorithm, GRMiner, for mining top-k GRs and we evaluate its effectiveness and efficiency using real data.
根据节点和边缘的属性信息来总结社交网络中连接或社会关系的模式,是理解参与者如何互动和形成关系的关键。我们将这个问题形式化为挖掘top-k组关系(GRs),它捕获了参与者组之间的强社会联系。虽然现有的作品关注的是遵循众所周知的同质性原则的模式,但我们对不遵循同质性的社会关系感兴趣,从而提供了新的见解。寻找top-k GRs面临着新的挑战:它需要一个新的排名指标,因为传统的指标倾向于同质性原则所期望的模式;这类GRs没有明显的反单调性,需要创新的搜索策略;它需要一种新颖的数据结构,以避免社交网络中多维节点和边缘以及多对多关系造成的数据爆炸。我们通过提出一种高效的算法GRMiner来解决这些问题,该算法用于挖掘top-k gr,并使用实际数据评估其有效性和效率。
{"title":"Mining social ties beyond homophily","authors":"Hongwei Liang, Ke Wang, Feida Zhu","doi":"10.1109/ICDE.2016.7498259","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498259","url":null,"abstract":"Summarizing patterns of connections or social ties in a social network, in terms of attributes information on nodes and edges, holds a key to the understanding of how the actors interact and form relationships. We formalize this problem as mining top-k group relationships (GRs), which captures strong social ties between groups of actors. While existing works focus on patterns that follow from the well known homophily principle, we are interested in social ties that do not follow from homophily, thus, provide new insights. Finding top-k GRs faces new challenges: it requires a novel ranking metric because traditional metrics favor patterns that are expected from the homophily principle; it requires an innovative search strategy since there is no obvious anti-monotonicity for such GRs; it requires a novel data structure to avoid data explosion caused by multidimensional nodes and edges and many-to-many relationships in a social network. We address these issues through presenting an efficient algorithm, GRMiner, for mining top-k GRs and we evaluate its effectiveness and efficiency using real data.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"228 1","pages":"421-432"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76099821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2016 IEEE 32nd International Conference on Data Engineering (ICDE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1