首页 > 最新文献

2016 IEEE 32nd International Conference on Data Engineering (ICDE)最新文献

英文 中文
Leveraging non-volatile memory for instant restarts of in-memory database systems 利用非易失性内存立即重新启动内存中的数据库系统
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498351
David Schwalb, Martin Faust, Markus Dreseler, Pedro Flemming, H. Plattner
Emerging non-volatile memory technologies (NVM) offer fast and byte-addressable access, allowing to rethink the durability mechanisms of in-memory databases. Hyrise-NV is a database storage engine that maintains table and index structures on NVM. Our architecture updates the database state and index structures transactionally consistent on NVM using multi-version data structures, allowing to instantly recover data-bases independent of their size. In this paper, we demonstrate the instant restart capabilities of Hyrise-NV, storing all data on non-volatile memory. Recovering a dataset of size 92.2 GB takes about 53 seconds using our log-based approach, whereas Hyrise-NV recovers in under one second.
新兴的非易失性内存技术(NVM)提供了快速且可字节寻址的访问,允许重新考虑内存数据库的持久性机制。Hyrise-NV是一个数据库存储引擎,用于在NVM上维护表和索引结构。我们的架构使用多版本数据结构在NVM上以事务一致的方式更新数据库状态和索引结构,允许立即恢复与数据库大小无关的数据库。在本文中,我们演示了Hyrise-NV的即时重启能力,将所有数据存储在非易失性存储器上。使用基于日志的方法恢复大小为92.2 GB的数据集大约需要53秒,而Hyrise-NV的恢复时间不到1秒。
{"title":"Leveraging non-volatile memory for instant restarts of in-memory database systems","authors":"David Schwalb, Martin Faust, Markus Dreseler, Pedro Flemming, H. Plattner","doi":"10.1109/ICDE.2016.7498351","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498351","url":null,"abstract":"Emerging non-volatile memory technologies (NVM) offer fast and byte-addressable access, allowing to rethink the durability mechanisms of in-memory databases. Hyrise-NV is a database storage engine that maintains table and index structures on NVM. Our architecture updates the database state and index structures transactionally consistent on NVM using multi-version data structures, allowing to instantly recover data-bases independent of their size. In this paper, we demonstrate the instant restart capabilities of Hyrise-NV, storing all data on non-volatile memory. Recovering a dataset of size 92.2 GB takes about 53 seconds using our log-based approach, whereas Hyrise-NV recovers in under one second.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"1386-1389"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88599417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Joint repairs for web wrappers 织网机的接缝修补
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498320
Stefano Ortona, G. Orsi, Tim Furche, Marcello Buoncristiano
Automated web scraping is a popular means for acquiring data from the web. Scrapers (or wrappers) are derived from either manually or automatically annotated examples, often resulting in under/over segmented data, together with missing or spurious content. Automatic repair and maintenance of the extracted data is thus a necessary complement to automatic wrapper generation. Moreover, the extracted data is often the result of a long-term data acquisition effort and thus jointly repairing wrappers together with the generated data reduces future needs for data cleaning. We study the problem of computing joint repairs for XPath-based wrappers and their extracted data. We show that the problem is NP-complete in general but becomes tractable under a few natural assumptions. Even tractable solutions to the problem are still impractical on very large datasets, but we propose an optimal approximation that proves effective across a wide variety of domains and sources. Our approach relies on encoded domain knowledge, but require no per-source supervision. An evaluation spanning more than 100k web pages from 100 different sites of a wide variety of application domains, shows that joint repairs are able to increase the quality of wrappers between 15% and 60% independently of the wrapper generation system, eliminating all errors in more than 50% of the cases.
自动网络抓取是一种从网络获取数据的流行方法。抓取器(或包装器)来自手动或自动注释的示例,通常导致数据分段不足/过度,以及丢失或虚假的内容。因此,自动修复和维护提取的数据是对自动包装器生成的必要补充。此外,提取的数据通常是长期数据采集工作的结果,因此将包装器与生成的数据一起修复可以减少未来对数据清理的需求。研究了基于xpath的包装器及其提取数据的联合修复计算问题。我们证明了这个问题在一般情况下是np完全的,但在一些自然假设下变得容易处理。即使是易于处理的问题解决方案在非常大的数据集上仍然是不切实际的,但我们提出了一个最佳近似,证明在各种领域和来源上都是有效的。我们的方法依赖于编码的领域知识,但不需要对每个源进行监督。对来自100个不同应用领域的100多个不同网站的100,000多个网页的评估表明,与包装器生成系统无关,联合修复能够将包装器的质量提高15%至60%,消除50%以上情况下的所有错误。
{"title":"Joint repairs for web wrappers","authors":"Stefano Ortona, G. Orsi, Tim Furche, Marcello Buoncristiano","doi":"10.1109/ICDE.2016.7498320","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498320","url":null,"abstract":"Automated web scraping is a popular means for acquiring data from the web. Scrapers (or wrappers) are derived from either manually or automatically annotated examples, often resulting in under/over segmented data, together with missing or spurious content. Automatic repair and maintenance of the extracted data is thus a necessary complement to automatic wrapper generation. Moreover, the extracted data is often the result of a long-term data acquisition effort and thus jointly repairing wrappers together with the generated data reduces future needs for data cleaning. We study the problem of computing joint repairs for XPath-based wrappers and their extracted data. We show that the problem is NP-complete in general but becomes tractable under a few natural assumptions. Even tractable solutions to the problem are still impractical on very large datasets, but we propose an optimal approximation that proves effective across a wide variety of domains and sources. Our approach relies on encoded domain knowledge, but require no per-source supervision. An evaluation spanning more than 100k web pages from 100 different sites of a wide variety of application domains, shows that joint repairs are able to increase the quality of wrappers between 15% and 60% independently of the wrapper generation system, eliminating all errors in more than 50% of the cases.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"191 1","pages":"1146-1157"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74461790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
pSCAN: Fast and exact structural graph clustering pSCAN:快速和精确的结构图聚类
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498245
Lijun Chang, Wei Li, Xuemin Lin, Lu Qin, W. Zhang
In this paper, we study the problem of structural graph clustering, a fundamental problem in managing and analyzing graph data. Given a large graph G = (V, E), structural graph clustering is to assign vertices in V to clusters and to identify the sets of hub vertices and outlier vertices as well, such that vertices in the same cluster are densely connected to each other while vertices in different clusters are loosely connected to each other. Firstly, we prove that the existing SCAN approach is worst-case optimal. Nevertheless, it is still not scalable to large graphs due to exhaustively computing structural similarity for every pair of adjacent vertices. Secondly, we make three observations about structural graph clustering, which present opportunities for further optimization. Based on these observations, in this paper we develop a new two-step paradigm for scalable structural graph clustering. Thirdly, following this paradigm, we present a new approach aiming to reduce the number of structural similarity computations. Moreover, we propose optimization techniques to speed up checking whether two vertices are structure-similar to each other. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our new approach outperforms the state-of-the-art approaches by over one order of magnitude. Noticeably, for the twitter graph with 1 billion edges, our approach takes 25 minutes while the state-of-the-art approach cannot finish even after 24 hours.
本文研究了结构图聚类问题,这是图数据管理和分析的一个基本问题。给定一个大的图G = (V, E),结构图聚类就是将V中的顶点分配给聚类,并识别出枢纽点和离群点的集合,使同一聚类中的顶点相互紧密连接,而不同聚类中的顶点相互松散连接。首先,我们证明了现有的SCAN方法是最坏最优的。然而,它仍然不能扩展到大型图,因为每一对相邻的顶点都要耗尽计算结构相似性。其次,我们对结构图聚类进行了三个观察,为进一步优化提供了机会。在此基础上,本文提出了一种新的两步聚类方法。在此基础上,提出了一种减少结构相似性计算次数的新方法。此外,我们提出了优化技术来加快检查两个顶点是否彼此结构相似。最后,我们对大型真实图和合成图进行了广泛的性能研究,这表明我们的新方法比最先进的方法要好一个数量级以上。值得注意的是,对于拥有10亿条边的twitter图,我们的方法需要25分钟,而最先进的方法即使在24小时后也无法完成。
{"title":"pSCAN: Fast and exact structural graph clustering","authors":"Lijun Chang, Wei Li, Xuemin Lin, Lu Qin, W. Zhang","doi":"10.1109/ICDE.2016.7498245","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498245","url":null,"abstract":"In this paper, we study the problem of structural graph clustering, a fundamental problem in managing and analyzing graph data. Given a large graph G = (V, E), structural graph clustering is to assign vertices in V to clusters and to identify the sets of hub vertices and outlier vertices as well, such that vertices in the same cluster are densely connected to each other while vertices in different clusters are loosely connected to each other. Firstly, we prove that the existing SCAN approach is worst-case optimal. Nevertheless, it is still not scalable to large graphs due to exhaustively computing structural similarity for every pair of adjacent vertices. Secondly, we make three observations about structural graph clustering, which present opportunities for further optimization. Based on these observations, in this paper we develop a new two-step paradigm for scalable structural graph clustering. Thirdly, following this paradigm, we present a new approach aiming to reduce the number of structural similarity computations. Moreover, we propose optimization techniques to speed up checking whether two vertices are structure-similar to each other. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our new approach outperforms the state-of-the-art approaches by over one order of magnitude. Noticeably, for the twitter graph with 1 billion edges, our approach takes 25 minutes while the state-of-the-art approach cannot finish even after 24 hours.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"12 1","pages":"253-264"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77053800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 72
A column store engine for real-time streaming analytics 用于实时流分析的列存储引擎
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498332
Alex Skidanov, Anders J. Papito, A. Prout
This paper describes novel aspects of the column store implemented in the MemSQL database engine and describes the design choices made to support real-time streaming workloads. Column stores have traditionally been restricted to data warehouse scenarios where low latency queries are a secondary goal, and where restricting data ingestion to be offline, batched, append-only, or some combination thereof is acceptable. In contrast, the MemSQL column store implementation treats low latency queries and ongoing writes as first class citizens, with a focus on avoiding interference between read, ingest, update, and storage optimization workloads through the use of fragmented snapshot transactions and optimistic storage reordering. This implementation broadens the range of serviceable column store workloads to include those with more stringent demands on query and data latency, such as those backing operational systems used by adtech, financial services, fraud detection and other real-time or data streaming applications.
本文描述了在MemSQL数据库引擎中实现的列存储的新方面,并描述了为支持实时流工作负载所做的设计选择。列存储传统上仅限于数据仓库场景,在这些场景中,低延迟查询是次要目标,并且可以将数据摄取限制为脱机、批处理、仅追加或其某种组合。相比之下,MemSQL列存储实现将低延迟查询和正在进行的写入视为头等大事,重点是通过使用碎片快照事务和乐观存储重排序来避免读取、摄取、更新和存储优化工作负载之间的干扰。这种实现扩大了可服务列存储工作负载的范围,包括那些对查询和数据延迟有更严格要求的工作负载,例如adtech、金融服务、欺诈检测和其他实时或数据流应用程序使用的后台操作系统。
{"title":"A column store engine for real-time streaming analytics","authors":"Alex Skidanov, Anders J. Papito, A. Prout","doi":"10.1109/ICDE.2016.7498332","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498332","url":null,"abstract":"This paper describes novel aspects of the column store implemented in the MemSQL database engine and describes the design choices made to support real-time streaming workloads. Column stores have traditionally been restricted to data warehouse scenarios where low latency queries are a secondary goal, and where restricting data ingestion to be offline, batched, append-only, or some combination thereof is acceptable. In contrast, the MemSQL column store implementation treats low latency queries and ongoing writes as first class citizens, with a focus on avoiding interference between read, ingest, update, and storage optimization workloads through the use of fragmented snapshot transactions and optimistic storage reordering. This implementation broadens the range of serviceable column store workloads to include those with more stringent demands on query and data latency, such as those backing operational systems used by adtech, financial services, fraud detection and other real-time or data streaming applications.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"139 1","pages":"1287-1297"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79913183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Hobbes3: Dynamic generation of variable-length signatures for efficient approximate subsequence mappings 动态生成有效的近似子序列映射的变长签名
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498238
Jongik Kim, Chen Li, Xiaohui Xie
Recent advances in DNA sequencing have enabled a flood of sequencing-based applications for studying biology and medicine. A key requirement of these applications is to rapidly and accurately map DNA subsequences to a reference genome. This DNA subsequence mapping problem shares core technical challenges with the similarity query processing problem studied in the database research literature. To solve this problem, existing techniques first extract signatures from a query, then retrieve candidate mapping positions from an index using the extracted signatures, and finally verify the candidate positions. The efficiency of these techniques depends critically on signatures selected from queries, while signature selection relies on an indexing scheme of a reference genome. The q-gram inverted indexing, one of the most widely used indexing schemes, can discover candidate positions quickly, but has the limitation that signatures of queries are restricted to fixed-length q-grams. To address the problem, we propose a flexible way to generate variable-length signatures using a fixed-length q-gram index. The proposed technique groups a few q-grams into a variable-length signature, and generates candidate positions for the variable-length signature using the inverted lists of the q-grams. We also propose a novel dynamic programming algorithm to balance between the filtering power of signatures and the overhead of generating candidate positions for the signatures. Through extensive experiments on both simulated and real genomic data, we show that our technique substantially improves the performance of read mapping in terms of both mapping speed and accuracy.
DNA测序的最新进展使基于测序的应用程序在生物学和医学研究中成为可能。这些应用的一个关键要求是快速准确地将DNA子序列映射到参考基因组。该DNA子序列映射问题与数据库研究文献中研究的相似度查询处理问题具有相同的核心技术挑战。为了解决这个问题,现有技术首先从查询中提取签名,然后使用提取的签名从索引中检索候选映射位置,最后验证候选位置。这些技术的效率主要取决于从查询中选择的签名,而签名选择依赖于参考基因组的索引方案。q-gram倒排索引是目前使用最广泛的索引方案之一,它可以快速发现候选位置,但其缺点是查询的签名仅限于固定长度的q-gram。为了解决这个问题,我们提出了一种灵活的方法来使用固定长度的q-gram索引生成变长签名。该技术将几个q-g分组为变长签名,并使用q-g的倒排表生成变长签名的候选位置。我们还提出了一种新的动态规划算法来平衡签名的过滤能力和为签名生成候选位置的开销。通过对模拟和真实基因组数据的大量实验,我们表明我们的技术在映射速度和精度方面都大大提高了读映射的性能。
{"title":"Hobbes3: Dynamic generation of variable-length signatures for efficient approximate subsequence mappings","authors":"Jongik Kim, Chen Li, Xiaohui Xie","doi":"10.1109/ICDE.2016.7498238","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498238","url":null,"abstract":"Recent advances in DNA sequencing have enabled a flood of sequencing-based applications for studying biology and medicine. A key requirement of these applications is to rapidly and accurately map DNA subsequences to a reference genome. This DNA subsequence mapping problem shares core technical challenges with the similarity query processing problem studied in the database research literature. To solve this problem, existing techniques first extract signatures from a query, then retrieve candidate mapping positions from an index using the extracted signatures, and finally verify the candidate positions. The efficiency of these techniques depends critically on signatures selected from queries, while signature selection relies on an indexing scheme of a reference genome. The q-gram inverted indexing, one of the most widely used indexing schemes, can discover candidate positions quickly, but has the limitation that signatures of queries are restricted to fixed-length q-grams. To address the problem, we propose a flexible way to generate variable-length signatures using a fixed-length q-gram index. The proposed technique groups a few q-grams into a variable-length signature, and generates candidate positions for the variable-length signature using the inverted lists of the q-grams. We also propose a novel dynamic programming algorithm to balance between the filtering power of signatures and the overhead of generating candidate positions for the signatures. Through extensive experiments on both simulated and real genomic data, we show that our technique substantially improves the performance of read mapping in terms of both mapping speed and accuracy.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"26 1","pages":"169-180"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81746037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Context-aware advertisement recommendation for high-speed social news feeding 基于上下文感知的广告推荐,支持高速社交新闻推送
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498266
Yuchen Li, Dongxiang Zhang, Ziquan Lan, K. Tan
Social media advertising is a multi-billion dollar market and has become the major revenue source for Facebook and Twitter. To deliver ads to potentially interested users, these social network platforms learn a prediction model for each user based on their personal interests. However, as user interests often evolve slowly, the user may end up receiving repetitive ads. In this paper, we propose a context-aware advertising framework that takes into account the relatively static personal interests as well as the dynamic news feed from friends to drive growth in the ad click-through rate. To meet the real-time requirement, we first propose an online retrieval strategy that finds k most relevant ads matching the dynamic context when a read operation is triggered. To avoid frequent retrieval when the context varies little, we propose a safe region method to quickly determine whether the top-k ads of a user are changed. Finally, we propose a hybrid model to combine the merits of both methods by analyzing the dynamism of news feed to determine an appropriate retrieval strategy. Extensive experiments conducted on multiple real social networks and ad datasets verified the efficiency and robustness of our hybrid model.
社交媒体广告是一个价值数十亿美元的市场,已成为Facebook和Twitter的主要收入来源。为了向潜在感兴趣的用户投放广告,这些社交网络平台根据每个用户的个人兴趣学习了一个预测模型。然而,由于用户兴趣往往演变缓慢,用户最终可能会收到重复的广告。在本文中,我们提出了一个上下文感知广告框架,该框架考虑了相对静态的个人兴趣以及来自朋友的动态新闻馈送,以推动广告点击率的增长。为了满足实时需求,我们首先提出了一种在线检索策略,当读取操作被触发时,该策略可以找到与动态上下文匹配的k个最相关的广告。为了避免上下文变化不大时频繁检索,我们提出了一种安全区域方法来快速确定用户的前k个广告是否发生了变化。最后,我们提出了一个混合模型,通过分析新闻源的动态特性,将两种方法的优点结合起来,以确定合适的检索策略。在多个真实社交网络和广告数据集上进行的大量实验验证了我们的混合模型的效率和鲁棒性。
{"title":"Context-aware advertisement recommendation for high-speed social news feeding","authors":"Yuchen Li, Dongxiang Zhang, Ziquan Lan, K. Tan","doi":"10.1109/ICDE.2016.7498266","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498266","url":null,"abstract":"Social media advertising is a multi-billion dollar market and has become the major revenue source for Facebook and Twitter. To deliver ads to potentially interested users, these social network platforms learn a prediction model for each user based on their personal interests. However, as user interests often evolve slowly, the user may end up receiving repetitive ads. In this paper, we propose a context-aware advertising framework that takes into account the relatively static personal interests as well as the dynamic news feed from friends to drive growth in the ad click-through rate. To meet the real-time requirement, we first propose an online retrieval strategy that finds k most relevant ads matching the dynamic context when a read operation is triggered. To avoid frequent retrieval when the context varies little, we propose a safe region method to quickly determine whether the top-k ads of a user are changed. Finally, we propose a hybrid model to combine the merits of both methods by analyzing the dynamism of news feed to determine an appropriate retrieval strategy. Extensive experiments conducted on multiple real social networks and ad datasets verified the efficiency and robustness of our hybrid model.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"43 1","pages":"505-516"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86511555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Accelerating database workloads by software-hardware-system co-design 通过软件-硬件系统协同设计加速数据库工作负载
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498362
R. Bordawekar, Mohammad Sadoghi
The key objective of this tutorial is to provide a broad, yet an in-depth survey of the emerging field of co-designing software, hardware, and systems components for accelerating enterprise data management workloads. The overall goal of this tutorial is two-fold. First, we provide a concise system-level characterization of different types of data management technologies, namely, the relational and NoSQL databases and data stream management systems from the perspective of analytical workloads. Using the characterization, we discuss opportunities for accelerating key data management workloads using software and hardware approaches. Second, we dive deeper into the hardware acceleration opportunities using Graphics Processing Units (GPUs) and Field-Programmable Gate Arrays (FPGAs) for the query execution pipeline. Furthermore, we explore other hardware acceleration mechanisms such as single-instruction multiple-data (SIMD) that enables short-vector data parallelism.
本教程的主要目标是对用于加速企业数据管理工作负载的共同设计软件、硬件和系统组件这一新兴领域进行广泛而深入的调查。本教程的总体目标有两个。首先,我们从分析工作负载的角度对不同类型的数据管理技术,即关系数据库和NoSQL数据库以及数据流管理系统进行了简明的系统级描述。通过描述,我们讨论了使用软件和硬件方法加速关键数据管理工作负载的机会。其次,我们深入研究了使用图形处理单元(gpu)和现场可编程门阵列(fpga)进行查询执行管道的硬件加速机会。此外,我们还探索了其他硬件加速机制,例如支持短向量数据并行的单指令多数据(SIMD)。
{"title":"Accelerating database workloads by software-hardware-system co-design","authors":"R. Bordawekar, Mohammad Sadoghi","doi":"10.1109/ICDE.2016.7498362","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498362","url":null,"abstract":"The key objective of this tutorial is to provide a broad, yet an in-depth survey of the emerging field of co-designing software, hardware, and systems components for accelerating enterprise data management workloads. The overall goal of this tutorial is two-fold. First, we provide a concise system-level characterization of different types of data management technologies, namely, the relational and NoSQL databases and data stream management systems from the perspective of analytical workloads. Using the characterization, we discuss opportunities for accelerating key data management workloads using software and hardware approaches. Second, we dive deeper into the hardware acceleration opportunities using Graphics Processing Units (GPUs) and Field-Programmable Gate Arrays (FPGAs) for the query execution pipeline. Furthermore, we explore other hardware acceleration mechanisms such as single-instruction multiple-data (SIMD) that enables short-vector data parallelism.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"17 1","pages":"1428-1431"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87190999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Answering why-not questions on metric probabilistic range queries 回答关于度量概率范围查询的why-not问题
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498288
Lu Chen, Yunjun Gao, Kai Wang, Christian S. Jensen, Gang Chen
Metric probabilistic range queries (MPRQ) have received substantial attention due to their utility in multimedia and text retrieval, decision making, etc. Existing MPRQ studies generally aim to improve query efficiency and resource usage. In contrast, we define and offer solutions to why-not questions on MPRQ. Given an original metric probabilistic range query and a why-not set W of uncertain objects that are absent from the query result, a why-not question on MPRQ explains why the uncertain objects in W do not appear in the query result, and provides refinements of the original query and/or W with the minimal penalty, so that the uncertain objects in W appear in the result of the refined query. Specifically, we propose a framework that consists of three efficient solutions, one that modifies the original query, one that modifies the why-not set, and one that modifies both the original query and the why-not set. Extensive experiments using both real and synthetic data sets offer insights into the properties of the proposed algorithms, and show that they are effective and efficient.
度量概率范围查询(MPRQ)由于在多媒体和文本检索、决策等方面的应用而受到了广泛的关注。现有的MPRQ研究一般以提高查询效率和资源利用率为目标。相反,我们定义并提供解决MPRQ中“为什么不”问题的方法。给定一个原始度量概率范围查询和查询结果中不确定对象的why-not集合W, MPRQ上的why-not问题解释了为什么W中的不确定对象没有出现在查询结果中,并以最小的惩罚对原始查询和/或W进行改进,使W中的不确定对象出现在改进后的查询结果中。具体来说,我们提出了一个由三个有效解决方案组成的框架,一个修改原始查询,一个修改为什么不设置,另一个修改原始查询和为什么不设置。使用真实和合成数据集的大量实验提供了对所提出算法特性的见解,并表明它们是有效和高效的。
{"title":"Answering why-not questions on metric probabilistic range queries","authors":"Lu Chen, Yunjun Gao, Kai Wang, Christian S. Jensen, Gang Chen","doi":"10.1109/ICDE.2016.7498288","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498288","url":null,"abstract":"Metric probabilistic range queries (MPRQ) have received substantial attention due to their utility in multimedia and text retrieval, decision making, etc. Existing MPRQ studies generally aim to improve query efficiency and resource usage. In contrast, we define and offer solutions to why-not questions on MPRQ. Given an original metric probabilistic range query and a why-not set W of uncertain objects that are absent from the query result, a why-not question on MPRQ explains why the uncertain objects in W do not appear in the query result, and provides refinements of the original query and/or W with the minimal penalty, so that the uncertain objects in W appear in the result of the refined query. Specifically, we propose a framework that consists of three efficient solutions, one that modifies the original query, one that modifies the why-not set, and one that modifies both the original query and the why-not set. Extensive experiments using both real and synthetic data sets offer insights into the properties of the proposed algorithms, and show that they are effective and efficient.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"57 1","pages":"767-778"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88035175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Edge classification in networks 网络中的边缘分类
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498311
C. Aggarwal, Gewen He, Peixiang Zhao
We consider in this paper the edge classification problem in networks, which is defined as follows. Given a graph-structured network G(N, A), where N is a set of vertices and A ⊆ N ×N is a set of edges, in which a subset Al ⊆ A of edges are properly labeled a priori, determine for those edges in Au = AAl the edge labels which are unknown. The edge classification problem has numerous applications in graph mining and social network analysis, such as relationship discovery, categorization, and recommendation. Although the vertex classification problem has been well known and extensively explored in networks, edge classification is relatively unknown and in an urgent need for careful studies. In this paper, we present a series of efficient, neighborhood-based algorithms to perform edge classification in networks. To make the proposed algorithms scalable in large-scale networks, which can be either disk-resident or streamlike, we further devise efficient, cost-effective probabilistic edge classification methods without a significant compromise to the classification accuracy. We carry out experimental studies in a series of real-world networks, and the experimental results demonstrate both the effectiveness and efficiency of the proposed methods for edge classification in large networks.
本文考虑网络中的边缘分类问题,定义如下:给定一个图结构网络G(N, a),其中N为一组顶点,a≥×N为一组边,其中一个子集Al≥a的边被先验地适当标记,在Au = a Al中确定未知边的标记。边缘分类问题在图挖掘和社会网络分析中有许多应用,如关系发现、分类和推荐。虽然顶点分类问题已经在网络中得到了广泛的研究,但边缘分类问题相对来说还是一个未知的问题,亟待深入研究。在本文中,我们提出了一系列有效的,基于邻域的算法来执行网络中的边缘分类。为了使所提出的算法在磁盘驻留或流状的大规模网络中具有可扩展性,我们进一步设计了高效,成本效益高的概率边缘分类方法,而不会显著损害分类精度。我们在一系列现实网络中进行了实验研究,实验结果证明了所提出的方法在大型网络中边缘分类的有效性和效率。
{"title":"Edge classification in networks","authors":"C. Aggarwal, Gewen He, Peixiang Zhao","doi":"10.1109/ICDE.2016.7498311","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498311","url":null,"abstract":"We consider in this paper the edge classification problem in networks, which is defined as follows. Given a graph-structured network G(N, A), where N is a set of vertices and A ⊆ N ×N is a set of edges, in which a subset Al ⊆ A of edges are properly labeled a priori, determine for those edges in Au = AAl the edge labels which are unknown. The edge classification problem has numerous applications in graph mining and social network analysis, such as relationship discovery, categorization, and recommendation. Although the vertex classification problem has been well known and extensively explored in networks, edge classification is relatively unknown and in an urgent need for careful studies. In this paper, we present a series of efficient, neighborhood-based algorithms to perform edge classification in networks. To make the proposed algorithms scalable in large-scale networks, which can be either disk-resident or streamlike, we further devise efficient, cost-effective probabilistic edge classification methods without a significant compromise to the classification accuracy. We carry out experimental studies in a series of real-world networks, and the experimental results demonstrate both the effectiveness and efficiency of the proposed methods for edge classification in large networks.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"5 1","pages":"1038-1049"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85067625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
An interval join optimized for modern hardware 为现代硬件优化的间隔连接
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498316
Danila Piatov, S. Helmer, Anton Dignös
We develop an algorithm for efficiently joining relations on interval-based attributes with overlap predicates, which, for example, are commonly found in temporal databases. Using a new data structure and a lazy evaluation technique, we are able to achieve impressive performance gains by optimizing memory accesses exploiting features of modern CPU architectures. In an experimental evaluation with real-world datasets our algorithm is able to outperform the state-of-the-art by an order of magnitude.
我们开发了一种算法,用于有效地将基于间隔的属性上的关系与重叠谓词连接起来,例如,重叠谓词在时态数据库中很常见。使用新的数据结构和惰性评估技术,我们能够通过利用现代CPU架构的特性来优化内存访问,从而获得令人印象深刻的性能提升。在对真实世界数据集的实验评估中,我们的算法能够以一个数量级超越最先进的算法。
{"title":"An interval join optimized for modern hardware","authors":"Danila Piatov, S. Helmer, Anton Dignös","doi":"10.1109/ICDE.2016.7498316","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498316","url":null,"abstract":"We develop an algorithm for efficiently joining relations on interval-based attributes with overlap predicates, which, for example, are commonly found in temporal databases. Using a new data structure and a lazy evaluation technique, we are able to achieve impressive performance gains by optimizing memory accesses exploiting features of modern CPU architectures. In an experimental evaluation with real-world datasets our algorithm is able to outperform the state-of-the-art by an order of magnitude.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"140 1","pages":"1098-1109"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85191273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
期刊
2016 IEEE 32nd International Conference on Data Engineering (ICDE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1