首页 > 最新文献

2020 IEEE 36th International Conference on Data Engineering (ICDE)最新文献

英文 中文
PocketView: A Concise and Informative Data Summarizer PocketView:一个简洁和信息丰富的数据总结器
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00159
Yihai Xi, Ning Wang, Shuang Hao, Wenyang Yang, Li Li
A data summarization for the large table can be of great help, which provides a concise and informative overview and assists the user to quickly figure out the subject of the data. However, a high quality summarization needs to have two desirable properties: presenting notable entities and achieving broad domain coverage. In this demonstration, we propose a summarizer system called PocketView that is able to create a data summarization through a pocket view of the table. The attendees will experience the following features of our system:(1) time-sensitive notability evaluation - PocketView can automatically identify notable entities according to their significance and popularity in user-defined time period; (2) broad-coverage pocket view - Our system will provide a pocket view for the table without losing any domain, which is much simpler and clearer for attendees to figure out the subject compared with the original table.
大表的数据摘要可以提供很大的帮助,它提供了一个简洁和信息丰富的概述,并帮助用户快速找出数据的主题。然而,高质量的摘要需要有两个理想的特性:呈现显著的实体和实现广泛的领域覆盖。在这个演示中,我们提出了一个名为PocketView的汇总器系统,它能够通过表格的口袋视图创建数据汇总。与会者将体验到我们系统的以下功能:(1)时间敏感的显著性评估- PocketView可以根据用户自定义时间段内的重要性和受欢迎程度自动识别显著实体;(2)广覆盖口袋视图-我们的系统将提供一个口袋视图的桌子,而不丢失任何域,这是更简单和更清晰的与会者找出主题相比,原来的桌子。
{"title":"PocketView: A Concise and Informative Data Summarizer","authors":"Yihai Xi, Ning Wang, Shuang Hao, Wenyang Yang, Li Li","doi":"10.1109/ICDE48307.2020.00159","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00159","url":null,"abstract":"A data summarization for the large table can be of great help, which provides a concise and informative overview and assists the user to quickly figure out the subject of the data. However, a high quality summarization needs to have two desirable properties: presenting notable entities and achieving broad domain coverage. In this demonstration, we propose a summarizer system called PocketView that is able to create a data summarization through a pocket view of the table. The attendees will experience the following features of our system:(1) time-sensitive notability evaluation - PocketView can automatically identify notable entities according to their significance and popularity in user-defined time period; (2) broad-coverage pocket view - Our system will provide a pocket view for the table without losing any domain, which is much simpler and clearer for attendees to figure out the subject compared with the original table.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"143 1","pages":"1742-1745"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74805499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Distributed Streaming Set Similarity Join 分布式流集相似度连接
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00055
Jianye Yang, W. Zhang, Xiang Wang, Ying Zhang, Xuemin Lin
With the prevalence of Internet access and user generated content, a large number of documents/records, such as news and web pages, have been continuously generated in an unprecedented manner. In this paper, we study the problem of efficient stream set similarity join over distributed systems, which has broad applications in data cleaning and data integration tasks, such as on-line near-duplicate detection. In contrast to prefix-based distribution strategy which is widely adopted in offline distributed processing, we propose a simple yet efficient length-based distribution framework which dispatches incoming records by their length. A load-aware length partition method is developed to find a balanced partition by effectively estimating local join cost to achieve good load balance. Our length-based scheme is surprisingly superior to its competitors since it has no replication, small communication cost, and high throughput. We further observe that the join results from the current incoming record can be utilized to guide the index construction, which in turn can facilitate the join processing of future records. Inspired by this observation, we propose a novel bundle-based join algorithm by grouping similar records on-the-fly to reduce filtering cost. A by-product of this algorithm is an efficient verification technique, which verifies a batch of records by utilizing their token differences to share verification costs, rather than verifying them individually. Extensive experiments conducted on Storm, a popular distributed stream processing system, suggest that our methods can achieve up to one order of magnitude throughput improvement over baselines.
随着互联网接入和用户生成内容的普及,新闻、网页等大量文档/记录以前所未有的方式不断产生。本文研究了分布式系统上的高效流集相似连接问题,该问题在数据清洗和数据集成任务(如在线近重复检测)中具有广泛的应用。与离线分布式处理中广泛采用的基于前缀的分发策略相比,我们提出了一种简单而高效的基于长度的分发框架,该框架根据记录的长度对传入记录进行调度。提出了一种负载敏感的长度分区方法,通过有效估计本地连接成本来寻找平衡分区,从而达到良好的负载平衡。我们的基于长度的方案比它的竞争对手令人惊讶地优越,因为它没有复制,通信成本小,吞吐量高。我们进一步观察到,当前传入记录的连接结果可以用来指导索引的构建,从而促进未来记录的连接处理。受此启发,我们提出了一种新颖的基于绑定的连接算法,该算法通过动态分组相似的记录来降低过滤成本。该算法的副产品是一种高效的验证技术,它通过利用令牌差异来共享验证成本来验证一批记录,而不是单独验证它们。在Storm(一个流行的分布式流处理系统)上进行的大量实验表明,我们的方法可以在基线上实现一个数量级的吞吐量改进。
{"title":"Distributed Streaming Set Similarity Join","authors":"Jianye Yang, W. Zhang, Xiang Wang, Ying Zhang, Xuemin Lin","doi":"10.1109/ICDE48307.2020.00055","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00055","url":null,"abstract":"With the prevalence of Internet access and user generated content, a large number of documents/records, such as news and web pages, have been continuously generated in an unprecedented manner. In this paper, we study the problem of efficient stream set similarity join over distributed systems, which has broad applications in data cleaning and data integration tasks, such as on-line near-duplicate detection. In contrast to prefix-based distribution strategy which is widely adopted in offline distributed processing, we propose a simple yet efficient length-based distribution framework which dispatches incoming records by their length. A load-aware length partition method is developed to find a balanced partition by effectively estimating local join cost to achieve good load balance. Our length-based scheme is surprisingly superior to its competitors since it has no replication, small communication cost, and high throughput. We further observe that the join results from the current incoming record can be utilized to guide the index construction, which in turn can facilitate the join processing of future records. Inspired by this observation, we propose a novel bundle-based join algorithm by grouping similar records on-the-fly to reduce filtering cost. A by-product of this algorithm is an efficient verification technique, which verifies a batch of records by utilizing their token differences to share verification costs, rather than verifying them individually. Extensive experiments conducted on Storm, a popular distributed stream processing system, suggest that our methods can achieve up to one order of magnitude throughput improvement over baselines.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"29 1","pages":"565-576"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78709390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Efficient Query Processing with Optimistically Compressed Hash Tables & Strings in the USSR 苏联乐观压缩哈希表和字符串的高效查询处理
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00033
Tim Gubner, Viktor Leis, P. Boncz
Modern query engines rely heavily on hash tables for query processing. Overall query performance and memory footprint is often determined by how hash tables and the tuples within them are represented. In this work, we propose three complementary techniques to improve this representation: Domain-Guided Prefix Suppression bit-packs keys and values tightly to reduce hash table record width. Optimistic Splitting decomposes values (and operations on them) into (operations on) frequently-accessed and infrequently-accessed value slices. By removing the infrequently-accessed value slices from the hash table record, it improves cache locality. The Unique Strings Self-aligned Region (USSR) accelerates handling frequently-occurring strings, which are very common in real-world data sets, by creating an on-the-fly dictionary of the most frequent strings. This allows executing many string operations with integer logic and reduces memory pressure.We integrated these techniques into Vectorwise. On the TPC-H benchmark, our approach reduces peak memory consumption by 2–4× and improves performance by up to 1.5×. On a real-world BI workload, we measured a 2× improvement in performance and in micro-benchmarks we observed speedups of up to 25×.
现代查询引擎严重依赖哈希表进行查询处理。总体查询性能和内存占用通常取决于如何表示散列表和其中的元组。在这项工作中,我们提出了三种互补的技术来改进这种表示:域引导前缀抑制位包键和值紧密减少哈希表记录宽度。乐观分割将值(以及对它们的操作)分解为(对)频繁访问和不频繁访问的值片的操作。通过从哈希表记录中删除不经常访问的值片,它提高了缓存局部性。唯一字符串自对齐区域(Unique Strings Self-aligned Region, USSR)通过创建最频繁字符串的动态字典来加速处理频繁出现的字符串,这在现实世界的数据集中非常常见。这允许使用整数逻辑执行许多字符串操作,并减少内存压力。我们将这些技术整合到Vectorwise中。在TPC-H基准测试中,我们的方法将峰值内存消耗降低了2 - 4倍,并将性能提高了1.5倍。在真实的BI工作负载中,我们测量到性能提高了2倍,在微基准测试中,我们观察到速度提高了25倍。
{"title":"Efficient Query Processing with Optimistically Compressed Hash Tables & Strings in the USSR","authors":"Tim Gubner, Viktor Leis, P. Boncz","doi":"10.1109/ICDE48307.2020.00033","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00033","url":null,"abstract":"Modern query engines rely heavily on hash tables for query processing. Overall query performance and memory footprint is often determined by how hash tables and the tuples within them are represented. In this work, we propose three complementary techniques to improve this representation: Domain-Guided Prefix Suppression bit-packs keys and values tightly to reduce hash table record width. Optimistic Splitting decomposes values (and operations on them) into (operations on) frequently-accessed and infrequently-accessed value slices. By removing the infrequently-accessed value slices from the hash table record, it improves cache locality. The Unique Strings Self-aligned Region (USSR) accelerates handling frequently-occurring strings, which are very common in real-world data sets, by creating an on-the-fly dictionary of the most frequent strings. This allows executing many string operations with integer logic and reduces memory pressure.We integrated these techniques into Vectorwise. On the TPC-H benchmark, our approach reduces peak memory consumption by 2–4× and improves performance by up to 1.5×. On a real-world BI workload, we measured a 2× improvement in performance and in micro-benchmarks we observed speedups of up to 25×.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"14 1","pages":"301-312"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75101677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
DynaMast: Adaptive Dynamic Mastering for Replicated Systems DynaMast:复制系统的自适应动态控制
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00123
Michael Abebe, Brad Glasbergen, Khuzaima S. Daudjee
Single-master replicated database systems strive to be scalable by offloading reads to replica nodes. However, single-master systems suffer from the performance bottleneck of all updates executing at a single site. Multi-master replicated systems distribute updates among sites but incur costly coordination for multi-site transactions. We present DynaMast, a lazily replicated, multi-master database system that guarantees one-site transaction execution while effectively distributing both reads and updates among multiple sites. DynaMast benefits from these advantages by dynamically transferring the mastership of data, or remastering, among sites using a lightweight metadata-based protocol. DynaMast leverages remastering to adaptively place master copies to balance load and minimize future remastering. Using benchmark workloads, we demonstrate that DynaMast delivers superior performance over existing replicated database system architectures.
单主复制数据库系统通过将读取任务卸载到复制节点来实现可伸缩性。然而,单主系统受到在单个站点上执行所有更新的性能瓶颈的影响。多主复制系统在站点之间分发更新,但会为多站点事务带来昂贵的协调成本。我们介绍了DynaMast,一个延迟复制的多主数据库系统,它保证单站点事务执行,同时有效地在多个站点之间分发读取和更新。DynaMast利用这些优势,使用轻量级的基于元数据的协议在站点之间动态地传输数据的控制权或重新控制。DynaMast利用重制版自适应地放置主副本,以平衡负载并尽量减少未来的重制版。通过使用基准工作负载,我们证明了DynaMast比现有的复制数据库系统架构提供了更好的性能。
{"title":"DynaMast: Adaptive Dynamic Mastering for Replicated Systems","authors":"Michael Abebe, Brad Glasbergen, Khuzaima S. Daudjee","doi":"10.1109/ICDE48307.2020.00123","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00123","url":null,"abstract":"Single-master replicated database systems strive to be scalable by offloading reads to replica nodes. However, single-master systems suffer from the performance bottleneck of all updates executing at a single site. Multi-master replicated systems distribute updates among sites but incur costly coordination for multi-site transactions. We present DynaMast, a lazily replicated, multi-master database system that guarantees one-site transaction execution while effectively distributing both reads and updates among multiple sites. DynaMast benefits from these advantages by dynamically transferring the mastership of data, or remastering, among sites using a lightweight metadata-based protocol. DynaMast leverages remastering to adaptively place master copies to balance load and minimize future remastering. Using benchmark workloads, we demonstrate that DynaMast delivers superior performance over existing replicated database system architectures.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"15 1 1","pages":"1381-1392"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77374464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Kronos: Lightweight Knowledge-based Event Analysis in Cyber-Physical Data Streams 网络物理数据流中的轻量级基于知识的事件分析
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00165
M. Namaki, Xin Zhang, Sukhjinder Singh, Arman Ahmed, Armina Foroutan, Yinghui Wu, A. Srivastava, Anton Kocheturov
We demonstrate Kronos, a framework and system that automatically extracts highly dynamic knowledge for complex event analysis in Cyber-Physical systems. Kronos captures events with anomaly-based event model, and integrates various events by correlating with their temporal associations in realtime, from heterogeneous, continuous cyber-physical measurement data streams. It maintains a lightweight highly dynamic knowledge base, enabled by online, window-based ensemble learning and incremental association analysis for event detection and linkage, respectively. These algorithms incur time costs determined by available memory, independent of the size of streams. Exploiting the highly dynamic knowledge, Kronos supports a rich set of stream event analytical queries including event search (keywords and query-by-example), provenance queries ("which measurements or features are responsible for detected events?"), and root cause analysis. We demonstrate how the GUI of Kronos interacts with users to support both continuous and ad-hoc queries online and enables situational awareness in Cyber-power systems, communication, and traffic networks.
我们展示了Kronos,一个框架和系统,可以自动提取高度动态的知识,用于网络物理系统中的复杂事件分析。Kronos使用基于异常的事件模型捕获事件,并通过实时关联各种事件,从异构的,连续的网络物理测量数据流中集成各种事件。它维护一个轻量级的高度动态的知识库,通过在线的、基于窗口的集成学习和用于事件检测和链接的增量关联分析来实现。这些算法产生的时间开销由可用内存决定,与流的大小无关。利用高度动态的知识,Kronos支持丰富的流事件分析查询,包括事件搜索(关键字和按例查询)、来源查询(“哪些测量或特征负责检测到的事件?”)和根本原因分析。我们演示了Kronos的GUI如何与用户交互,以支持在线的连续查询和临时查询,并在网络动力系统、通信和交通网络中实现态势感知。
{"title":"Kronos: Lightweight Knowledge-based Event Analysis in Cyber-Physical Data Streams","authors":"M. Namaki, Xin Zhang, Sukhjinder Singh, Arman Ahmed, Armina Foroutan, Yinghui Wu, A. Srivastava, Anton Kocheturov","doi":"10.1109/ICDE48307.2020.00165","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00165","url":null,"abstract":"We demonstrate Kronos, a framework and system that automatically extracts highly dynamic knowledge for complex event analysis in Cyber-Physical systems. Kronos captures events with anomaly-based event model, and integrates various events by correlating with their temporal associations in realtime, from heterogeneous, continuous cyber-physical measurement data streams. It maintains a lightweight highly dynamic knowledge base, enabled by online, window-based ensemble learning and incremental association analysis for event detection and linkage, respectively. These algorithms incur time costs determined by available memory, independent of the size of streams. Exploiting the highly dynamic knowledge, Kronos supports a rich set of stream event analytical queries including event search (keywords and query-by-example), provenance queries (\"which measurements or features are responsible for detected events?\"), and root cause analysis. We demonstrate how the GUI of Kronos interacts with users to support both continuous and ad-hoc queries online and enables situational awareness in Cyber-power systems, communication, and traffic networks.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"416 1","pages":"1766-1769"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84900441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Differentially Private Online Task Assignment in Spatial Crowdsourcing: A Tree-based Approach 空间众包中的差异私有在线任务分配:基于树的方法
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00051
Qian Tao, Yongxin Tong, Zimu Zhou, Yexuan Shi, Lei Chen, Ke Xu
With spatial crowdsourcing applications such as Uber and Waze deeply penetrated into everyday life, there is a growing concern to protect user privacy in spatial crowdsourcing. Particularly, locations of workers and tasks should be properly processed via certain privacy mechanism before reporting to the untrusted spatial crowdsourcing server for task assignment. Privacy mechanisms typically permute the location information, which tends to make task assignment ineffective. Prior studies only provide guarantees on privacy protection without assuring the effectiveness of task assignment. In this paper, we investigate privacy protection for online task assignment with the objective of minimizing the total distance, an important task assignment formulation in spatial crowdsourcing. We design a novel privacy mechanism based on Hierarchically Well-Separated Trees (HSTs). We prove that the mechanism is ε-Geo-Indistinguishable and show that there is a task assignment algorithm with a competitive ratio of $Oleft( {frac{1}{{{varepsilon ^4}}}log N{{log }^2}k} right)$, where is the privacy budget, N is the number of predefined points on the HST, and k is the matching size. Extensive experiments on synthetic and real datasets show that online task assignment under our privacy mechanism is notably more effective in terms of total distance than under prior differentially private mechanisms.
随着Uber、Waze等空间众包应用深入到人们的日常生活中,空间众包中的用户隐私保护问题日益受到关注。特别是,在向不受信任的空间众包服务器报告任务分配之前,应通过一定的隐私机制对工人和任务的位置进行适当处理。隐私机制通常会对位置信息进行排列,这往往会使任务分配效率低下。以往的研究只是对隐私保护提供了保障,并没有保证任务分配的有效性。本文研究了空间众包中一种重要的任务分配方式——以总距离最小为目标的在线任务分配中的隐私保护问题。我们设计了一种新的基于分层良好分离树(HSTs)的隐私机制。我们证明了该机制是ε- geo - ininguishable,并证明了存在一个竞争比为$Oleft( {frac{1}{{{varepsilon ^4}}}log N{{log }^2}k} right)$的任务分配算法,其中为隐私预算,N为HST上的预定义点数,k为匹配大小。在合成数据集和真实数据集上的大量实验表明,在我们的隐私机制下,在线任务分配在总距离方面明显比在先前的差分隐私机制下更有效。
{"title":"Differentially Private Online Task Assignment in Spatial Crowdsourcing: A Tree-based Approach","authors":"Qian Tao, Yongxin Tong, Zimu Zhou, Yexuan Shi, Lei Chen, Ke Xu","doi":"10.1109/ICDE48307.2020.00051","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00051","url":null,"abstract":"With spatial crowdsourcing applications such as Uber and Waze deeply penetrated into everyday life, there is a growing concern to protect user privacy in spatial crowdsourcing. Particularly, locations of workers and tasks should be properly processed via certain privacy mechanism before reporting to the untrusted spatial crowdsourcing server for task assignment. Privacy mechanisms typically permute the location information, which tends to make task assignment ineffective. Prior studies only provide guarantees on privacy protection without assuring the effectiveness of task assignment. In this paper, we investigate privacy protection for online task assignment with the objective of minimizing the total distance, an important task assignment formulation in spatial crowdsourcing. We design a novel privacy mechanism based on Hierarchically Well-Separated Trees (HSTs). We prove that the mechanism is ε-Geo-Indistinguishable and show that there is a task assignment algorithm with a competitive ratio of $Oleft( {frac{1}{{{varepsilon ^4}}}log N{{log }^2}k} right)$, where is the privacy budget, N is the number of predefined points on the HST, and k is the matching size. Extensive experiments on synthetic and real datasets show that online task assignment under our privacy mechanism is notably more effective in terms of total distance than under prior differentially private mechanisms.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"72 1","pages":"517-528"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85942050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Array-based Data Management for Genomics 基于阵列的基因组学数据管理
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00017
Olha Horlova, Abdulrahman Kaitoua, S. Ceri
With the huge growth of genomic data, exposing multiple heterogeneous features of genomic regions for millions of individuals, we increasingly need to support domain-specific query languages and knowledge extraction operations, capable of aggregating and comparing trillions of regions arbitrarily positioned on the human genome. While row-based models for regions can be effectively used as a basis for cloud-based implementations, in previous work we have shown that the array-based model is effective in supporting the class of region-preserving operations, i.e. operations which do not create any new region but rather compose existing ones.In this paper, we remove the above constraint, and describe an array-based implementation which applies to unrestricted region operations, as required by the Genometric Query Language. Specifically, we define a wide spectrum of operations over datasets which are represented using arrays, and we show that the arraybased implementation scales well upon Spark, also thanks to a data representation which is effectively used for supporting machine learning. Our benchmark, which uses an independent, pre-existing collection of queries, shows that in many cases the novel array-based implementation significantly improves the performance of the row-based implementation.
随着基因组数据的巨大增长,揭示了数百万个体基因组区域的多种异构特征,我们越来越需要支持特定领域的查询语言和知识提取操作,能够聚合和比较人类基因组上任意定位的数万亿个区域。虽然基于行的区域模型可以有效地用作基于云的实现的基础,但在之前的工作中,我们已经证明了基于数组的模型在支持区域保留操作类方面是有效的,即不创建任何新区域而是组成现有区域的操作。在本文中,我们消除了上述约束,并描述了一种基于数组的实现,该实现适用于基因组查询语言所要求的无限制区域操作。具体来说,我们在使用数组表示的数据集上定义了广泛的操作,并且我们表明基于数组的实现在Spark上可以很好地扩展,这也得益于有效用于支持机器学习的数据表示。我们的基准测试使用了一个独立的、预先存在的查询集合,结果表明,在许多情况下,新的基于数组的实现显著提高了基于行实现的性能。
{"title":"Array-based Data Management for Genomics","authors":"Olha Horlova, Abdulrahman Kaitoua, S. Ceri","doi":"10.1109/ICDE48307.2020.00017","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00017","url":null,"abstract":"With the huge growth of genomic data, exposing multiple heterogeneous features of genomic regions for millions of individuals, we increasingly need to support domain-specific query languages and knowledge extraction operations, capable of aggregating and comparing trillions of regions arbitrarily positioned on the human genome. While row-based models for regions can be effectively used as a basis for cloud-based implementations, in previous work we have shown that the array-based model is effective in supporting the class of region-preserving operations, i.e. operations which do not create any new region but rather compose existing ones.In this paper, we remove the above constraint, and describe an array-based implementation which applies to unrestricted region operations, as required by the Genometric Query Language. Specifically, we define a wide spectrum of operations over datasets which are represented using arrays, and we show that the arraybased implementation scales well upon Spark, also thanks to a data representation which is effectively used for supporting machine learning. Our benchmark, which uses an independent, pre-existing collection of queries, shows that in many cases the novel array-based implementation significantly improves the performance of the row-based implementation.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"76 1","pages":"109-120"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80946992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Efficient Entity Resolution on Heterogeneous Records(Extended abstract) 异构记录的高效实体解析(扩展抽象)
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.9238348
Yiming Lin, Hongzhi Wang, Jianzhong Li, Hong Gao
Entity resolution (ER) is the problem of identifying and merging records that refer to the same real-world entity. In many scenarios, raw records are stored under heterogeneous environment. To leverage such records better, most existing work assume that schema matching and data exchange have been done to convert records under different schemas to those under a predefined schema. However, we observe that schema matching would lose information in some cases, which could be useful or even crucial to ER. To leverage sufficient information from heterogeneous sources, in this paper, we address several challenges of ER on heterogeneous records and show that none of existing similarity metrics or their transformations could be applied to find similar records under heterogeneous settings. Motivated by this, we propose a novel framework to iteratively find records which refer to the same entity as well as an index to generate candidates and accelerate similarity computation. Evaluations on real-world datasets show the effectiveness and efficiency of our methods.
实体解析(ER)是识别和合并引用相同现实世界实体的记录的问题。在许多情况下,原始记录存储在异构环境中。为了更好地利用这些记录,大多数现有工作都假设已经完成了模式匹配和数据交换,将不同模式下的记录转换为预定义模式下的记录。然而,我们观察到模式匹配在某些情况下会丢失信息,这对ER可能是有用的,甚至是至关重要的。为了利用来自异构源的充分信息,在本文中,我们解决了异构记录上的ER的几个挑战,并表明现有的相似性度量或它们的转换都不能应用于在异构设置下查找相似的记录。为此,我们提出了一种新的框架来迭代查找引用同一实体的记录,并提出了一个索引来生成候选记录并加速相似度计算。对真实世界数据集的评估显示了我们方法的有效性和效率。
{"title":"Efficient Entity Resolution on Heterogeneous Records(Extended abstract)","authors":"Yiming Lin, Hongzhi Wang, Jianzhong Li, Hong Gao","doi":"10.1109/ICDE48307.2020.9238348","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.9238348","url":null,"abstract":"Entity resolution (ER) is the problem of identifying and merging records that refer to the same real-world entity. In many scenarios, raw records are stored under heterogeneous environment. To leverage such records better, most existing work assume that schema matching and data exchange have been done to convert records under different schemas to those under a predefined schema. However, we observe that schema matching would lose information in some cases, which could be useful or even crucial to ER. To leverage sufficient information from heterogeneous sources, in this paper, we address several challenges of ER on heterogeneous records and show that none of existing similarity metrics or their transformations could be applied to find similar records under heterogeneous settings. Motivated by this, we propose a novel framework to iteratively find records which refer to the same entity as well as an index to generate candidates and accelerate similarity computation. Evaluations on real-world datasets show the effectiveness and efficiency of our methods.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"37 1","pages":"2074-2075"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85423888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Enabling Efficient Random Access to Hierarchically-Compressed Data 启用对分层压缩数据的高效随机访问
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00097
Feng Zhang, Jidong Zhai, Xipeng Shen, O. Mutlu, Xiaoyong Du
Recent studies have shown the promise of direct data processing on hierarchically-compressed text documents. By removing the need for decompressing data, the direct data processing technique brings large savings in both time and space. However, its benefits have been limited to data traversal operations; for random accesses, direct data processing is several times slower than the state-of-the-art baselines. This paper presents a set of techniques that successfully eliminate the limitation, and for the first time, establishes the feasibility of effectively handling both data traversal operations and random data accesses on hierarchically-compressed data. The work yields a new library, which achieves 3.1 × speedup over the state-of-the-art on random data accesses to compressed data, while preserving the capability of supporting traversal operations efficiently and providing large (3.9 ×) space savings.
最近的研究显示了对分层压缩的文本文档进行直接数据处理的前景。由于不需要对数据进行解压缩,直接数据处理技术节省了大量的时间和空间。然而,它的好处仅限于数据遍历操作;对于随机访问,直接数据处理要比最先进的基线慢几倍。本文提出了一套技术,成功地消除了这一限制,并首次建立了在分层压缩数据上有效处理数据遍历操作和随机数据访问的可行性。这项工作产生了一个新的库,与最先进的随机数据访问压缩数据相比,它实现了3.1倍的加速,同时保留了有效支持遍历操作的能力,并提供了大量(3.9倍)的空间节省。
{"title":"Enabling Efficient Random Access to Hierarchically-Compressed Data","authors":"Feng Zhang, Jidong Zhai, Xipeng Shen, O. Mutlu, Xiaoyong Du","doi":"10.1109/ICDE48307.2020.00097","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00097","url":null,"abstract":"Recent studies have shown the promise of direct data processing on hierarchically-compressed text documents. By removing the need for decompressing data, the direct data processing technique brings large savings in both time and space. However, its benefits have been limited to data traversal operations; for random accesses, direct data processing is several times slower than the state-of-the-art baselines. This paper presents a set of techniques that successfully eliminate the limitation, and for the first time, establishes the feasibility of effectively handling both data traversal operations and random data accesses on hierarchically-compressed data. The work yields a new library, which achieves 3.1 × speedup over the state-of-the-art on random data accesses to compressed data, while preserving the capability of supporting traversal operations efficiently and providing large (3.9 ×) space savings.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"8 1","pages":"1069-1080"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85825313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
HomoPAI: A Secure Collaborative Machine Learning Platform based on Homomorphic Encryption HomoPAI:一个基于同态加密的安全协同机器学习平台
Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00152
Qifei Li, Zhicong Huang, Wen-jie Lu, Cheng Hong, Hunter Qu, Hui He, Weizhe Zhang
Homomorphic Encryption (HE) allows encrypted data to be processed without decryption, which could maximize the protection of user privacy without affecting the data utility. Thanks to strides made by cryptographers in the past few years, the efficiency of HE has been drastically improved, and machine learning on homomorphically encrypted data has become possible. Several works have explored machine learning based on HE, but most of them are restricted to the outsourced scenario, where all the data comes from a single data owner. We propose HomoPAI, an HE-based secure collaborative machine learning system, enabling a more promising scenario, where data from multiple data owners could be securely processed. Moreover, we integrate our system with the popular MPI framework to achieve parallel HE computations. Experiments show that our system can train a logistic regression model on millions of homomorphically encrypted data in less than two minutes.
同态加密(HE)允许在不解密的情况下处理加密的数据,可以在不影响数据效用的情况下最大限度地保护用户隐私。由于密码学家在过去几年中取得了长足的进步,HE的效率得到了极大的提高,在同态加密数据上进行机器学习已经成为可能。一些作品已经探索了基于HE的机器学习,但大多数都局限于外包场景,其中所有数据都来自单个数据所有者。我们提出了HomoPAI,一个基于he的安全协作机器学习系统,实现了一个更有前途的场景,其中来自多个数据所有者的数据可以被安全地处理。此外,我们将我们的系统与流行的MPI框架集成,以实现并行HE计算。实验表明,该系统可以在不到两分钟的时间内训练上百万个同态加密数据的逻辑回归模型。
{"title":"HomoPAI: A Secure Collaborative Machine Learning Platform based on Homomorphic Encryption","authors":"Qifei Li, Zhicong Huang, Wen-jie Lu, Cheng Hong, Hunter Qu, Hui He, Weizhe Zhang","doi":"10.1109/ICDE48307.2020.00152","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00152","url":null,"abstract":"Homomorphic Encryption (HE) allows encrypted data to be processed without decryption, which could maximize the protection of user privacy without affecting the data utility. Thanks to strides made by cryptographers in the past few years, the efficiency of HE has been drastically improved, and machine learning on homomorphically encrypted data has become possible. Several works have explored machine learning based on HE, but most of them are restricted to the outsourced scenario, where all the data comes from a single data owner. We propose HomoPAI, an HE-based secure collaborative machine learning system, enabling a more promising scenario, where data from multiple data owners could be securely processed. Moreover, we integrate our system with the popular MPI framework to achieve parallel HE computations. Experiments show that our system can train a logistic regression model on millions of homomorphically encrypted data in less than two minutes.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"39 1","pages":"1713-1717"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77671757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
期刊
2020 IEEE 36th International Conference on Data Engineering (ICDE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1