2020 IEEE 36th International Conference on Data Engineering (ICDE)最新文献

英文中文

Speed Kit: A Polyglot & GDPR-Compliant Approach For Caching Personalized Content 速度套件:多语言和gdpr兼容的方法缓存个性化内容

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00142

Wolfram Wingerath, Felix Gessert, Erik Witt, Hannes Kuhlmann, Florian Bücklers, Benjamin Wollmer, N. Ritter

Users leave when page loads take too long. This simple fact has complex implications for virtually all modern businesses, because accelerating content delivery through caching is not as simple as it used to be. As a fundamental technical challenge, the high degree of personalization in today’s Web has seemingly outgrown the capabilities of traditional content delivery networks (CDNs) which have been designed for distributing static assets under fixed caching times. As an additional legal challenge for services with personalized content, an increasing number of regional data protection laws constrain the ways in which CDNs can be used in the first place. In this paper, we present Speed Kit as a radically different approach for content distribution that combines (1) a polyglot architecture for efficiently caching personalized content with (2) a natively GDPR-compliant client proxy that handles all sensitive information within the user device. We describe the system design and implementation, explain the custom cache coherence protocol to avoid data staleness and achieve Δ-atomicity, and we share field experiences from over a year of productive use in the e-commerce industry.

当页面加载时间过长时，用户会离开。这个简单的事实对几乎所有现代企业都有复杂的影响，因为通过缓存加速内容交付不像以前那么简单了。作为一项基本的技术挑战，当今Web中的高度个性化似乎已经超出了传统内容交付网络(cdn)的能力，传统内容交付网络的设计目的是在固定的缓存时间内分发静态资产。作为个性化内容服务面临的额外法律挑战，越来越多的区域数据保护法律首先限制了cdn的使用方式。在本文中，我们将Speed Kit作为一种完全不同的内容分发方法，它结合了(1)用于高效缓存个性化内容的多语言架构和(2)处理用户设备内所有敏感信息的本地gdpr兼容客户端代理。我们描述了系统的设计和实现，解释了自定义缓存一致性协议以避免数据过时并实现Δ-atomicity，并且我们分享了在电子商务行业中一年多的生产性使用的现场经验。

{"title":"Speed Kit: A Polyglot & GDPR-Compliant Approach For Caching Personalized Content","authors":"Wolfram Wingerath, Felix Gessert, Erik Witt, Hannes Kuhlmann, Florian Bücklers, Benjamin Wollmer, N. Ritter","doi":"10.1109/ICDE48307.2020.00142","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00142","url":null,"abstract":"Users leave when page loads take too long. This simple fact has complex implications for virtually all modern businesses, because accelerating content delivery through caching is not as simple as it used to be. As a fundamental technical challenge, the high degree of personalization in today’s Web has seemingly outgrown the capabilities of traditional content delivery networks (CDNs) which have been designed for distributing static assets under fixed caching times. As an additional legal challenge for services with personalized content, an increasing number of regional data protection laws constrain the ways in which CDNs can be used in the first place. In this paper, we present Speed Kit as a radically different approach for content distribution that combines (1) a polyglot architecture for efficiently caching personalized content with (2) a natively GDPR-compliant client proxy that handles all sensitive information within the user device. We describe the system design and implementation, explain the custom cache coherence protocol to avoid data staleness and achieve Δ-atomicity, and we share field experiences from over a year of productive use in the e-commerce industry.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"28 1","pages":"1603-1608"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90906910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Automatic View Generation with Deep Learning and Reinforcement Learning 基于深度学习和强化学习的自动视图生成

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00133

Haitao Yuan, Guoliang Li, Ling Feng, Ji Sun, Yue Han

Materializing views is an important method to reduce redundant computations in DBMS, especially for processing large scale analytical queries. However, many existing methods still need DBAs to manually generate materialized views, which are not scalable to a large number of database instances, especially on the cloud database. To address this problem, we propose an automatic view generation method which judiciously selects "highly beneficial" subqueries to generate materialized views. However, there are two challenges. (1) How to estimate the benefit of using a materialized view for a queryƒ (2) How to select optimal subqueries to generate materialized viewsƒ To address the first challenge, we propose a neural network based method to estimate the benefit of using a materialized view to answer a query. In particular, we extract significant features from different perspectives and design effective encoding models to transform these features into hidden representations. To address the second challenge, we model this problem to an ILP (Integer Linear Programming) problem, which aims to maximize the utility by selecting optimal subqueries to materialize. We design an iterative optimization method to select subqueries to materialize. However, this method cannot guarantee the convergence of the solution. To address this issue, we model the iterative optimization process as an MDP (Markov Decision Process) and use the deep reinforcement learning model to solve the problem. Extensive experiments show that our method outperforms existing solutions by 28.4%, 8.8% and 31.7% on three real-world datasets.

物化视图是减少数据库管理系统中冗余计算的一种重要方法，特别是在处理大规模分析查询时。但是，许多现有的方法仍然需要dba手动生成物化视图，这不能扩展到大量的数据库实例，特别是在云数据库上。为了解决这个问题，我们提出了一种自动视图生成方法，该方法明智地选择“高度有益”的子查询来生成物化视图。然而，有两个挑战。(1)如何估计使用物化视图进行查询的好处(2)如何选择最优子查询来生成物化视图为了解决第一个挑战，我们提出了一种基于神经网络的方法来估计使用物化视图回答查询的好处。特别是，我们从不同的角度提取重要的特征，并设计有效的编码模型，将这些特征转换为隐藏的表征。为了解决第二个挑战，我们将该问题建模为ILP(整数线性规划)问题，该问题旨在通过选择最优子查询来实现效用最大化。我们设计了一种迭代优化方法来选择要实现的子查询。但该方法不能保证解的收敛性。为了解决这个问题，我们将迭代优化过程建模为MDP(马尔可夫决策过程)，并使用深度强化学习模型来解决这个问题。大量的实验表明，我们的方法在三个真实数据集上比现有的解决方案分别高出28.4%、8.8%和31.7%。

{"title":"Automatic View Generation with Deep Learning and Reinforcement Learning","authors":"Haitao Yuan, Guoliang Li, Ling Feng, Ji Sun, Yue Han","doi":"10.1109/ICDE48307.2020.00133","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00133","url":null,"abstract":"Materializing views is an important method to reduce redundant computations in DBMS, especially for processing large scale analytical queries. However, many existing methods still need DBAs to manually generate materialized views, which are not scalable to a large number of database instances, especially on the cloud database. To address this problem, we propose an automatic view generation method which judiciously selects \"highly beneficial\" subqueries to generate materialized views. However, there are two challenges. (1) How to estimate the benefit of using a materialized view for a queryƒ (2) How to select optimal subqueries to generate materialized viewsƒ To address the first challenge, we propose a neural network based method to estimate the benefit of using a materialized view to answer a query. In particular, we extract significant features from different perspectives and design effective encoding models to transform these features into hidden representations. To address the second challenge, we model this problem to an ILP (Integer Linear Programming) problem, which aims to maximize the utility by selecting optimal subqueries to materialize. We design an iterative optimization method to select subqueries to materialize. However, this method cannot guarantee the convergence of the solution. To address this issue, we model the iterative optimization process as an MDP (Markov Decision Process) and use the deep reinforcement learning model to solve the problem. Extensive experiments show that our method outperforms existing solutions by 28.4%, 8.8% and 31.7% on three real-world datasets.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"37 1","pages":"1501-1512"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91210970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

UniKV: Toward High-Performance and Scalable KV Storage in Mixed Workloads via Unified Indexing UniKV:通过统一索引在混合工作负载中实现高性能和可扩展的KV存储

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00034

Qiang Zhang, Yongkun Li, P. Lee, Yinlong Xu, Qiu Cui, L. Tang

Persistent key-value (KV) stores are mainly designed based on the Log-Structured Merge-tree (LSM-tree), which suffer from large read and write amplifications, especially when KV stores grow in size. Existing design optimizations for LSM-tree-based KV stores often make certain trade-offs and fail to simultaneously improve both the read and write performance on large KV stores without sacrificing scan performance. We design UniKV, which unifies the key design ideas of hash indexing and the LSM-tree in a single system. Specifically, UniKV leverages data locality to differentiate the indexing management of KV pairs. It also develops multiple techniques to tackle the issues caused by unifying the indexing techniques, so as to simultaneously improve the performance in reads, writes, and scans. Experiments show that UniKV significantly outperforms several state-of-the-art KV stores (e.g., LevelDB, RocksDB, HyperLevelDB, and PebblesDB) in overall throughput under read-write mixed workloads.

持久性键值存储(Persistent key-value, KV)主要基于日志结构合并树(Log-Structured Merge-tree, LSM-tree)进行设计，但这种存储存在较大的读写放大，特别是当KV存储规模增长时。现有的基于lsm树的KV存储的设计优化通常会进行某些权衡，并且无法在不牺牲扫描性能的情况下同时提高大型KV存储的读写性能。我们设计了UniKV，它将哈希索引和lsm树的关键设计思想统一在一个系统中。具体来说，UniKV利用数据局部性来区分KV对的索引管理。它还开发了多种技术来解决由统一索引技术引起的问题，从而同时提高读、写和扫描的性能。实验表明，在读写混合工作负载下，UniKV在总体吞吐量方面明显优于几个最先进的KV存储(例如，LevelDB, RocksDB, HyperLevelDB和pebble)。

引用次数: 11

Task Allocation in Dependency-aware Spatial Crowdsourcing 依赖感知空间众包中的任务分配

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00090

Wangze Ni, Peng Cheng, Lei Chen, Xuemin Lin

Ubiquitous smart devices and high-quality wireless networks enable people to participate in spatial crowdsourcing tasks easily, which require workers to physically move to specific locations to conduct their assigned tasks. Spatial crowdsourcing has attracted much attention from both academia and industry. In this paper, we consider a spatial crowdsourcing scenario, where the tasks may have some dependencies among them. Specifically, one task can only be dispatched when its dependent tasks have already been assigned. In fact, task dependencies are quite common in many real-life applications, such as house repairing and holding sports games. We formally define the dependency-aware spatial crowdsourcing (DA-SC), which focuses on finding an optimal worker-and-task assignment under the constraints of dependencies, skills of workers, moving distances and deadlines to maximize the successfully assigned tasks. We prove that the DA-SC problem is NP-hard and thus intractable. Therefore, we propose two approximation algorithms, including a greedy approach and a game-theoretic approach, which can guarantee the approximate bounds of the results in each batch process. Through extensive experiments on both real and synthetic data sets, we demonstrate the efficiency and effectiveness of our DA-SC approaches.

无处不在的智能设备和高质量的无线网络使人们能够轻松地参与空间众包任务，这需要工人实际移动到特定的位置来执行分配的任务。空间众包已经引起了学术界和工业界的广泛关注。在本文中，我们考虑一个空间众包场景，其中任务之间可能存在一些依赖关系。具体来说，一个任务只能在其相关任务已经被分配的情况下才可以被分派。事实上，任务依赖关系在许多现实生活中的应用程序中非常常见，例如房屋维修和举办体育比赛。我们正式定义了依赖感知空间众包(DA-SC)，其重点是在依赖关系、工人技能、移动距离和截止日期的约束下找到最佳的工人和任务分配，以最大限度地成功分配任务。我们证明了DA-SC问题是np困难的，因此是难以处理的。因此，我们提出了两种近似算法，即贪心算法和博弈论算法，可以保证每个批处理结果的近似界。通过对真实和合成数据集的大量实验，我们证明了我们的DA-SC方法的效率和有效性。

{"title":"Task Allocation in Dependency-aware Spatial Crowdsourcing","authors":"Wangze Ni, Peng Cheng, Lei Chen, Xuemin Lin","doi":"10.1109/ICDE48307.2020.00090","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00090","url":null,"abstract":"Ubiquitous smart devices and high-quality wireless networks enable people to participate in spatial crowdsourcing tasks easily, which require workers to physically move to specific locations to conduct their assigned tasks. Spatial crowdsourcing has attracted much attention from both academia and industry. In this paper, we consider a spatial crowdsourcing scenario, where the tasks may have some dependencies among them. Specifically, one task can only be dispatched when its dependent tasks have already been assigned. In fact, task dependencies are quite common in many real-life applications, such as house repairing and holding sports games. We formally define the dependency-aware spatial crowdsourcing (DA-SC), which focuses on finding an optimal worker-and-task assignment under the constraints of dependencies, skills of workers, moving distances and deadlines to maximize the successfully assigned tasks. We prove that the DA-SC problem is NP-hard and thus intractable. Therefore, we propose two approximation algorithms, including a greedy approach and a game-theoretic approach, which can guarantee the approximate bounds of the results in each batch process. Through extensive experiments on both real and synthetic data sets, we demonstrate the efficiency and effectiveness of our DA-SC approaches.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"61 1","pages":"985-996"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90996187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Online Indices for Predictive Top-k Entity and Aggregate Queries on Knowledge Graphs 知识图谱上预测Top-k实体和聚合查询的在线索引

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00096

Yan Li, Tingjian Ge, Cindy X. Chen

Knowledge graphs have seen increasingly broad applications. However, they are known to be incomplete. We define the notion of a virtual knowledge graph which extends a knowledge graph with predicted edges and their probabilities. We focus on two important types of queries: top-k entity queries and aggregate queries. To improve query processing efficiency, we propose an incremental index on top of low dimensional entity vectors transformed from network embedding vectors. We also devise query processing algorithms with the index. Moreover, we provide theoretical guarantees of accuracy, and conduct a systematic experimental evaluation. The experiments show that our approach is very efficient and effective. In particular, with the same or better accuracy guarantees, it is one to two orders of magnitude faster in query processing than the closest previous work which can only handle one relationship type.

知识图谱的应用越来越广泛。然而，他们是不完整的。我们定义了虚拟知识图的概念，它扩展了具有预测边及其概率的知识图。我们关注两种重要的查询类型:top-k实体查询和聚合查询。为了提高查询处理效率，我们提出了一种基于低维实体向量的增量索引方法。我们还设计了使用索引的查询处理算法。此外，我们提供了准确性的理论保证，并进行了系统的实验评估。实验结果表明，该方法是非常有效的。特别是，在相同或更好的准确性保证下，它在查询处理方面比之前只能处理一种关系类型的最接近的工作快一到两个数量级。

引用次数: 5

HBP: Hotness Balanced Partition for Prioritized Iterative Graph Computations 优先迭代图计算的热度平衡划分

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00209

Shufeng Gong, Yanfeng Zhang, Ge Yu

Existing graph partition methods are designed for round-robin synchronous distributed frameworks. They balance workload without discrimination of vertex importance and fail to consider the characteristics of priority-based scheduling, which may limit the benefit of prioritized graph computation. To accelerate prioritized iterative graph computations, we propose Hotness Balanced Partition (HBP) and a stream-based partition algorithm Pb-HBP. Pb-HBP partitions graph by distributing vertices with discrimination according to their hotness rather than blindly distributing vertices with equal weights, which aims to evenly distribute the hot vertices among workers. Our results show that our proposed partition method outperforms the state-of-the-art partition methods, Fennel and HotGraph. Specifically, Pb-HBP can reduce 40–90% runtime of that by hash partition, 5–75% runtime of that by Fennel, and 22–50% runtime of that by HotGraph.

现有的图划分方法是为循环同步分布式框架设计的。它们在不区分顶点重要性的情况下平衡工作负载，并且没有考虑基于优先级调度的特点，这可能会限制优先级图计算的优势。为了加速优先迭代图的计算，我们提出了热均衡划分算法(HBP)和基于流的划分算法Pb-HBP。Pb-HBP对图进行划分，不是盲目地分配权值相等的顶点，而是根据热度分布有区别的顶点，目的是将热顶点均匀地分布在工人之间。我们的结果表明，我们提出的分区方法优于最先进的分区方法，Fennel和HotGraph。具体来说，Pb-HBP通过散列分区可以减少40-90%的运行时间，通过Fennel可以减少5-75%的运行时间，通过HotGraph可以减少22-50%的运行时间。

引用次数: 5

Efficiently Answering Span-Reachability Queries in Large Temporal Graphs 大时间图中跨度可达性查询的有效回答

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00104

Dong Wen, Yilun Huang, Ying Zhang, Lu Qin, W. Zhang, Xuemin Lin

Reachability is a fundamental problem in graph analysis. In applications such as social networks and collaboration networks, edges are always associated with timestamps. Most existing works on reachability queries in temporal graphs assume that two vertices are related if they are connected by a path with non-decreasing timestamps (time-respecting) of edges. This assumption fails to capture the relationship between entities involved in the same group or activity with no time-respecting path connecting them. In this paper, we define a new reachability model, called span-reachability, designed to relax the time order dependency and identify the relationship between entities in a given time period. We adopt the idea of two-hop cover and propose an index-based method to answer span-reachability queries. Several optimizations are also given to improve the efficiency of index construction and query processing. We conduct extensive experiments on 17 real-world datasets to show the efficiency of our proposed solution.

可达性是图分析中的一个基本问题。在社交网络和协作网络等应用程序中，边缘总是与时间戳相关联。大多数关于时间图中可达性查询的现有工作都假设两个顶点是相关的，如果它们由具有非递减时间戳(与时间有关)的边的路径连接。这种假设无法捕捉到同一组或活动中涉及的实体之间的关系，因为没有时间相关的路径将它们连接起来。在本文中，我们定义了一个新的可达性模型，称为跨可达性，旨在放松时间顺序依赖，并识别给定时间段内实体之间的关系。我们采用两跳覆盖的思想，提出了一种基于索引的跨可达性查询的回答方法。为了提高索引构建和查询处理的效率，本文还进行了一些优化。我们在17个真实数据集上进行了广泛的实验，以证明我们提出的解决方案的效率。

引用次数: 13

Cool, a COhort OnLine analytical processing system Cool，一个队列在线分析处理系统

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00056

Zhongle Xie, Hongbin Ying, Cong Yue, Meihui Zhang, Gang Chen, B. Ooi

With a huge volume and variety of data accumulated over the years, OnLine Analytical Processing (OLAP) systems are facing challenges in query efficiency. Furthermore, the design of OLAP systems cannot serve modern applications well due to their inefficiency in processing complex queries such as cohort queries with low query latency. In this paper, we present Cool, a cohort online analytical processing system. As an integrated system with the support of several newly proposed operators on top of a sophisticated storage layer, it processes both cohort queries and conventional OLAP queries with superb performance. Its distributed design contains minimal load balancing and fault tolerance support and is scalable. Our evaluation results show that Cool outperforms two state-of-the-art systems, MonetDB and Druid, by a wide margin in single-node setting. The multi-node version of Cool can also beat the distributed Druid, as well as SparkSQL, by one order of magnitude in terms of query latency.

联机分析处理(OnLine Analytical Processing, OLAP)系统由于多年来积累的海量数据和各种数据，在查询效率方面面临着挑战。此外，OLAP系统的设计不能很好地服务于现代应用程序，因为它们在处理复杂查询(如具有低查询延迟的队列查询)方面效率低下。在本文中，我们提出了Cool，一个队列在线分析处理系统。作为一个集成系统，它在复杂的存储层上支持几个新提出的运算符，它既可以处理队列查询，也可以处理传统的OLAP查询，性能非常好。它的分布式设计包含最小的负载平衡和容错支持，并且是可扩展的。我们的评估结果表明，在单节点设置中，Cool优于MonetDB和Druid这两个最先进的系统。在查询延迟方面，Cool的多节点版本也可以击败分布式Druid和SparkSQL一个数量级。

引用次数: 2

Deciding When to Trade Data Freshness for Performance in MongoDB-as-a-Service 在mongodb即服务中决定何时以数据新鲜度换取性能

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00207

Chenhao Huang, Michael J. Cahill, A. Fekete, Uwe Röhm

MongoDB is a popular document store that is also available as a cloud-hosted service. MongoDB internally deploys primary-copy asynchronous replication, and it allows clients to vary the Read Preference, so reads can deliberately be directed to secondaries rather than the primary site. Doing this can sometimes improve performance, but the returned data might be stale, whereas the primary always returns the freshest data value. While state-of-practice is for programmers to decide where to direct the reads at application development time, they do not have full understanding then of workload or hardware capacity. It should be better to choose the appropriate Read Preference setting at runtime, as we describe in this paper.We show how a system can detect when the primary copy is saturated in MongoDB-as-a-Service, and use this to choose where reads should be done to improve overall performance. Our approach is aimed at a cloud-consumer; it assumes access to only the limited diagnostic data provided to clients of the hosted service.

MongoDB是一个流行的文档存储，也可以作为云托管服务。MongoDB内部部署主副本异步复制，它允许客户端改变Read Preference，因此读取可以故意定向到备用站点而不是主站点。这样做有时可以提高性能，但是返回的数据可能是过时的，而主服务器总是返回最新的数据值。虽然实践状态是由程序员在应用程序开发时决定在哪里读取，但他们并不完全了解工作负载或硬件容量。最好在运行时选择适当的Read Preference设置，正如我们在本文中所描述的那样。我们将展示系统如何在MongoDB-as-a-Service中检测主副本何时饱和，并使用它来选择应该在何处进行读取以提高整体性能。我们的目标客户是云消费者;它假定只能访问提供给托管服务的客户机的有限诊断数据。

引用次数: 5

Crowdsourcing-based Data Extraction from Visualization Charts 基于众包的可视化图表数据提取

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00177

Chengliang Chai, Guoliang Li, Ju Fan, Yuyu Luo

Visualization charts are widely utilized for presenting structured data. Under many circumstances, people want to explore the data in the charts collected from various sources, such as papers and websites, so as to further analyzing the data or creating new charts. However, the existing automatic and semi-automatic approaches are not always effective due to the variety of charts. In this paper, we introduce a crowdsourcing approach that leverages human ability to extract data from visualization charts. There are several challenges. The first one is how to avoid tedious human interaction with charts and design simple crowdsourcing tasks. Second, it is challenging to evaluate worker’s quality for truth inference, because workers may not only provide inaccurate values but also misalign values to wrong data series. To address the challenges, we design an effective crowdsourcing task scheme that splits a chart into simple micro-tasks. We introduce a novel worker quality model by considering worker’s accuracy and task difficulty. We also devise an effective early-stopping mechanisms to save the cost. We have conducted experiments on a real crowdsourcing platform, and the results show that our framework outperforms state-of-the-art approaches on both cost and quality.

可视化图表被广泛用于表示结构化数据。在很多情况下，人们想要探索从各种来源收集的图表中的数据，例如论文和网站，从而进一步分析数据或创建新的图表。然而，由于图表的多样性，现有的自动和半自动方法并不总是有效的。在本文中，我们介绍了一种利用人类能力从可视化图表中提取数据的众包方法。这里有几个挑战。第一个问题是如何避免与图表进行繁琐的人机交互，并设计简单的众包任务。其次，评估工作者的真值推断质量是具有挑战性的，因为工作者不仅可能提供不准确的值，而且可能将值与错误的数据序列不一致。为了应对这些挑战，我们设计了一个有效的众包任务方案，将图表分成简单的微任务。在此基础上，提出了一种考虑工人精度和任务难度的工人素质模型。我们还设计了一个有效的早期停止机制，以节省成本。我们在一个真正的众包平台上进行了实验，结果表明我们的框架在成本和质量上都优于最先进的方法。

{"title":"Crowdsourcing-based Data Extraction from Visualization Charts","authors":"Chengliang Chai, Guoliang Li, Ju Fan, Yuyu Luo","doi":"10.1109/ICDE48307.2020.00177","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00177","url":null,"abstract":"Visualization charts are widely utilized for presenting structured data. Under many circumstances, people want to explore the data in the charts collected from various sources, such as papers and websites, so as to further analyzing the data or creating new charts. However, the existing automatic and semi-automatic approaches are not always effective due to the variety of charts. In this paper, we introduce a crowdsourcing approach that leverages human ability to extract data from visualization charts. There are several challenges. The first one is how to avoid tedious human interaction with charts and design simple crowdsourcing tasks. Second, it is challenging to evaluate worker’s quality for truth inference, because workers may not only provide inaccurate values but also misalign values to wrong data series. To address the challenges, we design an effective crowdsourcing task scheme that splits a chart into simple micro-tasks. We introduce a novel worker quality model by considering worker’s accuracy and task difficulty. We also devise an effective early-stopping mechanisms to save the cost. We have conducted experiments on a real crowdsourcing platform, and the results show that our framework outperforms state-of-the-art approaches on both cost and quality.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"1814-1817"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82983461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2020 IEEE 36th International Conference on Data Engineering (ICDE)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀