2014 IEEE 30th International Conference on Data Engineering最新文献

英文中文

Head, modifier, and constraint detection in short texts 短文本中的标题、修饰语和约束检测

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816658

Zhongyuan Wang, Haixun Wang, Zhirui Hu

Head and modifier detection is an important problem for applications that handle short texts such as search queries, ads keywords, titles, captions, etc. In many cases, short texts such as search queries do not follow grammar rules, and existing approaches for head and modifier detection are coarse-grained, domain specific, and/or require labeling of large amounts of training data. In this paper, we introduce a semantic approach for head and modifier detection. We first obtain a large number of instance level head-modifier pairs from search log. Then, we develop a conceptualization mechanism to generalize the instance level pairs to concept level. Finally, we derive weighted concept patterns that are concise, accurate, and have strong generalization power in head and modifier detection. Furthermore, we identify a subset of modifiers that we call constraints. Constraints are usually specific and not negligible as far as the intent of the short text is concerned, while non-constraint modifiers are more subjective. The mechanism we developed has been used in production for search relevance and ads matching. We use extensive experiment results to demonstrate the effectiveness of our approach.

对于处理短文本(如搜索查询、广告关键字、标题、说明文字等)的应用程序来说，词头和修饰语检测是一个重要问题。在许多情况下，诸如搜索查询之类的短文本不遵循语法规则，并且用于检测词头和修饰语的现有方法是粗粒度的、特定于领域的和/或需要标记大量训练数据。本文介绍了一种用于词头和修饰语检测的语义方法。首先从搜索日志中获得大量实例级头部修饰符对。然后，我们开发了一种概念化机制，将实例级对泛化到概念级。最后，我们推导出的加权概念模式简洁、准确，在头部和修饰语检测中具有较强的泛化能力。此外，我们还确定了一个称为约束的修饰符子集。就短文本的意图而言，约束通常是具体的，不可忽视的，而非约束修饰语则更加主观。我们开发的机制已经用于搜索相关性和广告匹配。我们用大量的实验结果来证明我们方法的有效性。

{"title":"Head, modifier, and constraint detection in short texts","authors":"Zhongyuan Wang, Haixun Wang, Zhirui Hu","doi":"10.1109/ICDE.2014.6816658","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816658","url":null,"abstract":"Head and modifier detection is an important problem for applications that handle short texts such as search queries, ads keywords, titles, captions, etc. In many cases, short texts such as search queries do not follow grammar rules, and existing approaches for head and modifier detection are coarse-grained, domain specific, and/or require labeling of large amounts of training data. In this paper, we introduce a semantic approach for head and modifier detection. We first obtain a large number of instance level head-modifier pairs from search log. Then, we develop a conceptualization mechanism to generalize the instance level pairs to concept level. Finally, we derive weighted concept patterns that are concise, accurate, and have strong generalization power in head and modifier detection. Furthermore, we identify a subset of modifiers that we call constraints. Constraints are usually specific and not negligible as far as the intent of the short text is concerned, while non-constraint modifiers are more subjective. The mechanism we developed has been used in production for search relevance and ads matching. We use extensive experiment results to demonstrate the effectiveness of our approach.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123793063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Adaptive parallel compressed event matching 自适应并行压缩事件匹配

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816665

Mohammad Sadoghi, H. Jacobsen

The efficient processing of large collections of patterns expressed as Boolean expressions over event streams plays a central role in major data intensive applications ranging from user-centric processing and personalization to real-time data analysis. On the one hand, emerging user-centric applications, including computational advertising and selective information dissemination, demand determining and presenting to an end-user the relevant content as it is published. On the other hand, applications in real-time data analysis, including push-based multi-query optimization, computational finance and intrusion detection, demand meeting stringent subsecond processing requirements and providing high-frequency event processing. We achieve these event processing requirements by exploiting the shift towards multi-core architectures by proposing novel adaptive parallel compressed event matching algorithm (A-PCM) and online event stream re-ordering technique (OSR) that unleash an unprecedented degree of parallelism amenable for highly parallel event processing. In our comprehensive evaluation, we demonstrate the efficiency of our proposed techniques. We show that the adaptive parallel compressed event matching algorithm can sustain an event rate of up to 233,863 events/second while state-of-the-art sequential event matching algorithms sustains only 36 events/second when processing up to five million Boolean expressions.

在事件流上以布尔表达式表示的大量模式集合的有效处理在从以用户为中心的处理和个性化到实时数据分析的主要数据密集型应用程序中起着核心作用。一方面，新兴的以用户为中心的应用，包括计算广告和选择性信息传播，要求确定并在发布时向最终用户呈现相关内容。另一方面，实时数据分析的应用，包括基于推送的多查询优化、计算金融和入侵检测，需要满足严格的亚秒级处理要求，并提供高频事件处理。我们通过提出新颖的自适应并行压缩事件匹配算法(A-PCM)和在线事件流重新排序技术(OSR)，利用向多核架构的转变来实现这些事件处理需求，从而释放出前所未有的并行度，适合高度并行的事件处理。在我们的综合评估中，我们证明了我们提出的技术的效率。我们表明，自适应并行压缩事件匹配算法可以维持高达233,863个事件/秒的事件速率，而最先进的顺序事件匹配算法在处理多达500万个布尔表达式时只能维持36个事件/秒。

{"title":"Adaptive parallel compressed event matching","authors":"Mohammad Sadoghi, H. Jacobsen","doi":"10.1109/ICDE.2014.6816665","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816665","url":null,"abstract":"The efficient processing of large collections of patterns expressed as Boolean expressions over event streams plays a central role in major data intensive applications ranging from user-centric processing and personalization to real-time data analysis. On the one hand, emerging user-centric applications, including computational advertising and selective information dissemination, demand determining and presenting to an end-user the relevant content as it is published. On the other hand, applications in real-time data analysis, including push-based multi-query optimization, computational finance and intrusion detection, demand meeting stringent subsecond processing requirements and providing high-frequency event processing. We achieve these event processing requirements by exploiting the shift towards multi-core architectures by proposing novel adaptive parallel compressed event matching algorithm (A-PCM) and online event stream re-ordering technique (OSR) that unleash an unprecedented degree of parallelism amenable for highly parallel event processing. In our comprehensive evaluation, we demonstrate the efficiency of our proposed techniques. We show that the adaptive parallel compressed event matching algorithm can sustain an event rate of up to 233,863 events/second while state-of-the-art sequential event matching algorithms sustains only 36 events/second when processing up to five million Boolean expressions.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"358 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122741362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Engine independence for logical analytic flows 逻辑分析流的引擎独立性

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816723

P. Jovanovic, A. Simitsis, K. Wilkinson

A complex analytic flow in a modern enterprise may perform multiple, logically independent, tasks where each task uses a different processing engine. We term these multi-engine flows hybrid flows. Using multiple processing engines has advantages such as rapid deployment, better performance, lower cost, and so on. However, as the number and variety of these engines grows, developing and maintaining hybrid flows is a significant challenge because they are specified at a physical level and, so are hard to design and may break as the infrastructure evolves. We address this problem by enabling flow design at a logical level and automatic translation to physical flows. There are three main challenges. First, we describe how flows can be represented at a logical level, abstracting away details of any underlying processing engine. Second, we show how a physical flow, expressed in a programming language or some design GUI, can be imported and converted to a logical flow. In particular, we show how a hybrid flow comprising subflows in different languages can be imported and composed as a single, logical flow for subsequent manipulation. Third, we describe how a logical flow is translated into one or more physical flows for execution by the processing engines. The paper concludes with experimental results and example transformations that demonstrate the correctness and utility of our system.

现代企业中的复杂分析流可能执行多个逻辑上独立的任务，其中每个任务使用不同的处理引擎。我们称这些多引擎流为混合流。使用多个处理引擎具有快速部署、更好的性能、更低的成本等优点。然而，随着这些引擎的数量和种类的增长，开发和维护混合流是一个重大挑战，因为它们是在物理级别指定的，因此很难设计，并且可能随着基础设施的发展而中断。我们通过在逻辑级别启用流设计并自动转换为物理流来解决这个问题。主要有三个挑战。首先，我们描述如何在逻辑级别上表示流，抽象掉任何底层处理引擎的细节。其次，我们将展示如何将用编程语言或某些设计GUI表示的物理流导入并转换为逻辑流。特别是，我们将展示如何将包含不同语言的子流的混合流导入并组合为单个逻辑流，以供后续操作。第三，我们描述了如何将逻辑流转换为一个或多个物理流以供处理引擎执行。最后给出了实验结果和实例转换，验证了系统的正确性和实用性。

{"title":"Engine independence for logical analytic flows","authors":"P. Jovanovic, A. Simitsis, K. Wilkinson","doi":"10.1109/ICDE.2014.6816723","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816723","url":null,"abstract":"A complex analytic flow in a modern enterprise may perform multiple, logically independent, tasks where each task uses a different processing engine. We term these multi-engine flows hybrid flows. Using multiple processing engines has advantages such as rapid deployment, better performance, lower cost, and so on. However, as the number and variety of these engines grows, developing and maintaining hybrid flows is a significant challenge because they are specified at a physical level and, so are hard to design and may break as the infrastructure evolves. We address this problem by enabling flow design at a logical level and automatic translation to physical flows. There are three main challenges. First, we describe how flows can be represented at a logical level, abstracting away details of any underlying processing engine. Second, we show how a physical flow, expressed in a programming language or some design GUI, can be imported and converted to a logical flow. In particular, we show how a hybrid flow comprising subflows in different languages can be imported and composed as a single, logical flow for subsequent manipulation. Third, we describe how a logical flow is translated into one or more physical flows for execution by the processing engines. The paper concludes with experimental results and example transformations that demonstrate the correctness and utility of our system.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"561 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134324599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Efficient support of XQuery Full Text in SQL/XML enabled RDBMS 在启用SQL/XML的RDBMS中有效支持XQuery全文

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816729

Z. Liu, Ying Lu, Hui J. Chang

There has been more than decade of efforts of supporting storage, query and update XML documents in RDBMS. XML enabled RDBMS supports SQL/XML standard that defines XMLType as a SQL data type and allows XQuery/XPath embedded in XMLQuery(), XMLExists() and XMLTABLE() in SQL. In XML enabled RDBMS, both relational data and XML documents can be managed in one system and queried using SQL/XML language. However, the use case of management of document centric XML is not well-addressed due to the lacking of full text query constructs in XQuery. Recently, XQuery Full Text (XQFT) becomes the W3C recommendation. In this paper, we show how XQFT can be supported efficiently in SQL/XML for full text search of XML documents managed by XML enabled RDBMS, such as Oracle XMLDB. We present architecture of a new XML Full Text Index, XQuery compile time and run time enhancements to efficiently support XQFT in SQL/XML. We present our design rationale on how to exploit Information Retrieval (IR) techniques for XQFT support in RDBMS. The new XML Full Text Index can index common XML physical storage forms: such as text XML, binary XML, relational decomposition of the XML. Although our work is built within Oracle XMLDB, all of the presented principles and techniques in this paper are valuable enough to RDBMS industry that needs to effectively and efficiently support of XQFT over persisted XML documents.

在RDBMS中支持存储、查询和更新XML文档的工作已经进行了十多年。支持XML的RDBMS支持将XMLType定义为SQL数据类型的SQL/XML标准，并允许在SQL中的XMLQuery()、XMLExists()和XMLTABLE()中嵌入XQuery/XPath。在支持XML的RDBMS中，关系数据和XML文档可以在一个系统中进行管理，并使用SQL/XML语言进行查询。但是，由于XQuery中缺乏全文查询构造，因此没有很好地解决以文档为中心的XML管理的用例。最近，XQuery全文(XQFT)成为W3C的推荐标准。在本文中，我们将展示如何在SQL/XML中有效地支持XQFT，以便对启用XML的RDBMS(如Oracle XMLDB)管理的XML文档进行全文搜索。我们介绍了一个新的XML全文索引的体系结构、XQuery编译时和运行时增强，以便在SQL/XML中有效地支持XQFT。我们介绍了如何利用信息检索(IR)技术在RDBMS中支持XQFT的设计原理。新的XML全文索引可以索引常见的XML物理存储形式:如文本XML、二进制XML、关系分解的XML。尽管我们的工作是在Oracle XMLDB中构建的，但本文中提出的所有原则和技术对于需要有效地支持XQFT而不是持久化XML文档的RDBMS行业来说都很有价值。

{"title":"Efficient support of XQuery Full Text in SQL/XML enabled RDBMS","authors":"Z. Liu, Ying Lu, Hui J. Chang","doi":"10.1109/ICDE.2014.6816729","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816729","url":null,"abstract":"There has been more than decade of efforts of supporting storage, query and update XML documents in RDBMS. XML enabled RDBMS supports SQL/XML standard that defines XMLType as a SQL data type and allows XQuery/XPath embedded in XMLQuery(), XMLExists() and XMLTABLE() in SQL. In XML enabled RDBMS, both relational data and XML documents can be managed in one system and queried using SQL/XML language. However, the use case of management of document centric XML is not well-addressed due to the lacking of full text query constructs in XQuery. Recently, XQuery Full Text (XQFT) becomes the W3C recommendation. In this paper, we show how XQFT can be supported efficiently in SQL/XML for full text search of XML documents managed by XML enabled RDBMS, such as Oracle XMLDB. We present architecture of a new XML Full Text Index, XQuery compile time and run time enhancements to efficiently support XQFT in SQL/XML. We present our design rationale on how to exploit Information Retrieval (IR) techniques for XQFT support in RDBMS. The new XML Full Text Index can index common XML physical storage forms: such as text XML, binary XML, relational decomposition of the XML. Although our work is built within Oracle XMLDB, all of the presented principles and techniques in this paper are valuable enough to RDBMS industry that needs to effectively and efficiently support of XQFT over persisted XML documents.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"66 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126069349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Large-scale frequent subgraph mining in MapReduce MapReduce中的大规模频繁子图挖掘

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816705

Wenqing Lin, Xiaokui Xiao, Gabriel Ghinita

Mining frequent subgraphs from a large collection of graph objects is an important problem in several application domains such as bio-informatics, social networks, computer vision, etc. The main challenge in subgraph mining is efficiency, as (i) testing for graph isomorphisms is computationally intensive, and (ii) the cardinality of the graph collection to be mined may be very large. We propose a two-step filter-and-refinement approach that is suitable to massive parallelization within the scalable MapReduce computing model. We partition the collection of graphs among worker nodes, and each worker applies the filter step to determine a set of candidate subgraphs that are locally frequent in its partition. The union of all such graphs is the input to the refinement step, where each candidate is checked against all partitions and only the globally frequent graphs are retained. We devise a statistical threshold mechanism that allows us to predict which subgraphs have a high chance to become globally frequent, and thus reduce the computational overhead in the refinement step. We also propose effective strategies to avoid redundant computation in each round when searching for candidate graphs, as well as a lightweight graph compression mechanism to reduce the communication cost between machines. Extensive experimental evaluation results on several real-world large graph datasets show that the proposed approach clearly outperforms the existing state-of-the-art and provides a practical solution to the problem of frequent subgraph mining for massive collections of graphs.

从大量的图对象中挖掘频繁子图是生物信息学、社交网络、计算机视觉等应用领域的一个重要问题。子图挖掘的主要挑战是效率，因为(i)对图同构的测试是计算密集型的，(ii)要挖掘的图集合的基数可能非常大。我们提出了一种适合于可扩展MapReduce计算模型中大规模并行化的两步过滤和细化方法。我们在工作节点之间划分图集合，每个工作节点应用筛选步骤来确定一组在其分区中局部频繁出现的候选子图。所有这些图的并集是细化步骤的输入，在此步骤中，每个候选图将根据所有分区进行检查，并且只保留全局频繁图。我们设计了一个统计阈值机制，使我们能够预测哪些子图有很高的机会成为全局频繁的，从而减少了细化步骤中的计算开销。我们还提出了有效的策略来避免候选图在每轮搜索时的冗余计算，以及轻量级的图压缩机制来减少机器之间的通信开销。在几个真实世界的大型图数据集上进行的大量实验评估结果表明，所提出的方法明显优于现有的最先进的方法，并为大规模图集的频繁子图挖掘问题提供了一个实用的解决方案。

{"title":"Large-scale frequent subgraph mining in MapReduce","authors":"Wenqing Lin, Xiaokui Xiao, Gabriel Ghinita","doi":"10.1109/ICDE.2014.6816705","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816705","url":null,"abstract":"Mining frequent subgraphs from a large collection of graph objects is an important problem in several application domains such as bio-informatics, social networks, computer vision, etc. The main challenge in subgraph mining is efficiency, as (i) testing for graph isomorphisms is computationally intensive, and (ii) the cardinality of the graph collection to be mined may be very large. We propose a two-step filter-and-refinement approach that is suitable to massive parallelization within the scalable MapReduce computing model. We partition the collection of graphs among worker nodes, and each worker applies the filter step to determine a set of candidate subgraphs that are locally frequent in its partition. The union of all such graphs is the input to the refinement step, where each candidate is checked against all partitions and only the globally frequent graphs are retained. We devise a statistical threshold mechanism that allows us to predict which subgraphs have a high chance to become globally frequent, and thus reduce the computational overhead in the refinement step. We also propose effective strategies to avoid redundant computation in each round when searching for candidate graphs, as well as a lightweight graph compression mechanism to reduce the communication cost between machines. Extensive experimental evaluation results on several real-world large graph datasets show that the proposed approach clearly outperforms the existing state-of-the-art and provides a practical solution to the problem of frequent subgraph mining for massive collections of graphs.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130892796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 85

Stock trade volume prediction with Yahoo Finance user browsing behavior 股票交易量预测与雅虎财经用户浏览行为

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816733

Ilaria Bordino, N. Kourtellis, N. Laptev, Youssef Billawala

Web traffic represents a powerful mirror for various real-world phenomena. For example, it was shown that web search volumes have a positive correlation with stock trading volumes and with the sentiment of investors. Our hypothesis is that user browsing behavior on a domain-specific portal is a better predictor of user intent than web searches.

网络流量是反映各种现实世界现象的一面强有力的镜子。例如，研究表明，网络搜索量与股票交易量和投资者情绪呈正相关。我们的假设是，用户在特定领域门户上的浏览行为比网络搜索更能预测用户意图。

引用次数: 26

HOPE: Iterative and interactive database partitioning for OLTP workloads HOPE:用于OLTP工作负载的迭代和交互式数据库分区

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816759

Yu Cao, X. Guo, Baoyao Zhou, S. Todd

This paper demonstrates HOPE, an efficient and effective database partitioning system that is designed for OLTP workloads. HOPE is built on top of a novel tuple-group based database partitioning model, which is able to minimize the number of distributed transactions as well as the extent of partition and workload skews during the workload execution. HOPE conducts the partitioning in an iterative manner in order to achieve better partitioning quality, save the user's time spent on partitioning design and increase its application scenes. HOPE is also highly interactive, as it provides rich opportunities for the user to help it further improve the partitioning quality by passing expertise and indirect domain knowledge.

本文演示了一个高效的数据库分区系统HOPE，它是专为OLTP工作负载设计的。HOPE构建在一种新的基于元组的数据库分区模型之上，该模型能够在工作负载执行期间最大限度地减少分布式事务的数量以及分区和工作负载的倾斜程度。HOPE以迭代的方式进行分区，以获得更好的分区质量，节省用户在分区设计上花费的时间，增加其应用场景。HOPE还具有高度的交互性，因为它为用户提供了丰富的机会，通过传递专业知识和间接的领域知识来帮助它进一步提高分区质量。

引用次数: 1

GQBE: Querying knowledge graphs by example entity tuples GQBE:通过示例实体元组查询知识图

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816753

Nandish Jayaram, Mahesh Gupta, Arijit Khan, Chengkai Li, Xifeng Yan, R. Elmasri

We present GQBE, a system that presents a simple and intuitive mechanism to query large knowledge graphs. Answers to tasks such as “list university professors who have designed some programming languages and also won an award in Computer Science” are best found in knowledge graphs that record entities and their relationships. Real-world knowledge graphs are difficult to use due to their sheer size and complexity and the challenging task of writing complex structured graph queries. Toward better usability of query systems over knowledge graphs, GQBE allows users to query knowledge graphs by example entity tuples without writing complex queries. In this demo we present: 1) a detailed description of the various features and user-friendly GUI of GQBE, 2) a brief description of the system architecture, and 3) a demonstration scenario that we intend to show the audience.

我们提出了一个GQBE系统，它提供了一个简单直观的机制来查询大型知识图谱。诸如“列出设计过一些编程语言并获得过计算机科学奖项的大学教授名单”等问题的答案最好在记录实体及其关系的知识图谱中找到。现实世界的知识图由于其庞大的规模和复杂性以及编写复杂结构化图查询的挑战性任务而难以使用。为了提高知识图查询系统的可用性，GQBE允许用户通过示例实体元组查询知识图，而无需编写复杂的查询。在这个演示中，我们将介绍:1)对GQBE的各种特性和用户友好GUI的详细描述，2)对系统架构的简要描述，以及3)我们打算向观众展示的演示场景。

引用次数: 32

A tunable compression framework for bitmap indices 位图索引的可调压缩框架

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816675

Gheorghi Guzun, G. Canahuate, David Chiu, Jason Sawin

Bitmap indices are widely used for large read-only repositories in data warehouses and scientific databases. Their binary representation allows for the use of bitwise operations and specialized run-length compression techniques. Due to a trade-off between compression and query efficiency, bitmap compression schemes are aligned using a fixed encoding length size (typically the word length) to avoid explicit decompression during query time. In general, smaller encoding lengths provide better compression, but require more decoding during query execution. However, when the difference in size is considerable, it is possible for smaller encodings to also provide better execution time. We posit that a tailored encoding length for each bit vector will provide better performance than a one-size-fits-all approach. We present a framework that optimizes compression and query efficiency by allowing bitmaps to be compressed using variable encoding lengths while still maintaining alignment to avoid explicit decompression. Efficient algorithms are introduced to process queries over bitmaps compressed using different encoding lengths. An input parameter controls the aggressiveness of the compression providing the user with the ability to tune the tradeoff between space and query time. Our empirical study shows this approach achieves significant improvements in terms of both query time and compression ratio for synthetic and real data sets. Compared to 32-bit WAH, VAL-WAH produces up to 1.8× smaller bitmaps and achieves query times that are 30% faster.

位图索引广泛用于数据仓库和科学数据库中的大型只读存储库。它们的二进制表示允许使用位操作和专门的运行长度压缩技术。由于压缩和查询效率之间的权衡，位图压缩方案使用固定的编码长度大小(通常是单词长度)来对齐，以避免在查询期间显式解压缩。通常，较小的编码长度提供更好的压缩，但在查询执行期间需要更多的解码。但是，当大小差异很大时，更小的编码也可能提供更好的执行时间。我们假设为每个位向量量身定制的编码长度将提供比一刀切方法更好的性能。我们提出了一个框架，通过允许使用可变编码长度压缩位图，同时仍然保持对齐以避免显式解压缩，从而优化压缩和查询效率。引入了有效的算法来处理使用不同编码长度压缩的位图查询。输入参数控制压缩的力度，为用户提供在空间和查询时间之间进行优化的能力。我们的实证研究表明，这种方法在合成数据集和真实数据集的查询时间和压缩比方面都取得了显著的改进。与32位的WAH相比，VAL-WAH生成的位图小1.8倍，查询时间快30%。

{"title":"A tunable compression framework for bitmap indices","authors":"Gheorghi Guzun, G. Canahuate, David Chiu, Jason Sawin","doi":"10.1109/ICDE.2014.6816675","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816675","url":null,"abstract":"Bitmap indices are widely used for large read-only repositories in data warehouses and scientific databases. Their binary representation allows for the use of bitwise operations and specialized run-length compression techniques. Due to a trade-off between compression and query efficiency, bitmap compression schemes are aligned using a fixed encoding length size (typically the word length) to avoid explicit decompression during query time. In general, smaller encoding lengths provide better compression, but require more decoding during query execution. However, when the difference in size is considerable, it is possible for smaller encodings to also provide better execution time. We posit that a tailored encoding length for each bit vector will provide better performance than a one-size-fits-all approach. We present a framework that optimizes compression and query efficiency by allowing bitmaps to be compressed using variable encoding lengths while still maintaining alignment to avoid explicit decompression. Efficient algorithms are introduced to process queries over bitmaps compressed using different encoding lengths. An input parameter controls the aggressiveness of the compression providing the user with the ability to tune the tradeoff between space and query time. Our empirical study shows this approach achieves significant improvements in terms of both query time and compression ratio for synthetic and real data sets. Compared to 32-bit WAH, VAL-WAH produces up to 1.8× smaller bitmaps and achieves query times that are 30% faster.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128742308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 59

A hybrid machine-crowdsourcing system for matching web tables 用于匹配web表的混合机器-众包系统

2014 IEEE 30th International Conference on Data Engineering

Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816716

Ju Fan, Meiyu Lu, B. Ooi, W. Tan, Meihui Zhang

The Web is teeming with rich structured information in the form of HTML tables, which provides us with the opportunity to build a knowledge repository by integrating these tables. An essential problem of web data integration is to discover semantic correspondences between web table columns, and schema matching is a popular means to determine the semantic correspondences. However, conventional schema matching techniques are not always effective for web table matching due to the incompleteness in web tables. In this paper, we propose a two-pronged approach for web table matching that effectively addresses the above difficulties. First, we propose a concept-based approach that maps each column of a web table to the best concept, in a well-developed knowledge base, that represents it. This approach overcomes the problem that sometimes values of two web table columns may be disjoint, even though the columns are related, due to incompleteness in the column values. Second, we develop a hybrid machine-crowdsourcing framework that leverages human intelligence to discern the concepts for “difficult” columns. Our overall framework assigns the most “beneficial” column-to-concept matching tasks to the crowd under a given budget and utilizes the crowdsourcing result to help our algorithm infer the best matches for the rest of the columns. We validate the effectiveness of our framework through an extensive experimental study over two real-world web table data sets. The results show that our two-pronged approach outperforms existing schema matching techniques at only a low cost for crowdsourcing.

Web上充满了HTML表形式的丰富结构化信息，这为我们提供了通过集成这些表来构建知识存储库的机会。web数据集成的一个基本问题是发现web表列之间的语义对应关系，而模式匹配是确定语义对应关系的常用方法。然而，由于web表的不完整性，传统的模式匹配技术在web表匹配中并不总是有效的。在本文中，我们提出了一种双管齐下的web表匹配方法，有效地解决了上述困难。首先，我们提出了一种基于概念的方法，将web表的每一列映射到一个完善的知识库中代表它的最佳概念。这种方法克服了有时两个web表列的值可能不相交的问题，即使这些列是相关的，由于列值不完整。其次，我们开发了一个混合机器-众包框架，利用人类智能来识别“困难”栏的概念。在给定的预算下，我们的整体框架将最“有益”的列到概念匹配任务分配给人群，并利用众包结果帮助我们的算法推断其余列的最佳匹配。我们通过对两个真实世界的web表数据集进行广泛的实验研究来验证我们框架的有效性。结果表明，我们的双管齐下方法在众包成本较低的情况下优于现有的模式匹配技术。

{"title":"A hybrid machine-crowdsourcing system for matching web tables","authors":"Ju Fan, Meiyu Lu, B. Ooi, W. Tan, Meihui Zhang","doi":"10.1109/ICDE.2014.6816716","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816716","url":null,"abstract":"The Web is teeming with rich structured information in the form of HTML tables, which provides us with the opportunity to build a knowledge repository by integrating these tables. An essential problem of web data integration is to discover semantic correspondences between web table columns, and schema matching is a popular means to determine the semantic correspondences. However, conventional schema matching techniques are not always effective for web table matching due to the incompleteness in web tables. In this paper, we propose a two-pronged approach for web table matching that effectively addresses the above difficulties. First, we propose a concept-based approach that maps each column of a web table to the best concept, in a well-developed knowledge base, that represents it. This approach overcomes the problem that sometimes values of two web table columns may be disjoint, even though the columns are related, due to incompleteness in the column values. Second, we develop a hybrid machine-crowdsourcing framework that leverages human intelligence to discern the concepts for “difficult” columns. Our overall framework assigns the most “beneficial” column-to-concept matching tasks to the crowd under a given budget and utilizes the crowdsourcing result to help our algorithm infer the best matches for the rest of the columns. We validate the effectiveness of our framework through an extensive experimental study over two real-world web table data sets. The results show that our two-pronged approach outperforms existing schema matching techniques at only a low cost for crowdsourcing.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128429698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 110

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2014 IEEE 30th International Conference on Data Engineering

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀