首页 > 最新文献

2016 IEEE 32nd International Conference on Data Engineering (ICDE)最新文献

英文 中文
Ranking support for matched patterns over complex event streams: The CEPR system 对复杂事件流上匹配模式的排序支持:CEPR系统
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498343
Jiaqi Gu, Jin Wang, C. Zaniolo
There is a growing interest in pattern matching over complex event streams. While many bodies of techniques were proposed to search complex patterns and enhance the expressive power of query language, no previous work focused on supporting a well-defined ranking mechanism over answers using semantic ordering. To satisfy this need, we proposed CEPR, a CEP system capable of ranking matchings and emitting ordered results based on users' intentions via a novel query language. In this demo, we will (i) demonstrate language features, system architecture and functionalities, (ii) show examples of CEPR in various application domains and (iii) present a user-friendly interface to monitor query results and interact with the system in real time.
人们对复杂事件流的模式匹配越来越感兴趣。虽然提出了许多技术来搜索复杂的模式和增强查询语言的表达能力,但以前没有工作集中于支持使用语义排序的答案的定义良好的排序机制。为了满足这一需求,我们提出了CEPR,这是一个CEP系统,能够通过一种新颖的查询语言根据用户的意图对匹配进行排序并发出有序的结果。在本演示中,我们将(i)演示语言特性、系统架构和功能,(ii)展示各种应用领域中的CEPR示例,以及(iii)提供一个用户友好的界面,以监控查询结果并实时与系统交互。
{"title":"Ranking support for matched patterns over complex event streams: The CEPR system","authors":"Jiaqi Gu, Jin Wang, C. Zaniolo","doi":"10.1109/ICDE.2016.7498343","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498343","url":null,"abstract":"There is a growing interest in pattern matching over complex event streams. While many bodies of techniques were proposed to search complex patterns and enhance the expressive power of query language, no previous work focused on supporting a well-defined ranking mechanism over answers using semantic ordering. To satisfy this need, we proposed CEPR, a CEP system capable of ranking matchings and emitting ordered results based on users' intentions via a novel query language. In this demo, we will (i) demonstrate language features, system architecture and functionalities, (ii) show examples of CEPR in various application domains and (iii) present a user-friendly interface to monitor query results and interact with the system in real time.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"4 1","pages":"1354-1357"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74856799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Virtual lightweight snapshots for consistent analytics in NoSQL stores 用于NoSQL存储一致分析的虚拟轻量级快照
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498334
F. Chirigati, Jérôme Siméon, Martin Hirzel, J. Freire
Increasingly, applications that deal with big data need to run analytics concurrently with updates. But bridging the gap between big and fast data is challenging: most of these applications require analytics' results that are fresh and consistent, but without impacting system latency and throughput. We propose virtual lightweight snapshots (VLS), a mechanism that enables consistent analytics without blocking incoming updates in NoSQL stores. VLS requires neither native support for database versioning nor a transaction manager. Besides, it is storage-efficient, keeping additional versions of records only when needed to guarantee consistency, and sharing versions across multiple concurrent snapshots. We describe an implementation of VLS in MongoDB and present a detailed experimental evaluation which shows that it supports consistency for analytics with small impact on query evaluation time, update throughput, and latency.
处理大数据的应用程序越来越需要在更新的同时运行分析。但是,弥合大数据和快速数据之间的差距是具有挑战性的:大多数这些应用程序要求分析结果是新鲜和一致的,但不影响系统延迟和吞吐量。我们提出了虚拟轻量级快照(VLS),这是一种在不阻塞NoSQL存储中的传入更新的情况下实现一致性分析的机制。VLS既不需要数据库版本控制的本地支持,也不需要事务管理器。此外,它具有存储效率,仅在需要保证一致性时才保留记录的附加版本,并在多个并发快照之间共享版本。我们描述了VLS在MongoDB中的实现,并提供了一个详细的实验评估,表明它支持一致性分析,对查询评估时间、更新吞吐量和延迟的影响很小。
{"title":"Virtual lightweight snapshots for consistent analytics in NoSQL stores","authors":"F. Chirigati, Jérôme Siméon, Martin Hirzel, J. Freire","doi":"10.1109/ICDE.2016.7498334","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498334","url":null,"abstract":"Increasingly, applications that deal with big data need to run analytics concurrently with updates. But bridging the gap between big and fast data is challenging: most of these applications require analytics' results that are fresh and consistent, but without impacting system latency and throughput. We propose virtual lightweight snapshots (VLS), a mechanism that enables consistent analytics without blocking incoming updates in NoSQL stores. VLS requires neither native support for database versioning nor a transaction manager. Besides, it is storage-efficient, keeping additional versions of records only when needed to guarantee consistency, and sharing versions across multiple concurrent snapshots. We describe an implementation of VLS in MongoDB and present a detailed experimental evaluation which shows that it supports consistency for analytics with small impact on query evaluation time, update throughput, and latency.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"31 1","pages":"1310-1321"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86876369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Towards Virtual Private NoSQL datastores 迈向虚拟私有NoSQL数据存储
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498240
Pietro Colombo, E. Ferrari
Many modern applications use context related information to provide highly personalized services, and use NoSQL databases for data management, as these systems show outstanding performance and support high volumes of data. However, NoSQL databases integrate poor data protection features with basic coarse grained access control and no support for context aware policies. Therefore, we believe that a general approach is required to enhance NoSQL datastores with fine grained context aware access control. In this paper, we start to fill this void by targeting MongoDB, a very popular datastore. The contribution is twofold. We enhance MongoDB's access control model with advanced features and we define an enforcement monitor for the proposed enhanced model, which can be straightforwardly used in any MongoDB deployment. Technological limitations of MongoDB do not allow implementing the same efficient enforcement mechanism for all query types. As a consequence, experimental results show an enforcement overhead that is significant for aggregate queries, which contrasts with a low overhead measured for find and map-reduce queries.
许多现代应用程序使用上下文相关信息来提供高度个性化的服务,并使用NoSQL数据库进行数据管理,因为这些系统显示出出色的性能并支持大量数据。然而,NoSQL数据库集成了较差的数据保护特性和基本的粗粒度访问控制,并且不支持上下文感知策略。因此,我们认为需要一种通用的方法来增强具有细粒度上下文感知访问控制的NoSQL数据存储。在本文中,我们将开始以MongoDB(一种非常流行的数据存储)为目标来填补这一空白。这种贡献是双重的。我们用高级功能增强了MongoDB的访问控制模型,并为提议的增强模型定义了一个强制监视器,它可以直接用于任何MongoDB部署。MongoDB的技术限制不允许对所有查询类型实现相同的有效执行机制。因此,实验结果表明,与find和map-reduce查询的低开销相比,聚合查询的执行开销非常大。
{"title":"Towards Virtual Private NoSQL datastores","authors":"Pietro Colombo, E. Ferrari","doi":"10.1109/ICDE.2016.7498240","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498240","url":null,"abstract":"Many modern applications use context related information to provide highly personalized services, and use NoSQL databases for data management, as these systems show outstanding performance and support high volumes of data. However, NoSQL databases integrate poor data protection features with basic coarse grained access control and no support for context aware policies. Therefore, we believe that a general approach is required to enhance NoSQL datastores with fine grained context aware access control. In this paper, we start to fill this void by targeting MongoDB, a very popular datastore. The contribution is twofold. We enhance MongoDB's access control model with advanced features and we define an enforcement monitor for the proposed enhanced model, which can be straightforwardly used in any MongoDB deployment. Technological limitations of MongoDB do not allow implementing the same efficient enforcement mechanism for all query types. As a consequence, experimental results show an enforcement overhead that is significant for aggregate queries, which contrasts with a low overhead measured for find and map-reduce queries.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"103 1","pages":"193-204"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80927479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Keyword-aware continuous kNN query on road networks 基于关键字感知的道路网络连续kNN查询
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498297
Bolong Zheng, Kai Zheng, Xiaokui Xiao, Han Su, Hongzhi Yin, Xiaofang Zhou, Guohui Li
It is nowadays quite common for road networks to have textual contents on the vertices, which describe auxiliary information (e.g., business, traffic, etc.) associated with the vertex. In such road networks, which are modelled as weighted undirected graphs, each vertex is associated with one or more keywords, and each edge is assigned with a weight, which can be its physical length or travelling time. In this paper, we study the problem of keyword-aware continuous k nearest neighbour (KCkNN) search on road networks, which computes the k nearest vertices that contain the query keywords issued by a moving object and maintains the results continuously as the object is moving on the road network. Reducing the query processing costs in terms of computation and communication has attracted considerable attention in the database community with interesting techniques proposed. This paper proposes a framework, called a Labelling AppRoach for Continuous kNN query (LARC), on road networks to cope with KCkNN query efficiently. First we build a pivot-based reverse label index and a keyword-based pivot tree index to improve the efficiency of keyword-aware k nearest neighbour (KkNN) search by avoiding massive network traversals and sequential probe of keywords. To reduce the frequency of unnecessary result updates, we develop the concepts of dominance interval and region on road network, which share the similar intuition with safe region for processing continuous queries in Euclidean space but are more complicated and thus require more dedicated design. For high frequency keywords, we resolve the dominance interval when the query results changed. In addition, a path-based dominance updating approach is proposed to compute the dominance region efficiently when the query keywords are of low frequency. We conduct extensive experiments by comparing our algorithms with the state-of-the-art methods on real data sets. The empirical observations have verified the superiority of our proposed solution in all aspects of index size, communication cost and computation time.
如今,道路网络在顶点上具有文本内容是很常见的,这些文本内容描述了与顶点相关的辅助信息(例如,商业,交通等)。在这样的道路网络中,它们被建模为加权无向图,每个顶点与一个或多个关键字相关联,每个边缘被分配一个权重,这个权重可以是它的物理长度或行驶时间。本文研究了道路网络上的关键字感知连续k近邻(KCkNN)搜索问题,该问题计算包含运动物体发出的查询关键字的k个最近顶点,并在物体在道路网络上运动时连续保持结果。在计算和通信方面降低查询处理成本已经引起了数据库社区的广泛关注,并提出了一些有趣的技术。本文提出了一种基于道路网络的连续kNN查询标记方法(LARC)框架,以有效地处理KCkNN查询。首先,我们构建了一个基于关键字的反向标签索引和一个基于关键字的主树索引,通过避免大量的网络遍历和对关键字的顺序探测来提高关键字感知k近邻(KkNN)搜索的效率。为了减少不必要的结果更新频率,我们提出了道路网络上的优势区间和区域的概念,它们与欧几里得空间中处理连续查询的安全区域具有相似的直觉,但更复杂,因此需要更专门的设计。对于高频关键词,我们在查询结果发生变化时解析优势度区间。此外,提出了一种基于路径的优势度更新方法,以便在查询关键词频率较低时有效地计算优势域。我们通过将我们的算法与最先进的方法在真实数据集上进行比较,进行了广泛的实验。通过实证观察,验证了本文方案在索引大小、通信成本和计算时间等方面的优越性。
{"title":"Keyword-aware continuous kNN query on road networks","authors":"Bolong Zheng, Kai Zheng, Xiaokui Xiao, Han Su, Hongzhi Yin, Xiaofang Zhou, Guohui Li","doi":"10.1109/ICDE.2016.7498297","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498297","url":null,"abstract":"It is nowadays quite common for road networks to have textual contents on the vertices, which describe auxiliary information (e.g., business, traffic, etc.) associated with the vertex. In such road networks, which are modelled as weighted undirected graphs, each vertex is associated with one or more keywords, and each edge is assigned with a weight, which can be its physical length or travelling time. In this paper, we study the problem of keyword-aware continuous k nearest neighbour (KCkNN) search on road networks, which computes the k nearest vertices that contain the query keywords issued by a moving object and maintains the results continuously as the object is moving on the road network. Reducing the query processing costs in terms of computation and communication has attracted considerable attention in the database community with interesting techniques proposed. This paper proposes a framework, called a Labelling AppRoach for Continuous kNN query (LARC), on road networks to cope with KCkNN query efficiently. First we build a pivot-based reverse label index and a keyword-based pivot tree index to improve the efficiency of keyword-aware k nearest neighbour (KkNN) search by avoiding massive network traversals and sequential probe of keywords. To reduce the frequency of unnecessary result updates, we develop the concepts of dominance interval and region on road network, which share the similar intuition with safe region for processing continuous queries in Euclidean space but are more complicated and thus require more dedicated design. For high frequency keywords, we resolve the dominance interval when the query results changed. In addition, a path-based dominance updating approach is proposed to compute the dominance region efficiently when the query keywords are of low frequency. We conduct extensive experiments by comparing our algorithms with the state-of-the-art methods on real data sets. The empirical observations have verified the superiority of our proposed solution in all aspects of index size, communication cost and computation time.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"85 1","pages":"871-882"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84001472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 60
Fault-tolerant real-time analytics with distributed Oracle Database In-memory 分布式Oracle内存数据库的容错实时分析
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498333
Niloy J. Mukherjee, S. Chavan, Maria Colgan, M. Gleeson, Xiaoming He, Allison L. Holloway, J. Kamp, Kartik Kulkarni, T. Lahiri, Juan R. Loaiza, N. MacNaughton, Atrayee Mullick, S. Muthulingam, V. Raja, Raunak Rungta
Modern data management systems are required to address new breeds of OLTAP applications. These applications demand real time analytical insights over massive data volumes not only on dedicated data warehouses but also on live mainstream production environments where data gets continuously ingested and modified. Oracle introduced the Database In-memory Option (DBIM) in 2014 as a unique dual row and column format architecture aimed to address the emerging space of mixed OLTAP applications along with traditional OLAP workloads. The architecture allows both the row format and the column format to be maintained simultaneously with strict transactional consistency. While the row format is persisted in underlying storage, the column format is maintained purely in-memory without incurring additional logging overheads in OLTP. Maintenance of columnar data purely in memory creates the need for distributed data management architectures. Performance of analytics incurs severe regressions in single server architectures during server failures as it takes non-trivial time to recover and rebuild terabytes of in-memory columnar format. A distributed and distribution aware architecture therefore becomes necessary to provide real time high availability of the columnar format for glitch-free in-memory analytic query execution across server failures and additions, besides providing scale out of capacity and compute to address real time throughput requirements over large volumes of in-memory data. In this paper, we will present the high availability aspects of the distributed architecture of Oracle DBIM that includes extremely scaled out application transparent column format duplication mechanism, distributed query execution on duplicated in-memory columnar format, and several scenarios of fault tolerant analytic query execution across the in-memory column format at various stages of redistribution of columnar data during cluster topology changes.
需要现代数据管理系统来解决新型OLTAP应用。这些应用程序需要对大量数据进行实时分析,不仅需要在专用数据仓库中,还需要在数据不断被摄取和修改的实时主流生产环境中。Oracle在2014年推出了数据库内存选项(Database in -memory Option, DBIM),作为一种独特的双行双列格式架构,旨在解决混合OLTAP应用程序和传统OLAP工作负载的新兴空间。该体系结构允许同时维护行格式和列格式,并具有严格的事务一致性。虽然行格式在底层存储中持久化,但列格式完全在内存中维护,不会在OLTP中产生额外的日志开销。纯粹在内存中维护列数据需要分布式数据管理架构。在服务器故障期间,单服务器架构中的分析性能会导致严重的退化,因为恢复和重建内存中tb的列格式需要花费大量时间。因此,除了提供超出容量的扩展和计算来解决大量内存数据的实时吞吐量需求外,还需要分布式和分布感知架构来提供柱状格式的实时高可用性,以便跨服务器故障和添加执行无故障的内存分析查询。在本文中,我们将介绍Oracle DBIM分布式架构的高可用性方面,包括高度向外扩展的应用透明列格式复制机制,在重复的内存列格式上执行分布式查询,以及在集群拓扑变化期间,在列数据重新分配的各个阶段跨内存列格式执行容错分析查询的几个场景。
{"title":"Fault-tolerant real-time analytics with distributed Oracle Database In-memory","authors":"Niloy J. Mukherjee, S. Chavan, Maria Colgan, M. Gleeson, Xiaoming He, Allison L. Holloway, J. Kamp, Kartik Kulkarni, T. Lahiri, Juan R. Loaiza, N. MacNaughton, Atrayee Mullick, S. Muthulingam, V. Raja, Raunak Rungta","doi":"10.1109/ICDE.2016.7498333","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498333","url":null,"abstract":"Modern data management systems are required to address new breeds of OLTAP applications. These applications demand real time analytical insights over massive data volumes not only on dedicated data warehouses but also on live mainstream production environments where data gets continuously ingested and modified. Oracle introduced the Database In-memory Option (DBIM) in 2014 as a unique dual row and column format architecture aimed to address the emerging space of mixed OLTAP applications along with traditional OLAP workloads. The architecture allows both the row format and the column format to be maintained simultaneously with strict transactional consistency. While the row format is persisted in underlying storage, the column format is maintained purely in-memory without incurring additional logging overheads in OLTP. Maintenance of columnar data purely in memory creates the need for distributed data management architectures. Performance of analytics incurs severe regressions in single server architectures during server failures as it takes non-trivial time to recover and rebuild terabytes of in-memory columnar format. A distributed and distribution aware architecture therefore becomes necessary to provide real time high availability of the columnar format for glitch-free in-memory analytic query execution across server failures and additions, besides providing scale out of capacity and compute to address real time throughput requirements over large volumes of in-memory data. In this paper, we will present the high availability aspects of the distributed architecture of Oracle DBIM that includes extremely scaled out application transparent column format duplication mechanism, distributed query execution on duplicated in-memory columnar format, and several scenarios of fault tolerant analytic query execution across the in-memory column format at various stages of redistribution of columnar data during cluster topology changes.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"03 1","pages":"1298-1309"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86523050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A model-based approach for text clustering with outlier detection 基于模型的离群点检测文本聚类方法
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498276
Jianhua Yin, Jianyong Wang
Text clustering is a challenging problem due to the high-dimensional and large-volume characteristics of text datasets. In this paper, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Process Multinomial Mixture model for text clustering (abbr. to GSDPMM) which does not need to specify the number of clusters in advance and can cope with the high-dimensional problem of text clustering. Our extensive experimental study shows that GSDPMM can achieve significantly better performance than three other clustering methods and can achieve high consistency on both long and short text datasets. We found that GSDPMM has low time and space complexity and can scale well with huge text datasets. We also propose some novel and effective methods to detect the outliers in the dataset and obtain the representative words of each cluster.
由于文本数据集具有高维、大容量的特点,文本聚类是一个具有挑战性的问题。本文针对文本聚类的Dirichlet过程多项混合模型(简称GSDPMM)提出了一种不需要预先指定簇数的坍缩Gibbs采样算法,可以解决文本聚类的高维问题。我们广泛的实验研究表明,GSDPMM可以获得明显优于其他三种聚类方法的性能,并且可以在长文本和短文本数据集上实现高一致性。我们发现GSDPMM具有较低的时间和空间复杂度,并且可以很好地扩展到庞大的文本数据集。我们还提出了一些新颖有效的方法来检测数据集中的异常值,并获得每个聚类的代表词。
{"title":"A model-based approach for text clustering with outlier detection","authors":"Jianhua Yin, Jianyong Wang","doi":"10.1109/ICDE.2016.7498276","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498276","url":null,"abstract":"Text clustering is a challenging problem due to the high-dimensional and large-volume characteristics of text datasets. In this paper, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Process Multinomial Mixture model for text clustering (abbr. to GSDPMM) which does not need to specify the number of clusters in advance and can cope with the high-dimensional problem of text clustering. Our extensive experimental study shows that GSDPMM can achieve significantly better performance than three other clustering methods and can achieve high consistency on both long and short text datasets. We found that GSDPMM has low time and space complexity and can scale well with huge text datasets. We also propose some novel and effective methods to detect the outliers in the dataset and obtain the representative words of each cluster.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"116 1","pages":"625-636"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79797791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
SQL-SA for big data discovery polymorphic and parallelizable SQL user-defined scalar and aggregate infrastructure in Teradata Aster 6.20 SQL- sa用于大数据发现Teradata Aster 6.20中的多态和并行SQL用户定义的标量和聚合基础设施
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498323
Xin Tang, R. Wehrmeister, J. Shau, Abhirup Chakraborty, Daley Alex, A. A. Omari, Feven Atnafu, Jeff Davis, Litao Deng, Deepak Jaiswal, C. Keswani, Yafeng Lu, Chao Ren, T. Reyes, Kashif Siddiqui, David E. Simmen, D. Vidhani, Ling Wang, Shuai Yang, Daniel Yu
There is increasing demand to integrate big data analytic systems using SQL. Given the vast ecosystem of SQL applications, enabling SQL capabilities allows big data platforms to expose their analytic potential to a wide variety of end users, accelerating discovery processes and providing significant business value. Most existing big data frameworks are based on one particular programming model such as MapReduce or Graph. However, data scientists are often forced to manually create adhoc data pipelines to connect various big data tools and platforms to serve their analytic needs. When the analytic tasks change, these data pipelines may be costly to modify and maintain. In this paper we present SQL-SA, a polymorphic and parallelizable SQL scalar and aggregate infrastructure in Aster 6.20. This infrastructure extends Aster 6's MapReduce and Graph capabilities to support polymorphic user-defined scalar and aggregate functions using flexible SQL syntax. The implementation enhances main Aster components including query syntax, API, planning and execution extensively. Integrating these new user-defined scalar and aggregate functions with Aster MapReduce and Graph functions, Aster 6.20 enables data scientists to integrate diverse programming models in a single SQL statement. The statement is automatically converted to an optimal data pipeline and executed in parallel. Using a real world business problem and data, Aster 6.20 demonstrates a significant performance advantage (25%+) over Hadoop Pig and Hive.
使用SQL集成大数据分析系统的需求越来越大。考虑到SQL应用程序的庞大生态系统,启用SQL功能可以让大数据平台向各种各样的最终用户展示其分析潜力,从而加速发现过程并提供重要的业务价值。大多数现有的大数据框架都是基于一个特定的编程模型,如MapReduce或Graph。然而,数据科学家经常被迫手动创建专门的数据管道来连接各种大数据工具和平台,以满足他们的分析需求。当分析任务发生变化时,修改和维护这些数据管道的成本可能很高。在本文中,我们提出了SQL- sa,这是Aster 6.20中的一个多态和可并行的SQL标量和聚合基础结构。该基础架构扩展了Aster 6的MapReduce和Graph功能,使用灵活的SQL语法支持多态用户定义的标量和聚合函数。该实现广泛地增强了主要的Aster组件,包括查询语法、API、规划和执行。Aster 6.20将这些新的用户定义标量和聚合函数与Aster MapReduce和Graph函数集成在一起,使数据科学家能够在单个SQL语句中集成不同的编程模型。语句自动转换为最佳数据管道并并行执行。使用真实世界的业务问题和数据,Aster 6.20比Hadoop Pig和Hive表现出显著的性能优势(25%以上)。
{"title":"SQL-SA for big data discovery polymorphic and parallelizable SQL user-defined scalar and aggregate infrastructure in Teradata Aster 6.20","authors":"Xin Tang, R. Wehrmeister, J. Shau, Abhirup Chakraborty, Daley Alex, A. A. Omari, Feven Atnafu, Jeff Davis, Litao Deng, Deepak Jaiswal, C. Keswani, Yafeng Lu, Chao Ren, T. Reyes, Kashif Siddiqui, David E. Simmen, D. Vidhani, Ling Wang, Shuai Yang, Daniel Yu","doi":"10.1109/ICDE.2016.7498323","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498323","url":null,"abstract":"There is increasing demand to integrate big data analytic systems using SQL. Given the vast ecosystem of SQL applications, enabling SQL capabilities allows big data platforms to expose their analytic potential to a wide variety of end users, accelerating discovery processes and providing significant business value. Most existing big data frameworks are based on one particular programming model such as MapReduce or Graph. However, data scientists are often forced to manually create adhoc data pipelines to connect various big data tools and platforms to serve their analytic needs. When the analytic tasks change, these data pipelines may be costly to modify and maintain. In this paper we present SQL-SA, a polymorphic and parallelizable SQL scalar and aggregate infrastructure in Aster 6.20. This infrastructure extends Aster 6's MapReduce and Graph capabilities to support polymorphic user-defined scalar and aggregate functions using flexible SQL syntax. The implementation enhances main Aster components including query syntax, API, planning and execution extensively. Integrating these new user-defined scalar and aggregate functions with Aster MapReduce and Graph functions, Aster 6.20 enables data scientists to integrate diverse programming models in a single SQL statement. The statement is automatically converted to an optimal data pipeline and executed in parallel. Using a real world business problem and data, Aster 6.20 demonstrates a significant performance advantage (25%+) over Hadoop Pig and Hive.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"92 1","pages":"1182-1193"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82685359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
“Told you i didn't like it”: Exploiting uninteresting items for effective collaborative filtering “告诉过你我不喜欢它”:利用无趣的项目进行有效的协同过滤
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498253
Won-Seok Hwang, J. Parc, Sang-Wook Kim, Jongwuk Lee, Dongwon Lee
We study how to improve the accuracy and running time of top-N recommendation with collaborative filtering (CF). Unlike existing works that use mostly rated items (which is only a small fraction in a rating matrix), we propose the notion of pre-use preferences of users toward a vast amount of unrated items. Using this novel notion, we effectively identify uninteresting items that were not rated yet but are likely to receive very low ratings from users, and impute them as zero. This simple-yet-novel zero-injection method applied to a set of carefully-chosen uninteresting items not only addresses the sparsity problem by enriching a rating matrix but also completely prevents uninteresting items from being recommended as top-N items, thereby improving accuracy greatly. As our proposed idea is method-agnostic, it can be easily applied to a wide variety of popular CF methods. Through comprehensive experiments using the Movielens dataset and MyMediaLite implementation, we successfully demonstrate that our solution consistently and universally improves the accuracies of popular CF methods (e.g., item-based CF, SVD-based CF, and SVD++) by two to five orders of magnitude on average. Furthermore, our approach reduces the running time of those CF methods by 1.2 to 2.3 times when its setting produces the best accuracy. The datasets and codes that we used in experiments are available at: https://goo.gl/KUrmip.
研究了如何利用协同过滤(CF)提高top-N推荐的准确率和运行时间。不像现有的作品中使用的大多是评级项目(这只是评级矩阵中的一小部分),我们提出了用户对大量未评级项目的使用前偏好的概念。使用这个新颖的概念,我们有效地识别出那些尚未评分但可能从用户那里获得非常低评分的无趣项目,并将它们归为零。这种简单而新颖的零注入方法应用于一组精心挑选的无兴趣项目,不仅通过丰富评级矩阵解决了稀疏性问题,而且完全防止了无兴趣项目被推荐为top-N项目,从而大大提高了准确性。由于我们提出的思想与方法无关,因此它可以很容易地应用于各种流行的CF方法。通过使用Movielens数据集和MyMediaLite实现的综合实验,我们成功地证明了我们的解决方案一致且普遍地提高了流行的CF方法(例如,基于项目的CF,基于SVD的CF和SVD++)的准确率,平均提高了2到5个数量级。此外,当其设置产生最佳精度时,我们的方法将这些CF方法的运行时间减少了1.2至2.3倍。我们在实验中使用的数据集和代码可以在https://goo.gl/KUrmip上找到。
{"title":"“Told you i didn't like it”: Exploiting uninteresting items for effective collaborative filtering","authors":"Won-Seok Hwang, J. Parc, Sang-Wook Kim, Jongwuk Lee, Dongwon Lee","doi":"10.1109/ICDE.2016.7498253","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498253","url":null,"abstract":"We study how to improve the accuracy and running time of top-N recommendation with collaborative filtering (CF). Unlike existing works that use mostly rated items (which is only a small fraction in a rating matrix), we propose the notion of pre-use preferences of users toward a vast amount of unrated items. Using this novel notion, we effectively identify uninteresting items that were not rated yet but are likely to receive very low ratings from users, and impute them as zero. This simple-yet-novel zero-injection method applied to a set of carefully-chosen uninteresting items not only addresses the sparsity problem by enriching a rating matrix but also completely prevents uninteresting items from being recommended as top-N items, thereby improving accuracy greatly. As our proposed idea is method-agnostic, it can be easily applied to a wide variety of popular CF methods. Through comprehensive experiments using the Movielens dataset and MyMediaLite implementation, we successfully demonstrate that our solution consistently and universally improves the accuracies of popular CF methods (e.g., item-based CF, SVD-based CF, and SVD++) by two to five orders of magnitude on average. Furthermore, our approach reduces the running time of those CF methods by 1.2 to 2.3 times when its setting produces the best accuracy. The datasets and codes that we used in experiments are available at: https://goo.gl/KUrmip.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"49 1","pages":"349-360"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89247949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
FastFunction: Replacing a herd of lemmings with a cheetah a ruby framework for interaction with PostgreSQL databases FastFunction:用猎豹代替一群旅鼠,一个与PostgreSQL数据库交互的ruby框架
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498331
Henrietta Dombrovskaya, Srivathsava Rangarajan, Jonathan Marks
The ability of web applications to respond fast is critical to the success of any web-based business, and time spent on the interaction with the database is often the most time-consuming portion of the overall response time. Although recent research suggest lots of proven algorithms to improve this interaction, the technicalities of implementation are often considered too time consuming by application developers. In this paper we present a tool that eliminates a significant portion of these technical difficulties and allows almost seamless incorporation of complex database queries and functions into object-oriented applications.
web应用程序的快速响应能力对于任何基于web的业务的成功都是至关重要的,并且花在与数据库交互上的时间通常是整个响应时间中最耗时的部分。尽管最近的研究提出了许多经过验证的算法来改进这种交互,但应用程序开发人员通常认为实现的技术细节过于耗时。在本文中,我们提出了一种工具,它消除了这些技术困难的很大一部分,并允许几乎无缝地将复杂的数据库查询和功能集成到面向对象的应用程序中。
{"title":"FastFunction: Replacing a herd of lemmings with a cheetah a ruby framework for interaction with PostgreSQL databases","authors":"Henrietta Dombrovskaya, Srivathsava Rangarajan, Jonathan Marks","doi":"10.1109/ICDE.2016.7498331","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498331","url":null,"abstract":"The ability of web applications to respond fast is critical to the success of any web-based business, and time spent on the interaction with the database is often the most time-consuming portion of the overall response time. Although recent research suggest lots of proven algorithms to improve this interaction, the technicalities of implementation are often considered too time consuming by application developers. In this paper we present a tool that eliminates a significant portion of these technical difficulties and allows almost seamless incorporation of complex database queries and functions into object-oriented applications.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"8 1","pages":"1275-1286"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89677608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Finding the minimum spatial keyword cover 寻找最小的空间关键词覆盖
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498281
Dong-Wan Choi, J. Pei, Xuemin Lin
The existing works on spatial keyword search focus on finding a group of spatial objects covering all the query keywords and minimizing the diameter of the group. However, we observe that such a formulation may not address what users need in some application scenarios. In this paper, we introduce a novel spatial keyword cover problem (SK-COVER for short), which aims to identify the group of spatio-textual objects covering all keywords in a query and minimizing a distance cost function that leads to fewer proximate objects in the answer set. We prove that SK-COVER is not only NP-hard but also does not allow an approximation better than O(log m) in polynomial time, where m is the number of query keywords. We establish an O(log m)-approximation algorithm, which is asymptotically optimal in terms of the approximability of SK-COVER. Furthermore, we devise effective accessing strategies and pruning rules to improve the overall efficiency and scalability. In addition to our algorithmic results, we empirically show that our approximation algorithm always achieves the best accuracy, and the efficiency of our algorithm is comparable to a state-of-the-art algorithm that is intended for mCK, a problem similar to yet theoretically easier than SK-COVER.
现有的空间关键字搜索工作主要是寻找一组覆盖所有查询关键字的空间对象,并使组的直径最小化。然而,我们观察到,在某些应用场景中,这样的配方可能无法满足用户的需求。在本文中,我们引入了一个新的空间关键字覆盖问题(简称SK-COVER),该问题旨在识别一组覆盖查询中所有关键字的空间文本对象,并最小化距离代价函数,从而导致答案集中较少的近似对象。我们证明了SK-COVER不仅是np困难的,而且不允许在多项式时间内的近似优于O(log m),其中m是查询关键字的数量。我们建立了一个O(log m)逼近算法,该算法在SK-COVER的逼近性方面是渐近最优的。此外,我们设计了有效的访问策略和修剪规则,以提高整体效率和可扩展性。除了我们的算法结果之外,我们的经验表明,我们的近似算法总是达到最佳精度,并且我们的算法的效率可与用于mCK的最先进算法相媲美,mCK问题与SK-COVER类似,但理论上比SK-COVER更容易。
{"title":"Finding the minimum spatial keyword cover","authors":"Dong-Wan Choi, J. Pei, Xuemin Lin","doi":"10.1109/ICDE.2016.7498281","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498281","url":null,"abstract":"The existing works on spatial keyword search focus on finding a group of spatial objects covering all the query keywords and minimizing the diameter of the group. However, we observe that such a formulation may not address what users need in some application scenarios. In this paper, we introduce a novel spatial keyword cover problem (SK-COVER for short), which aims to identify the group of spatio-textual objects covering all keywords in a query and minimizing a distance cost function that leads to fewer proximate objects in the answer set. We prove that SK-COVER is not only NP-hard but also does not allow an approximation better than O(log m) in polynomial time, where m is the number of query keywords. We establish an O(log m)-approximation algorithm, which is asymptotically optimal in terms of the approximability of SK-COVER. Furthermore, we devise effective accessing strategies and pruning rules to improve the overall efficiency and scalability. In addition to our algorithmic results, we empirically show that our approximation algorithm always achieves the best accuracy, and the efficiency of our algorithm is comparable to a state-of-the-art algorithm that is intended for mCK, a problem similar to yet theoretically easier than SK-COVER.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"222 1","pages":"685-696"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77255202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
期刊
2016 IEEE 32nd International Conference on Data Engineering (ICDE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1