首页 > 最新文献

2016 IEEE 32nd International Conference on Data Engineering (ICDE)最新文献

英文 中文
Ranking support for matched patterns over complex event streams: The CEPR system 对复杂事件流上匹配模式的排序支持:CEPR系统
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498343
Jiaqi Gu, Jin Wang, C. Zaniolo
There is a growing interest in pattern matching over complex event streams. While many bodies of techniques were proposed to search complex patterns and enhance the expressive power of query language, no previous work focused on supporting a well-defined ranking mechanism over answers using semantic ordering. To satisfy this need, we proposed CEPR, a CEP system capable of ranking matchings and emitting ordered results based on users' intentions via a novel query language. In this demo, we will (i) demonstrate language features, system architecture and functionalities, (ii) show examples of CEPR in various application domains and (iii) present a user-friendly interface to monitor query results and interact with the system in real time.
人们对复杂事件流的模式匹配越来越感兴趣。虽然提出了许多技术来搜索复杂的模式和增强查询语言的表达能力,但以前没有工作集中于支持使用语义排序的答案的定义良好的排序机制。为了满足这一需求,我们提出了CEPR,这是一个CEP系统,能够通过一种新颖的查询语言根据用户的意图对匹配进行排序并发出有序的结果。在本演示中,我们将(i)演示语言特性、系统架构和功能,(ii)展示各种应用领域中的CEPR示例,以及(iii)提供一个用户友好的界面,以监控查询结果并实时与系统交互。
{"title":"Ranking support for matched patterns over complex event streams: The CEPR system","authors":"Jiaqi Gu, Jin Wang, C. Zaniolo","doi":"10.1109/ICDE.2016.7498343","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498343","url":null,"abstract":"There is a growing interest in pattern matching over complex event streams. While many bodies of techniques were proposed to search complex patterns and enhance the expressive power of query language, no previous work focused on supporting a well-defined ranking mechanism over answers using semantic ordering. To satisfy this need, we proposed CEPR, a CEP system capable of ranking matchings and emitting ordered results based on users' intentions via a novel query language. In this demo, we will (i) demonstrate language features, system architecture and functionalities, (ii) show examples of CEPR in various application domains and (iii) present a user-friendly interface to monitor query results and interact with the system in real time.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"4 1","pages":"1354-1357"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74856799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Practical privacy-preserving user profile matching in social networks 社交网络中实用的隐私保护用户档案匹配
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498255
X. Yi, E. Bertino, Fang-Yu Rao, A. Bouguettaya
In this paper, we consider a scenario where a user queries a user profile database, maintained by a social networking service provider, to find out some users whose profiles are similar to the profile specified by the querying user. A typical example of this application is online dating. Most recently, an online data site, Ashley Madison, was hacked, which results in disclosure of a large number of dating user profiles. This serious data breach has urged researchers to explore practical privacy protection for user profiles in online dating. In this paper, we give a privacy-preserving solution for user profile matching in social networks by using multiple servers. Our solution is built on homomorphic encryption and allows a user to find out some matching users with the help of the multiple servers without revealing to anyone privacy of the query and the queried user profiles. Our solution achieves user profile privacy and user query privacy as long as at least one of the multiple servers is honest. Our implementation and experiments demonstrate that our solution is practical.
在本文中,我们考虑这样一个场景:用户查询由社交网络服务提供商维护的用户个人资料数据库,以找出一些个人资料与查询用户指定的个人资料相似的用户。这种应用程序的一个典型例子是在线约会。最近,在线数据网站Ashley Madison遭到黑客攻击,导致大量约会用户资料泄露。这种严重的数据泄露促使研究人员探索在线约会中用户资料的实际隐私保护。本文给出了一种基于多服务器的社交网络用户档案匹配的隐私保护解决方案。我们的解决方案建立在同态加密的基础上,允许用户在多个服务器的帮助下找到一些匹配的用户,而不会向任何人泄露查询的隐私和被查询的用户配置文件。只要多个服务器中至少有一个是诚实的,我们的解决方案就可以实现用户配置文件隐私和用户查询隐私。实验结果表明,该方案是可行的。
{"title":"Practical privacy-preserving user profile matching in social networks","authors":"X. Yi, E. Bertino, Fang-Yu Rao, A. Bouguettaya","doi":"10.1109/ICDE.2016.7498255","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498255","url":null,"abstract":"In this paper, we consider a scenario where a user queries a user profile database, maintained by a social networking service provider, to find out some users whose profiles are similar to the profile specified by the querying user. A typical example of this application is online dating. Most recently, an online data site, Ashley Madison, was hacked, which results in disclosure of a large number of dating user profiles. This serious data breach has urged researchers to explore practical privacy protection for user profiles in online dating. In this paper, we give a privacy-preserving solution for user profile matching in social networks by using multiple servers. Our solution is built on homomorphic encryption and allows a user to find out some matching users with the help of the multiple servers without revealing to anyone privacy of the query and the queried user profiles. Our solution achieves user profile privacy and user query privacy as long as at least one of the multiple servers is honest. Our implementation and experiments demonstrate that our solution is practical.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"92 1","pages":"373-384"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91259803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
“Told you i didn't like it”: Exploiting uninteresting items for effective collaborative filtering “告诉过你我不喜欢它”:利用无趣的项目进行有效的协同过滤
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498253
Won-Seok Hwang, J. Parc, Sang-Wook Kim, Jongwuk Lee, Dongwon Lee
We study how to improve the accuracy and running time of top-N recommendation with collaborative filtering (CF). Unlike existing works that use mostly rated items (which is only a small fraction in a rating matrix), we propose the notion of pre-use preferences of users toward a vast amount of unrated items. Using this novel notion, we effectively identify uninteresting items that were not rated yet but are likely to receive very low ratings from users, and impute them as zero. This simple-yet-novel zero-injection method applied to a set of carefully-chosen uninteresting items not only addresses the sparsity problem by enriching a rating matrix but also completely prevents uninteresting items from being recommended as top-N items, thereby improving accuracy greatly. As our proposed idea is method-agnostic, it can be easily applied to a wide variety of popular CF methods. Through comprehensive experiments using the Movielens dataset and MyMediaLite implementation, we successfully demonstrate that our solution consistently and universally improves the accuracies of popular CF methods (e.g., item-based CF, SVD-based CF, and SVD++) by two to five orders of magnitude on average. Furthermore, our approach reduces the running time of those CF methods by 1.2 to 2.3 times when its setting produces the best accuracy. The datasets and codes that we used in experiments are available at: https://goo.gl/KUrmip.
研究了如何利用协同过滤(CF)提高top-N推荐的准确率和运行时间。不像现有的作品中使用的大多是评级项目(这只是评级矩阵中的一小部分),我们提出了用户对大量未评级项目的使用前偏好的概念。使用这个新颖的概念,我们有效地识别出那些尚未评分但可能从用户那里获得非常低评分的无趣项目,并将它们归为零。这种简单而新颖的零注入方法应用于一组精心挑选的无兴趣项目,不仅通过丰富评级矩阵解决了稀疏性问题,而且完全防止了无兴趣项目被推荐为top-N项目,从而大大提高了准确性。由于我们提出的思想与方法无关,因此它可以很容易地应用于各种流行的CF方法。通过使用Movielens数据集和MyMediaLite实现的综合实验,我们成功地证明了我们的解决方案一致且普遍地提高了流行的CF方法(例如,基于项目的CF,基于SVD的CF和SVD++)的准确率,平均提高了2到5个数量级。此外,当其设置产生最佳精度时,我们的方法将这些CF方法的运行时间减少了1.2至2.3倍。我们在实验中使用的数据集和代码可以在https://goo.gl/KUrmip上找到。
{"title":"“Told you i didn't like it”: Exploiting uninteresting items for effective collaborative filtering","authors":"Won-Seok Hwang, J. Parc, Sang-Wook Kim, Jongwuk Lee, Dongwon Lee","doi":"10.1109/ICDE.2016.7498253","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498253","url":null,"abstract":"We study how to improve the accuracy and running time of top-N recommendation with collaborative filtering (CF). Unlike existing works that use mostly rated items (which is only a small fraction in a rating matrix), we propose the notion of pre-use preferences of users toward a vast amount of unrated items. Using this novel notion, we effectively identify uninteresting items that were not rated yet but are likely to receive very low ratings from users, and impute them as zero. This simple-yet-novel zero-injection method applied to a set of carefully-chosen uninteresting items not only addresses the sparsity problem by enriching a rating matrix but also completely prevents uninteresting items from being recommended as top-N items, thereby improving accuracy greatly. As our proposed idea is method-agnostic, it can be easily applied to a wide variety of popular CF methods. Through comprehensive experiments using the Movielens dataset and MyMediaLite implementation, we successfully demonstrate that our solution consistently and universally improves the accuracies of popular CF methods (e.g., item-based CF, SVD-based CF, and SVD++) by two to five orders of magnitude on average. Furthermore, our approach reduces the running time of those CF methods by 1.2 to 2.3 times when its setting produces the best accuracy. The datasets and codes that we used in experiments are available at: https://goo.gl/KUrmip.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"49 1","pages":"349-360"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89247949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
Incremental updates on compressed XML 对压缩XML进行增量更新
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498310
S. Böttcher, Rita Hartel, T. Jacobs, S. Maneth
XML tree structures can be effectively compressed using straight-line grammars. It has been an open problem how to update straight-line grammars, while keeping them compressed. Therefore, the best previous known methods resort to periodic decompression followed by compression from scratch. The decompression step is expensive, potentially with exponential running time. We present a method that avoids this expensive step. Our method recompresses the updated grammar directly, without prior decompression; it thus greatly outperforms the decompress-compress approach, in terms of both space and time. Our experiments show that the obtained grammars are similar or even smaller than those of the decompress-compress method.
可以使用直线语法有效地压缩XML树结构。如何在保持直线语法压缩的同时更新它们一直是一个悬而未决的问题。因此,以前已知的最佳方法是定期解压缩,然后从头开始压缩。解压缩步骤是昂贵的,可能需要指数级的运行时间。我们提出了一种避免这一昂贵步骤的方法。我们的方法直接重新压缩更新后的语法,而不需要事先解压缩;因此,在空间和时间方面,它大大优于解压缩方法。我们的实验表明,获得的语法与解压缩方法相似甚至更小。
{"title":"Incremental updates on compressed XML","authors":"S. Böttcher, Rita Hartel, T. Jacobs, S. Maneth","doi":"10.1109/ICDE.2016.7498310","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498310","url":null,"abstract":"XML tree structures can be effectively compressed using straight-line grammars. It has been an open problem how to update straight-line grammars, while keeping them compressed. Therefore, the best previous known methods resort to periodic decompression followed by compression from scratch. The decompression step is expensive, potentially with exponential running time. We present a method that avoids this expensive step. Our method recompresses the updated grammar directly, without prior decompression; it thus greatly outperforms the decompress-compress approach, in terms of both space and time. Our experiments show that the obtained grammars are similar or even smaller than those of the decompress-compress method.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"66 1","pages":"1026-1037"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85072227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Keyword-aware continuous kNN query on road networks 基于关键字感知的道路网络连续kNN查询
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498297
Bolong Zheng, Kai Zheng, Xiaokui Xiao, Han Su, Hongzhi Yin, Xiaofang Zhou, Guohui Li
It is nowadays quite common for road networks to have textual contents on the vertices, which describe auxiliary information (e.g., business, traffic, etc.) associated with the vertex. In such road networks, which are modelled as weighted undirected graphs, each vertex is associated with one or more keywords, and each edge is assigned with a weight, which can be its physical length or travelling time. In this paper, we study the problem of keyword-aware continuous k nearest neighbour (KCkNN) search on road networks, which computes the k nearest vertices that contain the query keywords issued by a moving object and maintains the results continuously as the object is moving on the road network. Reducing the query processing costs in terms of computation and communication has attracted considerable attention in the database community with interesting techniques proposed. This paper proposes a framework, called a Labelling AppRoach for Continuous kNN query (LARC), on road networks to cope with KCkNN query efficiently. First we build a pivot-based reverse label index and a keyword-based pivot tree index to improve the efficiency of keyword-aware k nearest neighbour (KkNN) search by avoiding massive network traversals and sequential probe of keywords. To reduce the frequency of unnecessary result updates, we develop the concepts of dominance interval and region on road network, which share the similar intuition with safe region for processing continuous queries in Euclidean space but are more complicated and thus require more dedicated design. For high frequency keywords, we resolve the dominance interval when the query results changed. In addition, a path-based dominance updating approach is proposed to compute the dominance region efficiently when the query keywords are of low frequency. We conduct extensive experiments by comparing our algorithms with the state-of-the-art methods on real data sets. The empirical observations have verified the superiority of our proposed solution in all aspects of index size, communication cost and computation time.
如今,道路网络在顶点上具有文本内容是很常见的,这些文本内容描述了与顶点相关的辅助信息(例如,商业,交通等)。在这样的道路网络中,它们被建模为加权无向图,每个顶点与一个或多个关键字相关联,每个边缘被分配一个权重,这个权重可以是它的物理长度或行驶时间。本文研究了道路网络上的关键字感知连续k近邻(KCkNN)搜索问题,该问题计算包含运动物体发出的查询关键字的k个最近顶点,并在物体在道路网络上运动时连续保持结果。在计算和通信方面降低查询处理成本已经引起了数据库社区的广泛关注,并提出了一些有趣的技术。本文提出了一种基于道路网络的连续kNN查询标记方法(LARC)框架,以有效地处理KCkNN查询。首先,我们构建了一个基于关键字的反向标签索引和一个基于关键字的主树索引,通过避免大量的网络遍历和对关键字的顺序探测来提高关键字感知k近邻(KkNN)搜索的效率。为了减少不必要的结果更新频率,我们提出了道路网络上的优势区间和区域的概念,它们与欧几里得空间中处理连续查询的安全区域具有相似的直觉,但更复杂,因此需要更专门的设计。对于高频关键词,我们在查询结果发生变化时解析优势度区间。此外,提出了一种基于路径的优势度更新方法,以便在查询关键词频率较低时有效地计算优势域。我们通过将我们的算法与最先进的方法在真实数据集上进行比较,进行了广泛的实验。通过实证观察,验证了本文方案在索引大小、通信成本和计算时间等方面的优越性。
{"title":"Keyword-aware continuous kNN query on road networks","authors":"Bolong Zheng, Kai Zheng, Xiaokui Xiao, Han Su, Hongzhi Yin, Xiaofang Zhou, Guohui Li","doi":"10.1109/ICDE.2016.7498297","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498297","url":null,"abstract":"It is nowadays quite common for road networks to have textual contents on the vertices, which describe auxiliary information (e.g., business, traffic, etc.) associated with the vertex. In such road networks, which are modelled as weighted undirected graphs, each vertex is associated with one or more keywords, and each edge is assigned with a weight, which can be its physical length or travelling time. In this paper, we study the problem of keyword-aware continuous k nearest neighbour (KCkNN) search on road networks, which computes the k nearest vertices that contain the query keywords issued by a moving object and maintains the results continuously as the object is moving on the road network. Reducing the query processing costs in terms of computation and communication has attracted considerable attention in the database community with interesting techniques proposed. This paper proposes a framework, called a Labelling AppRoach for Continuous kNN query (LARC), on road networks to cope with KCkNN query efficiently. First we build a pivot-based reverse label index and a keyword-based pivot tree index to improve the efficiency of keyword-aware k nearest neighbour (KkNN) search by avoiding massive network traversals and sequential probe of keywords. To reduce the frequency of unnecessary result updates, we develop the concepts of dominance interval and region on road network, which share the similar intuition with safe region for processing continuous queries in Euclidean space but are more complicated and thus require more dedicated design. For high frequency keywords, we resolve the dominance interval when the query results changed. In addition, a path-based dominance updating approach is proposed to compute the dominance region efficiently when the query keywords are of low frequency. We conduct extensive experiments by comparing our algorithms with the state-of-the-art methods on real data sets. The empirical observations have verified the superiority of our proposed solution in all aspects of index size, communication cost and computation time.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"85 1","pages":"871-882"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84001472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 60
SQL-SA for big data discovery polymorphic and parallelizable SQL user-defined scalar and aggregate infrastructure in Teradata Aster 6.20 SQL- sa用于大数据发现Teradata Aster 6.20中的多态和并行SQL用户定义的标量和聚合基础设施
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498323
Xin Tang, R. Wehrmeister, J. Shau, Abhirup Chakraborty, Daley Alex, A. A. Omari, Feven Atnafu, Jeff Davis, Litao Deng, Deepak Jaiswal, C. Keswani, Yafeng Lu, Chao Ren, T. Reyes, Kashif Siddiqui, David E. Simmen, D. Vidhani, Ling Wang, Shuai Yang, Daniel Yu
There is increasing demand to integrate big data analytic systems using SQL. Given the vast ecosystem of SQL applications, enabling SQL capabilities allows big data platforms to expose their analytic potential to a wide variety of end users, accelerating discovery processes and providing significant business value. Most existing big data frameworks are based on one particular programming model such as MapReduce or Graph. However, data scientists are often forced to manually create adhoc data pipelines to connect various big data tools and platforms to serve their analytic needs. When the analytic tasks change, these data pipelines may be costly to modify and maintain. In this paper we present SQL-SA, a polymorphic and parallelizable SQL scalar and aggregate infrastructure in Aster 6.20. This infrastructure extends Aster 6's MapReduce and Graph capabilities to support polymorphic user-defined scalar and aggregate functions using flexible SQL syntax. The implementation enhances main Aster components including query syntax, API, planning and execution extensively. Integrating these new user-defined scalar and aggregate functions with Aster MapReduce and Graph functions, Aster 6.20 enables data scientists to integrate diverse programming models in a single SQL statement. The statement is automatically converted to an optimal data pipeline and executed in parallel. Using a real world business problem and data, Aster 6.20 demonstrates a significant performance advantage (25%+) over Hadoop Pig and Hive.
使用SQL集成大数据分析系统的需求越来越大。考虑到SQL应用程序的庞大生态系统,启用SQL功能可以让大数据平台向各种各样的最终用户展示其分析潜力,从而加速发现过程并提供重要的业务价值。大多数现有的大数据框架都是基于一个特定的编程模型,如MapReduce或Graph。然而,数据科学家经常被迫手动创建专门的数据管道来连接各种大数据工具和平台,以满足他们的分析需求。当分析任务发生变化时,修改和维护这些数据管道的成本可能很高。在本文中,我们提出了SQL- sa,这是Aster 6.20中的一个多态和可并行的SQL标量和聚合基础结构。该基础架构扩展了Aster 6的MapReduce和Graph功能,使用灵活的SQL语法支持多态用户定义的标量和聚合函数。该实现广泛地增强了主要的Aster组件,包括查询语法、API、规划和执行。Aster 6.20将这些新的用户定义标量和聚合函数与Aster MapReduce和Graph函数集成在一起,使数据科学家能够在单个SQL语句中集成不同的编程模型。语句自动转换为最佳数据管道并并行执行。使用真实世界的业务问题和数据,Aster 6.20比Hadoop Pig和Hive表现出显著的性能优势(25%以上)。
{"title":"SQL-SA for big data discovery polymorphic and parallelizable SQL user-defined scalar and aggregate infrastructure in Teradata Aster 6.20","authors":"Xin Tang, R. Wehrmeister, J. Shau, Abhirup Chakraborty, Daley Alex, A. A. Omari, Feven Atnafu, Jeff Davis, Litao Deng, Deepak Jaiswal, C. Keswani, Yafeng Lu, Chao Ren, T. Reyes, Kashif Siddiqui, David E. Simmen, D. Vidhani, Ling Wang, Shuai Yang, Daniel Yu","doi":"10.1109/ICDE.2016.7498323","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498323","url":null,"abstract":"There is increasing demand to integrate big data analytic systems using SQL. Given the vast ecosystem of SQL applications, enabling SQL capabilities allows big data platforms to expose their analytic potential to a wide variety of end users, accelerating discovery processes and providing significant business value. Most existing big data frameworks are based on one particular programming model such as MapReduce or Graph. However, data scientists are often forced to manually create adhoc data pipelines to connect various big data tools and platforms to serve their analytic needs. When the analytic tasks change, these data pipelines may be costly to modify and maintain. In this paper we present SQL-SA, a polymorphic and parallelizable SQL scalar and aggregate infrastructure in Aster 6.20. This infrastructure extends Aster 6's MapReduce and Graph capabilities to support polymorphic user-defined scalar and aggregate functions using flexible SQL syntax. The implementation enhances main Aster components including query syntax, API, planning and execution extensively. Integrating these new user-defined scalar and aggregate functions with Aster MapReduce and Graph functions, Aster 6.20 enables data scientists to integrate diverse programming models in a single SQL statement. The statement is automatically converted to an optimal data pipeline and executed in parallel. Using a real world business problem and data, Aster 6.20 demonstrates a significant performance advantage (25%+) over Hadoop Pig and Hive.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"92 1","pages":"1182-1193"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82685359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Fault-tolerant real-time analytics with distributed Oracle Database In-memory 分布式Oracle内存数据库的容错实时分析
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498333
Niloy J. Mukherjee, S. Chavan, Maria Colgan, M. Gleeson, Xiaoming He, Allison L. Holloway, J. Kamp, Kartik Kulkarni, T. Lahiri, Juan R. Loaiza, N. MacNaughton, Atrayee Mullick, S. Muthulingam, V. Raja, Raunak Rungta
Modern data management systems are required to address new breeds of OLTAP applications. These applications demand real time analytical insights over massive data volumes not only on dedicated data warehouses but also on live mainstream production environments where data gets continuously ingested and modified. Oracle introduced the Database In-memory Option (DBIM) in 2014 as a unique dual row and column format architecture aimed to address the emerging space of mixed OLTAP applications along with traditional OLAP workloads. The architecture allows both the row format and the column format to be maintained simultaneously with strict transactional consistency. While the row format is persisted in underlying storage, the column format is maintained purely in-memory without incurring additional logging overheads in OLTP. Maintenance of columnar data purely in memory creates the need for distributed data management architectures. Performance of analytics incurs severe regressions in single server architectures during server failures as it takes non-trivial time to recover and rebuild terabytes of in-memory columnar format. A distributed and distribution aware architecture therefore becomes necessary to provide real time high availability of the columnar format for glitch-free in-memory analytic query execution across server failures and additions, besides providing scale out of capacity and compute to address real time throughput requirements over large volumes of in-memory data. In this paper, we will present the high availability aspects of the distributed architecture of Oracle DBIM that includes extremely scaled out application transparent column format duplication mechanism, distributed query execution on duplicated in-memory columnar format, and several scenarios of fault tolerant analytic query execution across the in-memory column format at various stages of redistribution of columnar data during cluster topology changes.
需要现代数据管理系统来解决新型OLTAP应用。这些应用程序需要对大量数据进行实时分析,不仅需要在专用数据仓库中,还需要在数据不断被摄取和修改的实时主流生产环境中。Oracle在2014年推出了数据库内存选项(Database in -memory Option, DBIM),作为一种独特的双行双列格式架构,旨在解决混合OLTAP应用程序和传统OLAP工作负载的新兴空间。该体系结构允许同时维护行格式和列格式,并具有严格的事务一致性。虽然行格式在底层存储中持久化,但列格式完全在内存中维护,不会在OLTP中产生额外的日志开销。纯粹在内存中维护列数据需要分布式数据管理架构。在服务器故障期间,单服务器架构中的分析性能会导致严重的退化,因为恢复和重建内存中tb的列格式需要花费大量时间。因此,除了提供超出容量的扩展和计算来解决大量内存数据的实时吞吐量需求外,还需要分布式和分布感知架构来提供柱状格式的实时高可用性,以便跨服务器故障和添加执行无故障的内存分析查询。在本文中,我们将介绍Oracle DBIM分布式架构的高可用性方面,包括高度向外扩展的应用透明列格式复制机制,在重复的内存列格式上执行分布式查询,以及在集群拓扑变化期间,在列数据重新分配的各个阶段跨内存列格式执行容错分析查询的几个场景。
{"title":"Fault-tolerant real-time analytics with distributed Oracle Database In-memory","authors":"Niloy J. Mukherjee, S. Chavan, Maria Colgan, M. Gleeson, Xiaoming He, Allison L. Holloway, J. Kamp, Kartik Kulkarni, T. Lahiri, Juan R. Loaiza, N. MacNaughton, Atrayee Mullick, S. Muthulingam, V. Raja, Raunak Rungta","doi":"10.1109/ICDE.2016.7498333","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498333","url":null,"abstract":"Modern data management systems are required to address new breeds of OLTAP applications. These applications demand real time analytical insights over massive data volumes not only on dedicated data warehouses but also on live mainstream production environments where data gets continuously ingested and modified. Oracle introduced the Database In-memory Option (DBIM) in 2014 as a unique dual row and column format architecture aimed to address the emerging space of mixed OLTAP applications along with traditional OLAP workloads. The architecture allows both the row format and the column format to be maintained simultaneously with strict transactional consistency. While the row format is persisted in underlying storage, the column format is maintained purely in-memory without incurring additional logging overheads in OLTP. Maintenance of columnar data purely in memory creates the need for distributed data management architectures. Performance of analytics incurs severe regressions in single server architectures during server failures as it takes non-trivial time to recover and rebuild terabytes of in-memory columnar format. A distributed and distribution aware architecture therefore becomes necessary to provide real time high availability of the columnar format for glitch-free in-memory analytic query execution across server failures and additions, besides providing scale out of capacity and compute to address real time throughput requirements over large volumes of in-memory data. In this paper, we will present the high availability aspects of the distributed architecture of Oracle DBIM that includes extremely scaled out application transparent column format duplication mechanism, distributed query execution on duplicated in-memory columnar format, and several scenarios of fault tolerant analytic query execution across the in-memory column format at various stages of redistribution of columnar data during cluster topology changes.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"03 1","pages":"1298-1309"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86523050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Virtual lightweight snapshots for consistent analytics in NoSQL stores 用于NoSQL存储一致分析的虚拟轻量级快照
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498334
F. Chirigati, Jérôme Siméon, Martin Hirzel, J. Freire
Increasingly, applications that deal with big data need to run analytics concurrently with updates. But bridging the gap between big and fast data is challenging: most of these applications require analytics' results that are fresh and consistent, but without impacting system latency and throughput. We propose virtual lightweight snapshots (VLS), a mechanism that enables consistent analytics without blocking incoming updates in NoSQL stores. VLS requires neither native support for database versioning nor a transaction manager. Besides, it is storage-efficient, keeping additional versions of records only when needed to guarantee consistency, and sharing versions across multiple concurrent snapshots. We describe an implementation of VLS in MongoDB and present a detailed experimental evaluation which shows that it supports consistency for analytics with small impact on query evaluation time, update throughput, and latency.
处理大数据的应用程序越来越需要在更新的同时运行分析。但是,弥合大数据和快速数据之间的差距是具有挑战性的:大多数这些应用程序要求分析结果是新鲜和一致的,但不影响系统延迟和吞吐量。我们提出了虚拟轻量级快照(VLS),这是一种在不阻塞NoSQL存储中的传入更新的情况下实现一致性分析的机制。VLS既不需要数据库版本控制的本地支持,也不需要事务管理器。此外,它具有存储效率,仅在需要保证一致性时才保留记录的附加版本,并在多个并发快照之间共享版本。我们描述了VLS在MongoDB中的实现,并提供了一个详细的实验评估,表明它支持一致性分析,对查询评估时间、更新吞吐量和延迟的影响很小。
{"title":"Virtual lightweight snapshots for consistent analytics in NoSQL stores","authors":"F. Chirigati, Jérôme Siméon, Martin Hirzel, J. Freire","doi":"10.1109/ICDE.2016.7498334","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498334","url":null,"abstract":"Increasingly, applications that deal with big data need to run analytics concurrently with updates. But bridging the gap between big and fast data is challenging: most of these applications require analytics' results that are fresh and consistent, but without impacting system latency and throughput. We propose virtual lightweight snapshots (VLS), a mechanism that enables consistent analytics without blocking incoming updates in NoSQL stores. VLS requires neither native support for database versioning nor a transaction manager. Besides, it is storage-efficient, keeping additional versions of records only when needed to guarantee consistency, and sharing versions across multiple concurrent snapshots. We describe an implementation of VLS in MongoDB and present a detailed experimental evaluation which shows that it supports consistency for analytics with small impact on query evaluation time, update throughput, and latency.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"31 1","pages":"1310-1321"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86876369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
HAWK: Hardware support for unstructured log processing HAWK:非结构化日志处理的硬件支持
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498263
Prateek Tandon, Faissal M. Sleiman, Michael J. Cafarella, T. Wenisch
Rapidly processing high-velocity text data is critical for many technical and business applications. Widely used software solutions for processing these large text corpora target disk-resident data and rely on pre-computed indexes and large clusters to achieve high performance. However, greater capacity and falling costs are enabling a shift to RAM-resident data sets. The enormous bandwidth of RAM can facilitate scan operations that are competitive with pre-computed indexes for interactive, ad-hoc queries. However, software approaches for processing these large text corpora fall far short of saturating available bandwidth and meeting peak scan rates possible on modern memory systems. In this paper, we present HAWK, a hardware accelerator for ad hoc queries against large in-memory logs. HAWK comprises a stall-free hardware pipeline that scans input data at a constant rate, examining multiple input characters in parallel during a single accelerator clock cycle. We describe a 1GHz 32-characterwide HAWK design targeting ASIC implementation, designed to process data at 32GB/s (up to two orders of magnitude faster than software solutions), and demonstrate a scaled-down FPGA prototype that operates at 100MHz with 4-wide parallelism, which processes at 400MB/s (13× faster than software grep for large multi-pattern scans).
快速处理高速文本数据对于许多技术和业务应用程序至关重要。用于处理这些大型文本语料库的广泛使用的软件解决方案以磁盘驻留数据为目标,并依赖于预先计算的索引和大型集群来实现高性能。然而,更大的容量和不断下降的成本正在推动向驻留在ram上的数据集的转变。RAM的巨大带宽可以促进扫描操作,这些操作可以与交互式临时查询的预计算索引相竞争。然而,处理这些大型文本语料库的软件方法远远不能达到饱和可用带宽和满足现代存储系统上可能的峰值扫描速率。在本文中,我们介绍了HAWK,这是一个硬件加速器,用于针对大型内存日志进行临时查询。HAWK包括一个无失速硬件管道,以恒定速率扫描输入数据,在单个加速器时钟周期内并行检查多个输入字符。我们描述了一个针对ASIC实现的1GHz 32字符宽HAWK设计,旨在以32GB/s的速度处理数据(比软件解决方案快两个数量级),并演示了一个按比例缩小的FPGA原型,其工作频率为100MHz,并行度为4-wide,处理速度为400MB/s(比软件grep快13倍,用于大型多模式扫描)。
{"title":"HAWK: Hardware support for unstructured log processing","authors":"Prateek Tandon, Faissal M. Sleiman, Michael J. Cafarella, T. Wenisch","doi":"10.1109/ICDE.2016.7498263","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498263","url":null,"abstract":"Rapidly processing high-velocity text data is critical for many technical and business applications. Widely used software solutions for processing these large text corpora target disk-resident data and rely on pre-computed indexes and large clusters to achieve high performance. However, greater capacity and falling costs are enabling a shift to RAM-resident data sets. The enormous bandwidth of RAM can facilitate scan operations that are competitive with pre-computed indexes for interactive, ad-hoc queries. However, software approaches for processing these large text corpora fall far short of saturating available bandwidth and meeting peak scan rates possible on modern memory systems. In this paper, we present HAWK, a hardware accelerator for ad hoc queries against large in-memory logs. HAWK comprises a stall-free hardware pipeline that scans input data at a constant rate, examining multiple input characters in parallel during a single accelerator clock cycle. We describe a 1GHz 32-characterwide HAWK design targeting ASIC implementation, designed to process data at 32GB/s (up to two orders of magnitude faster than software solutions), and demonstrate a scaled-down FPGA prototype that operates at 100MHz with 4-wide parallelism, which processes at 400MB/s (13× faster than software grep for large multi-pattern scans).","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"4 4 1","pages":"469-480"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75939254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
A model-based approach for text clustering with outlier detection 基于模型的离群点检测文本聚类方法
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498276
Jianhua Yin, Jianyong Wang
Text clustering is a challenging problem due to the high-dimensional and large-volume characteristics of text datasets. In this paper, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Process Multinomial Mixture model for text clustering (abbr. to GSDPMM) which does not need to specify the number of clusters in advance and can cope with the high-dimensional problem of text clustering. Our extensive experimental study shows that GSDPMM can achieve significantly better performance than three other clustering methods and can achieve high consistency on both long and short text datasets. We found that GSDPMM has low time and space complexity and can scale well with huge text datasets. We also propose some novel and effective methods to detect the outliers in the dataset and obtain the representative words of each cluster.
由于文本数据集具有高维、大容量的特点,文本聚类是一个具有挑战性的问题。本文针对文本聚类的Dirichlet过程多项混合模型(简称GSDPMM)提出了一种不需要预先指定簇数的坍缩Gibbs采样算法,可以解决文本聚类的高维问题。我们广泛的实验研究表明,GSDPMM可以获得明显优于其他三种聚类方法的性能,并且可以在长文本和短文本数据集上实现高一致性。我们发现GSDPMM具有较低的时间和空间复杂度,并且可以很好地扩展到庞大的文本数据集。我们还提出了一些新颖有效的方法来检测数据集中的异常值,并获得每个聚类的代表词。
{"title":"A model-based approach for text clustering with outlier detection","authors":"Jianhua Yin, Jianyong Wang","doi":"10.1109/ICDE.2016.7498276","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498276","url":null,"abstract":"Text clustering is a challenging problem due to the high-dimensional and large-volume characteristics of text datasets. In this paper, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Process Multinomial Mixture model for text clustering (abbr. to GSDPMM) which does not need to specify the number of clusters in advance and can cope with the high-dimensional problem of text clustering. Our extensive experimental study shows that GSDPMM can achieve significantly better performance than three other clustering methods and can achieve high consistency on both long and short text datasets. We found that GSDPMM has low time and space complexity and can scale well with huge text datasets. We also propose some novel and effective methods to detect the outliers in the dataset and obtain the representative words of each cluster.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"116 1","pages":"625-636"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79797791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
期刊
2016 IEEE 32nd International Conference on Data Engineering (ICDE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1