首页 > 最新文献

2012 IEEE 28th International Conference on Data Engineering最新文献

英文 中文
Incorporating Duration Information for Trajectory Classification 结合持续时间信息进行弹道分类
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.72
D. Patel, Chang Sheng, W. Hsu, M. Lee
Trajectory classification has many useful applications. Existing works on trajectory classification do not consider the duration information of trajectory. In this paper, we extract duration-aware features from trajectories to build a classifier. Our method utilizes information theory to obtain regions where the trajectories have similar speeds and directions. Further, trajectories are summarized into a network based on the MDL principle that takes into account the duration difference among trajectories of different classes. A graph traversal is performed on this trajectory network to obtain the top-k covering path rules for each trajectory. Based on the discovered regions and top-k path rules, we build a classifier to predict the class labels of new trajectories. Experiment results on real-world datasets show that the proposed duration-aware classifier can obtain higher classification accuracy than the state-of-the-art trajectory classifier.
弹道分类有许多有用的应用。现有的弹道分类工作没有考虑弹道的持续时间信息。在本文中,我们从轨迹中提取持续时间感知特征来构建分类器。我们的方法利用信息理论来获得轨迹具有相似速度和方向的区域。此外,考虑到不同类别的轨迹之间的持续时间差异,基于MDL原则将轨迹总结成一个网络。对该轨迹网络进行图遍历,得到每条轨迹的top-k覆盖路径规则。基于发现的区域和top-k路径规则,我们建立了一个分类器来预测新轨迹的类别标签。在真实数据集上的实验结果表明,与目前最先进的轨迹分类器相比,所提出的时间感知分类器可以获得更高的分类精度。
{"title":"Incorporating Duration Information for Trajectory Classification","authors":"D. Patel, Chang Sheng, W. Hsu, M. Lee","doi":"10.1109/ICDE.2012.72","DOIUrl":"https://doi.org/10.1109/ICDE.2012.72","url":null,"abstract":"Trajectory classification has many useful applications. Existing works on trajectory classification do not consider the duration information of trajectory. In this paper, we extract duration-aware features from trajectories to build a classifier. Our method utilizes information theory to obtain regions where the trajectories have similar speeds and directions. Further, trajectories are summarized into a network based on the MDL principle that takes into account the duration difference among trajectories of different classes. A graph traversal is performed on this trajectory network to obtain the top-k covering path rules for each trajectory. Based on the discovered regions and top-k path rules, we build a classifier to predict the class labels of new trajectories. Experiment results on real-world datasets show that the proposed duration-aware classifier can obtain higher classification accuracy than the state-of-the-art trajectory classifier.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127065914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Interactive User Feedback in Ontology Matching Using Signature Vectors 基于特征向量的交互式用户反馈本体匹配
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.137
I. Cruz, Cosmin Stroe, M. Palmonari
When compared to a gold standard, the set of mappings that are generated by an automatic ontology matching process is neither complete nor are the individual mappings always correct. However, given the explosion in the number, size, and complexity of available ontologies, domain experts no longer have the capability to create ontology mappings without considerable effort. We present a solution to this problem that consists of making the ontology matching process interactive so as to incorporate user feedback in the loop. Our approach clusters mappings to identify where user feedback will be most beneficial in reducing the number of user interactions and system iterations. This feedback process has been implemented in the Agreement Maker system and is supported by visual analytic techniques that help users to better understand the matching process. Experimental results using the OAEI benchmarks show the effectiveness of our approach. We will demonstrate how users can interact with the ontology matching process through the Agreement Maker user interface to match real-world ontologies.
与黄金标准相比,由自动本体匹配过程生成的映射集既不完整,单个映射也不总是正确的。然而,鉴于可用本体的数量、大小和复杂性的爆炸式增长,领域专家不再有能力在不付出相当大努力的情况下创建本体映射。针对这一问题,我们提出了一种解决方案,使本体匹配过程具有交互性,从而将用户反馈纳入到循环中。我们的方法聚类映射,以确定用户反馈在减少用户交互和系统迭代数量方面最有益的地方。这个反馈过程已经在协议制定者系统中实现,并由视觉分析技术支持,帮助用户更好地理解匹配过程。使用OAEI基准的实验结果表明了我们的方法的有效性。我们将演示用户如何通过Agreement Maker用户界面与本体匹配过程交互,以匹配现实世界的本体。
{"title":"Interactive User Feedback in Ontology Matching Using Signature Vectors","authors":"I. Cruz, Cosmin Stroe, M. Palmonari","doi":"10.1109/ICDE.2012.137","DOIUrl":"https://doi.org/10.1109/ICDE.2012.137","url":null,"abstract":"When compared to a gold standard, the set of mappings that are generated by an automatic ontology matching process is neither complete nor are the individual mappings always correct. However, given the explosion in the number, size, and complexity of available ontologies, domain experts no longer have the capability to create ontology mappings without considerable effort. We present a solution to this problem that consists of making the ontology matching process interactive so as to incorporate user feedback in the loop. Our approach clusters mappings to identify where user feedback will be most beneficial in reducing the number of user interactions and system iterations. This feedback process has been implemented in the Agreement Maker system and is supported by visual analytic techniques that help users to better understand the matching process. Experimental results using the OAEI benchmarks show the effectiveness of our approach. We will demonstrate how users can interact with the ontology matching process through the Agreement Maker user interface to match real-world ontologies.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128825347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
TEDAS: A Twitter-based Event Detection and Analysis System 基于twitter的事件检测与分析系统
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.125
Rui Li, Kin Hou Lei, Ravi V. Khadiwala, K. Chang
Witnessing the emergence of Twitter, we propose a Twitter-based Event Detection and Analysis System (TEDAS), which helps to (1) detect new events, to (2) analyze the spatial and temporal pattern of an event, and to (3) identify importance of events. In this demonstration, we show the overall system architecture, explain in detail the implementation of the components that crawl, classify, and rank tweets and extract location from tweets, and present some interesting results of our system.
目睹Twitter的出现,我们提出了一个基于Twitter的事件检测和分析系统(TEDAS),它有助于(1)检测新事件,(2)分析事件的时空模式,以及(3)识别事件的重要性。在这个演示中,我们展示了整个系统架构,详细解释了抓取、分类和排序tweet以及从tweet中提取位置的组件的实现,并展示了我们系统的一些有趣的结果。
{"title":"TEDAS: A Twitter-based Event Detection and Analysis System","authors":"Rui Li, Kin Hou Lei, Ravi V. Khadiwala, K. Chang","doi":"10.1109/ICDE.2012.125","DOIUrl":"https://doi.org/10.1109/ICDE.2012.125","url":null,"abstract":"Witnessing the emergence of Twitter, we propose a Twitter-based Event Detection and Analysis System (TEDAS), which helps to (1) detect new events, to (2) analyze the spatial and temporal pattern of an event, and to (3) identify importance of events. In this demonstration, we show the overall system architecture, explain in detail the implementation of the components that crawl, classify, and rank tweets and extract location from tweets, and present some interesting results of our system.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121619881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 462
Emerging Graph Queries in Linked Data 关联数据中的新兴图查询
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.143
Arijit Khan, Yinghui Wu, Xifeng Yan
In a wide array of disciplines, data can be modeled as an interconnected network of entities, where various attributes could be associated with both the entities and the relations among them. Knowledge is often hidden in the complex structure and attributes inside these networks. While querying and mining these linked datasets are essential for various applications, traditional graph queries may not be able to capture the rich semantics in these networks. With the advent of complex information networks, new graph queries are emerging, including graph pattern matching and mining, similarity search, ranking and expert finding, graph aggregation and OLAP. These queries require both the topology and content information of the network data, and hence, different from classical graph algorithms such as shortest path, reach ability and minimum cut, which depend only on the structure of the network. In this tutorial, we shall give an introduction of the emerging graph queries, their indexing and resolution techniques, the current challenges and the future research directions.
在许多学科中,可以将数据建模为相互连接的实体网络,其中各种属性可以与实体及其之间的关系相关联。知识往往隐藏在这些网络内部复杂的结构和属性中。虽然查询和挖掘这些关联数据集对于各种应用都是必不可少的,但传统的图查询可能无法捕获这些网络中的丰富语义。随着复杂信息网络的出现,新的图形查询不断涌现,包括图形模式匹配和挖掘、相似度搜索、排序和专家查找、图形聚合和OLAP。这些查询既需要网络数据的拓扑信息,也需要网络数据的内容信息,因此不同于仅依赖于网络结构的经典图算法,如最短路径、到达能力和最小切割。在本教程中,我们将介绍新兴的图查询,它们的索引和解析技术,当前的挑战和未来的研究方向。
{"title":"Emerging Graph Queries in Linked Data","authors":"Arijit Khan, Yinghui Wu, Xifeng Yan","doi":"10.1109/ICDE.2012.143","DOIUrl":"https://doi.org/10.1109/ICDE.2012.143","url":null,"abstract":"In a wide array of disciplines, data can be modeled as an interconnected network of entities, where various attributes could be associated with both the entities and the relations among them. Knowledge is often hidden in the complex structure and attributes inside these networks. While querying and mining these linked datasets are essential for various applications, traditional graph queries may not be able to capture the rich semantics in these networks. With the advent of complex information networks, new graph queries are emerging, including graph pattern matching and mining, similarity search, ranking and expert finding, graph aggregation and OLAP. These queries require both the topology and content information of the network data, and hence, different from classical graph algorithms such as shortest path, reach ability and minimum cut, which depend only on the structure of the network. In this tutorial, we shall give an introduction of the emerging graph queries, their indexing and resolution techniques, the current challenges and the future research directions.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123064627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Detecting Outliers in Sensor Networks Using the Geometric Approach 基于几何方法的传感器网络异常点检测
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.85
Sabbas Burdakis, Antonios Deligiannakis
The topic of outlier detection in sensor networks has received significant attention in recent years. Detecting when the measurements of a node become "abnormal'' is interesting, because this event may help detect either a malfunctioning node, or a node that starts observing a local interesting phenomenon (i.e., a fire). In this paper we present a new algorithm for detecting outliers in sensor networks, based on the geometric approach. Unlike prior work. our algorithms perform a distributed monitoring of outlier readings, exhibit 100% accuracy in their monitoring (assuming no message losses), and require the transmission of messages only at a fraction of the epochs, thus allowing nodes to safely refrain from transmitting in many epochs. Our approach is based on transforming common similarity metrics in a way that admits the application of the recently proposed geometric approach. We then propose a general framework and suggest multiple modes of operation, which allow each sensor node to accurately monitor its similarity to other nodes. Our experiments demonstrate that our algorithms can accurately detect outliers at a fraction of the communication cost that a centralized approach would require (even in the case where the central node lies just one hop away from all sensor nodes). Moreover, we demonstrate that these bandwidth savings become even larger as we incorporate further optimizations in our proposed modes of operation.
近年来,传感器网络中的异常点检测问题受到了广泛关注。检测节点的测量何时变得“异常”是有趣的,因为该事件可能有助于检测故障节点或开始观察局部有趣现象(即火灾)的节点。本文提出了一种基于几何方法的传感器网络异常点检测新算法。与之前的工作不同。我们的算法执行离群值读数的分布式监测,在监测中表现出100%的准确性(假设没有消息丢失),并且只需要在一小部分epoch传输消息,从而允许节点安全地避免在许多epoch中传输消息。我们的方法是基于以一种允许应用最近提出的几何方法的方式转换常见的相似性度量。然后,我们提出了一个通用框架,并建议多种操作模式,使每个传感器节点能够准确地监测其与其他节点的相似性。我们的实验表明,我们的算法可以准确地检测到异常值,而集中式方法所需的通信成本只有一小部分(即使在中心节点距离所有传感器节点只有一跳的情况下)。此外,我们证明,当我们将进一步优化纳入我们提出的操作模式时,这些带宽节省会变得更大。
{"title":"Detecting Outliers in Sensor Networks Using the Geometric Approach","authors":"Sabbas Burdakis, Antonios Deligiannakis","doi":"10.1109/ICDE.2012.85","DOIUrl":"https://doi.org/10.1109/ICDE.2012.85","url":null,"abstract":"The topic of outlier detection in sensor networks has received significant attention in recent years. Detecting when the measurements of a node become \"abnormal'' is interesting, because this event may help detect either a malfunctioning node, or a node that starts observing a local interesting phenomenon (i.e., a fire). In this paper we present a new algorithm for detecting outliers in sensor networks, based on the geometric approach. Unlike prior work. our algorithms perform a distributed monitoring of outlier readings, exhibit 100% accuracy in their monitoring (assuming no message losses), and require the transmission of messages only at a fraction of the epochs, thus allowing nodes to safely refrain from transmitting in many epochs. Our approach is based on transforming common similarity metrics in a way that admits the application of the recently proposed geometric approach. We then propose a general framework and suggest multiple modes of operation, which allow each sensor node to accurately monitor its similarity to other nodes. Our experiments demonstrate that our algorithms can accurately detect outliers at a fraction of the communication cost that a centralized approach would require (even in the case where the central node lies just one hop away from all sensor nodes). Moreover, we demonstrate that these bandwidth savings become even larger as we incorporate further optimizations in our proposed modes of operation.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125729429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
DRAGOON: An Information Accountability System for High-Performance Databases 高性能数据库的信息问责系统
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.139
Kyriacos E. Pavlou, R. Snodgrass
Regulations and societal expectations have recently emphasized the need to mediate access to valuable databases, even access by insiders. Fraud occurs when a person, often an insider, tries to hide illegal activity. Companies would like to be assured that such tampering has not occurred, or if it does, that it will be quickly discovered and used to identify the perpetrator. At one end of the compliance spectrum lies the approach of restricting access to information and on the other that of information accountability. We focus on effecting information accountability of data stored in high-performance databases. The demonstrated work ensures appropriate use and thus end-to-end accountability of database information via a continuous assurance technology based on cryptographic hashing techniques. A prototype tamper detection and forensic analysis system named DRAGOON was designed and implemented to determine when tampering(s) occurred and what data were tampered with. DRAGOON is scalable, customizable, and intuitive. This work will show that information accountability is a viable alternative to information restriction for ensuring the correct storage, use, and maintenance of databases on extant DBMSes.
法规和社会期望最近强调需要调解对有价值数据库的访问,甚至是内部人员的访问。当一个人(通常是内部人员)试图隐藏非法活动时,欺诈就发生了。公司希望确保这种篡改没有发生,或者如果发生了,它将很快被发现并用于识别肇事者。遵守范围的一端是限制获取信息的方法,另一端是信息问责制。我们关注的是如何对存储在高性能数据库中的数据进行信息问责。演示的工作通过基于加密散列技术的连续保证技术确保数据库信息的适当使用和端到端问责制。设计并实现了一个名为DRAGOON的篡改检测和取证分析系统原型,以确定何时发生篡改以及哪些数据被篡改。龙骑士是可扩展的,可定制的,直观的。这项工作将表明,为了确保在现有dbms上正确存储、使用和维护数据库,信息问责制是信息限制的可行替代方案。
{"title":"DRAGOON: An Information Accountability System for High-Performance Databases","authors":"Kyriacos E. Pavlou, R. Snodgrass","doi":"10.1109/ICDE.2012.139","DOIUrl":"https://doi.org/10.1109/ICDE.2012.139","url":null,"abstract":"Regulations and societal expectations have recently emphasized the need to mediate access to valuable databases, even access by insiders. Fraud occurs when a person, often an insider, tries to hide illegal activity. Companies would like to be assured that such tampering has not occurred, or if it does, that it will be quickly discovered and used to identify the perpetrator. At one end of the compliance spectrum lies the approach of restricting access to information and on the other that of information accountability. We focus on effecting information accountability of data stored in high-performance databases. The demonstrated work ensures appropriate use and thus end-to-end accountability of database information via a continuous assurance technology based on cryptographic hashing techniques. A prototype tamper detection and forensic analysis system named DRAGOON was designed and implemented to determine when tampering(s) occurred and what data were tampered with. DRAGOON is scalable, customizable, and intuitive. This work will show that information accountability is a viable alternative to information restriction for ensuring the correct storage, use, and maintenance of databases on extant DBMSes.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125237388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
MXQuery with Hardware Acceleration 带有硬件加速的MXQuery
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.130
Peter M. Fischer, J. Teubner
We demonstrate MXQuery/H, a modified version of MXQuery that uses hardware acceleration to speed up XML processing. The main goal of this demonstration is to give an interactive example of hardware/software co-design and show how system performance and energy efficiency can be improved by off-loading tasks to FPGA hardware. To this end, we equipped MXQuery/H with various hooks to inspect the different parts of the system. Besides that, our system can finally really leverage the idea of XML projection. Though the idea of projection had been around for a while, its effectiveness remained always limited because of the unavoidable and high parsing overhead. By performing the task in hardware, we relieve the software part from this overhead and achieve processing speed-ups of several factors.
我们将演示MXQuery/H,这是MXQuery的修改版本,它使用硬件加速来加速XML处理。本演示的主要目标是给出一个硬件/软件协同设计的交互式示例,并展示如何通过将任务卸载到FPGA硬件来提高系统性能和能源效率。为此,我们为MXQuery/H配备了各种钩子来检查系统的不同部分。除此之外,我们的系统最终可以真正利用XML投影的思想。尽管投影的思想已经存在了一段时间,但由于不可避免的高解析开销,它的有效性始终受到限制。通过在硬件上执行任务,我们减轻了软件部分的开销,并实现了几个因素的处理速度提高。
{"title":"MXQuery with Hardware Acceleration","authors":"Peter M. Fischer, J. Teubner","doi":"10.1109/ICDE.2012.130","DOIUrl":"https://doi.org/10.1109/ICDE.2012.130","url":null,"abstract":"We demonstrate MXQuery/H, a modified version of MXQuery that uses hardware acceleration to speed up XML processing. The main goal of this demonstration is to give an interactive example of hardware/software co-design and show how system performance and energy efficiency can be improved by off-loading tasks to FPGA hardware. To this end, we equipped MXQuery/H with various hooks to inspect the different parts of the system. Besides that, our system can finally really leverage the idea of XML projection. Though the idea of projection had been around for a while, its effectiveness remained always limited because of the unavoidable and high parsing overhead. By performing the task in hardware, we relieve the software part from this overhead and achieve processing speed-ups of several factors.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134542450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Data Infrastructure at LinkedIn LinkedIn的数据基础设施
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.147
Aditya Auradkar, C. Botev, Shirshanka Das, Dave De Maagd, Alex Feinberg, Phanindra Ganti, L. Gao, B. Ghosh, K. Gopalakrishna, B. Harris, J. Koshy, Kevin Krawez, J. Kreps, Shih-Hui Lu, S. Nagaraj, N. Narkhede, S. Pachev, I. Perisic, Lin Qiao, Tom Quiggle, Jun Rao, Bob Schulman, Abraham Sebastian, Oliver Seeliger, Adam Silberstein, Boris Shkolnik, Chinmay Soman, Roshan Sumbaly, Kapil Surlaker, Sajid Topiwala, C. Tran, B. Varadarajan, Jemiah Westerman, Zach White, David Zhang, Jason Zhang
Linked In is among the largest social networking sites in the world. As the company has grown, our core data sets and request processing requirements have grown as well. In this paper, we describe a few selected data infrastructure projects at Linked In that have helped us accommodate this increasing scale. Most of those projects build on existing open source projects and are themselves available as open source. The projects covered in this paper include: (1) Voldemort: a scalable and fault tolerant key-value store, (2) Data bus: a framework for delivering database changes to downstream applications, (3) Espresso: a distributed data store that supports flexible schemas and secondary indexing, (4) Kafka: a scalable and efficient messaging system for collecting various user activity events and log data.
领英是世界上最大的社交网站之一。随着公司的发展,我们的核心数据集和请求处理需求也在增长。在本文中,我们描述了Linked In的一些选定的数据基础设施项目,这些项目帮助我们适应了这种不断增长的规模。这些项目中的大多数都建立在现有的开源项目之上,并且它们本身就是开源的。本文涉及的项目包括:(1)Voldemort:一个可扩展和容错的键值存储,(2)数据总线:一个将数据库更改交付给下游应用程序的框架,(3)Espresso:一个支持灵活模式和二级索引的分布式数据存储,(4)Kafka:一个可扩展和高效的消息传递系统,用于收集各种用户活动事件和日志数据。
{"title":"Data Infrastructure at LinkedIn","authors":"Aditya Auradkar, C. Botev, Shirshanka Das, Dave De Maagd, Alex Feinberg, Phanindra Ganti, L. Gao, B. Ghosh, K. Gopalakrishna, B. Harris, J. Koshy, Kevin Krawez, J. Kreps, Shih-Hui Lu, S. Nagaraj, N. Narkhede, S. Pachev, I. Perisic, Lin Qiao, Tom Quiggle, Jun Rao, Bob Schulman, Abraham Sebastian, Oliver Seeliger, Adam Silberstein, Boris Shkolnik, Chinmay Soman, Roshan Sumbaly, Kapil Surlaker, Sajid Topiwala, C. Tran, B. Varadarajan, Jemiah Westerman, Zach White, David Zhang, Jason Zhang","doi":"10.1109/ICDE.2012.147","DOIUrl":"https://doi.org/10.1109/ICDE.2012.147","url":null,"abstract":"Linked In is among the largest social networking sites in the world. As the company has grown, our core data sets and request processing requirements have grown as well. In this paper, we describe a few selected data infrastructure projects at Linked In that have helped us accommodate this increasing scale. Most of those projects build on existing open source projects and are themselves available as open source. The projects covered in this paper include: (1) Voldemort: a scalable and fault tolerant key-value store, (2) Data bus: a framework for delivering database changes to downstream applications, (3) Espresso: a distributed data store that supports flexible schemas and secondary indexing, (4) Kafka: a scalable and efficient messaging system for collecting various user activity events and log data.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115222422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 64
GSLPI: A Cost-Based Query Progress Indicator GSLPI:基于成本的查询进度指示器
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.74
Jiexing Li, Rimma V. Nehme, J. Naughton
Progress indicators for SQL queries were first published in 2004 with the simultaneous and independent proposals from Chaudhuri et al. and Luo et al. In this paper, we implement both progress indicators in the same commercial RDBMS to investigate their performance. We summarize common cases in which they are both accurate and cases in which they fail to provide reliable estimates. Although there are differences in their performance, much more striking is the similarity in the errors they make due to a common simplifying uniform future speed assumption. While the developers of these progress indicators were aware that this assumption could cause errors, they neither explored how large the errors might be nor did they investigate the feasibility of removing the assumption. To rectify this we propose a new query progress indicator, similar to these early progress indicators but without the uniform speed assumption. Experiments show that on the TPC-H benchmark, on queries for which the original progress indicators have errors up to 30X the query running time, the new progress indicator is accurate to within 10 percent. We also discuss the sources of the errors that still remain and shed some light on what would need to be done to eliminate them.
SQL查询的进度指标首次发布于2004年,Chaudhuri等人和Luo等人同时提出了独立的建议。在本文中,我们在同一个商业RDBMS中实现了这两个进度指标,以研究它们的性能。我们总结了它们既准确又不能提供可靠估计的常见情况。尽管它们在性能上存在差异,但更引人注目的是它们所犯错误的相似性,这是由于一个共同的简化的统一未来速度假设。虽然这些进度指标的开发人员意识到这种假设可能会导致错误,但他们既没有探索错误可能有多大,也没有调查消除这种假设的可行性。为了纠正这个问题,我们提出了一个新的查询进度指标,类似于这些早期的进度指标,但没有统一的速度假设。实验表明,在TPC-H基准测试中,对于原始进度指示器的误差高达查询运行时间的30倍的查询,新的进度指示器精确到10%以内。我们还讨论了仍然存在的错误的来源,并阐明了需要采取哪些措施来消除这些错误。
{"title":"GSLPI: A Cost-Based Query Progress Indicator","authors":"Jiexing Li, Rimma V. Nehme, J. Naughton","doi":"10.1109/ICDE.2012.74","DOIUrl":"https://doi.org/10.1109/ICDE.2012.74","url":null,"abstract":"Progress indicators for SQL queries were first published in 2004 with the simultaneous and independent proposals from Chaudhuri et al. and Luo et al. In this paper, we implement both progress indicators in the same commercial RDBMS to investigate their performance. We summarize common cases in which they are both accurate and cases in which they fail to provide reliable estimates. Although there are differences in their performance, much more striking is the similarity in the errors they make due to a common simplifying uniform future speed assumption. While the developers of these progress indicators were aware that this assumption could cause errors, they neither explored how large the errors might be nor did they investigate the feasibility of removing the assumption. To rectify this we propose a new query progress indicator, similar to these early progress indicators but without the uniform speed assumption. Experiments show that on the TPC-H benchmark, on queries for which the original progress indicators have errors up to 30X the query running time, the new progress indicator is accurate to within 10 percent. We also discuss the sources of the errors that still remain and shed some light on what would need to be done to eliminate them.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117085552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
DPCube: Releasing Differentially Private Data Cubes for Health Information DPCube:为健康信息发布不同的私有数据集
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.135
Yonghui Xiao, James J. Gardner, Li Xiong
We demonstrate DPCube, a component in our Health Information DE-identification (HIDE) framework, for releasing differentially private data cubes (or multi-dimensional histograms) for sensitive data. HIDE is a framework we developed for integrating heterogenous structured and unstructured health information and provides methods for privacy preserving data publishing. The DPCube component uses differentially private access mechanisms and an innovative 2-phase multidimensional partitioning strategy to publish a multi-dimensional data cube or histogram that achieves good utility while satisfying differential privacy. We demonstrate that the released data cubes can serve as a sanitized synopsis of the raw database and, together with an optional synthesized dataset based on the data cubes, can support various Online Analytical Processing (OLAP) queries and learning tasks.
我们将演示DPCube,它是健康信息去识别(HIDE)框架中的一个组件,用于为敏感数据发布不同的私有数据立方体(或多维直方图)。HIDE是我们开发的一个框架,用于集成异构结构化和非结构化健康信息,并提供保护隐私的数据发布方法。DPCube组件使用差异私有访问机制和创新的两阶段多维分区策略来发布多维数据立方体或直方图,从而在满足差异隐私的同时获得良好的实用性。我们演示了发布的数据集可以作为原始数据库的净化摘要,并与基于数据集的可选合成数据集一起,可以支持各种在线分析处理(OLAP)查询和学习任务。
{"title":"DPCube: Releasing Differentially Private Data Cubes for Health Information","authors":"Yonghui Xiao, James J. Gardner, Li Xiong","doi":"10.1109/ICDE.2012.135","DOIUrl":"https://doi.org/10.1109/ICDE.2012.135","url":null,"abstract":"We demonstrate DPCube, a component in our Health Information DE-identification (HIDE) framework, for releasing differentially private data cubes (or multi-dimensional histograms) for sensitive data. HIDE is a framework we developed for integrating heterogenous structured and unstructured health information and provides methods for privacy preserving data publishing. The DPCube component uses differentially private access mechanisms and an innovative 2-phase multidimensional partitioning strategy to publish a multi-dimensional data cube or histogram that achieves good utility while satisfying differential privacy. We demonstrate that the released data cubes can serve as a sanitized synopsis of the raw database and, together with an optional synthesized dataset based on the data cubes, can support various Online Analytical Processing (OLAP) queries and learning tasks.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117162025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
期刊
2012 IEEE 28th International Conference on Data Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1