首页 > 最新文献

2011 IEEE 27th International Conference on Data Engineering最新文献

英文 中文
Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning 复杂故障模式的发现:定量数据清理的新方法
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767864
Laure Berti-Équille, T. Dasu, D. Srivastava
Quantitative Data Cleaning (QDC) is the use of statistical and other analytical techniques to detect, quantify, and correct data quality problems (or glitches). Current QDC approaches focus on addressing each category of data glitch individually. However, in real-world data, different types of data glitches co-occur in complex patterns. These patterns and interactions between glitches offer valuable clues for developing effective domain-specific quantitative cleaning strategies. In this paper, we address the shortcomings of the extant QDC methods by proposing a novel framework, the DEC (Detect-Explore-Clean) framework. It is a comprehensive approach for the definition, detection and cleaning of complex, multi-type data glitches. We exploit the distributions and interactions of different types of glitches to develop data-driven cleaning strategies that may offer significant advantages over blind strategies. The DEC framework is a statistically rigorous methodology for evaluating and scoring glitches and selecting the quantitative cleaning strategies that result in cleaned data sets that are statistically proximal to user specifications. We demonstrate the efficacy and scalability of the DEC framework on very large real-world and synthetic data sets.
定量数据清理(QDC)是使用统计和其他分析技术来检测、量化和纠正数据质量问题(或故障)。目前的QDC方法侧重于单独解决每一类数据故障。然而,在真实世界的数据中,不同类型的数据故障在复杂的模式中同时发生。这些模式和故障之间的相互作用为开发有效的特定于领域的定量清理策略提供了有价值的线索。在本文中,我们通过提出一个新的框架,即DEC (Detect-Explore-Clean)框架来解决现有QDC方法的缺点。它是一种用于定义、检测和清除复杂、多类型数据故障的综合方法。我们利用不同类型故障的分布和相互作用来开发数据驱动的清理策略,这可能比盲目策略提供显著的优势。DEC框架是一种严格的统计方法,用于评估和评分故障,并选择定量清理策略,从而产生统计上接近用户规格的清理数据集。我们证明了DEC框架在非常大的真实世界和合成数据集上的有效性和可扩展性。
{"title":"Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning","authors":"Laure Berti-Équille, T. Dasu, D. Srivastava","doi":"10.1109/ICDE.2011.5767864","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767864","url":null,"abstract":"Quantitative Data Cleaning (QDC) is the use of statistical and other analytical techniques to detect, quantify, and correct data quality problems (or glitches). Current QDC approaches focus on addressing each category of data glitch individually. However, in real-world data, different types of data glitches co-occur in complex patterns. These patterns and interactions between glitches offer valuable clues for developing effective domain-specific quantitative cleaning strategies. In this paper, we address the shortcomings of the extant QDC methods by proposing a novel framework, the DEC (Detect-Explore-Clean) framework. It is a comprehensive approach for the definition, detection and cleaning of complex, multi-type data glitches. We exploit the distributions and interactions of different types of glitches to develop data-driven cleaning strategies that may offer significant advantages over blind strategies. The DEC framework is a statistically rigorous methodology for evaluating and scoring glitches and selecting the quantitative cleaning strategies that result in cleaned data sets that are statistically proximal to user specifications. We demonstrate the efficacy and scalability of the DEC framework on very large real-world and synthetic data sets.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127623665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Similarity measures for multidimensional data 多维数据的相似性度量
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767869
Eftychia Baikousi, Georgios Rogkakos, Panos Vassiliadis
How similar are two data-cubes? In other words, the question under consideration is: given two sets of points in a multidimensional hierarchical space, what is the distance value between them? In this paper we explore various distance functions that can be used over multidimensional hierarchical spaces. We organize the discussed functions with respect to the properties of the dimension hierarchies, levels and values. In order to discover which distance functions are more suitable and meaningful to the users, we conducted two user study analysis. The first user study analysis concerns the most preferred distance function between two values of a dimension. The findings of this user study indicate that the functions that seem to fit better the user needs are characterized by the tendency to consider as closest to a point in a multidimensional space, points with the smallest shortest path with respect to the same dimension hierarchy. The second user study aimed in discovering which distance function between two data cubes, is mostly preferred by users. The two functions that drew the attention of users where (a) the summation of distances between every cell of a cube with the most similar cell of another cube and (b) the Hausdorff distance function. Overall, the former function was preferred by users than the latter; however the individual scores of the tests indicate that this advantage is rather narrow.
两个数据集有多相似?换句话说,考虑的问题是:给定多维层次空间中的两组点,它们之间的距离值是多少?在本文中,我们探讨了可以在多维层次空间上使用的各种距离函数。我们根据维度层次、层次和值的属性来组织所讨论的函数。为了发现哪些距离函数对用户来说更合适、更有意义,我们进行了两次用户研究分析。第一个用户研究分析涉及一个维度的两个值之间的最优选距离函数。这项用户研究的结果表明,似乎更适合用户需求的功能的特点是倾向于考虑在多维空间中最接近点的点,在同一维度层次中具有最短路径的点。第二个用户研究旨在发现用户最喜欢的两个数据集之间的距离函数。引起用户注意的两个函数是(a)一个立方体的每个单元与另一个立方体中最相似的单元之间的距离和(b) Hausdorff距离函数。总体而言,用户对前者功能的偏好高于后者;然而,测试的个人分数表明,这种优势相当有限。
{"title":"Similarity measures for multidimensional data","authors":"Eftychia Baikousi, Georgios Rogkakos, Panos Vassiliadis","doi":"10.1109/ICDE.2011.5767869","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767869","url":null,"abstract":"How similar are two data-cubes? In other words, the question under consideration is: given two sets of points in a multidimensional hierarchical space, what is the distance value between them? In this paper we explore various distance functions that can be used over multidimensional hierarchical spaces. We organize the discussed functions with respect to the properties of the dimension hierarchies, levels and values. In order to discover which distance functions are more suitable and meaningful to the users, we conducted two user study analysis. The first user study analysis concerns the most preferred distance function between two values of a dimension. The findings of this user study indicate that the functions that seem to fit better the user needs are characterized by the tendency to consider as closest to a point in a multidimensional space, points with the smallest shortest path with respect to the same dimension hierarchy. The second user study aimed in discovering which distance function between two data cubes, is mostly preferred by users. The two functions that drew the attention of users where (a) the summation of distances between every cell of a cube with the most similar cell of another cube and (b) the Hausdorff distance function. Overall, the former function was preferred by users than the latter; however the individual scores of the tests indicate that this advantage is rather narrow.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127984425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
CubeLSI: An effective and efficient method for searching resources in social tagging systems CubeLSI:一种有效的社会标签系统资源搜索方法
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767863
Bin Bi, Sau-dan. Lee, B. Kao, Reynold Cheng
In a social tagging system, resources (such as photos, video and web pages) are associated with tags. These tags allow the resources to be effectively searched through tag-based keyword matching using traditional IR techniques. We note that in many such systems, tags of a resource are often assigned by a diverse audience of causal users (taggers). This leads to two issues that gravely affect the effectiveness of resource retrieval: (1) Noise: tags are picked from an uncontrolled vocabulary and are assigned by untrained taggers. The tags are thus noisy features in resource retrieval. (2) A multitude of aspects: different taggers focus on different aspects of a resource. Representing a resource using a flattened bag of tags ignores this important diversity of taggers. To improve the effectiveness of resource retrieval in social tagging systems, we propose CubeLSI — a technique that extends traditional LSI to include taggers as another dimension of feature space of resources. We compare CubeLSI against a number of other tag-based retrieval models and show that CubeLSI significantly outperforms the other models in terms of retrieval accuracy. We also prove two interesting theorems that allow CubeLSI to be very efficiently computed despite the much enlarged feature space it employs.
在社会标签系统中,资源(如照片、视频和网页)与标签相关联。这些标记允许使用传统IR技术通过基于标记的关键字匹配有效地搜索资源。我们注意到,在许多这样的系统中,资源的标签通常由不同的因果用户(标记者)分配。这导致了两个严重影响资源检索有效性的问题:(1)噪声:标签是从一个不受控制的词汇表中挑选出来的,并由未经训练的标注者分配。因此,标签在资源检索中是有噪声的特征。(2)多方面:不同的标注者关注资源的不同方面。使用一个扁平的标记包来表示资源忽略了标记器的重要多样性。为了提高社会标记系统中资源检索的有效性,我们提出了一种扩展传统LSI的技术,将标记器作为资源特征空间的另一个维度。我们将CubeLSI与许多其他基于标签的检索模型进行了比较,并表明CubeLSI在检索精度方面明显优于其他模型。我们还证明了两个有趣的定理,尽管CubeLSI使用了更大的特征空间,但它们仍然可以非常有效地计算CubeLSI。
{"title":"CubeLSI: An effective and efficient method for searching resources in social tagging systems","authors":"Bin Bi, Sau-dan. Lee, B. Kao, Reynold Cheng","doi":"10.1109/ICDE.2011.5767863","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767863","url":null,"abstract":"In a social tagging system, resources (such as photos, video and web pages) are associated with tags. These tags allow the resources to be effectively searched through tag-based keyword matching using traditional IR techniques. We note that in many such systems, tags of a resource are often assigned by a diverse audience of causal users (taggers). This leads to two issues that gravely affect the effectiveness of resource retrieval: (1) Noise: tags are picked from an uncontrolled vocabulary and are assigned by untrained taggers. The tags are thus noisy features in resource retrieval. (2) A multitude of aspects: different taggers focus on different aspects of a resource. Representing a resource using a flattened bag of tags ignores this important diversity of taggers. To improve the effectiveness of resource retrieval in social tagging systems, we propose CubeLSI — a technique that extends traditional LSI to include taggers as another dimension of feature space of resources. We compare CubeLSI against a number of other tag-based retrieval models and show that CubeLSI significantly outperforms the other models in terms of retrieval accuracy. We also prove two interesting theorems that allow CubeLSI to be very efficiently computed despite the much enlarged feature space it employs.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132215673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Hyracks: A flexible and extensible foundation for data-intensive computing hyrack:用于数据密集型计算的灵活且可扩展的基础
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767921
V. Borkar, M. Carey, Raman Grover, Nicola Onose, R. Vernica
Hyracks is a new partitioned-parallel software platform designed to run data-intensive computations on large shared-nothing clusters of computers. Hyracks allows users to express a computation as a DAG of data operators and connectors. Operators operate on partitions of input data and produce partitions of output data, while connectors repartition operators' outputs to make the newly produced partitions available at the consuming operators. We describe the Hyracks end user model, for authors of dataflow jobs, and the extension model for users who wish to augment Hyracks' built-in library with new operator and/or connector types. We also describe our initial Hyracks implementation. Since Hyracks is in roughly the same space as the open source Hadoop platform, we compare Hyracks with Hadoop experimentally for several different kinds of use cases. The initial results demonstrate that Hyracks has significant promise as a next-generation platform for data-intensive applications.
Hyracks是一种新的分区并行软件平台,设计用于在无共享的大型计算机集群上运行数据密集型计算。hyrack允许用户将计算表示为数据操作符和连接器的DAG。操作符对输入数据的分区进行操作,并生成输出数据的分区,而连接器对操作符的输出进行重新分区,以使新生成的分区可供消费操作符使用。我们为数据流作业的作者描述了hyrack的最终用户模型,为希望用新的操作符和/或连接器类型增强hyrack内置库的用户描述了扩展模型。我们还描述了我们最初的hyrack实现。由于hyrack与开源Hadoop平台处于大致相同的空间,我们针对几种不同的用例对hyrack与Hadoop进行了实验比较。初步结果表明,作为下一代数据密集型应用平台,Hyracks具有巨大的前景。
{"title":"Hyracks: A flexible and extensible foundation for data-intensive computing","authors":"V. Borkar, M. Carey, Raman Grover, Nicola Onose, R. Vernica","doi":"10.1109/ICDE.2011.5767921","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767921","url":null,"abstract":"Hyracks is a new partitioned-parallel software platform designed to run data-intensive computations on large shared-nothing clusters of computers. Hyracks allows users to express a computation as a DAG of data operators and connectors. Operators operate on partitions of input data and produce partitions of output data, while connectors repartition operators' outputs to make the newly produced partitions available at the consuming operators. We describe the Hyracks end user model, for authors of dataflow jobs, and the extension model for users who wish to augment Hyracks' built-in library with new operator and/or connector types. We also describe our initial Hyracks implementation. Since Hyracks is in roughly the same space as the open source Hadoop platform, we compare Hyracks with Hadoop experimentally for several different kinds of use cases. The initial results demonstrate that Hyracks has significant promise as a next-generation platform for data-intensive applications.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130263670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 290
Interactive itinerary planning 互动式行程规划
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767920
Senjuti Basu Roy, Gautam Das, S. Amer-Yahia, Cong Yu
Planning an itinerary when traveling to a city involves substantial effort in choosing Points-of-Interest (POIs), deciding in which order to visit them, and accounting for the time it takes to visit each POI and transit between them. Several online services address different aspects of itinerary planning but none of them provides an interactive interface where users give feedbacks and iteratively construct their itineraries based on personal interests and time budget. In this paper, we formalize interactive itinerary planning as an iterative process where, at each step: (1) the user provides feedback on POIs selected by the system, (2) the system recommends the best itineraries based on all feedback so far, and (3) the system further selects a new set of POIs, with optimal utility, to solicit feedback for, at the next step. This iterative process stops when the user is satisfied with the recommended itinerary. We show that computing an itinerary is NP-complete even for simple itinerary scoring functions, and that POI selection is NP-complete. We develop heuristics and optimizations for a specific case where the score of an itinerary is proportional to the number of desired POIs it contains. Our extensive experiments show that our algorithms are efficient and return high quality itineraries.
在城市旅行时,规划行程涉及到大量的工作,包括选择兴趣点(POI),决定访问它们的顺序,以及考虑访问每个POI和在它们之间中转所需的时间。一些在线服务解决了行程规划的不同方面,但它们都没有提供一个交互式界面,用户可以根据个人兴趣和时间预算来反馈和迭代地构建行程。在本文中,我们将交互式行程规划形式化为一个迭代过程,在每个步骤中:(1)用户对系统选择的poi提供反馈,(2)系统根据迄今为止的所有反馈推荐最佳行程,(3)系统进一步选择一组具有最佳效用的新poi,以便在下一步征求反馈。当用户对推荐的行程感到满意时,此迭代过程停止。我们证明了即使对于简单的行程评分函数,计算行程也是np完全的,并且POI的选择也是np完全的。我们针对一个特定的情况开发了启发式和优化,其中行程的分数与它包含的期望poi的数量成正比。我们的大量实验表明,我们的算法是有效的,并返回高质量的行程。
{"title":"Interactive itinerary planning","authors":"Senjuti Basu Roy, Gautam Das, S. Amer-Yahia, Cong Yu","doi":"10.1109/ICDE.2011.5767920","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767920","url":null,"abstract":"Planning an itinerary when traveling to a city involves substantial effort in choosing Points-of-Interest (POIs), deciding in which order to visit them, and accounting for the time it takes to visit each POI and transit between them. Several online services address different aspects of itinerary planning but none of them provides an interactive interface where users give feedbacks and iteratively construct their itineraries based on personal interests and time budget. In this paper, we formalize interactive itinerary planning as an iterative process where, at each step: (1) the user provides feedback on POIs selected by the system, (2) the system recommends the best itineraries based on all feedback so far, and (3) the system further selects a new set of POIs, with optimal utility, to solicit feedback for, at the next step. This iterative process stops when the user is satisfied with the recommended itinerary. We show that computing an itinerary is NP-complete even for simple itinerary scoring functions, and that POI selection is NP-complete. We develop heuristics and optimizations for a specific case where the score of an itinerary is proportional to the number of desired POIs it contains. Our extensive experiments show that our algorithms are efficient and return high quality itineraries.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133587768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 76
Extensibility and Data Sharing in evolving multi-tenant databases 发展中的多租户数据库中的可扩展性和数据共享
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767872
Stefan Aulbach, Michael Seibold, D. Jacobs, A. Kemper
Software-as-a-Service applications commonly consolidate multiple businesses into the same database to reduce costs. This practice makes it harder to implement several essential features of enterprise applications. The first is support for master data, which should be shared rather than replicated for each tenant. The second is application modification and extension, which applies both to the database schema and master data it contains. The third is evolution of the schema and master data, which occurs as the application and its extensions are upgraded. These features cannot be easily implemented in a traditional DBMS and, to the extent that they are currently offered at all, they are generally implemented within the application layer. This approach reduces the DBMS to a ‘dumb data repository’ that only stores data rather than managing it. In addition, it complicates development of the application since many DBMS features have to be re-implemented. Instead, a next-generation multi-tenant DBMS should provide explicit support for Extensibility, Data Sharing and Evolution. As these three features are strongly related, they cannot be implemented independently from each other. Therefore, we propose FLEXSCHEME which captures all three aspects in one integrated model. In this paper, we focus on efficient storage mechanisms for this model and present a novel versioning mechanism, called XOR Delta, which is based on XOR encoding and is optimized for main-memory DBMSs.
软件即服务应用程序通常将多个业务合并到同一个数据库中以降低成本。这种做法使得实现企业应用程序的几个基本特性变得更加困难。首先是对主数据的支持,应该为每个租户共享而不是复制主数据。第二种是应用程序修改和扩展,它既适用于数据库模式,也适用于它包含的主数据。第三个是模式和主数据的演变,这是在应用程序及其扩展升级时发生的。这些特性在传统的DBMS中不容易实现,而且就目前提供的程度而言,它们通常是在应用层中实现的。这种方法将DBMS降低为只存储数据而不管理数据的“哑数据存储库”。此外,它还使应用程序的开发复杂化,因为许多DBMS特性必须重新实现。相反,下一代多租户DBMS应该为可扩展性、数据共享和演进提供明确的支持。由于这三个特性是紧密相关的,因此它们不能相互独立地实现。因此,我们提出FLEXSCHEME,它在一个集成模型中捕获了所有三个方面。在本文中,我们重点研究了该模型的有效存储机制,并提出了一种新的版本控制机制,称为XOR Delta,该机制基于XOR编码,并针对主存dbms进行了优化。
{"title":"Extensibility and Data Sharing in evolving multi-tenant databases","authors":"Stefan Aulbach, Michael Seibold, D. Jacobs, A. Kemper","doi":"10.1109/ICDE.2011.5767872","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767872","url":null,"abstract":"Software-as-a-Service applications commonly consolidate multiple businesses into the same database to reduce costs. This practice makes it harder to implement several essential features of enterprise applications. The first is support for master data, which should be shared rather than replicated for each tenant. The second is application modification and extension, which applies both to the database schema and master data it contains. The third is evolution of the schema and master data, which occurs as the application and its extensions are upgraded. These features cannot be easily implemented in a traditional DBMS and, to the extent that they are currently offered at all, they are generally implemented within the application layer. This approach reduces the DBMS to a ‘dumb data repository’ that only stores data rather than managing it. In addition, it complicates development of the application since many DBMS features have to be re-implemented. Instead, a next-generation multi-tenant DBMS should provide explicit support for Extensibility, Data Sharing and Evolution. As these three features are strongly related, they cannot be implemented independently from each other. Therefore, we propose FLEXSCHEME which captures all three aspects in one integrated model. In this paper, we focus on efficient storage mechanisms for this model and present a novel versioning mechanism, called XOR Delta, which is based on XOR encoding and is optimized for main-memory DBMSs.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133619714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Web-scale information extraction with vertex 基于顶点的web级信息提取
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767842
P. Gulhane, Amit Madaan, Rupesh R. Mehta, J. Ramamirtham, R. Rastogi, Sandeepkumar Satpal, Srinivasan H. Sengamedu, Ashwin Tengli, Charu Tiwari
Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.
Vertex是Yahoo!用于从基于模板的Web页面中提取结构化记录。为了在Web规模上进行操作,Vertex采用了许多新颖的算法(1)在Web站点中对类似的结构化页面进行分组,(2)为包装器推理选择适当的示例页面,(3)学习对站点结构变化具有鲁棒性的基于xpath的提取规则,(4)通过监控示例页面来检测站点变化,以及(5)通过重用规则来优化编辑成本,等等。该系统部署在生产环境中,目前从200多个Web站点提取了2.5亿条记录。据我们所知,Vertex是第一个在Web规模上进行高精度信息提取的系统。
{"title":"Web-scale information extraction with vertex","authors":"P. Gulhane, Amit Madaan, Rupesh R. Mehta, J. Ramamirtham, R. Rastogi, Sandeepkumar Satpal, Srinivasan H. Sengamedu, Ashwin Tengli, Charu Tiwari","doi":"10.1109/ICDE.2011.5767842","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767842","url":null,"abstract":"Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114519654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 89
DiRec: Diversified recommendations for semantic-less Collaborative Filtering DiRec:无语义协同过滤的多样化建议
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767942
Rubi Boim, T. Milo, Slava Novgorodov
In this demo we present DiRec, a plug-in that allows Collaborative Filtering (CF) Recommender systems to diversify the recommendations that they present to users. DiRec estimates items diversity by comparing the rankings that different users gave to the items, thereby enabling diversification even in common scenarios where no semantic information on the items is available. Items are clustered based on a novel notion of priority-medoids that provides a natural balance between the need to present highly ranked items vs. highly diverse ones. We demonstrate the operation of DiRec in the context of a movie recommendation system. We show the advantage of recommendation diversification and its feasibility even in the absence of semantic information.
在这个演示中,我们介绍了DiRec,这是一个插件,它允许协同过滤(CF)推荐系统向用户提供多样化的推荐。DiRec通过比较不同用户对物品的排名来估计物品的多样性,从而即使在没有关于物品的语义信息的常见场景中也能实现多样化。项目是基于一种新颖的优先级媒介概念进行聚类的,这在呈现高排名项目与高度多样化项目之间提供了一种自然的平衡。我们在电影推荐系统的背景下演示了DiRec的操作。我们展示了多样化推荐的优势及其在缺乏语义信息的情况下的可行性。
{"title":"DiRec: Diversified recommendations for semantic-less Collaborative Filtering","authors":"Rubi Boim, T. Milo, Slava Novgorodov","doi":"10.1109/ICDE.2011.5767942","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767942","url":null,"abstract":"In this demo we present DiRec, a plug-in that allows Collaborative Filtering (CF) Recommender systems to diversify the recommendations that they present to users. DiRec estimates items diversity by comparing the rankings that different users gave to the items, thereby enabling diversification even in common scenarios where no semantic information on the items is available. Items are clustered based on a novel notion of priority-medoids that provides a natural balance between the need to present highly ranked items vs. highly diverse ones. We demonstrate the operation of DiRec in the context of a movie recommendation system. We show the advantage of recommendation diversification and its feasibility even in the absence of semantic information.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122150963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Massively parallel XML twig filtering using dynamic programming on FPGAs 在fpga上使用动态规划进行大规模并行XML分支过滤
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767899
R. Moussalli, Mariam Salloum, W. Najjar, V. Tsotras
In recent years, XML-based Publish-Subscribe Systems have become popular due to the increased demand of timely event-notification. Users (or subscribers) pose complex profiles on the structure and content of the published messages. If a profile matches the message, the message is forwarded to the interested subscriber. As the amount of published content continues to grow, current software-based systems will not scale. We thus propose a novel architecture to exploit parallelism of twig matching on FPGAs. This approach yields up to three orders of magnitude higher throughput when compared to conventional approaches bound by the sequential aspect of software computing. This paper, presents a novel method for performing unordered holistic twig matching on FPGAs without any false positives, and whose throughput is independent of the complexity of the user queries or the characteristics of the input XML stream. Furthermore, we present experimental comparison of different granularities of twig matching, namely path-based (root-to-leaf) and pair-based (parent-child or ancestor-descendant).We provide comprehensive experiments that compare the throughput, area utilization and the accuracy of matching (percent of false positives) of our holistic, path-based and pair-based FPGA approaches.
近年来,由于对及时事件通知的需求增加,基于xml的发布-订阅系统变得流行起来。用户(或订阅者)对已发布消息的结构和内容提出复杂的配置文件。如果配置文件与消息匹配,则将消息转发给感兴趣的订阅者。随着发布内容的数量持续增长,当前基于软件的系统将无法扩展。因此,我们提出了一种新的架构来利用fpga上小枝匹配的并行性。与受软件计算顺序方面约束的传统方法相比,这种方法的吞吐量最高可提高三个数量级。本文提出了一种在fpga上进行无误报的无序整体小枝匹配的新方法,该方法的吞吐量与用户查询的复杂性或输入XML流的特征无关。此外,我们还对不同粒度的树枝匹配进行了实验比较,即基于路径的(根到叶)和基于对的(亲子或祖先-后代)。我们提供了全面的实验,比较了我们的整体,基于路径和基于对的FPGA方法的吞吐量,面积利用率和匹配准确性(误报百分比)。
{"title":"Massively parallel XML twig filtering using dynamic programming on FPGAs","authors":"R. Moussalli, Mariam Salloum, W. Najjar, V. Tsotras","doi":"10.1109/ICDE.2011.5767899","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767899","url":null,"abstract":"In recent years, XML-based Publish-Subscribe Systems have become popular due to the increased demand of timely event-notification. Users (or subscribers) pose complex profiles on the structure and content of the published messages. If a profile matches the message, the message is forwarded to the interested subscriber. As the amount of published content continues to grow, current software-based systems will not scale. We thus propose a novel architecture to exploit parallelism of twig matching on FPGAs. This approach yields up to three orders of magnitude higher throughput when compared to conventional approaches bound by the sequential aspect of software computing. This paper, presents a novel method for performing unordered holistic twig matching on FPGAs without any false positives, and whose throughput is independent of the complexity of the user queries or the characteristics of the input XML stream. Furthermore, we present experimental comparison of different granularities of twig matching, namely path-based (root-to-leaf) and pair-based (parent-child or ancestor-descendant).We provide comprehensive experiments that compare the throughput, area utilization and the accuracy of matching (percent of false positives) of our holistic, path-based and pair-based FPGA approaches.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128459043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Robust query processing 健壮的查询处理
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767961
G. Graefe
In the context of data management, robustness is usually associated with resilience against failure, recovery, redundancy, disaster preparedness, etc. Robust query processing, on the other hand, is about robustness of performance and of scalability. It is more than progress reporting or predictability. A system that fails predictably or obviously performs poorly may be better than an unpredictable one, but it is not robust.
在数据管理上下文中,健壮性通常与针对故障、恢复、冗余、灾难准备等的弹性相关。另一方面,健壮的查询处理是关于性能和可伸缩性的健壮性。它不仅仅是进度报告或可预测性。一个可预测的失败或明显表现不佳的系统可能比一个不可预测的系统要好,但它不是健壮的。
{"title":"Robust query processing","authors":"G. Graefe","doi":"10.1109/ICDE.2011.5767961","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767961","url":null,"abstract":"In the context of data management, robustness is usually associated with resilience against failure, recovery, redundancy, disaster preparedness, etc. Robust query processing, on the other hand, is about robustness of performance and of scalability. It is more than progress reporting or predictability. A system that fails predictably or obviously performs poorly may be better than an unpredictable one, but it is not robust.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128898116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
期刊
2011 IEEE 27th International Conference on Data Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1