首页 > 最新文献

Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)最新文献

英文 中文
Discovery and Creation of Rich Entities for Knowledge Bases 知识库丰富实体的发现与创建
A. Quamar, Fatma Özcan, Konstantinos Xirogiannopoulos
Businesses and professional organizations from a variety of different domains such as finance, weather, healthcare, social networks, etc., produce massive amounts of unstructured, semi-structured and structured data. Knowledge bases, enable querying and analysis of integrated content derived from such data available as open, third party and propriety data sets. Many knowledge bases today, provide an entity-centric view over the integrated content by using domain-specific ontologies. These entity-centric views enable querying individual real-world entities, as well as exploring exact information (such as address or net revenue of a company) through explicit querying using languages such as SQL or SPARQL. Although very useful for many business and commercial applications, this may not be sufficient for the exploration of relevant and context specific information associated with real-world entities stored in these knowledge bases. Users often need to resort to a manual and tedious process of exploration using ad-hoc queries to gather the required information. To enhance user experience and ameliorate the problem of relevant data exploration, we propose the concept of Rich Entities. These rich entities comprise of all the relevant and context specific information grouped together around real-world entities and served as efficient and meaningful responses to user queries against these entities in a knowledge base. These rich entities are created by grouping together information not only from a single entity represented as an ontology concept, but also related concepts and properties as specified by the domain ontology. In this paper we propose several novel techniques and algorithms to automatically detect, learn, and create domain-specific rich entities. We use inputs from query patterns in existing query workloads against knowledge bases, and leverage the structure and relationships between entities defined in the domain ontology. Our techniques are very effective and can be applied to a wide variety of application domains thus adding great value to data exploration and information extraction from entity-centric real-world knowledge bases.
来自不同领域的企业和专业组织,如金融、天气、医疗保健、社交网络等,会产生大量的非结构化、半结构化和结构化数据。知识库,允许查询和分析从这些数据中获得的集成内容,这些数据可以作为开放的、第三方的和专有的数据集。如今,许多知识库通过使用特定于领域的本体,在集成的内容上提供以实体为中心的视图。这些以实体为中心的视图支持查询真实世界中的单个实体,以及通过使用SQL或SPARQL等语言进行显式查询来探索精确信息(例如公司的地址或净收入)。尽管对于许多业务和商业应用程序非常有用,但对于探索与存储在这些知识库中的现实世界实体相关的相关信息和特定于上下文的信息来说,这可能还不够。用户通常需要借助手动和繁琐的探索过程,使用特别的查询来收集所需的信息。为了增强用户体验和改善相关数据探索问题,我们提出了富实体的概念。这些丰富的实体包含围绕现实世界实体组合在一起的所有相关和特定于上下文的信息,并作为知识库中针对这些实体的用户查询的有效和有意义的响应。这些丰富的实体是通过将信息分组在一起创建的,这些信息不仅来自表示为本体概念的单个实体,还来自领域本体指定的相关概念和属性。在本文中,我们提出了几种新的技术和算法来自动检测、学习和创建特定于领域的富实体。我们对知识库使用现有查询工作负载中的查询模式输入,并利用领域本体中定义的实体之间的结构和关系。我们的技术非常有效,可以应用于各种各样的应用领域,从而为以实体为中心的现实世界知识库的数据探索和信息提取增加了巨大的价值。
{"title":"Discovery and Creation of Rich Entities for Knowledge Bases","authors":"A. Quamar, Fatma Özcan, Konstantinos Xirogiannopoulos","doi":"10.1145/3214708.3214712","DOIUrl":"https://doi.org/10.1145/3214708.3214712","url":null,"abstract":"Businesses and professional organizations from a variety of different domains such as finance, weather, healthcare, social networks, etc., produce massive amounts of unstructured, semi-structured and structured data. Knowledge bases, enable querying and analysis of integrated content derived from such data available as open, third party and propriety data sets. Many knowledge bases today, provide an entity-centric view over the integrated content by using domain-specific ontologies. These entity-centric views enable querying individual real-world entities, as well as exploring exact information (such as address or net revenue of a company) through explicit querying using languages such as SQL or SPARQL. Although very useful for many business and commercial applications, this may not be sufficient for the exploration of relevant and context specific information associated with real-world entities stored in these knowledge bases. Users often need to resort to a manual and tedious process of exploration using ad-hoc queries to gather the required information. To enhance user experience and ameliorate the problem of relevant data exploration, we propose the concept of Rich Entities. These rich entities comprise of all the relevant and context specific information grouped together around real-world entities and served as efficient and meaningful responses to user queries against these entities in a knowledge base. These rich entities are created by grouping together information not only from a single entity represented as an ontology concept, but also related concepts and properties as specified by the domain ontology. In this paper we propose several novel techniques and algorithms to automatically detect, learn, and create domain-specific rich entities. We use inputs from query patterns in existing query workloads against knowledge bases, and leverage the structure and relationships between entities defined in the domain ontology. Our techniques are very effective and can be applied to a wide variety of application domains thus adding great value to data exploration and information extraction from entity-centric real-world knowledge bases.","PeriodicalId":93360,"journal":{"name":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88984579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Exploring Genomic Datasets: from Batch to Interactive and Back 探索基因组数据集:从批处理到交互和返回
Luca Nanni, Pietro Pinoli, Arif Canakoglu, S. Ceri
Genomic data management is focused on achieving high performance over big datasets using batch, cloud-based architectures; this enables the execution of massive pipelines, but hampers the capability of exploring the solution space when it is not well-defined, by choosing different experimental samples or query extraction parameters. We present PyGMQL, a Python-based interoperability software layer that enables testing of experimental pipelines; PyGMQL solves the impedance mismatch between a batch execution environment and the agile programming style of Python, and provides transparency of access when exploration requires integrating local and remote resources. Wrapping PyGMQL and Python primitives within Jupyter notebooks guarantees reproducibility of the pipeline when used in different contexts or by different scientists. The software is freely available at https://github.com/DEIB-GECO/PyGMQL.
基因组数据管理的重点是使用批处理、基于云的架构实现大数据集的高性能;这使得大量管道的执行成为可能,但当解决方案空间没有定义好时,通过选择不同的实验样本或查询提取参数,会妨碍探索解决方案空间的能力。我们提出了PyGMQL,一个基于python的互操作性软件层,可以对实验管道进行测试;PyGMQL解决了批处理执行环境和Python敏捷编程风格之间的阻抗不匹配,并在需要集成本地和远程资源时提供透明的访问。在Jupyter笔记本中包装PyGMQL和Python原语可以保证在不同上下文中或由不同科学家使用时管道的可重复性。该软件可在https://github.com/DEIB-GECO/PyGMQL免费获得。
{"title":"Exploring Genomic Datasets: from Batch to Interactive and Back","authors":"Luca Nanni, Pietro Pinoli, Arif Canakoglu, S. Ceri","doi":"10.1145/3214708.3214710","DOIUrl":"https://doi.org/10.1145/3214708.3214710","url":null,"abstract":"Genomic data management is focused on achieving high performance over big datasets using batch, cloud-based architectures; this enables the execution of massive pipelines, but hampers the capability of exploring the solution space when it is not well-defined, by choosing different experimental samples or query extraction parameters. We present PyGMQL, a Python-based interoperability software layer that enables testing of experimental pipelines; PyGMQL solves the impedance mismatch between a batch execution environment and the agile programming style of Python, and provides transparency of access when exploration requires integrating local and remote resources. Wrapping PyGMQL and Python primitives within Jupyter notebooks guarantees reproducibility of the pipeline when used in different contexts or by different scientists. The software is freely available at https://github.com/DEIB-GECO/PyGMQL.","PeriodicalId":93360,"journal":{"name":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88479228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Exploring Pros and Cons of Ranked Entities with COMPETE 探索具有竞争的排名实体的利弊
Kiril Panev, S. Michel
We present COMPETE, a novel approach to explore data using rankings. Utilizing rankings, which succinctly summarize the relative performance of entities, is a very intuitive way to inspecting underlying data. In this work we present an approach where users can understand the dominance of entities and interactively explore data. The approach harnesses diverse precomputed rankings, capturing many different aspects of possible user interest. For a given set of input entities, COMPETE identifies entities that are dominating or are dominated by the input, thus expanding on the relative performance of the input and giving focus on other related entities which are always better (or worse) than the input. We consider several aspects of the dominance relationship and provide different approaches that reflect these nuances. We report on the results of an experimental evaluation over data obtained from the Internet Movie Database (IMDb).
我们提出了COMPETE,这是一种利用排名来探索数据的新方法。利用排名(它简洁地总结了实体的相对性能)是检查底层数据的一种非常直观的方法。在这项工作中,我们提出了一种方法,用户可以理解实体的主导地位,并交互式地探索数据。该方法利用各种预先计算的排名,捕捉可能的用户兴趣的许多不同方面。对于给定的一组输入实体,COMPETE识别支配或被输入支配的实体,从而扩展输入的相对性能,并关注总是比输入更好(或更差)的其他相关实体。我们考虑了优势关系的几个方面,并提供了反映这些细微差别的不同方法。我们报告了对从互联网电影数据库(IMDb)获得的数据进行实验评估的结果。
{"title":"Exploring Pros and Cons of Ranked Entities with COMPETE","authors":"Kiril Panev, S. Michel","doi":"10.1145/3214708.3214709","DOIUrl":"https://doi.org/10.1145/3214708.3214709","url":null,"abstract":"We present COMPETE, a novel approach to explore data using rankings. Utilizing rankings, which succinctly summarize the relative performance of entities, is a very intuitive way to inspecting underlying data. In this work we present an approach where users can understand the dominance of entities and interactively explore data. The approach harnesses diverse precomputed rankings, capturing many different aspects of possible user interest. For a given set of input entities, COMPETE identifies entities that are dominating or are dominated by the input, thus expanding on the relative performance of the input and giving focus on other related entities which are always better (or worse) than the input. We consider several aspects of the dominance relationship and provide different approaches that reflect these nuances. We report on the results of an experimental evaluation over data obtained from the Internet Movie Database (IMDb).","PeriodicalId":93360,"journal":{"name":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74603229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Recommendations for Explorations based on Graphs 基于图的探索建议
Marialena Kyriakidi, G. Koutrika, Y. Ioannidis
Recommendations are an integral part of data exploration. Existing approaches, however, consider a limited model of recommendations. In this vision paper, we lay the ground for a graph-based approach for recommendations that allows significant flexibility in capturing both data and recommendations and process them efficiently. We determine the requirements of a desired solution and illustrate the overall idea with an example based on the Yelp dataset.
建议是数据探索的一个组成部分。然而,现有的方法考虑的是一个有限的建议模型。在这篇远景论文中,我们为基于图的推荐方法奠定了基础,该方法在捕获数据和建议并有效处理它们方面具有极大的灵活性。我们确定所需解决方案的需求,并使用基于Yelp数据集的示例说明总体思想。
{"title":"Recommendations for Explorations based on Graphs","authors":"Marialena Kyriakidi, G. Koutrika, Y. Ioannidis","doi":"10.1145/3214708.3214713","DOIUrl":"https://doi.org/10.1145/3214708.3214713","url":null,"abstract":"Recommendations are an integral part of data exploration. Existing approaches, however, consider a limited model of recommendations. In this vision paper, we lay the ground for a graph-based approach for recommendations that allows significant flexibility in capturing both data and recommendations and process them efficiently. We determine the requirements of a desired solution and illustrate the overall idea with an example based on the Yelp dataset.","PeriodicalId":93360,"journal":{"name":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89837465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Strategies for Detection of Correlated Data Streams 关联数据流检测策略
Rakan Alseghayer, Daniel Petrov, Panos K. Chrysanthis
There is an increasing demand for real-time analysis of large volumes of data streams that are produced at high velocity. The most recent data needs to be processed within a specified delay target in order for the analysis to lead to actionable result. In this paper we present an effective solution for the analysis of such data streams that is based upon a 3-fold approach that combines (1) incremental sliding-window computation of aggregates, to avoid unnecessary recomputations, (2) intelligent scheduling of computation steps and operations, driven by a utility function within a micro-batch, and (3) an exploration strategy that tunes the utility function. Specifically, we propose eight strategies that explore correlated pairs of live data streams across consecutive micro-batches. Our experimental evaluation on a real dataset shows that some strategies are more suitable to identifying high numbers of correlated pairs of live data streams, already known from previous micro-batches, while others are more suitable to identifying previously unseen pairs of live data streams across consecutive micro-batches.
对高速产生的大量数据流进行实时分析的需求越来越大。最近的数据需要在指定的延迟目标内处理,以便分析产生可操作的结果。在本文中,我们提出了一种有效的数据流分析解决方案,该解决方案基于一种三重方法,该方法结合了(1)聚合的增量滑动窗口计算,以避免不必要的重新计算,(2)由微批内的效用函数驱动的计算步骤和操作的智能调度,以及(3)调整效用函数的探索策略。具体来说,我们提出了八种策略来探索跨连续微批的实时数据流的相关对。我们在真实数据集上的实验评估表明,一些策略更适合识别大量相关的实时数据流对,这些数据流对已经从以前的微批中已知,而另一些策略更适合识别以前未见过的连续微批中的实时数据流对。
{"title":"Strategies for Detection of Correlated Data Streams","authors":"Rakan Alseghayer, Daniel Petrov, Panos K. Chrysanthis","doi":"10.1145/3214708.3214714","DOIUrl":"https://doi.org/10.1145/3214708.3214714","url":null,"abstract":"There is an increasing demand for real-time analysis of large volumes of data streams that are produced at high velocity. The most recent data needs to be processed within a specified delay target in order for the analysis to lead to actionable result. In this paper we present an effective solution for the analysis of such data streams that is based upon a 3-fold approach that combines (1) incremental sliding-window computation of aggregates, to avoid unnecessary recomputations, (2) intelligent scheduling of computation steps and operations, driven by a utility function within a micro-batch, and (3) an exploration strategy that tunes the utility function. Specifically, we propose eight strategies that explore correlated pairs of live data streams across consecutive micro-batches. Our experimental evaluation on a real dataset shows that some strategies are more suitable to identifying high numbers of correlated pairs of live data streams, already known from previous micro-batches, while others are more suitable to identifying previously unseen pairs of live data streams across consecutive micro-batches.","PeriodicalId":93360,"journal":{"name":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74660549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Any-k Algorithms for Exploratory Analysis with Conjunctive Queries. 带有联合查询的探索性分析的Any-k算法。
Xiaofeng Yang, Mirek Riedewald, Rundong Li, Wolfgang Gatterbauer

We recently proposed the notion of any-k queries, together with the KARPET algorithm, for tree-pattern search in labeled graphs. Any-k extends top-k by not requiring a pre-specified value of k. Instead, an any-k algorithm returns as many of the top-ranked results as possible, for a given time budget. Given additional time, it produces the next-highest ranked results quickly as well. It can be stopped anytime, but may have to continue until all results are returned. In the latter case, any-k takes times similar to an algorithm that first produces all results and then sorts them. We summarize KARPET and argue that it can be extended to support any-k exploratory search for arbitrary conjunctive queries.

我们最近提出了任意k查询的概念,以及KARPET算法,用于标记图中的树模式搜索。Any-k通过不需要预先指定的k值来扩展top-k。相反,对于给定的时间预算,Any-k算法返回尽可能多的排名靠前的结果。如果有额外的时间,它也会很快产生排名第二的结果。它可以随时停止,但可能必须继续,直到返回所有结果。在后一种情况下,any-k所花费的时间类似于首先生成所有结果然后对它们进行排序的算法。我们总结了KARPET,并认为它可以扩展到支持任意连接查询的任意k探索性搜索。
{"title":"Any-k Algorithms for Exploratory Analysis with Conjunctive Queries.","authors":"Xiaofeng Yang,&nbsp;Mirek Riedewald,&nbsp;Rundong Li,&nbsp;Wolfgang Gatterbauer","doi":"10.1145/3214708.3214711","DOIUrl":"https://doi.org/10.1145/3214708.3214711","url":null,"abstract":"<p><p>We recently proposed the notion of <i>any-k queries</i>, together with the KARPET algorithm, for tree-pattern search in labeled graphs. Any-<i>k</i> extends top-<i>k</i> by not requiring a pre-specified value of <i>k</i>. Instead, an any-<i>k</i> algorithm returns as many of the top-ranked results as possible, for a given time budget. Given additional time, it produces the next-highest ranked results quickly as well. It can be stopped anytime, but may have to continue until all results are returned. In the latter case, any-k takes times similar to an algorithm that first produces all results and then sorts them. We summarize KARPET and argue that it can be extended to support any-<i>k</i> exploratory search for arbitrary conjunctive queries.</p>","PeriodicalId":93360,"journal":{"name":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3214708.3214711","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39258977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web 第五届数据库与网络探索性搜索国际研讨会论文集
Senjuti Basu Roy, K. Stefanidis, G. Koutrika, Mirek Riedewald, L. Lakshmanan
The purpose of the ExploreDB workshop is to bring together researchers and practitioners that approach data exploration from different angles, ranging from data management, information retrieval to data visualization and human computer interaction, in order to study the emerging needs and objectives for data exploration, as well as the challenges and problems that need to be tackled.
explordb研讨会的目的是将从不同角度研究数据探索的研究人员和实践者聚集在一起,从数据管理、信息检索到数据可视化和人机交互,以研究数据探索的新需求和目标,以及需要解决的挑战和问题。
{"title":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web","authors":"Senjuti Basu Roy, K. Stefanidis, G. Koutrika, Mirek Riedewald, L. Lakshmanan","doi":"10.1145/3214708","DOIUrl":"https://doi.org/10.1145/3214708","url":null,"abstract":"The purpose of the ExploreDB workshop is to bring together researchers and practitioners that approach data exploration from different angles, ranging from data management, information retrieval to data visualization and human computer interaction, in order to study the emerging needs and objectives for data exploration, as well as the challenges and problems that need to be tackled.","PeriodicalId":93360,"journal":{"name":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87416916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Supporting Range Queries on Web Data Using k-Nearest Neighbor Search 支持使用k近邻搜索对Web数据的范围查询
Wan D. Bae, Shayma Alkobaisi, S. H. Kim, Sada Narayanappa, C. Shahabi
{"title":"Supporting Range Queries on Web Data Using k-Nearest Neighbor Search","authors":"Wan D. Bae, Shayma Alkobaisi, S. H. Kim, Sada Narayanappa, C. Shahabi","doi":"10.1007/978-3-540-76925-5_5","DOIUrl":"https://doi.org/10.1007/978-3-540-76925-5_5","url":null,"abstract":"","PeriodicalId":93360,"journal":{"name":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2007-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88311975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages 垃圾邮件,该死的垃圾邮件和统计:使用统计分析来定位垃圾网页
Dennis Fetterly, M. Manasse, Marc Najork
The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam.This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.
搜索引擎对商业网站的重要性日益增加,导致了一种我们称之为“垃圾网页”的现象,即存在的网页只是为了误导搜索引擎(错误地)将用户引导到某些网站。网络垃圾邮件对用户和搜索引擎来说都是一个麻烦:用户很难找到他们需要的信息,而搜索引擎不得不处理膨胀的语料库,这反过来又导致每次查询的成本增加。因此,搜索引擎有强烈的动机从他们的索引中清除垃圾网页。我们建议可以通过统计分析来识别一些垃圾网页:某些类别的垃圾网页,特别是那些由机器生成的垃圾网页,在某些属性上与一般网页的属性不同。我们研究了各种这样的属性,包括链接结构、页面内容和页面演变,并发现这些属性的统计分布中的异常值极有可能是由web垃圾邮件引起的。本文描述了我们所研究的特性,给出了我们所观察到的统计分布,并展示了哪些异常值与网络垃圾邮件高度相关。
{"title":"Spam, damn spam, and statistics: using statistical analysis to locate spam web pages","authors":"Dennis Fetterly, M. Manasse, Marc Najork","doi":"10.1145/1017074.1017077","DOIUrl":"https://doi.org/10.1145/1017074.1017077","url":null,"abstract":"The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call \"web spam\", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam.This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.","PeriodicalId":93360,"journal":{"name":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72892866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 350
Twig query processing over graph-structured XML data 图结构XML数据上的小枝查询处理
Zografoula Vagena, Mirella M. Moro, V. Tsotras
XML and semi-structured data is usually modeled using graph structures. Structural summaries, which have been proposed to speedup XML query processing have graph forms as well. The existent approaches for evaluating queries over tree structured data (i.e. data whose underlying structure is a tree) are not directly applicable when the data is modeled as a random graph. Moreover, they cannot be applied when structural summaries are employed and, to the best of our knowledge, no analogous techniques have been reported for this case either. As a result, the potential of structural summaries is not fully exploited.In this paper, we investigate query evaluation techniques applicable to graph-structured data. We propose efficient algorithms for the case of directed acyclic graphs, which appear in many real world situations. We then tailor our approaches to handle other directed graphs as well. Our experimental evaluation reveals the advantages of our solutions over existing methods for graph-structured data.
XML和半结构化数据通常使用图结构建模。为了加速XML查询处理而提出的结构摘要也具有图形形式。现有的评估树结构数据(即底层结构为树的数据)查询的方法不能直接适用于将数据建模为随机图的情况。此外,当采用结构摘要时,它们不能应用,据我们所知,这种情况下也没有类似的技术报道。因此,结构摘要的潜力没有得到充分利用。本文研究了适用于图结构数据的查询评估技术。我们针对有向无环图的情况提出了有效的算法,这种情况出现在许多现实世界的情况中。然后我们调整我们的方法来处理其他有向图。我们的实验评估揭示了我们的解决方案比现有的图结构数据方法的优势。
{"title":"Twig query processing over graph-structured XML data","authors":"Zografoula Vagena, Mirella M. Moro, V. Tsotras","doi":"10.1145/1017074.1017087","DOIUrl":"https://doi.org/10.1145/1017074.1017087","url":null,"abstract":"XML and semi-structured data is usually modeled using graph structures. Structural summaries, which have been proposed to speedup XML query processing have graph forms as well. The existent approaches for evaluating queries over tree structured data (i.e. data whose underlying structure is a tree) are not directly applicable when the data is modeled as a random graph. Moreover, they cannot be applied when structural summaries are employed and, to the best of our knowledge, no analogous techniques have been reported for this case either. As a result, the potential of structural summaries is not fully exploited.In this paper, we investigate query evaluation techniques applicable to graph-structured data. We propose efficient algorithms for the case of directed acyclic graphs, which appear in many real world situations. We then tailor our approaches to handle other directed graphs as well. Our experimental evaluation reveals the advantages of our solutions over existing methods for graph-structured data.","PeriodicalId":93360,"journal":{"name":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2004-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85950207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
期刊
Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1