首页 > 最新文献

Proceedings of the 18th International Workshop on Web and Databases最新文献

英文 中文
Truth Finding with Attribute Partitioning 属性划分的真值发现
Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767118
M. Ba, Roxana Horincar, P. Senellart, Huayu Wu
Truth finding is the problem of determining which of the statements made by contradictory sources is correct, in the absence of prior information on the trustworthiness of the sources. A number of approaches to truth finding have been proposed, from simple majority voting to elaborate iterative algorithms that estimate the quality of sources by corroborating their statements. In this paper, we consider the case where there is an inherent structure in the statements made by sources about real-world objects, that imply different quality levels of a given source on different groups of attributes of an object. We do not assume this structuring given, but instead find it automatically, by exploring and weighting the partitions of the sets of attributes of an object, and applying a reference truth finding algorithm on each subset of the optimal partition. Our experimental results on synthetic and real-world datasets show that we obtain better precision at truth finding than baselines in cases where data has an inherent structure.
发现真相的问题是,在没有关于消息来源可信度的事先信息的情况下,确定相互矛盾的消息来源所作的陈述中哪一个是正确的。已经提出了许多寻找真相的方法,从简单的多数投票到通过证实其陈述来估计来源质量的精心设计的迭代算法。在本文中,我们考虑了这样一种情况,即来源对现实世界对象的陈述中存在固有结构,这意味着给定来源对对象的不同属性组的不同质量水平。我们不假设这种结构是给定的,而是通过探索和加权对象属性集的分区,并在最优分区的每个子集上应用参考真值查找算法来自动找到它。我们在合成数据集和真实世界数据集上的实验结果表明,在数据具有固有结构的情况下,我们比基线获得了更好的真相发现精度。
{"title":"Truth Finding with Attribute Partitioning","authors":"M. Ba, Roxana Horincar, P. Senellart, Huayu Wu","doi":"10.1145/2767109.2767118","DOIUrl":"https://doi.org/10.1145/2767109.2767118","url":null,"abstract":"Truth finding is the problem of determining which of the statements made by contradictory sources is correct, in the absence of prior information on the trustworthiness of the sources. A number of approaches to truth finding have been proposed, from simple majority voting to elaborate iterative algorithms that estimate the quality of sources by corroborating their statements. In this paper, we consider the case where there is an inherent structure in the statements made by sources about real-world objects, that imply different quality levels of a given source on different groups of attributes of an object. We do not assume this structuring given, but instead find it automatically, by exploring and weighting the partitions of the sets of attributes of an object, and applying a reference truth finding algorithm on each subset of the optimal partition. Our experimental results on synthetic and real-world datasets show that we obtain better precision at truth finding than baselines in cases where data has an inherent structure.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124840407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Discovering Subsumption Relationships for Web-Based Ontologies 发现基于web的本体的包容关系
Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767111
Dana Movshovitz-Attias, Steven Euijong Whang, Natasha Noy, A. Halevy
As search engines are becoming smarter at interpreting user queries and providing meaningful responses, they rely on ontologies to understand the meaning of entities. Creating ontologies manually is a laborious process, and resulting ontologies may not reflect the way users think about the world, as many concepts used in queries are noisy, and not easily amenable to formal modeling. There has been considerable effort in generating ontologies from Web text and query streams, which may be more reflective of how users query and write content. In this paper, we describe the LATTE system that automatically generates a subconcept--superconcept hierarchy, which is critical for using ontologies to answer queries. LATTE combines signals based on word-vector representations of concepts and dependency parse trees; however, LATTE derives most of its power from an ontology of attributes extracted from the Web that indicates the aspects of concepts that users find important. LATTE achieves an F1 score of 74%, which is comparable to expert agreement on a similar task. We additionally demonstrate the usefulness of LATTE in detecting high quality concepts from an existing resource of IsA links.
随着搜索引擎在解释用户查询和提供有意义的响应方面变得越来越智能,它们依赖于本体来理解实体的含义。手动创建本体是一个费力的过程,生成的本体可能无法反映用户对世界的看法,因为查询中使用的许多概念都是嘈杂的,不容易进行形式化建模。在从Web文本和查询流生成本体方面已经付出了相当大的努力,这可能更能反映用户查询和编写内容的方式。在本文中,我们描述了自动生成子概念-超概念层次结构的LATTE系统,这对于使用本体回答查询至关重要。LATTE结合了基于概念的词向量表示和依赖解析树的信号;然而,LATTE的大部分功能来自于从Web中提取的属性本体,该本体指出了用户认为重要的概念方面。LATTE达到了74%的F1分数,这与专家对类似任务的一致意见相当。我们还演示了LATTE在从现有的IsA链接资源中检测高质量概念方面的有用性。
{"title":"Discovering Subsumption Relationships for Web-Based Ontologies","authors":"Dana Movshovitz-Attias, Steven Euijong Whang, Natasha Noy, A. Halevy","doi":"10.1145/2767109.2767111","DOIUrl":"https://doi.org/10.1145/2767109.2767111","url":null,"abstract":"As search engines are becoming smarter at interpreting user queries and providing meaningful responses, they rely on ontologies to understand the meaning of entities. Creating ontologies manually is a laborious process, and resulting ontologies may not reflect the way users think about the world, as many concepts used in queries are noisy, and not easily amenable to formal modeling. There has been considerable effort in generating ontologies from Web text and query streams, which may be more reflective of how users query and write content. In this paper, we describe the LATTE system that automatically generates a subconcept--superconcept hierarchy, which is critical for using ontologies to answer queries. LATTE combines signals based on word-vector representations of concepts and dependency parse trees; however, LATTE derives most of its power from an ontology of attributes extracted from the Web that indicates the aspects of concepts that users find important. LATTE achieves an F1 score of 74%, which is comparable to expert agreement on a similar task. We additionally demonstrate the usefulness of LATTE in detecting high quality concepts from an existing resource of IsA links.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122755074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Addressing Instance Ambiguity in Web Harvesting 解决Web收集中的实例歧义
Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767114
Zhixu Li, Xiangliang Zhang, Hai Huang, Qing Xie, Jia Zhu, Xiaofang Zhou
Web Harvesting enables the enrichment of incomplete data sets by retrieving required information from the Web. However, the ambiguity of instances may greatly decrease the quality of the harvested data, given that any instance in the local data set may become ambiguous when attempting to identify it on the Web. Although plenty of disambiguation methods have been proposed to deal with the ambiguity problems in various settings, none of them are able to handle the instance ambiguity problem in Web Harvesting. In this paper, we propose to do instance disambiguation in Web Harvesting with a novel disambiguation method inspired by the idea of collaborative identity recognition. In particular, we expect to find some common properties in forms of latent shared attribute values among instances in the list, such that these shared attribute values can differentiate instances within the list against those ambiguous ones on the Web. Our extensive experimental evaluation illustrates the utility of collaborative disambiguation for a popular Web Harvesting application, and shows that it substantially improves the accuracy of the harvested data.
Web harvest可以通过从Web检索所需的信息来丰富不完整的数据集。然而,实例的模糊性可能会大大降低收集数据的质量,因为当试图在Web上识别本地数据集中的任何实例时,它都可能变得含糊不清。尽管已经提出了大量的消歧方法来处理各种情况下的歧义问题,但没有一种方法能够处理Web收获中的实例歧义问题。在本文中,我们提出了一种基于协同身份识别思想的消歧方法来实现Web采集中的实例消歧。特别是,我们希望在列表中的实例之间以潜在共享属性值的形式找到一些公共属性,这样这些共享属性值就可以区分列表中的实例和Web上那些不明确的实例。我们广泛的实验评估说明了协作消歧对一个流行的Web收集应用程序的效用,并表明它大大提高了收集数据的准确性。
{"title":"Addressing Instance Ambiguity in Web Harvesting","authors":"Zhixu Li, Xiangliang Zhang, Hai Huang, Qing Xie, Jia Zhu, Xiaofang Zhou","doi":"10.1145/2767109.2767114","DOIUrl":"https://doi.org/10.1145/2767109.2767114","url":null,"abstract":"Web Harvesting enables the enrichment of incomplete data sets by retrieving required information from the Web. However, the ambiguity of instances may greatly decrease the quality of the harvested data, given that any instance in the local data set may become ambiguous when attempting to identify it on the Web. Although plenty of disambiguation methods have been proposed to deal with the ambiguity problems in various settings, none of them are able to handle the instance ambiguity problem in Web Harvesting. In this paper, we propose to do instance disambiguation in Web Harvesting with a novel disambiguation method inspired by the idea of collaborative identity recognition. In particular, we expect to find some common properties in forms of latent shared attribute values among instances in the list, such that these shared attribute values can differentiate instances within the list against those ambiguous ones on the Web. Our extensive experimental evaluation illustrates the utility of collaborative disambiguation for a popular Web Harvesting application, and shows that it substantially improves the accuracy of the harvested data.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133406497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Long-term Optimization of Update Frequencies for Decaying Information 衰减信息更新频率的长期优化
Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767113
Simon Razniewski, W. Nutt
Many kinds of information, such as addresses, crawls of webpages, or academic affiliations, are prone to becoming outdated over time. Therefore, in some applications, updates are performed periodically in order to keep the correctness and usefulness of such information high. As refreshing information usually has a cost, e.g. computation time, network bandwidth or human work time, a problem is to find the right update frequency depending on the benefit gained from the information and on the speed with which the information is expected to get outdated. This is especially important since often entities exhibit a different speed of getting outdated, as, e.g., addresses of students change more frequently than addresses of pensionists, or news portals change more frequently than personal homepages. Thus, there is no uniform best update frequency for all entities. Previous work [5] on data freshness has focused on the question of how to best distribute a fixed budget for updates among various entities, which is of interest in the short-term, when resources are fixed and cannot be adjusted. In the long-term, many businesses are able to adjust their resources in order to optimize their gain. Then, the problem is not one of distributing a fixed number of updates but one of determining the frequency of updates that optimizes the overall gain from the information. In this paper, we investigate how the optimal update frequency for decaying information can be determined. We show that the optimal update frequency is independent for each entity, and how simple iteration can be used to find the optimal update frequency. An implementation of our solution for exponential decay is available online.
许多类型的信息,如地址、网页抓取或学术关系,随着时间的推移很容易过时。因此,在一些应用程序中,定期执行更新,以保持这些信息的正确性和有用性。由于更新信息通常是有成本的,例如计算时间、网络带宽或人工工作时间,因此问题是根据信息所带来的好处和信息预期过时的速度找到正确的更新频率。这一点尤其重要,因为实体往往表现出不同的过时速度,例如,学生的地址比养老金领取者的地址变化得更频繁,或者新闻门户比个人主页变化得更频繁。因此,对于所有实体没有统一的最佳更新频率。先前关于数据新鲜度的工作[5]集中在如何最好地在不同实体之间分配固定的更新预算的问题上,这在短期内是感兴趣的,当资源是固定的并且无法调整时。从长远来看,许多企业能够调整他们的资源,以优化他们的收益。那么,问题就不在于分配固定数量的更新,而在于确定更新的频率,从而优化信息的总体增益。在本文中,我们研究了如何确定衰减信息的最优更新频率。我们展示了每个实体的最优更新频率是独立的,以及如何使用简单的迭代来找到最优更新频率。我们的指数衰减解决方案的实现可以在网上找到。
{"title":"Long-term Optimization of Update Frequencies for Decaying Information","authors":"Simon Razniewski, W. Nutt","doi":"10.1145/2767109.2767113","DOIUrl":"https://doi.org/10.1145/2767109.2767113","url":null,"abstract":"Many kinds of information, such as addresses, crawls of webpages, or academic affiliations, are prone to becoming outdated over time. Therefore, in some applications, updates are performed periodically in order to keep the correctness and usefulness of such information high. As refreshing information usually has a cost, e.g. computation time, network bandwidth or human work time, a problem is to find the right update frequency depending on the benefit gained from the information and on the speed with which the information is expected to get outdated. This is especially important since often entities exhibit a different speed of getting outdated, as, e.g., addresses of students change more frequently than addresses of pensionists, or news portals change more frequently than personal homepages. Thus, there is no uniform best update frequency for all entities. Previous work [5] on data freshness has focused on the question of how to best distribute a fixed budget for updates among various entities, which is of interest in the short-term, when resources are fixed and cannot be adjusted. In the long-term, many businesses are able to adjust their resources in order to optimize their gain. Then, the problem is not one of distributing a fixed number of updates but one of determining the frequency of updates that optimizes the overall gain from the information. In this paper, we investigate how the optimal update frequency for decaying information can be determined. We show that the optimal update frequency is independent for each entity, and how simple iteration can be used to find the optimal update frequency. An implementation of our solution for exponential decay is available online.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125215544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths FOREST:利用显著标记路径的聚焦对象检索
Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767112
Marilena Oita, P. Senellart
Content-intensive websites, e.g., of blogs or news, present pages that contain Web articles automatically generated by content management systems. Identification and extraction of their main content is critical in many applications, such as indexing or classification. We present a novel unsupervised approach for the extraction of Web articles from dynamically-generated Web pages. Our system, called Forest, combines structural and information-based features to target the main content generated by a Web source, and published in associated Web pages. We extensively evaluate Forest with respect to various baselines and datasets, and report improved results over state-of-the art techniques in content extraction.
内容密集的网站,例如博客或新闻,提供包含由内容管理系统自动生成的Web文章的页面。识别和提取其主要内容在许多应用程序中是至关重要的,例如索引或分类。我们提出了一种新的无监督方法,用于从动态生成的Web页面中提取Web文章。我们的系统称为Forest,它结合了结构化和基于信息的特性,以Web源生成的主要内容为目标,并在相关的Web页面中发布。我们根据各种基线和数据集对Forest进行了广泛的评估,并报告了在内容提取方面采用最先进技术的改进结果。
{"title":"FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths","authors":"Marilena Oita, P. Senellart","doi":"10.1145/2767109.2767112","DOIUrl":"https://doi.org/10.1145/2767109.2767112","url":null,"abstract":"Content-intensive websites, e.g., of blogs or news, present pages that contain Web articles automatically generated by content management systems. Identification and extraction of their main content is critical in many applications, such as indexing or classification. We present a novel unsupervised approach for the extraction of Web articles from dynamically-generated Web pages. Our system, called Forest, combines structural and information-based features to target the main content generated by a Web source, and published in associated Web pages. We extensively evaluate Forest with respect to various baselines and datasets, and report improved results over state-of-the art techniques in content extraction.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130052576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
TriAL-QL: Distributed Processing of Navigational Queries TriAL-QL:导航查询的分布式处理
Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767115
Martin Przyjaciel-Zablocki, A. Schätzle, Adriano Lange
Navigational queries are among the most natural query patterns for RDF data, but yet most existing RDF query languages fail to cover all the varieties inherent to its triple-based model, including SPARQL 1.1 and its derivatives. As a consequence, the development of more expressive RDF languages is of general interest. With TriAL* [14], there exists an expressive algebra which subsumes many previous approaches, while adding novel features that are not expressible in most other RDF query languages based on the standard graph model. However, its algebraic notation is inappropriate for practical usage and it is not supported by any existing RDF triple store. In this paper, we propose TriAL-QL, an easy to write and grasp language for TriAL*, preserving its compositional algebraic structure. We present an implementation based on Impala, a massive parallel SQL query engine on Hadoop, using an optimized semi-naive evaluation for the recursive fragments of TriAL*. This way, we support both data-intensive ETL-like workloads and explorative ad-hoc style queries. To demonstrate the scalability and expressiveness of our approach, we conducted experiments on generated social networks with up to 1.8 billion triples and compared different execution strategies to a Hive-based solution.
导航查询是RDF数据最自然的查询模式之一,但是大多数现有的RDF查询语言都不能涵盖其基于三元的模型所固有的所有变体,包括SPARQL 1.1及其衍生物。因此,开发更具表达性的RDF语言引起了普遍的兴趣。在TriAL*[14]中,存在一个包含了许多以前方法的表达代数,同时添加了在大多数其他基于标准图模型的RDF查询语言中无法表达的新特性。然而,它的代数表示法不适合实际使用,任何现有的RDF三重存储都不支持它。在本文中,我们提出了一种易于编写和掌握的TriAL*语言,保留了它的组合代数结构。我们提出了一个基于Impala的实现,Impala是Hadoop上的一个大规模并行SQL查询引擎,对TriAL*的递归片段使用了优化的半幼稚求值。通过这种方式,我们既支持数据密集型的类似于etl的工作负载,也支持探索性的自组织样式查询。为了证明我们方法的可扩展性和表达性,我们在生成的社交网络上进行了多达18亿个三重组的实验,并将不同的执行策略与基于hive的解决方案进行了比较。
{"title":"TriAL-QL: Distributed Processing of Navigational Queries","authors":"Martin Przyjaciel-Zablocki, A. Schätzle, Adriano Lange","doi":"10.1145/2767109.2767115","DOIUrl":"https://doi.org/10.1145/2767109.2767115","url":null,"abstract":"Navigational queries are among the most natural query patterns for RDF data, but yet most existing RDF query languages fail to cover all the varieties inherent to its triple-based model, including SPARQL 1.1 and its derivatives. As a consequence, the development of more expressive RDF languages is of general interest. With TriAL* [14], there exists an expressive algebra which subsumes many previous approaches, while adding novel features that are not expressible in most other RDF query languages based on the standard graph model. However, its algebraic notation is inappropriate for practical usage and it is not supported by any existing RDF triple store. In this paper, we propose TriAL-QL, an easy to write and grasp language for TriAL*, preserving its compositional algebraic structure. We present an implementation based on Impala, a massive parallel SQL query engine on Hadoop, using an optimized semi-naive evaluation for the recursive fragments of TriAL*. This way, we support both data-intensive ETL-like workloads and explorative ad-hoc style queries. To demonstrate the scalability and expressiveness of our approach, we conducted experiments on generated social networks with up to 1.8 billion triples and compared different execution strategies to a Hive-based solution.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130722328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Person-Name Parsing for Linking User Web Profiles 链接用户Web配置文件的人名解析
Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767117
G. Das, Xiang Li, Ang Sun, Hakan Kardes, Xin Wang
A person-name parser involves the identification of constituent parts of a person's name. Due to multiple writing styles ("John Smith" versus "Smith, John"), extra information ("John Smith, PhD", "Rev. John Smith"), and country-specific last-name prefixes ("Jean van de Velde"), parsing fullname strings from user profiles on Web 2.0 applications is not straightforward. To the best of our knowledge, we are the first to address this problem systematically by proposing machine learning approaches for parsing noisy fullname strings. In this paper, we propose several types of features based on token statistics, surface-patterns, and specialized dictionaries and apply them within a sequence modeling framework to learn a fullname parser. In particular, we propose the use of "bucket" features based on (name-token, label) distributions in lieu of "term" features frequently used in various Natural Language Processing applications to prevent the growth of learning parameters as a function of the training data size. We experimentally illustrate the generalizability, effectiveness, and efficiency aspects of our proposed features for noisy fullname parsing on fullname strings from the popular, professional networking website LinkedIn and commonly-used person names in the United States. On these datasets, our fullname parser significantly outperforms both the parser trained using classification approaches and a commercially-available name parsing solution.
人名解析器涉及识别人名的组成部分。由于多种写作风格(“John Smith”和“Smith, John”)、额外信息(“John Smith, PhD”、“Rev. John Smith”)和特定国家的姓氏前缀(“Jean van de Velde”),在Web 2.0应用程序上解析用户配置文件中的全名字符串并不简单。据我们所知,我们是第一个系统地解决这个问题的人,提出了解析嘈杂全名字符串的机器学习方法。在本文中,我们提出了基于标记统计、表面模式和专用字典的几种类型的特征,并在序列建模框架中应用它们来学习全名解析器。特别是,我们建议使用基于(名称令牌,标签)分布的“桶”特征来代替各种自然语言处理应用中经常使用的“术语”特征,以防止学习参数作为训练数据大小的函数的增长。我们通过实验说明了我们提出的功能的通用性、有效性和效率方面,这些功能可以对来自美国流行的专业网络网站LinkedIn和常用人名的全名字符串进行嘈杂的全名解析。在这些数据集上,我们的全名解析器明显优于使用分类方法训练的解析器和商业上可用的名称解析解决方案。
{"title":"Person-Name Parsing for Linking User Web Profiles","authors":"G. Das, Xiang Li, Ang Sun, Hakan Kardes, Xin Wang","doi":"10.1145/2767109.2767117","DOIUrl":"https://doi.org/10.1145/2767109.2767117","url":null,"abstract":"A person-name parser involves the identification of constituent parts of a person's name. Due to multiple writing styles (\"John Smith\" versus \"Smith, John\"), extra information (\"John Smith, PhD\", \"Rev. John Smith\"), and country-specific last-name prefixes (\"Jean van de Velde\"), parsing fullname strings from user profiles on Web 2.0 applications is not straightforward. To the best of our knowledge, we are the first to address this problem systematically by proposing machine learning approaches for parsing noisy fullname strings. In this paper, we propose several types of features based on token statistics, surface-patterns, and specialized dictionaries and apply them within a sequence modeling framework to learn a fullname parser. In particular, we propose the use of \"bucket\" features based on (name-token, label) distributions in lieu of \"term\" features frequently used in various Natural Language Processing applications to prevent the growth of learning parameters as a function of the training data size. We experimentally illustrate the generalizability, effectiveness, and efficiency aspects of our proposed features for noisy fullname parsing on fullname strings from the popular, professional networking website LinkedIn and commonly-used person names in the United States. On these datasets, our fullname parser significantly outperforms both the parser trained using classification approaches and a commercially-available name parsing solution.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133724461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Analyzing Crowd Rankings 人群排名分析
Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767110
Julia Stoyanovich, Marie Jacob, Xuemei Gong
Ranked data is ubiquitous in real-world applications, arising naturally when users express preferences about products and services, when voters cast ballots in elections, and when funding proposals are evaluated based on their merits or university departments based on their reputation. This paper focuses on crowdsourcing and novel analysis of ranked data. We describe the design of a data collection task in which Amazon MT workers were asked to rank movies. We present results of data analysis, correlating our ranked dataset with IMDb, where movies are rated on a discrete scale rather than ranked. We develop an intuitive measure of worker quality appropriate for this task, where no gold standard answer exists. We propose a model of local structure in ranked datasets, reflecting that subsets of the workers agree in their ranking over subsets of the items, develop a data mining algorithm that identifies such structure, and evaluate in on our dataset. Our dataset is publicly available at https://github.com/stoyanovich/CrowdRank.
排名数据在现实世界的应用程序中无处不在,当用户表达对产品和服务的偏好时,当选民在选举中投票时,当根据他们的优点评估资助提案或根据他们的声誉评估大学院系时,排名数据自然产生。本文的重点是众包和排名数据的新颖分析。我们描述了一个数据收集任务的设计,在这个任务中,亚马逊MT工作人员被要求对电影进行排名。我们展示了数据分析的结果,将我们的排名数据集与IMDb相关联,在IMDb中,电影是在离散的尺度上进行评级的,而不是排名。在没有黄金标准答案的情况下,我们开发了一种适合于这项任务的工人素质的直观测量方法。我们提出了一个排名数据集中的局部结构模型,反映了工人的子集在项目子集上的排名一致,开发了一个识别这种结构的数据挖掘算法,并在我们的数据集上进行评估。我们的数据集可以在https://github.com/stoyanovich/CrowdRank上公开获取。
{"title":"Analyzing Crowd Rankings","authors":"Julia Stoyanovich, Marie Jacob, Xuemei Gong","doi":"10.1145/2767109.2767110","DOIUrl":"https://doi.org/10.1145/2767109.2767110","url":null,"abstract":"Ranked data is ubiquitous in real-world applications, arising naturally when users express preferences about products and services, when voters cast ballots in elections, and when funding proposals are evaluated based on their merits or university departments based on their reputation. This paper focuses on crowdsourcing and novel analysis of ranked data. We describe the design of a data collection task in which Amazon MT workers were asked to rank movies. We present results of data analysis, correlating our ranked dataset with IMDb, where movies are rated on a discrete scale rather than ranked. We develop an intuitive measure of worker quality appropriate for this task, where no gold standard answer exists. We propose a model of local structure in ranked datasets, reflecting that subsets of the workers agree in their ranking over subsets of the items, develop a data mining algorithm that identifies such structure, and evaluate in on our dataset. Our dataset is publicly available at https://github.com/stoyanovich/CrowdRank.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125216412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
The elephant in the room: getting value from Big Data 房间里的大象:从大数据中获取价值
Pub Date : 2015-05-31 DOI: 10.1145/2767109.2770014
S. Abiteboul, X. Dong, Oren Etzioni, D. Srivastava, G. Weikum, Julia Stoyanovich, Fabian M. Suchanek
Big Data, and its 4 Vs – volume, velocity, variety, and veracity – have been at the forefront of societal, scientific and engineering discourse. Arguably the most important 5th V, value, is not talked about as much. How can we make sure that our data is not just big, but also valuable? WebDB 2015 has as its theme “Freshness, Correctness, Quality of Information and Knowledge on the Web”. The workshop attracted 31 submissions, of which the best 9 were selected for presentation at the workshop, and for publication in the proceedings. To set the stage, we have interviewed several prominent members of the data management community, soliciting their opinions on how we can ensure that data is not just available in quantity, but also in quality. In this interview Serge Abiteboul, Oren Etzioni, Divesh Srivastava with Luna Dong, and Gerhard Weikum shared with us their motivation for doing research in the area of data quality, and discussed their current work and their view on the future of the field. This interview appeared as a SIGMOD Blog article.
大数据及其4v(体积、速度、多样性和准确性)一直处于社会、科学和工程话语的前沿。可以说,最重要的第5个V——价值——并没有被谈论得那么多。我们如何确保我们的数据不仅大,而且有价值?WebDB 2015的主题是“网络上信息和知识的新鲜度、正确性和质量”。研讨会共收到31份意见书,其中最好的9份被选出在研讨会上发表,并在会议记录上发表。为了做好准备,我们采访了几位数据管理界的知名人士,就如何确保数据不仅在数量上可用,而且在质量上可用征求他们的意见。在这次采访中,Serge Abiteboul, Oren Etzioni, Divesh Srivastava和Luna Dong以及Gerhard Weikum与我们分享了他们在数据质量领域进行研究的动机,并讨论了他们目前的工作以及他们对该领域未来的看法。这次采访发表在SIGMOD博客上。
{"title":"The elephant in the room: getting value from Big Data","authors":"S. Abiteboul, X. Dong, Oren Etzioni, D. Srivastava, G. Weikum, Julia Stoyanovich, Fabian M. Suchanek","doi":"10.1145/2767109.2770014","DOIUrl":"https://doi.org/10.1145/2767109.2770014","url":null,"abstract":"Big Data, and its 4 Vs – volume, velocity, variety, and veracity – have been at the forefront of societal, scientific and engineering discourse. Arguably the most important 5th V, value, is not talked about as much. How can we make sure that our data is not just big, but also valuable? WebDB 2015 has as its theme “Freshness, Correctness, Quality of Information and Knowledge on the Web”. The workshop attracted 31 submissions, of which the best 9 were selected for presentation at the workshop, and for publication in the proceedings. To set the stage, we have interviewed several prominent members of the data management community, soliciting their opinions on how we can ensure that data is not just available in quantity, but also in quality. In this interview Serge Abiteboul, Oren Etzioni, Divesh Srivastava with Luna Dong, and Gerhard Weikum shared with us their motivation for doing research in the area of data quality, and discussed their current work and their view on the future of the field. This interview appeared as a SIGMOD Blog article.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125766985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
IBEX: Harvesting Entities from the Web Using Unique Identifiers IBEX:使用唯一标识符从Web中获取实体
Pub Date : 2015-05-04 DOI: 10.1145/2767109.2767116
Aliaksandr Talaika, J. Biega, Antoine Amarilli, Fabian M. Suchanek
In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with humanreadable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73--96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web.
在本文中,我们研究了唯一实体标识符在Web上的流行。例如,isbn(用于图书)、gtin(用于商业产品)、doi(用于文档)、电子邮件地址等等。我们将展示如何从Web页面系统地获取这些标识符,以及如何将它们与大规模实体的人类可读名称相关联。从简单地从Web页面中提取标识符和名称开始,我们将展示如何使用唯一标识符的属性来过滤噪声并清理整个语料库上的提取结果。最终的结果是一个包含数百万个不同类型的唯一标识实体的数据库,与现有的知识库相比,准确率达到73- 96%,覆盖率非常高。我们使用这个数据库来计算关于Web上产品、人员和其他实体存在的新统计数据。
{"title":"IBEX: Harvesting Entities from the Web Using Unique Identifiers","authors":"Aliaksandr Talaika, J. Biega, Antoine Amarilli, Fabian M. Suchanek","doi":"10.1145/2767109.2767116","DOIUrl":"https://doi.org/10.1145/2767109.2767116","url":null,"abstract":"In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with humanreadable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73--96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116999543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
期刊
Proceedings of the 18th International Workshop on Web and Databases
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1