首页 > 最新文献

Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management最新文献

英文 中文
Efficient methods for finding influential locations with adaptive grids 利用自适应网格寻找有影响位置的有效方法
D. Yan, R. C. Wong, Wilfred Ng
Given a set S of servers and a set C of clients, an optimal-location query returns a location where a new server can attract the greatest number of clients. Optimal-location queries are important in a lot of real-life applications, such as mobile service planning or resource distribution in an area. Previous studies assume that a client always visits its nearest server, which is too strict to be true in reality. In this paper, we relax this assumption and propose a new model to tackle this problem. We further generalize the problem to finding top-k optimal locations. The main challenge is that, even the fastest approach in existing studies needs to take hours to answer an optimal-location query on a typical real world dataset, which significantly limits the applications of the query. Using our relaxed model, we design an efficient grid-based approximation algorithm called FILM (Fast Influential Location Miner) to the queries, which is orders of magnitude faster than the best-known previous work and the number of clients attracted by a new server in the result location often exceeds 98% of the optimal. The algorithm is extended to finding k influential locations. Extensive experiments are conducted to show the efficiency and effectiveness of FILM on both real and synthetic datasets.
给定服务器集S和客户端集C,最优位置查询返回一个新服务器可以吸引最多客户端的位置。最优位置查询在许多现实生活中的应用程序中都很重要,例如移动服务规划或区域内的资源分配。以前的研究假设客户端总是访问离它最近的服务器,这在现实中太严格了。在本文中,我们放宽这一假设,并提出一个新的模型来解决这一问题。我们进一步将问题推广到寻找top-k最优位置。主要的挑战是,即使是现有研究中最快的方法也需要花费几个小时来回答一个典型的真实世界数据集上的最佳位置查询,这极大地限制了查询的应用。使用我们的松弛模型,我们设计了一种高效的基于网格的近似算法,称为FILM(快速影响位置挖掘器),该算法比之前最著名的工作快了几个数量级,并且新服务器在结果位置吸引的客户数量通常超过最优值的98%。将该算法扩展到寻找k个有影响的位置。大量的实验证明了FILM在真实和合成数据集上的效率和有效性。
{"title":"Efficient methods for finding influential locations with adaptive grids","authors":"D. Yan, R. C. Wong, Wilfred Ng","doi":"10.1145/2063576.2063788","DOIUrl":"https://doi.org/10.1145/2063576.2063788","url":null,"abstract":"Given a set S of servers and a set C of clients, an optimal-location query returns a location where a new server can attract the greatest number of clients. Optimal-location queries are important in a lot of real-life applications, such as mobile service planning or resource distribution in an area. Previous studies assume that a client always visits its nearest server, which is too strict to be true in reality. In this paper, we relax this assumption and propose a new model to tackle this problem. We further generalize the problem to finding top-k optimal locations. The main challenge is that, even the fastest approach in existing studies needs to take hours to answer an optimal-location query on a typical real world dataset, which significantly limits the applications of the query. Using our relaxed model, we design an efficient grid-based approximation algorithm called FILM (Fast Influential Location Miner) to the queries, which is orders of magnitude faster than the best-known previous work and the number of clients attracted by a new server in the result location often exceeds 98% of the optimal. The algorithm is extended to finding k influential locations. Extensive experiments are conducted to show the efficiency and effectiveness of FILM on both real and synthetic datasets.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"30 1","pages":"1475-1484"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82234173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Semi-indexing semi-structured data in tiny space 小空间中的半索引半结构化数据
G. Ottaviano, R. Grossi
Semi-structured textual formats are gaining increasing popularity for the storage of document collections and rich logs. Their flexibility comes at the cost of having to load and parse a document entirely even if just a small part of it needs to be accessed. For instance, in data analytics massive collections are usually scanned sequentially, selecting a small number of attributes from each document. We propose a technique to attach to a raw, unparsed document (even in compressed form) a "semi-index": a succinct data structure that supports operations on the document tree at speed comparable with an in-memory deserialized object, thus bridging textual formats with binary formats. After describing the general technique, we focus on the JSON format: our experiments show that avoiding the full loading and parsing step can give speedups of up to 12 times for on-disk documents using a small space overhead.
半结构化文本格式在存储文档集合和丰富日志方面越来越受欢迎。它们的灵活性是以必须加载和解析整个文档为代价的,即使只需要访问文档的一小部分。例如,在数据分析中,通常顺序扫描大量集合,从每个文档中选择少量属性。我们提出了一种技术,将“半索引”附加到原始的、未解析的文档(即使是压缩形式)上:这是一种简洁的数据结构,支持对文档树的操作,其速度与内存中反序列化对象相当,从而将文本格式与二进制格式连接起来。在描述了一般技术之后,我们将重点关注JSON格式:我们的实验表明,避免完全加载和解析步骤可以使用很小的空间开销为磁盘上的文档提供高达12倍的速度提升。
{"title":"Semi-indexing semi-structured data in tiny space","authors":"G. Ottaviano, R. Grossi","doi":"10.1145/2063576.2063790","DOIUrl":"https://doi.org/10.1145/2063576.2063790","url":null,"abstract":"Semi-structured textual formats are gaining increasing popularity for the storage of document collections and rich logs. Their flexibility comes at the cost of having to load and parse a document entirely even if just a small part of it needs to be accessed. For instance, in data analytics massive collections are usually scanned sequentially, selecting a small number of attributes from each document. We propose a technique to attach to a raw, unparsed document (even in compressed form) a \"semi-index\": a succinct data structure that supports operations on the document tree at speed comparable with an in-memory deserialized object, thus bridging textual formats with binary formats. After describing the general technique, we focus on the JSON format: our experiments show that avoiding the full loading and parsing step can give speedups of up to 12 times for on-disk documents using a small space overhead.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"5 1","pages":"1485-1494"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81600276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Local computation of PageRank: the ranking side PageRank的局部计算:排名端
M. Bressan, Luca Pretto
Imagine you are a social network user who wants to search, in a list of potential candidates, for the best candidate for a job on the basis of their PageRank-induced importance ranking. Is it possible to compute this ranking for a low cost, by visiting only small subnetworks around the nodes that represent each candidate? The fundamental problem underpinning this question, i.e. computing locally the PageRank ranking of k nodes in an $n$-node graph, was first raised by Chen et al. (CIKM 2004) and then restated by Bar-Yossef and Mashiach (CIKM 2008). In this paper we formalize and provide the first analysis of the problem, proving that any local algorithm that computes a correct ranking must take into consideration Ω(√(kn)) nodes -- even when ranking the top $k$ nodes of the graph, even if their PageRank scores are "well separated", and even if the algorithm is randomized (and we prove a stronger Ω(n) bound for deterministic algorithms). Experiments carried out on large, publicly available crawls of the web and of a social network show that also in practice the fraction of the graph to be visited to compute the ranking may be considerable, both for algorithms that are always correct and for algorithms that employ (efficient) local score approximations.
假设您是一个社交网络用户,想要在潜在候选人列表中搜索基于pagerank诱导的重要性排名的最佳候选人。是否有可能通过只访问代表每个候选节点周围的小子网来以低成本计算这个排名?支撑这个问题的基本问题,即计算$n$节点图中k个节点的局部PageRank排名,首先由Chen等人(CIKM 2004)提出,然后由Bar-Yossef和Mashiach (CIKM 2008)重申。在本文中,我们形式化并提供了对问题的第一个分析,证明任何计算正确排名的局部算法都必须考虑Ω(√(kn))节点——即使在对图的前$k$节点进行排名时,即使它们的PageRank分数“很好地分离”,即使算法是随机的(并且我们证明了确定性算法的更强Ω(n)界)。在大型的、公开的网络和社交网络爬虫上进行的实验表明,在实践中,对于总是正确的算法和使用(有效的)局部分数近似的算法,要访问的图的部分来计算排名可能是相当大的。
{"title":"Local computation of PageRank: the ranking side","authors":"M. Bressan, Luca Pretto","doi":"10.1145/2063576.2063670","DOIUrl":"https://doi.org/10.1145/2063576.2063670","url":null,"abstract":"Imagine you are a social network user who wants to search, in a list of potential candidates, for the best candidate for a job on the basis of their PageRank-induced importance ranking. Is it possible to compute this ranking for a low cost, by visiting only small subnetworks around the nodes that represent each candidate? The fundamental problem underpinning this question, i.e. computing locally the PageRank ranking of k nodes in an $n$-node graph, was first raised by Chen et al. (CIKM 2004) and then restated by Bar-Yossef and Mashiach (CIKM 2008). In this paper we formalize and provide the first analysis of the problem, proving that any local algorithm that computes a correct ranking must take into consideration Ω(√(kn)) nodes -- even when ranking the top $k$ nodes of the graph, even if their PageRank scores are \"well separated\", and even if the algorithm is randomized (and we prove a stronger Ω(n) bound for deterministic algorithms). Experiments carried out on large, publicly available crawls of the web and of a social network show that also in practice the fraction of the graph to be visited to compute the ranking may be considerable, both for algorithms that are always correct and for algorithms that employ (efficient) local score approximations.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"28 1","pages":"631-640"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82341758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Fast fully dynamic landmark-based estimation of shortest path distances in very large graphs 在非常大的图中快速完全动态的基于地标的最短路径距离估计
Konstantin Tretyakov, Abel Armas-Cervantes, L. García-Bañuelos, J. Vilo, M. Dumas
Computing the shortest path between a pair of vertices in a graph is a fundamental primitive in graph algorithmics. Classical exact methods for this problem do not scale up to contemporary, rapidly evolving social networks with hundreds of millions of users and billions of connections. A number of approximate methods have been proposed, including several landmark-based methods that have been shown to scale up to very large graphs with acceptable accuracy. This paper presents two improvements to existing landmark-based shortest path estimation methods. The first improvement relates to the use of shortest-path trees (SPTs). Together with appropriate short-cutting heuristics, the use of SPTs allows to achieve higher accuracy with acceptable time and memory overhead. Furthermore, SPTs can be maintained incrementally under edge insertions and deletions, which allows for a fully-dynamic algorithm. The second improvement is a new landmark selection strategy that seeks to maximize the coverage of all shortest paths by the selected landmarks. The improved method is evaluated on the DBLP, Orkut, Twitter and Skype social networks.
计算图中一对顶点之间的最短路径是图算法的基本原理。解决这个问题的经典精确方法无法适用于拥有数亿用户和数十亿连接的现代快速发展的社交网络。已经提出了许多近似方法,包括几种基于地标的方法,这些方法已被证明可以以可接受的精度扩展到非常大的图形。本文对现有的基于地标的最短路径估计方法进行了两种改进。第一个改进与最短路径树(spt)的使用有关。结合适当的捷径启发式方法,使用spt可以在可接受的时间和内存开销下实现更高的准确性。此外,在边缘插入和删除的情况下,spt可以增量地保持,这允许全动态算法。第二个改进是一种新的地标选择策略,它寻求最大化所选地标的所有最短路径的覆盖范围。在DBLP、Orkut、Twitter和Skype社交网络上对改进后的方法进行了评价。
{"title":"Fast fully dynamic landmark-based estimation of shortest path distances in very large graphs","authors":"Konstantin Tretyakov, Abel Armas-Cervantes, L. García-Bañuelos, J. Vilo, M. Dumas","doi":"10.1145/2063576.2063834","DOIUrl":"https://doi.org/10.1145/2063576.2063834","url":null,"abstract":"Computing the shortest path between a pair of vertices in a graph is a fundamental primitive in graph algorithmics. Classical exact methods for this problem do not scale up to contemporary, rapidly evolving social networks with hundreds of millions of users and billions of connections. A number of approximate methods have been proposed, including several landmark-based methods that have been shown to scale up to very large graphs with acceptable accuracy. This paper presents two improvements to existing landmark-based shortest path estimation methods. The first improvement relates to the use of shortest-path trees (SPTs). Together with appropriate short-cutting heuristics, the use of SPTs allows to achieve higher accuracy with acceptable time and memory overhead. Furthermore, SPTs can be maintained incrementally under edge insertions and deletions, which allows for a fully-dynamic algorithm. The second improvement is a new landmark selection strategy that seeks to maximize the coverage of all shortest paths by the selected landmarks. The improved method is evaluated on the DBLP, Orkut, Twitter and Skype social networks.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"15 1","pages":"1785-1794"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82523921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 85
Retrieving and ranking unannotated images through collaboratively mining online search results 通过协同挖掘在线搜索结果检索和排序未注释的图像
Songhua Xu, Hao Jiang, F. Lau
We present a new image search and ranking algorithm for retrieving unannotated images by collaboratively mining online search results which consist of online image and text search results. The online image search results are leveraged as reference examples to perform content-based image search over unannotated images. The online text search results are utilized to estimate the reference images' relevance to the search query. The key feature of our method is its capability to deal with unreliable online image search results through jointly mining visual and textual aspects of online search results. Through such collaborative mining, our algorithm infers the relevance of an online search result image to a text query. Once we obtain the estimate of query relevance score for each online image search result, we can selectively use query specific online search result images as reference examples for retrieving and ranking unannotated images. We tested our algorithm both on the standard public image datasets and several modestly sized personal photo collections. We also compared our method with two well-known peer methods. The results indicate that our algorithm is superior to existing content-based image search algorithms for retrieving and ranking unannotated images.
通过对在线图像和文本搜索结果的协同挖掘,提出了一种新的无注释图像检索排序算法。利用在线图像搜索结果作为参考示例,对未注释的图像执行基于内容的图像搜索。利用在线文本搜索结果来估计参考图像与搜索查询的相关性。该方法的主要特点是能够通过联合挖掘在线搜索结果的视觉和文本方面来处理不可靠的在线图像搜索结果。通过这种协同挖掘,我们的算法推断出在线搜索结果图像与文本查询的相关性。一旦我们获得了每个在线图像搜索结果的查询相关性评分估计值,我们就可以选择性地使用查询特定的在线搜索结果图像作为参考示例来检索和排序未注释的图像。我们在标准的公共图像数据集和几个中等大小的个人图片集上测试了我们的算法。我们还将我们的方法与两种知名的同类方法进行了比较。结果表明,我们的算法在检索和排序未注释图像方面优于现有的基于内容的图像搜索算法。
{"title":"Retrieving and ranking unannotated images through collaboratively mining online search results","authors":"Songhua Xu, Hao Jiang, F. Lau","doi":"10.1145/2063576.2063650","DOIUrl":"https://doi.org/10.1145/2063576.2063650","url":null,"abstract":"We present a new image search and ranking algorithm for retrieving unannotated images by collaboratively mining online search results which consist of online image and text search results. The online image search results are leveraged as reference examples to perform content-based image search over unannotated images. The online text search results are utilized to estimate the reference images' relevance to the search query. The key feature of our method is its capability to deal with unreliable online image search results through jointly mining visual and textual aspects of online search results. Through such collaborative mining, our algorithm infers the relevance of an online search result image to a text query. Once we obtain the estimate of query relevance score for each online image search result, we can selectively use query specific online search result images as reference examples for retrieving and ranking unannotated images. We tested our algorithm both on the standard public image datasets and several modestly sized personal photo collections. We also compared our method with two well-known peer methods. The results indicate that our algorithm is superior to existing content-based image search algorithms for retrieving and ranking unannotated images.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"25 1","pages":"485-494"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80909168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Recency ranking by diversification of result set 基于结果集多样化的近期排序
Andrey Styskin, Fedor Romanenko, F. Vorobyev, P. Serdyukov
In this paper, we propose a web search retrieval approach which automatically detects recency sensitive queries and increases the freshness of the ordinary document ranking by a degree proportional to the probability of the need in recent content. We propose to solve the recency ranking problem by using result diversification principles and deal with the query's non-topical ambiguity appearing when the need in recent content can be detected only with uncertainty. Our offine and online experiments with millions of queries from real search engine users demonstrate the significant increase in satisfaction of users presented with a search result generated by our approach.
在本文中,我们提出了一种自动检测最近敏感查询的web搜索检索方法,并以与最近内容中需要的概率成比例的程度增加普通文档的新鲜度排名。我们提出利用结果多样化原则来解决最近排序问题,并处理仅通过不确定性就可以检测到最近内容中的需求时出现的查询的非主题歧义。我们对来自真实搜索引擎用户的数百万查询进行的离线和在线实验表明,通过我们的方法生成的搜索结果显著提高了用户的满意度。
{"title":"Recency ranking by diversification of result set","authors":"Andrey Styskin, Fedor Romanenko, F. Vorobyev, P. Serdyukov","doi":"10.1145/2063576.2063862","DOIUrl":"https://doi.org/10.1145/2063576.2063862","url":null,"abstract":"In this paper, we propose a web search retrieval approach which automatically detects recency sensitive queries and increases the freshness of the ordinary document ranking by a degree proportional to the probability of the need in recent content. We propose to solve the recency ranking problem by using result diversification principles and deal with the query's non-topical ambiguity appearing when the need in recent content can be detected only with uncertainty. Our offine and online experiments with millions of queries from real search engine users demonstrate the significant increase in satisfaction of users presented with a search result generated by our approach.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"25 1","pages":"1949-1952"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83308498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Managing interoperability and complexity inhealth systems: MIXHS'11 workshop summary 管理卫生系统的互操作性和复杂性:MIXHS'11研讨会总结
M. Bouamrane, C. Tao
Managing Interoperability and Complexity in Health Systems, MIXHS'11, aims to be a forum focussing on recent research and technical results in knowledge management and information systems in bio-medical and electronic health systems. The workshop will provide an opportunity for sharing practical experiences and best practices in e-Health information infrastructure development and management. Of particular interest to the workshop themes are technical solutions to recurring practical systems deployment issues, including harnessing the complexity of bio-medical domain knowledge and the interoperability of heterogeneous health systems. The workshop will gather experts, researchers, system developers, practitioners and policymakers designing and implementing solutions for managing clinical data and integrating existing and future electronic health systems infrastructures.
管理卫生系统中的互操作性和复杂性,MIXHS'11旨在成为一个论坛,重点关注生物医学和电子卫生系统中知识管理和信息系统的最新研究和技术成果。讲习班将为分享电子卫生信息基础设施发展和管理方面的实际经验和最佳做法提供机会。研讨会主题特别感兴趣的是反复出现的实际系统部署问题的技术解决方案,包括利用生物医学领域知识的复杂性和异构卫生系统的互操作性。研讨会将聚集专家、研究人员、系统开发人员、从业人员和决策者,设计和实施管理临床数据和整合现有和未来电子卫生系统基础设施的解决方案。
{"title":"Managing interoperability and complexity inhealth systems: MIXHS'11 workshop summary","authors":"M. Bouamrane, C. Tao","doi":"10.1145/2063576.2064050","DOIUrl":"https://doi.org/10.1145/2063576.2064050","url":null,"abstract":"Managing Interoperability and Complexity in Health Systems, MIXHS'11, aims to be a forum focussing on recent research and technical results in knowledge management and information systems in bio-medical and electronic health systems. The workshop will provide an opportunity for sharing practical experiences and best practices in e-Health information infrastructure development and management. Of particular interest to the workshop themes are technical solutions to recurring practical systems deployment issues, including harnessing the complexity of bio-medical domain knowledge and the interoperability of heterogeneous health systems. The workshop will gather experts, researchers, system developers, practitioners and policymakers designing and implementing solutions for managing clinical data and integrating existing and future electronic health systems infrastructures.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"115 1","pages":"2635-2636"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89313577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Duplicate detection through structure optimization 通过结构优化进行重复检测
Luís Leitão, P. Calado
Detecting and eliminating duplicates in databases is a task of critical importance in many applications. Although solutions for traditional models, such as relational data, have been widely studied, recently there has been some focus on solutions for more complex hierarchical structures as, for instance, XML data. Such data presents many different challenges, among which is the issue of how to exploit the schema structure to determine if two objects are duplicates. In this paper, we argue that structure can indeed have a significant impact on the process of duplicate detection. We propose a novel method that automatically restructures database objects in order to take full advantage of the relations between its attributes. This new structure reflects the relative importance of the attributes in the database and avoids the need to perform a manual selection. To test our approach we applied it to an existing duplicate detection system. Experiments performed on several datasets show that, using the new learned structure, we consistently outperform both the results obtained with the original database structure and those obtained by letting a knowledgeable user manually choose the attributes to compare.
在许多应用程序中,检测和消除数据库中的重复是一项至关重要的任务。尽管针对传统模型(如关系数据)的解决方案已经得到了广泛的研究,但最近关注的焦点是针对更复杂的层次结构(如XML数据)的解决方案。这样的数据提出了许多不同的挑战,其中一个问题是如何利用模式结构来确定两个对象是否重复。在本文中,我们认为结构确实可以对重复检测过程产生重大影响。为了充分利用数据库对象属性之间的关系,提出了一种自动重构数据库对象的方法。这种新结构反映了数据库中属性的相对重要性,并避免了执行手动选择的需要。为了测试我们的方法,我们将其应用于现有的重复检测系统。在多个数据集上进行的实验表明,使用新的学习结构获得的结果始终优于使用原始数据库结构获得的结果,以及让有知识的用户手动选择属性进行比较获得的结果。
{"title":"Duplicate detection through structure optimization","authors":"Luís Leitão, P. Calado","doi":"10.1145/2063576.2063644","DOIUrl":"https://doi.org/10.1145/2063576.2063644","url":null,"abstract":"Detecting and eliminating duplicates in databases is a task of critical importance in many applications. Although solutions for traditional models, such as relational data, have been widely studied, recently there has been some focus on solutions for more complex hierarchical structures as, for instance, XML data. Such data presents many different challenges, among which is the issue of how to exploit the schema structure to determine if two objects are duplicates. In this paper, we argue that structure can indeed have a significant impact on the process of duplicate detection. We propose a novel method that automatically restructures database objects in order to take full advantage of the relations between its attributes. This new structure reflects the relative importance of the attributes in the database and avoids the need to perform a manual selection. To test our approach we applied it to an existing duplicate detection system. Experiments performed on several datasets show that, using the new learned structure, we consistently outperform both the results obtained with the original database structure and those obtained by letting a knowledgeable user manually choose the attributes to compare.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"12 1","pages":"443-452"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89826857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
A pairwise ranking based approach to learning with positive and unlabeled examples 一种基于两两排序的学习方法,使用正样例和未标记样例
Sundararajan Sellamanickam, Priyanka Garg, S. Keerthi
A large fraction of binary classification problems arising in web applications are of the type where the positive class is well defined and compact while the negative class comprises everything else in the distribution for which the classifier is developed; it is hard to represent and sample from such a broad negative class. Classifiers based only on positive and unlabeled examples reduce human annotation effort significantly by removing the burden of choosing a representative set of negative examples. Various methods have been proposed in the literature for building such classifiers. Of these, the state of the art methods are Biased SVM and Elkan & Noto's methods. While these methods often work well in practice, they are computationally expensive since hyperparameter tuning is very important, particularly when the size of labeled positive examples set is small and class imbalance is high. In this paper we propose a pairwise ranking based approach to learn from positive and unlabeled examples (LPU) and we give a theoretical justification for it. We present a pairwise RankSVM (RSVM) based method for our approach. The method is simple, efficient, and its hyperparameters are easy to tune. A detailed experimental study using several benchmark datasets shows that the proposed method gives competitive classification performance compared to the mentioned state of the art methods, while training 3-10 times faster. We also propose an efficient AUC based feature selection technique in the LPU setting and demonstrate its usefulness on the datasets. To get an idea of the goodness of the LPU methods we compare them against supervised learning (SL) methods that also make use of negative examples in training. SL methods give a slightly better performance than LPU methods when there is a rich set of negative examples; however, they are inferior when the number of negative training examples is not large enough.
在web应用程序中出现的大部分二元分类问题都是这样的类型:正类定义良好且紧凑,而负类包含了分类器开发的分布中的其他所有内容;很难从如此广泛的负面类别中代表和抽样。仅基于正面和未标记示例的分类器通过消除选择具有代表性的负面示例集的负担,大大减少了人类注释的工作量。文献中提出了各种方法来构建这样的分类器。其中,最先进的方法是有偏差的支持向量机和Elkan & Noto的方法。虽然这些方法在实践中通常工作得很好,但它们的计算成本很高,因为超参数调优非常重要,特别是当标记的正例集的大小很小且类不平衡很高时。在本文中,我们提出了一种基于成对排序的方法来学习积极和未标记的例子(LPU),并给出了理论证明。我们提出了一种基于成对秩支持向量机(RSVM)的方法。该方法简单、高效,且超参数易于调优。使用几个基准数据集进行的详细实验研究表明,与上述最先进的方法相比,所提出的方法具有竞争力的分类性能,同时训练速度快3-10倍。我们还在LPU设置中提出了一种有效的基于AUC的特征选择技术,并证明了其在数据集上的实用性。为了了解LPU方法的优点,我们将其与监督学习(SL)方法进行比较,后者也在训练中使用负例。当存在丰富的负例集时,SL方法的性能略好于LPU方法;然而,当负训练样例的数量不够多时,它们就显得很差。
{"title":"A pairwise ranking based approach to learning with positive and unlabeled examples","authors":"Sundararajan Sellamanickam, Priyanka Garg, S. Keerthi","doi":"10.1145/2063576.2063675","DOIUrl":"https://doi.org/10.1145/2063576.2063675","url":null,"abstract":"A large fraction of binary classification problems arising in web applications are of the type where the positive class is well defined and compact while the negative class comprises everything else in the distribution for which the classifier is developed; it is hard to represent and sample from such a broad negative class. Classifiers based only on positive and unlabeled examples reduce human annotation effort significantly by removing the burden of choosing a representative set of negative examples. Various methods have been proposed in the literature for building such classifiers. Of these, the state of the art methods are Biased SVM and Elkan & Noto's methods. While these methods often work well in practice, they are computationally expensive since hyperparameter tuning is very important, particularly when the size of labeled positive examples set is small and class imbalance is high. In this paper we propose a pairwise ranking based approach to learn from positive and unlabeled examples (LPU) and we give a theoretical justification for it. We present a pairwise RankSVM (RSVM) based method for our approach. The method is simple, efficient, and its hyperparameters are easy to tune. A detailed experimental study using several benchmark datasets shows that the proposed method gives competitive classification performance compared to the mentioned state of the art methods, while training 3-10 times faster. We also propose an efficient AUC based feature selection technique in the LPU setting and demonstrate its usefulness on the datasets. To get an idea of the goodness of the LPU methods we compare them against supervised learning (SL) methods that also make use of negative examples in training. SL methods give a slightly better performance than LPU methods when there is a rich set of negative examples; however, they are inferior when the number of negative training examples is not large enough.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"7 1","pages":"663-672"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89862696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Ranking-based processing of SQL queries 基于排名的SQL查询处理
H. Azzam, T. Roelleke, Sirvan Yahyaei
A growing number of applications are built on top of search engines and issue complex structured queries. This paper contributes a customisable ranking-based processing of such queries, specifically SQL. Similar to how term-based statistics are exploited by term-based retrieval models, ranking-aware processing of SQL queries exploits tuple-based statistics that are derived from sources or, more precisely, derived from the relations specified in the SQL query. To implement this ranking-based processing, we leverage PSQL, a probabilistic variant of SQL, to facilitate probability estimation and the generalisation of document retrieval models to be used for tuple retrieval. The result is a general-purpose framework that can interpret any SQL query and then assign a probabilistic retrieval model to rank the results of that query. The evaluation on the IMDB and Monster benchmarks proves that the PSQL-based approach is applicable to (semi-)structured and unstructured data and structured queries.
越来越多的应用程序建立在搜索引擎之上,并发出复杂的结构化查询。本文提供了一种可定制的基于排名的此类查询处理,特别是SQL。与基于术语的检索模型利用基于术语的统计信息的方式类似,SQL查询的排序感知处理利用基于元组的统计信息,这些统计信息来自源,或者更准确地说,来自SQL查询中指定的关系。为了实现这种基于排名的处理,我们利用PSQL (SQL的一种概率变体)来促进用于元组检索的文档检索模型的概率估计和泛化。结果是一个通用框架,它可以解释任何SQL查询,然后分配一个概率检索模型来对该查询的结果进行排序。对IMDB和Monster基准测试的评估证明,基于psql的方法适用于(半)结构化和非结构化数据以及结构化查询。
{"title":"Ranking-based processing of SQL queries","authors":"H. Azzam, T. Roelleke, Sirvan Yahyaei","doi":"10.1145/2063576.2063614","DOIUrl":"https://doi.org/10.1145/2063576.2063614","url":null,"abstract":"A growing number of applications are built on top of search engines and issue complex structured queries. This paper contributes a customisable ranking-based processing of such queries, specifically SQL. Similar to how term-based statistics are exploited by term-based retrieval models, ranking-aware processing of SQL queries exploits tuple-based statistics that are derived from sources or, more precisely, derived from the relations specified in the SQL query. To implement this ranking-based processing, we leverage PSQL, a probabilistic variant of SQL, to facilitate probability estimation and the generalisation of document retrieval models to be used for tuple retrieval. The result is a general-purpose framework that can interpret any SQL query and then assign a probabilistic retrieval model to rank the results of that query. The evaluation on the IMDB and Monster benchmarks proves that the PSQL-based approach is applicable to (semi-)structured and unstructured data and structured queries.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"35 1","pages":"231-236"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86505244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1