首页 > 最新文献

Proceedings of the 13th International Conference on Web Search and Data Mining最新文献

英文 中文
Web-scale Knowledge Collection 网络规模的知识收集
Pub Date : 2020-01-20 DOI: 10.1145/3336191.3371878
Colin Lockard, Prashant Shiralkar, Xin Dong, Hannaneh Hajishirzi
How do we surface the large amount of information present in HTML documents on the Web, from news articles to scientific papers to Rotten Tomatoes pages to tables of sports scores? Such information can enable a variety of applications including knowledge base construction, question answering, recommendation, and more. In this tutorial, we present approaches for Information Extraction (IE) from Web data that can be differentiated along two key dimensions: 1) the diversity in data modality that is leveraged, e.g. text, visual, XML/HTML, and 2) the thrust to develop scalable approaches with zero to limited human supervision. We cover the key ideas and intuition behind existing approaches to emphasize their applicability and potential in various settings.
我们如何在Web上呈现HTML文档中的大量信息,从新闻文章到科学论文,从烂番茄页面到体育比分表?这些信息可以支持各种应用程序,包括知识库构建、问题回答、推荐等等。在本教程中,我们介绍了从Web数据中提取信息(IE)的方法,这些方法可以根据两个关键维度进行区分:1)所利用的数据模式的多样性,例如文本、可视化、XML/HTML,以及2)开发零到有限人工监督的可扩展方法的动力。我们涵盖了现有方法背后的关键思想和直觉,以强调它们在各种环境中的适用性和潜力。
{"title":"Web-scale Knowledge Collection","authors":"Colin Lockard, Prashant Shiralkar, Xin Dong, Hannaneh Hajishirzi","doi":"10.1145/3336191.3371878","DOIUrl":"https://doi.org/10.1145/3336191.3371878","url":null,"abstract":"How do we surface the large amount of information present in HTML documents on the Web, from news articles to scientific papers to Rotten Tomatoes pages to tables of sports scores? Such information can enable a variety of applications including knowledge base construction, question answering, recommendation, and more. In this tutorial, we present approaches for Information Extraction (IE) from Web data that can be differentiated along two key dimensions: 1) the diversity in data modality that is leveraged, e.g. text, visual, XML/HTML, and 2) the thrust to develop scalable approaches with zero to limited human supervision. We cover the key ideas and intuition behind existing approaches to emphasize their applicability and potential in various settings.","PeriodicalId":319008,"journal":{"name":"Proceedings of the 13th International Conference on Web Search and Data Mining","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127205691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Structural Graph Representation Learning Framework 一个结构图表示学习框架
Pub Date : 2020-01-20 DOI: 10.1145/3336191.3371843
Ryan A. Rossi, Nesreen K. Ahmed, Eunyee Koh, Sungchul Kim, Anup B. Rao, Yasin Abbasi-Yadkori
The success of many graph-based machine learning tasks highly depends on an appropriate representation learned from the graph data. Most work has focused on learning node embeddings that preserve proximity as opposed to structural role-based embeddings that preserve the structural similarity among nodes. These methods fail to capture higher-order structural dependencies and connectivity patterns that are crucial for structural role-based applications such as visitor stitching from web logs. In this work, we formulate higher-order network representation learning and describe a general framework called HONE for learning such structural node embeddings from networks via the subgraph patterns (network motifs, graphlet orbits/positions) in a nodes neighborhood. A general diffusion mechanism is introduced in HONE along with a space-efficient approach that avoids explicit construction of the k-step motif-based matrices using a k-step linear operator. Furthermore, HONE is shown to be fast and efficient with a worst-case time complexity that is nearly-linear in the number of edges. The experiments demonstrate the effectiveness of HONE for a number of important tasks including link prediction and visitor stitching from large web log data.
许多基于图的机器学习任务的成功高度依赖于从图数据中学习到的适当表示。大多数工作都集中在学习保持接近性的节点嵌入,而不是保持节点之间结构相似性的基于结构角色的嵌入。这些方法无法捕获高阶结构依赖关系和连接模式,而这些对于结构化的基于角色的应用程序(如从web日志中拼接访问者)至关重要。在这项工作中,我们制定了高阶网络表示学习,并描述了一个称为HONE的通用框架,用于通过节点邻域中的子图模式(网络基元,石墨烯轨道/位置)从网络中学习这种结构节点嵌入。在HONE中引入了一种通用的扩散机制,以及一种空间高效的方法,该方法避免了使用k步线性算子显式构建基于k步基元的矩阵。此外,HONE被证明是快速和有效的,最坏情况下的时间复杂度在边的数量上接近线性。实验证明了该算法在大量web日志数据的链接预测和访问者拼接等重要任务中的有效性。
{"title":"A Structural Graph Representation Learning Framework","authors":"Ryan A. Rossi, Nesreen K. Ahmed, Eunyee Koh, Sungchul Kim, Anup B. Rao, Yasin Abbasi-Yadkori","doi":"10.1145/3336191.3371843","DOIUrl":"https://doi.org/10.1145/3336191.3371843","url":null,"abstract":"The success of many graph-based machine learning tasks highly depends on an appropriate representation learned from the graph data. Most work has focused on learning node embeddings that preserve proximity as opposed to structural role-based embeddings that preserve the structural similarity among nodes. These methods fail to capture higher-order structural dependencies and connectivity patterns that are crucial for structural role-based applications such as visitor stitching from web logs. In this work, we formulate higher-order network representation learning and describe a general framework called HONE for learning such structural node embeddings from networks via the subgraph patterns (network motifs, graphlet orbits/positions) in a nodes neighborhood. A general diffusion mechanism is introduced in HONE along with a space-efficient approach that avoids explicit construction of the k-step motif-based matrices using a k-step linear operator. Furthermore, HONE is shown to be fast and efficient with a worst-case time complexity that is nearly-linear in the number of edges. The experiments demonstrate the effectiveness of HONE for a number of important tasks including link prediction and visitor stitching from large web log data.","PeriodicalId":319008,"journal":{"name":"Proceedings of the 13th International Conference on Web Search and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123288447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Why Do People Buy Seemingly Irrelevant Items in Voice Product Search?: On the Relation between Product Relevance and Customer Satisfaction in eCommerce 为什么人们会在语音产品搜索中购买看似无关的商品?电子商务中产品相关性与顾客满意度的关系研究
Pub Date : 2020-01-20 DOI: 10.1145/3336191.3371780
David Carmel, Elad Haramaty, Arnon Lazerson, L. Lewin-Eytan, Y. Maarek
One emerging benefit of voice assistants is to facilitate product search experience, allowing users to express orally which products they seek, and taking actions on retrieved results such as adding them to their cart or sending the product details to their mobile phone for further examination. Looking at users' behavior in product search, supported by a digital voice assistant, we have observed an interesting phenomenon where users purchase or engage with search results that are objectively judged irrelevant to their queries. In this work, we analyze and characterize this phenomenon. We provide several hypotheses as to the reasons behind it, including users' personalized preferences, the product's popularity, the product's indirect relation with the query, the user's tolerance level, the query intent, and the product price. We address each hypothesis by conducting thorough data analyses and offer some insights with respect to users' purchase and engagement behavior with seemingly irrelevant results. We conclude with a discussion on how this analysis can be used to improve voice product search services.
语音助手的一个新兴好处是促进了产品搜索体验,允许用户口头表达他们要寻找的产品,并对检索到的结果采取行动,例如将它们添加到购物车或将产品详细信息发送到他们的手机以进行进一步检查。在数字语音助手的支持下,观察用户在产品搜索中的行为,我们观察到一个有趣的现象,即用户购买或参与搜索结果,这些结果被客观地判断为与他们的查询无关。在这项工作中,我们对这一现象进行了分析和表征。我们对其背后的原因提供了几个假设,包括用户的个性化偏好、产品的受欢迎程度、产品与查询的间接关系、用户的容忍度、查询意图和产品价格。我们通过进行彻底的数据分析来解决每个假设,并提供一些关于用户购买和参与行为的见解,这些行为与看似无关的结果。最后,我们讨论了如何使用此分析来改进语音产品搜索服务。
{"title":"Why Do People Buy Seemingly Irrelevant Items in Voice Product Search?: On the Relation between Product Relevance and Customer Satisfaction in eCommerce","authors":"David Carmel, Elad Haramaty, Arnon Lazerson, L. Lewin-Eytan, Y. Maarek","doi":"10.1145/3336191.3371780","DOIUrl":"https://doi.org/10.1145/3336191.3371780","url":null,"abstract":"One emerging benefit of voice assistants is to facilitate product search experience, allowing users to express orally which products they seek, and taking actions on retrieved results such as adding them to their cart or sending the product details to their mobile phone for further examination. Looking at users' behavior in product search, supported by a digital voice assistant, we have observed an interesting phenomenon where users purchase or engage with search results that are objectively judged irrelevant to their queries. In this work, we analyze and characterize this phenomenon. We provide several hypotheses as to the reasons behind it, including users' personalized preferences, the product's popularity, the product's indirect relation with the query, the user's tolerance level, the query intent, and the product price. We address each hypothesis by conducting thorough data analyses and offer some insights with respect to users' purchase and engagement behavior with seemingly irrelevant results. We conclude with a discussion on how this analysis can be used to improve voice product search services.","PeriodicalId":319008,"journal":{"name":"Proceedings of the 13th International Conference on Web Search and Data Mining","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115515509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Intelligible Machine Learning and Knowledge Discovery Boosted by Visual Means 视觉手段促进的可理解机器学习和知识发现
Pub Date : 2020-01-20 DOI: 10.1145/3336191.3371872
Boris Kovalerchuk
Intelligible machine learning and knowledge discovery are important for modeling individual and social behavior, user activity, link prediction, community detection, crowd-generated data, and others. The role of the interpretable method in web search and mining activities is also very significant to enhance clustering, classification, data summarization, knowledge acquisition, opinion and sentiment mining, web traffic analysis, and web recommender systems. Deep learning success in accuracy of prediction and its failure in explanation of the produced models without special interpretation efforts motivated the surge of efforts to make Machine Learning (ML) models more intelligible and understandable. The prominence of visual methods in getting appealing explanations of ML models motivated the growth of deep visualization, and visual knowledge discovery. This tutorial covers the state-of-the-art research, development, and applications in the area of Intelligible Knowledge Discovery, and Machine Learning boosted by Visual Means.
可理解的机器学习和知识发现对于建模个人和社会行为、用户活动、链接预测、社区检测、人群生成数据等非常重要。可解释方法在网络搜索和挖掘活动中的作用也非常重要,可以增强聚类、分类、数据汇总、知识获取、意见和情感挖掘、网络流量分析和网络推荐系统。深度学习在预测准确性方面的成功,以及在没有特殊解释的情况下对生成模型的解释方面的失败,促使人们努力使机器学习(ML)模型更容易理解。视觉方法在获得机器学习模型的吸引人的解释方面的突出作用推动了深度可视化和视觉知识发现的发展。本教程涵盖了可理解知识发现领域的最新研究、开发和应用,以及由视觉手段推动的机器学习。
{"title":"Intelligible Machine Learning and Knowledge Discovery Boosted by Visual Means","authors":"Boris Kovalerchuk","doi":"10.1145/3336191.3371872","DOIUrl":"https://doi.org/10.1145/3336191.3371872","url":null,"abstract":"Intelligible machine learning and knowledge discovery are important for modeling individual and social behavior, user activity, link prediction, community detection, crowd-generated data, and others. The role of the interpretable method in web search and mining activities is also very significant to enhance clustering, classification, data summarization, knowledge acquisition, opinion and sentiment mining, web traffic analysis, and web recommender systems. Deep learning success in accuracy of prediction and its failure in explanation of the produced models without special interpretation efforts motivated the surge of efforts to make Machine Learning (ML) models more intelligible and understandable. The prominence of visual methods in getting appealing explanations of ML models motivated the growth of deep visualization, and visual knowledge discovery. This tutorial covers the state-of-the-art research, development, and applications in the area of Intelligible Knowledge Discovery, and Machine Learning boosted by Visual Means.","PeriodicalId":319008,"journal":{"name":"Proceedings of the 13th International Conference on Web Search and Data Mining","volume":"168 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116447323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
PERQ
Pub Date : 2020-01-20 DOI: 10.1145/3336191.3371782
Zhiyong Wu, B. Kao, Tien-Hsuan Wu, Pengcheng Yin, Qun Liu
A knowledge-based question-answering (KB-QA) system is one that answers natural-language questions by accessing information stored in a knowledge base (KB). Existing KB-QA systems generally register an accuracy of 70-80% for simple questions and less for more complex ones. We observe that certain questions are intrinsically difficult to answer correctly with existing systems. We propose the PERQ framework to address this issue. Given a question q, we perform three steps to boost answer accuracy: (1) (Prediction) We predict if q can be answered correctly by a KB-QA system S. (2) (Explanation) If S is predicted to fail q, we analyze them to determine the most likely reasons of the failure. (3) (Rectification) We use the prediction and explanation results to rectify the answer. We put forward tools to achieve the three steps and analyze their effectiveness. Our experiments show that the PERQ framework can significantly improve KB-QA systems' accuracies over simple questions.
{"title":"PERQ","authors":"Zhiyong Wu, B. Kao, Tien-Hsuan Wu, Pengcheng Yin, Qun Liu","doi":"10.1145/3336191.3371782","DOIUrl":"https://doi.org/10.1145/3336191.3371782","url":null,"abstract":"A knowledge-based question-answering (KB-QA) system is one that answers natural-language questions by accessing information stored in a knowledge base (KB). Existing KB-QA systems generally register an accuracy of 70-80% for simple questions and less for more complex ones. We observe that certain questions are intrinsically difficult to answer correctly with existing systems. We propose the PERQ framework to address this issue. Given a question q, we perform three steps to boost answer accuracy: (1) (Prediction) We predict if q can be answered correctly by a KB-QA system S. (2) (Explanation) If S is predicted to fail q, we analyze them to determine the most likely reasons of the failure. (3) (Rectification) We use the prediction and explanation results to rectify the answer. We put forward tools to achieve the three steps and analyze their effectiveness. Our experiments show that the PERQ framework can significantly improve KB-QA systems' accuracies over simple questions.","PeriodicalId":319008,"journal":{"name":"Proceedings of the 13th International Conference on Web Search and Data Mining","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122844585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
DySAT: Deep Neural Representation Learning on Dynamic Graphs via Self-Attention Networks 基于自注意网络的动态图的深度神经表征学习
Pub Date : 2020-01-20 DOI: 10.1145/3336191.3371845
Aravind Sankar, Yanhong Wu, Liang Gou, Wei Zhang, Hao Yang
Learning node representations in graphs is important for many applications such as link prediction, node classification, and community detection. Existing graph representation learning methods primarily target static graphs while many real-world graphs evolve over time. Complex time-varying graph structures make it challenging to learn informative node representations over time. We present Dynamic Self-Attention Network (DySAT), a novel neural architecture that learns node representations to capture dynamic graph structural evolution. Specifically, DySAT computes node representations through joint self-attention along the two dimensions of structural neighborhood and temporal dynamics. Compared with state-of-the-art recurrent methods modeling graph evolution, dynamic self-attention is efficient, while achieving consistently superior performance. We conduct link prediction experiments on two graph types: communication networks and bipartite rating networks. Experimental results demonstrate significant performance gains for DySAT over several state-of-the-art graph embedding baselines, in both single and multi-step link prediction tasks. Furthermore, our ablation study validates the effectiveness of jointly modeling structural and temporal self-attention.
学习图中的节点表示对于链接预测、节点分类和社区检测等许多应用都很重要。现有的图表示学习方法主要针对静态图,而许多现实世界的图随着时间的推移而发展。复杂的时变图结构使得随着时间的推移学习信息节点表示具有挑战性。我们提出了动态自注意网络(DySAT),这是一种新的神经结构,通过学习节点表示来捕捉动态图结构的演变。具体来说,DySAT通过沿结构邻域和时间动态两个维度的联合自关注来计算节点表示。与最先进的递归图演化建模方法相比,动态自关注是高效的,同时获得了一贯的优越性能。我们对两种图类型:通信网络和二部评级网络进行了链路预测实验。实验结果表明,在单步和多步链路预测任务中,DySAT在几个最先进的图嵌入基线上的性能都有显著提高。此外,我们的消融研究验证了结构和时间自我注意联合建模的有效性。
{"title":"DySAT: Deep Neural Representation Learning on Dynamic Graphs via Self-Attention Networks","authors":"Aravind Sankar, Yanhong Wu, Liang Gou, Wei Zhang, Hao Yang","doi":"10.1145/3336191.3371845","DOIUrl":"https://doi.org/10.1145/3336191.3371845","url":null,"abstract":"Learning node representations in graphs is important for many applications such as link prediction, node classification, and community detection. Existing graph representation learning methods primarily target static graphs while many real-world graphs evolve over time. Complex time-varying graph structures make it challenging to learn informative node representations over time. We present Dynamic Self-Attention Network (DySAT), a novel neural architecture that learns node representations to capture dynamic graph structural evolution. Specifically, DySAT computes node representations through joint self-attention along the two dimensions of structural neighborhood and temporal dynamics. Compared with state-of-the-art recurrent methods modeling graph evolution, dynamic self-attention is efficient, while achieving consistently superior performance. We conduct link prediction experiments on two graph types: communication networks and bipartite rating networks. Experimental results demonstrate significant performance gains for DySAT over several state-of-the-art graph embedding baselines, in both single and multi-step link prediction tasks. Furthermore, our ablation study validates the effectiveness of jointly modeling structural and temporal self-attention.","PeriodicalId":319008,"journal":{"name":"Proceedings of the 13th International Conference on Web Search and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116674965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 317
Hybrid Utility Function for Unexpected Recommendations 混合实用功能的意想不到的建议
Pub Date : 2020-01-20 DOI: 10.1145/3336191.3372183
P. Li
Unexpectedness constitutes an important factor for recommender system to improve user satisfaction and avoid filter bubble issues. In this proposal, we propose to provide unexpected recommendations using the hybrid utility function as a mixture of estimated ratings, unexpectedness, relevance and annoyance. We plan to conduct extensive experiments to validate the superiority of the proposed method.
非预期性是推荐系统提高用户满意度和避免过滤气泡问题的重要因素。在本建议中,我们建议使用混合效用函数作为估计评级,意外,相关性和烦恼的混合物来提供意想不到的建议。我们计划进行广泛的实验来验证所提出方法的优越性。
{"title":"Hybrid Utility Function for Unexpected Recommendations","authors":"P. Li","doi":"10.1145/3336191.3372183","DOIUrl":"https://doi.org/10.1145/3336191.3372183","url":null,"abstract":"Unexpectedness constitutes an important factor for recommender system to improve user satisfaction and avoid filter bubble issues. In this proposal, we propose to provide unexpected recommendations using the hybrid utility function as a mixture of estimated ratings, unexpectedness, relevance and annoyance. We plan to conduct extensive experiments to validate the superiority of the proposed method.","PeriodicalId":319008,"journal":{"name":"Proceedings of the 13th International Conference on Web Search and Data Mining","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128268611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Sampling Subgraphs with Guaranteed Treewidth for Accurate and Efficient Graphical Inference 采样子图与保证树宽度准确和有效的图形推理
Pub Date : 2020-01-20 DOI: 10.1145/3336191.3371815
Jaemin Yoo, U. Kang, Mauro Scanagatta, Giorgio Corani, Marco Zaffalon
How can we run graphical inference on large graphs efficiently and accurately? Many real-world networks are modeled as graphical models, and graphical inference is fundamental to understand the properties of those networks. In this work, we propose a novel approach for fast and accurate inference, which first samples a small subgraph and then runs inference over the subgraph instead of the given graph. This is done by the bounded treewidth (BTW) sampling, our novel algorithm that generates a subgraph with guaranteed bounded treewidth while retaining as many edges as possible. We first analyze the properties of BTW theoretically. Then, we evaluate our approach on node classification and compare it with the baseline which is to run loopy belief propagation (LBP) on the original graph. Our approach can be coupled with various inference algorithms: it shows higher accuracy up to 13.7% with the junction tree algorithm, and allows faster inference up to 23.8 times with LBP. We further compare BTW with previous graph sampling algorithms and show that it gives the best accuracy.
如何在大型图形上高效、准确地运行图形推理?许多现实世界的网络都被建模为图形模型,而图形推理是理解这些网络属性的基础。在这项工作中,我们提出了一种快速准确推理的新方法,该方法首先对一个小子图进行采样,然后在子图上而不是给定图上进行推理。这是通过有界树宽(BTW)采样完成的,我们的新算法生成一个保证有界树宽的子图,同时保留尽可能多的边。首先从理论上分析了BTW的性质。然后,我们评估了我们的节点分类方法,并将其与在原始图上运行循环信念传播(LBP)的基线进行了比较。我们的方法可以与各种推理算法相结合:使用连接树算法,它的准确率高达13.7%,使用LBP算法,它的推理速度更快,高达23.8倍。我们进一步将BTW与以前的图采样算法进行比较,表明它具有最好的精度。
{"title":"Sampling Subgraphs with Guaranteed Treewidth for Accurate and Efficient Graphical Inference","authors":"Jaemin Yoo, U. Kang, Mauro Scanagatta, Giorgio Corani, Marco Zaffalon","doi":"10.1145/3336191.3371815","DOIUrl":"https://doi.org/10.1145/3336191.3371815","url":null,"abstract":"How can we run graphical inference on large graphs efficiently and accurately? Many real-world networks are modeled as graphical models, and graphical inference is fundamental to understand the properties of those networks. In this work, we propose a novel approach for fast and accurate inference, which first samples a small subgraph and then runs inference over the subgraph instead of the given graph. This is done by the bounded treewidth (BTW) sampling, our novel algorithm that generates a subgraph with guaranteed bounded treewidth while retaining as many edges as possible. We first analyze the properties of BTW theoretically. Then, we evaluate our approach on node classification and compare it with the baseline which is to run loopy belief propagation (LBP) on the original graph. Our approach can be coupled with various inference algorithms: it shows higher accuracy up to 13.7% with the junction tree algorithm, and allows faster inference up to 23.8 times with LBP. We further compare BTW with previous graph sampling algorithms and show that it gives the best accuracy.","PeriodicalId":319008,"journal":{"name":"Proceedings of the 13th International Conference on Web Search and Data Mining","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116467794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Text Recognition Using Anonymous CAPTCHA Answers 使用匿名CAPTCHA答案的文本识别
Pub Date : 2020-01-20 DOI: 10.1145/3336191.3371795
Alexander Shishkin, Anastasya A. Bezzubtseva, Valentina Fedorova, Alexey Drutsa, Gleb Gusev
Internet companies use crowdsourcing to collect large amounts of data needed for creating products based on machine learning techniques. A significant source of such labels for OCR data sets is (re)CAPTCHA, which distinguishes humans from automated bots by asking them to recognize text and, at the same time, receives new labeled data in this way. An important component of such approach to data collection is the reduction of noisy labels produced by bots and non-qualified users. In this paper, we address the problem of labeling text images via CAPTCHA, where user identification is generally impossible. We propose a new algorithm to aggregate multiple guesses collected through CAPTCHA. We employ incremental relabeling to minimize the number of guesses needed for obtaining the recognized text of a good accuracy. The aggregation model and the stopping rule for our incremental relabeling are based on novel machine learning techniques and use meta features of CAPTCHA tasks and accumulated guesses. Our experiments show that our approach can provide a large amount of accurately recognized texts using a minimal number of user guesses. Finally, we report the great improvements of an optical character recognition model after implementing our approach in Yandex.
互联网公司使用众包来收集基于机器学习技术创建产品所需的大量数据。OCR数据集的这种标签的一个重要来源是CAPTCHA,它通过要求人类识别文本来区分人类和自动机器人,同时以这种方式接收新的标记数据。这种数据收集方法的一个重要组成部分是减少机器人和不合格用户产生的嘈杂标签。在本文中,我们解决了通过CAPTCHA标记文本图像的问题,其中用户识别通常是不可能的。我们提出了一种新的算法来聚合通过CAPTCHA收集的多个猜测。我们使用增量重标注来最小化获得良好精度的识别文本所需的猜测次数。我们的增量重标注的聚合模型和停止规则基于新颖的机器学习技术,并使用CAPTCHA任务和累积猜测的元特征。我们的实验表明,我们的方法可以使用最少的用户猜测提供大量准确识别的文本。最后,我们报告了在Yandex中实现我们的方法后光学字符识别模型的巨大改进。
{"title":"Text Recognition Using Anonymous CAPTCHA Answers","authors":"Alexander Shishkin, Anastasya A. Bezzubtseva, Valentina Fedorova, Alexey Drutsa, Gleb Gusev","doi":"10.1145/3336191.3371795","DOIUrl":"https://doi.org/10.1145/3336191.3371795","url":null,"abstract":"Internet companies use crowdsourcing to collect large amounts of data needed for creating products based on machine learning techniques. A significant source of such labels for OCR data sets is (re)CAPTCHA, which distinguishes humans from automated bots by asking them to recognize text and, at the same time, receives new labeled data in this way. An important component of such approach to data collection is the reduction of noisy labels produced by bots and non-qualified users. In this paper, we address the problem of labeling text images via CAPTCHA, where user identification is generally impossible. We propose a new algorithm to aggregate multiple guesses collected through CAPTCHA. We employ incremental relabeling to minimize the number of guesses needed for obtaining the recognized text of a good accuracy. The aggregation model and the stopping rule for our incremental relabeling are based on novel machine learning techniques and use meta features of CAPTCHA tasks and accumulated guesses. Our experiments show that our approach can provide a large amount of accurately recognized texts using a minimal number of user guesses. Finally, we report the great improvements of an optical character recognition model after implementing our approach in Yandex.","PeriodicalId":319008,"journal":{"name":"Proceedings of the 13th International Conference on Web Search and Data Mining","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134379528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning with Small Data 小数据学习
Pub Date : 2020-01-20 DOI: 10.1145/3336191.3371874
Z. Li, Huaxiu Yao, Fenglong Ma
In the era of big data, it is easy for us collect a huge number of image and text data. However, we frequently face the real-world problems with only small (labeled) data in some domains, such as healthcare and urban computing. The challenge is how to make machine learn algorithms still work well with small data? To solve this challenge, in this tutorial, we will cover the state-of-the-art machine learning techniques to handle small data issue. In particular, we focus on the following three aspects: (1) Providing a comprehensive review of recent advances in exploring the power of knowledge transfer, especially focusing on meta-learning; (2) introducing the cutting-edge techniques of incorporating human/expert knowledge into machine learning models; and (3) identifying the open challenges to data augmentation techniques, such as generative adversarial networks. We believe this is an emerging and potentially high-impact topic in computational data science, which will attract both researchers and practitioners from academia and industry.
在大数据时代,我们很容易收集到大量的图像和文本数据。然而,在某些领域(如医疗保健和城市计算),我们经常面临只有少量(标记)数据的现实问题。挑战在于如何让机器学习算法在处理小数据时仍然能很好地工作?为了解决这一挑战,在本教程中,我们将介绍最先进的机器学习技术来处理小数据问题。本文重点关注以下三个方面:(1)全面回顾了知识转移的力量,特别是元学习的研究进展;(2)引入将人类/专家知识纳入机器学习模型的前沿技术;(3)识别数据增强技术的公开挑战,如生成对抗网络。我们相信这是计算数据科学中一个新兴的、具有潜在高影响力的话题,它将吸引学术界和工业界的研究人员和实践者。
{"title":"Learning with Small Data","authors":"Z. Li, Huaxiu Yao, Fenglong Ma","doi":"10.1145/3336191.3371874","DOIUrl":"https://doi.org/10.1145/3336191.3371874","url":null,"abstract":"In the era of big data, it is easy for us collect a huge number of image and text data. However, we frequently face the real-world problems with only small (labeled) data in some domains, such as healthcare and urban computing. The challenge is how to make machine learn algorithms still work well with small data? To solve this challenge, in this tutorial, we will cover the state-of-the-art machine learning techniques to handle small data issue. In particular, we focus on the following three aspects: (1) Providing a comprehensive review of recent advances in exploring the power of knowledge transfer, especially focusing on meta-learning; (2) introducing the cutting-edge techniques of incorporating human/expert knowledge into machine learning models; and (3) identifying the open challenges to data augmentation techniques, such as generative adversarial networks. We believe this is an emerging and potentially high-impact topic in computational data science, which will attract both researchers and practitioners from academia and industry.","PeriodicalId":319008,"journal":{"name":"Proceedings of the 13th International Conference on Web Search and Data Mining","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132445784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
期刊
Proceedings of the 13th International Conference on Web Search and Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1