首页 > 最新文献

Proceedings of the 30th ACM International Conference on Information & Knowledge Management最新文献

英文 中文
Jura
Zhengqi Xu, Yixuan Cao, Rongyu Cao, Guoxiang Li, Xuanqiang Liu, Yan Pang, Yangbin Wang, Jianfei Zhang, Allie Cheung, Matthew Tam, Lukas Petrikas, Ping Luo
The initial public offering (IPO) market in Hong Kong is consistently one of the largest in the world. As part of its regulatory responsibilities, Hong Kong Exchanges and Clearing Limited (HKEX) reviews annual reports published by listed companies (issuers). The number of issuers has grown at a fast pace, reaching 2,538 as the end of 2020. This poses a challenge for manually reviewing these annual reports against the many diverse regulatory obligations (listing rules). We propose a system named Jura to improve the efficiency of annual report reviewing with the help of machine learning methods. This system checks the compliance of an issuer's published information against listing rules in four steps: panoptic document recognition, relevant passage location, fine-grained information extraction, and compliance assessment. This paper introduces in detail the passage location step, how it is critical for speeding up compliance assessment, and the various challenges faced. We argue that although a passage is a relatively independent unit, it needs to be combined with document structure and contextual information to accurately locate the relevant passages. With the help of Jura, HKEX reports saving 80% of the time on reviewing issuers' annual reports.
{"title":"Jura","authors":"Zhengqi Xu, Yixuan Cao, Rongyu Cao, Guoxiang Li, Xuanqiang Liu, Yan Pang, Yangbin Wang, Jianfei Zhang, Allie Cheung, Matthew Tam, Lukas Petrikas, Ping Luo","doi":"10.1145/3459637.3481929","DOIUrl":"https://doi.org/10.1145/3459637.3481929","url":null,"abstract":"The initial public offering (IPO) market in Hong Kong is consistently one of the largest in the world. As part of its regulatory responsibilities, Hong Kong Exchanges and Clearing Limited (HKEX) reviews annual reports published by listed companies (issuers). The number of issuers has grown at a fast pace, reaching 2,538 as the end of 2020. This poses a challenge for manually reviewing these annual reports against the many diverse regulatory obligations (listing rules). We propose a system named Jura to improve the efficiency of annual report reviewing with the help of machine learning methods. This system checks the compliance of an issuer's published information against listing rules in four steps: panoptic document recognition, relevant passage location, fine-grained information extraction, and compliance assessment. This paper introduces in detail the passage location step, how it is critical for speeding up compliance assessment, and the various challenges faced. We argue that although a passage is a relatively independent unit, it needs to be combined with document structure and contextual information to accurately locate the relevant passages. With the help of Jura, HKEX reports saving 80% of the time on reviewing issuers' annual reports.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115061674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Scalable Contrast Pattern Mining over Data Streams 数据流上的可伸缩对比模式挖掘
E. Chavary, S. Erfani, C. Leckie
Incremental contrast pattern mining (CPM) is an important task in various fields such as network traffic analysis, medical diagnosis, and customer behavior analysis. Due to increases in the speed and dimension of data streams, a major challenge for CPM is to deal with the huge number of generated candidate patterns. While there are some works on incremental CPM, their approaches are not scalable in dense and high dimensional data streams, and the problem of CPM over an evolving dataset is an open challenge. In this work we focus on extracting the most specific set of contrast patterns (CPs) to discover significant changes between two data streams. We devise a novel algorithm to extract CPs using previously mined patterns instead of generating all patterns in each window from scratch. Our experimental results on a wide variety of datasets demonstrate the advantages of our approach over the state of the art in terms of efficiency.
增量对比模式挖掘(CPM)是网络流量分析、医疗诊断和客户行为分析等领域的一项重要任务。由于数据流的速度和维度的增加,CPM面临的一个主要挑战是处理生成的大量候选模式。虽然有一些关于增量CPM的工作,但他们的方法在密集和高维数据流中是不可扩展的,并且在不断发展的数据集上的CPM问题是一个开放的挑战。在这项工作中,我们专注于提取最具体的对比模式(CPs)集,以发现两个数据流之间的重大变化。我们设计了一种新的算法,使用先前挖掘的模式来提取CPs,而不是从头开始生成每个窗口中的所有模式。我们在各种各样的数据集上的实验结果表明,我们的方法在效率方面优于目前最先进的方法。
{"title":"Scalable Contrast Pattern Mining over Data Streams","authors":"E. Chavary, S. Erfani, C. Leckie","doi":"10.1145/3459637.3482174","DOIUrl":"https://doi.org/10.1145/3459637.3482174","url":null,"abstract":"Incremental contrast pattern mining (CPM) is an important task in various fields such as network traffic analysis, medical diagnosis, and customer behavior analysis. Due to increases in the speed and dimension of data streams, a major challenge for CPM is to deal with the huge number of generated candidate patterns. While there are some works on incremental CPM, their approaches are not scalable in dense and high dimensional data streams, and the problem of CPM over an evolving dataset is an open challenge. In this work we focus on extracting the most specific set of contrast patterns (CPs) to discover significant changes between two data streams. We devise a novel algorithm to extract CPs using previously mined patterns instead of generating all patterns in each window from scratch. Our experimental results on a wide variety of datasets demonstrate the advantages of our approach over the state of the art in terms of efficiency.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115089271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Study of Explainability Features to Scrutinize Faceted Filtering Results 面过滤结果的可解释性特征研究
Jiaming Qu, Jaime Arguello, Yue Wang
Faceted search systems enable users to filter results by selecting values along different dimensions or facets. Traditionally, facets have corresponded to properties of information items that are part of the document metadata. Recently, faceted search systems have begun to use machine learning to automatically associate documents with facet-values that are more subjective and abstract. Examples include search systems that support topic-based filtering of research articles, concept-based filtering of medical documents, and tag-based filtering of images. While machine learning can be used to infer facet-values when the collection is too large for manual annotation, machine-learned classifiers make mistakes. In such cases, it is desirable to have a scrutable system that explains why a filtered result is relevant to a facet-value. Such explanations are missing from current systems. In this paper, we investigate how explainability features can help users interpret results filtered using machine-learned facets. We consider two explainability features: (1) showing prediction confidence values and (2) highlighting rationale sentences that played an influential role in predicting a facet-value. We report on a crowdsourced study involving 200 participants. Participants were asked to scrutinize movie plot summaries predicted to satisfy multiple genres and indicate their agreement or disagreement with the system. Participants were exposed to four interface conditions. We found that both explainability features had a positive impact on participants' perceptions and performance. While both features helped, the sentence-highlighting feature played a more instrumental role in enabling participants to reject false positive cases. We discuss implications for designing tools to help users scrutinize automatically assigned facet-values.
分面搜索系统使用户能够通过沿着不同的维度或面选择值来过滤结果。传统上,facet对应于作为文档元数据一部分的信息项的属性。最近,面搜索系统已经开始使用机器学习来自动将文档与更加主观和抽象的面值关联起来。示例包括支持基于主题的研究文章过滤、基于概念的医疗文档过滤和基于标记的图像过滤的搜索系统。当集合太大而无法手动注释时,机器学习可以用来推断面值,但机器学习分类器会犯错误。在这种情况下,需要有一个可解析的系统来解释为什么过滤的结果与面值相关。这样的解释在当前的体系中是缺失的。在本文中,我们研究了可解释性特征如何帮助用户解释使用机器学习方面过滤的结果。我们考虑了两个可解释性特征:(1)显示预测置信度值和(2)突出在预测面值中发挥影响作用的基本原理句子。我们报道了一项涉及200名参与者的众包研究。参与者被要求仔细审查预测满足多种类型的电影情节摘要,并表明他们对该系统的同意或不同意。参与者被暴露在四种界面条件下。我们发现,这两个可解释性特征对参与者的认知和表现都有积极的影响。虽然这两种特征都有帮助,但句子突出特征在让参与者拒绝假阳性案例方面发挥了更重要的作用。我们讨论了设计工具的含义,以帮助用户仔细检查自动分配的面值。
{"title":"A Study of Explainability Features to Scrutinize Faceted Filtering Results","authors":"Jiaming Qu, Jaime Arguello, Yue Wang","doi":"10.1145/3459637.3482409","DOIUrl":"https://doi.org/10.1145/3459637.3482409","url":null,"abstract":"Faceted search systems enable users to filter results by selecting values along different dimensions or facets. Traditionally, facets have corresponded to properties of information items that are part of the document metadata. Recently, faceted search systems have begun to use machine learning to automatically associate documents with facet-values that are more subjective and abstract. Examples include search systems that support topic-based filtering of research articles, concept-based filtering of medical documents, and tag-based filtering of images. While machine learning can be used to infer facet-values when the collection is too large for manual annotation, machine-learned classifiers make mistakes. In such cases, it is desirable to have a scrutable system that explains why a filtered result is relevant to a facet-value. Such explanations are missing from current systems. In this paper, we investigate how explainability features can help users interpret results filtered using machine-learned facets. We consider two explainability features: (1) showing prediction confidence values and (2) highlighting rationale sentences that played an influential role in predicting a facet-value. We report on a crowdsourced study involving 200 participants. Participants were asked to scrutinize movie plot summaries predicted to satisfy multiple genres and indicate their agreement or disagreement with the system. Participants were exposed to four interface conditions. We found that both explainability features had a positive impact on participants' perceptions and performance. While both features helped, the sentence-highlighting feature played a more instrumental role in enabling participants to reject false positive cases. We discuss implications for designing tools to help users scrutinize automatically assigned facet-values.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"603 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116451393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
VidLife: A Dataset for Life Event Extraction from Videos VidLife:从视频中提取生活事件的数据集
Tai-Te Chu, An-Zi Yen, Wei-Hong Ang, Hen-Hsen Huang, Hsin-Hsi Chen
Filming video blogs, which is shortened to vlog, becomes a popular way for people to record their life experiences in recent years. In this work, we present a novel task that is aimed at extracting life events from videos and constructing personal knowledge bases of individuals. In contrast to most existing researches in the field of computer vision that focus on identifying low-level script-like activities such as moving boxes, our goal is to extract life events where high-level activities like moving into a new house are recorded. The challenges to be tackled include: (1) identifying which objects in a given scene related to the life events of the protagonist we concern, and (2) determining the association between an extracted visual concept and a more high-level description of a video clip. To address the research issues, we construct a video life event extraction dataset VidLife by exploiting videos from the TV series The Big Bang Theory, in which the plot is around the daily lives of several characters. A pilot multitask learning model is proposed to extract life events given video clips and subtitles for storing in the personal knowledge base.
拍摄视频博客,简称vlog,近年来成为人们记录生活经历的一种流行方式。在这项工作中,我们提出了一个新的任务,旨在从视频中提取生活事件并构建个人知识库。与计算机视觉领域的大多数现有研究专注于识别低级脚本式活动(如移动盒子)相比,我们的目标是提取记录了高级活动(如搬进新房子)的生活事件。需要解决的挑战包括:(1)确定给定场景中哪些对象与我们关注的主角的生活事件相关,以及(2)确定提取的视觉概念与视频片段的更高级描述之间的关联。为了解决研究问题,我们利用电视剧《生活大爆炸》中的视频构建了一个视频生活事件提取数据集VidLife,其中的情节围绕着几个角色的日常生活展开。提出了一种多任务学习模型,从给定的视频片段和字幕中提取生活事件并存储在个人知识库中。
{"title":"VidLife: A Dataset for Life Event Extraction from Videos","authors":"Tai-Te Chu, An-Zi Yen, Wei-Hong Ang, Hen-Hsen Huang, Hsin-Hsi Chen","doi":"10.1145/3459637.3482022","DOIUrl":"https://doi.org/10.1145/3459637.3482022","url":null,"abstract":"Filming video blogs, which is shortened to vlog, becomes a popular way for people to record their life experiences in recent years. In this work, we present a novel task that is aimed at extracting life events from videos and constructing personal knowledge bases of individuals. In contrast to most existing researches in the field of computer vision that focus on identifying low-level script-like activities such as moving boxes, our goal is to extract life events where high-level activities like moving into a new house are recorded. The challenges to be tackled include: (1) identifying which objects in a given scene related to the life events of the protagonist we concern, and (2) determining the association between an extracted visual concept and a more high-level description of a video clip. To address the research issues, we construct a video life event extraction dataset VidLife by exploiting videos from the TV series The Big Bang Theory, in which the plot is around the daily lives of several characters. A pilot multitask learning model is proposed to extract life events given video clips and subtitles for storing in the personal knowledge base.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122323925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Seq2Bubbles
Qitian Wu, Chenxiao Yang, Shuodian Yu, Xiaofeng Gao, Guihai Chen
User behavior sequences contain rich information about user interests and are exploited to predict user's future clicking in sequential recommendation. Existing approaches, especially recently proposed deep learning models, often embed a sequence of clicked items into a single vector, i.e., a point in vector space, which suffer from limited expressiveness for complex distributions of user interests with multi-modality and heterogeneous concentration. In this paper, we propose a new representation model, named as Seq2Bubbles, for sequential user behaviors via embedding an input sequence into a set of bubbles each of which is represented by a center vector and a radius vector in embedding space. The bubble embedding can effectively identify and accommodate multi-modal user interests and diverse concentration levels. Furthermore, we design an efficient scheme to compute distance between a target item and the bubble embedding of a user sequence to achieve next-item recommendation. We also develop a self-supervised contrastive loss based on our bubble embeddings as an effective regularization approach. Extensive experiments on four benchmark datasets demonstrate that our bubble embedding can consistently outperform state-of-the-art sequential recommendation models.
{"title":"Seq2Bubbles","authors":"Qitian Wu, Chenxiao Yang, Shuodian Yu, Xiaofeng Gao, Guihai Chen","doi":"10.1145/3459637.3482296","DOIUrl":"https://doi.org/10.1145/3459637.3482296","url":null,"abstract":"User behavior sequences contain rich information about user interests and are exploited to predict user's future clicking in sequential recommendation. Existing approaches, especially recently proposed deep learning models, often embed a sequence of clicked items into a single vector, i.e., a point in vector space, which suffer from limited expressiveness for complex distributions of user interests with multi-modality and heterogeneous concentration. In this paper, we propose a new representation model, named as Seq2Bubbles, for sequential user behaviors via embedding an input sequence into a set of bubbles each of which is represented by a center vector and a radius vector in embedding space. The bubble embedding can effectively identify and accommodate multi-modal user interests and diverse concentration levels. Furthermore, we design an efficient scheme to compute distance between a target item and the bubble embedding of a user sequence to achieve next-item recommendation. We also develop a self-supervised contrastive loss based on our bubble embeddings as an effective regularization approach. Extensive experiments on four benchmark datasets demonstrate that our bubble embedding can consistently outperform state-of-the-art sequential recommendation models.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122900019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Adversarial Kernel Sampling on Class-imbalanced Data Streams 类不平衡数据流的对抗性核采样
Peng Yang, Ping Li
This paper investigates online active learning in the setting of class-imbalanced data streams, where labels are allowed to be queried of with limited budgets. In this setup, conventional learning would be biased towards majority classes and consequently harm the performance. To address this issue, imbalance learning technique adopts both asymmetric losses and asymmetric queries to tackle the imbalance. Although this approach is effective, it may not guarantee the performance in an adversarial setting where the actual labels are unknown, and they may be chosen by the adversary To learn a promising hypothesis in class-imbalanced and adversarial environment, we propose an asymmetric min-max optimization framework for online classification. The derived algorithm can track the imbalance and bound the choices of an adversary simultaneously. Despite the promising result, this algorithm assumes that the label is provided for every input, while label is scare and labeling is expensive in real-world application. To this end, we design a confidence-based sampling strategy to query the informative labels within a budget. We theoretically analyze this algorithm in terms of mistake bound, and two asymmetric measures. Empirically, we evaluate the algorithms on multiple real-world imbalanced tasks. Promising results could be achieved on various application domains.
本文研究了类不平衡数据流环境下的在线主动学习,在这种情况下,标签可以在有限的预算下查询。在这种设置中,传统的学习将偏向于大多数班级,从而损害性能。为了解决这个问题,不平衡学习技术采用非对称损失和非对称查询来解决不平衡问题。虽然这种方法是有效的,但它可能不能保证在实际标签未知的对抗环境下的性能,并且它们可能被对手选择。为了在类不平衡和对抗环境下学习一个有希望的假设,我们提出了一个非对称的最小-最大优化框架用于在线分类。该算法可以跟踪不平衡并同时约束对手的选择。尽管结果很有希望,但该算法假设为每个输入都提供了标签,而标签在实际应用中是可怕的,并且标签是昂贵的。为此,我们设计了一种基于置信度的采样策略来查询预算内的信息标签。我们从错误界和两个非对称测度的角度对该算法进行了理论分析。经验上,我们在多个现实世界的不平衡任务上评估算法。在各个应用领域都能取得可喜的成果。
{"title":"Adversarial Kernel Sampling on Class-imbalanced Data Streams","authors":"Peng Yang, Ping Li","doi":"10.1145/3459637.3482227","DOIUrl":"https://doi.org/10.1145/3459637.3482227","url":null,"abstract":"This paper investigates online active learning in the setting of class-imbalanced data streams, where labels are allowed to be queried of with limited budgets. In this setup, conventional learning would be biased towards majority classes and consequently harm the performance. To address this issue, imbalance learning technique adopts both asymmetric losses and asymmetric queries to tackle the imbalance. Although this approach is effective, it may not guarantee the performance in an adversarial setting where the actual labels are unknown, and they may be chosen by the adversary To learn a promising hypothesis in class-imbalanced and adversarial environment, we propose an asymmetric min-max optimization framework for online classification. The derived algorithm can track the imbalance and bound the choices of an adversary simultaneously. Despite the promising result, this algorithm assumes that the label is provided for every input, while label is scare and labeling is expensive in real-world application. To this end, we design a confidence-based sampling strategy to query the informative labels within a budget. We theoretically analyze this algorithm in terms of mistake bound, and two asymmetric measures. Empirically, we evaluate the algorithms on multiple real-world imbalanced tasks. Promising results could be achieved on various application domains.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122814549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Learning to Augment Imbalanced Data for Re-ranking Models 学习为重新排序模型增加不平衡数据
Zimeng Qiu, Yingchun Jian, Qingguo Chen, Lijun Zhang
The conventional solution to learning to rank problems ranks individual documents by prediction scores greedily. Recent emerged re-ranking models, which take as input initial lists, aim to capture document interdependencies and directly generate the optimal ordered lists. Typically, a re-ranking model is learned from a set of labeled data, which can achieve favorable performance on average. However, it can be suboptimal for individual queries because the available training data is usually highly imbalanced. This problem is challenging due to the absence of informative data for some queries and furthermore, the lack of a good data augmentation policy. In this paper, we propose a novel method named Learning to Augment (LTA), which mitigates the imbalance issue through learning to augment the initial lists for re-ranking models. Specifically, we first design a data generation model based on Gaussian Mixture Variational Autoencoder (GMVAE) for generating informative data. GMVAE imposes a mixture of Gaussians on the latent space, which allows it to cluster queries in an unsupervised manner and then generate new data with different query types using the learned components. Then, to obtain a good augmentation strategy (instead of heuristics), we design a teacher model that consists of two intelligent agents to determine how to generate new data for a given list and how to rank both the raw data and generated data to produce augmented lists, respectively. The teacher model leverages the feedback from the re-ranking model to optimize its augmentation policy by means of reinforcement learning. Our method offers a general learning paradigm that is applicable to both supervised and reinforced re-ranking models. Experimental results on benchmark learning to rank datasets show that our proposed method can significantly improve the performance of re-ranking models.
学习对问题进行排序的传统解决方案是贪婪地通过预测分数对单个文档进行排序。最近出现的重排序模型以初始列表为输入,旨在捕获文档的相互依赖关系并直接生成最优有序列表。通常,重新排序模型是从一组标记数据中学习的,平均而言可以获得较好的性能。然而,对于单个查询,它可能不是最优的,因为可用的训练数据通常是高度不平衡的。由于某些查询缺乏信息数据,而且缺乏良好的数据增强策略,因此这个问题具有挑战性。在本文中,我们提出了一种新的方法,即学习增强(LTA),它通过学习增强模型的初始列表来缓解不平衡问题。具体来说,我们首先设计了一个基于高斯混合变分自编码器(GMVAE)的数据生成模型来生成信息数据。GMVAE在潜在空间上施加了一种混合的高斯函数,这允许它以一种无监督的方式对查询进行聚类,然后使用学习到的组件生成不同查询类型的新数据。然后,为了获得良好的增强策略(而不是启发式方法),我们设计了一个由两个智能代理组成的教师模型,以确定如何为给定列表生成新数据,以及如何对原始数据和生成数据进行排序以生成增强列表。教师模型利用重新排序模型的反馈,通过强化学习优化其增强策略。我们的方法提供了一种通用的学习范式,适用于监督式和强化式重新排序模型。基于基准学习对数据集进行排序的实验结果表明,本文提出的方法可以显著提高重排序模型的性能。
{"title":"Learning to Augment Imbalanced Data for Re-ranking Models","authors":"Zimeng Qiu, Yingchun Jian, Qingguo Chen, Lijun Zhang","doi":"10.1145/3459637.3482364","DOIUrl":"https://doi.org/10.1145/3459637.3482364","url":null,"abstract":"The conventional solution to learning to rank problems ranks individual documents by prediction scores greedily. Recent emerged re-ranking models, which take as input initial lists, aim to capture document interdependencies and directly generate the optimal ordered lists. Typically, a re-ranking model is learned from a set of labeled data, which can achieve favorable performance on average. However, it can be suboptimal for individual queries because the available training data is usually highly imbalanced. This problem is challenging due to the absence of informative data for some queries and furthermore, the lack of a good data augmentation policy. In this paper, we propose a novel method named Learning to Augment (LTA), which mitigates the imbalance issue through learning to augment the initial lists for re-ranking models. Specifically, we first design a data generation model based on Gaussian Mixture Variational Autoencoder (GMVAE) for generating informative data. GMVAE imposes a mixture of Gaussians on the latent space, which allows it to cluster queries in an unsupervised manner and then generate new data with different query types using the learned components. Then, to obtain a good augmentation strategy (instead of heuristics), we design a teacher model that consists of two intelligent agents to determine how to generate new data for a given list and how to rank both the raw data and generated data to produce augmented lists, respectively. The teacher model leverages the feedback from the re-ranking model to optimize its augmentation policy by means of reinforcement learning. Our method offers a general learning paradigm that is applicable to both supervised and reinforced re-ranking models. Experimental results on benchmark learning to rank datasets show that our proposed method can significantly improve the performance of re-ranking models.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114251621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
FairER
Vasilis Efthymiou, K. Stefanidis, E. Pitoura, V. Christophides
There is an urgent call to detect and prevent "biased data" at the earliest possible stage of the data pipelines used to build automated decision-making systems. In this paper, we are focusing on controlling the data bias in entity resolution (ER) tasks aiming to discover and unify records/descriptions from different data sources that refer to the same real-world entity. We formally define the ER problem with fairness constraints ensuring that all groups of entities have similar chances to be resolved. Then, we introduce FairER, a greedy algorithm for solving this problem for fairness criteria based on equal matching decisions. Our experiments show that FairER achieves similar or higher accuracy against two baseline methods over 7 datasets, while guaranteeing minimal bias.
{"title":"FairER","authors":"Vasilis Efthymiou, K. Stefanidis, E. Pitoura, V. Christophides","doi":"10.1145/3459637.3482105","DOIUrl":"https://doi.org/10.1145/3459637.3482105","url":null,"abstract":"There is an urgent call to detect and prevent \"biased data\" at the earliest possible stage of the data pipelines used to build automated decision-making systems. In this paper, we are focusing on controlling the data bias in entity resolution (ER) tasks aiming to discover and unify records/descriptions from different data sources that refer to the same real-world entity. We formally define the ER problem with fairness constraints ensuring that all groups of entities have similar chances to be resolved. Then, we introduce FairER, a greedy algorithm for solving this problem for fairness criteria based on equal matching decisions. Our experiments show that FairER achieves similar or higher accuracy against two baseline methods over 7 datasets, while guaranteeing minimal bias.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"35 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114276155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Will Sorafenib Help?: Treatment-aware Reranking in Precision Medicine Search 索拉非尼有帮助吗?:精准医学搜索中的治疗意识重排序
Maciej Rybiński, Sarvnaz Karimi
High-quality evidence from the biomedical literature is crucial for decision making of oncologists who treat cancer patients. Search for evidence on a specific treatment for a patient is the challenge set by the precision medicine track of TREC in 2020. To address this challenge, we propose a two-step method to incorporate treatment into the query formulation and ranking. Training of such ranking function uses a zero-shot setup to incorporate the novel focus on treatments which did not exist in any of the previous TREC tracks. Our treatment-aware neural reranking approach, FAT, achieves state-of-the-art effectiveness for TREC Precision Medicine 2020. Our analysis indicates that the BERT-based rerankers automatically learn to score documents through identifying concepts relevant to precision medicine, similar to hand-crafted heuristics successful in the earlier studies.
来自生物医学文献的高质量证据对于治疗癌症患者的肿瘤学家的决策至关重要。寻找针对患者的特定治疗的证据是2020年TREC精准医学轨道所面临的挑战。为了解决这一挑战,我们提出了一个两步方法,将处理合并到查询公式和排名中。这种排序函数的训练使用了零射击设置,以结合以前任何TREC轨道中都不存在的新颖治疗焦点。我们的治疗感知神经重新排序方法,FAT,为TREC精准医学2020实现了最先进的有效性。我们的分析表明,基于bert的重新排序器通过识别与精准医学相关的概念来自动学习对文档进行评分,类似于早期研究中成功的手工启发式。
{"title":"Will Sorafenib Help?: Treatment-aware Reranking in Precision Medicine Search","authors":"Maciej Rybiński, Sarvnaz Karimi","doi":"10.1145/3459637.3482220","DOIUrl":"https://doi.org/10.1145/3459637.3482220","url":null,"abstract":"High-quality evidence from the biomedical literature is crucial for decision making of oncologists who treat cancer patients. Search for evidence on a specific treatment for a patient is the challenge set by the precision medicine track of TREC in 2020. To address this challenge, we propose a two-step method to incorporate treatment into the query formulation and ranking. Training of such ranking function uses a zero-shot setup to incorporate the novel focus on treatments which did not exist in any of the previous TREC tracks. Our treatment-aware neural reranking approach, FAT, achieves state-of-the-art effectiveness for TREC Precision Medicine 2020. Our analysis indicates that the BERT-based rerankers automatically learn to score documents through identifying concepts relevant to precision medicine, similar to hand-crafted heuristics successful in the earlier studies.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116738306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Misbeliefs and Biases in Health-Related Searches 健康相关搜索中的错误观念和偏见
Alexander Bondarenko, Ekaterina Shirshakova, M. Driker, Matthias Hagen, Pavel Braslavski
Quality of search engine results returned to health-related questions is very critical, since a searcher may directly trust any suggestion in the top results. We analyze search questions that mention diseases / symptoms and remedies that are potential health-related misbeliefs. Using lists of medical and alternative medicine terms, we extract health-related search questions from 1.5~billion questions submitted to Yandex. As an initial study, we sample 30 frequent questions that contain a disease--remedy pair like "Can hepatitis be cured with milk thistle?". For each question, we carefully identify a ground truth answer in the medical literature and annotate the top-10 Yandex search result snippets as confirming the belief, rejecting it, or giving no answer. Our analysis shows that about 44%~of the snippets (that users may simply interpret as definitive answers!) confirm some untrue beliefs and are wrong, and only few include health risk warnings about using toxic plants.
健康相关问题的搜索引擎结果的质量非常关键,因为搜索者可能会直接相信顶部结果中的任何建议。我们分析那些提到疾病/症状和治疗方法的搜索问题,这些问题可能是与健康有关的误解。使用医学和替代医学术语列表,我们从提交到Yandex的15 ~亿个问题中提取与健康相关的搜索问题。作为初步研究,我们选取了30个常见的问题,这些问题都包含一种疾病——治疗对,比如“水飞蓟能治好肝炎吗?”对于每个问题,我们仔细地在医学文献中找出一个基本的真实答案,并将Yandex搜索结果的前10个片段注释为确认信念、拒绝信念或不给出答案。我们的分析表明,大约44%的片段(用户可能会简单地将其解释为明确的答案!)证实了一些不真实的信念,并且是错误的,只有少数包括使用有毒植物的健康风险警告。
{"title":"Misbeliefs and Biases in Health-Related Searches","authors":"Alexander Bondarenko, Ekaterina Shirshakova, M. Driker, Matthias Hagen, Pavel Braslavski","doi":"10.1145/3459637.3482141","DOIUrl":"https://doi.org/10.1145/3459637.3482141","url":null,"abstract":"Quality of search engine results returned to health-related questions is very critical, since a searcher may directly trust any suggestion in the top results. We analyze search questions that mention diseases / symptoms and remedies that are potential health-related misbeliefs. Using lists of medical and alternative medicine terms, we extract health-related search questions from 1.5~billion questions submitted to Yandex. As an initial study, we sample 30 frequent questions that contain a disease--remedy pair like \"Can hepatitis be cured with milk thistle?\". For each question, we carefully identify a ground truth answer in the medical literature and annotate the top-10 Yandex search result snippets as confirming the belief, rejecting it, or giving no answer. Our analysis shows that about 44%~of the snippets (that users may simply interpret as definitive answers!) confirm some untrue beliefs and are wrong, and only few include health risk warnings about using toxic plants.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129564256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
Proceedings of the 30th ACM International Conference on Information & Knowledge Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1