首页 > 最新文献

Proceedings of the 30th ACM International Conference on Information & Knowledge Management最新文献

英文 中文
Representation Learning via Variational Bayesian Networks 基于变分贝叶斯网络的表征学习
Oren Barkan, Avi Caciularu, Idan Rejwan, Ori Katz, Jonathan Weill, Itzik Malkiel, Noam Koenigstein
We present Variational Bayesian Network (VBN) - a novel Bayesian entity representation learning model that utilizes hierarchical and relational side information and is particularly useful for modeling entities in the "long-tail'', where the data is scarce. VBN provides better modeling for long-tail entities via two complementary mechanisms: First, VBN employs informative hierarchical priors that enable information propagation between entities sharing common ancestors. Additionally, VBN models explicit relations between entities that enforce complementary structure and consistency, guiding the learned representations towards a more meaningful arrangement in space. Second, VBN represents entities by densities (rather than vectors), hence modeling uncertainty that plays a complementary role in coping with data scarcity. Finally, we propose a scalable Variational Bayes optimization algorithm that enables fast approximate Bayesian inference. We evaluate the effectiveness of VBN on linguistic, recommendations, and medical inference tasks. Our findings show that VBN outperforms other existing methods across multiple datasets, and especially in the long-tail.
我们提出了变分贝叶斯网络(VBN)——一种新的贝叶斯实体表示学习模型,它利用了层次和关系侧信息,对于在数据稀缺的“长尾”中建模实体特别有用。VBN通过两种互补机制为长尾实体提供更好的建模:首先,VBN采用信息分层先验,使信息能够在共享共同祖先的实体之间传播。此外,VBN模型明确实体之间的关系,加强互补结构和一致性,引导学习表征在空间中更有意义的安排。其次,VBN通过密度(而不是向量)表示实体,因此建模不确定性在应对数据稀缺性方面起着补充作用。最后,我们提出了一个可扩展的变分贝叶斯优化算法,实现快速的近似贝叶斯推理。我们评估了VBN在语言、推荐和医学推理任务上的有效性。我们的研究结果表明,VBN在多个数据集上优于其他现有方法,特别是在长尾数据集上。
{"title":"Representation Learning via Variational Bayesian Networks","authors":"Oren Barkan, Avi Caciularu, Idan Rejwan, Ori Katz, Jonathan Weill, Itzik Malkiel, Noam Koenigstein","doi":"10.1145/3459637.3482363","DOIUrl":"https://doi.org/10.1145/3459637.3482363","url":null,"abstract":"We present Variational Bayesian Network (VBN) - a novel Bayesian entity representation learning model that utilizes hierarchical and relational side information and is particularly useful for modeling entities in the \"long-tail'', where the data is scarce. VBN provides better modeling for long-tail entities via two complementary mechanisms: First, VBN employs informative hierarchical priors that enable information propagation between entities sharing common ancestors. Additionally, VBN models explicit relations between entities that enforce complementary structure and consistency, guiding the learned representations towards a more meaningful arrangement in space. Second, VBN represents entities by densities (rather than vectors), hence modeling uncertainty that plays a complementary role in coping with data scarcity. Finally, we propose a scalable Variational Bayes optimization algorithm that enables fast approximate Bayesian inference. We evaluate the effectiveness of VBN on linguistic, recommendations, and medical inference tasks. Our findings show that VBN outperforms other existing methods across multiple datasets, and especially in the long-tail.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129745227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Distilling Knowledge from BERT into Simple Fully Connected Neural Networks for Efficient Vertical Retrieval 从BERT中提取知识到简单的全连接神经网络以实现高效的垂直检索
Peiyang Liu, Xi Wang, Lin Wang, Wei Ye, Xiangyu Xi, Shikun Zhang
Distilled BERT models are more suitable for efficient vertical retrieval in online sponsored vertical search with low-latency requirements than BERT due to fewer parameters and faster inference. Unfortunately, most of these models are still far from ideal inference speed. This paper presents a novel and effective method to distill knowledge from BERT into simple fully connected neural networks (FNN). Results of extensive experiments on English and Chinese datasets demonstrate that our method achieves comparable results with existing distilled BERT models while the inference is accelerated by more than ten times. We have successfully applied our method on our online sponsored vertical search engine and get remarkable improvements.
在低延迟的在线赞助垂直搜索中,蒸馏BERT模型比BERT模型更适合于高效的垂直检索,因为它的参数更少,推理速度更快。不幸的是,这些模型中的大多数离理想的推理速度还很远。本文提出了一种新颖有效的方法,将BERT中的知识提取到简单的全连接神经网络中。在中文和英文数据集上的大量实验结果表明,我们的方法与现有的蒸馏BERT模型达到了相当的结果,并且推理速度提高了十倍以上。我们已经成功地将我们的方法应用于我们的在线赞助垂直搜索引擎,并获得了显着的改进。
{"title":"Distilling Knowledge from BERT into Simple Fully Connected Neural Networks for Efficient Vertical Retrieval","authors":"Peiyang Liu, Xi Wang, Lin Wang, Wei Ye, Xiangyu Xi, Shikun Zhang","doi":"10.1145/3459637.3481909","DOIUrl":"https://doi.org/10.1145/3459637.3481909","url":null,"abstract":"Distilled BERT models are more suitable for efficient vertical retrieval in online sponsored vertical search with low-latency requirements than BERT due to fewer parameters and faster inference. Unfortunately, most of these models are still far from ideal inference speed. This paper presents a novel and effective method to distill knowledge from BERT into simple fully connected neural networks (FNN). Results of extensive experiments on English and Chinese datasets demonstrate that our method achieves comparable results with existing distilled BERT models while the inference is accelerated by more than ten times. We have successfully applied our method on our online sponsored vertical search engine and get remarkable improvements.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128440569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
An RDF Data Management System for Conflict Casualties 冲突伤亡RDF数据管理系统
Yad Fatah, Mark Nourallah, Lynn Wahab, Fatima K. Abu Salem, Shady Elbassuoni
In a world embroiled in armed conflicts, documenting conflict casualties is an important goal for many NGOs. Most of such documented records of casualties are however managed through internal databases, spreadsheets or Web forms. As such, exploring and querying such data becomes extremely chaotic. In this paper, we demonstrate CasualtIS, an RDF data management system for conflict casualties. Our system models conflict casualties data as RDF graphs and allows users to query such data using a SPARQL endpoint. Our system also includes a template-based natural-language querying interface to support non-expert users. Our system can be used for various purposes by end users, such as fact-checking certain claims about conflict casualties, aggregating casualties over time and location, and finding contextual information about casualties, such as the cause of death, actors involved, and other similar critical information. We demonstrate our system using two case studies, one related to casualties in the Iraqi war and the other related to casualties in the Syrian war.
在一个卷入武装冲突的世界里,记录冲突伤亡是许多非政府组织的一个重要目标。然而,大多数伤亡记录都是通过内部数据库、电子表格或网络表格进行管理的。因此,探索和查询这些数据变得极其混乱。在本文中,我们演示了一个冲突伤亡RDF数据管理系统CasualtIS。我们的系统将冲突伤亡数据建模为RDF图,并允许用户使用SPARQL端点查询此类数据。我们的系统还包括一个基于模板的自然语言查询接口,以支持非专业用户。我们的系统可以被最终用户用于各种目的,例如核实关于冲突伤亡的某些说法,汇总随时间和地点的伤亡,以及查找有关伤亡的上下文信息,例如死亡原因、涉及的行为者和其他类似的关键信息。我们使用两个案例研究来演示我们的系统,一个与伊拉克战争中的伤亡有关,另一个与叙利亚战争中的伤亡有关。
{"title":"An RDF Data Management System for Conflict Casualties","authors":"Yad Fatah, Mark Nourallah, Lynn Wahab, Fatima K. Abu Salem, Shady Elbassuoni","doi":"10.1145/3459637.3481976","DOIUrl":"https://doi.org/10.1145/3459637.3481976","url":null,"abstract":"In a world embroiled in armed conflicts, documenting conflict casualties is an important goal for many NGOs. Most of such documented records of casualties are however managed through internal databases, spreadsheets or Web forms. As such, exploring and querying such data becomes extremely chaotic. In this paper, we demonstrate CasualtIS, an RDF data management system for conflict casualties. Our system models conflict casualties data as RDF graphs and allows users to query such data using a SPARQL endpoint. Our system also includes a template-based natural-language querying interface to support non-expert users. Our system can be used for various purposes by end users, such as fact-checking certain claims about conflict casualties, aggregating casualties over time and location, and finding contextual information about casualties, such as the cause of death, actors involved, and other similar critical information. We demonstrate our system using two case studies, one related to casualties in the Iraqi war and the other related to casualties in the Syrian war.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124546501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Influence Maximization With Co-Existing Seeds 影响最大化与共存的种子
R. Becker, Gianlorenzo D'angelo, Hugo Gilbert
In the classical influence maximization problem we aim to select a set of nodes, called seeds, to start an efficient information diffusion process. More precisely, the goal is to select seeds such that the expected number of nodes reached by the diffusion process is maximized. In this work we study a variant of this problem where an unknown (up to a probability distribution) set of nodes, referred to as co-existing seeds, joins in starting the diffusion process even if not selected. This setting allows to model that, in certain situations, some nodes are willing to act as "voluntary seeds'' even if not chosen by the campaign organizer. This may for example be due to the positive nature of the information campaign (e.g., public health awareness programs, HIV prevention, financial aid programs), or due to external social driving effects (e.g., nodes are friends of selected seeds in real life or in other social media). In this setting, we study two types of optimization problems. While the first one aims to maximize the expected number of reached nodes, the second one endeavors to maximize the expected increment in the number of reached nodes in comparison to a non-intervention strategy. The problems (particularly the second one) are motivated by cooperative game theory. For various probability distributions on co-existing seeds, we obtain several algorithms with approximation guarantees as well as hardness and hardness of approximation results. We conclude with experiments that demonstrate the usefulness of our approach when co-existing seeds exist.
在经典的影响最大化问题中,我们的目标是选择一组节点,称为种子,开始一个有效的信息扩散过程。更准确地说,目标是选择种子,使扩散过程达到的预期节点数最大化。在这项工作中,我们研究了这个问题的一个变体,其中一个未知的(高达概率分布)节点集,称为共存种子,即使没有被选中,也会加入开始扩散过程。这种设置允许建模,在某些情况下,一些节点愿意充当“自愿种子”,即使不是由活动组织者选择的。例如,这可能是由于宣传活动的积极性质(例如,公共卫生宣传方案、艾滋病毒预防、财政援助方案),也可能是由于外部社会驱动效应(例如,节点是现实生活中或其他社交媒体中选定种子的朋友)。在这种情况下,我们研究两类优化问题。第一个策略的目标是最大化到达节点的预期数量,而第二个策略的目标是最大化与非干预策略相比到达节点数量的预期增量。这些问题(尤其是第二个问题)是由合作博弈论驱动的。对于共存种子上的各种概率分布,我们得到了几种具有近似保证的算法以及近似结果的硬度和硬度。最后,我们用实验证明,当共存的种子存在时,我们的方法是有用的。
{"title":"Influence Maximization With Co-Existing Seeds","authors":"R. Becker, Gianlorenzo D'angelo, Hugo Gilbert","doi":"10.1145/3459637.3482439","DOIUrl":"https://doi.org/10.1145/3459637.3482439","url":null,"abstract":"In the classical influence maximization problem we aim to select a set of nodes, called seeds, to start an efficient information diffusion process. More precisely, the goal is to select seeds such that the expected number of nodes reached by the diffusion process is maximized. In this work we study a variant of this problem where an unknown (up to a probability distribution) set of nodes, referred to as co-existing seeds, joins in starting the diffusion process even if not selected. This setting allows to model that, in certain situations, some nodes are willing to act as \"voluntary seeds'' even if not chosen by the campaign organizer. This may for example be due to the positive nature of the information campaign (e.g., public health awareness programs, HIV prevention, financial aid programs), or due to external social driving effects (e.g., nodes are friends of selected seeds in real life or in other social media). In this setting, we study two types of optimization problems. While the first one aims to maximize the expected number of reached nodes, the second one endeavors to maximize the expected increment in the number of reached nodes in comparison to a non-intervention strategy. The problems (particularly the second one) are motivated by cooperative game theory. For various probability distributions on co-existing seeds, we obtain several algorithms with approximation guarantees as well as hardness and hardness of approximation results. We conclude with experiments that demonstrate the usefulness of our approach when co-existing seeds exist.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130364470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatio-Temporal-Social Multi-Feature-based Fine-Grained Hot Spots Prediction for Content Delivery Services in 5G Era 5G时代基于时空社会多特征的内容交付服务细粒度热点预测
Shaoyuan Huang, Hengda Zhang, Xiaofei Wang, Min Chen, Jianxin Li, Victor C. M. Leung
The arrival of 5G networks has extensively promoted the growth of content delivery services (CDSs). Understanding and predicting the spatio-temporal distribution of CDSs are beneficial to mobile users, Internet Content Providers and carriers. Conventional methods for predicting the spatio-temporal distribution of CDSs are mostly base-stations (BSs) centric, leading to weak generalization and spatio coarse-grained. To improve the spatio accuracy and generalization of modeling, we propose user-centric methods for CDSs spatio-temporal analysis. With geocoding and spatio-temporal graphs modeling algorithms, CDSs records collected from mobile devices are modeled as dynamic graphs with spatio-temporal attributes. Moreover, we propose a spatio-temporal-social multi-feature extraction framework for spatio fine-grained CDSs hot spots prediction. Specifically, an edge-enhanced graph convolutional block is designed to encode CDSs information based on the social relations and the spatio dependence features. Besides, we introduce the Long Short Term Memory (LSTM) to further capture the temporal dependence. Experiments on two real-world CDSs datasets verified the effectiveness of the proposed framework, and ablation studies are taken to evaluate the importance of each feature.
5G网络的到来广泛促进了内容交付服务(cds)的增长。了解和预测cds的时空分布有利于移动用户、互联网内容提供商和运营商。传统的cds时空分布预测方法多以基站为中心,通用性弱,空间粒度粗。为了提高模型的空间精度和泛化能力,本文提出了以用户为中心的cds时空分析方法。利用地理编码和时空图建模算法,将移动设备上的cds记录建模为具有时空属性的动态图。此外,我们提出了一个时空社会多特征提取框架,用于空间细粒度cds热点预测。具体而言,基于社会关系和空间依赖特征,设计了一种边缘增强图卷积块对cds信息进行编码。此外,我们引入了长短期记忆(LSTM)来进一步捕捉时间依赖性。在两个真实cds数据集上的实验验证了所提出框架的有效性,并进行了消融研究来评估每个特征的重要性。
{"title":"Spatio-Temporal-Social Multi-Feature-based Fine-Grained Hot Spots Prediction for Content Delivery Services in 5G Era","authors":"Shaoyuan Huang, Hengda Zhang, Xiaofei Wang, Min Chen, Jianxin Li, Victor C. M. Leung","doi":"10.1145/3459637.3482298","DOIUrl":"https://doi.org/10.1145/3459637.3482298","url":null,"abstract":"The arrival of 5G networks has extensively promoted the growth of content delivery services (CDSs). Understanding and predicting the spatio-temporal distribution of CDSs are beneficial to mobile users, Internet Content Providers and carriers. Conventional methods for predicting the spatio-temporal distribution of CDSs are mostly base-stations (BSs) centric, leading to weak generalization and spatio coarse-grained. To improve the spatio accuracy and generalization of modeling, we propose user-centric methods for CDSs spatio-temporal analysis. With geocoding and spatio-temporal graphs modeling algorithms, CDSs records collected from mobile devices are modeled as dynamic graphs with spatio-temporal attributes. Moreover, we propose a spatio-temporal-social multi-feature extraction framework for spatio fine-grained CDSs hot spots prediction. Specifically, an edge-enhanced graph convolutional block is designed to encode CDSs information based on the social relations and the spatio dependence features. Besides, we introduce the Long Short Term Memory (LSTM) to further capture the temporal dependence. Experiments on two real-world CDSs datasets verified the effectiveness of the proposed framework, and ablation studies are taken to evaluate the importance of each feature.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123981133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Disentangling Preference Representations for Recommendation Critiquing with ß-VAE 基于ß-VAE的推荐评价偏好表示解耦
Preksha Nema, Alexandros Karatzoglou, Filip Radlinski
Modern recommender systems usually embed users and items into a learned vector space representation. Similarity in this space is used to generate recommendations, and recommendation methods are agnostic to the structure of the embedding space. Motivated by the need for recommendation systems to be more transparent and controllable, we postulate that it is beneficial to assign meaning to some of the dimensions of user and item representations. Disentanglement is one technique commonly used for this purpose. We presenta novel supervised disentangling approach for recommendation tasks. Our model learns embeddings where attributes of interest are disentangled, while requiring only a very small number of labeled items at training time. The model can then generate interactive and critiquable recommendations for all users, without requiring any labels at recommendation time, and without sacrificing any recommendation performance. Our approach thus provides users with levers to manipulate, critique and fine-tune recommendations, and gives insight into why particular recommendations are made. Given only user-item interactions at recommendation time, we show that it identifies user tastes with respect to the attributes that have been disentangled, allowing for users to manipulate recommendations across these attributes.
现代推荐系统通常将用户和项目嵌入到学习的向量空间表示中。该空间的相似度用于生成推荐,推荐方法与嵌入空间的结构无关。由于推荐系统需要更加透明和可控,我们假设为用户和项目表示的某些维度分配意义是有益的。解开缠结是一种通常用于此目的的技术。我们提出了一种新的有监督的推荐任务解纠缠方法。我们的模型学习感兴趣的属性被解开的嵌入,而在训练时只需要非常少量的标记项目。然后,该模型可以为所有用户生成交互式和可批评的推荐,在推荐时不需要任何标签,也不会牺牲任何推荐性能。因此,我们的方法为用户提供了操纵、批评和微调建议的杠杆,并深入了解为什么要提出特定的建议。仅给定推荐时的用户-项目交互,我们表明它根据已解耦的属性识别用户品味,允许用户跨这些属性操作推荐。
{"title":"Disentangling Preference Representations for Recommendation Critiquing with ß-VAE","authors":"Preksha Nema, Alexandros Karatzoglou, Filip Radlinski","doi":"10.1145/3459637.3482425","DOIUrl":"https://doi.org/10.1145/3459637.3482425","url":null,"abstract":"Modern recommender systems usually embed users and items into a learned vector space representation. Similarity in this space is used to generate recommendations, and recommendation methods are agnostic to the structure of the embedding space. Motivated by the need for recommendation systems to be more transparent and controllable, we postulate that it is beneficial to assign meaning to some of the dimensions of user and item representations. Disentanglement is one technique commonly used for this purpose. We presenta novel supervised disentangling approach for recommendation tasks. Our model learns embeddings where attributes of interest are disentangled, while requiring only a very small number of labeled items at training time. The model can then generate interactive and critiquable recommendations for all users, without requiring any labels at recommendation time, and without sacrificing any recommendation performance. Our approach thus provides users with levers to manipulate, critique and fine-tune recommendations, and gives insight into why particular recommendations are made. Given only user-item interactions at recommendation time, we show that it identifies user tastes with respect to the attributes that have been disentangled, allowing for users to manipulate recommendations across these attributes.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124218305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
DSDD
Haoxiang Zhang, Aécio S. R. Santos, Juliana Freire
With the push for transparency and open data, many datasets and data repositories are becoming available on the Web. This opens new opportunities for data-driven exploration, from empowering analysts to answer new questions and obtain insights to improving predictive models through data augmentation. But as datasets are spread over a plethora of Web sites, finding data that are relevant for a given task is difficult. In this paper, we take a first step towards the construction of domain-specific data lakes. We propose an end-to-end dataset discovery system, targeted at domain experts, which given a small set of keywords, automatically finds potentially relevant datasets on the Web. The system makes use of search engines to hop across Web sites, uses online learning to incrementally build a model to recognize sites that contain datasets, utilizes a set of discovery actions to broaden the search, and applies a multi-armed bandit based algorithm to balance the trade-offs of different discovery actions. We report the results of an extensive experimental evaluation over multiple domains, and demonstrate that our strategy is effective and outperforms state-of-the-art content discovery methods.
{"title":"DSDD","authors":"Haoxiang Zhang, Aécio S. R. Santos, Juliana Freire","doi":"10.1145/3459637.3482427","DOIUrl":"https://doi.org/10.1145/3459637.3482427","url":null,"abstract":"With the push for transparency and open data, many datasets and data repositories are becoming available on the Web. This opens new opportunities for data-driven exploration, from empowering analysts to answer new questions and obtain insights to improving predictive models through data augmentation. But as datasets are spread over a plethora of Web sites, finding data that are relevant for a given task is difficult. In this paper, we take a first step towards the construction of domain-specific data lakes. We propose an end-to-end dataset discovery system, targeted at domain experts, which given a small set of keywords, automatically finds potentially relevant datasets on the Web. The system makes use of search engines to hop across Web sites, uses online learning to incrementally build a model to recognize sites that contain datasets, utilizes a set of discovery actions to broaden the search, and applies a multi-armed bandit based algorithm to balance the trade-offs of different discovery actions. We report the results of an extensive experimental evaluation over multiple domains, and demonstrate that our strategy is effective and outperforms state-of-the-art content discovery methods.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114363184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Large-Scale Information Extraction under Privacy-Aware Constraints 隐私感知约束下的大规模信息提取
Rajeev Gupta, Ranganath Kondapally
In this digital age, people spend a significant portion of their lives online and this has led to an explosion of personal data from users and their activities. Typically, this data is private and nobody else, except the user, is allowed to look at it. This poses interesting and complex challenges from scalable information extraction point of view: extracting information under privacy aware constraints where there is little data to learn from but need highly accurate models to run on large amount of data across different users. Anonymization of data is typically used to convert private data into publicly accessible data. But this may not always be feasible and may require complex differential privacy guarantees in order to be safe from any potential negative consequences. Other techniques involve building models on a small amount of seen (eyes-on) data and a large amount of unseen (eyes-off) data. In this tutorial, we use emails as representative private data to explain the concepts of scalable IE under privacy-aware constraints.
在这个数字时代,人们在网上花费了他们生活的很大一部分时间,这导致了用户及其活动的个人数据的爆炸式增长。通常,这些数据是私有的,除了用户之外,其他人都不允许查看它。从可扩展信息提取的角度来看,这带来了有趣而复杂的挑战:在隐私意识约束下提取信息,其中可以学习的数据很少,但需要高度精确的模型来运行跨不同用户的大量数据。数据的匿名化通常用于将私有数据转换为可公开访问的数据。但这可能并不总是可行的,而且可能需要复杂的差异化隐私保障,以避免任何潜在的负面后果。其他技术涉及在少量可见(肉眼可见)数据和大量不可见(肉眼看不到)数据上构建模型。在本教程中,我们使用电子邮件作为具有代表性的私有数据来解释隐私感知约束下可扩展IE的概念。
{"title":"Large-Scale Information Extraction under Privacy-Aware Constraints","authors":"Rajeev Gupta, Ranganath Kondapally","doi":"10.1145/3459637.3482027","DOIUrl":"https://doi.org/10.1145/3459637.3482027","url":null,"abstract":"In this digital age, people spend a significant portion of their lives online and this has led to an explosion of personal data from users and their activities. Typically, this data is private and nobody else, except the user, is allowed to look at it. This poses interesting and complex challenges from scalable information extraction point of view: extracting information under privacy aware constraints where there is little data to learn from but need highly accurate models to run on large amount of data across different users. Anonymization of data is typically used to convert private data into publicly accessible data. But this may not always be feasible and may require complex differential privacy guarantees in order to be safe from any potential negative consequences. Other techniques involve building models on a small amount of seen (eyes-on) data and a large amount of unseen (eyes-off) data. In this tutorial, we use emails as representative private data to explain the concepts of scalable IE under privacy-aware constraints.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114535819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Discriminative and Unbiased Representations for Few-Shot Relation Extraction 学习判别和无偏表示的少镜头关系提取
Jiale Han, Bo Cheng, Guoshun Nan
Few-shot relation extraction (FSRE) aims to predict the relation for a pair of entities in a sentence by exploring a few labeled instances for each relation type. Current methods mainly rely on meta-learning to learn generalized representations by optimizing the network parameters based on various collections of tasks sampled from training data. However, these methods may suffer from two main issues. 1) Insufficient supervision of meta-learning to learn discriminative representations on very few training instances, which are sampled from a large amount of base class data. 2) Spurious correlations between entities and relation types due to the biased training procedure that focuses more on entity pair rather than context. To learn more discriminative and unbiased representations for FSRE, this paper proposes a two-stage approach via supervised contrastive learning and sentence- and entity-level prototypical networks. In the first (pre-training) stage, we introduce a supervised contrastive pre-training method, which is able to yield more discriminative representations by learning from the entire training instances, such that the semantically related representations are close to each other, and far away otherwise. In the second (meta-learning) stage, we propose a novel sentence- and entity-level prototypical network equipped with fine-grained feature-wise fusion strategy to learn unbiased representations, where the networks are initialized with the parameters trained in the first stage. Specifically, the proposed network consists of a sentence branch and an entity branch, taking entire sentences and entity mentions as inputs, respectively. The entity branch explicitly captures the correlation between entity pairs and relations, and then dynamically adjusts the sentence branch's prediction distributions. By doing so, the spurious correlations issue caused by biased training samples can be properly mitigated. Extensive experiments on two FSRE benchmarks demonstrate the effectiveness of our approach.
少射关系提取(FSRE)旨在通过探索每个关系类型的几个标记实例来预测句子中一对实体的关系。目前的方法主要依赖于元学习,通过从训练数据中采样的各种任务集合来优化网络参数来学习广义表示。然而,这些方法可能存在两个主要问题。1)元学习的监督不足,无法在很少的训练实例上学习判别表示,这些训练实例是从大量基类数据中采样的。2)实体和关系类型之间的虚假关联,这是由于有偏见的训练过程更关注实体对而不是上下文。为了学习更多的FSRE判别和无偏表征,本文提出了一种通过监督对比学习和句子级和实体级原型网络的两阶段方法。在第一个(预训练)阶段,我们引入了一种监督对比预训练方法,该方法能够通过从整个训练实例中学习来产生更多的判别表示,使得语义相关的表示彼此接近,否则就会远离。在第二阶段(元学习),我们提出了一个新的句子和实体级原型网络,该网络配备了细粒度特征融合策略来学习无偏表示,其中网络使用第一阶段训练的参数初始化。具体来说,该网络包括一个句子分支和一个实体分支,分别以完整的句子和实体提及作为输入。实体分支显式捕获实体对和关系之间的相关性,然后动态调整句子分支的预测分布。通过这样做,可以适当地减轻由有偏差的训练样本引起的虚假相关性问题。在两个FSRE基准上进行的大量实验证明了我们方法的有效性。
{"title":"Learning Discriminative and Unbiased Representations for Few-Shot Relation Extraction","authors":"Jiale Han, Bo Cheng, Guoshun Nan","doi":"10.1145/3459637.3482268","DOIUrl":"https://doi.org/10.1145/3459637.3482268","url":null,"abstract":"Few-shot relation extraction (FSRE) aims to predict the relation for a pair of entities in a sentence by exploring a few labeled instances for each relation type. Current methods mainly rely on meta-learning to learn generalized representations by optimizing the network parameters based on various collections of tasks sampled from training data. However, these methods may suffer from two main issues. 1) Insufficient supervision of meta-learning to learn discriminative representations on very few training instances, which are sampled from a large amount of base class data. 2) Spurious correlations between entities and relation types due to the biased training procedure that focuses more on entity pair rather than context. To learn more discriminative and unbiased representations for FSRE, this paper proposes a two-stage approach via supervised contrastive learning and sentence- and entity-level prototypical networks. In the first (pre-training) stage, we introduce a supervised contrastive pre-training method, which is able to yield more discriminative representations by learning from the entire training instances, such that the semantically related representations are close to each other, and far away otherwise. In the second (meta-learning) stage, we propose a novel sentence- and entity-level prototypical network equipped with fine-grained feature-wise fusion strategy to learn unbiased representations, where the networks are initialized with the parameters trained in the first stage. Specifically, the proposed network consists of a sentence branch and an entity branch, taking entire sentences and entity mentions as inputs, respectively. The entity branch explicitly captures the correlation between entity pairs and relations, and then dynamically adjusts the sentence branch's prediction distributions. By doing so, the spurious correlations issue caused by biased training samples can be properly mitigated. Extensive experiments on two FSRE benchmarks demonstrate the effectiveness of our approach.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"34 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116491084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Constructing Noise Free Economic Policy Uncertainty Index 构建无噪声经济政策不确定性指数
Chung-Chi Chen, Hen-Hsen Huang, Yu-Lieh Huang, Hsin-Hsi Chen
The economic policy uncertainty (EPU) index is one of the important text-based indexes in finance and economics fields. The EPU indexes of more than 26 countries have been constructed to reflect the policy uncertainty on country-level economic environments and serve as an important economic leading indicator. The EPU indexes are calculated based on the number of news articles with some manually-selected keywords related to economic, uncertainty, and policy. We find that the keyword-based EPU indexes contain noise, which will influence their explainability and predictability. In our experimental dataset, over 40% of news articles with the selected keywords are not related to the EPU. Instead of using keywords only, our proposed models take contextual information into account and get good performance on identifying the articles unrelated to EPU. The noise free EPU index performs better than the keyword-based EPU index in both explainability and predictability.
经济政策不确定性指数(EPU)是财经领域重要的文本型指标之一。构建了超过26个国家的EPU指数,以反映国家层面经济环境的政策不确定性,并作为重要的经济先行指标。EPU指数是根据人工选择与经济、不确定性和政策相关的关键词的新闻文章的数量计算得出的。研究发现,基于关键词的EPU指标存在噪声,影响其可解释性和可预测性。在我们的实验数据集中,超过40%的带有所选关键词的新闻文章与EPU无关。与仅使用关键词不同,我们提出的模型考虑了上下文信息,并在识别与EPU无关的文章方面取得了良好的性能。无噪声EPU指数在可解释性和可预测性方面都优于基于关键字的EPU指数。
{"title":"Constructing Noise Free Economic Policy Uncertainty Index","authors":"Chung-Chi Chen, Hen-Hsen Huang, Yu-Lieh Huang, Hsin-Hsi Chen","doi":"10.1145/3459637.3482075","DOIUrl":"https://doi.org/10.1145/3459637.3482075","url":null,"abstract":"The economic policy uncertainty (EPU) index is one of the important text-based indexes in finance and economics fields. The EPU indexes of more than 26 countries have been constructed to reflect the policy uncertainty on country-level economic environments and serve as an important economic leading indicator. The EPU indexes are calculated based on the number of news articles with some manually-selected keywords related to economic, uncertainty, and policy. We find that the keyword-based EPU indexes contain noise, which will influence their explainability and predictability. In our experimental dataset, over 40% of news articles with the selected keywords are not related to the EPU. Instead of using keywords only, our proposed models take contextual information into account and get good performance on identifying the articles unrelated to EPU. The noise free EPU index performs better than the keyword-based EPU index in both explainability and predictability.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124474985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 30th ACM International Conference on Information & Knowledge Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1