首页 > 最新文献

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management最新文献

英文 中文
Sequence Modeling with Hierarchical Deep Generative Models with Dual Memory 具有双存储器的层次深度生成模型序列建模
Yanan Zheng, L. Wen, Jianmin Wang, Jun Yan, Lei Ji
Deep Generative Models (DGMs) are able to extract high-level representations from massive unlabeled data and are explainable from a probabilistic perspective. Such characteristics favor sequence modeling tasks. However, it still remains a huge challenge to model sequences with DGMs. Unlike real-valued data that can be directly fed into models, sequence data consist of discrete elements and require being transformed into certain representations first. This leads to the following two challenges. First, high-level features are sensitive to small variations of inputs as well as the way of representing data. Second, the models are more likely to lose long-term information during multiple transformations. In this paper, we propose a Hierarchical Deep Generative Model With Dual Memory to address the two challenges. Furthermore, we provide a method to efficiently perform inference and learning on the model. The proposed model extends basic DGMs with an improved hierarchically organized multi-layer architecture. Besides, our model incorporates memories along dual directions, respectively denoted as broad memory and deep memory. The model is trained end-to-end by optimizing a variational lower bound on data log-likelihood using the improved stochastic variational method. We perform experiments on several tasks with various datasets and obtain excellent results. The results of language modeling show our method significantly outperforms state-of-the-art results in terms of generative performance. Extended experiments including document modeling and sentiment analysis, prove the high-effectiveness of dual memory mechanism and latent representations. Text random generation provides a straightforward perception for advantages of our model.
深度生成模型(dgm)能够从大量未标记数据中提取高级表示,并且可以从概率角度进行解释。这些特征有利于序列建模任务。然而,用dgm建立序列模型仍然是一个巨大的挑战。与可以直接输入模型的实值数据不同,序列数据由离散元素组成,需要首先转换为特定的表示。这就带来了以下两个挑战。首先,高级特征对输入的微小变化以及表示数据的方式都很敏感。其次,在多次转换期间,模型更有可能丢失长期信息。在本文中,我们提出了一种具有双存储器的分层深度生成模型来解决这两个挑战。此外,我们提供了一种有效地对模型进行推理和学习的方法。提出的模型通过改进的分层组织的多层体系结构扩展了基本的dgm。此外,我们的模型包含双向记忆,分别表示为宽记忆和深记忆。采用改进的随机变分方法优化数据对数似然的变分下界,对模型进行端到端训练。我们用不同的数据集对几个任务进行了实验,并获得了很好的结果。语言建模的结果表明,我们的方法在生成性能方面明显优于最先进的结果。扩展实验包括文档建模和情感分析,证明了双重记忆机制和潜在表征的有效性。文本随机生成为我们的模型的优势提供了一个直观的感知。
{"title":"Sequence Modeling with Hierarchical Deep Generative Models with Dual Memory","authors":"Yanan Zheng, L. Wen, Jianmin Wang, Jun Yan, Lei Ji","doi":"10.1145/3132847.3132952","DOIUrl":"https://doi.org/10.1145/3132847.3132952","url":null,"abstract":"Deep Generative Models (DGMs) are able to extract high-level representations from massive unlabeled data and are explainable from a probabilistic perspective. Such characteristics favor sequence modeling tasks. However, it still remains a huge challenge to model sequences with DGMs. Unlike real-valued data that can be directly fed into models, sequence data consist of discrete elements and require being transformed into certain representations first. This leads to the following two challenges. First, high-level features are sensitive to small variations of inputs as well as the way of representing data. Second, the models are more likely to lose long-term information during multiple transformations. In this paper, we propose a Hierarchical Deep Generative Model With Dual Memory to address the two challenges. Furthermore, we provide a method to efficiently perform inference and learning on the model. The proposed model extends basic DGMs with an improved hierarchically organized multi-layer architecture. Besides, our model incorporates memories along dual directions, respectively denoted as broad memory and deep memory. The model is trained end-to-end by optimizing a variational lower bound on data log-likelihood using the improved stochastic variational method. We perform experiments on several tasks with various datasets and obtain excellent results. The results of language modeling show our method significantly outperforms state-of-the-art results in terms of generative performance. Extended experiments including document modeling and sentiment analysis, prove the high-effectiveness of dual memory mechanism and latent representations. Text random generation provides a straightforward perception for advantages of our model.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80398839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint Topic-Semantic-aware Social Recommendation for Online Voting 面向在线投票的联合主题语义社会推荐
Hongwei Wang, Jia Wang, Miao Zhao, Jiannong Cao, M. Guo
Online voting is an emerging feature in social networks, in which users can express their attitudes toward various issues and show their unique interest. Online voting imposes new challenges on recommendation, because the propagation of votings heavily depends on the structure of social networks as well as the content of votings. In this paper, we investigate how to utilize these two factors in a comprehensive manner when doing voting recommendation. First, due to the fact that existing text mining methods such as topic model and semantic model cannot well process the content of votings that is typically short and ambiguous, we propose a novel Topic-Enhanced Word Embedding (TEWE) method to learn word and document representation by jointly considering their topics and semantics. Then we propose our Joint Topic-Semantic-aware social Matrix Factorization (JTS-MF) model for voting recommendation. JTS-MF model calculates similarity among users and votings by combining their TEWE representation and structural information of social networks, and preserves this topic-semantic-social similarity during matrix factorization. To evaluate the performance of TEWE representation and JTS-MF model, we conduct extensive experiments on real online voting dataset. The results prove the efficacy of our approach against several state-of-the-art baselines.
在线投票是社交网络的新兴功能,用户可以在其中表达对各种问题的态度,并显示自己独特的兴趣。在线投票给推荐带来了新的挑战,因为投票的传播严重依赖于社交网络的结构以及投票的内容。本文主要研究在进行投票推荐时,如何综合利用这两个因素。首先,由于现有的文本挖掘方法如主题模型和语义模型不能很好地处理典型的简短和模糊的投票内容,我们提出了一种新的主题增强词嵌入(topic - enhanced Word Embedding, TEWE)方法,通过联合考虑词和文档的主题和语义来学习词和文档的表示。然后,我们提出了联合主题语义感知社会矩阵分解(JTS-MF)投票推荐模型。JTS-MF模型通过结合用户和投票的TEWE表示和社会网络的结构信息来计算用户和投票之间的相似度,并在矩阵分解过程中保持这种主题-语义-社会相似度。为了评估TEWE表示和JTS-MF模型的性能,我们在真实的在线投票数据集上进行了大量的实验。结果证明了我们的方法对几种最先进的基线的有效性。
{"title":"Joint Topic-Semantic-aware Social Recommendation for Online Voting","authors":"Hongwei Wang, Jia Wang, Miao Zhao, Jiannong Cao, M. Guo","doi":"10.1145/3132847.3132889","DOIUrl":"https://doi.org/10.1145/3132847.3132889","url":null,"abstract":"Online voting is an emerging feature in social networks, in which users can express their attitudes toward various issues and show their unique interest. Online voting imposes new challenges on recommendation, because the propagation of votings heavily depends on the structure of social networks as well as the content of votings. In this paper, we investigate how to utilize these two factors in a comprehensive manner when doing voting recommendation. First, due to the fact that existing text mining methods such as topic model and semantic model cannot well process the content of votings that is typically short and ambiguous, we propose a novel Topic-Enhanced Word Embedding (TEWE) method to learn word and document representation by jointly considering their topics and semantics. Then we propose our Joint Topic-Semantic-aware social Matrix Factorization (JTS-MF) model for voting recommendation. JTS-MF model calculates similarity among users and votings by combining their TEWE representation and structural information of social networks, and preserves this topic-semantic-social similarity during matrix factorization. To evaluate the performance of TEWE representation and JTS-MF model, we conduct extensive experiments on real online voting dataset. The results prove the efficacy of our approach against several state-of-the-art baselines.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80489631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Social Media for Opioid Addiction Epidemiology: Automatic Detection of Opioid Addicts from Twitter and Case Studies 阿片类药物成瘾流行病学的社交媒体:从Twitter和案例研究中自动检测阿片类药物成瘾者
Yujie Fan, Yiming Zhang, Yanfang Ye, Xin Li, W. Zheng
Opioid (e.g., heroin and morphine) addiction has become one of the largest and deadliest epidemics in the United States. To combat such deadly epidemic, there is an urgent need for novel tools and methodologies to gain new insights into the behavioral processes of opioid abuse and addiction. The role of social media in biomedical knowledge mining has turned into increasingly significant in recent years. In this paper, we propose a novel framework named AutoDOA to automatically detect the opioid addicts from Twitter, which can potentially assist in sharpening our understanding toward the behavioral process of opioid abuse and addiction. In AutoDOA, to model the users and posted tweets as well as their rich relationships, a structured heterogeneous information network (HIN) is first constructed. Then meta-path based approach is used to formulate similarity measures over users and different similarities are aggregated using Laplacian scores. Based on HIN and the combined meta-path, to reduce the cost of acquiring labeled examples for supervised learning, a transductive classification model is built for automatic opioid addict detection. To the best of our knowledge, this is the first work to apply transductive classification in HIN into drug-addiction domain. Comprehensive experiments on real sample collections from Twitter are conducted to validate the effectiveness of our developed system AutoDOA in opioid addict detection by comparisons with other alternate methods. The results and case studies also demonstrate that knowledge from daily-life social media data mining could support a better practice of opioid addiction prevention and treatment.
类阿片(如海洛因和吗啡)成瘾已成为美国最大和最致命的流行病之一。为了与这种致命的流行病作斗争,迫切需要新的工具和方法,以便对阿片类药物滥用和成瘾的行为过程获得新的见解。近年来,社交媒体在生物医学知识挖掘中的作用变得越来越重要。在本文中,我们提出了一个名为AutoDOA的新框架来自动检测来自Twitter的阿片类药物成瘾者,这可能有助于加深我们对阿片类药物滥用和成瘾行为过程的理解。在AutoDOA中,为了对用户和发布的tweets及其丰富的关系进行建模,首先构建了一个结构化异构信息网络(HIN)。然后采用基于元路径的方法制定用户的相似度度量,并使用拉普拉斯分数对不同的相似度进行汇总。基于HIN和组合元路径,为了降低监督学习中标记样例的获取成本,建立了用于阿片类药物成瘾自动检测的传导分类模型。据我们所知,这是第一个将HIN中的传导分类应用于吸毒成瘾领域的工作。通过对Twitter真实样本采集的综合实验,通过与其他替代方法的比较,验证了我们开发的AutoDOA系统在阿片类药物成瘾检测中的有效性。结果和案例研究还表明,来自日常生活社交媒体数据挖掘的知识可以支持更好的阿片类药物成瘾预防和治疗实践。
{"title":"Social Media for Opioid Addiction Epidemiology: Automatic Detection of Opioid Addicts from Twitter and Case Studies","authors":"Yujie Fan, Yiming Zhang, Yanfang Ye, Xin Li, W. Zheng","doi":"10.1145/3132847.3132857","DOIUrl":"https://doi.org/10.1145/3132847.3132857","url":null,"abstract":"Opioid (e.g., heroin and morphine) addiction has become one of the largest and deadliest epidemics in the United States. To combat such deadly epidemic, there is an urgent need for novel tools and methodologies to gain new insights into the behavioral processes of opioid abuse and addiction. The role of social media in biomedical knowledge mining has turned into increasingly significant in recent years. In this paper, we propose a novel framework named AutoDOA to automatically detect the opioid addicts from Twitter, which can potentially assist in sharpening our understanding toward the behavioral process of opioid abuse and addiction. In AutoDOA, to model the users and posted tweets as well as their rich relationships, a structured heterogeneous information network (HIN) is first constructed. Then meta-path based approach is used to formulate similarity measures over users and different similarities are aggregated using Laplacian scores. Based on HIN and the combined meta-path, to reduce the cost of acquiring labeled examples for supervised learning, a transductive classification model is built for automatic opioid addict detection. To the best of our knowledge, this is the first work to apply transductive classification in HIN into drug-addiction domain. Comprehensive experiments on real sample collections from Twitter are conducted to validate the effectiveness of our developed system AutoDOA in opioid addict detection by comparisons with other alternate methods. The results and case studies also demonstrate that knowledge from daily-life social media data mining could support a better practice of opioid addiction prevention and treatment.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84438896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
An Euclidean Distance based on the Weighted Self-information Related Data Transformation for Nominal Data Clustering 基于加权自信息相关数据变换的欧氏距离标称数据聚类
Lei Gu, Liying Zhang, Yang Zhao
Numerical data clustering is a tractable task since well-defined numerical measures like traditional Euclidean distance can be directly used for it, but nominal data clustering is a very difficult problem because there exists no natural relative ordering between nominal attribute values. This paper mainly aims to make the Euclidean distance measure appropriate to nominal data clustering, and the core idea is to transform each nominal attribute value into numerical. This transformation method consists of three steps. In the first step, the weighted self-information, which can quantify the amount of information in attribute values, is calculated for each value in each nominal attribute. In the second step, we find k nearest neighbors for each object because k nearest neighbors of one object have close similarities with it. In the last step, the weighted self-information of each attribute value in each nominal object is modified according to the object's k nearest neighbors. To evaluate the effectiveness of our proposed method, experiments are done on 10 data sets. Experimental results demonstrate that our method not only enables the Euclidean distance to be used for nominal data clustering, but also can acquire the better clustering performance than several existing state-of-the-art approaches.
数值数据聚类是一项容易处理的任务,因为传统的欧几里得距离等定义良好的数值度量可以直接用于聚类,但名义数据聚类是一个非常困难的问题,因为名义属性值之间不存在自然的相对顺序。本文的主要目的是使欧氏距离测度适合于标称数据聚类,其核心思想是将各个标称属性值转化为数值。这种转换方法包括三个步骤。第一步,对每个标称属性中的每个值计算加权自信息,它可以量化属性值中的信息量。在第二步中,我们为每个对象找到k个最近邻,因为一个对象的k个近邻与它具有非常接近的相似性。最后一步,根据每个标称对象的k个最近邻来修改每个属性值的加权自信息。为了评估我们提出的方法的有效性,在10个数据集上进行了实验。实验结果表明,该方法不仅可以将欧几里得距离用于标称数据聚类,而且可以获得比现有几种最先进方法更好的聚类性能。
{"title":"An Euclidean Distance based on the Weighted Self-information Related Data Transformation for Nominal Data Clustering","authors":"Lei Gu, Liying Zhang, Yang Zhao","doi":"10.1145/3132847.3133062","DOIUrl":"https://doi.org/10.1145/3132847.3133062","url":null,"abstract":"Numerical data clustering is a tractable task since well-defined numerical measures like traditional Euclidean distance can be directly used for it, but nominal data clustering is a very difficult problem because there exists no natural relative ordering between nominal attribute values. This paper mainly aims to make the Euclidean distance measure appropriate to nominal data clustering, and the core idea is to transform each nominal attribute value into numerical. This transformation method consists of three steps. In the first step, the weighted self-information, which can quantify the amount of information in attribute values, is calculated for each value in each nominal attribute. In the second step, we find k nearest neighbors for each object because k nearest neighbors of one object have close similarities with it. In the last step, the weighted self-information of each attribute value in each nominal object is modified according to the object's k nearest neighbors. To evaluate the effectiveness of our proposed method, experiments are done on 10 data sets. Experimental results demonstrate that our method not only enables the Euclidean distance to be used for nominal data clustering, but also can acquire the better clustering performance than several existing state-of-the-art approaches.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85385065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Exploiting User Consuming Behavior for Effective Item Tagging 利用用户消费行为进行有效的项目标记
Shen Liu, Hongyan Liu
Automatic tagging techniques are important for many applications such as searching and recommendation, which has attracted many researchers' attention in recent years. Existing methods mainly rely on users' tagging behavior or items' content information for tagging, yet users' consuming behavior is ignored. In this paper, we propose to leverage such information and introduce a probabilistic model called joint-tagging LDA to improve tagging accuracy. An effective algorithm based on Zero-Order Collapsed Variational Bayes is developed. Experiments conducted on a real dataset demonstrate that joint-tagging LDA outperforms existing competing methods.
自动标注技术在搜索和推荐等许多应用中具有重要意义,近年来引起了许多研究者的关注。现有的方法主要依靠用户的标注行为或物品的内容信息进行标注,忽略了用户的消费行为。在本文中,我们建议利用这些信息并引入一种称为联合标记LDA的概率模型来提高标记精度。提出了一种基于零阶坍缩变分贝叶斯的有效算法。在真实数据集上进行的实验表明,联合标记LDA优于现有的竞争方法。
{"title":"Exploiting User Consuming Behavior for Effective Item Tagging","authors":"Shen Liu, Hongyan Liu","doi":"10.1145/3132847.3133071","DOIUrl":"https://doi.org/10.1145/3132847.3133071","url":null,"abstract":"Automatic tagging techniques are important for many applications such as searching and recommendation, which has attracted many researchers' attention in recent years. Existing methods mainly rely on users' tagging behavior or items' content information for tagging, yet users' consuming behavior is ignored. In this paper, we propose to leverage such information and introduce a probabilistic model called joint-tagging LDA to improve tagging accuracy. An effective algorithm based on Zero-Order Collapsed Variational Bayes is developed. Experiments conducted on a real dataset demonstrate that joint-tagging LDA outperforms existing competing methods.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84039474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Words are Malleable: Computing Semantic Shifts in Political and Media Discourse 词语是可塑的:计算政治和媒体话语中的语义转变
H. Azarbonyad, Mostafa Dehghani, K. Beelen, Alexandra Arkut, maarten marx, J. Kamps
Recently, researchers started to pay attention to the detection of temporal shifts in the meaning of words. However, most (if not all) of these approaches restricted their efforts to uncovering change over time, thus neglecting other valuable dimensions such as social or political variability. We propose an approach for detecting semantic shifts between different viewpoints---broadly defined as a set of texts that share a specific metadata feature, which can be a time-period, but also a social entity such as a political party. For each viewpoint, we learn a semantic space in which each word is represented as a low dimensional neural embedded vector. The challenge is to compare the meaning of a word in one space to its meaning in another space and measure the size of the semantic shifts. We compare the effectiveness of a measure based on optimal transformations between the two spaces with a measure based on the similarity of the neighbors of the word in the respective spaces. Our experiments demonstrate that the combination of these two performs best. We show that the semantic shifts not only occur over time but also along different viewpoints in a short period of time. For evaluation, we demonstrate how this approach captures meaningful semantic shifts and can help improve other tasks such as the contrastive viewpoint summarization and ideology detection (measured as classification accuracy) in political texts. We also show that the two laws of semantic change which were empirically shown to hold for temporal shifts also hold for shifts across viewpoints. These laws state that frequent words are less likely to shift meaning while words with many senses are more likely to do so.
近年来,研究人员开始关注词语意义的时间变化检测。然而,这些方法中的大多数(如果不是全部的话)限制了它们揭示随时间变化的努力,从而忽略了其他有价值的方面,如社会或政治可变性。我们提出了一种检测不同观点之间语义转移的方法——广义上定义为一组共享特定元数据特征的文本,该特征可以是一个时间段,也可以是一个社会实体,如政党。对于每个视点,我们学习一个语义空间,其中每个词被表示为一个低维神经嵌入向量。挑战在于比较一个词在一个空间中的意义和它在另一个空间中的意义,并测量语义变化的大小。我们比较了基于两个空间之间的最优变换的度量与基于单词在各自空间中邻居的相似性的度量的有效性。我们的实验表明,这两者的结合效果最好。我们发现语义的转变不仅会随着时间的推移而发生,而且会在短时间内沿着不同的观点发生。为了评估,我们展示了这种方法如何捕获有意义的语义转换,并有助于改善政治文本中的对比观点总结和意识形态检测(以分类准确性衡量)等其他任务。我们还表明,语义变化的两个规律,这是经验证明,对时间的变化也适用于跨视点的变化。这些规律表明,频繁出现的单词不太可能改变意思,而有多种含义的单词更有可能改变意思。
{"title":"Words are Malleable: Computing Semantic Shifts in Political and Media Discourse","authors":"H. Azarbonyad, Mostafa Dehghani, K. Beelen, Alexandra Arkut, maarten marx, J. Kamps","doi":"10.1145/3132847.3132878","DOIUrl":"https://doi.org/10.1145/3132847.3132878","url":null,"abstract":"Recently, researchers started to pay attention to the detection of temporal shifts in the meaning of words. However, most (if not all) of these approaches restricted their efforts to uncovering change over time, thus neglecting other valuable dimensions such as social or political variability. We propose an approach for detecting semantic shifts between different viewpoints---broadly defined as a set of texts that share a specific metadata feature, which can be a time-period, but also a social entity such as a political party. For each viewpoint, we learn a semantic space in which each word is represented as a low dimensional neural embedded vector. The challenge is to compare the meaning of a word in one space to its meaning in another space and measure the size of the semantic shifts. We compare the effectiveness of a measure based on optimal transformations between the two spaces with a measure based on the similarity of the neighbors of the word in the respective spaces. Our experiments demonstrate that the combination of these two performs best. We show that the semantic shifts not only occur over time but also along different viewpoints in a short period of time. For evaluation, we demonstrate how this approach captures meaningful semantic shifts and can help improve other tasks such as the contrastive viewpoint summarization and ideology detection (measured as classification accuracy) in political texts. We also show that the two laws of semantic change which were empirically shown to hold for temporal shifts also hold for shifts across viewpoints. These laws state that frequent words are less likely to shift meaning while words with many senses are more likely to do so.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78525181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Building a Dossier on the Cheap: Integrating Distributed Personal Data Resources Under Cost Constraints 建立一个廉价的档案:在成本约束下整合分布式个人数据资源
Imrul Chowdhury Anindya, Harichandan Roy, Murat Kantarcioglu, B. Malin
A wide variety of personal data is routinely collected by numerous organizations that, in turn, share and sell their collections for analytic investigations (e.g., market research). To preserve privacy, certain identifiers are often redacted, perturbed or even removed. A substantial number of attacks have shown that, if care is not taken, such data can be linked to external resources to determine the explicit identifiers (e.g., personal names) or infer sensitive attributes (e.g., income) for the individuals from whom the data was collected. As such, organizations increasingly rely upon record linkage methods to assess the risk such attacks pose and adopt countermeasures accordingly. Traditional linkage methods assume only two datasets would be linked (e.g., linking de-identified hospital discharge to identified voter registration lists), but with the advent of a multi-billion dollar data broker industry, modern adversaries have access to a massive data stash of multiple datasets that can be leveraged. Still, realistic adversaries have budget constraints that prevent them from obtaining and integrating all relevant datasets. Thus, in this work, we investigate a novel privacy risk assessment framework, based on adversaries who plan an integration of datasets for the most accurate estimate of targeted sensitive attributes under a certain budget. To solve this problem, we introduce a graph-based formulation of the problem and predictive modeling methods to prioritize data resources for linkage. We perform an empirical analysis using real world voter registration data from two different U.S. states and show that the methods can be used efficiently to accurately estimate potentially sensitive information disclosure risks even under a non-trivial amount of noise.
许多组织定期收集各种各样的个人数据,这些组织反过来分享和出售其收集的数据用于分析调查(例如,市场研究)。为了保护隐私,某些标识符经常被编辑、干扰甚至删除。大量攻击表明,如果不小心,这些数据可以链接到外部资源,以确定明确的标识符(例如,个人姓名)或推断收集数据的个人的敏感属性(例如,收入)。因此,组织越来越依赖于记录链接方法来评估此类攻击所带来的风险并采取相应的对策。传统的链接方法假设只有两个数据集会被链接(例如,将去识别的医院出院病例链接到已识别的选民登记名单),但随着价值数十亿美元的数据代理行业的出现,现代对手可以访问可以利用的多个数据集的大量数据存储。然而,现实的对手有预算限制,阻止他们获取和整合所有相关数据集。因此,在这项工作中,我们研究了一种新的隐私风险评估框架,该框架基于对手在一定预算下计划数据集集成以最准确地估计目标敏感属性的数据集。为了解决这个问题,我们引入了一种基于图的问题表述和预测建模方法来对数据资源进行优先级排序。我们使用来自美国两个不同州的真实选民登记数据进行了实证分析,并表明即使在非微不足道的噪音下,这些方法也可以有效地准确估计潜在的敏感信息披露风险。
{"title":"Building a Dossier on the Cheap: Integrating Distributed Personal Data Resources Under Cost Constraints","authors":"Imrul Chowdhury Anindya, Harichandan Roy, Murat Kantarcioglu, B. Malin","doi":"10.1145/3132847.3132951","DOIUrl":"https://doi.org/10.1145/3132847.3132951","url":null,"abstract":"A wide variety of personal data is routinely collected by numerous organizations that, in turn, share and sell their collections for analytic investigations (e.g., market research). To preserve privacy, certain identifiers are often redacted, perturbed or even removed. A substantial number of attacks have shown that, if care is not taken, such data can be linked to external resources to determine the explicit identifiers (e.g., personal names) or infer sensitive attributes (e.g., income) for the individuals from whom the data was collected. As such, organizations increasingly rely upon record linkage methods to assess the risk such attacks pose and adopt countermeasures accordingly. Traditional linkage methods assume only two datasets would be linked (e.g., linking de-identified hospital discharge to identified voter registration lists), but with the advent of a multi-billion dollar data broker industry, modern adversaries have access to a massive data stash of multiple datasets that can be leveraged. Still, realistic adversaries have budget constraints that prevent them from obtaining and integrating all relevant datasets. Thus, in this work, we investigate a novel privacy risk assessment framework, based on adversaries who plan an integration of datasets for the most accurate estimate of targeted sensitive attributes under a certain budget. To solve this problem, we introduce a graph-based formulation of the problem and predictive modeling methods to prioritize data resources for linkage. We perform an empirical analysis using real world voter registration data from two different U.S. states and show that the methods can be used efficiently to accurately estimate potentially sensitive information disclosure risks even under a non-trivial amount of noise.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77830377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Predicting Startup Crowdfunding Success through Longitudinal Social Engagement Analysis 通过纵向社会参与分析预测创业公司众筹成功
Qizhen Zhang, Tengyuan Ye, Meryem Essaidi, S. Agarwal, Vincent Liu, B. T. Loo
A key ingredient to a startup's success is its ability to raise funding at an early stage. Crowdfunding has emerged as an exciting new mechanism for connecting startups with potentially thousands of investors. Nonetheless, little is known about its effectiveness, nor the strategies that entrepreneurs should adopt in order to maximize their rate of success. In this paper, we perform a longitudinal data collection and analysis of AngelList - a popular crowdfunding social platform for connecting investors and entrepreneurs. Over a 7-10 month period, we track companies that are actively fund-raising on AngelList, and record their level of social engagement on AngelList, Twitter, and Facebook. Through a series of measures on social en- gagement (e.g. number of tweets, posts, new followers), our analysis shows that active engagement on social media is highly correlated to crowdfunding success. In some cases, the engagement level is an order of magnitude higher for successful companies. We further apply a range of machine learning techniques (e.g. decision tree, SVM, KNN, etc) to predict the ability of a company to success- fully raise funding based on its social engagement and other metrics. Since fund-raising is a rare event, we explore various techniques to deal with class imbalance issues. We observe that some metrics (e.g. AngelList followers and Facebook posts) are more signi cant than other metrics in predicting fund-raising success. Furthermore, despite the class imbalance, we are able to predict crowdfunding success with 84% accuracy.
创业公司成功的一个关键因素是它在早期阶段筹集资金的能力。众筹已经成为一种令人兴奋的新机制,可以将初创企业与潜在的数千名投资者联系起来。然而,人们对其有效性知之甚少,也不知道企业家应该采取什么策略来最大限度地提高他们的成功率。在本文中,我们对AngelList进行了纵向数据收集和分析,这是一个连接投资者和企业家的热门众筹社交平台。在7-10个月的时间里,我们追踪那些在AngelList上积极融资的公司,记录他们在AngelList、Twitter和Facebook上的社交参与度。通过一系列关于社交参与度的指标(如推文数量、帖子数量、新关注者数量),我们的分析表明,社交媒体上的积极参与度与众筹成功高度相关。在某些情况下,成功公司的敬业度要高一个数量级。我们进一步应用了一系列机器学习技术(如决策树、支持向量机、KNN等)来预测公司成功筹集资金的能力——基于其社会参与度和其他指标。由于融资是一个罕见的事件,我们探索各种技术来处理阶级失衡问题。我们观察到,在预测融资成功方面,一些指标(如AngelList关注者和Facebook帖子)比其他指标更重要。此外,尽管班级不平衡,我们能够以84%的准确率预测众筹的成功。
{"title":"Predicting Startup Crowdfunding Success through Longitudinal Social Engagement Analysis","authors":"Qizhen Zhang, Tengyuan Ye, Meryem Essaidi, S. Agarwal, Vincent Liu, B. T. Loo","doi":"10.1145/3132847.3132908","DOIUrl":"https://doi.org/10.1145/3132847.3132908","url":null,"abstract":"A key ingredient to a startup's success is its ability to raise funding at an early stage. Crowdfunding has emerged as an exciting new mechanism for connecting startups with potentially thousands of investors. Nonetheless, little is known about its effectiveness, nor the strategies that entrepreneurs should adopt in order to maximize their rate of success. In this paper, we perform a longitudinal data collection and analysis of AngelList - a popular crowdfunding social platform for connecting investors and entrepreneurs. Over a 7-10 month period, we track companies that are actively fund-raising on AngelList, and record their level of social engagement on AngelList, Twitter, and Facebook. Through a series of measures on social en- gagement (e.g. number of tweets, posts, new followers), our analysis shows that active engagement on social media is highly correlated to crowdfunding success. In some cases, the engagement level is an order of magnitude higher for successful companies. We further apply a range of machine learning techniques (e.g. decision tree, SVM, KNN, etc) to predict the ability of a company to success- fully raise funding based on its social engagement and other metrics. Since fund-raising is a rare event, we explore various techniques to deal with class imbalance issues. We observe that some metrics (e.g. AngelList followers and Facebook posts) are more signi cant than other metrics in predicting fund-raising success. Furthermore, despite the class imbalance, we are able to predict crowdfunding success with 84% accuracy.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82561170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Efficient Fault-Tolerant Group Recommendation Using alpha-beta-core 使用alpha-beta-core的高效容错组推荐
Danhao Ding, Hui Li, Zhipeng Huang, N. Mamoulis
Fault-tolerant group recommendation systems based on subspace clustering successfully alleviate high-dimensionality and sparsity problems. However, the cost of recommendation grows exponentially with the size of dataset. To address this issue, we model the fault-tolerant subspace clustering problem as a search problem on graphs and present an algorithm, GraphRec, based on the concept of α-ß-core. Moreover, we propose two variants of our approach that use indexes to improve query latency. Our experiments on different datasets demonstrate that our methods are extremely fast compared to the state-of-the-art.
基于子空间聚类的容错群推荐系统成功地解决了高维稀疏问题。然而,推荐的成本随着数据集的大小呈指数增长。为了解决这个问题,我们将容错子空间聚类问题建模为图上的搜索问题,并提出了一种基于α-ß-core概念的算法GraphRec。此外,我们提出了我们的方法的两个变体,它们使用索引来改善查询延迟。我们在不同数据集上的实验表明,与最先进的方法相比,我们的方法非常快。
{"title":"Efficient Fault-Tolerant Group Recommendation Using alpha-beta-core","authors":"Danhao Ding, Hui Li, Zhipeng Huang, N. Mamoulis","doi":"10.1145/3132847.3133130","DOIUrl":"https://doi.org/10.1145/3132847.3133130","url":null,"abstract":"Fault-tolerant group recommendation systems based on subspace clustering successfully alleviate high-dimensionality and sparsity problems. However, the cost of recommendation grows exponentially with the size of dataset. To address this issue, we model the fault-tolerant subspace clustering problem as a search problem on graphs and present an algorithm, GraphRec, based on the concept of α-ß-core. Moreover, we propose two variants of our approach that use indexes to improve query latency. Our experiments on different datasets demonstrate that our methods are extremely fast compared to the state-of-the-art.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76539937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Selective Value Coupling Learning for Detecting Outliers in High-Dimensional Categorical Data 高维分类数据异常点检测的选择值耦合学习
Guansong Pang, Hongzuo Xu, Longbing Cao, Wentao Zhao
This paper introduces a novel framework, namely SelectVC and its instance POP, for learning selective value couplings (i.e., interactions between the full value set and a set of outlying values) to identify outliers in high-dimensional categorical data. Existing outlier detection methods work on a full data space or feature subspaces that are identified independently from subsequent outlier scoring. As a result, they are significantly challenged by overwhelming irrelevant features in high-dimensional data due to the noise brought by the irrelevant features and its huge search space. In contrast, SelectVC works on a clean and condensed data space spanned by selective value couplings by jointly optimizing outlying value selection and value outlierness scoring. Its instance POP defines a value outlierness scoring function by modeling a partial outlierness propagation process to capture the selective value couplings. POP further defines a top-k outlying value selection method to ensure its scalability to the huge search space. We show that POP (i) significantly outperforms five state-of-the-art full space- or subspace-based outlier detectors and their combinations with three feature selection methods on 12 real-world high-dimensional data sets with different levels of irrelevant features; and (ii) obtains good scalability, stable performance w.r.t. k, and fast convergence rate.
本文引入了一个新的框架,即SelectVC及其实例POP,用于学习选择值耦合(即完整值集与一组离群值之间的相互作用)以识别高维分类数据中的离群值。现有的离群点检测方法适用于完整的数据空间或特征子空间,这些子空间独立于随后的离群点评分进行识别。因此,由于不相关特征带来的噪声和巨大的搜索空间,高维数据中压倒性的不相关特征对它们构成了极大的挑战。相比之下,SelectVC通过联合优化离群值选择和值离群值评分,在由选择性值耦合跨越的干净和压缩的数据空间上工作。它的实例POP通过对部分离群值传播过程建模来定义一个值离群值评分函数,以捕获选择性值耦合。POP进一步定义了top-k离群值选择方法,以确保其对巨大搜索空间的可扩展性。我们表明,POP (i)在12个具有不同程度不相关特征的真实世界高维数据集上,显著优于五种最先进的基于全空间或子空间的离群值检测器及其与三种特征选择方法的组合;(ii)可扩展性好,性能w.r.t.k稳定,收敛速度快。
{"title":"Selective Value Coupling Learning for Detecting Outliers in High-Dimensional Categorical Data","authors":"Guansong Pang, Hongzuo Xu, Longbing Cao, Wentao Zhao","doi":"10.1145/3132847.3132994","DOIUrl":"https://doi.org/10.1145/3132847.3132994","url":null,"abstract":"This paper introduces a novel framework, namely SelectVC and its instance POP, for learning selective value couplings (i.e., interactions between the full value set and a set of outlying values) to identify outliers in high-dimensional categorical data. Existing outlier detection methods work on a full data space or feature subspaces that are identified independently from subsequent outlier scoring. As a result, they are significantly challenged by overwhelming irrelevant features in high-dimensional data due to the noise brought by the irrelevant features and its huge search space. In contrast, SelectVC works on a clean and condensed data space spanned by selective value couplings by jointly optimizing outlying value selection and value outlierness scoring. Its instance POP defines a value outlierness scoring function by modeling a partial outlierness propagation process to capture the selective value couplings. POP further defines a top-k outlying value selection method to ensure its scalability to the huge search space. We show that POP (i) significantly outperforms five state-of-the-art full space- or subspace-based outlier detectors and their combinations with three feature selection methods on 12 real-world high-dimensional data sets with different levels of irrelevant features; and (ii) obtains good scalability, stable performance w.r.t. k, and fast convergence rate.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"17 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87087098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
期刊
Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1