Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献_第4页

Networked bandits with disjoint linear payoffs 具有不相交线性收益的网络强盗

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623672

Meng Fang, D. Tao

In this paper, we study `networked bandits', a new bandit problem where a set of interrelated arms varies over time and, given the contextual information that selects one arm, invokes other correlated arms. This problem remains under-investigated, in spite of its applicability to many practical problems. For instance, in social networks, an arm can obtain payoffs from both the selected user and its relations since they often share the content through the network. We examine whether it is possible to obtain multiple payoffs from several correlated arms based on the relationships. In particular, we formalize the networked bandit problem and propose an algorithm that considers not only the selected arm, but also the relationships between arms. Our algorithm is `optimism in face of uncertainty' style, in that it decides an arm depending on integrated confidence sets constructed from historical data. We analyze the performance in simulation experiments and on two real-world offline datasets. The experimental results demonstrate our algorithm's effectiveness in the networked bandit setting.

在本文中，我们研究了“网络盗匪”，这是一个新的盗匪问题，其中一组相互关联的武器随着时间的推移而变化，并且给定选择一个武器的上下文信息，调用其他相关的武器。尽管这个问题适用于许多实际问题，但它仍未得到充分研究。例如，在社交网络中，手臂可以从被选择的用户及其关系中获得回报，因为他们经常通过网络共享内容。我们研究是否有可能从基于关系的几个相关臂中获得多个收益。特别是，我们形式化了网络强盗问题，并提出了一种不仅考虑所选武器，而且考虑武器之间关系的算法。我们的算法是“面对不确定性的乐观主义”风格，因为它根据从历史数据构建的集成置信度集来决定手臂。我们在仿真实验和两个真实的离线数据集上分析了性能。实验结果证明了该算法在网络强盗环境下的有效性。

引用次数: 27

COM: a generative model for group recommendation COM:群体推荐的生成模型

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623616

Quan Yuan, G. Cong, Chin-Yew Lin

With the rapid development of online social networks, a growing number of people are willing to share their group activities, e.g. having dinners with colleagues, and watching movies with spouses. This motivates the studies on group recommendation, which aims to recommend items for a group of users. Group recommendation is a challenging problem because different group members have different preferences, and how to make a trade-off among their preferences for recommendation is still an open problem. In this paper, we propose a probabilistic model named COM (COnsensus Model) to model the generative process of group activities, and make group recommendations. Intuitively, users in a group may have different influences, and those who are expert in topics relevant to the group are usually more influential. In addition, users in a group may behave differently as group members from as individuals. COM is designed based on these intuitions, and is able to incorporate both users' selection history and personal considerations of content factors. When making recommendations, COM estimates the preference of a group to an item by aggregating the preferences of the group members with different weights. We conduct extensive experiments on four datasets, and the results show that the proposed model is effective in making group recommendations, and outperforms baseline methods significantly.

随着在线社交网络的快速发展，越来越多的人愿意分享他们的团体活动，例如与同事共进晚餐，与配偶一起看电影。这激发了群体推荐的研究，其目的是为一组用户推荐物品。群体推荐是一个具有挑战性的问题，因为不同的群体成员有不同的偏好，如何在他们的偏好之间进行权衡推荐仍然是一个悬而未决的问题。在本文中，我们提出了一个概率模型COM (COnsensus model)来模拟群体活动的生成过程，并提出了群体建议。从直觉上看，组中的用户可能具有不同的影响力，而那些在与组相关的主题方面的专家通常具有更大的影响力。此外，组中的用户作为组成员的行为可能与作为个体的行为不同。COM就是基于这些直觉来设计的，并且能够结合用户的选择历史和个人对内容因素的考虑。在提出建议时，COM通过汇总具有不同权重的组成员的偏好来估计组对项目的偏好。我们在四个数据集上进行了大量的实验，结果表明，所提出的模型在群体推荐方面是有效的，并且显著优于基线方法。

{"title":"COM: a generative model for group recommendation","authors":"Quan Yuan, G. Cong, Chin-Yew Lin","doi":"10.1145/2623330.2623616","DOIUrl":"https://doi.org/10.1145/2623330.2623616","url":null,"abstract":"With the rapid development of online social networks, a growing number of people are willing to share their group activities, e.g. having dinners with colleagues, and watching movies with spouses. This motivates the studies on group recommendation, which aims to recommend items for a group of users. Group recommendation is a challenging problem because different group members have different preferences, and how to make a trade-off among their preferences for recommendation is still an open problem. In this paper, we propose a probabilistic model named COM (COnsensus Model) to model the generative process of group activities, and make group recommendations. Intuitively, users in a group may have different influences, and those who are expert in topics relevant to the group are usually more influential. In addition, users in a group may behave differently as group members from as individuals. COM is designed based on these intuitions, and is able to incorporate both users' selection history and personal considerations of content factors. When making recommendations, COM estimates the preference of a group to an item by aggregating the preferences of the group members with different weights. We conduct extensive experiments on four datasets, and the results show that the proposed model is effective in making group recommendations, and outperforms baseline methods significantly.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85244002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 171

Mining topics in documents: standing on the shoulders of big data 文档主题挖掘:站在大数据的肩膀上

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623622

Zhiyuan Chen, B. Liu

Topic modeling has been widely used to mine topics from documents. However, a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of documents) to provide reliable statistics to generate coherent topics. However, in practice, many document collections do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory results. In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better topics. Our research takes a radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. When faced with a new task, we first mine some reliable (prior) knowledge from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web. The proposed algorithm mines two forms of knowledge: must-link (meaning that two words should be in the same topic) and cannot-link (meaning that two words should not be in the same topic). It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitivity. Experimental results using review documents from 100 product domains show that the proposed approach makes dramatic improvements over state-of-the-art baselines.

主题建模已被广泛用于从文档中挖掘主题。然而，主题建模的一个关键弱点是，它需要大量的数据(例如，数千个文档)来提供可靠的统计数据，以生成一致的主题。然而，在实践中，许多文档集合并没有这么多文档。在给定少量文档的情况下，经典主题模型LDA生成的主题非常差。即使有大量的数据，主题模型的无监督学习仍然会产生令人不满意的结果。近年来，人们提出了基于知识的主题模型，该模型要求人类用户提供一些先验的领域知识来指导模型产生更好的主题。我们的研究采用了一种完全不同的方法。我们建议像人类一样学习，也就是说，保留过去学到的结果，并用它们来帮助未来的学习。当面对一个新的任务时，我们首先从过去的学习/建模结果中挖掘一些可靠的(先验的)知识，然后用它来指导模型推理，以产生更连贯的主题。这种方法是可能的，因为大数据随时可以在网络上获得。该算法挖掘了两种形式的知识:必须链接(意思是两个单词应该在同一个主题中)和不能链接(意思是两个单词不应该在同一个主题中)。同时也解决了自动挖掘知识的两个问题，即错误知识和知识的传递性问题。使用来自100个产品领域的评论文档的实验结果表明，所提出的方法比最先进的基线有了显着的改进。

{"title":"Mining topics in documents: standing on the shoulders of big data","authors":"Zhiyuan Chen, B. Liu","doi":"10.1145/2623330.2623622","DOIUrl":"https://doi.org/10.1145/2623330.2623622","url":null,"abstract":"Topic modeling has been widely used to mine topics from documents. However, a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of documents) to provide reliable statistics to generate coherent topics. However, in practice, many document collections do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory results. In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better topics. Our research takes a radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. When faced with a new task, we first mine some reliable (prior) knowledge from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web. The proposed algorithm mines two forms of knowledge: must-link (meaning that two words should be in the same topic) and cannot-link (meaning that two words should not be in the same topic). It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitivity. Experimental results using review documents from 100 product domains show that the proposed approach makes dramatic improvements over state-of-the-art baselines.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85403961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 147

Exploiting geographic dependencies for real estate appraisal: a mutual perspective of ranking and clustering 利用地理依赖性进行房地产估价:排序和聚类的相互视角

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623675

Yanjie Fu, Hui Xiong, Yong Ge, Zijun Yao, Yu Zheng, Zhi-Hua Zhou

It is traditionally a challenge for home buyers to understand, compare and contrast the investment values of real estates. While a number of estate appraisal methods have been developed to value real property, the performances of these methods have been limited by the traditional data sources for estate appraisal. However, with the development of new ways of collecting estate-related mobile data, there is a potential to leverage geographic dependencies of estates for enhancing estate appraisal. Indeed, the geographic dependencies of the value of an estate can be from the characteristics of its own neighborhood (individual), the values of its nearby estates (peer), and the prosperity of the affiliated latent business area (zone). To this end, in this paper, we propose a geographic method, named ClusRanking, for estate appraisal by leveraging the mutual enforcement of ranking and clustering power. ClusRanking is able to exploit geographic individual, peer, and zone dependencies in a probabilistic ranking model. Specifically, we first extract the geographic utility of estates from geography data, estimate the neighborhood popularity of estates by mining taxicab trajectory data, and model the influence of latent business areas via ClusRanking. Also, we use a linear model to fuse these three influential factors and predict estate investment values. Moreover, we simultaneously consider individual, peer and zone dependencies, and derive an estate-specific ranking likelihood as the objective function. Finally, we conduct a comprehensive evaluation with real-world estate related data, and the experimental results demonstrate the effectiveness of our method.

对于购房者来说，理解、比较和对比房地产的投资价值历来是一个挑战。虽然已经开发了许多房地产评估方法来对房地产进行评估，但这些方法的性能受到传统房地产评估数据源的限制。然而，随着收集与房地产相关的移动数据的新方法的发展，有可能利用房地产的地理依赖性来加强房地产评估。事实上，房地产价值的地理依赖性可以来自其自身社区(个人)的特征，其附近房地产(同行)的价值以及附属潜在商业区域(区域)的繁荣程度。为此，本文提出了一种利用排序力和聚类力的相互作用来进行房地产评估的地理方法ClusRanking。ClusRanking能够利用概率排序模型中的地理个体、同伴和区域依赖关系。具体而言，我们首先从地理数据中提取街区的地理效用，通过挖掘出租车轨迹数据估计街区的受欢迎程度，并通过ClusRanking对潜在商业区的影响进行建模。并利用线性模型对这三个影响因素进行融合，预测房地产投资价值。此外，我们同时考虑了个体、同伴和区域的依赖关系，并推导了一个特定于地产的排序似然作为目标函数。最后，结合实际房地产相关数据进行了综合评价，实验结果验证了本文方法的有效性。

{"title":"Exploiting geographic dependencies for real estate appraisal: a mutual perspective of ranking and clustering","authors":"Yanjie Fu, Hui Xiong, Yong Ge, Zijun Yao, Yu Zheng, Zhi-Hua Zhou","doi":"10.1145/2623330.2623675","DOIUrl":"https://doi.org/10.1145/2623330.2623675","url":null,"abstract":"It is traditionally a challenge for home buyers to understand, compare and contrast the investment values of real estates. While a number of estate appraisal methods have been developed to value real property, the performances of these methods have been limited by the traditional data sources for estate appraisal. However, with the development of new ways of collecting estate-related mobile data, there is a potential to leverage geographic dependencies of estates for enhancing estate appraisal. Indeed, the geographic dependencies of the value of an estate can be from the characteristics of its own neighborhood (individual), the values of its nearby estates (peer), and the prosperity of the affiliated latent business area (zone). To this end, in this paper, we propose a geographic method, named ClusRanking, for estate appraisal by leveraging the mutual enforcement of ranking and clustering power. ClusRanking is able to exploit geographic individual, peer, and zone dependencies in a probabilistic ranking model. Specifically, we first extract the geographic utility of estates from geography data, estimate the neighborhood popularity of estates by mining taxicab trajectory data, and model the influence of latent business areas via ClusRanking. Also, we use a linear model to fuse these three influential factors and predict estate investment values. Moreover, we simultaneously consider individual, peer and zone dependencies, and derive an estate-specific ranking likelihood as the objective function. Finally, we conduct a comprehensive evaluation with real-world estate related data, and the experimental results demonstrate the effectiveness of our method.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82278591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 89

Optimal real-time bidding for display advertising 展示广告最优实时竞价

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623633

Weinan Zhang, Shuai Yuan, Jun Wang

In this paper we study bid optimisation for real-time bidding (RTB) based display advertising. RTB allows advertisers to bid on a display ad impression in real time when it is being generated. It goes beyond contextual advertising by motivating the bidding focused on user data and it is different from the sponsored search auction where the bid price is associated with keywords. For the demand side, a fundamental technical challenge is to automate the bidding process based on the budget, the campaign objective and various information gathered in runtime and in history. In this paper, the programmatic bidding is cast as a functional optimisation problem. Under certain dependency assumptions, we derive simple bidding functions that can be calculated in real time; our finding shows that the optimal bid has a non-linear relationship with the impression level evaluation such as the click-through rate and the conversion rate, which are estimated in real time from the impression level features. This is different from previous work that is mainly focused on a linear bidding function. Our mathematical derivation suggests that optimal bidding strategies should try to bid more impressions rather than focus on a small set of high valued impressions because according to the current RTB market data, compared to the higher evaluated impressions, the lower evaluated ones are more cost effective and the chances of winning them are relatively higher. Aside from the theoretical insights, offline experiments on a real dataset and online experiments on a production RTB system verify the effectiveness of our proposed optimal bidding strategies and the functional optimisation framework.

本文主要研究基于实时竞价(RTB)的展示广告的竞价优化。RTB允许广告商在广告产生时实时对展示广告印象进行出价。它超越了上下文广告，以用户数据为导向，激励出价，这与赞助搜索拍卖不同，后者的出价与关键字相关。对于需求方来说，一个基本的技术挑战是基于预算、活动目标和运行时和历史中收集的各种信息来自动化招标过程。在本文中，程序化投标被视为一个功能优化问题。在一定的依赖假设下，我们推导出可以实时计算的简单竞价函数;我们的研究结果表明，最优出价与印象水平评估(如点击率和转化率)存在非线性关系，这是根据印象水平特征实时估计的。这与之前主要关注线性竞标功能的工作不同。我们的数学推导表明，最优的投标策略应该是尝试出价更多的印象，而不是专注于一小部分高价值的印象，因为根据目前的RTB市场数据，与高评价的印象相比，低评价的印象更具成本效益，赢得它们的机会也相对更高。除了理论见解之外，在真实数据集上的离线实验和在生产RTB系统上的在线实验验证了我们提出的最优竞价策略和功能优化框架的有效性。

{"title":"Optimal real-time bidding for display advertising","authors":"Weinan Zhang, Shuai Yuan, Jun Wang","doi":"10.1145/2623330.2623633","DOIUrl":"https://doi.org/10.1145/2623330.2623633","url":null,"abstract":"In this paper we study bid optimisation for real-time bidding (RTB) based display advertising. RTB allows advertisers to bid on a display ad impression in real time when it is being generated. It goes beyond contextual advertising by motivating the bidding focused on user data and it is different from the sponsored search auction where the bid price is associated with keywords. For the demand side, a fundamental technical challenge is to automate the bidding process based on the budget, the campaign objective and various information gathered in runtime and in history. In this paper, the programmatic bidding is cast as a functional optimisation problem. Under certain dependency assumptions, we derive simple bidding functions that can be calculated in real time; our finding shows that the optimal bid has a non-linear relationship with the impression level evaluation such as the click-through rate and the conversion rate, which are estimated in real time from the impression level features. This is different from previous work that is mainly focused on a linear bidding function. Our mathematical derivation suggests that optimal bidding strategies should try to bid more impressions rather than focus on a small set of high valued impressions because according to the current RTB market data, compared to the higher evaluated impressions, the lower evaluated ones are more cost effective and the chances of winning them are relatively higher. Aside from the theoretical insights, offline experiments on a real dataset and online experiments on a production RTB system verify the effectiveness of our proposed optimal bidding strategies and the functional optimisation framework.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80543715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 296

Scalable noise mining in long-term electrocardiographic time-series to predict death following heart attacks 长期心电图时间序列的可扩展噪声挖掘预测心脏病发作后死亡

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623702

Chih-Chun Chia, Z. Syed

Cardiac disease is the leading cause of death around the world; with ischemic heart disease alone claiming 7 million lives in 2011. This burden can be attributed, in part, to the absence of biomarkers that can reliably identify high risk patients and match them to treatments that are appropriate for them. In recent clinical studies, we have demonstrated the ability of computation to extract information with substantial prognostic utility that is typically disregarded in time-series data collected from cardiac patients. Of particular interest are subtle variations in long-term electrocardiographic (ECG) data that are usually overlooked as noise but provide a useful assessment of myocardial instability. In multiple clinical cohorts, we have developed the pathophysiological basis for studying probabilistic variations in long-term ECG and demonstrated the ability of this information to effectively risk stratify patients at risk of dying following heart attacks. In this paper, we extend this work and focus on the question of how to reduce its computational complexity for scalable use in large datasets or energy constrained embedded devices. Our basic approach to uncovering pathological structure within the ECG focuses on characterizing beat-to-beat time-warped shape deformations of the ECG using a modified dynamic time-warping (DTW) and Lomb-Scargle periodogram-based algorithm. As part of our efforts to scale this work up, we explore a novel approach to address the quadratic runtime of DTW. We achieve this by developing the idea of adaptive downsampling to reduce the size of the inputs presented to DTW, and describe changes to the dynamic programming problem underlying DTW to exploit adaptively downsampled ECG signals. When evaluated on data from 765 patients in the DISPERSE2-TIMI33 trial, our results show that high morphologic variability is associated with an 8- to 9-fold increased risk of death within 90 days of a heart attack. Moreover, the use of adaptive downsampling with a modified DTW formulation achieves a 7- to almost 20-fold reduction in runtime relative to DTW, without a significant change in biomarker discrimination.

心脏病是世界各地的主要死亡原因;2011年，仅缺血性心脏病就夺去了700万人的生命。造成这种负担的部分原因是缺乏能够可靠地识别高风险患者并将其与适合他们的治疗相匹配的生物标志物。在最近的临床研究中，我们已经证明了计算提取具有重要预后效用的信息的能力，这些信息通常在从心脏病患者收集的时间序列数据中被忽视。特别令人感兴趣的是长期心电图(ECG)数据的细微变化，这些变化通常被忽视为噪声，但却提供了对心肌不稳定性的有用评估。在多个临床队列中，我们开发了长期心电图概率变化研究的病理生理学基础，并证明了该信息对心脏病发作后死亡风险患者进行有效风险分层的能力。在本文中，我们扩展了这项工作，并专注于如何降低其计算复杂性，以便在大型数据集或能量受限的嵌入式设备中可扩展使用。我们揭示ECG病理结构的基本方法侧重于使用改进的动态时间扭曲(DTW)和基于Lomb-Scargle周期图的算法来表征ECG的节拍到节拍的时间扭曲形状变形。作为我们努力扩大这项工作的一部分，我们探索了一种新的方法来解决DTW的二次运行时间。我们通过发展自适应下采样的思想来减少DTW的输入大小，并描述了DTW背后的动态规划问题的变化，以利用自适应下采样的心电信号。在对来自765名患者的分散2- timi33试验数据进行评估后，我们的结果显示，高形态学变异性与心脏病发作后90天内死亡风险增加8- 9倍相关。此外，使用带有改进DTW配方的自适应下采样，相对于DTW，运行时间减少了7至近20倍，而生物标志物的区分没有显著变化。

{"title":"Scalable noise mining in long-term electrocardiographic time-series to predict death following heart attacks","authors":"Chih-Chun Chia, Z. Syed","doi":"10.1145/2623330.2623702","DOIUrl":"https://doi.org/10.1145/2623330.2623702","url":null,"abstract":"Cardiac disease is the leading cause of death around the world; with ischemic heart disease alone claiming 7 million lives in 2011. This burden can be attributed, in part, to the absence of biomarkers that can reliably identify high risk patients and match them to treatments that are appropriate for them. In recent clinical studies, we have demonstrated the ability of computation to extract information with substantial prognostic utility that is typically disregarded in time-series data collected from cardiac patients. Of particular interest are subtle variations in long-term electrocardiographic (ECG) data that are usually overlooked as noise but provide a useful assessment of myocardial instability. In multiple clinical cohorts, we have developed the pathophysiological basis for studying probabilistic variations in long-term ECG and demonstrated the ability of this information to effectively risk stratify patients at risk of dying following heart attacks. In this paper, we extend this work and focus on the question of how to reduce its computational complexity for scalable use in large datasets or energy constrained embedded devices. Our basic approach to uncovering pathological structure within the ECG focuses on characterizing beat-to-beat time-warped shape deformations of the ECG using a modified dynamic time-warping (DTW) and Lomb-Scargle periodogram-based algorithm. As part of our efforts to scale this work up, we explore a novel approach to address the quadratic runtime of DTW. We achieve this by developing the idea of adaptive downsampling to reduce the size of the inputs presented to DTW, and describe changes to the dynamic programming problem underlying DTW to exploit adaptively downsampled ECG signals. When evaluated on data from 765 patients in the DISPERSE2-TIMI33 trial, our results show that high morphologic variability is associated with an 8- to 9-fold increased risk of death within 90 days of a heart attack. Moreover, the use of adaptive downsampling with a modified DTW formulation achieves a 7- to almost 20-fold reduction in runtime relative to DTW, without a significant change in biomarker discrimination.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80559556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Spatially embedded co-offence prediction using supervised learning 使用监督学习的空间嵌入共犯预测

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623353

M. A. Tayebi, M. Ester, U. Glässer, P. Brantingham

Crime reduction and prevention strategies are essential to increase public safety and reduce the crime costs to society. Law enforcement agencies have long realized the importance of analyzing co-offending networks---networks of offenders who have committed crimes together---for this purpose. Although network structure can contribute significantly to co-offence prediction, research in this area is very limited. Here we address this important problem by proposing a framework for co-offence prediction using supervised learning. Considering the available information about offenders, we introduce social, geographic, geo-social and similarity feature sets which are used for classifying potential negative and positive pairs of offenders. Similar to other social networks, co-offending networks also suffer from a highly skewed distribution of positive and negative pairs. To address the class imbalance problem, we identify three types of criminal cooperation opportunities which help to reduce the class imbalance ratio significantly, while keeping half of the co-offences. The proposed framework is evaluated on a large crime dataset for the Province of British Columbia, Canada. Our experimental evaluation of four different feature sets show that the novel geo-social features are the best predictors. Overall, we experimentally show the high effectiveness of the proposed co-offence prediction framework. We believe that our framework will not only allow law enforcement agencies to improve their crime reduction and prevention strategies, but also offers new criminological insights into criminal link formation between offenders.

减少和预防犯罪战略对于提高公共安全和减少社会犯罪成本至关重要。执法机构早就意识到分析共同犯罪网络——共同犯罪的罪犯网络——的重要性。虽然网络结构对共同犯罪的预测有重要的贡献，但这方面的研究非常有限。在这里，我们通过提出一个使用监督学习的共同犯罪预测框架来解决这个重要问题。考虑到罪犯的可用信息，我们引入了社会、地理、地理社会和相似特征集，用于对潜在的消极和积极的罪犯对进行分类。与其他社交网络类似，共同犯罪网络也存在积极和消极配对的高度倾斜分布。为了解决阶级失衡问题，我们确定了三种类型的犯罪合作机会，这有助于显著降低阶级失衡比例，同时保留了一半的共同犯罪。该框架在加拿大不列颠哥伦比亚省的大型犯罪数据集上进行了评估。我们对四种不同特征集的实验评估表明，新的地理社会特征是最好的预测因子。总体而言，我们通过实验证明了所提出的共犯预测框架的有效性。我们相信，我们的架构不但有助执法机关改善减少及预防罪案的策略，而且有助我们深入了解罪犯之间的犯罪联系。

{"title":"Spatially embedded co-offence prediction using supervised learning","authors":"M. A. Tayebi, M. Ester, U. Glässer, P. Brantingham","doi":"10.1145/2623330.2623353","DOIUrl":"https://doi.org/10.1145/2623330.2623353","url":null,"abstract":"Crime reduction and prevention strategies are essential to increase public safety and reduce the crime costs to society. Law enforcement agencies have long realized the importance of analyzing co-offending networks---networks of offenders who have committed crimes together---for this purpose. Although network structure can contribute significantly to co-offence prediction, research in this area is very limited. Here we address this important problem by proposing a framework for co-offence prediction using supervised learning. Considering the available information about offenders, we introduce social, geographic, geo-social and similarity feature sets which are used for classifying potential negative and positive pairs of offenders. Similar to other social networks, co-offending networks also suffer from a highly skewed distribution of positive and negative pairs. To address the class imbalance problem, we identify three types of criminal cooperation opportunities which help to reduce the class imbalance ratio significantly, while keeping half of the co-offences. The proposed framework is evaluated on a large crime dataset for the Province of British Columbia, Canada. Our experimental evaluation of four different feature sets show that the novel geo-social features are the best predictors. Overall, we experimentally show the high effectiveness of the proposed co-offence prediction framework. We believe that our framework will not only allow law enforcement agencies to improve their crime reduction and prevention strategies, but also offers new criminological insights into criminal link formation between offenders.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"146 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80567840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Fast influence-based coarsening for large networks 针对大型网络的快速基于影响的粗化

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623701

Manish Purohit, B. Prakash, Chanhyun Kang, Yao Zhang, V. S. Subrahmanian

Given a social network, can we quickly 'zoom-out' of the graph? Is there a smaller equivalent representation of the graph that preserves its propagation characteristics? Can we group nodes together based on their influence properties? These are important problems with applications to influence analysis, epidemiology and viral marketing applications. In this paper, we first formulate a novel Graph Coarsening Problem to find a succinct representation of any graph while preserving key characteristics for diffusion processes on that graph. We then provide a fast and effective near-linear-time (in nodes and edges) algorithm COARSENET for the same. Using extensive experiments on multiple real datasets, we demonstrate the quality and scalability of COARSENET, enabling us to reduce the graph by 90% in some cases without much loss of information. Finally we also show how our method can help in diverse applications like influence maximization and detecting patterns of propagation at the level of automatically created groups on real cascade data.

给定一个社交网络，我们能否快速“缩小”这张图?图是否有一个更小的等价表示，保留了它的传播特性?我们能否根据节点的影响属性对它们进行分组?这些都是影响分析、流行病学和病毒式营销应用中的重要问题。在本文中，我们首先提出了一个新的图粗化问题，以找到任何图的简洁表示，同时保留该图上扩散过程的关键特征。然后，我们提供了一种快速有效的近线性时间(在节点和边缘)算法COARSENET。通过对多个真实数据集的大量实验，我们证明了COARSENET的质量和可扩展性，使我们能够在某些情况下将图减少90%而不会丢失太多信息。最后，我们还展示了我们的方法如何帮助各种应用程序，如影响最大化和检测在真实级联数据上自动创建组的传播模式。

引用次数: 62

Style in the long tail: discovering unique interests with latent variable models in large scale social E-commerce 长尾中的风格:利用潜在变量模型发现大型社交电子商务中的独特兴趣

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623338

D. Hu, Robert J. Hall, Josh Attenberg

Purchasing decisions in many product categories are heavily influenced by the shopper's aesthetic preferences. It's insufficient to simply match a shopper with popular items from the category in question; a successful shopping experience also identifies products that match those aesthetics. The challenge of capturing shoppers' styles becomes more difficult as the size and diversity of the marketplace increases. At Etsy, an online marketplace for handmade and vintage goods with over 30 million diverse listings, the problem of capturing taste is particularly important -- users come to the site specifically to find items that match their eclectic styles. In this paper, we describe our methods and experiments for deploying two new style-based recommender systems on the Etsy site. We use Latent Dirichlet Allocation (LDA) to discover trending categories and styles on Etsy, which are then used to describe a user's "interest" profile. We also explore hashing methods to perform fast nearest neighbor search on a map-reduce framework, in order to efficiently obtain recommendations. These techniques have been implemented successfully at very large scale, substantially improving many key business metrics.

许多产品类别的购买决策在很大程度上受到购物者审美偏好的影响。简单地将购物者与相关类别中的热门商品匹配是不够的;成功的购物体验也能识别出符合这些审美的产品。随着市场规模和多样性的增加，捕捉消费者风格的挑战变得更加困难。Etsy是一个手工和复古商品的在线市场，拥有超过3000万种不同的商品，捕捉品味的问题尤为重要——用户来到这个网站是专门为了找到符合他们折衷风格的商品。在本文中，我们描述了在Etsy网站上部署两种新的基于风格的推荐系统的方法和实验。我们使用潜在狄利克雷分配(LDA)来发现Etsy上的趋势类别和风格，然后用于描述用户的“兴趣”配置文件。我们还探索了在map-reduce框架上执行快速最近邻搜索的哈希方法，以有效地获得推荐。这些技术已经在非常大的规模上成功地实现，极大地改进了许多关键的业务指标。

{"title":"Style in the long tail: discovering unique interests with latent variable models in large scale social E-commerce","authors":"D. Hu, Robert J. Hall, Josh Attenberg","doi":"10.1145/2623330.2623338","DOIUrl":"https://doi.org/10.1145/2623330.2623338","url":null,"abstract":"Purchasing decisions in many product categories are heavily influenced by the shopper's aesthetic preferences. It's insufficient to simply match a shopper with popular items from the category in question; a successful shopping experience also identifies products that match those aesthetics. The challenge of capturing shoppers' styles becomes more difficult as the size and diversity of the marketplace increases. At Etsy, an online marketplace for handmade and vintage goods with over 30 million diverse listings, the problem of capturing taste is particularly important -- users come to the site specifically to find items that match their eclectic styles. In this paper, we describe our methods and experiments for deploying two new style-based recommender systems on the Etsy site. We use Latent Dirichlet Allocation (LDA) to discover trending categories and styles on Etsy, which are then used to describe a user's \"interest\" profile. We also explore hashing methods to perform fast nearest neighbor search on a map-reduce framework, in order to efficiently obtain recommendations. These techniques have been implemented successfully at very large scale, substantially improving many key business metrics.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83077151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Correlation clustering in MapReduce MapReduce中的相关聚类

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623743

Flavio Chierichetti, Nilesh N. Dalvi, Ravi Kumar

Correlation clustering is a basic primitive in data miner's toolkit with applications ranging from entity matching to social network analysis. The goal in correlation clustering is, given a graph with signed edges, partition the nodes into clusters to minimize the number of disagreements. In this paper we obtain a new algorithm for correlation clustering. Our algorithm is easily implementable in computational models such as MapReduce and streaming, and runs in a small number of rounds. In addition, we show that our algorithm obtains an almost 3-approximation to the optimal correlation clustering. Experiments on huge graphs demonstrate the scalability of our algorithm and its applicability to data mining problems.

关联聚类是数据挖掘工具包中的一个基本元素，其应用范围从实体匹配到社会网络分析。关联聚类的目标是，给定一个有符号边的图，将节点划分成簇，以最小化不一致的数量。本文提出了一种新的相关聚类算法。我们的算法很容易在MapReduce和streaming等计算模型中实现，并且以少量的轮数运行。此外，我们还证明了我们的算法获得了最优相关聚类的接近3的近似。在大型图上的实验证明了算法的可扩展性和对数据挖掘问题的适用性。

引用次数: 73