首页 > 最新文献

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)最新文献

英文 中文
Pattern Matching Trajectories for Investigative Graph Searches 调查图搜索的模式匹配轨迹
Benjamin W. K. Hung, A. Jayasumana, Vidarshana W. Bandara
Investigative graph search is the process of searching for and prioritizing entities of interest that may exhibit part or all of a pattern of attributes or connections for a latent behavior. In this work we formulate a related sub-problem of determining the pattern matching trajectories of such entities. The goal is to not only provide analysts with the ability to find full or partial matches against a query pattern, but also a means to quantify the pace of the appearance of the indicators. This technology has a variety of potential applications such as aiding in the detection of homegrown violent extremists before they carry out acts of domestic terrorism, detecting signs for post-traumatic stress in veterans, or tracking potential customer activities and experiences along a consumer journey. We propose a vectorized graph pattern matching approach that calculates the multi-hop class similarities between nodes in query and data graphs over time. By tracking partial match trajectories, we provide another dimension of analysis in investigative graph searches to highlight entities on a pathway towards a pattern of a latent behavior. We demonstrate the performance of our approach on a real-world BlogCatalog dataset of over 470K nodes and 4 million edges, where 98.56% of nodes and 99.65% of edges were filtered out with preprocessing steps, and successfully detected the trajectory of the top 1,327 nodes towards a query pattern.
调查图搜索是搜索感兴趣的实体并对其进行优先排序的过程,这些实体可能表现出潜在行为的部分或全部属性模式或联系。在这项工作中,我们制定了确定这些实体的模式匹配轨迹的相关子问题。其目标不仅是为分析人员提供针对查询模式查找全部或部分匹配的能力,而且还提供了一种量化指示器出现速度的方法。这项技术有各种各样的潜在应用,比如在本土暴力极端分子实施国内恐怖主义行为之前帮助发现他们,检测退伍军人创伤后压力的迹象,或者跟踪潜在的客户活动和消费过程中的体验。我们提出了一种矢量图模式匹配方法,该方法计算查询图和数据图中节点之间随时间的多跳类相似性。通过跟踪部分匹配轨迹,我们在调查图搜索中提供了另一个维度的分析,以突出指向潜在行为模式的路径上的实体。我们在一个真实的BlogCatalog数据集上展示了我们的方法的性能,该数据集有超过470K个节点和400万条边,其中98.56%的节点和99.65%的边通过预处理步骤被过滤掉,并成功地检测到前1327个节点的查询模式的轨迹。
{"title":"Pattern Matching Trajectories for Investigative Graph Searches","authors":"Benjamin W. K. Hung, A. Jayasumana, Vidarshana W. Bandara","doi":"10.1109/DSAA.2016.14","DOIUrl":"https://doi.org/10.1109/DSAA.2016.14","url":null,"abstract":"Investigative graph search is the process of searching for and prioritizing entities of interest that may exhibit part or all of a pattern of attributes or connections for a latent behavior. In this work we formulate a related sub-problem of determining the pattern matching trajectories of such entities. The goal is to not only provide analysts with the ability to find full or partial matches against a query pattern, but also a means to quantify the pace of the appearance of the indicators. This technology has a variety of potential applications such as aiding in the detection of homegrown violent extremists before they carry out acts of domestic terrorism, detecting signs for post-traumatic stress in veterans, or tracking potential customer activities and experiences along a consumer journey. We propose a vectorized graph pattern matching approach that calculates the multi-hop class similarities between nodes in query and data graphs over time. By tracking partial match trajectories, we provide another dimension of analysis in investigative graph searches to highlight entities on a pathway towards a pattern of a latent behavior. We demonstrate the performance of our approach on a real-world BlogCatalog dataset of over 470K nodes and 4 million edges, where 98.56% of nodes and 99.65% of edges were filtered out with preprocessing steps, and successfully detected the trajectory of the top 1,327 nodes towards a query pattern.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116984328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
MedCare: Leveraging Medication Similarity for Disease Prediction 医疗保健:利用药物相似性进行疾病预测
D. Dasgupta, N. Chawla
The emergence of electronic health records (EHRs) has made medical history including past and current diseases, and prescribed medications easily available. This has facilitated development of personalized and population health care management systems. Contemporary disease prediction systems leverage data such as disease diagnoses codes to compute patients' similarity and predict the possible future disease risks of an individual. However, we posit that not all diseases (such as pre-existing conditions) may be represented in an EHR as a disease diagnosis code. It is likely that a patient is already taking a medication but does not have a corresponding disease in the EHR. To that end, we posit that the medication history can serve as a proxy for disease diagnoses, and ask the question whether medication and disease diagnoses combined together can improve the predictability of such systems. Building on our prior work in predicting disease risks (CARE), we develop two disease prediction systems: one using medication-based similarity (medCARE) and the other using both disease and medication-based similarity (combinedCARE). We show that combinedCARE provided a greater coverage and a higher average rank.
电子健康记录(EHRs)的出现使得病史(包括过去和现在的疾病)和处方药物很容易获得。这促进了个性化和人口保健管理系统的发展。当代疾病预测系统利用疾病诊断代码等数据来计算患者的相似性,并预测个体未来可能的疾病风险。然而,我们假定并非所有疾病(例如预先存在的疾病)都可以在EHR中表示为疾病诊断代码。很可能患者已经在服药,但在电子病历中没有相应的疾病。为此,我们假设用药史可以作为疾病诊断的代理,并提出一个问题,即药物和疾病诊断结合在一起是否可以提高这种系统的可预测性。基于我们之前在预测疾病风险(CARE)方面的工作,我们开发了两种疾病预测系统:一种使用基于药物的相似性(medCARE),另一种使用基于疾病和药物的相似性(combinedCARE)。我们表明,联合护理提供了更大的覆盖率和更高的平均排名。
{"title":"MedCare: Leveraging Medication Similarity for Disease Prediction","authors":"D. Dasgupta, N. Chawla","doi":"10.1109/DSAA.2016.90","DOIUrl":"https://doi.org/10.1109/DSAA.2016.90","url":null,"abstract":"The emergence of electronic health records (EHRs) has made medical history including past and current diseases, and prescribed medications easily available. This has facilitated development of personalized and population health care management systems. Contemporary disease prediction systems leverage data such as disease diagnoses codes to compute patients' similarity and predict the possible future disease risks of an individual. However, we posit that not all diseases (such as pre-existing conditions) may be represented in an EHR as a disease diagnosis code. It is likely that a patient is already taking a medication but does not have a corresponding disease in the EHR. To that end, we posit that the medication history can serve as a proxy for disease diagnoses, and ask the question whether medication and disease diagnoses combined together can improve the predictability of such systems. Building on our prior work in predicting disease risks (CARE), we develop two disease prediction systems: one using medication-based similarity (medCARE) and the other using both disease and medication-based similarity (combinedCARE). We show that combinedCARE provided a greater coverage and a higher average rank.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"570 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116276816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Informative Priors and Bayesian Computation 信息先验和贝叶斯计算
Shirin Golchi
The use of prior distributions is often a controversial topic in Bayesian inference. Informative priors are often avoided at all costs. However, when prior information is available informative priors are an appropriate way of introducing this information into the model. Furthermore, informative priors, when used properly and creatively, can provide solutions to computational issues and improve modeling efficiency. Through three examples with different applications we demonstrate the importance and usefulness of informative priors in incorporating external information into the model and overcoming computational difficulties.
先验分布的使用在贝叶斯推理中经常是一个有争议的话题。通常不惜一切代价避免信息先验。然而,当先验信息可用时,信息先验是将该信息引入模型的适当方法。此外,信息先验,如果使用得当和创造性,可以提供解决方案的计算问题,提高建模效率。通过三个不同应用的例子,我们证明了信息先验在将外部信息纳入模型和克服计算困难方面的重要性和有用性。
{"title":"Informative Priors and Bayesian Computation","authors":"Shirin Golchi","doi":"10.1109/DSAA.2016.67","DOIUrl":"https://doi.org/10.1109/DSAA.2016.67","url":null,"abstract":"The use of prior distributions is often a controversial topic in Bayesian inference. Informative priors are often avoided at all costs. However, when prior information is available informative priors are an appropriate way of introducing this information into the model. Furthermore, informative priors, when used properly and creatively, can provide solutions to computational issues and improve modeling efficiency. Through three examples with different applications we demonstrate the importance and usefulness of informative priors in incorporating external information into the model and overcoming computational difficulties.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127249468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Deconstructing Domain Names to Reveal Latent Topics 解构域名以揭示潜在主题
Cheryl J. Flynn, Kenneth E. Shirley, Wei Wang
Measurement of the lexical properties of domain names enables many types of relatively fast, lightweight web mining analyses. These include unsupervised learning tasks such as automatic categorization and clustering of websites, as well as supervised learning tasks, such as classifying websites as malicious or benign. In this paper we explore whether these tasks can be better accomplished by identifying semantically coherent groups of words in a large set of domain names using a combination of word segmentation and topic modeling methods. By segmenting domain names to generate a large set of new domain-level features, we compare three different unsupervised learning methods for identifying topics among domain name keywords: spherical k-means clustering (SKM), Latent Dirichlet Allocation (LDA), and the Biterm Topic Model (BTM). We successfully infer semantically coherent groups of words in two independent data sets, finding that BTM topics are quantitatively the most coherent. Using the BTM, we compare inferred topics across data sets and across time periods, and we also highlight instances of homophony within the topics. Finally, we show that the BTM topics can be used as features to improve the interpretability of a supervised learning model for the detection of malicious domain names. To our knowledge this is the first large-scale empirical analysis of the co-occurrence patterns of words within domain names.
对域名的词法属性的测量使许多类型的相对快速、轻量级的web挖掘分析成为可能。这些包括无监督学习任务,如网站的自动分类和聚类,以及监督学习任务,如将网站分类为恶意或良性。在本文中,我们探讨了是否可以通过使用分词和主题建模相结合的方法在大量域名中识别语义连贯的词组来更好地完成这些任务。通过分割域名以生成大量新的领域级特征,我们比较了三种不同的无监督学习方法在域名关键词中识别主题:球面k-均值聚类(SKM)、潜狄利克雷分配(LDA)和Biterm主题模型(BTM)。我们成功地在两个独立的数据集中推断出语义上连贯的词组,发现BTM主题在数量上是最连贯的。使用BTM,我们跨数据集和跨时间段比较推断的主题,我们还突出显示主题中同音的实例。最后,我们证明了BTM主题可以作为特征来提高监督学习模型的可解释性,用于检测恶意域名。据我们所知,这是对域名内词共现模式的第一次大规模实证分析。
{"title":"Deconstructing Domain Names to Reveal Latent Topics","authors":"Cheryl J. Flynn, Kenneth E. Shirley, Wei Wang","doi":"10.1109/DSAA.2016.63","DOIUrl":"https://doi.org/10.1109/DSAA.2016.63","url":null,"abstract":"Measurement of the lexical properties of domain names enables many types of relatively fast, lightweight web mining analyses. These include unsupervised learning tasks such as automatic categorization and clustering of websites, as well as supervised learning tasks, such as classifying websites as malicious or benign. In this paper we explore whether these tasks can be better accomplished by identifying semantically coherent groups of words in a large set of domain names using a combination of word segmentation and topic modeling methods. By segmenting domain names to generate a large set of new domain-level features, we compare three different unsupervised learning methods for identifying topics among domain name keywords: spherical k-means clustering (SKM), Latent Dirichlet Allocation (LDA), and the Biterm Topic Model (BTM). We successfully infer semantically coherent groups of words in two independent data sets, finding that BTM topics are quantitatively the most coherent. Using the BTM, we compare inferred topics across data sets and across time periods, and we also highlight instances of homophony within the topics. Finally, we show that the BTM topics can be used as features to improve the interpretability of a supervised learning model for the detection of malicious domain names. To our knowledge this is the first large-scale empirical analysis of the co-occurrence patterns of words within domain names.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127258720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Infinite Langevin Mixture Modeling and Feature Selection 无限朗之万混合建模与特征选择
Ola Amayri, N. Bouguila
In this paper, we introduce data clustering based on infinite mixture models for spherical patterns. This particular clustering is based on Langevin distribution which has been shown to be effective to model this kind of data. The proposed learning algorithm is tackled using a fully Bayesian approach. In contrast to classical Bayesian approaches, which suppose an unknown finite number of mixture components, proposed approach assumes an infinite number of components and have witnessed considerable theoretical and computational advances in recent years. In particular, we have developed a Markov Chain Monte Carlo (MCMC) algorithm to sample from the posterior distributions associated with the selected priors for the different model parameters. Moreover, we propose an infinite framework that allows simultaneous feature selection selection and parameter estimation. The usefulness of the developed framework has been shown via topic novelty detection application.
本文介绍了基于无限混合模型的球形图案数据聚类。这种特殊的聚类是基于朗之万分布的,它已经被证明是有效的建模这类数据。所提出的学习算法使用完全贝叶斯方法来解决。与经典贝叶斯方法假设未知有限数量的混合成分不同,该方法假设无限数量的混合成分,近年来在理论和计算方面取得了相当大的进展。特别是,我们开发了一种马尔可夫链蒙特卡罗(MCMC)算法,从与不同模型参数的选定先验相关的后验分布中进行采样。此外,我们提出了一个无限框架,允许同时进行特征选择和参数估计。通过主题新颖性检测应用表明了所开发框架的有效性。
{"title":"Infinite Langevin Mixture Modeling and Feature Selection","authors":"Ola Amayri, N. Bouguila","doi":"10.1109/DSAA.2016.22","DOIUrl":"https://doi.org/10.1109/DSAA.2016.22","url":null,"abstract":"In this paper, we introduce data clustering based on infinite mixture models for spherical patterns. This particular clustering is based on Langevin distribution which has been shown to be effective to model this kind of data. The proposed learning algorithm is tackled using a fully Bayesian approach. In contrast to classical Bayesian approaches, which suppose an unknown finite number of mixture components, proposed approach assumes an infinite number of components and have witnessed considerable theoretical and computational advances in recent years. In particular, we have developed a Markov Chain Monte Carlo (MCMC) algorithm to sample from the posterior distributions associated with the selected priors for the different model parameters. Moreover, we propose an infinite framework that allows simultaneous feature selection selection and parameter estimation. The usefulness of the developed framework has been shown via topic novelty detection application.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129097260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
On the Tiny Yet Real Happiness Phenomenon in the Mobile Games Market 论手机游戏市场中微小但真实的快乐现象
Po-Heng Chen, Yi-Pei Tu, Kuan-Ta Chen
This paper explores a counter-intuitive observation in the global mobile games market: that despite people in East Asian countries currently experiencing a challenging economic environment with lower disposable incomes and less leisure time than people in the West, they still spend much greater amounts of money on mobile gaming on a per-user basis. We link this situation to the tiny yet real happiness (TYRH) phenomenon: a term coined by Haruki Murakami, frequently rumored as a future recipient of the Nobel Prize for Literature, in his 1986 book "Afternoon at Langerhan's Island". The TYRH phenomenon describes that, due to structural inequality problems, people (especially the members of younger generations) may lose their ambition to actively develop their careers, and instead to cherish small, ordinary moments of bliss. More concretely, people implicated in this phenomenon tend to maintain an attitude of "living in the moment" without regard for their current and future lives, and may even retreat into various non-career-related activities, including mobile gaming. In this paper, we investigate the possible role of the TYRH phenomenon in influencing how smartphone users spend money (and time) on mobile games. We find that countries with long work hours, higher scores on the Gini index, lower unemployment rates, and lower life satisfaction are all associated with higher per-user spending on mobile games on both the App Store and Google Play platforms. This suggests that the TYRH phenomenon is indeed positively associated with mobile game-playing and spending behavior, and that countries where the phenomenon is more prominent are likely to contribute disproportionately to the mobile games market, now and in the future.
这篇文章探讨了全球手机游戏市场的一个反直觉观察:尽管东亚国家的人们目前正经历着具有挑战性的经济环境,他们的可支配收入和休闲时间都低于西方,但他们仍然在手机游戏上花费更多的钱。我们将这种情况与微小但真实的幸福(TYRH)现象联系起来:这个词是村上春树在1986年的《朗格汉岛的下午》一书中创造的,村上春树经常被传为未来的诺贝尔文学奖获得者。TYRH现象描述的是,由于结构性不平等问题,人们(尤其是年轻一代)可能会失去积极发展事业的雄心,转而珍惜平凡的幸福时刻。更具体地说,这种现象中的人倾向于保持一种“活在当下”的态度,而不考虑他们现在和未来的生活,甚至可能退回到各种与职业无关的活动中,包括手机游戏。在本文中,我们调查了TYRH现象在影响智能手机用户在手机游戏中花钱(和时间)的可能作用。我们发现,那些工作时间较长、基尼指数较高、失业率较低、生活满意度较低的国家,在App Store和Google Play平台上的每用户手机游戏消费都较高。这表明,TYRH现象确实与手机游戏体验和消费行为呈正相关,而且这种现象更为突出的国家,无论现在还是将来,都可能对手机游戏市场做出不成比例的贡献。
{"title":"On the Tiny Yet Real Happiness Phenomenon in the Mobile Games Market","authors":"Po-Heng Chen, Yi-Pei Tu, Kuan-Ta Chen","doi":"10.1109/DSAA.2016.76","DOIUrl":"https://doi.org/10.1109/DSAA.2016.76","url":null,"abstract":"This paper explores a counter-intuitive observation in the global mobile games market: that despite people in East Asian countries currently experiencing a challenging economic environment with lower disposable incomes and less leisure time than people in the West, they still spend much greater amounts of money on mobile gaming on a per-user basis. We link this situation to the tiny yet real happiness (TYRH) phenomenon: a term coined by Haruki Murakami, frequently rumored as a future recipient of the Nobel Prize for Literature, in his 1986 book \"Afternoon at Langerhan's Island\". The TYRH phenomenon describes that, due to structural inequality problems, people (especially the members of younger generations) may lose their ambition to actively develop their careers, and instead to cherish small, ordinary moments of bliss. More concretely, people implicated in this phenomenon tend to maintain an attitude of \"living in the moment\" without regard for their current and future lives, and may even retreat into various non-career-related activities, including mobile gaming. In this paper, we investigate the possible role of the TYRH phenomenon in influencing how smartphone users spend money (and time) on mobile games. We find that countries with long work hours, higher scores on the Gini index, lower unemployment rates, and lower life satisfaction are all associated with higher per-user spending on mobile games on both the App Store and Google Play platforms. This suggests that the TYRH phenomenon is indeed positively associated with mobile game-playing and spending behavior, and that countries where the phenomenon is more prominent are likely to contribute disproportionately to the mobile games market, now and in the future.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132060115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Parallel Framework for Grid-Based Bottom-Up Subspace Clustering 基于网格的自底向上子空间聚类并行框架
Poonam Goyal, S. Kumari, Shubham Singh, V. Kishore, S. Balasubramaniam, Navneet Goyal
Clustering is a popular data mining and machine learning technique which discovers interesting patterns from unlabeled data by grouping similar objects together. Clustering high-dimensional data is a challenging task as points in high dimensional space are nearly equidistant from each other, rendering commonly used similarity measures ineffective. Subspace clustering has emerged as a possible solution to the problem of clustering high-dimensional data. In subspace clustering, we try to find clusters in different subspaces within a dataset. Many subspace clustering algorithms have been proposed in the last two decades to find clusters in multiple overlapping subspaces of high-dimensional data. Subspace clustering algorithms iteratively find the best subset of dimensions for a cluster from 2d–1 possible combinations in d-dimensional data. Subspace clustering is extremely compute intensive because of exhaustive search of subspaces, especially in the bottom-up subspace clustering algorithms. To address this issue, an efficient parallel framework for grid-based bottom-up subspace clustering algorithms is developed, considering popular algorithms belonging to this category. The framework is implemented for shared memory, distributed memory, and hybrid systems and is tested for three grid-based bottom-up subspace clustering algorithms: CLIQUE, MAFIA, and ENCLUS. All parallel implementations exhibit impressive speedup and scalability on real datasets.
聚类是一种流行的数据挖掘和机器学习技术,它通过将相似的对象分组在一起,从未标记的数据中发现有趣的模式。聚类高维数据是一项具有挑战性的任务,因为高维空间中的点彼此之间的距离几乎相等,使得常用的相似性度量无效。子空间聚类是解决高维数据聚类问题的一种可能的方法。在子空间聚类中,我们试图在数据集中的不同子空间中找到聚类。在过去的二十年里,人们提出了许多子空间聚类算法来在高维数据的多个重叠子空间中寻找聚类。子空间聚类算法从d维数据的2d-1可能组合中迭代地找到聚类的最佳维度子集。由于子空间的穷举搜索,特别是自底向上的子空间聚类算法,子空间聚类的计算量非常大。为了解决这一问题,考虑到这类常用算法,开发了一种基于网格的自下而上子空间聚类算法的高效并行框架。该框架适用于共享内存、分布式内存和混合系统,并测试了三种基于网格的自下而上子空间聚类算法:CLIQUE、MAFIA和ENCLUS。所有并行实现在实际数据集上都表现出令人印象深刻的加速和可伸缩性。
{"title":"A Parallel Framework for Grid-Based Bottom-Up Subspace Clustering","authors":"Poonam Goyal, S. Kumari, Shubham Singh, V. Kishore, S. Balasubramaniam, Navneet Goyal","doi":"10.1109/DSAA.2016.42","DOIUrl":"https://doi.org/10.1109/DSAA.2016.42","url":null,"abstract":"Clustering is a popular data mining and machine learning technique which discovers interesting patterns from unlabeled data by grouping similar objects together. Clustering high-dimensional data is a challenging task as points in high dimensional space are nearly equidistant from each other, rendering commonly used similarity measures ineffective. Subspace clustering has emerged as a possible solution to the problem of clustering high-dimensional data. In subspace clustering, we try to find clusters in different subspaces within a dataset. Many subspace clustering algorithms have been proposed in the last two decades to find clusters in multiple overlapping subspaces of high-dimensional data. Subspace clustering algorithms iteratively find the best subset of dimensions for a cluster from 2d–1 possible combinations in d-dimensional data. Subspace clustering is extremely compute intensive because of exhaustive search of subspaces, especially in the bottom-up subspace clustering algorithms. To address this issue, an efficient parallel framework for grid-based bottom-up subspace clustering algorithms is developed, considering popular algorithms belonging to this category. The framework is implemented for shared memory, distributed memory, and hybrid systems and is tested for three grid-based bottom-up subspace clustering algorithms: CLIQUE, MAFIA, and ENCLUS. All parallel implementations exhibit impressive speedup and scalability on real datasets.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125435618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Evidence-Based Behavioral Model for Calendar Schedules of Individual Mobile Phone Users 基于证据的个人手机用户日程安排行为模型
Iqbal H. Sarker, M. A. Kabir, A. Colman, Jun Han
The electronic calendar usually serves as a personal organizer and is a valuable resource for managing daily activities or schedules of the users. Naturally, a calendar provides various contextual information about individual's scheduled events/appointments, e.g., meeting. A number of researchers have utilized such information to predict human behavior for mobile communication, by assuming a predefined event-behavior mapping which is static and non-personalized. However, in the real world, people differ from each other in how they respond to incoming calls during their scheduled events, even a particular individual may respond differently subject to what type of event is scheduled in the calendar. Thus a static behavioral model does not necessarily map to calendar schedules and corresponding phone call response behavior of individuals. Therefore, we propose an evidencebased behavioral model (EBM) that dynamically identifies the actual call response behavior of individuals for various calendar events based on their mobile phone log that records the data related to a user's phone call activities. Experiments on real datasets show that our proposed technique better captures the user's call response behavior for various calendar events, thereby enabling more appropriate rules to be created for the purpose of automated handling of incoming calls in an intelligent call interruption management system.
电子日历通常作为个人组织者,是管理用户日常活动或日程安排的宝贵资源。自然,日历提供了关于个人计划的事件/约会的各种上下文信息,例如会议。许多研究人员已经利用这些信息来预测移动通信中的人类行为,通过假设一个预定义的静态和非个性化的事件-行为映射。然而,在现实世界中,人们在计划的事件中如何响应来电的方式是不同的,甚至一个特定的人也可能根据日历中计划的事件类型做出不同的响应。因此,静态行为模型不一定映射到个人的日程安排和相应的电话响应行为。因此,我们提出了一种基于证据的行为模型(EBM),该模型基于记录用户电话活动相关数据的手机日志,动态识别个人在各种日历事件中的实际呼叫响应行为。在真实数据集上的实验表明,我们提出的技术更好地捕获了用户对各种日历事件的呼叫响应行为,从而能够创建更合适的规则,以便在智能呼叫中断管理系统中自动处理传入呼叫。
{"title":"Evidence-Based Behavioral Model for Calendar Schedules of Individual Mobile Phone Users","authors":"Iqbal H. Sarker, M. A. Kabir, A. Colman, Jun Han","doi":"10.1109/DSAA.2016.86","DOIUrl":"https://doi.org/10.1109/DSAA.2016.86","url":null,"abstract":"The electronic calendar usually serves as a personal organizer and is a valuable resource for managing daily activities or schedules of the users. Naturally, a calendar provides various contextual information about individual's scheduled events/appointments, e.g., meeting. A number of researchers have utilized such information to predict human behavior for mobile communication, by assuming a predefined event-behavior mapping which is static and non-personalized. However, in the real world, people differ from each other in how they respond to incoming calls during their scheduled events, even a particular individual may respond differently subject to what type of event is scheduled in the calendar. Thus a static behavioral model does not necessarily map to calendar schedules and corresponding phone call response behavior of individuals. Therefore, we propose an evidencebased behavioral model (EBM) that dynamically identifies the actual call response behavior of individuals for various calendar events based on their mobile phone log that records the data related to a user's phone call activities. Experiments on real datasets show that our proposed technique better captures the user's call response behavior for various calendar events, thereby enabling more appropriate rules to be created for the purpose of automated handling of incoming calls in an intelligent call interruption management system.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121637877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
What Would a Data Scientist Ask? Automatically Formulating and Solving Predictive Problems 数据科学家会问什么?自动制定和解决预测问题
B. Schreck, K. Veeramachaneni
In this paper, we designed a formal language, called Trane, for describing prediction problems over relational datasets, implemented a system that allows data scientists to specify problems in that language. We show that this language is able to describe several prediction problems and even the ones on KAGGLE-a data science competition website. We express 29 different KAGGLE problems in this language. We designed an interpreter, which translates input from the user, specified in this language, into a series of transformation and aggregation operations to apply to a dataset in order to generate labels that can be used to train a supervised machine learning classifier. Using a smaller subset of this language, we developed a system to automatically enumerate, interpret and solve prediction problems. We tested this system on the Walmart Store Sales Forecasting dataset found on KAGGLE, enumerated 1077 prediction problems and built models that attempted to solve them, for which we produced 235 AUC scores. Considering that only one out of those 1077 problems was the focus of a 2.5 month long competition on KAGGLE, we expect this system to deliver a thousandfold increase in data scientist's productivity.
在本文中,我们设计了一种称为Trane的形式化语言,用于描述关系数据集上的预测问题,实现了一个允许数据科学家用该语言指定问题的系统。我们证明了这种语言能够描述几个预测问题,甚至是数据科学竞赛网站kaggle上的预测问题。我们用这种语言表达了29种不同的KAGGLE问题。我们设计了一个解释器,它将来自用户的输入翻译成一系列转换和聚合操作,以应用于数据集,以生成可用于训练监督机器学习分类器的标签。使用该语言的一个较小子集,我们开发了一个系统来自动枚举、解释和解决预测问题。我们在KAGGLE上找到的沃尔玛商店销售预测数据集上测试了这个系统,列举了1077个预测问题,并建立了试图解决这些问题的模型,为此我们产生了235个AUC分数。考虑到这1077个问题中只有一个是KAGGLE长达2.5个月的竞赛的焦点,我们期望这个系统能够将数据科学家的工作效率提高1000倍。
{"title":"What Would a Data Scientist Ask? Automatically Formulating and Solving Predictive Problems","authors":"B. Schreck, K. Veeramachaneni","doi":"10.1109/DSAA.2016.55","DOIUrl":"https://doi.org/10.1109/DSAA.2016.55","url":null,"abstract":"In this paper, we designed a formal language, called Trane, for describing prediction problems over relational datasets, implemented a system that allows data scientists to specify problems in that language. We show that this language is able to describe several prediction problems and even the ones on KAGGLE-a data science competition website. We express 29 different KAGGLE problems in this language. We designed an interpreter, which translates input from the user, specified in this language, into a series of transformation and aggregation operations to apply to a dataset in order to generate labels that can be used to train a supervised machine learning classifier. Using a smaller subset of this language, we developed a system to automatically enumerate, interpret and solve prediction problems. We tested this system on the Walmart Store Sales Forecasting dataset found on KAGGLE, enumerated 1077 prediction problems and built models that attempted to solve them, for which we produced 235 AUC scores. Considering that only one out of those 1077 problems was the focus of a 2.5 month long competition on KAGGLE, we expect this system to deliver a thousandfold increase in data scientist's productivity.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":" 22","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120828233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Inconsistent Node Flattening for Improving Top-Down Hierarchical Classification 改进自顶向下分层分类的不一致节点平坦化
Azad Naik, H. Rangwala
Large-scale classification of data where classes are structurally organized in a hierarchy is an important area of research. Top-down approaches that exploit the hierarchy during the learning and prediction phase are efficient for large-scale hierarchical classification. However, accuracy of top-down approaches is poor due to error propagation i.e., prediction errors made at higher levels in the hierarchy cannot be corrected at lower levels. One of the main reason behind errors at the higher levels is the presence of inconsistent nodes that are introduced due to the arbitrary process of creating these hierarchies by domain experts. In this paper, we propose two different data-driven approaches (local and global) for hierarchical structure modification that identifies and flattens inconsistent nodes present within the hierarchy. Our extensive empirical evaluation of the proposed approaches on several image and text datasets with varying distribution of features, classes and training instances per class shows improved classification performance over competing hierarchical modification approaches. Specifically, we see an improvement upto 7% in Macro-F1 score with our approach over best TD baseline. SOURCE CODE: http://www.cs.gmu.edu/ mlbio/InconsistentNodeFlattening.
类在结构上按层次组织的大规模数据分类是一个重要的研究领域。在学习和预测阶段利用层次结构的自顶向下方法对于大规模层次分类是有效的。然而,由于误差传播,自顶向下方法的准确性较差,即在层次结构中较高级别的预测错误无法在较低级别进行纠正。较高级别错误背后的主要原因之一是由于领域专家创建这些层次结构的任意过程而引入的不一致节点的存在。在本文中,我们提出了两种不同的数据驱动方法(本地和全局)用于层次结构修改,以识别和平坦层次结构中存在的不一致节点。我们在几个图像和文本数据集上对所提出的方法进行了广泛的经验评估,这些数据集具有不同的特征、类别和每个类别的训练实例分布,结果表明,与竞争的分层修改方法相比,该方法的分类性能有所提高。具体来说,我们看到,与最佳TD基线相比,我们的方法在宏观f1评分方面提高了7%。源代码:http://www.cs.gmu.edu/ mlbio/InconsistentNodeFlattening。
{"title":"Inconsistent Node Flattening for Improving Top-Down Hierarchical Classification","authors":"Azad Naik, H. Rangwala","doi":"10.1109/DSAA.2016.47","DOIUrl":"https://doi.org/10.1109/DSAA.2016.47","url":null,"abstract":"Large-scale classification of data where classes are structurally organized in a hierarchy is an important area of research. Top-down approaches that exploit the hierarchy during the learning and prediction phase are efficient for large-scale hierarchical classification. However, accuracy of top-down approaches is poor due to error propagation i.e., prediction errors made at higher levels in the hierarchy cannot be corrected at lower levels. One of the main reason behind errors at the higher levels is the presence of inconsistent nodes that are introduced due to the arbitrary process of creating these hierarchies by domain experts. In this paper, we propose two different data-driven approaches (local and global) for hierarchical structure modification that identifies and flattens inconsistent nodes present within the hierarchy. Our extensive empirical evaluation of the proposed approaches on several image and text datasets with varying distribution of features, classes and training instances per class shows improved classification performance over competing hierarchical modification approaches. Specifically, we see an improvement upto 7% in Macro-F1 score with our approach over best TD baseline. SOURCE CODE: http://www.cs.gmu.edu/ mlbio/InconsistentNodeFlattening.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126706150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
期刊
2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1