首页 > 最新文献

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)最新文献

英文 中文
Exploiting a Bootstrapping Approach for Automatic Annotation of Emotions in Texts 基于自举方法的文本情感自动标注
Lea Canales, C. Strapparava, E. Boldrini, P. Martínez-Barco
The objective of this research is to develop a technique to automatically annotate emotional corpora. The complexity of automatic annotation of emotional corpora still presents numerous challenges and thus there is a need to develop a technique that allow us to tackle the annotation task. The relevance of this research is demonstrated by the fact that people's emotions and the patterns of these emotions provide a great value for business, individuals, society or politics. Hence, the creation of a robust emotion detection system becomes crucial. Due to the subjectivity of the emotions, the main challenge for the creation of emotional resources is the annotation process. Thus, with this staring point in mind, the objective of our paper is to illustrate an innovative and effective bootstrapping process for automatic annotations of emotional corpora. The evaluations carried out confirm the soundness of the proposed approach and allow us to consider the bootstrapping process as an appropriate approach to create resources such as an emotional corpus that can be employed on supervised machine learning towards the improvement of emotion detection systems.
本研究的目的是开发一种自动标注情感语料库的技术。情感语料库自动标注的复杂性仍然存在许多挑战,因此需要开发一种能够解决标注任务的技术。人们的情绪和这些情绪的模式为商业、个人、社会或政治提供了巨大的价值,这一事实证明了这项研究的相关性。因此,创建一个强大的情感检测系统变得至关重要。由于情感的主观性,情感资源的创建面临的主要挑战是注释过程。因此,考虑到这个出发点,本文的目标是说明一种创新而有效的情感语料库自动注释的引导过程。所进行的评估确认了所提出方法的合理性,并允许我们将自举过程视为创建资源(如情感语料库)的适当方法,该资源可用于监督机器学习,以改进情感检测系统。
{"title":"Exploiting a Bootstrapping Approach for Automatic Annotation of Emotions in Texts","authors":"Lea Canales, C. Strapparava, E. Boldrini, P. Martínez-Barco","doi":"10.1109/DSAA.2016.78","DOIUrl":"https://doi.org/10.1109/DSAA.2016.78","url":null,"abstract":"The objective of this research is to develop a technique to automatically annotate emotional corpora. The complexity of automatic annotation of emotional corpora still presents numerous challenges and thus there is a need to develop a technique that allow us to tackle the annotation task. The relevance of this research is demonstrated by the fact that people's emotions and the patterns of these emotions provide a great value for business, individuals, society or politics. Hence, the creation of a robust emotion detection system becomes crucial. Due to the subjectivity of the emotions, the main challenge for the creation of emotional resources is the annotation process. Thus, with this staring point in mind, the objective of our paper is to illustrate an innovative and effective bootstrapping process for automatic annotations of emotional corpora. The evaluations carried out confirm the soundness of the proposed approach and allow us to consider the bootstrapping process as an appropriate approach to create resources such as an emotional corpus that can be employed on supervised machine learning towards the improvement of emotion detection systems.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"56 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116372838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Role Models: Mining Role Transitions Data in IT Project Management 角色模型:挖掘IT项目管理中的角色转换数据
G. Palshikar, Sachin Pawar, Nitin Ramrakhiyani
The notion of roles is crucial in project management across various domains. A role indicates a broad set of tasks, activities, deliverables and responsibilities that the person needs to carry out within a project. Assigning roles to team members clarifies the expectations of work items to be delivered by each and structures the interactions of the team among themselves as well as with external stakeholders. This paper analyzes a sizeable real-life dataset regarding the actual usage of roles in software development and maintenance projects in a large multinational IT organization. The paper introduces and formalizes concepts such as seniority level of a role, career progression and career lines, formulates various business questions related to role-based project management, proposes analytics techniques to answer them and outlines the actual results produced to answer the business questions. The business questions are related to dependencies between roles, patterns in role assignments and durations, predicting role changes, discovering insights useful for meeting career aspirations, interesting role sequences etc. The proposed analytics algorithms are based on Markov models, sequence mining, classification and survival analysis.
角色的概念在跨不同领域的项目管理中是至关重要的。角色表明了该人员在项目中需要执行的一系列广泛的任务、活动、可交付成果和职责。将角色分配给团队成员,可以澄清每个成员要交付的工作项的期望,并构建团队之间以及与外部涉众的交互。本文分析了一个关于大型跨国IT组织中软件开发和维护项目中角色实际使用情况的相当大的现实数据集。本文介绍并形式化了诸如角色的资历级别,职业发展和职业线等概念,制定了与基于角色的项目管理相关的各种业务问题,提出了分析技术来回答这些问题,并概述了为回答业务问题而产生的实际结果。业务问题与角色之间的依赖关系、角色分配和持续时间的模式、预测角色变化、发现对满足职业抱负有用的见解、有趣的角色序列等有关。所提出的分析算法基于马尔可夫模型、序列挖掘、分类和生存分析。
{"title":"Role Models: Mining Role Transitions Data in IT Project Management","authors":"G. Palshikar, Sachin Pawar, Nitin Ramrakhiyani","doi":"10.1109/DSAA.2016.62","DOIUrl":"https://doi.org/10.1109/DSAA.2016.62","url":null,"abstract":"The notion of roles is crucial in project management across various domains. A role indicates a broad set of tasks, activities, deliverables and responsibilities that the person needs to carry out within a project. Assigning roles to team members clarifies the expectations of work items to be delivered by each and structures the interactions of the team among themselves as well as with external stakeholders. This paper analyzes a sizeable real-life dataset regarding the actual usage of roles in software development and maintenance projects in a large multinational IT organization. The paper introduces and formalizes concepts such as seniority level of a role, career progression and career lines, formulates various business questions related to role-based project management, proposes analytics techniques to answer them and outlines the actual results produced to answer the business questions. The business questions are related to dependencies between roles, patterns in role assignments and durations, predicting role changes, discovering insights useful for meeting career aspirations, interesting role sequences etc. The proposed analytics algorithms are based on Markov models, sequence mining, classification and survival analysis.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132364641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Using Players' Gameplay Action-Decision Profiles to Prescribe Training: Reducing Training Costs with Serious Games Analytics 利用玩家的玩法动作决策档案来规定训练:利用严肃游戏分析降低训练成本
C. S. Loh, I. Li
Players' gameplay action-decision data can be used towards profiling as serious games analytics. The insights gained can help support the decisions for performance improvement and as 'prescriptions' for training – e.g., diagnosing who should receive training, how much training will be given, informing the design of the game, and determining the contents for inclusion and exclusion. Data-driven training prescription can help learning organizations save money by mitigating unnecessary training to reduce costs. Players' learning performance in games can be measured in lieu of their behaviors traced in situ the training environment. Novice players' action-decision data can first be converted into Course of Actions (COAs) before pairwise similarity comparison against that of the expert(s) to reveal how similar they are to the training goal, or expert/model answer. We identified three Gameplay Action-Decision (GAD) profiles from these gameplay action-decision data and applied them as diagnostics towards prescriptive training.
玩家的游戏玩法动作决策数据可以用作严肃的游戏分析。从中获得的见解可以帮助我们做出改善表现的决定,并作为训练的“处方”——例如,诊断谁应该接受训练,将给予多少训练,告知游戏的设计,并确定包含和排除的内容。数据驱动的培训处方可以通过减少不必要的培训来降低成本,从而帮助学习型组织节省资金。玩家在游戏中的学习表现可以被测量,而不是在训练环境中追踪他们的行为。新手玩家的行动决策数据可以首先转换为行动过程(coa),然后与专家进行两两相似性比较,以揭示他们与训练目标或专家/模型答案的相似程度。我们从这些游戏玩法行动决策数据中识别出三种游戏玩法行动决策(GAD)特征,并将其应用于规定性训练的诊断。
{"title":"Using Players' Gameplay Action-Decision Profiles to Prescribe Training: Reducing Training Costs with Serious Games Analytics","authors":"C. S. Loh, I. Li","doi":"10.1109/DSAA.2016.74","DOIUrl":"https://doi.org/10.1109/DSAA.2016.74","url":null,"abstract":"Players' gameplay action-decision data can be used towards profiling as serious games analytics. The insights gained can help support the decisions for performance improvement and as 'prescriptions' for training – e.g., diagnosing who should receive training, how much training will be given, informing the design of the game, and determining the contents for inclusion and exclusion. Data-driven training prescription can help learning organizations save money by mitigating unnecessary training to reduce costs. Players' learning performance in games can be measured in lieu of their behaviors traced in situ the training environment. Novice players' action-decision data can first be converted into Course of Actions (COAs) before pairwise similarity comparison against that of the expert(s) to reveal how similar they are to the training goal, or expert/model answer. We identified three Gameplay Action-Decision (GAD) profiles from these gameplay action-decision data and applied them as diagnostics towards prescriptive training.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134268936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Dilation of Chisini-Jensen-Shannon Divergence Chisini-Jensen-Shannon散度的扩张
P. Sharma, Gary Holness
Jensen-Shannon divergence (JSD) does not provide adequate separation when the difference between input distributions is subtle. A recently introduced technique, Chisini Jensen Shannon Divergence (CJSD), increases JSD's ability to discriminate between probability distributions by reformulating with operators from Chisini mean. As a consequence, CJSDs also carry additional properties concerning robustness. The utility of this approach was validated in the form of two SVM kernels that give superior classification performance. Our work explores why the performance improvement to JSDs is afforded by this reformulation. We characterize the nature of this improvement based on the idea of relative dilation, that is how Chisini mean transforms JSD's range and prove a number of propositions that establish the degree of this separation. Finally, we provide empirical validation on a synthetic dataset that confirms our theoretical results pertaining to relative dilation.
当输入分布之间的差异很细微时,Jensen-Shannon散度(JSD)不能提供足够的分离。最近引入的一种技术,Chisini Jensen Shannon Divergence (CJSD),通过使用来自Chisini均值的算子进行重新表述,提高了JSD区分概率分布的能力。因此,cjsd还具有与健壮性有关的附加属性。通过两个支持向量机核的形式验证了该方法的实用性,这两个支持向量机核具有优异的分类性能。我们的工作探讨了为什么通过这种重新表述可以提高jsf的性能。我们基于相对膨胀的思想来描述这种改进的本质,这就是Chisini mean如何变换JSD的范围,并证明了一些建立这种分离程度的命题。最后,我们在一个合成数据集上提供了经验验证,证实了我们关于相对膨胀的理论结果。
{"title":"Dilation of Chisini-Jensen-Shannon Divergence","authors":"P. Sharma, Gary Holness","doi":"10.1109/DSAA.2016.25","DOIUrl":"https://doi.org/10.1109/DSAA.2016.25","url":null,"abstract":"Jensen-Shannon divergence (JSD) does not provide adequate separation when the difference between input distributions is subtle. A recently introduced technique, Chisini Jensen Shannon Divergence (CJSD), increases JSD's ability to discriminate between probability distributions by reformulating with operators from Chisini mean. As a consequence, CJSDs also carry additional properties concerning robustness. The utility of this approach was validated in the form of two SVM kernels that give superior classification performance. Our work explores why the performance improvement to JSDs is afforded by this reformulation. We characterize the nature of this improvement based on the idea of relative dilation, that is how Chisini mean transforms JSD's range and prove a number of propositions that establish the degree of this separation. Finally, we provide empirical validation on a synthetic dataset that confirms our theoretical results pertaining to relative dilation.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124535393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Fraud Detection in Energy Consumption: A Supervised Approach 能源消费中的欺诈检测:一种监督方法
Bernat Coma-Puig, J. Carmona, Ricard Gavaldà, Santiago Alcoverro, Victor Martin
Data from utility meters (gas, electricity, water) is a rich source of information for distribution companies, beyond billing. In this paper we present a supervised technique, which primarily but not only feeds on meter information, to detect meter anomalies and customer fraudulent behavior (meter tampering). Our system detects anomalous meter readings on the basis of models built using machine learning techniques on past data. Unlike most previous work, it can incrementally incorporate the result of field checks to grow the database of fraud and non-fraud patterns, therefore increasing model precision over time and potentially adapting to emerging fraud patterns. The full system has been developed with a company providing electricity and gas and already used to carry out several field checks, with large improvements in fraud detection over the previous checks which used simpler techniques.
来自公用事业仪表(燃气、电力、水)的数据是配电公司的丰富信息来源,而不仅仅是账单。在本文中,我们提出了一种监督技术,该技术主要但不仅限于以电表信息为依据,来检测电表异常和客户欺诈行为(电表篡改)。我们的系统基于使用机器学习技术在过去数据上建立的模型来检测异常的仪表读数。与大多数以前的工作不同,它可以逐步合并现场检查的结果,以增加欺诈和非欺诈模式的数据库,因此随着时间的推移提高模型精度,并有可能适应新出现的欺诈模式。整个系统是与一家提供电力和天然气的公司合作开发的,并已用于进行几次现场检查,与以前使用更简单技术的检查相比,在欺诈检测方面有了很大改进。
{"title":"Fraud Detection in Energy Consumption: A Supervised Approach","authors":"Bernat Coma-Puig, J. Carmona, Ricard Gavaldà, Santiago Alcoverro, Victor Martin","doi":"10.1109/DSAA.2016.19","DOIUrl":"https://doi.org/10.1109/DSAA.2016.19","url":null,"abstract":"Data from utility meters (gas, electricity, water) is a rich source of information for distribution companies, beyond billing. In this paper we present a supervised technique, which primarily but not only feeds on meter information, to detect meter anomalies and customer fraudulent behavior (meter tampering). Our system detects anomalous meter readings on the basis of models built using machine learning techniques on past data. Unlike most previous work, it can incrementally incorporate the result of field checks to grow the database of fraud and non-fraud patterns, therefore increasing model precision over time and potentially adapting to emerging fraud patterns. The full system has been developed with a company providing electricity and gas and already used to carry out several field checks, with large improvements in fraud detection over the previous checks which used simpler techniques.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122785132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
Meeting Health Care Research Needs in a Kimball Integrated Data Warehouse 在Kimball集成数据仓库中满足医疗保健研究需求
R. Hart, A. Kuo
Business Intelligence and the Kimball methodology, often referred to as dimensional modelling, are well established in data warehousing as a successful means of turning data into information. These techniques have been utilized in multiple business areas such as banking, manufacturing, marketing, sales, healthcare and more. Several articles have also shown how the Kimball approach can and has been used in the development of clinical research databases. However, these articles have also shown that there are weaknesses to the Kimball methodology when applied to complex areas such as clinical research. This paper describes our approach to address these weaknesses and meet the more sophisticated needs of health researchers by leveraging relationships within the underlying data and advanced techniques in the Kimball methodology.
商业智能和Kimball方法(通常称为维度建模)在数据仓库中已经很好地建立起来,是将数据转换为信息的一种成功方法。这些技术已用于多个业务领域,如银行、制造、营销、销售、医疗保健等。一些文章也展示了Kimball方法如何能够并且已经用于临床研究数据库的开发。然而,这些文章也表明,Kimball方法在应用于临床研究等复杂领域时存在弱点。本文描述了我们的方法,以解决这些弱点,并满足卫生研究人员更复杂的需求,利用关系的基础数据和先进的技术在金博尔方法。
{"title":"Meeting Health Care Research Needs in a Kimball Integrated Data Warehouse","authors":"R. Hart, A. Kuo","doi":"10.1109/DSAA.2016.91","DOIUrl":"https://doi.org/10.1109/DSAA.2016.91","url":null,"abstract":"Business Intelligence and the Kimball methodology, often referred to as dimensional modelling, are well established in data warehousing as a successful means of turning data into information. These techniques have been utilized in multiple business areas such as banking, manufacturing, marketing, sales, healthcare and more. Several articles have also shown how the Kimball approach can and has been used in the development of clinical research databases. However, these articles have also shown that there are weaknesses to the Kimball methodology when applied to complex areas such as clinical research. This paper describes our approach to address these weaknesses and meet the more sophisticated needs of health researchers by leveraging relationships within the underlying data and advanced techniques in the Kimball methodology.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123195470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A Framework for Description and Analysis of Sampling-Based Approximate Triangle Counting Algorithms 基于采样的近似三角形计数算法描述与分析框架
M. H. Chehreghani
Counting the number of triangles in a large graph has many important applications in network analysis. Several frequently computed metrics such as the clustering coefficient and the transitivity ratio need to count the number of triangles. In this paper, we present a randomized framework for expressing and analyzing approximate triangle counting algorithms. We show that many existing approximate triangle counting algorithms can be described in terms of probability distributions given as parameters to the proposed framework. Then, we show that our proposed framework provides a quantitative measure for the quality of different approximate algorithms. Finally, we perform experiments on real-world networks from different domains and show that there is no unique sampling technique outperforming the others for all networks and the quality of sampling techniques depends on different factors such as the structure of the network, the vertex degree-triangle correlation and the number of samples.
在网络分析中,计算大图中三角形的数量有许多重要的应用。一些经常计算的度量,如聚类系数和传递率需要计算三角形的数量。在本文中,我们提出了一个表示和分析近似三角形计数算法的随机框架。我们证明了许多现有的近似三角形计数算法可以用给定的概率分布作为所提出框架的参数来描述。然后,我们证明了我们提出的框架为不同近似算法的质量提供了定量度量。最后,我们对来自不同领域的真实网络进行了实验,结果表明,对于所有网络,没有唯一的采样技术优于其他采样技术,采样技术的质量取决于不同的因素,如网络结构、顶点度-三角形相关性和样本数量。
{"title":"A Framework for Description and Analysis of Sampling-Based Approximate Triangle Counting Algorithms","authors":"M. H. Chehreghani","doi":"10.1109/DSAA.2016.15","DOIUrl":"https://doi.org/10.1109/DSAA.2016.15","url":null,"abstract":"Counting the number of triangles in a large graph has many important applications in network analysis. Several frequently computed metrics such as the clustering coefficient and the transitivity ratio need to count the number of triangles. In this paper, we present a randomized framework for expressing and analyzing approximate triangle counting algorithms. We show that many existing approximate triangle counting algorithms can be described in terms of probability distributions given as parameters to the proposed framework. Then, we show that our proposed framework provides a quantitative measure for the quality of different approximate algorithms. Finally, we perform experiments on real-world networks from different domains and show that there is no unique sampling technique outperforming the others for all networks and the quality of sampling techniques depends on different factors such as the structure of the network, the vertex degree-triangle correlation and the number of samples.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124229897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Projecting "Better Than Randomly": How to Reduce the Dimensionality of Very Large Datasets in a Way That Outperforms Random Projections 投影“优于随机”:如何以优于随机投影的方式降低超大数据集的维数
M. Wojnowicz, Di Zhang, Glenn Chisholm, Xuan Zhao, M. Wolff
For very large datasets, random projections (RP) have become the tool of choice for dimensionality reduction. This is due to the computational complexity of principal component analysis. However, the recent development of randomized principal component analysis (RPCA) has opened up the possibility of obtaining approximate principal components on very large datasets. In this paper, we compare the performance of RPCA and RP in dimensionality reduction for supervised learning. In Experiment 1, study a malware classification task on a dataset with over 10 million samples, almost 100,000 features, and over 25 billion non-zero values, with the goal of reducing the dimensionality to a compressed representation of 5,000 features. In order to apply RPCA to this dataset, we develop a new algorithm called large sample RPCA (LS-RPCA), which extends the RPCA algorithm to work on datasets with arbitrarily many samples. We find that classification performance is much higher when using LS-RPCA for dimensionality reduction than when using random projections. In particular, across a range of target dimensionalities, we find that using LS-RPCA reduces classification error by between 37% and 54%. Experiment 2 generalizes the phenomenon to multiple datasets, feature representations, and classifiers. These findings have implications for a large number of research projects in which random projections were used as a preprocessing step for dimensionality reduction. As long as accuracy is at a premium and the target dimensionality is sufficiently less than the numeric rank of the dataset, randomized PCA may be a superior choice. Moreover, if the dataset has a large number of samples, then LS-RPCA will provide a method for obtaining the approximate principal components.
对于非常大的数据集,随机投影(RP)已成为降维的首选工具。这是由于主成分分析的计算复杂性。然而,随机主成分分析(RPCA)的最新发展为在非常大的数据集上获得近似主成分提供了可能性。在本文中,我们比较了RPCA和RP在监督学习降维方面的性能。在实验1中,在一个拥有超过1000万个样本、近10万个特征和超过250亿个非零值的数据集上研究一个恶意软件分类任务,目标是将维数降至5000个特征的压缩表示。为了将RPCA应用于该数据集,我们开发了一种名为大样本RPCA (LS-RPCA)的新算法,该算法将RPCA算法扩展到具有任意多样本的数据集上。我们发现使用LS-RPCA进行降维的分类性能要比使用随机投影的分类性能高得多。特别是,在目标维度范围内,我们发现使用LS-RPCA可以减少37%到54%的分类错误。实验2将这种现象推广到多个数据集、特征表示和分类器。这些发现对大量使用随机投影作为降维预处理步骤的研究项目具有启示意义。只要精度很重要,并且目标维数足够小于数据集的数字秩,随机PCA可能是更好的选择。此外,如果数据集有大量的样本,那么LS-RPCA将提供一种获得近似主成分的方法。
{"title":"Projecting \"Better Than Randomly\": How to Reduce the Dimensionality of Very Large Datasets in a Way That Outperforms Random Projections","authors":"M. Wojnowicz, Di Zhang, Glenn Chisholm, Xuan Zhao, M. Wolff","doi":"10.1109/DSAA.2016.26","DOIUrl":"https://doi.org/10.1109/DSAA.2016.26","url":null,"abstract":"For very large datasets, random projections (RP) have become the tool of choice for dimensionality reduction. This is due to the computational complexity of principal component analysis. However, the recent development of randomized principal component analysis (RPCA) has opened up the possibility of obtaining approximate principal components on very large datasets. In this paper, we compare the performance of RPCA and RP in dimensionality reduction for supervised learning. In Experiment 1, study a malware classification task on a dataset with over 10 million samples, almost 100,000 features, and over 25 billion non-zero values, with the goal of reducing the dimensionality to a compressed representation of 5,000 features. In order to apply RPCA to this dataset, we develop a new algorithm called large sample RPCA (LS-RPCA), which extends the RPCA algorithm to work on datasets with arbitrarily many samples. We find that classification performance is much higher when using LS-RPCA for dimensionality reduction than when using random projections. In particular, across a range of target dimensionalities, we find that using LS-RPCA reduces classification error by between 37% and 54%. Experiment 2 generalizes the phenomenon to multiple datasets, feature representations, and classifiers. These findings have implications for a large number of research projects in which random projections were used as a preprocessing step for dimensionality reduction. As long as accuracy is at a premium and the target dimensionality is sufficiently less than the numeric rank of the dataset, randomized PCA may be a superior choice. Moreover, if the dataset has a large number of samples, then LS-RPCA will provide a method for obtaining the approximate principal components.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115946938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
On the Evaluation of Outlier Detection and One-Class Classification Methods 关于离群点检测和一类分类方法的评价
Lorne Swersky, Henrique O. Marques, J. Sander, R. Campello, A. Zimek
It has been shown that unsupervised outlier detection methods can be adapted to the one-class classification problem. In this paper, we focus on the comparison of oneclass classification algorithms with such adapted unsupervised outlier detection methods, improving on previous comparison studies in several important aspects. We study a number of one-class classification and unsupervised outlier detection methods in a rigorous experimental setup, comparing them on a large number of datasets with different characteristics, using different performance measures. Our experiments led to conclusions that do not fully agree with those of previous work.
研究表明,无监督离群点检测方法可以适用于单类分类问题。在本文中,我们将重点放在单类分类算法与这种自适应无监督离群点检测方法的比较上,在几个重要方面改进了之前的比较研究。我们在严格的实验设置中研究了许多单类分类和无监督异常值检测方法,并在具有不同特征的大量数据集上使用不同的性能度量对它们进行了比较。我们的实验得出的结论与以前的工作并不完全一致。
{"title":"On the Evaluation of Outlier Detection and One-Class Classification Methods","authors":"Lorne Swersky, Henrique O. Marques, J. Sander, R. Campello, A. Zimek","doi":"10.1109/DSAA.2016.8","DOIUrl":"https://doi.org/10.1109/DSAA.2016.8","url":null,"abstract":"It has been shown that unsupervised outlier detection methods can be adapted to the one-class classification problem. In this paper, we focus on the comparison of oneclass classification algorithms with such adapted unsupervised outlier detection methods, improving on previous comparison studies in several important aspects. We study a number of one-class classification and unsupervised outlier detection methods in a rigorous experimental setup, comparing them on a large number of datasets with different characteristics, using different performance measures. Our experiments led to conclusions that do not fully agree with those of previous work.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"02 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130938419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 52
A Symbolic Tree Model for Oil and Gas Production Prediction Using Time-Series Production Data 基于时序生产数据的油气产量预测符号树模型
Bingjie Wei, Helen Pinto, Xin Wang
Oil and gas well production prediction takes place in early stages of production to estimate future recovery. A data driven workflow is proposed in this paper to construct a symbolic tree model to predict new well production using historic time-series production data of analogous wells. Production data are firstly aggregated and symbolized for dimensionality reduction and data discretization of time-series data. A symbolic tree is constructed on time-series symbol sequences, and pre-pruning mechanisms – minimum node size and spatial information gain – are integrated to achieve a compact and informative tree. A coverage index is used to assess the tree size. A case study was conducted applying the proposed workflow to shale gas wells in Montney-A pool in Canada. It has proved the feasibility and accuracy of the proposed method.
油气井产量预测在生产的早期阶段进行,以估计未来的采收率。本文提出了一种数据驱动的工作流程,利用模拟井的历史时序生产数据,构建符号树模型来预测新井产量。首先对生产数据进行聚合和符号化处理,对时间序列数据进行降维和离散化处理。在时间序列符号序列上构造符号树,并结合最小节点大小和空间信息增益两种预剪枝机制,得到一棵紧凑且信息丰富的符号树。覆盖指数用于评估树的大小。将该工作流应用于加拿大Montney-A油藏的页岩气井进行了案例研究。验证了该方法的可行性和准确性。
{"title":"A Symbolic Tree Model for Oil and Gas Production Prediction Using Time-Series Production Data","authors":"Bingjie Wei, Helen Pinto, Xin Wang","doi":"10.1109/DSAA.2016.36","DOIUrl":"https://doi.org/10.1109/DSAA.2016.36","url":null,"abstract":"Oil and gas well production prediction takes place in early stages of production to estimate future recovery. A data driven workflow is proposed in this paper to construct a symbolic tree model to predict new well production using historic time-series production data of analogous wells. Production data are firstly aggregated and symbolized for dimensionality reduction and data discretization of time-series data. A symbolic tree is constructed on time-series symbol sequences, and pre-pruning mechanisms – minimum node size and spatial information gain – are integrated to achieve a compact and informative tree. A coverage index is used to assess the tree size. A case study was conducted applying the proposed workflow to shale gas wells in Montney-A pool in Canada. It has proved the feasibility and accuracy of the proposed method.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131311408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1