iCASSTLE : Imbalanced Classification Algorithm for Semi Supervised Text Learning

2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA) Pub Date : 2018-12-01 DOI:10.1109/ICMLA.2018.00165

Debanjan Banerjee, Gyan Prabhat, Riyanka Bhowal

{"title":"iCASSTLE : Imbalanced Classification Algorithm for Semi Supervised Text Learning","authors":"Debanjan Banerjee, Gyan Prabhat, Riyanka Bhowal","doi":"10.1109/ICMLA.2018.00165","DOIUrl":null,"url":null,"abstract":"Information in the form of text can be found in abundance in the web today, which can be mined to solve multifarious problems. Customer reviews, for instance, flow in across multiple sources in thousands per day which can be leveraged to obtain several insights. Our goal is to extract cases of a rare event e.g., recall of products, allegations of ethics or, legal concerns or, threats to product-safety, etc. from this enormous amount of data. Manual identification of such cases to be reported is extremely labour-intensive as well as time-sensitive, but failure to do so can have fatal impact on the industry's overall health and dependability; missing out on even a single case may lead to huge penalties in terms of customer experience, product liability and industry reputation. In this paper, we will discuss classification through Positive and Unlabeled data, PU classification, where the only class, for which instances are available, is a rare event. In iCASSTLE, we propose a two-staged approach where Stage I leverages three unique components of text mining to procure representative training data containing instances of both classes in the right proportion, and Stage II uses results from Stage I to run a semi-supervised classification. We applied this to multiple datasets differing in nature of Product Safety as well as nature of imbalance and iCASSTLE is proven to perform better than the state-of-the-art methods for the relevant use-cases.","PeriodicalId":6533,"journal":{"name":"2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"12 1","pages":"1012-1016"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2018.00165","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Information in the form of text can be found in abundance in the web today, which can be mined to solve multifarious problems. Customer reviews, for instance, flow in across multiple sources in thousands per day which can be leveraged to obtain several insights. Our goal is to extract cases of a rare event e.g., recall of products, allegations of ethics or, legal concerns or, threats to product-safety, etc. from this enormous amount of data. Manual identification of such cases to be reported is extremely labour-intensive as well as time-sensitive, but failure to do so can have fatal impact on the industry's overall health and dependability; missing out on even a single case may lead to huge penalties in terms of customer experience, product liability and industry reputation. In this paper, we will discuss classification through Positive and Unlabeled data, PU classification, where the only class, for which instances are available, is a rare event. In iCASSTLE, we propose a two-staged approach where Stage I leverages three unique components of text mining to procure representative training data containing instances of both classes in the right proportion, and Stage II uses results from Stage I to run a semi-supervised classification. We applied this to multiple datasets differing in nature of Product Safety as well as nature of imbalance and iCASSTLE is proven to perform better than the state-of-the-art methods for the relevant use-cases.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

半监督文本学习的不平衡分类算法

如今，在网络上可以找到大量的文本形式的信息，这些信息可以通过挖掘来解决各种各样的问题。例如，每天有成千上万的客户评论从多个来源流入，这可以用来获得一些见解。我们的目标是从海量的数据中提取罕见事件的案例，例如产品召回，道德或法律问题的指控，对产品安全的威胁等。手工确定要报告的这类病例极其耗费人力，而且时间敏感，但如果不这样做，可能对该行业的整体健康和可靠性产生致命影响;即使遗漏一个案例，也可能在客户体验、产品责任和行业声誉方面招致巨额罚款。在本文中，我们将讨论通过Positive和Unlabeled数据的分类，PU分类，其中唯一的类，其实例是可用的，是一个罕见的事件。在iCASSTLE中，我们提出了一种两阶段的方法，其中第一阶段利用文本挖掘的三个独特组件来获取包含适当比例的两类实例的代表性训练数据，第二阶段使用第一阶段的结果来运行半监督分类。我们将其应用于产品安全性质不同的多个数据集，以及不平衡的性质，iCASSTLE被证明比相关用例的最先进方法表现得更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)

自引率

0.00%

发文量