iCASSTLE : Imbalanced Classification Algorithm for Semi Supervised Text Learning

Debanjan Banerjee, Gyan Prabhat, Riyanka Bhowal
{"title":"iCASSTLE : Imbalanced Classification Algorithm for Semi Supervised Text Learning","authors":"Debanjan Banerjee, Gyan Prabhat, Riyanka Bhowal","doi":"10.1109/ICMLA.2018.00165","DOIUrl":null,"url":null,"abstract":"Information in the form of text can be found in abundance in the web today, which can be mined to solve multifarious problems. Customer reviews, for instance, flow in across multiple sources in thousands per day which can be leveraged to obtain several insights. Our goal is to extract cases of a rare event e.g., recall of products, allegations of ethics or, legal concerns or, threats to product-safety, etc. from this enormous amount of data. Manual identification of such cases to be reported is extremely labour-intensive as well as time-sensitive, but failure to do so can have fatal impact on the industry's overall health and dependability; missing out on even a single case may lead to huge penalties in terms of customer experience, product liability and industry reputation. In this paper, we will discuss classification through Positive and Unlabeled data, PU classification, where the only class, for which instances are available, is a rare event. In iCASSTLE, we propose a two-staged approach where Stage I leverages three unique components of text mining to procure representative training data containing instances of both classes in the right proportion, and Stage II uses results from Stage I to run a semi-supervised classification. We applied this to multiple datasets differing in nature of Product Safety as well as nature of imbalance and iCASSTLE is proven to perform better than the state-of-the-art methods for the relevant use-cases.","PeriodicalId":6533,"journal":{"name":"2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"12 1","pages":"1012-1016"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2018.00165","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Information in the form of text can be found in abundance in the web today, which can be mined to solve multifarious problems. Customer reviews, for instance, flow in across multiple sources in thousands per day which can be leveraged to obtain several insights. Our goal is to extract cases of a rare event e.g., recall of products, allegations of ethics or, legal concerns or, threats to product-safety, etc. from this enormous amount of data. Manual identification of such cases to be reported is extremely labour-intensive as well as time-sensitive, but failure to do so can have fatal impact on the industry's overall health and dependability; missing out on even a single case may lead to huge penalties in terms of customer experience, product liability and industry reputation. In this paper, we will discuss classification through Positive and Unlabeled data, PU classification, where the only class, for which instances are available, is a rare event. In iCASSTLE, we propose a two-staged approach where Stage I leverages three unique components of text mining to procure representative training data containing instances of both classes in the right proportion, and Stage II uses results from Stage I to run a semi-supervised classification. We applied this to multiple datasets differing in nature of Product Safety as well as nature of imbalance and iCASSTLE is proven to perform better than the state-of-the-art methods for the relevant use-cases.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
半监督文本学习的不平衡分类算法
如今,在网络上可以找到大量的文本形式的信息,这些信息可以通过挖掘来解决各种各样的问题。例如,每天有成千上万的客户评论从多个来源流入,这可以用来获得一些见解。我们的目标是从海量的数据中提取罕见事件的案例,例如产品召回,道德或法律问题的指控,对产品安全的威胁等。手工确定要报告的这类病例极其耗费人力,而且时间敏感,但如果不这样做,可能对该行业的整体健康和可靠性产生致命影响;即使遗漏一个案例,也可能在客户体验、产品责任和行业声誉方面招致巨额罚款。在本文中,我们将讨论通过Positive和Unlabeled数据的分类,PU分类,其中唯一的类,其实例是可用的,是一个罕见的事件。在iCASSTLE中,我们提出了一种两阶段的方法,其中第一阶段利用文本挖掘的三个独特组件来获取包含适当比例的两类实例的代表性训练数据,第二阶段使用第一阶段的结果来运行半监督分类。我们将其应用于产品安全性质不同的多个数据集,以及不平衡的性质,iCASSTLE被证明比相关用例的最先进方法表现得更好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Teacher/Student Deep Semi-Supervised Learning for Training with Noisy Labels Asymmetric Gaussian-Based Statistical Models Using Markov Chain Monte Carlo Techniques for Image Categorization Real-Time Prediction of Employee Engagement Using Social Media and Text Mining Fine-Grained Image Classification via Spatial Saliency Extraction SEDAT: Sentiment and Emotion Detection in Arabic Text Using CNN-LSTM Deep Learning
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1