Selecting Key Features of Online Behaviour on South African Informative Websites Prior to Unsupervised Machine Learning

Judah Soobramoney, R. Chifurira, T. Zewotir
{"title":"Selecting Key Features of Online Behaviour on South African Informative Websites Prior to Unsupervised Machine Learning","authors":"Judah Soobramoney, R. Chifurira, T. Zewotir","doi":"10.19139/soic-2310-5070-1139","DOIUrl":null,"url":null,"abstract":"The main aim of the study was to explore the feature selection process of online web data prior to unsupervised machine learning models. At the time of writing, no such literature could be found reporting the use of feature selection in this context. Feature selection was determined by inspecting the variability and association between features. The variability of numeric features were quantified using the variance, mean absolute difference and dispersion ratio metrics whilst the coefficient of unalikeability was employed for categorical features. To quantify association, correlation matrices were used for numeric features, chi-squared independence tests between categorical features and box-and-whisker plots between mixed features. The main findings showed the variance, mean absolute difference, dispersion ratio and coefficient of unalikeability metrics have successfully highlighted features with very low variability within the observed data. Whilst the correlation matrix, chi-squared test for independence and box-and-whisker plots highlighted possible redundancy, natural relationships and insightful relationships between the features thereby suggesting features to be considered for omission prior to unsupervised modelling. The proposed methods and findings can be applied to various other applications of feature selection and exploration.","PeriodicalId":131002,"journal":{"name":"Statistics, Optimization & Information Computing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics, Optimization & Information Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.19139/soic-2310-5070-1139","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The main aim of the study was to explore the feature selection process of online web data prior to unsupervised machine learning models. At the time of writing, no such literature could be found reporting the use of feature selection in this context. Feature selection was determined by inspecting the variability and association between features. The variability of numeric features were quantified using the variance, mean absolute difference and dispersion ratio metrics whilst the coefficient of unalikeability was employed for categorical features. To quantify association, correlation matrices were used for numeric features, chi-squared independence tests between categorical features and box-and-whisker plots between mixed features. The main findings showed the variance, mean absolute difference, dispersion ratio and coefficient of unalikeability metrics have successfully highlighted features with very low variability within the observed data. Whilst the correlation matrix, chi-squared test for independence and box-and-whisker plots highlighted possible redundancy, natural relationships and insightful relationships between the features thereby suggesting features to be considered for omission prior to unsupervised modelling. The proposed methods and findings can be applied to various other applications of feature selection and exploration.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在无监督机器学习之前选择南非信息网站在线行为的关键特征
该研究的主要目的是探索在无监督机器学习模型之前在线web数据的特征选择过程。在撰写本文时,没有这样的文献可以找到报告在这种情况下使用特征选择。通过检查特征之间的可变性和相关性来确定特征选择。数值特征的可变性使用方差、平均绝对差和离散比度量来量化,而不相似系数用于分类特征。为了量化关联,将相关矩阵用于数字特征、分类特征之间的卡方独立性检验以及混合特征之间的盒须图。主要研究结果表明,方差、平均绝对差、离散比和不相似度系数指标成功地突出了观察数据中极低可变性的特征。而相关矩阵、独立性的卡方检验和盒须图突出了特征之间可能的冗余、自然关系和深刻的关系,从而表明在无监督建模之前需要考虑遗漏的特征。所提出的方法和发现可以应用于特征选择和探索的各种其他应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
In-depth Analysis of von Mises Distribution Models: Understanding Theory, Applications, and Future Directions Bayesian and Non-Bayesian Estimation for The Parameter of Inverted Topp-Leone Distribution Based on Progressive Type I Censoring Comparative Evaluation of Imbalanced Data Management Techniques for Solving Classification Problems on Imbalanced Datasets An Algorithm for Solving Quadratic Programming Problems with an M-matrix An Effective Randomized Algorithm for Hyperspectral Image Feature Extraction
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1