K. P. N. V. Satya Sree, J. Karthik, Chava Niharika, P. Srinivas, N. Ravinder, Chitturi Prasad
{"title":"Optimized Conversion of Categorical and Numerical Features in Machine Learning Models","authors":"K. P. N. V. Satya Sree, J. Karthik, Chava Niharika, P. Srinivas, N. Ravinder, Chitturi Prasad","doi":"10.1109/I-SMAC52330.2021.9640967","DOIUrl":null,"url":null,"abstract":"While some data have an explicit, numerical form, many other data, such as gender or nationality, do not typically use numbers and are referred to as categorical data. Thus, machine learning algorithms need a way of representing categorical information numerically in order to be able to analyze them. Our project specifically focuses on optimizing the conversion of categorical features to a numerical form in order to maximize the effectiveness of various machine learning models. From the methods utilized, it has been observed that wide and deep is the most effective model for datasets that contain high-cardinality features, as opposed to learn embedding and one-hot encoding.","PeriodicalId":178783,"journal":{"name":"2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/I-SMAC52330.2021.9640967","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
While some data have an explicit, numerical form, many other data, such as gender or nationality, do not typically use numbers and are referred to as categorical data. Thus, machine learning algorithms need a way of representing categorical information numerically in order to be able to analyze them. Our project specifically focuses on optimizing the conversion of categorical features to a numerical form in order to maximize the effectiveness of various machine learning models. From the methods utilized, it has been observed that wide and deep is the most effective model for datasets that contain high-cardinality features, as opposed to learn embedding and one-hot encoding.