K. P. N. V. Satya Sree, J. Karthik, Chava Niharika, P. Srinivas, N. Ravinder, Chitturi Prasad
{"title":"机器学习模型中分类和数值特征的优化转换","authors":"K. P. N. V. Satya Sree, J. Karthik, Chava Niharika, P. Srinivas, N. Ravinder, Chitturi Prasad","doi":"10.1109/I-SMAC52330.2021.9640967","DOIUrl":null,"url":null,"abstract":"While some data have an explicit, numerical form, many other data, such as gender or nationality, do not typically use numbers and are referred to as categorical data. Thus, machine learning algorithms need a way of representing categorical information numerically in order to be able to analyze them. Our project specifically focuses on optimizing the conversion of categorical features to a numerical form in order to maximize the effectiveness of various machine learning models. From the methods utilized, it has been observed that wide and deep is the most effective model for datasets that contain high-cardinality features, as opposed to learn embedding and one-hot encoding.","PeriodicalId":178783,"journal":{"name":"2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Optimized Conversion of Categorical and Numerical Features in Machine Learning Models\",\"authors\":\"K. P. N. V. Satya Sree, J. Karthik, Chava Niharika, P. Srinivas, N. Ravinder, Chitturi Prasad\",\"doi\":\"10.1109/I-SMAC52330.2021.9640967\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While some data have an explicit, numerical form, many other data, such as gender or nationality, do not typically use numbers and are referred to as categorical data. Thus, machine learning algorithms need a way of representing categorical information numerically in order to be able to analyze them. Our project specifically focuses on optimizing the conversion of categorical features to a numerical form in order to maximize the effectiveness of various machine learning models. From the methods utilized, it has been observed that wide and deep is the most effective model for datasets that contain high-cardinality features, as opposed to learn embedding and one-hot encoding.\",\"PeriodicalId\":178783,\"journal\":{\"name\":\"2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/I-SMAC52330.2021.9640967\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/I-SMAC52330.2021.9640967","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Optimized Conversion of Categorical and Numerical Features in Machine Learning Models
While some data have an explicit, numerical form, many other data, such as gender or nationality, do not typically use numbers and are referred to as categorical data. Thus, machine learning algorithms need a way of representing categorical information numerically in order to be able to analyze them. Our project specifically focuses on optimizing the conversion of categorical features to a numerical form in order to maximize the effectiveness of various machine learning models. From the methods utilized, it has been observed that wide and deep is the most effective model for datasets that contain high-cardinality features, as opposed to learn embedding and one-hot encoding.