Bhanu Prakash Reddy Banda, Bianca Govan, K. Roy, Kelvin S. Bryant
{"title":"基于API调用的特征提取的可解释ML模型的恶意软件检测","authors":"Bhanu Prakash Reddy Banda, Bianca Govan, K. Roy, Kelvin S. Bryant","doi":"10.1109/icABCD59051.2023.10220515","DOIUrl":null,"url":null,"abstract":"Malware attacks have become a crucial problem in modern life. From 2015 to 2021 about 56.1billion malware attacks have taken place in the world. A malware attack typically costs a business over 2.5 million dollars to remediate. According to Cybersecurity Ventures, during the next five years, the cost of cybercrime would increase by 15% yearly, reaching 10.5 trillion USD annually by 2025 from 3 trillion USD in 2015. There is a global epidemic of malware. Studies imply that malware's effects are deteriorating. The main defense against malware tools is malware detectors. Therefore, it is crucial that we research malware detection methods to better comprehend their advantages and disadvantages. This research focuses on an Application Pro-gramming Interface (API) call-based malware detection strategy with Machine Learning to further improve malware detection. The Limitations that motivated to work on this project was the lack of datasets with newly attacked malware samples and also lack of detecting the malware with good accuracy. The main goal of this research is to understand the malware behavior on the Windows platform, use a dynamic analysis to identify various aspects or features that have dangerous code patterns from malware samples and employ various malware and benign samples to construct and validate machine learning-based malware detection models. The data was gathered from publicly accessible sites and sampled using a sandbox approach. Machine Learning models were built using the new dataset. The Supervised Learning models and deep Learning models were applied to the dataset and then the results were compared and cross-checked to get the best fit model. This investigation demonstrated the possibility of estab- lishing a high-precision capability for the detection of malware while combining API calls and Machine Learning models., The strategy yielded a high malware detection accuracy of 88.26% (XGBoost) model and 90.70% (MLP classifier) for Windows-based platforms. We have used Explainable Machine Learning, namely the SHapley Additive exPlanations (SHAP) value methods to demonstrate the important component or feature responsible for the prediction of the model.","PeriodicalId":51314,"journal":{"name":"Big Data","volume":null,"pages":null},"PeriodicalIF":2.6000,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Malware detection using Explainable ML models based on Feature Extraction using API calls\",\"authors\":\"Bhanu Prakash Reddy Banda, Bianca Govan, K. Roy, Kelvin S. Bryant\",\"doi\":\"10.1109/icABCD59051.2023.10220515\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Malware attacks have become a crucial problem in modern life. From 2015 to 2021 about 56.1billion malware attacks have taken place in the world. A malware attack typically costs a business over 2.5 million dollars to remediate. According to Cybersecurity Ventures, during the next five years, the cost of cybercrime would increase by 15% yearly, reaching 10.5 trillion USD annually by 2025 from 3 trillion USD in 2015. There is a global epidemic of malware. Studies imply that malware's effects are deteriorating. The main defense against malware tools is malware detectors. Therefore, it is crucial that we research malware detection methods to better comprehend their advantages and disadvantages. This research focuses on an Application Pro-gramming Interface (API) call-based malware detection strategy with Machine Learning to further improve malware detection. The Limitations that motivated to work on this project was the lack of datasets with newly attacked malware samples and also lack of detecting the malware with good accuracy. The main goal of this research is to understand the malware behavior on the Windows platform, use a dynamic analysis to identify various aspects or features that have dangerous code patterns from malware samples and employ various malware and benign samples to construct and validate machine learning-based malware detection models. The data was gathered from publicly accessible sites and sampled using a sandbox approach. Machine Learning models were built using the new dataset. The Supervised Learning models and deep Learning models were applied to the dataset and then the results were compared and cross-checked to get the best fit model. This investigation demonstrated the possibility of estab- lishing a high-precision capability for the detection of malware while combining API calls and Machine Learning models., The strategy yielded a high malware detection accuracy of 88.26% (XGBoost) model and 90.70% (MLP classifier) for Windows-based platforms. We have used Explainable Machine Learning, namely the SHapley Additive exPlanations (SHAP) value methods to demonstrate the important component or feature responsible for the prediction of the model.\",\"PeriodicalId\":51314,\"journal\":{\"name\":\"Big Data\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2023-08-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Big Data\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/icABCD59051.2023.10220515\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/icABCD59051.2023.10220515","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
Malware detection using Explainable ML models based on Feature Extraction using API calls
Malware attacks have become a crucial problem in modern life. From 2015 to 2021 about 56.1billion malware attacks have taken place in the world. A malware attack typically costs a business over 2.5 million dollars to remediate. According to Cybersecurity Ventures, during the next five years, the cost of cybercrime would increase by 15% yearly, reaching 10.5 trillion USD annually by 2025 from 3 trillion USD in 2015. There is a global epidemic of malware. Studies imply that malware's effects are deteriorating. The main defense against malware tools is malware detectors. Therefore, it is crucial that we research malware detection methods to better comprehend their advantages and disadvantages. This research focuses on an Application Pro-gramming Interface (API) call-based malware detection strategy with Machine Learning to further improve malware detection. The Limitations that motivated to work on this project was the lack of datasets with newly attacked malware samples and also lack of detecting the malware with good accuracy. The main goal of this research is to understand the malware behavior on the Windows platform, use a dynamic analysis to identify various aspects or features that have dangerous code patterns from malware samples and employ various malware and benign samples to construct and validate machine learning-based malware detection models. The data was gathered from publicly accessible sites and sampled using a sandbox approach. Machine Learning models were built using the new dataset. The Supervised Learning models and deep Learning models were applied to the dataset and then the results were compared and cross-checked to get the best fit model. This investigation demonstrated the possibility of estab- lishing a high-precision capability for the detection of malware while combining API calls and Machine Learning models., The strategy yielded a high malware detection accuracy of 88.26% (XGBoost) model and 90.70% (MLP classifier) for Windows-based platforms. We have used Explainable Machine Learning, namely the SHapley Additive exPlanations (SHAP) value methods to demonstrate the important component or feature responsible for the prediction of the model.
Big DataCOMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-COMPUTER SCIENCE, THEORY & METHODS
CiteScore
9.10
自引率
2.20%
发文量
60
期刊介绍:
Big Data is the leading peer-reviewed journal covering the challenges and opportunities in collecting, analyzing, and disseminating vast amounts of data. The Journal addresses questions surrounding this powerful and growing field of data science and facilitates the efforts of researchers, business managers, analysts, developers, data scientists, physicists, statisticians, infrastructure developers, academics, and policymakers to improve operations, profitability, and communications within their businesses and institutions.
Spanning a broad array of disciplines focusing on novel big data technologies, policies, and innovations, the Journal brings together the community to address current challenges and enforce effective efforts to organize, store, disseminate, protect, manipulate, and, most importantly, find the most effective strategies to make this incredible amount of information work to benefit society, industry, academia, and government.
Big Data coverage includes:
Big data industry standards,
New technologies being developed specifically for big data,
Data acquisition, cleaning, distribution, and best practices,
Data protection, privacy, and policy,
Business interests from research to product,
The changing role of business intelligence,
Visualization and design principles of big data infrastructures,
Physical interfaces and robotics,
Social networking advantages for Facebook, Twitter, Amazon, Google, etc,
Opportunities around big data and how companies can harness it to their advantage.