隔离森林算法在信用卡交易欺诈检测中的性能分析

Khazanah Informatika : Jurnal Ilmu Komputer dan Informatika Pub Date : 2020-10-27 DOI:10.23917/khif.v6i2.10520

I. Waspada, N. Bahtiar, P. W. Wirawan, Bagus Dwi Ari Awan

{"title":"隔离森林算法在信用卡交易欺诈检测中的性能分析","authors":"I. Waspada, N. Bahtiar, P. W. Wirawan, Bagus Dwi Ari Awan","doi":"10.23917/khif.v6i2.10520","DOIUrl":null,"url":null,"abstract":"Losses incurred due to fraud on e-commerce transactions, especially those based on credit cards, continue to increase, resulting in large losses each year. One mechanism to minimize the risk of fraudulent credit card transactions is to utilize a detection technique for ongoing transactions. Credit card transaction data in its original state does not have a label, and the amount of fraud data on the training data is very small so that it belongs to a very unbalanced category, and the pattern of fraud continues to change. Isolation forest is an unsupervised algorithm that is efficient in detecting anomalies. Several techniques can be applied to improve the performance of the Isolation forest model. Previous studies used the ROC-AUC metric in analyzing the performance of Isolation Forests, which could provide incorrect information. This study made two contributions; the first is to present a performance analysis with both the ROC-AUC and AUCPR. Thus, it can be seen that the high ROC-AUC value does not guarantee the model has the reliability in detecting fraud. In comparison, the information provided through AUCPR is more appropriate to describe the ability of the model to capture data fraud. The second contribution is to propose several techniques that can be applied to improve the performance of the Isolation forest model, namely to optimize the determination of the amount of training data, feature selection, the amount of fraud contamination, and setting hyper-parameters in the modeling stage (training). Experiments were carried out using a real-life dataset from ULB. The best results are obtained when the validation data split ratio is 60:40, using the five most important features, using only 60% of fraud data, and setting hyper-parameters with the number of trees 100, 128 sample maximum, and 0.001 contamination. The validation performance of this model is precision 0.809917, recall 0.710145, F1-score 0.756757, ROC-AUC 0.969728, and AUCPR 0.637993, while for Testing results obtained precision 0.807143, recall 0.763514, F1-score 0.784722, ROC-AUC 0.97371, and AUCPR 0.759228.","PeriodicalId":326094,"journal":{"name":"Khazanah Informatika : Jurnal Ilmu Komputer dan Informatika","volume":"46 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Performance Analysis of Isolation Forest Algorithm in Fraud Detection of Credit Card Transactions\",\"authors\":\"I. Waspada, N. Bahtiar, P. W. Wirawan, Bagus Dwi Ari Awan\",\"doi\":\"10.23917/khif.v6i2.10520\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Losses incurred due to fraud on e-commerce transactions, especially those based on credit cards, continue to increase, resulting in large losses each year. One mechanism to minimize the risk of fraudulent credit card transactions is to utilize a detection technique for ongoing transactions. Credit card transaction data in its original state does not have a label, and the amount of fraud data on the training data is very small so that it belongs to a very unbalanced category, and the pattern of fraud continues to change. Isolation forest is an unsupervised algorithm that is efficient in detecting anomalies. Several techniques can be applied to improve the performance of the Isolation forest model. Previous studies used the ROC-AUC metric in analyzing the performance of Isolation Forests, which could provide incorrect information. This study made two contributions; the first is to present a performance analysis with both the ROC-AUC and AUCPR. Thus, it can be seen that the high ROC-AUC value does not guarantee the model has the reliability in detecting fraud. In comparison, the information provided through AUCPR is more appropriate to describe the ability of the model to capture data fraud. The second contribution is to propose several techniques that can be applied to improve the performance of the Isolation forest model, namely to optimize the determination of the amount of training data, feature selection, the amount of fraud contamination, and setting hyper-parameters in the modeling stage (training). Experiments were carried out using a real-life dataset from ULB. The best results are obtained when the validation data split ratio is 60:40, using the five most important features, using only 60% of fraud data, and setting hyper-parameters with the number of trees 100, 128 sample maximum, and 0.001 contamination. The validation performance of this model is precision 0.809917, recall 0.710145, F1-score 0.756757, ROC-AUC 0.969728, and AUCPR 0.637993, while for Testing results obtained precision 0.807143, recall 0.763514, F1-score 0.784722, ROC-AUC 0.97371, and AUCPR 0.759228.\",\"PeriodicalId\":326094,\"journal\":{\"name\":\"Khazanah Informatika : Jurnal Ilmu Komputer dan Informatika\",\"volume\":\"46 1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Khazanah Informatika : Jurnal Ilmu Komputer dan Informatika\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23917/khif.v6i2.10520\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Khazanah Informatika : Jurnal Ilmu Komputer dan Informatika","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23917/khif.v6i2.10520","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

由于电子商务交易，特别是基于信用卡的电子商务交易的欺诈所造成的损失不断增加，每年造成的损失都很大。将信用卡欺诈交易风险降至最低的一种机制是对正在进行的交易使用检测技术。信用卡交易数据在其原始状态下没有标签，并且训练数据上的欺诈数据数量非常少，因此属于一个非常不平衡的类别，并且欺诈的模式不断变化。隔离森林算法是一种有效检测异常的无监督算法。可以应用几种技术来提高隔离林模型的性能。以前的研究使用ROC-AUC度量来分析隔离森林的性能，这可能提供不正确的信息。这项研究有两个贡献;首先是对ROC-AUC和AUCPR进行性能分析。由此可见，较高的ROC-AUC值并不能保证模型具有检测欺诈的可靠性。相比之下，通过AUCPR提供的信息更适合描述模型捕获数据欺诈的能力。第二个贡献是提出了几种可用于提高隔离森林模型性能的技术，即优化训练数据量的确定、特征选择、欺诈污染的数量，以及在建模阶段(训练)设置超参数。实验是使用ULB的真实数据集进行的。当验证数据分割比例为60:40，使用五个最重要的特征，仅使用60%的欺诈数据，并设置树数为100、128个样本最大值和0.001污染的超参数时，可以获得最佳结果。该模型的验证性能为精度0.809917，召回率0.710145,f1分数0.756757,ROC-AUC 0.969728, AUCPR 0.637993，而测试结果为精度0.807143，召回率0.763514,f1分数0.784722,ROC-AUC 0.97371, AUCPR 0.759228。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Performance Analysis of Isolation Forest Algorithm in Fraud Detection of Credit Card Transactions

Losses incurred due to fraud on e-commerce transactions, especially those based on credit cards, continue to increase, resulting in large losses each year. One mechanism to minimize the risk of fraudulent credit card transactions is to utilize a detection technique for ongoing transactions. Credit card transaction data in its original state does not have a label, and the amount of fraud data on the training data is very small so that it belongs to a very unbalanced category, and the pattern of fraud continues to change. Isolation forest is an unsupervised algorithm that is efficient in detecting anomalies. Several techniques can be applied to improve the performance of the Isolation forest model. Previous studies used the ROC-AUC metric in analyzing the performance of Isolation Forests, which could provide incorrect information. This study made two contributions; the first is to present a performance analysis with both the ROC-AUC and AUCPR. Thus, it can be seen that the high ROC-AUC value does not guarantee the model has the reliability in detecting fraud. In comparison, the information provided through AUCPR is more appropriate to describe the ability of the model to capture data fraud. The second contribution is to propose several techniques that can be applied to improve the performance of the Isolation forest model, namely to optimize the determination of the amount of training data, feature selection, the amount of fraud contamination, and setting hyper-parameters in the modeling stage (training). Experiments were carried out using a real-life dataset from ULB. The best results are obtained when the validation data split ratio is 60:40, using the five most important features, using only 60% of fraud data, and setting hyper-parameters with the number of trees 100, 128 sample maximum, and 0.001 contamination. The validation performance of this model is precision 0.809917, recall 0.710145, F1-score 0.756757, ROC-AUC 0.969728, and AUCPR 0.637993, while for Testing results obtained precision 0.807143, recall 0.763514, F1-score 0.784722, ROC-AUC 0.97371, and AUCPR 0.759228.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Khazanah Informatika : Jurnal Ilmu Komputer dan Informatika

自引率

0.00%

发文量