Pub Date : 2022-01-20DOI: 10.14710/jtsiskom.2021.14247
Hermawan Prasetyo
The clustering algorithm can group regions based on economic potential with mixed attributes data, consisting of numeric and categorical data. This study aims to group villages according to their economic potential in determining village development targets in Demak Regency using the fuzzy k-prototypes algorithm and modified Eskin distance to measure the distance of categorical attributes. The data used are PODES2018 data and the 2019 Wilkerstat Mapping. Village clustering produces three village clusters according to their economic potential, namely low, medium, and high economic clusters. Clusters of high economic potential are located on the main transportation routes of Semarang–Kudus and Semarang–Grobogan. However, villages on the main transportation route are still included in the low economic cluster. Considering the status of the urban/rural village classification, most of these villages are included in the urban village category. The results of this clustering can be used to determine village development targets in increasing the Village Developing Index in Demak Regency.
{"title":"Regional clustering based on economic potential with a modified fuzzy k-prototypes algorithm for village developing target determination","authors":"Hermawan Prasetyo","doi":"10.14710/jtsiskom.2021.14247","DOIUrl":"https://doi.org/10.14710/jtsiskom.2021.14247","url":null,"abstract":"The clustering algorithm can group regions based on economic potential with mixed attributes data, consisting of numeric and categorical data. This study aims to group villages according to their economic potential in determining village development targets in Demak Regency using the fuzzy k-prototypes algorithm and modified Eskin distance to measure the distance of categorical attributes. The data used are PODES2018 data and the 2019 Wilkerstat Mapping. Village clustering produces three village clusters according to their economic potential, namely low, medium, and high economic clusters. Clusters of high economic potential are located on the main transportation routes of Semarang–Kudus and Semarang–Grobogan. However, villages on the main transportation route are still included in the low economic cluster. Considering the status of the urban/rural village classification, most of these villages are included in the urban village category. The results of this clustering can be used to determine village development targets in increasing the Village Developing Index in Demak Regency.","PeriodicalId":56231,"journal":{"name":"Jurnal Teknologi dan Sistem Komputer","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45630957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-20DOI: 10.14710/jtsiskom.2021.14110
An-Naas Shahifatun Na’iema, Harminto Mulyo, Nur Aeni Widiastuti
The registrars for rehabilitation programs for uninhabitable settlements are increasing every year. The large data processing of registrants may result in inaccuracies and need a long time to determine livable houses (RTLH) and unfit for habitation (non RTLH). This study aims to apply the K-Nearest Neighbor algorithm in classifying the eligibility of recipients of uninhabitable house rehabilitation assistance. The data used in this study were 1289 data with 13 attributes from the Jepara Regency Public Housing and Settlement Service. Data processing begins with attribute selection, categorization, outlier data cleaning, and data normalization and method application. The proposed system has the best classification at k of 5 with an accuracy of 97.93%, 96.88% precision, 99.53% recall, and an AUC value of 0.964.
{"title":"Classification of beneficiaries for the rehabilitation of uninhabitable houses using the K-Nearest Neighbor algorithm","authors":"An-Naas Shahifatun Na’iema, Harminto Mulyo, Nur Aeni Widiastuti","doi":"10.14710/jtsiskom.2021.14110","DOIUrl":"https://doi.org/10.14710/jtsiskom.2021.14110","url":null,"abstract":"The registrars for rehabilitation programs for uninhabitable settlements are increasing every year. The large data processing of registrants may result in inaccuracies and need a long time to determine livable houses (RTLH) and unfit for habitation (non RTLH). This study aims to apply the K-Nearest Neighbor algorithm in classifying the eligibility of recipients of uninhabitable house rehabilitation assistance. The data used in this study were 1289 data with 13 attributes from the Jepara Regency Public Housing and Settlement Service. Data processing begins with attribute selection, categorization, outlier data cleaning, and data normalization and method application. The proposed system has the best classification at k of 5 with an accuracy of 97.93%, 96.88% precision, 99.53% recall, and an AUC value of 0.964.","PeriodicalId":56231,"journal":{"name":"Jurnal Teknologi dan Sistem Komputer","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48430069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-20DOI: 10.14710/jtsiskom.2021.14119
Faiz Miftakhur Rozaqi, W. Wahyono
Flood is a natural disaster that often occurs in Indonesia. Therefore, a flood warning system is required to reduce the number of losses due to flooding. In this study, a Sobel edge detection-based framework is proposed to measure the river water level, which is expected to be used as an early flood warning system. Sobel edge detection is used to determine the edge of the water surface, which is then taken by the position of the pixels, and the height is calculated by comparing the image with actual conditions. The test results of the system implemented on the prototype show that this system has an RMSE less than 0.6986 mm and can run at 12 fps which in the future can be implemented directly on rivers.
{"title":"River water level measurement system using Sobel edge detection method","authors":"Faiz Miftakhur Rozaqi, W. Wahyono","doi":"10.14710/jtsiskom.2021.14119","DOIUrl":"https://doi.org/10.14710/jtsiskom.2021.14119","url":null,"abstract":"Flood is a natural disaster that often occurs in Indonesia. Therefore, a flood warning system is required to reduce the number of losses due to flooding. In this study, a Sobel edge detection-based framework is proposed to measure the river water level, which is expected to be used as an early flood warning system. Sobel edge detection is used to determine the edge of the water surface, which is then taken by the position of the pixels, and the height is calculated by comparing the image with actual conditions. The test results of the system implemented on the prototype show that this system has an RMSE less than 0.6986 mm and can run at 12 fps which in the future can be implemented directly on rivers.","PeriodicalId":56231,"journal":{"name":"Jurnal Teknologi dan Sistem Komputer","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46822175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-20DOI: 10.14710/jtsiskom.2021.14074
Dewi Wardani, Widyaswari Mahayanti, Haryono Setiadi, Maria Ulfa, E. Wihidayat
Several problems can occur when students feel they have made the wrong choice of major in university. Choosing a major is one of the problems that students often face. Therefore, this study aims to develop a Decision Support System (DSS) to help students find majors that match their interests and abilities. This DSS proposes a two-way approach by considering students and the major's requirements, standards, and characteristics. The DSS utilizes the TOPSIS method; therefore, it is called TATOPSIS, which stands for Two-way Approach TOPSIS. It showed that the two-way approach in Scenario 1 (without score normalization) and Scenario 3 (with score normalization) shows better agreement results in 78.33% and 73.33% than the two-way approach for Scenario 2, Scenario 4, and the student-one-way approaches.
{"title":"TATOPSIS: A decision support system for selecting a major in university with a two-way approach and TOPSIS","authors":"Dewi Wardani, Widyaswari Mahayanti, Haryono Setiadi, Maria Ulfa, E. Wihidayat","doi":"10.14710/jtsiskom.2021.14074","DOIUrl":"https://doi.org/10.14710/jtsiskom.2021.14074","url":null,"abstract":"Several problems can occur when students feel they have made the wrong choice of major in university. Choosing a major is one of the problems that students often face. Therefore, this study aims to develop a Decision Support System (DSS) to help students find majors that match their interests and abilities. This DSS proposes a two-way approach by considering students and the major's requirements, standards, and characteristics. The DSS utilizes the TOPSIS method; therefore, it is called TATOPSIS, which stands for Two-way Approach TOPSIS. It showed that the two-way approach in Scenario 1 (without score normalization) and Scenario 3 (with score normalization) shows better agreement results in 78.33% and 73.33% than the two-way approach for Scenario 2, Scenario 4, and the student-one-way approaches.","PeriodicalId":56231,"journal":{"name":"Jurnal Teknologi dan Sistem Komputer","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45043425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-04DOI: 10.14710/jtsiskom.2021.13807
Herlina Apriani, J. Jaman, Riza Ibnu Adam
Coffee variety is one of the main factors affecting the quality and price of coffee, so it is important to recognize coffee varieties. This study aims to optimize the recognition of robusta coffee beans based on circularity and eccentricity image features using a support vector machine (SVM) and Grid search algorithm. The methods used included image acquisition, preprocessing, feature extraction, classification, and evaluation. Circularity and eccentricity are used in the feature extraction process, while the grid search algorithm is used to optimize SVM parameters in the classification process for four different kernels. This study produced the best classification model with the highest accuracy of 94% for the RBF and Polynomial kernels.
{"title":"SVM optimization using a grid search algorithm to identify robusta coffee bean images based on circularity and eccentricity","authors":"Herlina Apriani, J. Jaman, Riza Ibnu Adam","doi":"10.14710/jtsiskom.2021.13807","DOIUrl":"https://doi.org/10.14710/jtsiskom.2021.13807","url":null,"abstract":"Coffee variety is one of the main factors affecting the quality and price of coffee, so it is important to recognize coffee varieties. This study aims to optimize the recognition of robusta coffee beans based on circularity and eccentricity image features using a support vector machine (SVM) and Grid search algorithm. The methods used included image acquisition, preprocessing, feature extraction, classification, and evaluation. Circularity and eccentricity are used in the feature extraction process, while the grid search algorithm is used to optimize SVM parameters in the classification process for four different kernels. This study produced the best classification model with the highest accuracy of 94% for the RBF and Polynomial kernels.","PeriodicalId":56231,"journal":{"name":"Jurnal Teknologi dan Sistem Komputer","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47933388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-04DOI: 10.14710/jtsiskom.2021.13984
Syahid Abdullah, W. Kusuma, S. Wijaya
Protein-protein interaction (PPI) can define a protein's function by knowing the protein's position in a complex network of protein interactions. The number of PPIs that have been identified is relatively small. Therefore, several studies were conducted to predict PPI using protein sequence information. This research compares the performance of three autocorrelation methods: Moran, Geary, and Moreau-Broto, in extracting protein sequence features to predict PPI. The results of the three extractions are then applied to three machine learning algorithms, namely k-Nearest Neighbor (KNN), Random Forest, and Support Vector Machine (SVM). The prediction models with the three autocorrelation methods can produce predictions with high average accuracy, which is 95.34% for Geary in KNN, 97.43% for Geary in RF, and 97.11% for Geary and Moran in SVM. In addition, the interacting protein pairs tend to have similar autocorrelation characteristics. Thus, the autocorrelation method can be used to predict PPI well.
{"title":"Sequence-based prediction of protein-protein interaction using autocorrelation features and machine learning","authors":"Syahid Abdullah, W. Kusuma, S. Wijaya","doi":"10.14710/jtsiskom.2021.13984","DOIUrl":"https://doi.org/10.14710/jtsiskom.2021.13984","url":null,"abstract":"Protein-protein interaction (PPI) can define a protein's function by knowing the protein's position in a complex network of protein interactions. The number of PPIs that have been identified is relatively small. Therefore, several studies were conducted to predict PPI using protein sequence information. This research compares the performance of three autocorrelation methods: Moran, Geary, and Moreau-Broto, in extracting protein sequence features to predict PPI. The results of the three extractions are then applied to three machine learning algorithms, namely k-Nearest Neighbor (KNN), Random Forest, and Support Vector Machine (SVM). The prediction models with the three autocorrelation methods can produce predictions with high average accuracy, which is 95.34% for Geary in KNN, 97.43% for Geary in RF, and 97.11% for Geary and Moran in SVM. In addition, the interacting protein pairs tend to have similar autocorrelation characteristics. Thus, the autocorrelation method can be used to predict PPI well.","PeriodicalId":56231,"journal":{"name":"Jurnal Teknologi dan Sistem Komputer","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45363200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-08-10DOI: 10.14710/jtsiskom.2021.14105
Willdan Aprizal Arifin, I. Ariawan, A. A. Rosalia, L. Lukman, Nabila Tufailah
This study aims to analyze the performance of machine learning algorithms with the data scaling process to show the method's effectiveness. It uses min-max (normalization) and zero-mean (standardization) data scaling techniques in the abalone dataset. The stages carried out in this study included data normalization on the data of abalone physical measurement features. The model evaluation was carried out using k-fold cross-validation with the number of k-fold 10. Abalone datasets were normalized in machine learning algorithms: Random Forest, Naïve Bayesian, Decision Tree, and SVM (RBF kernels and linear kernels). The eight features of the abalone dataset show that machine learning algorithms did not too influence data scaling. There is an increase in the performance of SVM, while Random Forest decreases when the abalone dataset is applied to data scaling. Random Forest has the highest average balanced accuracy (74.87%) without data scaling.
{"title":"Data scaling performance on various machine learning algorithms to identify abalone sex","authors":"Willdan Aprizal Arifin, I. Ariawan, A. A. Rosalia, L. Lukman, Nabila Tufailah","doi":"10.14710/jtsiskom.2021.14105","DOIUrl":"https://doi.org/10.14710/jtsiskom.2021.14105","url":null,"abstract":"This study aims to analyze the performance of machine learning algorithms with the data scaling process to show the method's effectiveness. It uses min-max (normalization) and zero-mean (standardization) data scaling techniques in the abalone dataset. The stages carried out in this study included data normalization on the data of abalone physical measurement features. The model evaluation was carried out using k-fold cross-validation with the number of k-fold 10. Abalone datasets were normalized in machine learning algorithms: Random Forest, Naïve Bayesian, Decision Tree, and SVM (RBF kernels and linear kernels). The eight features of the abalone dataset show that machine learning algorithms did not too influence data scaling. There is an increase in the performance of SVM, while Random Forest decreases when the abalone dataset is applied to data scaling. Random Forest has the highest average balanced accuracy (74.87%) without data scaling.","PeriodicalId":56231,"journal":{"name":"Jurnal Teknologi dan Sistem Komputer","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43092386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-18DOI: 10.14710/JTSISKOM.2021.13970
F. Adhinata, R. Adhitama, A. Segara
Currency recognition is one of the essential things since everyone in any country must know money. Therefore, computer vision has been developed to recognize currency. One of the currency recognition uses the SIFT algorithm. The recognition results are very accurate, but the processing takes a considerable amount of time, making it impossible to run for real-time data such as video. AKAZE algorithm has been developed for real-time data processing because of its fast computation time to process video data frames. This study proposes the faster real-time currency recognition system on video using the AKAZE algorithm. The purpose of this study is to compare the SIFT and AKAZE algorithms related to a real-time video data processing to determine the value of F1 and its speed. Based on the experimental results, the AKAZE algorithm is resulting F1 value of 0.97, and the processing speed on each video frame is 0.251 seconds. Then at the same video resolution, the SIFT algorithm results in an F1 value of 0.65 and a speed of 0.305 seconds to process one frame. These results show that the AKAZE algorithm is faster and more accurate in processing video data.
{"title":"Real-time currency recognition on video using AKAZE algorithm","authors":"F. Adhinata, R. Adhitama, A. Segara","doi":"10.14710/JTSISKOM.2021.13970","DOIUrl":"https://doi.org/10.14710/JTSISKOM.2021.13970","url":null,"abstract":"Currency recognition is one of the essential things since everyone in any country must know money. Therefore, computer vision has been developed to recognize currency. One of the currency recognition uses the SIFT algorithm. The recognition results are very accurate, but the processing takes a considerable amount of time, making it impossible to run for real-time data such as video. AKAZE algorithm has been developed for real-time data processing because of its fast computation time to process video data frames. This study proposes the faster real-time currency recognition system on video using the AKAZE algorithm. The purpose of this study is to compare the SIFT and AKAZE algorithms related to a real-time video data processing to determine the value of F1 and its speed. Based on the experimental results, the AKAZE algorithm is resulting F1 value of 0.97, and the processing speed on each video frame is 0.251 seconds. Then at the same video resolution, the SIFT algorithm results in an F1 value of 0.65 and a speed of 0.305 seconds to process one frame. These results show that the AKAZE algorithm is faster and more accurate in processing video data.","PeriodicalId":56231,"journal":{"name":"Jurnal Teknologi dan Sistem Komputer","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46003704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-18DOI: 10.14710/JTSISKOM.2021.14038
A. N. Babatunde, Afeez Adeshina Oke, A. A. Oloyede, Aisha Oiza Bello
This paper aims to improve the image scrambling and encryption effect in traditional two-dimensional discrete Arnold transform by introducing a new Residue number system (RNS) with three moduli and the New Arnold Transform. The study focuses on improving the classical discrete Arnold transform with quasi-affine properties, applying image scrambling and encryption research. The design of the method is explicit to three moduli set {2n, 2n+1+1, 2n+1-1}. These moduli set includes equalized and shapely moduli leading to the effective execution of the residue to binary converter. The study employs an arithmetic residue to the binary converter and an improved Arnold transformation algorithm. The encryption process uses MATLAB to accept a digital image input and subsequently convert the image into an RNS representation. The images are connected as a group. The resulting encrypted image uses the Arnold transformation algorithm. The encrypted image is used as input at decryption using the anti-Arnold (Reverse Arnold) transformation algorithm to convert the picture to the original RNS (original pixel value). Then the RNS was used to retransform the original RNS to its binary form. Security analysis tests, like histogram analysis, keyspace, key sensitivity, and correlation coefficient analysis, were administered on the encrypted image. Results show that the hybrid system can use the improved Arnold transform algorithm with better security and no constraint on image width and size.
{"title":"Enhanced image security using residue number system and new Arnold transform","authors":"A. N. Babatunde, Afeez Adeshina Oke, A. A. Oloyede, Aisha Oiza Bello","doi":"10.14710/JTSISKOM.2021.14038","DOIUrl":"https://doi.org/10.14710/JTSISKOM.2021.14038","url":null,"abstract":"This paper aims to improve the image scrambling and encryption effect in traditional two-dimensional discrete Arnold transform by introducing a new Residue number system (RNS) with three moduli and the New Arnold Transform. The study focuses on improving the classical discrete Arnold transform with quasi-affine properties, applying image scrambling and encryption research. The design of the method is explicit to three moduli set {2n, 2n+1+1, 2n+1-1}. These moduli set includes equalized and shapely moduli leading to the effective execution of the residue to binary converter. The study employs an arithmetic residue to the binary converter and an improved Arnold transformation algorithm. The encryption process uses MATLAB to accept a digital image input and subsequently convert the image into an RNS representation. The images are connected as a group. The resulting encrypted image uses the Arnold transformation algorithm. The encrypted image is used as input at decryption using the anti-Arnold (Reverse Arnold) transformation algorithm to convert the picture to the original RNS (original pixel value). Then the RNS was used to retransform the original RNS to its binary form. Security analysis tests, like histogram analysis, keyspace, key sensitivity, and correlation coefficient analysis, were administered on the encrypted image. Results show that the hybrid system can use the improved Arnold transform algorithm with better security and no constraint on image width and size.","PeriodicalId":56231,"journal":{"name":"Jurnal Teknologi dan Sistem Komputer","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47666119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-09DOI: 10.14710/JTSISKOM.2021.13965
K. Adewole, Muiz O. Raheem, M. Abdulraheem, I. D. Oladipo, A. Balogun, Omotola Fatimah Baker
As a result of advancements in technology and technological devices, data is now spawned at an infinite rate, emanating from a vast array of networks, devices, and daily operations like credit card transactions and mobile phones. Datastream entails sequential and real-time continuous data in the inform of evolving stream. However, the traditional machine learning approach is characterized by a batch learning model. Labeled training data are given apriori to train a model based on some machine learning algorithms. This technique necessitates the entire training sample to be readily accessible before the learning process. The training procedure is mainly done offline in this setting due to the high training cost. Consequently, the traditional batch learning technique suffers severe drawbacks, such as poor scalability for real-time phishing websites detection. The model mostly requires re-training from scratch using new training samples. This paper presents the application of streaming algorithms for detecting malicious URLs based on selected online learners: Hoeffding Tree (HT), Naïve Bayes (NB), and Ozabag. Ozabag produced promising results in terms of accuracy, Kappa and Kappa Temp on the dataset with large samples while HT and NB have the least prediction time with comparable accuracy and Kappa with Ozabag algorithm for the real-time detection of phishing websites.
由于技术和技术设备的进步,数据现在以无限的速度产生,从大量的网络、设备和日常操作中产生,如信用卡交易和移动电话。数据流在不断发展的信息流中需要连续的和实时的连续数据。然而,传统的机器学习方法以批处理学习模型为特征。先验地给出标记的训练数据来训练基于某些机器学习算法的模型。这种技术要求在学习过程之前可以很容易地访问整个训练样本。在这种情况下,由于培训成本高,培训过程主要是离线完成的。因此,传统的批量学习技术存在着严重的缺陷,如对网络钓鱼网站的实时检测可扩展性差。该模型大多需要使用新的训练样本从零开始重新训练。本文介绍了基于选定在线学习器的流算法在检测恶意url中的应用:Hoeffding Tree (HT), Naïve Bayes (NB)和Ozabag。Ozabag算法在大样本数据集的准确率、Kappa和Kappa Temp方面取得了令人满意的结果,而HT和NB算法的预测时间最短,准确率相当,Kappa和Ozabag算法用于实时检测网络钓鱼网站。
{"title":"Malicious URLs detection using data streaming algorithms","authors":"K. Adewole, Muiz O. Raheem, M. Abdulraheem, I. D. Oladipo, A. Balogun, Omotola Fatimah Baker","doi":"10.14710/JTSISKOM.2021.13965","DOIUrl":"https://doi.org/10.14710/JTSISKOM.2021.13965","url":null,"abstract":"As a result of advancements in technology and technological devices, data is now spawned at an infinite rate, emanating from a vast array of networks, devices, and daily operations like credit card transactions and mobile phones. Datastream entails sequential and real-time continuous data in the inform of evolving stream. However, the traditional machine learning approach is characterized by a batch learning model. Labeled training data are given apriori to train a model based on some machine learning algorithms. This technique necessitates the entire training sample to be readily accessible before the learning process. The training procedure is mainly done offline in this setting due to the high training cost. Consequently, the traditional batch learning technique suffers severe drawbacks, such as poor scalability for real-time phishing websites detection. The model mostly requires re-training from scratch using new training samples. This paper presents the application of streaming algorithms for detecting malicious URLs based on selected online learners: Hoeffding Tree (HT), Naïve Bayes (NB), and Ozabag. Ozabag produced promising results in terms of accuracy, Kappa and Kappa Temp on the dataset with large samples while HT and NB have the least prediction time with comparable accuracy and Kappa with Ozabag algorithm for the real-time detection of phishing websites.","PeriodicalId":56231,"journal":{"name":"Jurnal Teknologi dan Sistem Komputer","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44900728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}