Pub Date : 2024-02-01DOI: 10.3844/jcssp.2024.157.167
Ghizlane Bourahouat, Manar Abourezq, N. Daoudi
: This study addresses the crucial task of sentiment analysis in natural language processing, with a particular focus on Arabic, especially dialectal Arabic, which has been relatively understudied due to inherent challenges. Our approach centers on sentiment analysis in Moroccan Arabic, leveraging BERT models that are pre-trained in the Arabic language, namely AraBERT, QARIB, ALBERT, AraELECTRA, and CAMeLBERT. These models are integrated alongside deep learning and machine learning algorithms, including SVM and CNN, with additional fine-tuning of the pre-trained model. Furthermore, we examine the impact of data imbalance by evaluating the models on three distinct datasets: An unbalanced set, a balanced set obtained through under-sampling, and a balanced set created by combining the initial dataset with another unbalanced one. Notably, our proposed approach demonstrates impressive accuracy, achieving a notable 96% when employing the QARIB model even on imbalanced data. The novelty of this research lies in the integration of pre-trained Arabic BERT models for Moroccan sentiment analysis, as well as the exploration of their combined use with CNN and SVM algorithms. Furthermore, our findings reveal that employing BERT-based models yields superior results compared to their application in conjunction with CNN or SVM, marking a significant advancement in sentiment analysis for Moroccan Arabic. Our method's effectiveness is highlighted through a comparative analysis with state-of-the-art approaches, providing valuable insights that contribute to the advancement of sentiment analysis in Arabic dialects
{"title":"Improvement of Moroccan Dialect Sentiment Analysis Using Arabic BERT-Based Models","authors":"Ghizlane Bourahouat, Manar Abourezq, N. Daoudi","doi":"10.3844/jcssp.2024.157.167","DOIUrl":"https://doi.org/10.3844/jcssp.2024.157.167","url":null,"abstract":": This study addresses the crucial task of sentiment analysis in natural language processing, with a particular focus on Arabic, especially dialectal Arabic, which has been relatively understudied due to inherent challenges. Our approach centers on sentiment analysis in Moroccan Arabic, leveraging BERT models that are pre-trained in the Arabic language, namely AraBERT, QARIB, ALBERT, AraELECTRA, and CAMeLBERT. These models are integrated alongside deep learning and machine learning algorithms, including SVM and CNN, with additional fine-tuning of the pre-trained model. Furthermore, we examine the impact of data imbalance by evaluating the models on three distinct datasets: An unbalanced set, a balanced set obtained through under-sampling, and a balanced set created by combining the initial dataset with another unbalanced one. Notably, our proposed approach demonstrates impressive accuracy, achieving a notable 96% when employing the QARIB model even on imbalanced data. The novelty of this research lies in the integration of pre-trained Arabic BERT models for Moroccan sentiment analysis, as well as the exploration of their combined use with CNN and SVM algorithms. Furthermore, our findings reveal that employing BERT-based models yields superior results compared to their application in conjunction with CNN or SVM, marking a significant advancement in sentiment analysis for Moroccan Arabic. Our method's effectiveness is highlighted through a comparative analysis with state-of-the-art approaches, providing valuable insights that contribute to the advancement of sentiment analysis in Arabic dialects","PeriodicalId":40005,"journal":{"name":"Journal of Computer Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139683801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01DOI: 10.3844/jcssp.2024.191.197
Hari Soetanto, Painem, Muhammad Kamil Suryadewiansyah
: Abdominal colic is a common condition that affects infants and it can be difficult to diagnose because it shares many symptoms with other conditions, such as gastric disease and appendicitis. Limitations of existing diagnostic methods include the unreliability of physical examinations and medical histories and the high cost and time-consuming nature of imaging tests. This research proposes an expert system based on interpolation, forward chaining, and certainty factors for diagnosing abdominal colic. This system has the potential to provide a more accurate and efficient way to diagnose abdominal colic, which could lead to better patient outcomes. This research proposes an expert system based on interpolation, forward chaining, and certainty factors for diagnosing abdominal colic. This system is implemented as a web application model. The forward chaining method is used to establish rules for the expert system. The rules are based on the symptoms and diseases that are included in the system's knowledge base. The interpolation method is used to normalize lab results and the certainty factor method is used to process medical history and physical examinations. This is necessary because medical history and physical examinations can be imprecise. The expert system was tested on a dataset of 100 cases and it was able to accurately diagnose 96 patients, achieving a 96% accuracy rate. This suggests that the expert system has the potential to provide a more accurate and efficient way to diagnose abdominal colic, which could lead to better patient outcomes.
{"title":"Optimization of Expert System Based on Interpolation, Forward Chaining, and Certainty Factor for Diagnosing Abdominal Colic","authors":"Hari Soetanto, Painem, Muhammad Kamil Suryadewiansyah","doi":"10.3844/jcssp.2024.191.197","DOIUrl":"https://doi.org/10.3844/jcssp.2024.191.197","url":null,"abstract":": Abdominal colic is a common condition that affects infants and it can be difficult to diagnose because it shares many symptoms with other conditions, such as gastric disease and appendicitis. Limitations of existing diagnostic methods include the unreliability of physical examinations and medical histories and the high cost and time-consuming nature of imaging tests. This research proposes an expert system based on interpolation, forward chaining, and certainty factors for diagnosing abdominal colic. This system has the potential to provide a more accurate and efficient way to diagnose abdominal colic, which could lead to better patient outcomes. This research proposes an expert system based on interpolation, forward chaining, and certainty factors for diagnosing abdominal colic. This system is implemented as a web application model. The forward chaining method is used to establish rules for the expert system. The rules are based on the symptoms and diseases that are included in the system's knowledge base. The interpolation method is used to normalize lab results and the certainty factor method is used to process medical history and physical examinations. This is necessary because medical history and physical examinations can be imprecise. The expert system was tested on a dataset of 100 cases and it was able to accurately diagnose 96 patients, achieving a 96% accuracy rate. This suggests that the expert system has the potential to provide a more accurate and efficient way to diagnose abdominal colic, which could lead to better patient outcomes.","PeriodicalId":40005,"journal":{"name":"Journal of Computer Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139687689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01DOI: 10.3844/jcssp.2024.181.190
Jacobo Osorio, Marko Figueroa, Lenis Wong
: Smartphone addiction has emerged as a growing concern in society, particularly among teenagers, due to its potential negative impact on physical, emotional social well-being. The excessive use of smartphones has consistently shown associations with negative outcomes, highlighting a strong dependence on these devices, which often leads to detrimental effects on mental health, including heightened levels of anxiety, distress, stress depression. This psychological burden can further result in the neglect of daily activities as individuals become increasingly engrossed in seeking pleasure through their smartphones. The aim of this study is to develop a predictive model utilizing machine learning techniques to identify smartphone addiction based on the "Big Five Personality Traits (BFPT)". The model was developed by following five out of the six phases of the "Cross Industry Standard Process for Data Mining (CRISP-DM)" methodology, namely "business understanding," "data understanding," "data preparation," "modeling," and "evaluation." To construct the database, data was collected from a school using the Big Five Inventory (BFI) and the Smartphone Addiction Scale (SAS) questionnaires. Subsequently, four algorithms (DT, RF, XGB LG) were employed the correlation between the personality traits and addiction was examined. The analysis revealed a relationship between the traits of neuroticism and conscientiousness with smartphone addiction. The results demonstrated that the RF algorithm achieved an accuracy of 89.7%, a precision of 87.3% the highest AUC value on the ROC curve. These findings highlight the effectiveness of the proposed model in accurately predicting smartphone addiction among adolescents
{"title":"Predicting Smartphone Addiction in Teenagers: An Integrative Model Incorporating Machine Learning and Big Five Personality Traits","authors":"Jacobo Osorio, Marko Figueroa, Lenis Wong","doi":"10.3844/jcssp.2024.181.190","DOIUrl":"https://doi.org/10.3844/jcssp.2024.181.190","url":null,"abstract":": Smartphone addiction has emerged as a growing concern in society, particularly among teenagers, due to its potential negative impact on physical, emotional social well-being. The excessive use of smartphones has consistently shown associations with negative outcomes, highlighting a strong dependence on these devices, which often leads to detrimental effects on mental health, including heightened levels of anxiety, distress, stress depression. This psychological burden can further result in the neglect of daily activities as individuals become increasingly engrossed in seeking pleasure through their smartphones. The aim of this study is to develop a predictive model utilizing machine learning techniques to identify smartphone addiction based on the \"Big Five Personality Traits (BFPT)\". The model was developed by following five out of the six phases of the \"Cross Industry Standard Process for Data Mining (CRISP-DM)\" methodology, namely \"business understanding,\" \"data understanding,\" \"data preparation,\" \"modeling,\" and \"evaluation.\" To construct the database, data was collected from a school using the Big Five Inventory (BFI) and the Smartphone Addiction Scale (SAS) questionnaires. Subsequently, four algorithms (DT, RF, XGB LG) were employed the correlation between the personality traits and addiction was examined. The analysis revealed a relationship between the traits of neuroticism and conscientiousness with smartphone addiction. The results demonstrated that the RF algorithm achieved an accuracy of 89.7%, a precision of 87.3% the highest AUC value on the ROC curve. These findings highlight the effectiveness of the proposed model in accurately predicting smartphone addiction among adolescents","PeriodicalId":40005,"journal":{"name":"Journal of Computer Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139688077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01DOI: 10.3844/jcssp.2024.168.180
Divager Balasubramaniyan, N. Husin, N. Mustapha, N. Sharef, T.N. Mohd Aris
: Worldwide, 422 million people suffer from diabetic disease, and 1.5 million die yearly. Diabetes is a threat to people who still fail to cure or maintain it, so it is challenging to predict this disease accurately. The existing systems face data over-fitting issues, convergence problems, non-converging optimization complex predictions, and latent and predominant feature extraction. These issues affect the system's performance and reduce diabetic disease detection accuracy. Hence, the research objective is to create an improved diabetic disease detection system using a Flock Optimization Algorithm-Based Deep Learning Model (FOADLM) feature modeling approach that leverages the PIMA Indian dataset to predict and classify diabetic disease cases. The collected data is processed by a Gaussian filtering approach that eliminates irrelevant information, reducing the overfitting issues. Then flock optimization algorithm is applied to detect the sequence; this process is used to reduce the convergence and optimization problems. Finally, the recurrent neural approach is applied to classify the normal and abnormal features. The entire research implementation result is carried out with the help of the MATLAB program and the results are analyzed with accuracy, precision, recall, computational time, reliability scalability, and error rate measures like root mean square error, mean square error, and correlation coefficients. In conclusion, the system evaluation result produced 99.23% accuracy in predicting diabetic disease with the metrics.
{"title":"Flock Optimization Algorithm-Based Deep Learning Model for Diabetic Disease Detection Improvement","authors":"Divager Balasubramaniyan, N. Husin, N. Mustapha, N. Sharef, T.N. Mohd Aris","doi":"10.3844/jcssp.2024.168.180","DOIUrl":"https://doi.org/10.3844/jcssp.2024.168.180","url":null,"abstract":": Worldwide, 422 million people suffer from diabetic disease, and 1.5 million die yearly. Diabetes is a threat to people who still fail to cure or maintain it, so it is challenging to predict this disease accurately. The existing systems face data over-fitting issues, convergence problems, non-converging optimization complex predictions, and latent and predominant feature extraction. These issues affect the system's performance and reduce diabetic disease detection accuracy. Hence, the research objective is to create an improved diabetic disease detection system using a Flock Optimization Algorithm-Based Deep Learning Model (FOADLM) feature modeling approach that leverages the PIMA Indian dataset to predict and classify diabetic disease cases. The collected data is processed by a Gaussian filtering approach that eliminates irrelevant information, reducing the overfitting issues. Then flock optimization algorithm is applied to detect the sequence; this process is used to reduce the convergence and optimization problems. Finally, the recurrent neural approach is applied to classify the normal and abnormal features. The entire research implementation result is carried out with the help of the MATLAB program and the results are analyzed with accuracy, precision, recall, computational time, reliability scalability, and error rate measures like root mean square error, mean square error, and correlation coefficients. In conclusion, the system evaluation result produced 99.23% accuracy in predicting diabetic disease with the metrics.","PeriodicalId":40005,"journal":{"name":"Journal of Computer Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139684102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01DOI: 10.3844/jcssp.2024.198.206
Charuay Savithi, Arisaphat Suttidee
: The security and reliability of cloud computing services continue to be major concerns that hinder their widespread adoption. This study explores how information reliability and cloud computing efficiency influence website design and e-commerce business development decisions on cloud computing. The researchers distributed 379 questionnaires to determine the sample size, resulting in a 46.50% response rate of 46.50% with 186 participants. Various statistical tests, including the t-test, the f-test (ANOVA and MANOVA), multiple correlation analysis and multiple regression analysis, are used to analyses the collected data. The results of the study show a positive correlation and influence between the reliability of information, specifically in terms of confidentiality, stability and verifiability and the decision to design and develop websites. Furthermore, the efficiency of cloud computing, particularly in communication and processing, demonstrates a positive relationship and impact on website design and development. These findings highlight the importance for e-commerce business leaders to understand the importance of information reliability and cloud computing efficiency. Recognizing these factors can enhance their competitive advantage in the e-commerce industry and foster consistent and sustainable growth. Research also highlights the contribution of cloud technology and security to increasing confidence in the development of e-commerce businesses.
:云计算服务的安全性和可靠性仍然是阻碍其广泛应用的主要问题。本研究探讨了信息可靠性和云计算效率如何影响有关云计算的网站设计和电子商务业务开发决策。研究人员发放了 379 份问卷以确定样本量,结果有 186 人参与,回复率为 46.50%。研究采用了各种统计检验,包括 t 检验、f 检验(方差分析和曼诺夫分析)、多重相关分析和多元回归分析,对收集到的数据进行分析。研究结果表明,信息的可靠性(特别是在保密性、稳定性和可验证性方面)与设计和开发网站的决策之间存在正相关关系和影响。此外,云计算的效率,特别是在通信和处理方面的效率,也对网站设计和开发产生了积极的关系和影响。这些发现凸显了电子商务企业领导者了解信息可靠性和云计算效率的重要性。认识到这些因素可以增强他们在电子商务行业的竞争优势,促进持续稳定的增长。研究还强调了云技术和安全性对增强电子商务企业发展信心的贡献。
{"title":"The Impact of Information Reliability and Cloud Computing Efficiency on Website Design and E-Commerce Business in Thailand","authors":"Charuay Savithi, Arisaphat Suttidee","doi":"10.3844/jcssp.2024.198.206","DOIUrl":"https://doi.org/10.3844/jcssp.2024.198.206","url":null,"abstract":": The security and reliability of cloud computing services continue to be major concerns that hinder their widespread adoption. This study explores how information reliability and cloud computing efficiency influence website design and e-commerce business development decisions on cloud computing. The researchers distributed 379 questionnaires to determine the sample size, resulting in a 46.50% response rate of 46.50% with 186 participants. Various statistical tests, including the t-test, the f-test (ANOVA and MANOVA), multiple correlation analysis and multiple regression analysis, are used to analyses the collected data. The results of the study show a positive correlation and influence between the reliability of information, specifically in terms of confidentiality, stability and verifiability and the decision to design and develop websites. Furthermore, the efficiency of cloud computing, particularly in communication and processing, demonstrates a positive relationship and impact on website design and development. These findings highlight the importance for e-commerce business leaders to understand the importance of information reliability and cloud computing efficiency. Recognizing these factors can enhance their competitive advantage in the e-commerce industry and foster consistent and sustainable growth. Research also highlights the contribution of cloud technology and security to increasing confidence in the development of e-commerce businesses.","PeriodicalId":40005,"journal":{"name":"Journal of Computer Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139685595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01DOI: 10.3844/jcssp.2024.150.156
N. Jarah, Abbas Hanon Hassin Alasadi, K. M. Hashim
: Seismic tremors are among the foremost perilous normal fiascos individuals confront due to their event without earlier caution and their effect on their lives and properties. In expansion, to consider future disaster prevention measures for major earthquakes, it is necessary to predict earthquakes using Neural Networks (NN). A machine learning technique has developed a technology to predict earthquakes from ground controller data by measuring ground vibration and transmitting data by a sensor network. Devices to process this data and record it in a catalog of seismic data from 1900-2019 for Iraq and neighboring regions, then divide this data into 80% training data and 20% test data. It gave better results than other prediction algorithms, where the NN model performs better Seismic prediction than other machine learning methods.
{"title":"A New Algorithm for Earthquake Prediction Using Machine Learning Methods","authors":"N. Jarah, Abbas Hanon Hassin Alasadi, K. M. Hashim","doi":"10.3844/jcssp.2024.150.156","DOIUrl":"https://doi.org/10.3844/jcssp.2024.150.156","url":null,"abstract":": Seismic tremors are among the foremost perilous normal fiascos individuals confront due to their event without earlier caution and their effect on their lives and properties. In expansion, to consider future disaster prevention measures for major earthquakes, it is necessary to predict earthquakes using Neural Networks (NN). A machine learning technique has developed a technology to predict earthquakes from ground controller data by measuring ground vibration and transmitting data by a sensor network. Devices to process this data and record it in a catalog of seismic data from 1900-2019 for Iraq and neighboring regions, then divide this data into 80% training data and 20% test data. It gave better results than other prediction algorithms, where the NN model performs better Seismic prediction than other machine learning methods.","PeriodicalId":40005,"journal":{"name":"Journal of Computer Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139685506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01DOI: 10.3844/jcssp.2024.129.139
Denny Arbahri, O. Nurhayati, Imam Mudita
: Marine data and information are very important for human survival, therefore this data and information is attractive to investors because of the potential economic value. This data and information has been difficult to obtain, the solution to overcome this is by analyzing oceanographic data for 2009-2019 collected from the marine database belonging to the Agency for the Study and Application of Technology (BPPT). The data is the result of a collaborative marine survey between Indonesian and foreign researchers from various countries who sailed in various Indonesian waters. Raw oceanographic data is converted and classified into Conductivity, Temperature, and Depth (CTD) data as oceanographic data parameters identified as predictor variables (X) that are correlated with each other. CTD data is processed into numeric data attributes that have been labeled for input and training. The data was modeled using the Machine Learning (ML) type Supervised Learning (SL) method with the Decision Tree (DT), Linear Regression (LR) and Random Forest (RF) algorithms which were interpreted according to the characteristics of the CTD data. ML will learn data models to understand and store. Next, the model is evaluated using accuracy metrics by measuring the difference between the predicted value and the actual value to obtain a good prediction model. The prediction results show a salinity level of 34.0 parts per thousand (ppt), meaning that in this area of marine waters salinity will affect the solubility of Oxygen (O 2 ) and play a major role in the sustainability and growth of the fertility level of biological resources which is supported by sea surface temperature conditions 29.2°C. So the salinity values obtained using ML techniques and marine resource potential can be assumed to have a strong correlation. The research results show that the RF model has the lowest level of prediction error based on the values: Mean Square Error (MSE) = 0.007; Root Mean Squared Error (RMSE) = 0.082; Mean Absolute Error (MAE) = 0.007 compared to DT model: MSE = 0.008; RMSE = 0.088; MAE = 0.012 and LR model: MSE = 1.008; RMSE = 1.004; MAE = 0.281. The equivalent RF and DT models have a Determination Coefficient (R 2 ) = 0.999, meaning that a model is created that is good at predicting, compared to the LR model with a value of R 2 = 0.914. The correlation between variables shows that the LR model is very linear with a Correlation Coefficient (r) = 1.000 compared to the DT model (r) = 0.621 and the RF model (r) = 0.379. Therefore the algorithm that has a value of (r) +1 has the best level of accuracy. The use of ML to predict marine resource potential is a relatively new research field, so this research has the potential to contribute data and information as a reference for innovative studies and investment decision material for investors.
:海洋数据和信息对人类生存非常重要,因此这些数据和信息因其潜在的经济价值而对投资者具有吸引力。这些数据和信息一直难以获得,解决这一问题的办法是从技术研究和应用局(BPPT)海洋数据库中收集的 2009-2019 年海洋学数据进行分析。这些数据是印尼和来自不同国家的外国研究人员在印尼不同水域合作开展海洋调查的结果。原始海洋学数据被转换并分类为电导率、温度和深度(CTD)数据,这些海洋学数据参数被确定为相互关联的预测变量(X)。CTD 数据被处理成数字数据属性,这些属性已被标记,用于输入和训练。数据建模采用机器学习(ML)类型的监督学习(SL)方法,包括决策树(DT)、线性回归(LR)和随机森林(RF)算法,这些算法根据 CTD 数据的特征进行解释。ML 将学习数据模型,以便理解和存储。接下来,通过测量预测值与实际值之间的差异,使用准确度指标对模型进行评估,以获得良好的预测模型。预测结果显示,盐度水平为千分之 34.0(ppt),这意味着在这一区域的海水中,盐度将影响氧气(O 2 )的溶解度,并对生物资源肥力水平的可持续性和增长起到重要作用,而生物资源的肥力水平是由 29.2°C 的海面温度条件支持的。因此,可以认为利用 ML 技术获得的盐度值与海洋资源潜力具有很强的相关性。研究结果表明,RF 模型的预测误差值最小:平均平方误差(MSE)= 0.007;均方根误差(RMSE)= 0.082;平均绝对误差(MAE)= 0.007:MSE = 0.008;RMSE = 0.088;MAE = 0.012 和 LR 模型:MSE = 1.008; RMSE = 1.004; MAE = 0.281。等效的 RF 和 DT 模型的判定系数(R 2 )= 0.999,这意味着所创建的模型具有良好的预测能力,而 LR 模型的 R 2 = 0.914。变量之间的相关性显示,LR 模型的相关系数(r)= 1.000,而 DT 模型的相关系数(r)= 0.621,RF 模型的相关系数(r)= 0.379。因此,(r)+1 值的算法准确度最高。使用 ML 预测海洋资源潜力是一个相对较新的研究领域,因此本研究有可能提供数据和信息,作为创新研究和投资者投资决策材料的参考。
{"title":"Machine Learning Oceanographic Data for Prediction of the Potential of Marine Resources","authors":"Denny Arbahri, O. Nurhayati, Imam Mudita","doi":"10.3844/jcssp.2024.129.139","DOIUrl":"https://doi.org/10.3844/jcssp.2024.129.139","url":null,"abstract":": Marine data and information are very important for human survival, therefore this data and information is attractive to investors because of the potential economic value. This data and information has been difficult to obtain, the solution to overcome this is by analyzing oceanographic data for 2009-2019 collected from the marine database belonging to the Agency for the Study and Application of Technology (BPPT). The data is the result of a collaborative marine survey between Indonesian and foreign researchers from various countries who sailed in various Indonesian waters. Raw oceanographic data is converted and classified into Conductivity, Temperature, and Depth (CTD) data as oceanographic data parameters identified as predictor variables (X) that are correlated with each other. CTD data is processed into numeric data attributes that have been labeled for input and training. The data was modeled using the Machine Learning (ML) type Supervised Learning (SL) method with the Decision Tree (DT), Linear Regression (LR) and Random Forest (RF) algorithms which were interpreted according to the characteristics of the CTD data. ML will learn data models to understand and store. Next, the model is evaluated using accuracy metrics by measuring the difference between the predicted value and the actual value to obtain a good prediction model. The prediction results show a salinity level of 34.0 parts per thousand (ppt), meaning that in this area of marine waters salinity will affect the solubility of Oxygen (O 2 ) and play a major role in the sustainability and growth of the fertility level of biological resources which is supported by sea surface temperature conditions 29.2°C. So the salinity values obtained using ML techniques and marine resource potential can be assumed to have a strong correlation. The research results show that the RF model has the lowest level of prediction error based on the values: Mean Square Error (MSE) = 0.007; Root Mean Squared Error (RMSE) = 0.082; Mean Absolute Error (MAE) = 0.007 compared to DT model: MSE = 0.008; RMSE = 0.088; MAE = 0.012 and LR model: MSE = 1.008; RMSE = 1.004; MAE = 0.281. The equivalent RF and DT models have a Determination Coefficient (R 2 ) = 0.999, meaning that a model is created that is good at predicting, compared to the LR model with a value of R 2 = 0.914. The correlation between variables shows that the LR model is very linear with a Correlation Coefficient (r) = 1.000 compared to the DT model (r) = 0.621 and the RF model (r) = 0.379. Therefore the algorithm that has a value of (r) +1 has the best level of accuracy. The use of ML to predict marine resource potential is a relatively new research field, so this research has the potential to contribute data and information as a reference for innovative studies and investment decision material for investors.","PeriodicalId":40005,"journal":{"name":"Journal of Computer Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139687562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01DOI: 10.3844/jcssp.2024.140.149
S. Priya, R. S. Ponmagal
: The problem of intrusion detection in cloud environments has been well studied. The presence of adversaries would challenge data security in the cloud by generating intrusion attacks towards the cloud data and should be mitigated for the development of the cloud environment. In mitigating intrusion attacks, there exist several techniques in the literature. The method uses different features like frequency of access, payload details, protocol mapping, etc. However, the methods need to improve to achieve the expected performance in detecting intrusion attacks. An efficient Periodic Service Behavior Strain Analysis (PSBSA) is presented to handle this issue. Unlike earlier methods, the PSBSA model analyzes the behavior of users in various time frames like historical, recent, and current spans. The model focused on identifying intrusion attacks in several constraints, not just considering the current nature. The performance of intrusion detection can be improved by viewing the user's behavior in historical, present, and recent timespan. Unlike other approaches, the proposed PSBSA model considers the user's behavior at different times in measuring the user's trust towards intrusion detection. Accordingly, the proposed PSBSA model analyzes the behavior of users under various situations. It examines the behavior in accessing the services at historical, current, and recent times. The method performs Historical Strain Analysis (HSA) Current Strain Analysis (CSA) and Recent Strain Analysis (RSA). HSA analysis is performed according to the historical data, CSA is performed based on the current access data and RSA is performed with the recent access data. The model estimates various legitimacy support values on each analysis to conclude the trust of any user. According to the support values, intrusion detection has been performed. The proposed PSBSA model introduces higher accuracy in intrusion detection in a cloud environment.
{"title":"Periodic Service Behavior Strain Analysis-Based Intrusion Detection in Cloud","authors":"S. Priya, R. S. Ponmagal","doi":"10.3844/jcssp.2024.140.149","DOIUrl":"https://doi.org/10.3844/jcssp.2024.140.149","url":null,"abstract":": The problem of intrusion detection in cloud environments has been well studied. The presence of adversaries would challenge data security in the cloud by generating intrusion attacks towards the cloud data and should be mitigated for the development of the cloud environment. In mitigating intrusion attacks, there exist several techniques in the literature. The method uses different features like frequency of access, payload details, protocol mapping, etc. However, the methods need to improve to achieve the expected performance in detecting intrusion attacks. An efficient Periodic Service Behavior Strain Analysis (PSBSA) is presented to handle this issue. Unlike earlier methods, the PSBSA model analyzes the behavior of users in various time frames like historical, recent, and current spans. The model focused on identifying intrusion attacks in several constraints, not just considering the current nature. The performance of intrusion detection can be improved by viewing the user's behavior in historical, present, and recent timespan. Unlike other approaches, the proposed PSBSA model considers the user's behavior at different times in measuring the user's trust towards intrusion detection. Accordingly, the proposed PSBSA model analyzes the behavior of users under various situations. It examines the behavior in accessing the services at historical, current, and recent times. The method performs Historical Strain Analysis (HSA) Current Strain Analysis (CSA) and Recent Strain Analysis (RSA). HSA analysis is performed according to the historical data, CSA is performed based on the current access data and RSA is performed with the recent access data. The model estimates various legitimacy support values on each analysis to conclude the trust of any user. According to the support values, intrusion detection has been performed. The proposed PSBSA model introduces higher accuracy in intrusion detection in a cloud environment.","PeriodicalId":40005,"journal":{"name":"Journal of Computer Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139685196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01DOI: 10.3844/jcssp.2024.207.217
Madhura Prabha R, Sasikala S
: The primary issue in real-time big data classification is imbalanced datasets. Even though we have many balancing techniques to reduce imbalance ratio which is not suitable for big data that has scalability issues. This study is envisioned to explore different balancing techniques with experimental study. We tried comparing the effectiveness of various balancing strategies, including cutting-edge approaches for severely unbalanced data from online repositories. Here we apply SMOTE, SMOTE ENN and SMOTE Tomek balancing algorithms for dermatology, wine quality and diabetes datasets. After balancing the dataset, the balanced dataset is classified with AdaBoost and random forest algorithms. On three datasets, the outcomes show that the classification algorithm with the balancing technique improves the classification performance for imbalanced datasets. Experiment results showed that the SMOTE ENN technique produces higher classification with accuracy than the SMOTE and SMOTE Tomek techniques. The findings are analyzed with other factors like execution time and scalability. Though SMOTE Tomek produces 1.0 for a few datasets, its execution time is longer than SMOTE ENN. Therefore, SMOTE ENN with random forest classification produces 1.0 accuracy for all three datasets with less execution time. This experimental study analyses to create a novel ensemble technique for balancing highly imbalanced data.
:实时大数据分类的首要问题是数据集的不平衡。尽管我们有很多平衡技术来降低不平衡率,但这些技术并不适用于存在可扩展性问题的大数据。本研究旨在通过实验研究探索不同的平衡技术。我们尝试比较各种平衡策略的有效性,包括针对来自在线资源库的严重不平衡数据的前沿方法。在此,我们将 SMOTE、SMOTE ENN 和 SMOTE Tomek 平衡算法应用于皮肤病学、葡萄酒质量和糖尿病数据集。平衡数据集后,使用 AdaBoost 和随机森林算法对平衡后的数据集进行分类。在三个数据集上的结果表明,采用平衡技术的分类算法提高了不平衡数据集的分类性能。实验结果表明,SMOTE ENN 技术的分类准确率高于 SMOTE 和 SMOTE Tomek 技术。分析结果还考虑了其他因素,如执行时间和可扩展性。虽然 SMOTE Tomek 在一些数据集上的准确率达到了 1.0,但其执行时间却比 SMOTE ENN 长。因此,采用随机森林分类的 SMOTE ENN 在所有三个数据集上的准确率都能达到 1.0,而且执行时间更短。这项实验研究分析了如何创建一种新颖的集合技术来平衡高度不平衡的数据。
{"title":"Data Analytics for Imbalanced Dataset","authors":"Madhura Prabha R, Sasikala S","doi":"10.3844/jcssp.2024.207.217","DOIUrl":"https://doi.org/10.3844/jcssp.2024.207.217","url":null,"abstract":": The primary issue in real-time big data classification is imbalanced datasets. Even though we have many balancing techniques to reduce imbalance ratio which is not suitable for big data that has scalability issues. This study is envisioned to explore different balancing techniques with experimental study. We tried comparing the effectiveness of various balancing strategies, including cutting-edge approaches for severely unbalanced data from online repositories. Here we apply SMOTE, SMOTE ENN and SMOTE Tomek balancing algorithms for dermatology, wine quality and diabetes datasets. After balancing the dataset, the balanced dataset is classified with AdaBoost and random forest algorithms. On three datasets, the outcomes show that the classification algorithm with the balancing technique improves the classification performance for imbalanced datasets. Experiment results showed that the SMOTE ENN technique produces higher classification with accuracy than the SMOTE and SMOTE Tomek techniques. The findings are analyzed with other factors like execution time and scalability. Though SMOTE Tomek produces 1.0 for a few datasets, its execution time is longer than SMOTE ENN. Therefore, SMOTE ENN with random forest classification produces 1.0 accuracy for all three datasets with less execution time. This experimental study analyses to create a novel ensemble technique for balancing highly imbalanced data.","PeriodicalId":40005,"journal":{"name":"Journal of Computer Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139884057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01DOI: 10.3844/jcssp.2024.207.217
Madhura Prabha R, Sasikala S
: The primary issue in real-time big data classification is imbalanced datasets. Even though we have many balancing techniques to reduce imbalance ratio which is not suitable for big data that has scalability issues. This study is envisioned to explore different balancing techniques with experimental study. We tried comparing the effectiveness of various balancing strategies, including cutting-edge approaches for severely unbalanced data from online repositories. Here we apply SMOTE, SMOTE ENN and SMOTE Tomek balancing algorithms for dermatology, wine quality and diabetes datasets. After balancing the dataset, the balanced dataset is classified with AdaBoost and random forest algorithms. On three datasets, the outcomes show that the classification algorithm with the balancing technique improves the classification performance for imbalanced datasets. Experiment results showed that the SMOTE ENN technique produces higher classification with accuracy than the SMOTE and SMOTE Tomek techniques. The findings are analyzed with other factors like execution time and scalability. Though SMOTE Tomek produces 1.0 for a few datasets, its execution time is longer than SMOTE ENN. Therefore, SMOTE ENN with random forest classification produces 1.0 accuracy for all three datasets with less execution time. This experimental study analyses to create a novel ensemble technique for balancing highly imbalanced data.
:实时大数据分类的首要问题是数据集的不平衡。尽管我们有很多平衡技术来降低不平衡率,但这些技术并不适用于存在可扩展性问题的大数据。本研究旨在通过实验研究探索不同的平衡技术。我们尝试比较各种平衡策略的有效性,包括针对来自在线资源库的严重不平衡数据的前沿方法。在此,我们将 SMOTE、SMOTE ENN 和 SMOTE Tomek 平衡算法应用于皮肤病学、葡萄酒质量和糖尿病数据集。平衡数据集后,使用 AdaBoost 和随机森林算法对平衡后的数据集进行分类。在三个数据集上的结果表明,采用平衡技术的分类算法提高了不平衡数据集的分类性能。实验结果表明,SMOTE ENN 技术的分类准确率高于 SMOTE 和 SMOTE Tomek 技术。分析结果还考虑了其他因素,如执行时间和可扩展性。虽然 SMOTE Tomek 在一些数据集上的准确率达到了 1.0,但其执行时间却比 SMOTE ENN 长。因此,采用随机森林分类的 SMOTE ENN 在所有三个数据集上的准确率都能达到 1.0,而且执行时间更短。这项实验研究分析了如何创建一种新颖的集合技术来平衡高度不平衡的数据。
{"title":"Data Analytics for Imbalanced Dataset","authors":"Madhura Prabha R, Sasikala S","doi":"10.3844/jcssp.2024.207.217","DOIUrl":"https://doi.org/10.3844/jcssp.2024.207.217","url":null,"abstract":": The primary issue in real-time big data classification is imbalanced datasets. Even though we have many balancing techniques to reduce imbalance ratio which is not suitable for big data that has scalability issues. This study is envisioned to explore different balancing techniques with experimental study. We tried comparing the effectiveness of various balancing strategies, including cutting-edge approaches for severely unbalanced data from online repositories. Here we apply SMOTE, SMOTE ENN and SMOTE Tomek balancing algorithms for dermatology, wine quality and diabetes datasets. After balancing the dataset, the balanced dataset is classified with AdaBoost and random forest algorithms. On three datasets, the outcomes show that the classification algorithm with the balancing technique improves the classification performance for imbalanced datasets. Experiment results showed that the SMOTE ENN technique produces higher classification with accuracy than the SMOTE and SMOTE Tomek techniques. The findings are analyzed with other factors like execution time and scalability. Though SMOTE Tomek produces 1.0 for a few datasets, its execution time is longer than SMOTE ENN. Therefore, SMOTE ENN with random forest classification produces 1.0 accuracy for all three datasets with less execution time. This experimental study analyses to create a novel ensemble technique for balancing highly imbalanced data.","PeriodicalId":40005,"journal":{"name":"Journal of Computer Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139824265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}