Pub Date : 2023-10-08DOI: 10.5815/ijisa.2023.05.04
Ayman H. Tanira, Wesam M. Ashour
The detection of outliers in text documents is a highly challenging task, primarily due to the unstructured nature of documents and the curse of dimensionality. Text document outliers refer to text data that deviates from the text found in other documents belonging to the same category. Mining text document outliers has wide applications in various domains, including spam email identification, digital libraries, medical archives, enhancing the performance of web search engines, and cleaning corpora used in document classification. To address the issue of dimensionality, it is crucial to employ feature selection techniques that reduce the large number of features without compromising their representativeness of the domain. In this paper, we propose a hybrid density-based approach that incorporates mutual information for text document outlier detection. The proposed approach utilizes normalized mutual information to identify the most distinct features that characterize the target domain. Subsequently, we customize the well-known density-based local outlier factor algorithm to suit text document datasets. To evaluate the effectiveness of the proposed approach, we conduct experiments on synthetic and real datasets comprising twelve high-dimensional datasets. The results demonstrate that the proposed approach consistently outperforms conventional methods, achieving an average improvement of 5.73% in terms of the AUC metric. These findings highlight the remarkable enhancements achieved by leveraging normalized mutual information in conjunction with a density-based algorithm, particularly in high-dimensional datasets.
{"title":"A Hybrid Unsupervised Density-based Approach with Mutual Information for Text Outlier Detection","authors":"Ayman H. Tanira, Wesam M. Ashour","doi":"10.5815/ijisa.2023.05.04","DOIUrl":"https://doi.org/10.5815/ijisa.2023.05.04","url":null,"abstract":"The detection of outliers in text documents is a highly challenging task, primarily due to the unstructured nature of documents and the curse of dimensionality. Text document outliers refer to text data that deviates from the text found in other documents belonging to the same category. Mining text document outliers has wide applications in various domains, including spam email identification, digital libraries, medical archives, enhancing the performance of web search engines, and cleaning corpora used in document classification. To address the issue of dimensionality, it is crucial to employ feature selection techniques that reduce the large number of features without compromising their representativeness of the domain. In this paper, we propose a hybrid density-based approach that incorporates mutual information for text document outlier detection. The proposed approach utilizes normalized mutual information to identify the most distinct features that characterize the target domain. Subsequently, we customize the well-known density-based local outlier factor algorithm to suit text document datasets. To evaluate the effectiveness of the proposed approach, we conduct experiments on synthetic and real datasets comprising twelve high-dimensional datasets. The results demonstrate that the proposed approach consistently outperforms conventional methods, achieving an average improvement of 5.73% in terms of the AUC metric. These findings highlight the remarkable enhancements achieved by leveraging normalized mutual information in conjunction with a density-based algorithm, particularly in high-dimensional datasets.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135196392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-08DOI: 10.5815/ijisa.2023.05.01
Martin C. Peter, Steve Adeshina, Olabode Idowu-Bismark, Opeyemi Osanaiye, Oluseun Oyeleke
Water supply infrastructure operational efficiency has a direct impact on the quantity of portable water available to end users. It is commonplace to find water supply infrastructure in a declining operational state in rural and some urban centers in developing countries. Maintenance issues result in unabated wastage and shortage of supply to users. This work proposes a cost-effective solution to the problem of water distribution losses using a Microcontroller-based digital control method and Machine Learning (ML) to forecast and manage portable water production and system maintenance. A fundamental concept of hydrostatic pressure equilibrium was used for the detection and control of leakages from pipeline segments. The results obtained from the analysis of collated data show a linear direct relationship between water distribution loss and production quantity; an inverse relationship between Mean Time Between Failure (MTBF) and yearly failure rates, which are the key problem factors affecting water supply efficiency and availability. Results from the prototype system test show water supply efficiency of 99% as distribution loss was reduced to 1% due to Line Control Unit (LCU) installed on the prototype pipeline. Hydrostatic pressure equilibrium being used as the logic criteria for leak detection and control indeed proved potent for significant efficiency improvement in the water supply infrastructure.
{"title":"Digital Control and Management of Water Supply Infrastructure Using Embedded Systems and Machine Learning","authors":"Martin C. Peter, Steve Adeshina, Olabode Idowu-Bismark, Opeyemi Osanaiye, Oluseun Oyeleke","doi":"10.5815/ijisa.2023.05.01","DOIUrl":"https://doi.org/10.5815/ijisa.2023.05.01","url":null,"abstract":"Water supply infrastructure operational efficiency has a direct impact on the quantity of portable water available to end users. It is commonplace to find water supply infrastructure in a declining operational state in rural and some urban centers in developing countries. Maintenance issues result in unabated wastage and shortage of supply to users. This work proposes a cost-effective solution to the problem of water distribution losses using a Microcontroller-based digital control method and Machine Learning (ML) to forecast and manage portable water production and system maintenance. A fundamental concept of hydrostatic pressure equilibrium was used for the detection and control of leakages from pipeline segments. The results obtained from the analysis of collated data show a linear direct relationship between water distribution loss and production quantity; an inverse relationship between Mean Time Between Failure (MTBF) and yearly failure rates, which are the key problem factors affecting water supply efficiency and availability. Results from the prototype system test show water supply efficiency of 99% as distribution loss was reduced to 1% due to Line Control Unit (LCU) installed on the prototype pipeline. Hydrostatic pressure equilibrium being used as the logic criteria for leak detection and control indeed proved potent for significant efficiency improvement in the water supply infrastructure.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135196393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-08DOI: 10.5815/ijisa.2023.05.03
Aayush Juyal, Nandini Sharma, Pisati Rithya, Sandeep Kumar
Data Structures and Algorithms (DSA) is a widely explored domain in the world of computer science. With it being a crucial topic during an interview for a software engineer, it is a topic not to take lightly. There are various platforms available to understand a particular DSA, several programming problems, and its implementation. Hacckerank, LeetCode, GeeksForGeeks (GFG), and Codeforces are popular platforms that offer a vast collection of programming problems to enhance skills. However, with the huge content of DSA available, it is challenging for users to identify which one among all to focus on after going through the required domain. This work aims to use a Content-based filtering (CBF) recommendation engine to suggest users programming-based questions related to different DSAs such as arrays, linked lists, trees, graphs, etc. The recommendations are generated using the concept of Natural Language Processing (NLP). The data set consists of approximately 500 problems. Each problem is represented by the features such as problem statement, related topics, level of difficulty, and platform link. Standard measures like cosine similarity, accuracy, precision, and F1-score are used to determine the proportion of correctly recommended problems. The percentages indicate how well the system performed regarding that evaluation. The result shows that CBF achieves an accuracy of 83 %, a precision of 83 %, a recall of 80%, and an F1-score of 80%. This recommendation system is deployed on a web application that provides a suitable user interface allowing the user to interact with other features. With this, a whole E-learning application is built to aid potential software engineers and computer science students. In the future, two more recommendation systems, Collaborative Filtering (CF) and Hybrid systems, can be implemented to make a comparison and decide which is most suitable for the given problem statement.
{"title":"An Enhanced Approach to Recommend Data Structures and Algorithms Problems Using Content-based Filtering","authors":"Aayush Juyal, Nandini Sharma, Pisati Rithya, Sandeep Kumar","doi":"10.5815/ijisa.2023.05.03","DOIUrl":"https://doi.org/10.5815/ijisa.2023.05.03","url":null,"abstract":"Data Structures and Algorithms (DSA) is a widely explored domain in the world of computer science. With it being a crucial topic during an interview for a software engineer, it is a topic not to take lightly. There are various platforms available to understand a particular DSA, several programming problems, and its implementation. Hacckerank, LeetCode, GeeksForGeeks (GFG), and Codeforces are popular platforms that offer a vast collection of programming problems to enhance skills. However, with the huge content of DSA available, it is challenging for users to identify which one among all to focus on after going through the required domain. This work aims to use a Content-based filtering (CBF) recommendation engine to suggest users programming-based questions related to different DSAs such as arrays, linked lists, trees, graphs, etc. The recommendations are generated using the concept of Natural Language Processing (NLP). The data set consists of approximately 500 problems. Each problem is represented by the features such as problem statement, related topics, level of difficulty, and platform link. Standard measures like cosine similarity, accuracy, precision, and F1-score are used to determine the proportion of correctly recommended problems. The percentages indicate how well the system performed regarding that evaluation. The result shows that CBF achieves an accuracy of 83 %, a precision of 83 %, a recall of 80%, and an F1-score of 80%. This recommendation system is deployed on a web application that provides a suitable user interface allowing the user to interact with other features. With this, a whole E-learning application is built to aid potential software engineers and computer science students. In the future, two more recommendation systems, Collaborative Filtering (CF) and Hybrid systems, can be implemented to make a comparison and decide which is most suitable for the given problem statement.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135196397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-08DOI: 10.5815/ijisa.2023.05.02
Arash Salehpour
This paper analyses the performance of machine learning models in forecasting the Tehran Stock Exchange's automobile index. Historical daily data from 2018-2022 was pre-processed and used to train Linear Regression (LR), Support Vector Regression (SVR), and Random Forest (RF) models. The models were evaluated on mean absolute error, mean squared error, root mean squared error and R2 score metrics. The results indicate that LR and SVR outperformed RF in predicting automobile stock prices, with LR achieving the lowest error scores. This demonstrates the capability of machine learning techniques to model complex, nonlinear relationships in financial time series data. This pioneering study on a previously unexplored dataset provides empirical evidence that LR and SVR can reliably forecast automobile stock market prices, holding promise for investing applications.
{"title":"Predicting Automobile Stock Prices Index in the Tehran Stock Exchange Using Machine Learning Models","authors":"Arash Salehpour","doi":"10.5815/ijisa.2023.05.02","DOIUrl":"https://doi.org/10.5815/ijisa.2023.05.02","url":null,"abstract":"This paper analyses the performance of machine learning models in forecasting the Tehran Stock Exchange's automobile index. Historical daily data from 2018-2022 was pre-processed and used to train Linear Regression (LR), Support Vector Regression (SVR), and Random Forest (RF) models. The models were evaluated on mean absolute error, mean squared error, root mean squared error and R2 score metrics. The results indicate that LR and SVR outperformed RF in predicting automobile stock prices, with LR achieving the lowest error scores. This demonstrates the capability of machine learning techniques to model complex, nonlinear relationships in financial time series data. This pioneering study on a previously unexplored dataset provides empirical evidence that LR and SVR can reliably forecast automobile stock market prices, holding promise for investing applications.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"298 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135196391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-08DOI: 10.5815/ijisa.2023.05.05
Deep Karan Singh, Nisha Rawat
Climate change, a significant and lasting alteration in global weather patterns, is profoundly impacting the stability and predictability of global temperature regimes. As the world continues to grapple with the far-reaching effects of climate change, accurate and timely temperature predictions have become pivotal to various sectors, including agriculture, energy, public health and many more. Crucially, precise temperature forecasting assists in developing effective climate change mitigation and adaptation strategies. With the advent of machine learning techniques, we now have powerful tools that can learn from vast climatic datasets and provide improved predictive performance. This study delves into the comparison of three such advanced machine learning models—XGBoost, Support Vector Machine (SVM), and Random Forest—in predicting daily maximum and minimum temperatures using a 45-year dataset of Visakhapatnam airport. Each model was rigorously trained and evaluated based on key performance metrics including training loss, Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R2 score, Mean Absolute Percentage Error (MAPE), and Explained Variance Score. Although there was no clear dominance of a single model across all metrics, SVM and Random Forest showed slightly superior performance on several measures. These findings not only highlight the potential of machine learning techniques in enhancing the accuracy of temperature forecasting but also stress the importance of selecting an appropriate model and performance metrics aligned with the requirements of the task at hand. This research accomplishes a thorough comparative analysis, conducts a rigorous evaluation of the models, highlights the significance of model selection.
{"title":"Machine Learning for Weather Forecasting: XGBoost vs SVM vs Random Forest in Predicting Temperature for Visakhapatnam","authors":"Deep Karan Singh, Nisha Rawat","doi":"10.5815/ijisa.2023.05.05","DOIUrl":"https://doi.org/10.5815/ijisa.2023.05.05","url":null,"abstract":"Climate change, a significant and lasting alteration in global weather patterns, is profoundly impacting the stability and predictability of global temperature regimes. As the world continues to grapple with the far-reaching effects of climate change, accurate and timely temperature predictions have become pivotal to various sectors, including agriculture, energy, public health and many more. Crucially, precise temperature forecasting assists in developing effective climate change mitigation and adaptation strategies. With the advent of machine learning techniques, we now have powerful tools that can learn from vast climatic datasets and provide improved predictive performance. This study delves into the comparison of three such advanced machine learning models—XGBoost, Support Vector Machine (SVM), and Random Forest—in predicting daily maximum and minimum temperatures using a 45-year dataset of Visakhapatnam airport. Each model was rigorously trained and evaluated based on key performance metrics including training loss, Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R2 score, Mean Absolute Percentage Error (MAPE), and Explained Variance Score. Although there was no clear dominance of a single model across all metrics, SVM and Random Forest showed slightly superior performance on several measures. These findings not only highlight the potential of machine learning techniques in enhancing the accuracy of temperature forecasting but also stress the importance of selecting an appropriate model and performance metrics aligned with the requirements of the task at hand. This research accomplishes a thorough comparative analysis, conducts a rigorous evaluation of the models, highlights the significance of model selection.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135196394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-08DOI: 10.5815/ijisa.2023.04.05
Sabbir Hossain, Rahman Sharar, Md. Ibrahim Bahadur, A. Sufian, Rashidul Hasan Nabil
The emergence of chatbots over the last 50 years has been the primary consequence of the need of a virtual aid. Unlike their biological anthropomorphic counterpart in the form of fellow homo sapiens, chatbots have the ability to instantaneously present themselves at the user's need and convenience. Be it for something as benign as feeling the need of a friend to talk to, to a more dire case such as medical assistance, chatbots are unequivocally ubiquitous in their utility. This paper aims to develop one such chatbot that is capable of not only analyzing human text (and speech in the near future), but also refining the ability to assist them medically through the process of accumulating data from relevant datasets. Although Recurrent Neural Networks (RNNs) are often used to develop chatbots, the constant presence of the vanishing gradient issue brought about by backpropagation, coupled with the cumbersome process of sequentially parsing each word individually has led to the increased usage of Transformer Neural Networks (TNNs) instead, which parses entire sentences at once while simultaneously giving context to it via embeddings, leading to increased parallelization. Two variants of the TNN Bidirectional Encoder Representations from Transformers (BERT), namely KeyBERT and BioBERT, are used for tagging the keywords in each sentence and for contextual vectorization into Q/A pairs for matrix multiplication, respectively. A final layer of GPT-2 (Generative Pre-trained Transformer) is applied to fine-tune the results from the BioBERT into a form that is human readable. The outcome of such an attempt could potentially lessen the need for trips to the nearest physician, and the temporal delay and financial resources required to do so.
{"title":"MediBERT: A Medical Chatbot Built Using KeyBERT, BioBERT and GPT-2","authors":"Sabbir Hossain, Rahman Sharar, Md. Ibrahim Bahadur, A. Sufian, Rashidul Hasan Nabil","doi":"10.5815/ijisa.2023.04.05","DOIUrl":"https://doi.org/10.5815/ijisa.2023.04.05","url":null,"abstract":"The emergence of chatbots over the last 50 years has been the primary consequence of the need of a virtual aid. Unlike their biological anthropomorphic counterpart in the form of fellow homo sapiens, chatbots have the ability to instantaneously present themselves at the user's need and convenience. Be it for something as benign as feeling the need of a friend to talk to, to a more dire case such as medical assistance, chatbots are unequivocally ubiquitous in their utility. This paper aims to develop one such chatbot that is capable of not only analyzing human text (and speech in the near future), but also refining the ability to assist them medically through the process of accumulating data from relevant datasets. Although Recurrent Neural Networks (RNNs) are often used to develop chatbots, the constant presence of the vanishing gradient issue brought about by backpropagation, coupled with the cumbersome process of sequentially parsing each word individually has led to the increased usage of Transformer Neural Networks (TNNs) instead, which parses entire sentences at once while simultaneously giving context to it via embeddings, leading to increased parallelization. Two variants of the TNN Bidirectional Encoder Representations from Transformers (BERT), namely KeyBERT and BioBERT, are used for tagging the keywords in each sentence and for contextual vectorization into Q/A pairs for matrix multiplication, respectively. A final layer of GPT-2 (Generative Pre-trained Transformer) is applied to fine-tune the results from the BioBERT into a form that is human readable. The outcome of such an attempt could potentially lessen the need for trips to the nearest physician, and the temporal delay and financial resources required to do so.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82279989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-08DOI: 10.5815/ijisa.2023.04.04
Farhan M. A. Nashwan, Khaled A. M. Al Soufy, N. Al-Ashwal, Majed A. Al-Badany
Automatic Number Plate Recognition (ANPR) is an important tool in the Intelligent Transport System (ITS). Plate features can be used to provide the identification of any vehicle as they help ensure effective law enforcement and security. However, this is a challenging problem, because of the diversity of plate formats, different scales, rotations and non-uniform illumination and other conditions during image acquisition. This work aims to design and implement an ANPR system specified for Yemeni vehicle plates. The proposed system involves several steps to detect, segment, and recognize Yemeni vehicle plate numbers. First, a dataset of images is manually collected. Then, the collected images undergo preprocessing, followed by plate extraction, digit segmentation, and feature extraction. Finally, the plate numbers are identified using Support Vector Machine (SVM). When designing the proposed system, all possible conditions that could affect the efficiency of the system were considered. The experimental results showed that the proposed system achieved 96.98% and 99.19% of the training and testing success rates respectively.
{"title":"Design of Automatic Number Plate Recognition System for Yemeni Vehicles with Support Vector Machine","authors":"Farhan M. A. Nashwan, Khaled A. M. Al Soufy, N. Al-Ashwal, Majed A. Al-Badany","doi":"10.5815/ijisa.2023.04.04","DOIUrl":"https://doi.org/10.5815/ijisa.2023.04.04","url":null,"abstract":"Automatic Number Plate Recognition (ANPR) is an important tool in the Intelligent Transport System (ITS). Plate features can be used to provide the identification of any vehicle as they help ensure effective law enforcement and security. However, this is a challenging problem, because of the diversity of plate formats, different scales, rotations and non-uniform illumination and other conditions during image acquisition. This work aims to design and implement an ANPR system specified for Yemeni vehicle plates. The proposed system involves several steps to detect, segment, and recognize Yemeni vehicle plate numbers. First, a dataset of images is manually collected. Then, the collected images undergo preprocessing, followed by plate extraction, digit segmentation, and feature extraction. Finally, the plate numbers are identified using Support Vector Machine (SVM). When designing the proposed system, all possible conditions that could affect the efficiency of the system were considered. The experimental results showed that the proposed system achieved 96.98% and 99.19% of the training and testing success rates respectively.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"19 3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89445036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-08DOI: 10.5815/ijisa.2023.04.01
Md. Abdur Rahman, A. Nayem, Mahfida Amjad, Md. Saeed Siddik
Toxic comments on social media platforms, news portals, and online forums are impolite, insulting, or unreasonable that usually make other users leave a conversation. Due to the significant number of comments, it is impractical to moderate them manually. Therefore, online service providers use the automatic detection of toxicity using Machine Learning (ML) algorithms. However, the model's toxicity identification performance relies on the best combination of classifier and feature extraction techniques. In this empirical study, we set up a comparison environment for toxic comment classification using 15 frequently used supervised ML classifiers with the four most prominent feature extraction schemes. We considered the publicly available Jigsaw dataset on toxic comments written by human users. We tested, analyzed and compared with every pair of investigated classifiers and finally reported a conclusion. We used the accuracy and area under the ROC curve as the evaluation metrics. We revealed that Logistic Regression and AdaBoost are the best toxic comment classifiers. The average accuracy of Logistic Regression and AdaBoost is 0.895 and 0.893, respectively, where both achieved the same area under the ROC curve score (i.e., 0.828). Therefore, the primary takeaway of this study is that the Logistic Regression and Adaboost leveraging BoW, TF-IDF, or Hashing features can perform sufficiently for toxic comment classification.
{"title":"How do Machine Learning Algorithms Effectively Classify Toxic Comments? An Empirical Analysis","authors":"Md. Abdur Rahman, A. Nayem, Mahfida Amjad, Md. Saeed Siddik","doi":"10.5815/ijisa.2023.04.01","DOIUrl":"https://doi.org/10.5815/ijisa.2023.04.01","url":null,"abstract":"Toxic comments on social media platforms, news portals, and online forums are impolite, insulting, or unreasonable that usually make other users leave a conversation. Due to the significant number of comments, it is impractical to moderate them manually. Therefore, online service providers use the automatic detection of toxicity using Machine Learning (ML) algorithms. However, the model's toxicity identification performance relies on the best combination of classifier and feature extraction techniques. In this empirical study, we set up a comparison environment for toxic comment classification using 15 frequently used supervised ML classifiers with the four most prominent feature extraction schemes. We considered the publicly available Jigsaw dataset on toxic comments written by human users. We tested, analyzed and compared with every pair of investigated classifiers and finally reported a conclusion. We used the accuracy and area under the ROC curve as the evaluation metrics. We revealed that Logistic Regression and AdaBoost are the best toxic comment classifiers. The average accuracy of Logistic Regression and AdaBoost is 0.895 and 0.893, respectively, where both achieved the same area under the ROC curve score (i.e., 0.828). Therefore, the primary takeaway of this study is that the Logistic Regression and Adaboost leveraging BoW, TF-IDF, or Hashing features can perform sufficiently for toxic comment classification.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83309669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-08DOI: 10.5815/ijisa.2023.04.03
Rajan Prasad, P. Shukla
Autism spectrum disorder (ASD) is a chronic developmental impairment that impairs a person's ability to communicate and connect with others. In people with ASD, social contact and reciprocal communication are continually jeopardized. People with ASD may require varying degrees of psychological aid in order to gain greater independence, or they may require ongoing supervision and care. Early discovery of ASD results in more time allocated to individual rehabilitation. In this study, we proposed the fuzzy classifier for ASD classification and tested its interpretability with the fuzzy index and Nauck's index to ensure its reliability. Then, the rule base is created with the Gauje tool. The fuzzy rules were then applied to the fuzzy neural network to predict autism. The suggested model is built on the Mamdani rule set and optimized using the backpropagation algorithm. The proposed model uses a heuristic function and pattern evolution to classify dataset. The model is evaluated using the benchmark metrics accuracy and F-measure, and Nauck's index and fuzzy index are employed to quantify interpretability. The proposed model is superior in its ability to accurately detect ASD, with an average accuracy rate of 91% compared to other classifiers.
{"title":"Interpretable Fuzzy System for Early Detection Autism Spectrum Disorder","authors":"Rajan Prasad, P. Shukla","doi":"10.5815/ijisa.2023.04.03","DOIUrl":"https://doi.org/10.5815/ijisa.2023.04.03","url":null,"abstract":"Autism spectrum disorder (ASD) is a chronic developmental impairment that impairs a person's ability to communicate and connect with others. In people with ASD, social contact and reciprocal communication are continually jeopardized. People with ASD may require varying degrees of psychological aid in order to gain greater independence, or they may require ongoing supervision and care. Early discovery of ASD results in more time allocated to individual rehabilitation. In this study, we proposed the fuzzy classifier for ASD classification and tested its interpretability with the fuzzy index and Nauck's index to ensure its reliability. Then, the rule base is created with the Gauje tool. The fuzzy rules were then applied to the fuzzy neural network to predict autism. The suggested model is built on the Mamdani rule set and optimized using the backpropagation algorithm. The proposed model uses a heuristic function and pattern evolution to classify dataset. The model is evaluated using the benchmark metrics accuracy and F-measure, and Nauck's index and fuzzy index are employed to quantify interpretability. The proposed model is superior in its ability to accurately detect ASD, with an average accuracy rate of 91% compared to other classifiers.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82401083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-08DOI: 10.5815/ijisa.2023.04.02
Ikechi Risi, C. Ogbonda, F. B. Sigalo, Isabona Joseph
The frequent poor service network experienced by some mobile phone users within some deadlock areas in Nigeria is an issue which has been identified by different researchers due to wrong positioning and planning of the evolved NodeB (eNodeB) transmitter using existing propagation loss models. To effectively contribute towards this potential issue constantly experienced in some part of Nigeria, an adaptive hybrid propagation loss model that is based on wavelet transform and genetic algorithm methods has been developed for cellular network planning and optimization, with the capacity to resolve the problems absolutely. First, the signal strengths were measured within four selected eNodeB cell sites in long term evolution (LTE) at 2600MHz using drive-test method. Secondly, the measured data were denoised through wavelet tools. Thirdly, COST231 model was optimize and deduced to generic model with parameters. Fourthly, genetic optimization algorithm automatically developed the propagation loss models for denoised signal data (designated as wavelet-GA model) and unprocessed signal data (designated as GA model). The hybrid wavelet-GA propagation loss model, GA propagation loss model, and COST231 propagation loss model were compared based on three error metrics such as root mean square error (RMSE), mean absolute error (MAE) and correlation coefficient (R). The developed hybrid wavelet-GA model estimated the lowest RMSEs of 2.8813 dB, 3.9381 dB, 4.7643 dB, 6.9366 dB, whereas, COST231 model gave highest value of RMSE. The developed hybrid wavelet-GA model also derived the least value of MAE as compared with COST231 and the GA models, such as, 2.2016 dB, 2.8672 dB, 3.4766 dB, 5.8235 dB. The correlation coefficients were also compared, and it showed that the developed hybrid wavelet-GA model were 90.04%, 78.61%, 92.21% and 91.23% for the four cell sites. The developed hybrid wavelet-GA model was also validated to account for the performance level by checking for the correlation coefficient using another measured signal data from different eNodeB cell sites other than the once used for the developed of the hybrid wavelet-GA model. It was noticed that the developed hybrid wavelet-GA propagation loss model is 97.41% valid. Existing standard COST231 model are not able to predict propagation loss with high level of accuracy, as such not efficient to be applied within part of Port Harcourt, Nigeria. The proposed hybrid wavelet-GA model has proven to achieve high performance level and it is relevant to be utilized for cellular network planning and optimization. In future purposes, more regions and locations should be considered to form a broader view in the development of more robust propagation loss models.
{"title":"An Adaptive Hybrid Outdoor Propagation Loss Prediction Modelling for Effective Cellular Systems Network Planning and Optimization","authors":"Ikechi Risi, C. Ogbonda, F. B. Sigalo, Isabona Joseph","doi":"10.5815/ijisa.2023.04.02","DOIUrl":"https://doi.org/10.5815/ijisa.2023.04.02","url":null,"abstract":"The frequent poor service network experienced by some mobile phone users within some deadlock areas in Nigeria is an issue which has been identified by different researchers due to wrong positioning and planning of the evolved NodeB (eNodeB) transmitter using existing propagation loss models. To effectively contribute towards this potential issue constantly experienced in some part of Nigeria, an adaptive hybrid propagation loss model that is based on wavelet transform and genetic algorithm methods has been developed for cellular network planning and optimization, with the capacity to resolve the problems absolutely. First, the signal strengths were measured within four selected eNodeB cell sites in long term evolution (LTE) at 2600MHz using drive-test method. Secondly, the measured data were denoised through wavelet tools. Thirdly, COST231 model was optimize and deduced to generic model with parameters. Fourthly, genetic optimization algorithm automatically developed the propagation loss models for denoised signal data (designated as wavelet-GA model) and unprocessed signal data (designated as GA model). The hybrid wavelet-GA propagation loss model, GA propagation loss model, and COST231 propagation loss model were compared based on three error metrics such as root mean square error (RMSE), mean absolute error (MAE) and correlation coefficient (R). The developed hybrid wavelet-GA model estimated the lowest RMSEs of 2.8813 dB, 3.9381 dB, 4.7643 dB, 6.9366 dB, whereas, COST231 model gave highest value of RMSE. The developed hybrid wavelet-GA model also derived the least value of MAE as compared with COST231 and the GA models, such as, 2.2016 dB, 2.8672 dB, 3.4766 dB, 5.8235 dB. The correlation coefficients were also compared, and it showed that the developed hybrid wavelet-GA model were 90.04%, 78.61%, 92.21% and 91.23% for the four cell sites. The developed hybrid wavelet-GA model was also validated to account for the performance level by checking for the correlation coefficient using another measured signal data from different eNodeB cell sites other than the once used for the developed of the hybrid wavelet-GA model. It was noticed that the developed hybrid wavelet-GA propagation loss model is 97.41% valid. Existing standard COST231 model are not able to predict propagation loss with high level of accuracy, as such not efficient to be applied within part of Port Harcourt, Nigeria. The proposed hybrid wavelet-GA model has proven to achieve high performance level and it is relevant to be utilized for cellular network planning and optimization. In future purposes, more regions and locations should be considered to form a broader view in the development of more robust propagation loss models.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84860678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}