International Journal of Intelligent Systems and Applications in Engineering最新文献

A Hybrid Unsupervised Density-based Approach with Mutual Information for Text Outlier Detection 基于互信息的混合无监督密度文本离群点检测方法

Q3 Computer Science

International Journal of Intelligent Systems and Applications in Engineering

Pub Date : 2023-10-08 DOI: 10.5815/ijisa.2023.05.04

Ayman H. Tanira, Wesam M. Ashour

The detection of outliers in text documents is a highly challenging task, primarily due to the unstructured nature of documents and the curse of dimensionality. Text document outliers refer to text data that deviates from the text found in other documents belonging to the same category. Mining text document outliers has wide applications in various domains, including spam email identification, digital libraries, medical archives, enhancing the performance of web search engines, and cleaning corpora used in document classification. To address the issue of dimensionality, it is crucial to employ feature selection techniques that reduce the large number of features without compromising their representativeness of the domain. In this paper, we propose a hybrid density-based approach that incorporates mutual information for text document outlier detection. The proposed approach utilizes normalized mutual information to identify the most distinct features that characterize the target domain. Subsequently, we customize the well-known density-based local outlier factor algorithm to suit text document datasets. To evaluate the effectiveness of the proposed approach, we conduct experiments on synthetic and real datasets comprising twelve high-dimensional datasets. The results demonstrate that the proposed approach consistently outperforms conventional methods, achieving an average improvement of 5.73% in terms of the AUC metric. These findings highlight the remarkable enhancements achieved by leveraging normalized mutual information in conjunction with a density-based algorithm, particularly in high-dimensional datasets.

检测文本文档中的异常值是一项极具挑战性的任务，主要是由于文档的非结构化性质和维度的诅咒。文本文档离群值是指偏离属于同一类别的其他文档中的文本的文本数据。文本文档离群值挖掘在垃圾邮件识别、数字图书馆、医疗档案、增强web搜索引擎性能、清理文档分类中使用的语料库等领域有着广泛的应用。为了解决维数问题，关键是要采用特征选择技术，减少大量的特征，而不影响它们在领域的代表性。在本文中，我们提出了一种基于混合密度的方法，该方法结合了互信息用于文本文档异常点检测。该方法利用归一化互信息来识别目标域最明显的特征。随后，我们定制了著名的基于密度的局部离群因子算法，以适应文本文档数据集。为了评估该方法的有效性，我们在包含12个高维数据集的合成数据集和真实数据集上进行了实验。结果表明，该方法始终优于传统方法，在AUC度量方面实现了5.73%的平均改进。这些发现突出了利用标准化互信息与基于密度的算法(特别是在高维数据集中)实现的显著增强。

{"title":"A Hybrid Unsupervised Density-based Approach with Mutual Information for Text Outlier Detection","authors":"Ayman H. Tanira, Wesam M. Ashour","doi":"10.5815/ijisa.2023.05.04","DOIUrl":"https://doi.org/10.5815/ijisa.2023.05.04","url":null,"abstract":"The detection of outliers in text documents is a highly challenging task, primarily due to the unstructured nature of documents and the curse of dimensionality. Text document outliers refer to text data that deviates from the text found in other documents belonging to the same category. Mining text document outliers has wide applications in various domains, including spam email identification, digital libraries, medical archives, enhancing the performance of web search engines, and cleaning corpora used in document classification. To address the issue of dimensionality, it is crucial to employ feature selection techniques that reduce the large number of features without compromising their representativeness of the domain. In this paper, we propose a hybrid density-based approach that incorporates mutual information for text document outlier detection. The proposed approach utilizes normalized mutual information to identify the most distinct features that characterize the target domain. Subsequently, we customize the well-known density-based local outlier factor algorithm to suit text document datasets. To evaluate the effectiveness of the proposed approach, we conduct experiments on synthetic and real datasets comprising twelve high-dimensional datasets. The results demonstrate that the proposed approach consistently outperforms conventional methods, achieving an average improvement of 5.73% in terms of the AUC metric. These findings highlight the remarkable enhancements achieved by leveraging normalized mutual information in conjunction with a density-based algorithm, particularly in high-dimensional datasets.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135196392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Digital Control and Management of Water Supply Infrastructure Using Embedded Systems and Machine Learning 使用嵌入式系统和机器学习的供水基础设施的数字控制和管理

Q3 Computer Science

International Journal of Intelligent Systems and Applications in Engineering

Pub Date : 2023-10-08 DOI: 10.5815/ijisa.2023.05.01

Martin C. Peter, Steve Adeshina, Olabode Idowu-Bismark, Opeyemi Osanaiye, Oluseun Oyeleke

Water supply infrastructure operational efficiency has a direct impact on the quantity of portable water available to end users. It is commonplace to find water supply infrastructure in a declining operational state in rural and some urban centers in developing countries. Maintenance issues result in unabated wastage and shortage of supply to users. This work proposes a cost-effective solution to the problem of water distribution losses using a Microcontroller-based digital control method and Machine Learning (ML) to forecast and manage portable water production and system maintenance. A fundamental concept of hydrostatic pressure equilibrium was used for the detection and control of leakages from pipeline segments. The results obtained from the analysis of collated data show a linear direct relationship between water distribution loss and production quantity; an inverse relationship between Mean Time Between Failure (MTBF) and yearly failure rates, which are the key problem factors affecting water supply efficiency and availability. Results from the prototype system test show water supply efficiency of 99% as distribution loss was reduced to 1% due to Line Control Unit (LCU) installed on the prototype pipeline. Hydrostatic pressure equilibrium being used as the logic criteria for leak detection and control indeed proved potent for significant efficiency improvement in the water supply infrastructure.

供水基础设施的运作效率直接影响到最终用户可获得的可携带水的数量。在发展中国家的农村和一些城市中心，供水基础设施的运行状态不断下降是司空见惯的。维修问题导致耗损和用户供应短缺。这项工作提出了一个具有成本效益的解决方案，使用基于微控制器的数字控制方法和机器学习(ML)来预测和管理便携式水生产和系统维护问题。流体静压平衡的基本概念被用于检测和控制管道段的泄漏。整理资料分析结果表明，配水量损失与产量呈线性直接关系;平均故障间隔时间(MTBF)与年故障率呈反比关系，是影响供水效率和可利用性的关键问题因素。原型系统测试结果表明，由于在原型管道上安装了线路控制单元(LCU)，供水效率为99%，分配损失降至1%。静水压力平衡被用作泄漏检测和控制的逻辑准则，确实证明了供水基础设施效率的显著提高。

{"title":"Digital Control and Management of Water Supply Infrastructure Using Embedded Systems and Machine Learning","authors":"Martin C. Peter, Steve Adeshina, Olabode Idowu-Bismark, Opeyemi Osanaiye, Oluseun Oyeleke","doi":"10.5815/ijisa.2023.05.01","DOIUrl":"https://doi.org/10.5815/ijisa.2023.05.01","url":null,"abstract":"Water supply infrastructure operational efficiency has a direct impact on the quantity of portable water available to end users. It is commonplace to find water supply infrastructure in a declining operational state in rural and some urban centers in developing countries. Maintenance issues result in unabated wastage and shortage of supply to users. This work proposes a cost-effective solution to the problem of water distribution losses using a Microcontroller-based digital control method and Machine Learning (ML) to forecast and manage portable water production and system maintenance. A fundamental concept of hydrostatic pressure equilibrium was used for the detection and control of leakages from pipeline segments. The results obtained from the analysis of collated data show a linear direct relationship between water distribution loss and production quantity; an inverse relationship between Mean Time Between Failure (MTBF) and yearly failure rates, which are the key problem factors affecting water supply efficiency and availability. Results from the prototype system test show water supply efficiency of 99% as distribution loss was reduced to 1% due to Line Control Unit (LCU) installed on the prototype pipeline. Hydrostatic pressure equilibrium being used as the logic criteria for leak detection and control indeed proved potent for significant efficiency improvement in the water supply infrastructure.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135196393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Enhanced Approach to Recommend Data Structures and Algorithms Problems Using Content-based Filtering 使用基于内容的过滤推荐数据结构和算法问题的增强方法

Q3 Computer Science

International Journal of Intelligent Systems and Applications in Engineering

Pub Date : 2023-10-08 DOI: 10.5815/ijisa.2023.05.03

Aayush Juyal, Nandini Sharma, Pisati Rithya, Sandeep Kumar

Data Structures and Algorithms (DSA) is a widely explored domain in the world of computer science. With it being a crucial topic during an interview for a software engineer, it is a topic not to take lightly. There are various platforms available to understand a particular DSA, several programming problems, and its implementation. Hacckerank, LeetCode, GeeksForGeeks (GFG), and Codeforces are popular platforms that offer a vast collection of programming problems to enhance skills. However, with the huge content of DSA available, it is challenging for users to identify which one among all to focus on after going through the required domain. This work aims to use a Content-based filtering (CBF) recommendation engine to suggest users programming-based questions related to different DSAs such as arrays, linked lists, trees, graphs, etc. The recommendations are generated using the concept of Natural Language Processing (NLP). The data set consists of approximately 500 problems. Each problem is represented by the features such as problem statement, related topics, level of difficulty, and platform link. Standard measures like cosine similarity, accuracy, precision, and F1-score are used to determine the proportion of correctly recommended problems. The percentages indicate how well the system performed regarding that evaluation. The result shows that CBF achieves an accuracy of 83 %, a precision of 83 %, a recall of 80%, and an F1-score of 80%. This recommendation system is deployed on a web application that provides a suitable user interface allowing the user to interact with other features. With this, a whole E-learning application is built to aid potential software engineers and computer science students. In the future, two more recommendation systems, Collaborative Filtering (CF) and Hybrid systems, can be implemented to make a comparison and decide which is most suitable for the given problem statement.

数据结构与算法(DSA)是计算机科学领域中一个被广泛探索的领域。在软件工程师的面试中，它是一个至关重要的话题，这是一个不能掉以轻心的话题。有许多平台可用于理解特定的DSA、一些编程问题及其实现。Hacckerank、LeetCode、GeeksForGeeks (GFG)和Codeforces是提供大量编程问题以提高技能的流行平台。然而，由于DSA的内容非常丰富，用户在浏览了所需的领域后，很难确定应该关注哪一个。这项工作旨在使用基于内容的过滤(CBF)推荐引擎向用户推荐与不同的dsa(如数组、链表、树、图等)相关的基于编程的问题。这些建议是使用自然语言处理(NLP)的概念生成的。该数据集由大约500个问题组成。每个问题都由问题陈述、相关主题、难度等级和平台链接等特征表示。诸如余弦相似性、准确性、精密度和f1分数等标准度量用于确定正确推荐问题的比例。百分比表示系统对该评估的执行情况。结果表明，CBF的准确率为83%，精密度为83%，召回率为80%，f1分数为80%。该推荐系统部署在一个web应用程序上，该应用程序提供了一个合适的用户界面，允许用户与其他功能交互。有了这个，一个完整的电子学习应用程序被建立起来，以帮助潜在的软件工程师和计算机科学专业的学生。在未来，两种推荐系统，协同过滤(CF)和混合系统，可以实现进行比较，并决定哪一个最适合给定的问题陈述。

{"title":"An Enhanced Approach to Recommend Data Structures and Algorithms Problems Using Content-based Filtering","authors":"Aayush Juyal, Nandini Sharma, Pisati Rithya, Sandeep Kumar","doi":"10.5815/ijisa.2023.05.03","DOIUrl":"https://doi.org/10.5815/ijisa.2023.05.03","url":null,"abstract":"Data Structures and Algorithms (DSA) is a widely explored domain in the world of computer science. With it being a crucial topic during an interview for a software engineer, it is a topic not to take lightly. There are various platforms available to understand a particular DSA, several programming problems, and its implementation. Hacckerank, LeetCode, GeeksForGeeks (GFG), and Codeforces are popular platforms that offer a vast collection of programming problems to enhance skills. However, with the huge content of DSA available, it is challenging for users to identify which one among all to focus on after going through the required domain. This work aims to use a Content-based filtering (CBF) recommendation engine to suggest users programming-based questions related to different DSAs such as arrays, linked lists, trees, graphs, etc. The recommendations are generated using the concept of Natural Language Processing (NLP). The data set consists of approximately 500 problems. Each problem is represented by the features such as problem statement, related topics, level of difficulty, and platform link. Standard measures like cosine similarity, accuracy, precision, and F1-score are used to determine the proportion of correctly recommended problems. The percentages indicate how well the system performed regarding that evaluation. The result shows that CBF achieves an accuracy of 83 %, a precision of 83 %, a recall of 80%, and an F1-score of 80%. This recommendation system is deployed on a web application that provides a suitable user interface allowing the user to interact with other features. With this, a whole E-learning application is built to aid potential software engineers and computer science students. In the future, two more recommendation systems, Collaborative Filtering (CF) and Hybrid systems, can be implemented to make a comparison and decide which is most suitable for the given problem statement.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135196397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predicting Automobile Stock Prices Index in the Tehran Stock Exchange Using Machine Learning Models 利用机器学习模型预测德黑兰证券交易所的汽车股票价格指数

Q3 Computer Science

International Journal of Intelligent Systems and Applications in Engineering

Pub Date : 2023-10-08 DOI: 10.5815/ijisa.2023.05.02

Arash Salehpour

This paper analyses the performance of machine learning models in forecasting the Tehran Stock Exchange's automobile index. Historical daily data from 2018-2022 was pre-processed and used to train Linear Regression (LR), Support Vector Regression (SVR), and Random Forest (RF) models. The models were evaluated on mean absolute error, mean squared error, root mean squared error and R2 score metrics. The results indicate that LR and SVR outperformed RF in predicting automobile stock prices, with LR achieving the lowest error scores. This demonstrates the capability of machine learning techniques to model complex, nonlinear relationships in financial time series data. This pioneering study on a previously unexplored dataset provides empirical evidence that LR and SVR can reliably forecast automobile stock market prices, holding promise for investing applications.

本文分析了机器学习模型在预测德黑兰证券交易所汽车指数方面的性能。对2018-2022年的历史每日数据进行预处理，并用于训练线性回归(LR)、支持向量回归(SVR)和随机森林(RF)模型。采用平均绝对误差、均方误差、均方根误差和R2评分指标对模型进行评价。结果表明，LR和SVR在预测汽车股价方面优于RF，其中LR的误差得分最低。这证明了机器学习技术在金融时间序列数据中建模复杂、非线性关系的能力。这项开创性的研究基于以前未开发的数据集，提供了经验证据，证明LR和SVR可以可靠地预测汽车股票市场价格，为投资应用带来了希望。

引用次数: 0

Machine Learning for Weather Forecasting: XGBoost vs SVM vs Random Forest in Predicting Temperature for Visakhapatnam 天气预报的机器学习:XGBoost vs SVM vs随机森林预测维萨卡帕特南的温度

Q3 Computer Science

International Journal of Intelligent Systems and Applications in Engineering

Pub Date : 2023-10-08 DOI: 10.5815/ijisa.2023.05.05

Deep Karan Singh, Nisha Rawat

Climate change, a significant and lasting alteration in global weather patterns, is profoundly impacting the stability and predictability of global temperature regimes. As the world continues to grapple with the far-reaching effects of climate change, accurate and timely temperature predictions have become pivotal to various sectors, including agriculture, energy, public health and many more. Crucially, precise temperature forecasting assists in developing effective climate change mitigation and adaptation strategies. With the advent of machine learning techniques, we now have powerful tools that can learn from vast climatic datasets and provide improved predictive performance. This study delves into the comparison of three such advanced machine learning models—XGBoost, Support Vector Machine (SVM), and Random Forest—in predicting daily maximum and minimum temperatures using a 45-year dataset of Visakhapatnam airport. Each model was rigorously trained and evaluated based on key performance metrics including training loss, Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R2 score, Mean Absolute Percentage Error (MAPE), and Explained Variance Score. Although there was no clear dominance of a single model across all metrics, SVM and Random Forest showed slightly superior performance on several measures. These findings not only highlight the potential of machine learning techniques in enhancing the accuracy of temperature forecasting but also stress the importance of selecting an appropriate model and performance metrics aligned with the requirements of the task at hand. This research accomplishes a thorough comparative analysis, conducts a rigorous evaluation of the models, highlights the significance of model selection.

气候变化是全球天气模式的一个重大而持久的变化，正在深刻地影响着全球温度制度的稳定性和可预测性。随着世界继续努力应对气候变化的深远影响，准确和及时的温度预测已成为各个部门的关键，包括农业、能源、公共卫生等。至关重要的是，精确的温度预报有助于制定有效的气候变化减缓和适应战略。随着机器学习技术的出现，我们现在有了强大的工具，可以从大量的气候数据集中学习，并提供改进的预测性能。本研究利用维萨卡帕特南机场45年的数据集，深入研究了三种先进的机器学习模型——xgboost、支持向量机(SVM)和随机森林——在预测日最高和最低温度方面的比较。每个模型都经过严格的训练，并根据关键性能指标进行评估，包括训练损失、平均绝对误差(MAE)、均方误差(MSE)、均方根平方误差(RMSE)、R2评分、平均绝对百分比误差(MAPE)和解释方差评分。虽然单一模型在所有指标中没有明显的优势，但SVM和随机森林在几个指标上表现出略微优越的性能。这些发现不仅突出了机器学习技术在提高温度预测准确性方面的潜力，而且还强调了选择合适的模型和符合手头任务要求的性能指标的重要性。本研究完成了深入的比较分析，对模型进行了严格的评价，突出了模型选择的意义。

{"title":"Machine Learning for Weather Forecasting: XGBoost vs SVM vs Random Forest in Predicting Temperature for Visakhapatnam","authors":"Deep Karan Singh, Nisha Rawat","doi":"10.5815/ijisa.2023.05.05","DOIUrl":"https://doi.org/10.5815/ijisa.2023.05.05","url":null,"abstract":"Climate change, a significant and lasting alteration in global weather patterns, is profoundly impacting the stability and predictability of global temperature regimes. As the world continues to grapple with the far-reaching effects of climate change, accurate and timely temperature predictions have become pivotal to various sectors, including agriculture, energy, public health and many more. Crucially, precise temperature forecasting assists in developing effective climate change mitigation and adaptation strategies. With the advent of machine learning techniques, we now have powerful tools that can learn from vast climatic datasets and provide improved predictive performance. This study delves into the comparison of three such advanced machine learning models—XGBoost, Support Vector Machine (SVM), and Random Forest—in predicting daily maximum and minimum temperatures using a 45-year dataset of Visakhapatnam airport. Each model was rigorously trained and evaluated based on key performance metrics including training loss, Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R2 score, Mean Absolute Percentage Error (MAPE), and Explained Variance Score. Although there was no clear dominance of a single model across all metrics, SVM and Random Forest showed slightly superior performance on several measures. These findings not only highlight the potential of machine learning techniques in enhancing the accuracy of temperature forecasting but also stress the importance of selecting an appropriate model and performance metrics aligned with the requirements of the task at hand. This research accomplishes a thorough comparative analysis, conducts a rigorous evaluation of the models, highlights the significance of model selection.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135196394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MediBERT: A Medical Chatbot Built Using KeyBERT, BioBERT and GPT-2 MediBERT:一个使用KeyBERT, BioBERT和GPT-2构建的医疗聊天机器人

Q3 Computer Science

International Journal of Intelligent Systems and Applications in Engineering

Pub Date : 2023-08-08 DOI: 10.5815/ijisa.2023.04.05

Sabbir Hossain, Rahman Sharar, Md. Ibrahim Bahadur, A. Sufian, Rashidul Hasan Nabil

The emergence of chatbots over the last 50 years has been the primary consequence of the need of a virtual aid. Unlike their biological anthropomorphic counterpart in the form of fellow homo sapiens, chatbots have the ability to instantaneously present themselves at the user's need and convenience. Be it for something as benign as feeling the need of a friend to talk to, to a more dire case such as medical assistance, chatbots are unequivocally ubiquitous in their utility. This paper aims to develop one such chatbot that is capable of not only analyzing human text (and speech in the near future), but also refining the ability to assist them medically through the process of accumulating data from relevant datasets. Although Recurrent Neural Networks (RNNs) are often used to develop chatbots, the constant presence of the vanishing gradient issue brought about by backpropagation, coupled with the cumbersome process of sequentially parsing each word individually has led to the increased usage of Transformer Neural Networks (TNNs) instead, which parses entire sentences at once while simultaneously giving context to it via embeddings, leading to increased parallelization. Two variants of the TNN Bidirectional Encoder Representations from Transformers (BERT), namely KeyBERT and BioBERT, are used for tagging the keywords in each sentence and for contextual vectorization into Q/A pairs for matrix multiplication, respectively. A final layer of GPT-2 (Generative Pre-trained Transformer) is applied to fine-tune the results from the BioBERT into a form that is human readable. The outcome of such an attempt could potentially lessen the need for trips to the nearest physician, and the temporal delay and financial resources required to do so.

在过去的50年里，聊天机器人的出现是对虚拟援助需求的主要结果。与人类不同的是，聊天机器人有能力在用户需要和方便的时候立即出现。无论是感觉需要与朋友交谈这样的良性情况，还是医疗援助这样的更可怕的情况，聊天机器人的用途毫无疑问是无处不在的。本文旨在开发一种这样的聊天机器人，它不仅能够分析人类文本(以及不久的将来的语音)，而且还能够通过从相关数据集中积累数据的过程来完善帮助他们进行医疗的能力。虽然递归神经网络(rnn)经常用于开发聊天机器人，但由于反向传播带来的梯度消失问题的持续存在，加上逐个顺序解析每个单词的繁琐过程，导致变压器神经网络(tnn)的使用增加，它可以一次解析整个句子，同时通过嵌入为其提供上下文，从而增加并行化。来自变压器(BERT)的TNN双向编码器表示的两个变体，即KeyBERT和BioBERT，分别用于标记每个句子中的关键字和上下文矢量化到Q/A对以进行矩阵乘法。最后一层GPT-2(生成预训练变压器)被应用于将生物检测结果微调成人类可读的形式。这种尝试的结果可能会减少前往最近的医生的需要，以及这样做所需的时间延误和财政资源。

{"title":"MediBERT: A Medical Chatbot Built Using KeyBERT, BioBERT and GPT-2","authors":"Sabbir Hossain, Rahman Sharar, Md. Ibrahim Bahadur, A. Sufian, Rashidul Hasan Nabil","doi":"10.5815/ijisa.2023.04.05","DOIUrl":"https://doi.org/10.5815/ijisa.2023.04.05","url":null,"abstract":"The emergence of chatbots over the last 50 years has been the primary consequence of the need of a virtual aid. Unlike their biological anthropomorphic counterpart in the form of fellow homo sapiens, chatbots have the ability to instantaneously present themselves at the user's need and convenience. Be it for something as benign as feeling the need of a friend to talk to, to a more dire case such as medical assistance, chatbots are unequivocally ubiquitous in their utility. This paper aims to develop one such chatbot that is capable of not only analyzing human text (and speech in the near future), but also refining the ability to assist them medically through the process of accumulating data from relevant datasets. Although Recurrent Neural Networks (RNNs) are often used to develop chatbots, the constant presence of the vanishing gradient issue brought about by backpropagation, coupled with the cumbersome process of sequentially parsing each word individually has led to the increased usage of Transformer Neural Networks (TNNs) instead, which parses entire sentences at once while simultaneously giving context to it via embeddings, leading to increased parallelization. Two variants of the TNN Bidirectional Encoder Representations from Transformers (BERT), namely KeyBERT and BioBERT, are used for tagging the keywords in each sentence and for contextual vectorization into Q/A pairs for matrix multiplication, respectively. A final layer of GPT-2 (Generative Pre-trained Transformer) is applied to fine-tune the results from the BioBERT into a form that is human readable. The outcome of such an attempt could potentially lessen the need for trips to the nearest physician, and the temporal delay and financial resources required to do so.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82279989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Design of Automatic Number Plate Recognition System for Yemeni Vehicles with Support Vector Machine 基于支持向量机的也门车辆车牌自动识别系统设计

Q3 Computer Science

International Journal of Intelligent Systems and Applications in Engineering

Pub Date : 2023-08-08 DOI: 10.5815/ijisa.2023.04.04

Farhan M. A. Nashwan, Khaled A. M. Al Soufy, N. Al-Ashwal, Majed A. Al-Badany

Automatic Number Plate Recognition (ANPR) is an important tool in the Intelligent Transport System (ITS). Plate features can be used to provide the identification of any vehicle as they help ensure effective law enforcement and security. However, this is a challenging problem, because of the diversity of plate formats, different scales, rotations and non-uniform illumination and other conditions during image acquisition. This work aims to design and implement an ANPR system specified for Yemeni vehicle plates. The proposed system involves several steps to detect, segment, and recognize Yemeni vehicle plate numbers. First, a dataset of images is manually collected. Then, the collected images undergo preprocessing, followed by plate extraction, digit segmentation, and feature extraction. Finally, the plate numbers are identified using Support Vector Machine (SVM). When designing the proposed system, all possible conditions that could affect the efficiency of the system were considered. The experimental results showed that the proposed system achieved 96.98% and 99.19% of the training and testing success rates respectively.

车牌自动识别(ANPR)是智能交通系统(ITS)中的一个重要工具。车牌特征可以用来识别任何车辆，因为它们有助于确保有效的执法和安全。然而，这是一个具有挑战性的问题，因为在图像采集过程中，底片格式的多样性，不同的尺度，旋转和不均匀的光照等条件。这项工作旨在设计和实施也门车牌指定的ANPR系统。提出的系统包括检测、分割和识别也门车牌号码的几个步骤。首先，手动收集图像数据集。然后对采集到的图像进行预处理，然后进行板块提取、数字分割、特征提取。最后，利用支持向量机(SVM)识别车牌号码。在设计所提出的系统时，考虑了所有可能影响系统效率的条件。实验结果表明，该系统的训练成功率为96.98%，测试成功率为99.19%。

{"title":"Design of Automatic Number Plate Recognition System for Yemeni Vehicles with Support Vector Machine","authors":"Farhan M. A. Nashwan, Khaled A. M. Al Soufy, N. Al-Ashwal, Majed A. Al-Badany","doi":"10.5815/ijisa.2023.04.04","DOIUrl":"https://doi.org/10.5815/ijisa.2023.04.04","url":null,"abstract":"Automatic Number Plate Recognition (ANPR) is an important tool in the Intelligent Transport System (ITS). Plate features can be used to provide the identification of any vehicle as they help ensure effective law enforcement and security. However, this is a challenging problem, because of the diversity of plate formats, different scales, rotations and non-uniform illumination and other conditions during image acquisition. This work aims to design and implement an ANPR system specified for Yemeni vehicle plates. The proposed system involves several steps to detect, segment, and recognize Yemeni vehicle plate numbers. First, a dataset of images is manually collected. Then, the collected images undergo preprocessing, followed by plate extraction, digit segmentation, and feature extraction. Finally, the plate numbers are identified using Support Vector Machine (SVM). When designing the proposed system, all possible conditions that could affect the efficiency of the system were considered. The experimental results showed that the proposed system achieved 96.98% and 99.19% of the training and testing success rates respectively.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"19 3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89445036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

How do Machine Learning Algorithms Effectively Classify Toxic Comments? An Empirical Analysis 机器学习算法如何有效分类有毒评论?实证分析

Q3 Computer Science

International Journal of Intelligent Systems and Applications in Engineering

Pub Date : 2023-08-08 DOI: 10.5815/ijisa.2023.04.01

Md. Abdur Rahman, A. Nayem, Mahfida Amjad, Md. Saeed Siddik

Toxic comments on social media platforms, news portals, and online forums are impolite, insulting, or unreasonable that usually make other users leave a conversation. Due to the significant number of comments, it is impractical to moderate them manually. Therefore, online service providers use the automatic detection of toxicity using Machine Learning (ML) algorithms. However, the model's toxicity identification performance relies on the best combination of classifier and feature extraction techniques. In this empirical study, we set up a comparison environment for toxic comment classification using 15 frequently used supervised ML classifiers with the four most prominent feature extraction schemes. We considered the publicly available Jigsaw dataset on toxic comments written by human users. We tested, analyzed and compared with every pair of investigated classifiers and finally reported a conclusion. We used the accuracy and area under the ROC curve as the evaluation metrics. We revealed that Logistic Regression and AdaBoost are the best toxic comment classifiers. The average accuracy of Logistic Regression and AdaBoost is 0.895 and 0.893, respectively, where both achieved the same area under the ROC curve score (i.e., 0.828). Therefore, the primary takeaway of this study is that the Logistic Regression and Adaboost leveraging BoW, TF-IDF, or Hashing features can perform sufficiently for toxic comment classification.

社交媒体平台、新闻门户网站和在线论坛上的有毒评论是不礼貌的、侮辱性的或不合理的，通常会让其他用户离开对话。由于大量的评论，手动调节它们是不切实际的。因此，在线服务提供商使用机器学习(ML)算法自动检测毒性。然而，该模型的毒性识别性能依赖于分类器和特征提取技术的最佳结合。在本实证研究中，我们使用15种常用的有监督机器学习分类器和四种最突出的特征提取方案建立了有毒评论分类的比较环境。我们考虑了人类用户写的有毒评论的公开可用的Jigsaw数据集。我们对每一对被调查的分类器进行测试、分析和比较，最后报告一个结论。我们使用ROC曲线下的准确度和面积作为评价指标。我们发现Logistic回归和AdaBoost是最好的有毒评论分类器。Logistic回归和AdaBoost的平均准确率分别为0.895和0.893，两者在ROC曲线得分下的面积相同(即0.828)。因此，本研究的主要结论是，逻辑回归和Adaboost利用BoW、TF-IDF或哈希特征可以充分执行有毒评论分类。

{"title":"How do Machine Learning Algorithms Effectively Classify Toxic Comments? An Empirical Analysis","authors":"Md. Abdur Rahman, A. Nayem, Mahfida Amjad, Md. Saeed Siddik","doi":"10.5815/ijisa.2023.04.01","DOIUrl":"https://doi.org/10.5815/ijisa.2023.04.01","url":null,"abstract":"Toxic comments on social media platforms, news portals, and online forums are impolite, insulting, or unreasonable that usually make other users leave a conversation. Due to the significant number of comments, it is impractical to moderate them manually. Therefore, online service providers use the automatic detection of toxicity using Machine Learning (ML) algorithms. However, the model's toxicity identification performance relies on the best combination of classifier and feature extraction techniques. In this empirical study, we set up a comparison environment for toxic comment classification using 15 frequently used supervised ML classifiers with the four most prominent feature extraction schemes. We considered the publicly available Jigsaw dataset on toxic comments written by human users. We tested, analyzed and compared with every pair of investigated classifiers and finally reported a conclusion. We used the accuracy and area under the ROC curve as the evaluation metrics. We revealed that Logistic Regression and AdaBoost are the best toxic comment classifiers. The average accuracy of Logistic Regression and AdaBoost is 0.895 and 0.893, respectively, where both achieved the same area under the ROC curve score (i.e., 0.828). Therefore, the primary takeaway of this study is that the Logistic Regression and Adaboost leveraging BoW, TF-IDF, or Hashing features can perform sufficiently for toxic comment classification.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83309669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Interpretable Fuzzy System for Early Detection Autism Spectrum Disorder 自闭症谱系障碍早期检测的可解释模糊系统

Q3 Computer Science

International Journal of Intelligent Systems and Applications in Engineering

Pub Date : 2023-08-08 DOI: 10.5815/ijisa.2023.04.03

Rajan Prasad, P. Shukla

Autism spectrum disorder (ASD) is a chronic developmental impairment that impairs a person's ability to communicate and connect with others. In people with ASD, social contact and reciprocal communication are continually jeopardized. People with ASD may require varying degrees of psychological aid in order to gain greater independence, or they may require ongoing supervision and care. Early discovery of ASD results in more time allocated to individual rehabilitation. In this study, we proposed the fuzzy classifier for ASD classification and tested its interpretability with the fuzzy index and Nauck's index to ensure its reliability. Then, the rule base is created with the Gauje tool. The fuzzy rules were then applied to the fuzzy neural network to predict autism. The suggested model is built on the Mamdani rule set and optimized using the backpropagation algorithm. The proposed model uses a heuristic function and pattern evolution to classify dataset. The model is evaluated using the benchmark metrics accuracy and F-measure, and Nauck's index and fuzzy index are employed to quantify interpretability. The proposed model is superior in its ability to accurately detect ASD, with an average accuracy rate of 91% compared to other classifiers.

自闭症谱系障碍(ASD)是一种慢性发育障碍，会损害一个人与他人沟通和联系的能力。在自闭症患者中，社会接触和相互交流不断受到损害。患有自闭症谱系障碍的人可能需要不同程度的心理援助，以获得更大的独立性，或者他们可能需要持续的监督和照顾。ASD的早期发现导致更多的时间分配给个人康复。在本研究中，我们提出了模糊分类器用于ASD分类，并通过模糊指数和Nauck指数对其可解释性进行检验，以保证其可靠性。然后，使用Gauje工具创建规则库。然后将模糊规则应用到模糊神经网络中进行自闭症预测。该模型建立在Mamdani规则集上，并使用反向传播算法进行优化。该模型使用启发式函数和模式进化对数据集进行分类。采用基准指标精度和F-measure对模型进行评价，并采用Nauck指数和模糊指数对模型的可解释性进行量化。与其他分类器相比，所提出的模型在准确检测ASD的能力方面具有优势，平均准确率为91%。

{"title":"Interpretable Fuzzy System for Early Detection Autism Spectrum Disorder","authors":"Rajan Prasad, P. Shukla","doi":"10.5815/ijisa.2023.04.03","DOIUrl":"https://doi.org/10.5815/ijisa.2023.04.03","url":null,"abstract":"Autism spectrum disorder (ASD) is a chronic developmental impairment that impairs a person's ability to communicate and connect with others. In people with ASD, social contact and reciprocal communication are continually jeopardized. People with ASD may require varying degrees of psychological aid in order to gain greater independence, or they may require ongoing supervision and care. Early discovery of ASD results in more time allocated to individual rehabilitation. In this study, we proposed the fuzzy classifier for ASD classification and tested its interpretability with the fuzzy index and Nauck's index to ensure its reliability. Then, the rule base is created with the Gauje tool. The fuzzy rules were then applied to the fuzzy neural network to predict autism. The suggested model is built on the Mamdani rule set and optimized using the backpropagation algorithm. The proposed model uses a heuristic function and pattern evolution to classify dataset. The model is evaluated using the benchmark metrics accuracy and F-measure, and Nauck's index and fuzzy index are employed to quantify interpretability. The proposed model is superior in its ability to accurately detect ASD, with an average accuracy rate of 91% compared to other classifiers.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82401083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Adaptive Hybrid Outdoor Propagation Loss Prediction Modelling for Effective Cellular Systems Network Planning and Optimization 有效蜂窝系统网络规划与优化的自适应混合室外传播损耗预测模型

Q3 Computer Science

International Journal of Intelligent Systems and Applications in Engineering

Pub Date : 2023-08-08 DOI: 10.5815/ijisa.2023.04.02

Ikechi Risi, C. Ogbonda, F. B. Sigalo, Isabona Joseph

The frequent poor service network experienced by some mobile phone users within some deadlock areas in Nigeria is an issue which has been identified by different researchers due to wrong positioning and planning of the evolved NodeB (eNodeB) transmitter using existing propagation loss models. To effectively contribute towards this potential issue constantly experienced in some part of Nigeria, an adaptive hybrid propagation loss model that is based on wavelet transform and genetic algorithm methods has been developed for cellular network planning and optimization, with the capacity to resolve the problems absolutely. First, the signal strengths were measured within four selected eNodeB cell sites in long term evolution (LTE) at 2600MHz using drive-test method. Secondly, the measured data were denoised through wavelet tools. Thirdly, COST231 model was optimize and deduced to generic model with parameters. Fourthly, genetic optimization algorithm automatically developed the propagation loss models for denoised signal data (designated as wavelet-GA model) and unprocessed signal data (designated as GA model). The hybrid wavelet-GA propagation loss model, GA propagation loss model, and COST231 propagation loss model were compared based on three error metrics such as root mean square error (RMSE), mean absolute error (MAE) and correlation coefficient (R). The developed hybrid wavelet-GA model estimated the lowest RMSEs of 2.8813 dB, 3.9381 dB, 4.7643 dB, 6.9366 dB, whereas, COST231 model gave highest value of RMSE. The developed hybrid wavelet-GA model also derived the least value of MAE as compared with COST231 and the GA models, such as, 2.2016 dB, 2.8672 dB, 3.4766 dB, 5.8235 dB. The correlation coefficients were also compared, and it showed that the developed hybrid wavelet-GA model were 90.04%, 78.61%, 92.21% and 91.23% for the four cell sites. The developed hybrid wavelet-GA model was also validated to account for the performance level by checking for the correlation coefficient using another measured signal data from different eNodeB cell sites other than the once used for the developed of the hybrid wavelet-GA model. It was noticed that the developed hybrid wavelet-GA propagation loss model is 97.41% valid. Existing standard COST231 model are not able to predict propagation loss with high level of accuracy, as such not efficient to be applied within part of Port Harcourt, Nigeria. The proposed hybrid wavelet-GA model has proven to achieve high performance level and it is relevant to be utilized for cellular network planning and optimization. In future purposes, more regions and locations should be considered to form a broader view in the development of more robust propagation loss models.

在尼日利亚的一些死锁区域内，一些移动电话用户经常遇到服务网络差的问题，这是一个由不同的研究人员发现的问题，原因是使用现有的传播损耗模型对改进的NodeB (eNodeB)发射机进行了错误的定位和规划。为了有效解决尼日利亚部分地区经常遇到的这一潜在问题，开发了一种基于小波变换和遗传算法的自适应混合传播损失模型，用于蜂窝网络规划和优化，具有绝对解决问题的能力。首先，采用驱动测试方法，在2600MHz长期演进(LTE)的四个选定的eNodeB蜂窝站点内测量信号强度。其次，利用小波工具对实测数据进行去噪处理;再次，对COST231模型进行优化，并将其推导为带参数的通用模型。第四，遗传优化算法自动建立去噪信号数据(称为小波遗传算法模型)和未处理信号数据(称为遗传算法模型)的传播损失模型。基于均方根误差(RMSE)、平均绝对误差(MAE)和相关系数(R) 3个误差指标对混合小波-遗传传播损耗模型、遗传传播损耗模型和COST231传播损耗模型进行了比较，发现混合小波-遗传模型的最小RMSE分别为2.8813 dB、3.9381 dB、4.7643 dB和6.9366 dB，而COST231模型的RMSE最高。与COST231和GA模型(2.2016 dB、2.8672 dB、3.4766 dB、5.8235 dB)相比，所建立的混合小波-GA模型的MAE值最小。结果表明，所建立的混合小波-遗传模型对4个细胞位点的相关系数分别为90.04%、78.61%、92.21%和91.23%。通过检查来自不同eNodeB小区的其他测量信号数据(而不是用于开发混合小波-遗传模型的数据)的相关系数，还验证了所开发的混合小波-遗传模型的性能水平。结果表明，所建立的小波-遗传算法混合传播损耗模型的有效性为97.41%。现有的标准COST231模型无法高精度地预测传播损失，因此在尼日利亚哈科特港部分地区应用效率不高。所提出的混合小波-遗传算法模型具有较高的性能水平，可用于蜂窝网络的规划和优化。在未来的目的中，应该考虑更多的区域和位置，以便在开发更鲁棒的传播损失模型时形成更广阔的视野。

{"title":"An Adaptive Hybrid Outdoor Propagation Loss Prediction Modelling for Effective Cellular Systems Network Planning and Optimization","authors":"Ikechi Risi, C. Ogbonda, F. B. Sigalo, Isabona Joseph","doi":"10.5815/ijisa.2023.04.02","DOIUrl":"https://doi.org/10.5815/ijisa.2023.04.02","url":null,"abstract":"The frequent poor service network experienced by some mobile phone users within some deadlock areas in Nigeria is an issue which has been identified by different researchers due to wrong positioning and planning of the evolved NodeB (eNodeB) transmitter using existing propagation loss models. To effectively contribute towards this potential issue constantly experienced in some part of Nigeria, an adaptive hybrid propagation loss model that is based on wavelet transform and genetic algorithm methods has been developed for cellular network planning and optimization, with the capacity to resolve the problems absolutely. First, the signal strengths were measured within four selected eNodeB cell sites in long term evolution (LTE) at 2600MHz using drive-test method. Secondly, the measured data were denoised through wavelet tools. Thirdly, COST231 model was optimize and deduced to generic model with parameters. Fourthly, genetic optimization algorithm automatically developed the propagation loss models for denoised signal data (designated as wavelet-GA model) and unprocessed signal data (designated as GA model). The hybrid wavelet-GA propagation loss model, GA propagation loss model, and COST231 propagation loss model were compared based on three error metrics such as root mean square error (RMSE), mean absolute error (MAE) and correlation coefficient (R). The developed hybrid wavelet-GA model estimated the lowest RMSEs of 2.8813 dB, 3.9381 dB, 4.7643 dB, 6.9366 dB, whereas, COST231 model gave highest value of RMSE. The developed hybrid wavelet-GA model also derived the least value of MAE as compared with COST231 and the GA models, such as, 2.2016 dB, 2.8672 dB, 3.4766 dB, 5.8235 dB. The correlation coefficients were also compared, and it showed that the developed hybrid wavelet-GA model were 90.04%, 78.61%, 92.21% and 91.23% for the four cell sites. The developed hybrid wavelet-GA model was also validated to account for the performance level by checking for the correlation coefficient using another measured signal data from different eNodeB cell sites other than the once used for the developed of the hybrid wavelet-GA model. It was noticed that the developed hybrid wavelet-GA propagation loss model is 97.41% valid. Existing standard COST231 model are not able to predict propagation loss with high level of accuracy, as such not efficient to be applied within part of Port Harcourt, Nigeria. The proposed hybrid wavelet-GA model has proven to achieve high performance level and it is relevant to be utilized for cellular network planning and optimization. In future purposes, more regions and locations should be considered to form a broader view in the development of more robust propagation loss models.","PeriodicalId":14067,"journal":{"name":"International Journal of Intelligent Systems and Applications in Engineering","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84860678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0