Pub Date : 2021-12-08DOI: 10.5815/ijitcs.2021.06.01
Bassel Al-khatib, Ali A. Ali
With the increased adoption of open government initiatives around the world, a huge amount of governmental raw datasets was released. However, the data was published in heterogeneous formats and vocabularies and in many cases in bad quality due to inconsistency, messy, and maybe incorrectness as it has been collected by practicalities within the source organization, which makes it inefficient for reusing and integrating it for serving citizens and third-party apps. This research introduces the LDOG (Linked Data for Open Government) experimental framework, which aims to provide a modular architecture that can be integrated into the open government hierarchy, allowing huge amounts of data to be gathered in a fine-grained manner from source and directly publishing them as linked data based on Tim Berners lee’s five-star deployment scheme with a validation layer using SHACL, which results in high quality data. The general idea is to model the hierarchy of government and classify government organizations into two types, the modeling organizations at higher levels and data source organizations at lower levels. Modeling organization’s experts in linked data have the responsibility to design data templates, ontologies, SHACL shapes, and linkage specifications. whereas non-experts can be incorporated in data source organizations to utilize their knowledge in data to do mapping, reconciliation, and correcting data. This approach lowers the needed experts that represent a problem of linked data adoption. To test the functionality of our framework in action, we developed the LDOG platform which utilizes the different modules of the framework to power a set of user interfaces that can be used to publish government datasets. we used this platform to convert some of UAE's government datasets into linked data. Finally, on top of the converted data, we built a proof-of-concept app to show the power of five-star linked data for integrating datasets from disparate organizations and to promote the governments' adoption. Our work has defined a clear path to integrate the linked data into open governments and solid steps to publishing and enhancing it in a fine-grained and practical manner with a lower number of experts in linked data, It extends SHACL to define data shapes and convert CSV to RDF.
随着世界各地越来越多地采用开放政府的举措,大量的政府原始数据集被发布。然而,数据以异构格式和词汇表发布,并且在许多情况下由于不一致、混乱和可能不正确而导致质量差,因为它是由源组织内部的实用性收集的,这使得为服务公民和第三方应用程序重用和集成它的效率低下。本研究引入了LDOG (Linked Data for Open Government)实验框架,该框架旨在提供一个可以集成到开放政府层级的模块化架构,允许从源头以细粒度的方式收集大量数据,并基于Tim Berners lee的五星部署方案,使用SHACL进行验证层,直接将其作为链接数据发布,从而获得高质量的数据。总体思路是对政府的层次结构进行建模,并将政府组织分为两类,即高层的建模组织和低层的数据源组织。关联数据建模组织的专家有责任设计数据模板、本体、SHACL形状和链接规范。而非专家可以加入数据源组织,利用他们在数据方面的知识进行映射、协调和纠正数据。这种方法降低了代表关联数据采用问题的所需专家。为了测试我们框架的实际功能,我们开发了LDOG平台,该平台利用框架的不同模块来支持一组可用于发布政府数据集的用户界面。我们使用这个平台将阿联酋的一些政府数据集转换为关联数据。最后,在转换数据的基础上,我们建立了一个概念验证应用程序,以展示五星关联数据在整合来自不同组织的数据集方面的力量,并促进政府的采用。我们的工作定义了一条清晰的路径,将关联数据集成到开放的政府中,并采取坚实的步骤,以细粒度和实用的方式与较少数量的关联数据专家一起发布和增强关联数据。它扩展了SHACL来定义数据形状并将CSV转换为RDF。
{"title":"Linked Data: A Framework for Publishing FiveStar Open Government Data","authors":"Bassel Al-khatib, Ali A. Ali","doi":"10.5815/ijitcs.2021.06.01","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.06.01","url":null,"abstract":"With the increased adoption of open government initiatives around the world, a huge amount of governmental raw datasets was released. However, the data was published in heterogeneous formats and vocabularies and in many cases in bad quality due to inconsistency, messy, and maybe incorrectness as it has been collected by practicalities within the source organization, which makes it inefficient for reusing and integrating it for serving citizens and third-party apps. This research introduces the LDOG (Linked Data for Open Government) experimental framework, which aims to provide a modular architecture that can be integrated into the open government hierarchy, allowing huge amounts of data to be gathered in a fine-grained manner from source and directly publishing them as linked data based on Tim Berners lee’s five-star deployment scheme with a validation layer using SHACL, which results in high quality data. The general idea is to model the hierarchy of government and classify government organizations into two types, the modeling organizations at higher levels and data source organizations at lower levels. Modeling organization’s experts in linked data have the responsibility to design data templates, ontologies, SHACL shapes, and linkage specifications. whereas non-experts can be incorporated in data source organizations to utilize their knowledge in data to do mapping, reconciliation, and correcting data. This approach lowers the needed experts that represent a problem of linked data adoption. To test the functionality of our framework in action, we developed the LDOG platform which utilizes the different modules of the framework to power a set of user interfaces that can be used to publish government datasets. we used this platform to convert some of UAE's government datasets into linked data. Finally, on top of the converted data, we built a proof-of-concept app to show the power of five-star linked data for integrating datasets from disparate organizations and to promote the governments' adoption. Our work has defined a clear path to integrate the linked data into open governments and solid steps to publishing and enhancing it in a fine-grained and practical manner with a lower number of experts in linked data, It extends SHACL to define data shapes and convert CSV to RDF.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"215 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132348370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-08DOI: 10.5815/ijitcs.2021.06.04
Prajwal Kaushal, Nithin Bharadwaj B P, Pranav M S, K. S., Dr. Anjan K Koundinya
Twitter being one of the most sophisticated social networking platforms whose users base is growing exponentially, terabytes of data is being generated every day. Technology Giants invest billions of dollars in drawing insights from these tweets. The huge amount of data is still going underutilized. The main of this paper is to solve two tasks. Firstly, to build a sentiment analysis model using BERT (Bidirectional Encoder Representations from Transformers) which analyses the tweets and predicts the sentiments of the users. Secondly to build a personality prediction model using various machine learning classifiers under the umbrella of Myers-Briggs Personality Type Indicator. MBTI is one of the most widely used psychological instruments in the world. Using this we intend to predict the traits and qualities of people based on their posts and interactions in Twitter. The model succeeds to predict the personality traits and qualities on twitter users. We intend to use the analyzed results in various applications like market research, recruitment, psychological tests, consulting, etc, in future.
Twitter是最复杂的社交网络平台之一,其用户基础呈指数级增长,每天都有tb级的数据生成。科技巨头投入数十亿美元从这些推文中获取见解。大量的数据仍未得到充分利用。本文主要解决两个任务。首先,利用BERT (Bidirectional Encoder Representations from Transformers)构建情感分析模型,对推文进行分析并预测用户的情感。其次,在Myers-Briggs人格类型指标的框架下,利用各种机器学习分类器构建人格预测模型。MBTI是世界上使用最广泛的心理测试工具之一。利用这个,我们打算根据人们在Twitter上的帖子和互动来预测他们的特征和品质。该模型成功地预测了twitter用户的个性特征和品质。我们打算将分析结果用于未来的各种应用,如市场研究、招聘、心理测试、咨询等。
{"title":"Myers-briggs Personality Prediction and Sentiment Analysis of Twitter using Machine Learning Classifiers and BERT","authors":"Prajwal Kaushal, Nithin Bharadwaj B P, Pranav M S, K. S., Dr. Anjan K Koundinya","doi":"10.5815/ijitcs.2021.06.04","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.06.04","url":null,"abstract":"Twitter being one of the most sophisticated social networking platforms whose users base is growing exponentially, terabytes of data is being generated every day. Technology Giants invest billions of dollars in drawing insights from these tweets. The huge amount of data is still going underutilized. The main of this paper is to solve two tasks. Firstly, to build a sentiment analysis model using BERT (Bidirectional Encoder Representations from Transformers) which analyses the tweets and predicts the sentiments of the users. Secondly to build a personality prediction model using various machine learning classifiers under the umbrella of Myers-Briggs Personality Type Indicator. MBTI is one of the most widely used psychological instruments in the world. Using this we intend to predict the traits and qualities of people based on their posts and interactions in Twitter. The model succeeds to predict the personality traits and qualities on twitter users. We intend to use the analyzed results in various applications like market research, recruitment, psychological tests, consulting, etc, in future.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125357729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-08DOI: 10.5815/ijitcs.2021.06.05
Isaac Kofi Nti, Owusu N yarko-Boateng, J. Aning
The numerical value of k in a k-fold cross-validation training technique of machine learning predictive models is an essential element that impacts the model’s performance. A right choice of k results in better accuracy, while a poorly chosen value for k might affect the model’s performance. In literature, the most commonly used values of k are five (5) or ten (10), as these two values are believed to give test error rate estimates that suffer neither from extremely high bias nor very high variance. However, there is no formal rule. To the best of our knowledge, few experimental studies attempted to investigate the effect of diverse k values in training different machine learning models. This paper empirically analyses the prevalence and effect of distinct k values (3, 5, 7, 10, 15 and 20) on the validation performance of four well-known machine learning algorithms (Gradient Boosting Machine (GBM), Logistic Regression (LR), Decision Tree (DT) and K-Nearest Neighbours (KNN)). It was observed that the value of k and model validation performance differ from one machine-learning algorithm to another for the same classification task. However, our empirical suggest that k = 7 offers a slight increase in validations accuracy and area under the curve measure with lesser computational complexity than k = 10 across most MLA. We discuss in detail the study outcomes and outline some guidelines for beginners in the machine learning field in selecting the best k value and machine learning algorithm for a given task.
{"title":"Performance of Machine Learning Algorithms with Different K Values in K-fold CrossValidation","authors":"Isaac Kofi Nti, Owusu N yarko-Boateng, J. Aning","doi":"10.5815/ijitcs.2021.06.05","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.06.05","url":null,"abstract":"The numerical value of k in a k-fold cross-validation training technique of machine learning predictive models is an essential element that impacts the model’s performance. A right choice of k results in better accuracy, while a poorly chosen value for k might affect the model’s performance. In literature, the most commonly used values of k are five (5) or ten (10), as these two values are believed to give test error rate estimates that suffer neither from extremely high bias nor very high variance. However, there is no formal rule. To the best of our knowledge, few experimental studies attempted to investigate the effect of diverse k values in training different machine learning models. This paper empirically analyses the prevalence and effect of distinct k values (3, 5, 7, 10, 15 and 20) on the validation performance of four well-known machine learning algorithms (Gradient Boosting Machine (GBM), Logistic Regression (LR), Decision Tree (DT) and K-Nearest Neighbours (KNN)). It was observed that the value of k and model validation performance differ from one machine-learning algorithm to another for the same classification task. However, our empirical suggest that k = 7 offers a slight increase in validations accuracy and area under the curve measure with lesser computational complexity than k = 10 across most MLA. We discuss in detail the study outcomes and outline some guidelines for beginners in the machine learning field in selecting the best k value and machine learning algorithm for a given task.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127511422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-08DOI: 10.5815/ijitcs.2021.06.03
Edward Ombui, Lawrence Muchemi, P. Wagacha
This study examines the problem of hate speech identification in codeswitched text from social media using a natural language processing approach. It explores different features in training nine models and empirically evaluates their predictiveness in identifying hate speech in a ~50k human-annotated dataset. The study espouses a novel approach to handle this challenge by introducing a hierarchical approach that employs Latent Dirichlet Analysis to generate topic models that help build a high-level Psychosocial feature set that we acronym PDC. PDC groups similar meaning words in word families, which is significant in capturing codeswitching during the preprocessing stage for supervised learning models. The high-level PDC features generated are based on a hate speech annotation framework [1] that is largely informed by the duplex theory of hate [2]. Results obtained from frequency-based models using the PDC feature on the dataset comprising of tweets generated during the 2012 and 2017 presidential elections in Kenya indicate an f-score of 83% (precision: 81%, recall: 85%) in identifying hate speech. The study is significant in that it publicly shares a unique codeswitched dataset for hate speech that is valuable for comparative studies. Secondly, it provides a methodology for building a novel PDC feature set to identify nuanced forms of hate speech, camouflaged in codeswitched data, which conventional methods could not adequately identify.
{"title":"Psychosocial Features for Hate Speech Detection in Code-switched Texts","authors":"Edward Ombui, Lawrence Muchemi, P. Wagacha","doi":"10.5815/ijitcs.2021.06.03","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.06.03","url":null,"abstract":"This study examines the problem of hate speech identification in codeswitched text from social media using a natural language processing approach. It explores different features in training nine models and empirically evaluates their predictiveness in identifying hate speech in a ~50k human-annotated dataset. The study espouses a novel approach to handle this challenge by introducing a hierarchical approach that employs Latent Dirichlet Analysis to generate topic models that help build a high-level Psychosocial feature set that we acronym PDC. PDC groups similar meaning words in word families, which is significant in capturing codeswitching during the preprocessing stage for supervised learning models. The high-level PDC features generated are based on a hate speech annotation framework [1] that is largely informed by the duplex theory of hate [2]. Results obtained from frequency-based models using the PDC feature on the dataset comprising of tweets generated during the 2012 and 2017 presidential elections in Kenya indicate an f-score of 83% (precision: 81%, recall: 85%) in identifying hate speech. The study is significant in that it publicly shares a unique codeswitched dataset for hate speech that is valuable for comparative studies. Secondly, it provides a methodology for building a novel PDC feature set to identify nuanced forms of hate speech, camouflaged in codeswitched data, which conventional methods could not adequately identify.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131331192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-08DOI: 10.5815/ijitcs.2021.05.04
Eman Bashir, M. Bouguessa
Broadly cyberbullying is viewed as a severe social danger that influences many individuals around the globe, particularly young people and teenagers. The Arabic world has embraced technology and continues using it in different ways to communicate inside social media platforms. However, the Arabic text has drawbacks for its complexity, challenges, and scarcity of its resources. This paper investigates several questions related to the content of how to protect an Arabic text from cyberbullying/harassment through the information posted on Twitter. To answer this question, we collected the Arab corpus covering the topics with specific words, which will explain in detail. We devised experiments in which we investigated several learning approaches. Our results suggest that deep learning models like LSTM achieve better performance compared to other traditional yberbullying classifiers with an accuracy of 72%.
{"title":"Data Mining for Cyberbullying and Harassment Detection in Arabic Texts","authors":"Eman Bashir, M. Bouguessa","doi":"10.5815/ijitcs.2021.05.04","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.05.04","url":null,"abstract":"Broadly cyberbullying is viewed as a severe social danger that influences many individuals around the globe, particularly young people and teenagers. The Arabic world has embraced technology and continues using it in different ways to communicate inside social media platforms. However, the Arabic text has drawbacks for its complexity, challenges, and scarcity of its resources. This paper investigates several questions related to the content of how to protect an Arabic text from cyberbullying/harassment through the information posted on Twitter. To answer this question, we collected the Arab corpus covering the topics with specific words, which will explain in detail. We devised experiments in which we investigated several learning approaches. Our results suggest that deep learning models like LSTM achieve better performance compared to other traditional yberbullying classifiers with an accuracy of 72%.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"478 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128002877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-08DOI: 10.5815/ijitcs.2021.05.02
Norah Al-Harbi, Amirrudin Bin Kamsin
Terrorist groups in the Arab world are using social networking sites like Twitter and Facebook to rapidly spread terror for the past few years. Detection and suspension of such accounts is a way to control the menace to some extent. This research is aimed at building an effective text classifier, using machine learning to identify the polarity of the tweets automatically. Five classifiers were chosen, which are AdB_SAMME, AdB_SAMME.R, Linear SVM, NB, and LR. These classifiers were applied on three features namely S1 (one word, unigram), S2 (word pair, bigram), and S3 (word triplet, trigram). All five classifiers evaluated samples S1, S2, and S3 in 346 preprocessed tweets. Feature extraction process utilized one of the most widely applied weighing schemes tf-idf (term frequency-inverse document frequency).The results were validated by four experts in Arabic language (three teachers and an educational supervisor in Saudi Arabia) through a questionnaire. The study found that the Linear SVM classifier yielded the best results of 99.7 % classification accuracy on S3 among all the other classifiers used. When both classification accuracy and time were considered, the NB classifier demonstrated the performance on S1 with 99.4% accuracy, which was comparable with Linear SVM. The Arab world has faced massive terrorist attacks in the past, and therefore, the research is highly significant and relevant due to its specific focus on detecting terrorism messages in Arabic. The state-of-the-art methods developed so far for tweets classification are mostly focused on analyzing English text, and hence, there was a dire need for devising machine learning algorithms for detecting Arabic terrorism messages. The innovative aspect of the model presented in the current study is that the five best classifiers were selected and applied on three language models S1, S2, and S3. The comparative analysis based on classification accuracy and time constraints proposed the best classifiers for sentiment analysis in the Arabic language.
{"title":"An Effective Text Classifier using Machine Learning for Identifying Tweets’ Polarity Concerning Terrorist Connotation","authors":"Norah Al-Harbi, Amirrudin Bin Kamsin","doi":"10.5815/ijitcs.2021.05.02","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.05.02","url":null,"abstract":"Terrorist groups in the Arab world are using social networking sites like Twitter and Facebook to rapidly spread terror for the past few years. Detection and suspension of such accounts is a way to control the menace to some extent. This research is aimed at building an effective text classifier, using machine learning to identify the polarity of the tweets automatically. Five classifiers were chosen, which are AdB_SAMME, AdB_SAMME.R, Linear SVM, NB, and LR. These classifiers were applied on three features namely S1 (one word, unigram), S2 (word pair, bigram), and S3 (word triplet, trigram). All five classifiers evaluated samples S1, S2, and S3 in 346 preprocessed tweets. Feature extraction process utilized one of the most widely applied weighing schemes tf-idf (term frequency-inverse document frequency).The results were validated by four experts in Arabic language (three teachers and an educational supervisor in Saudi Arabia) through a questionnaire. The study found that the Linear SVM classifier yielded the best results of 99.7 % classification accuracy on S3 among all the other classifiers used. When both classification accuracy and time were considered, the NB classifier demonstrated the performance on S1 with 99.4% accuracy, which was comparable with Linear SVM. The Arab world has faced massive terrorist attacks in the past, and therefore, the research is highly significant and relevant due to its specific focus on detecting terrorism messages in Arabic. The state-of-the-art methods developed so far for tweets classification are mostly focused on analyzing English text, and hence, there was a dire need for devising machine learning algorithms for detecting Arabic terrorism messages. The innovative aspect of the model presented in the current study is that the five best classifiers were selected and applied on three language models S1, S2, and S3. The comparative analysis based on classification accuracy and time constraints proposed the best classifiers for sentiment analysis in the Arabic language.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127885439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-08DOI: 10.5815/ijitcs.2021.05.03
Pankaj Bhowmik, Pulak Chandra Bhowmik, U. Ali, Md. Sohrawordi
A sizeable number of women face difficulties during pregnancy, which eventually can lead the fetus towards serious health problems. However, early detection of these risks can save both the invaluable life of infants and mothers. Cardiotocography (CTG) data provides sophisticated information by monitoring the heart rate signal of the fetus, is used to predict the potential risks of fetal wellbeing and for making clinical conclusions. This paper proposed to analyze the antepartum CTG data (available on UCI Machine Learning Repository) and develop an efficient tree-based ensemble learning (EL) classifier model to predict fetal health status. In this study, EL considers the Stacking approach, and a concise overview of this approach is discussed and developed accordingly. The study also endeavors to apply distinct machine learning algorithmic techniques on the CTG dataset and determine their performances. The Stacking EL technique, in this paper, involves four tree-based machine learning algorithms, namely, Random Forest classifier, Decision Tree classifier, Extra Trees classifier, and Deep Forest classifier as base learners. The CTG dataset contains 21 features, but only 10 most important features are selected from the dataset with the Chi-square method for this experiment, and then the features are normalized with Min-Max scaling. Following that, Grid Search is applied for tuning the hyperparameters of the base algorithms. Subsequently, 10-folds cross validation is performed to select the meta learner of the EL classifier model. However, a comparative model assessment is made between the individual base learning algorithms and the EL classifier model; and the finding depicts EL classifiers’ superiority in fetal health risks prediction with securing the accuracy of about 96.05%. Eventually, this study concludes that the Stacking EL approach can be a substantial paradigm in machine learning studies to improve models’ accuracy and reduce the error rate.
{"title":"Cardiotocography Data Analysis to Predict Fetal Health Risks with Tree-Based Ensemble Learning","authors":"Pankaj Bhowmik, Pulak Chandra Bhowmik, U. Ali, Md. Sohrawordi","doi":"10.5815/ijitcs.2021.05.03","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.05.03","url":null,"abstract":"A sizeable number of women face difficulties during pregnancy, which eventually can lead the fetus towards serious health problems. However, early detection of these risks can save both the invaluable life of infants and mothers. Cardiotocography (CTG) data provides sophisticated information by monitoring the heart rate signal of the fetus, is used to predict the potential risks of fetal wellbeing and for making clinical conclusions. This paper proposed to analyze the antepartum CTG data (available on UCI Machine Learning Repository) and develop an efficient tree-based ensemble learning (EL) classifier model to predict fetal health status. In this study, EL considers the Stacking approach, and a concise overview of this approach is discussed and developed accordingly. The study also endeavors to apply distinct machine learning algorithmic techniques on the CTG dataset and determine their performances. The Stacking EL technique, in this paper, involves four tree-based machine learning algorithms, namely, Random Forest classifier, Decision Tree classifier, Extra Trees classifier, and Deep Forest classifier as base learners. The CTG dataset contains 21 features, but only 10 most important features are selected from the dataset with the Chi-square method for this experiment, and then the features are normalized with Min-Max scaling. Following that, Grid Search is applied for tuning the hyperparameters of the base algorithms. Subsequently, 10-folds cross validation is performed to select the meta learner of the EL classifier model. However, a comparative model assessment is made between the individual base learning algorithms and the EL classifier model; and the finding depicts EL classifiers’ superiority in fetal health risks prediction with securing the accuracy of about 96.05%. Eventually, this study concludes that the Stacking EL approach can be a substantial paradigm in machine learning studies to improve models’ accuracy and reduce the error rate.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127393021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-08DOI: 10.5815/ijitcs.2021.05.05
Josephine Kithinji, Makau S. Mutua, Gitonga D. M wathi
Consolidation of storage into IP SANs (Internet protocol storage area network) has led to a combination of multiple workloads of varying demands and importance. To ensure that users get their Service level objective (SLO) a technique for isolating workloads is required. Solutions that exist include cache partitioning and throttling of workloads. However, all these techniques require workloads to be classified in order to be isolated. Previous works on performance isolation overlooked the classification process as a source of overhead in implementing performance isolation. However, it’s known that linear search based classifiers search linearly for rules that match packets in order to classify flows which results in delays among other problems especially when rules are many. This paper looks at the various limitation of list based classifiers. In addition, the paper proposes a technique that includes rule sorting, rule partitioning and building a tree rule firewall to reduce the cost of matching packets to rules during classification. Experiments were used to evaluate the proposed solution against the existing solutions and proved that the linear search based classification process could result in performance degradation if not optimized. The results of the experiments showed that the proposed solution when implemented would considerably reduce the time required for matching packets to their classes during classification as evident in the throughput and latency experienced.
将存储整合到IP san (Internet协议存储区域网络)导致了不同需求和重要性的多个工作负载的组合。为了确保用户获得他们的服务水平目标(SLO),需要一种隔离工作负载的技术。现有的解决方案包括缓存分区和工作负载调节。但是,所有这些技术都需要对工作负载进行分类,以便进行隔离。以前关于性能隔离的工作忽略了分类过程,将其作为实现性能隔离的开销来源。然而,众所周知,基于线性搜索的分类器线性搜索与数据包匹配的规则,以便对流进行分类,这会导致延迟和其他问题,特别是当规则很多时。本文研究了基于列表的分类器的各种局限性。此外,本文还提出了一种包含规则排序、规则划分和构建树状规则防火墙的技术,以降低分类过程中数据包与规则匹配的成本。实验结果表明,如果不进行优化,基于线性搜索的分类过程可能会导致性能下降。实验结果表明,所提出的解决方案在实现时将大大减少在分类过程中将数据包与其类匹配所需的时间,这一点从吞吐量和延迟中可以看出。
{"title":"An Enhanced List Based Packet Classifier for Performance Isolation in Internet Protocol Storage Area Networks","authors":"Josephine Kithinji, Makau S. Mutua, Gitonga D. M wathi","doi":"10.5815/ijitcs.2021.05.05","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.05.05","url":null,"abstract":"Consolidation of storage into IP SANs (Internet protocol storage area network) has led to a combination of multiple workloads of varying demands and importance. To ensure that users get their Service level objective (SLO) a technique for isolating workloads is required. Solutions that exist include cache partitioning and throttling of workloads. However, all these techniques require workloads to be classified in order to be isolated. Previous works on performance isolation overlooked the classification process as a source of overhead in implementing performance isolation. However, it’s known that linear search based classifiers search linearly for rules that match packets in order to classify flows which results in delays among other problems especially when rules are many. This paper looks at the various limitation of list based classifiers. In addition, the paper proposes a technique that includes rule sorting, rule partitioning and building a tree rule firewall to reduce the cost of matching packets to rules during classification. Experiments were used to evaluate the proposed solution against the existing solutions and proved that the linear search based classification process could result in performance degradation if not optimized. The results of the experiments showed that the proposed solution when implemented would considerably reduce the time required for matching packets to their classes during classification as evident in the throughput and latency experienced.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134037207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-08DOI: 10.5815/ijitcs.2021.05.01
S. Zybin, Yana Bielozorova
The article is dedicated to using the methodology of building a decision support system under threats and risks. This method has been developed by modifying the methods of targeted evaluation of options and is used for constructing a scheme of the decision support system. Decision support systems help to make correct and effective solution to shortage of time, incompleteness, uncertainty and unreliability of information, and taking into account the risks. When we are making decisions taking into account the risks, it is necessary to solve the following tasks:determination of quantitative characteristics of risk; determination of quantitative indicators for the effectiveness of decisions in the presence of risks; distribution of resources between means of countering threats, and means that are aimed at improving information security. The known methods for solving the first problem provide for the identification of risks (qualitative analysis), as well as the assessment of the probabilities and the extent of possible damage (quantitative analysis). However, at the same time, the task of assessing the effectiveness of decisions taking into account risks is not solved and remains at the discretion of the expert. The suggesting method of decision support under threats and risks has been developed by modifying the methods of targeted evaluation of options. The relative efficiency in supporting measures to develop measures has been calculated as a function of time given on a time interval. The main idea of the proposed approach to the analysis of the impact of threats and risks in decision-making is that events that cause threats or risks are considered as a part of the decision support system. Therefore, such models of threats or risks are included in the hierarchy of goals, their links with other system's parts and goals are established. The main functional modules that ensure the continuous and efficient operation of the decision support system are the following subsystems: subsystem for analysing problems, risks and threats; subsystem for the formation of goals and criteria; decision-making subsystem; subsystem of formation of the decisive rule and analysis of alternatives. Structural schemes of functioning are constructed for each subsystem. The given block diagram provides a full-fledged decision-making process.
{"title":"Risk-based Decision-making System for Information Processing Systems","authors":"S. Zybin, Yana Bielozorova","doi":"10.5815/ijitcs.2021.05.01","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.05.01","url":null,"abstract":"The article is dedicated to using the methodology of building a decision support system under threats and risks. This method has been developed by modifying the methods of targeted evaluation of options and is used for constructing a scheme of the decision support system. Decision support systems help to make correct and effective solution to shortage of time, incompleteness, uncertainty and unreliability of information, and taking into account the risks. When we are making decisions taking into account the risks, it is necessary to solve the following tasks:determination of quantitative characteristics of risk; determination of quantitative indicators for the effectiveness of decisions in the presence of risks; distribution of resources between means of countering threats, and means that are aimed at improving information security. The known methods for solving the first problem provide for the identification of risks (qualitative analysis), as well as the assessment of the probabilities and the extent of possible damage (quantitative analysis). However, at the same time, the task of assessing the effectiveness of decisions taking into account risks is not solved and remains at the discretion of the expert. The suggesting method of decision support under threats and risks has been developed by modifying the methods of targeted evaluation of options. The relative efficiency in supporting measures to develop measures has been calculated as a function of time given on a time interval. The main idea of the proposed approach to the analysis of the impact of threats and risks in decision-making is that events that cause threats or risks are considered as a part of the decision support system. Therefore, such models of threats or risks are included in the hierarchy of goals, their links with other system's parts and goals are established. The main functional modules that ensure the continuous and efficient operation of the decision support system are the following subsystems: subsystem for analysing problems, risks and threats; subsystem for the formation of goals and criteria; decision-making subsystem; subsystem of formation of the decisive rule and analysis of alternatives. Structural schemes of functioning are constructed for each subsystem. The given block diagram provides a full-fledged decision-making process.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128696012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-08-08DOI: 10.5815/ijitcs.2021.04.05
Halil Arslan, M. Yalçın, Yasin Şahan
Thanks to the recent development in the technology number of IoT devices increased dramatically. Therefore,industries have been started to use IoT devices for their business processes. Many systems can be done automatically thanks to them. For this purpose, there is a server to process sensors data. Transferring these data to the server without any loss has crucial importance for the accuracy of IoT applications. Therefore, in this thesis a scalable broker for real time streaming data is proposed. Open source technologies, which are NoSql and in-memory databases, queueing, fulltext index search, virtualization and container management orchestration algorithms, are used to increase efficiency of the broker. Firstly, it is planned to be used for the biggest airport in Turkey to determine the staff location. Considering the experiment analysis, proposed system is good enough to transfer data produced by devices in that airport. In addition to this, the system can adapt to device increase, which means if number of devices increasing in time, number of nodes can be increased to capture more data.
{"title":"SBIoT: Scalable Broker Design for Real Time Streaming Big Data in the Internet of Things Environment","authors":"Halil Arslan, M. Yalçın, Yasin Şahan","doi":"10.5815/ijitcs.2021.04.05","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.04.05","url":null,"abstract":"Thanks to the recent development in the technology number of IoT devices increased dramatically. Therefore,industries have been started to use IoT devices for their business processes. Many systems can be done automatically thanks to them. For this purpose, there is a server to process sensors data. Transferring these data to the server without any loss has crucial importance for the accuracy of IoT applications. Therefore, in this thesis a scalable broker for real time streaming data is proposed. Open source technologies, which are NoSql and in-memory databases, queueing, fulltext index search, virtualization and container management orchestration algorithms, are used to increase efficiency of the broker. Firstly, it is planned to be used for the biggest airport in Turkey to determine the staff location. Considering the experiment analysis, proposed system is good enough to transfer data produced by devices in that airport. In addition to this, the system can adapt to device increase, which means if number of devices increasing in time, number of nodes can be increased to capture more data.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134431994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}