International Journal of Information Technology and Computer Science最新文献_第9页

Linked Data: A Framework for Publishing FiveStar Open Government Data 关联数据:发布五星级开放政府数据的框架

International Journal of Information Technology and Computer Science

Pub Date : 2021-12-08 DOI: 10.5815/ijitcs.2021.06.01

Bassel Al-khatib, Ali A. Ali

With the increased adoption of open government initiatives around the world, a huge amount of governmental raw datasets was released. However, the data was published in heterogeneous formats and vocabularies and in many cases in bad quality due to inconsistency, messy, and maybe incorrectness as it has been collected by practicalities within the source organization, which makes it inefficient for reusing and integrating it for serving citizens and third-party apps. This research introduces the LDOG (Linked Data for Open Government) experimental framework, which aims to provide a modular architecture that can be integrated into the open government hierarchy, allowing huge amounts of data to be gathered in a fine-grained manner from source and directly publishing them as linked data based on Tim Berners lee’s five-star deployment scheme with a validation layer using SHACL, which results in high quality data. The general idea is to model the hierarchy of government and classify government organizations into two types, the modeling organizations at higher levels and data source organizations at lower levels. Modeling organization’s experts in linked data have the responsibility to design data templates, ontologies, SHACL shapes, and linkage specifications. whereas non-experts can be incorporated in data source organizations to utilize their knowledge in data to do mapping, reconciliation, and correcting data. This approach lowers the needed experts that represent a problem of linked data adoption. To test the functionality of our framework in action, we developed the LDOG platform which utilizes the different modules of the framework to power a set of user interfaces that can be used to publish government datasets. we used this platform to convert some of UAE's government datasets into linked data. Finally, on top of the converted data, we built a proof-of-concept app to show the power of five-star linked data for integrating datasets from disparate organizations and to promote the governments' adoption. Our work has defined a clear path to integrate the linked data into open governments and solid steps to publishing and enhancing it in a fine-grained and practical manner with a lower number of experts in linked data, It extends SHACL to define data shapes and convert CSV to RDF.

随着世界各地越来越多地采用开放政府的举措，大量的政府原始数据集被发布。然而，数据以异构格式和词汇表发布，并且在许多情况下由于不一致、混乱和可能不正确而导致质量差，因为它是由源组织内部的实用性收集的，这使得为服务公民和第三方应用程序重用和集成它的效率低下。本研究引入了LDOG (Linked Data for Open Government)实验框架，该框架旨在提供一个可以集成到开放政府层级的模块化架构，允许从源头以细粒度的方式收集大量数据，并基于Tim Berners lee的五星部署方案，使用SHACL进行验证层，直接将其作为链接数据发布，从而获得高质量的数据。总体思路是对政府的层次结构进行建模，并将政府组织分为两类，即高层的建模组织和低层的数据源组织。关联数据建模组织的专家有责任设计数据模板、本体、SHACL形状和链接规范。而非专家可以加入数据源组织，利用他们在数据方面的知识进行映射、协调和纠正数据。这种方法降低了代表关联数据采用问题的所需专家。为了测试我们框架的实际功能，我们开发了LDOG平台，该平台利用框架的不同模块来支持一组可用于发布政府数据集的用户界面。我们使用这个平台将阿联酋的一些政府数据集转换为关联数据。最后，在转换数据的基础上，我们建立了一个概念验证应用程序，以展示五星关联数据在整合来自不同组织的数据集方面的力量，并促进政府的采用。我们的工作定义了一条清晰的路径，将关联数据集成到开放的政府中，并采取坚实的步骤，以细粒度和实用的方式与较少数量的关联数据专家一起发布和增强关联数据。它扩展了SHACL来定义数据形状并将CSV转换为RDF。

{"title":"Linked Data: A Framework for Publishing FiveStar Open Government Data","authors":"Bassel Al-khatib, Ali A. Ali","doi":"10.5815/ijitcs.2021.06.01","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.06.01","url":null,"abstract":"With the increased adoption of open government initiatives around the world, a huge amount of governmental raw datasets was released. However, the data was published in heterogeneous formats and vocabularies and in many cases in bad quality due to inconsistency, messy, and maybe incorrectness as it has been collected by practicalities within the source organization, which makes it inefficient for reusing and integrating it for serving citizens and third-party apps. This research introduces the LDOG (Linked Data for Open Government) experimental framework, which aims to provide a modular architecture that can be integrated into the open government hierarchy, allowing huge amounts of data to be gathered in a fine-grained manner from source and directly publishing them as linked data based on Tim Berners lee’s five-star deployment scheme with a validation layer using SHACL, which results in high quality data. The general idea is to model the hierarchy of government and classify government organizations into two types, the modeling organizations at higher levels and data source organizations at lower levels. Modeling organization’s experts in linked data have the responsibility to design data templates, ontologies, SHACL shapes, and linkage specifications. whereas non-experts can be incorporated in data source organizations to utilize their knowledge in data to do mapping, reconciliation, and correcting data. This approach lowers the needed experts that represent a problem of linked data adoption. To test the functionality of our framework in action, we developed the LDOG platform which utilizes the different modules of the framework to power a set of user interfaces that can be used to publish government datasets. we used this platform to convert some of UAE's government datasets into linked data. Finally, on top of the converted data, we built a proof-of-concept app to show the power of five-star linked data for integrating datasets from disparate organizations and to promote the governments' adoption. Our work has defined a clear path to integrate the linked data into open governments and solid steps to publishing and enhancing it in a fine-grained and practical manner with a lower number of experts in linked data, It extends SHACL to define data shapes and convert CSV to RDF.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"215 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132348370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Myers-briggs Personality Prediction and Sentiment Analysis of Twitter using Machine Learning Classifiers and BERT 使用机器学习分类器和BERT的Twitter Myers-briggs人格预测和情感分析

International Journal of Information Technology and Computer Science

Pub Date : 2021-12-08 DOI: 10.5815/ijitcs.2021.06.04

Prajwal Kaushal, Nithin Bharadwaj B P, Pranav M S, K. S., Dr. Anjan K Koundinya

Twitter being one of the most sophisticated social networking platforms whose users base is growing exponentially, terabytes of data is being generated every day. Technology Giants invest billions of dollars in drawing insights from these tweets. The huge amount of data is still going underutilized. The main of this paper is to solve two tasks. Firstly, to build a sentiment analysis model using BERT (Bidirectional Encoder Representations from Transformers) which analyses the tweets and predicts the sentiments of the users. Secondly to build a personality prediction model using various machine learning classifiers under the umbrella of Myers-Briggs Personality Type Indicator. MBTI is one of the most widely used psychological instruments in the world. Using this we intend to predict the traits and qualities of people based on their posts and interactions in Twitter. The model succeeds to predict the personality traits and qualities on twitter users. We intend to use the analyzed results in various applications like market research, recruitment, psychological tests, consulting, etc, in future.

Twitter是最复杂的社交网络平台之一，其用户基础呈指数级增长，每天都有tb级的数据生成。科技巨头投入数十亿美元从这些推文中获取见解。大量的数据仍未得到充分利用。本文主要解决两个任务。首先，利用BERT (Bidirectional Encoder Representations from Transformers)构建情感分析模型，对推文进行分析并预测用户的情感。其次，在Myers-Briggs人格类型指标的框架下，利用各种机器学习分类器构建人格预测模型。MBTI是世界上使用最广泛的心理测试工具之一。利用这个，我们打算根据人们在Twitter上的帖子和互动来预测他们的特征和品质。该模型成功地预测了twitter用户的个性特征和品质。我们打算将分析结果用于未来的各种应用，如市场研究、招聘、心理测试、咨询等。

{"title":"Myers-briggs Personality Prediction and Sentiment Analysis of Twitter using Machine Learning Classifiers and BERT","authors":"Prajwal Kaushal, Nithin Bharadwaj B P, Pranav M S, K. S., Dr. Anjan K Koundinya","doi":"10.5815/ijitcs.2021.06.04","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.06.04","url":null,"abstract":"Twitter being one of the most sophisticated social networking platforms whose users base is growing exponentially, terabytes of data is being generated every day. Technology Giants invest billions of dollars in drawing insights from these tweets. The huge amount of data is still going underutilized. The main of this paper is to solve two tasks. Firstly, to build a sentiment analysis model using BERT (Bidirectional Encoder Representations from Transformers) which analyses the tweets and predicts the sentiments of the users. Secondly to build a personality prediction model using various machine learning classifiers under the umbrella of Myers-Briggs Personality Type Indicator. MBTI is one of the most widely used psychological instruments in the world. Using this we intend to predict the traits and qualities of people based on their posts and interactions in Twitter. The model succeeds to predict the personality traits and qualities on twitter users. We intend to use the analyzed results in various applications like market research, recruitment, psychological tests, consulting, etc, in future.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125357729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Performance of Machine Learning Algorithms with Different K Values in K-fold CrossValidation 不同K值的机器学习算法在K-fold交叉验证中的性能

International Journal of Information Technology and Computer Science

Pub Date : 2021-12-08 DOI: 10.5815/ijitcs.2021.06.05

Isaac Kofi Nti, Owusu N yarko-Boateng, J. Aning

The numerical value of k in a k-fold cross-validation training technique of machine learning predictive models is an essential element that impacts the model’s performance. A right choice of k results in better accuracy, while a poorly chosen value for k might affect the model’s performance. In literature, the most commonly used values of k are five (5) or ten (10), as these two values are believed to give test error rate estimates that suffer neither from extremely high bias nor very high variance. However, there is no formal rule. To the best of our knowledge, few experimental studies attempted to investigate the effect of diverse k values in training different machine learning models. This paper empirically analyses the prevalence and effect of distinct k values (3, 5, 7, 10, 15 and 20) on the validation performance of four well-known machine learning algorithms (Gradient Boosting Machine (GBM), Logistic Regression (LR), Decision Tree (DT) and K-Nearest Neighbours (KNN)). It was observed that the value of k and model validation performance differ from one machine-learning algorithm to another for the same classification task. However, our empirical suggest that k = 7 offers a slight increase in validations accuracy and area under the curve measure with lesser computational complexity than k = 10 across most MLA. We discuss in detail the study outcomes and outline some guidelines for beginners in the machine learning field in selecting the best k value and machine learning algorithm for a given task.

在机器学习预测模型的k-fold交叉验证训练技术中，k的数值是影响模型性能的重要因素。正确选择k可以提高准确率，而选择不当的k值可能会影响模型的性能。在文献中，最常用的k值是5(5)或10(10)，因为这两个值被认为给出的测试错误率估计既不会受到极高的偏差也不会受到很高的方差。然而，没有正式的规定。据我们所知，很少有实验研究试图调查不同k值在训练不同机器学习模型中的影响。本文实证分析了不同k值(3、5、7、10、15和20)对四种知名机器学习算法(梯度增强机(GBM)、逻辑回归(LR)、决策树(DT)和k近邻(KNN))验证性能的普遍性和影响。可以观察到，对于相同的分类任务，不同的机器学习算法的k值和模型验证性能是不同的。然而，我们的经验表明，在大多数MLA中，与k = 10相比，k = 7在验证精度和曲线测量下的面积方面略有增加，计算复杂度更低。我们详细讨论了研究结果，并为机器学习领域的初学者在为给定任务选择最佳k值和机器学习算法时概述了一些指导方针。

{"title":"Performance of Machine Learning Algorithms with Different K Values in K-fold CrossValidation","authors":"Isaac Kofi Nti, Owusu N yarko-Boateng, J. Aning","doi":"10.5815/ijitcs.2021.06.05","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.06.05","url":null,"abstract":"The numerical value of k in a k-fold cross-validation training technique of machine learning predictive models is an essential element that impacts the model’s performance. A right choice of k results in better accuracy, while a poorly chosen value for k might affect the model’s performance. In literature, the most commonly used values of k are five (5) or ten (10), as these two values are believed to give test error rate estimates that suffer neither from extremely high bias nor very high variance. However, there is no formal rule. To the best of our knowledge, few experimental studies attempted to investigate the effect of diverse k values in training different machine learning models. This paper empirically analyses the prevalence and effect of distinct k values (3, 5, 7, 10, 15 and 20) on the validation performance of four well-known machine learning algorithms (Gradient Boosting Machine (GBM), Logistic Regression (LR), Decision Tree (DT) and K-Nearest Neighbours (KNN)). It was observed that the value of k and model validation performance differ from one machine-learning algorithm to another for the same classification task. However, our empirical suggest that k = 7 offers a slight increase in validations accuracy and area under the curve measure with lesser computational complexity than k = 10 across most MLA. We discuss in detail the study outcomes and outline some guidelines for beginners in the machine learning field in selecting the best k value and machine learning algorithm for a given task.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127511422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Psychosocial Features for Hate Speech Detection in Code-switched Texts 码交换文本中仇恨言语检测的社会心理特征

International Journal of Information Technology and Computer Science

Pub Date : 2021-12-08 DOI: 10.5815/ijitcs.2021.06.03

Edward Ombui, Lawrence Muchemi, P. Wagacha

This study examines the problem of hate speech identification in codeswitched text from social media using a natural language processing approach. It explores different features in training nine models and empirically evaluates their predictiveness in identifying hate speech in a ~50k human-annotated dataset. The study espouses a novel approach to handle this challenge by introducing a hierarchical approach that employs Latent Dirichlet Analysis to generate topic models that help build a high-level Psychosocial feature set that we acronym PDC. PDC groups similar meaning words in word families, which is significant in capturing codeswitching during the preprocessing stage for supervised learning models. The high-level PDC features generated are based on a hate speech annotation framework [1] that is largely informed by the duplex theory of hate [2]. Results obtained from frequency-based models using the PDC feature on the dataset comprising of tweets generated during the 2012 and 2017 presidential elections in Kenya indicate an f-score of 83% (precision: 81%, recall: 85%) in identifying hate speech. The study is significant in that it publicly shares a unique codeswitched dataset for hate speech that is valuable for comparative studies. Secondly, it provides a methodology for building a novel PDC feature set to identify nuanced forms of hate speech, camouflaged in codeswitched data, which conventional methods could not adequately identify.

本研究使用自然语言处理方法研究了来自社交媒体的代码转换文本中的仇恨言论识别问题。它探索了训练9个模型的不同特征，并在一个约50k的人工注释数据集中实证地评估了它们在识别仇恨言论方面的预测性。该研究采用了一种新颖的方法来处理这一挑战，通过引入一种分层方法，该方法使用潜在狄利克雷分析来生成主题模型，帮助构建高层次的社会心理特征集，我们将其缩写为PDC。PDC对词族中意义相近的词进行分组，这对于在监督学习模型预处理阶段捕捉编码转换具有重要意义。生成的高级PDC特征基于仇恨言论注释框架[1]，该框架在很大程度上受仇恨双工理论[2]的影响。在2012年和2017年肯尼亚总统选举期间生成的推文数据集上使用PDC特征的基于频率的模型获得的结果表明，在识别仇恨言论方面，f得分为83%(精度:81%，召回率:85%)。这项研究意义重大，因为它公开分享了一个独特的仇恨言论代码转换数据集，这对比较研究很有价值。其次，它提供了一种构建新的PDC特征集的方法，用于识别隐藏在代码转换数据中的微妙形式的仇恨言论，而传统方法无法充分识别这些仇恨言论。

{"title":"Psychosocial Features for Hate Speech Detection in Code-switched Texts","authors":"Edward Ombui, Lawrence Muchemi, P. Wagacha","doi":"10.5815/ijitcs.2021.06.03","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.06.03","url":null,"abstract":"This study examines the problem of hate speech identification in codeswitched text from social media using a natural language processing approach. It explores different features in training nine models and empirically evaluates their predictiveness in identifying hate speech in a ~50k human-annotated dataset. The study espouses a novel approach to handle this challenge by introducing a hierarchical approach that employs Latent Dirichlet Analysis to generate topic models that help build a high-level Psychosocial feature set that we acronym PDC. PDC groups similar meaning words in word families, which is significant in capturing codeswitching during the preprocessing stage for supervised learning models. The high-level PDC features generated are based on a hate speech annotation framework [1] that is largely informed by the duplex theory of hate [2]. Results obtained from frequency-based models using the PDC feature on the dataset comprising of tweets generated during the 2012 and 2017 presidential elections in Kenya indicate an f-score of 83% (precision: 81%, recall: 85%) in identifying hate speech. The study is significant in that it publicly shares a unique codeswitched dataset for hate speech that is valuable for comparative studies. Secondly, it provides a methodology for building a novel PDC feature set to identify nuanced forms of hate speech, camouflaged in codeswitched data, which conventional methods could not adequately identify.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131331192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data Mining for Cyberbullying and Harassment Detection in Arabic Texts 阿拉伯文本中网络欺凌和骚扰检测的数据挖掘

International Journal of Information Technology and Computer Science

Pub Date : 2021-10-08 DOI: 10.5815/ijitcs.2021.05.04

Eman Bashir, M. Bouguessa

Broadly cyberbullying is viewed as a severe social danger that influences many individuals around the globe, particularly young people and teenagers. The Arabic world has embraced technology and continues using it in different ways to communicate inside social media platforms. However, the Arabic text has drawbacks for its complexity, challenges, and scarcity of its resources. This paper investigates several questions related to the content of how to protect an Arabic text from cyberbullying/harassment through the information posted on Twitter. To answer this question, we collected the Arab corpus covering the topics with specific words, which will explain in detail. We devised experiments in which we investigated several learning approaches. Our results suggest that deep learning models like LSTM achieve better performance compared to other traditional yberbullying classifiers with an accuracy of 72%.

总的来说，网络欺凌被视为一种严重的社会危险，影响着全球许多人，特别是年轻人和青少年。阿拉伯世界已经接受了科技，并继续以不同的方式在社交媒体平台上进行交流。然而，阿拉伯文由于其复杂性、挑战性和资源的稀缺性而存在缺陷。本文调查了几个与如何通过Twitter上发布的信息保护阿拉伯文本免受网络欺凌/骚扰相关的问题。为了回答这个问题，我们收集了阿拉伯语料库，涵盖了特定词汇的主题，这将详细解释。我们设计了一些实验，研究了几种学习方法。我们的研究结果表明，与其他传统的网络欺凌分类器相比，像LSTM这样的深度学习模型获得了更好的性能，准确率为72%。

引用次数: 7

An Effective Text Classifier using Machine Learning for Identifying Tweets’ Polarity Concerning Terrorist Connotation 一种有效的文本分类器，使用机器学习来识别推文关于恐怖主义内涵的极性

International Journal of Information Technology and Computer Science

Pub Date : 2021-10-08 DOI: 10.5815/ijitcs.2021.05.02

Norah Al-Harbi, Amirrudin Bin Kamsin

Terrorist groups in the Arab world are using social networking sites like Twitter and Facebook to rapidly spread terror for the past few years. Detection and suspension of such accounts is a way to control the menace to some extent. This research is aimed at building an effective text classifier, using machine learning to identify the polarity of the tweets automatically. Five classifiers were chosen, which are AdB_SAMME, AdB_SAMME.R, Linear SVM, NB, and LR. These classifiers were applied on three features namely S1 (one word, unigram), S2 (word pair, bigram), and S3 (word triplet, trigram). All five classifiers evaluated samples S1, S2, and S3 in 346 preprocessed tweets. Feature extraction process utilized one of the most widely applied weighing schemes tf-idf (term frequency-inverse document frequency).The results were validated by four experts in Arabic language (three teachers and an educational supervisor in Saudi Arabia) through a questionnaire. The study found that the Linear SVM classifier yielded the best results of 99.7 % classification accuracy on S3 among all the other classifiers used. When both classification accuracy and time were considered, the NB classifier demonstrated the performance on S1 with 99.4% accuracy, which was comparable with Linear SVM. The Arab world has faced massive terrorist attacks in the past, and therefore, the research is highly significant and relevant due to its specific focus on detecting terrorism messages in Arabic. The state-of-the-art methods developed so far for tweets classification are mostly focused on analyzing English text, and hence, there was a dire need for devising machine learning algorithms for detecting Arabic terrorism messages. The innovative aspect of the model presented in the current study is that the five best classifiers were selected and applied on three language models S1, S2, and S3. The comparative analysis based on classification accuracy and time constraints proposed the best classifiers for sentiment analysis in the Arabic language.

过去几年，阿拉伯世界的恐怖组织利用Twitter和Facebook等社交网站迅速传播恐怖活动。在某种程度上，发现和封禁这些账户是控制威胁的一种方式。本研究旨在建立一个有效的文本分类器，使用机器学习来自动识别推文的极性。选择了5个分类器，分别是AdB_SAMME、AdB_SAMME。R，线性SVM, NB, LR。这些分类器被应用于三个特征上，即S1(一个词，单字符)、S2(词对，双字符)和S3(词三元组，三字符)。所有五个分类器都评估了346个预处理tweet中的样本S1、S2和S3。特征提取过程采用了一种应用最广泛的加权方法tf-idf(词频-逆文档频率)。研究结果由四位阿拉伯语专家(沙特阿拉伯的三位教师和一位教育主管)通过问卷调查进行验证。研究发现，在使用的所有其他分类器中，Linear SVM分类器在S3上的分类准确率达到99.7%。同时考虑分类精度和时间，NB分类器在S1上的准确率达到99.4%，与线性支持向量机相当。阿拉伯世界在过去曾面临过大规模的恐怖袭击，因此，这项研究非常重要和相关，因为它特别关注阿拉伯语中的恐怖主义信息。目前开发的推文分类方法主要集中在分析英文文本，因此迫切需要设计用于检测阿拉伯恐怖主义信息的机器学习算法。本研究中提出的模型的创新之处在于，选择了五个最佳分类器并将其应用于三个语言模型S1、S2和S3。基于分类精度和时间约束的对比分析提出了阿拉伯语情感分析的最佳分类器。

{"title":"An Effective Text Classifier using Machine Learning for Identifying Tweets’ Polarity Concerning Terrorist Connotation","authors":"Norah Al-Harbi, Amirrudin Bin Kamsin","doi":"10.5815/ijitcs.2021.05.02","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.05.02","url":null,"abstract":"Terrorist groups in the Arab world are using social networking sites like Twitter and Facebook to rapidly spread terror for the past few years. Detection and suspension of such accounts is a way to control the menace to some extent. This research is aimed at building an effective text classifier, using machine learning to identify the polarity of the tweets automatically. Five classifiers were chosen, which are AdB_SAMME, AdB_SAMME.R, Linear SVM, NB, and LR. These classifiers were applied on three features namely S1 (one word, unigram), S2 (word pair, bigram), and S3 (word triplet, trigram). All five classifiers evaluated samples S1, S2, and S3 in 346 preprocessed tweets. Feature extraction process utilized one of the most widely applied weighing schemes tf-idf (term frequency-inverse document frequency).The results were validated by four experts in Arabic language (three teachers and an educational supervisor in Saudi Arabia) through a questionnaire. The study found that the Linear SVM classifier yielded the best results of 99.7 % classification accuracy on S3 among all the other classifiers used. When both classification accuracy and time were considered, the NB classifier demonstrated the performance on S1 with 99.4% accuracy, which was comparable with Linear SVM. The Arab world has faced massive terrorist attacks in the past, and therefore, the research is highly significant and relevant due to its specific focus on detecting terrorism messages in Arabic. The state-of-the-art methods developed so far for tweets classification are mostly focused on analyzing English text, and hence, there was a dire need for devising machine learning algorithms for detecting Arabic terrorism messages. The innovative aspect of the model presented in the current study is that the five best classifiers were selected and applied on three language models S1, S2, and S3. The comparative analysis based on classification accuracy and time constraints proposed the best classifiers for sentiment analysis in the Arabic language.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127885439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Cardiotocography Data Analysis to Predict Fetal Health Risks with Tree-Based Ensemble Learning 用基于树的集成学习预测胎儿健康风险的心脏造影数据分析

International Journal of Information Technology and Computer Science

Pub Date : 2021-10-08 DOI: 10.5815/ijitcs.2021.05.03

Pankaj Bhowmik, Pulak Chandra Bhowmik, U. Ali, Md. Sohrawordi

A sizeable number of women face difficulties during pregnancy, which eventually can lead the fetus towards serious health problems. However, early detection of these risks can save both the invaluable life of infants and mothers. Cardiotocography (CTG) data provides sophisticated information by monitoring the heart rate signal of the fetus, is used to predict the potential risks of fetal wellbeing and for making clinical conclusions. This paper proposed to analyze the antepartum CTG data (available on UCI Machine Learning Repository) and develop an efficient tree-based ensemble learning (EL) classifier model to predict fetal health status. In this study, EL considers the Stacking approach, and a concise overview of this approach is discussed and developed accordingly. The study also endeavors to apply distinct machine learning algorithmic techniques on the CTG dataset and determine their performances. The Stacking EL technique, in this paper, involves four tree-based machine learning algorithms, namely, Random Forest classifier, Decision Tree classifier, Extra Trees classifier, and Deep Forest classifier as base learners. The CTG dataset contains 21 features, but only 10 most important features are selected from the dataset with the Chi-square method for this experiment, and then the features are normalized with Min-Max scaling. Following that, Grid Search is applied for tuning the hyperparameters of the base algorithms. Subsequently, 10-folds cross validation is performed to select the meta learner of the EL classifier model. However, a comparative model assessment is made between the individual base learning algorithms and the EL classifier model; and the finding depicts EL classifiers’ superiority in fetal health risks prediction with securing the accuracy of about 96.05%. Eventually, this study concludes that the Stacking EL approach can be a substantial paradigm in machine learning studies to improve models’ accuracy and reduce the error rate.

相当多的妇女在怀孕期间面临困难，最终可能导致胎儿出现严重的健康问题。然而，及早发现这些风险可以挽救婴儿和母亲的宝贵生命。心脏造影(CTG)数据通过监测胎儿的心率信号提供了复杂的信息，用于预测胎儿健康的潜在风险并做出临床结论。本文提出对产前CTG数据(UCI Machine Learning Repository提供)进行分析，开发一种高效的基于树的集成学习(EL)分类器模型来预测胎儿健康状况。在本研究中，EL考虑了堆叠方法，并相应地讨论和发展了该方法的简明概述。本研究还尝试在CTG数据集上应用不同的机器学习算法技术，并确定它们的性能。本文的Stacking EL技术涉及四种基于树的机器学习算法，即Random Forest分类器、Decision Tree分类器、Extra Trees分类器和Deep Forest分类器作为基础学习器。CTG数据集包含21个特征，但本实验使用卡方方法从数据集中选择了10个最重要的特征，然后使用Min-Max缩放对特征进行归一化。然后，应用网格搜索对基本算法的超参数进行调优。随后，进行10次交叉验证以选择EL分类器模型的元学习器。然而，在个体基学习算法和EL分类器模型之间进行了比较模型评估;EL分类器在胎儿健康风险预测方面具有优势，准确率约为96.05%。最终，本研究得出结论，堆叠EL方法可以成为机器学习研究中的一个重要范例，以提高模型的准确性并降低错误率。

{"title":"Cardiotocography Data Analysis to Predict Fetal Health Risks with Tree-Based Ensemble Learning","authors":"Pankaj Bhowmik, Pulak Chandra Bhowmik, U. Ali, Md. Sohrawordi","doi":"10.5815/ijitcs.2021.05.03","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.05.03","url":null,"abstract":"A sizeable number of women face difficulties during pregnancy, which eventually can lead the fetus towards serious health problems. However, early detection of these risks can save both the invaluable life of infants and mothers. Cardiotocography (CTG) data provides sophisticated information by monitoring the heart rate signal of the fetus, is used to predict the potential risks of fetal wellbeing and for making clinical conclusions. This paper proposed to analyze the antepartum CTG data (available on UCI Machine Learning Repository) and develop an efficient tree-based ensemble learning (EL) classifier model to predict fetal health status. In this study, EL considers the Stacking approach, and a concise overview of this approach is discussed and developed accordingly. The study also endeavors to apply distinct machine learning algorithmic techniques on the CTG dataset and determine their performances. The Stacking EL technique, in this paper, involves four tree-based machine learning algorithms, namely, Random Forest classifier, Decision Tree classifier, Extra Trees classifier, and Deep Forest classifier as base learners. The CTG dataset contains 21 features, but only 10 most important features are selected from the dataset with the Chi-square method for this experiment, and then the features are normalized with Min-Max scaling. Following that, Grid Search is applied for tuning the hyperparameters of the base algorithms. Subsequently, 10-folds cross validation is performed to select the meta learner of the EL classifier model. However, a comparative model assessment is made between the individual base learning algorithms and the EL classifier model; and the finding depicts EL classifiers’ superiority in fetal health risks prediction with securing the accuracy of about 96.05%. Eventually, this study concludes that the Stacking EL approach can be a substantial paradigm in machine learning studies to improve models’ accuracy and reduce the error rate.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127393021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

An Enhanced List Based Packet Classifier for Performance Isolation in Internet Protocol Storage Area Networks 一种用于Internet协议存储区域网络性能隔离的增强型基于列表的数据包分类器

International Journal of Information Technology and Computer Science

Pub Date : 2021-10-08 DOI: 10.5815/ijitcs.2021.05.05

Josephine Kithinji, Makau S. Mutua, Gitonga D. M wathi

Consolidation of storage into IP SANs (Internet protocol storage area network) has led to a combination of multiple workloads of varying demands and importance. To ensure that users get their Service level objective (SLO) a technique for isolating workloads is required. Solutions that exist include cache partitioning and throttling of workloads. However, all these techniques require workloads to be classified in order to be isolated. Previous works on performance isolation overlooked the classification process as a source of overhead in implementing performance isolation. However, it’s known that linear search based classifiers search linearly for rules that match packets in order to classify flows which results in delays among other problems especially when rules are many. This paper looks at the various limitation of list based classifiers. In addition, the paper proposes a technique that includes rule sorting, rule partitioning and building a tree rule firewall to reduce the cost of matching packets to rules during classification. Experiments were used to evaluate the proposed solution against the existing solutions and proved that the linear search based classification process could result in performance degradation if not optimized. The results of the experiments showed that the proposed solution when implemented would considerably reduce the time required for matching packets to their classes during classification as evident in the throughput and latency experienced.

将存储整合到IP san (Internet协议存储区域网络)导致了不同需求和重要性的多个工作负载的组合。为了确保用户获得他们的服务水平目标(SLO)，需要一种隔离工作负载的技术。现有的解决方案包括缓存分区和工作负载调节。但是，所有这些技术都需要对工作负载进行分类，以便进行隔离。以前关于性能隔离的工作忽略了分类过程，将其作为实现性能隔离的开销来源。然而，众所周知，基于线性搜索的分类器线性搜索与数据包匹配的规则，以便对流进行分类，这会导致延迟和其他问题，特别是当规则很多时。本文研究了基于列表的分类器的各种局限性。此外，本文还提出了一种包含规则排序、规则划分和构建树状规则防火墙的技术，以降低分类过程中数据包与规则匹配的成本。实验结果表明，如果不进行优化，基于线性搜索的分类过程可能会导致性能下降。实验结果表明，所提出的解决方案在实现时将大大减少在分类过程中将数据包与其类匹配所需的时间，这一点从吞吐量和延迟中可以看出。

{"title":"An Enhanced List Based Packet Classifier for Performance Isolation in Internet Protocol Storage Area Networks","authors":"Josephine Kithinji, Makau S. Mutua, Gitonga D. M wathi","doi":"10.5815/ijitcs.2021.05.05","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.05.05","url":null,"abstract":"Consolidation of storage into IP SANs (Internet protocol storage area network) has led to a combination of multiple workloads of varying demands and importance. To ensure that users get their Service level objective (SLO) a technique for isolating workloads is required. Solutions that exist include cache partitioning and throttling of workloads. However, all these techniques require workloads to be classified in order to be isolated. Previous works on performance isolation overlooked the classification process as a source of overhead in implementing performance isolation. However, it’s known that linear search based classifiers search linearly for rules that match packets in order to classify flows which results in delays among other problems especially when rules are many. This paper looks at the various limitation of list based classifiers. In addition, the paper proposes a technique that includes rule sorting, rule partitioning and building a tree rule firewall to reduce the cost of matching packets to rules during classification. Experiments were used to evaluate the proposed solution against the existing solutions and proved that the linear search based classification process could result in performance degradation if not optimized. The results of the experiments showed that the proposed solution when implemented would considerably reduce the time required for matching packets to their classes during classification as evident in the throughput and latency experienced.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134037207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Risk-based Decision-making System for Information Processing Systems 基于风险的信息处理系统决策系统

International Journal of Information Technology and Computer Science

Pub Date : 2021-10-08 DOI: 10.5815/ijitcs.2021.05.01

S. Zybin, Yana Bielozorova

The article is dedicated to using the methodology of building a decision support system under threats and risks. This method has been developed by modifying the methods of targeted evaluation of options and is used for constructing a scheme of the decision support system. Decision support systems help to make correct and effective solution to shortage of time, incompleteness, uncertainty and unreliability of information, and taking into account the risks. When we are making decisions taking into account the risks, it is necessary to solve the following tasks:determination of quantitative characteristics of risk; determination of quantitative indicators for the effectiveness of decisions in the presence of risks; distribution of resources between means of countering threats, and means that are aimed at improving information security. The known methods for solving the first problem provide for the identification of risks (qualitative analysis), as well as the assessment of the probabilities and the extent of possible damage (quantitative analysis). However, at the same time, the task of assessing the effectiveness of decisions taking into account risks is not solved and remains at the discretion of the expert. The suggesting method of decision support under threats and risks has been developed by modifying the methods of targeted evaluation of options. The relative efficiency in supporting measures to develop measures has been calculated as a function of time given on a time interval. The main idea of the proposed approach to the analysis of the impact of threats and risks in decision-making is that events that cause threats or risks are considered as a part of the decision support system. Therefore, such models of threats or risks are included in the hierarchy of goals, their links with other system's parts and goals are established. The main functional modules that ensure the continuous and efficient operation of the decision support system are the following subsystems: subsystem for analysing problems, risks and threats; subsystem for the formation of goals and criteria; decision-making subsystem; subsystem of formation of the decisive rule and analysis of alternatives. Structural schemes of functioning are constructed for each subsystem. The given block diagram provides a full-fledged decision-making process.

本文致力于在威胁和风险下使用构建决策支持系统的方法。该方法是通过对期权目标评价方法的改进而发展起来的，并用于构建决策支持系统的方案。决策支持系统有助于正确有效地解决时间短缺、信息不完整、不确定和不可靠的问题，并考虑到风险。在进行考虑风险的决策时，需要解决以下任务:确定风险的数量特征;在存在风险的情况下确定决策有效性的定量指标;在应对威胁的手段和旨在提高信息安全的手段之间分配资源。已知的解决第一个问题的方法提供了风险的识别(定性分析)，以及可能损害的概率和程度的评估(定量分析)。然而，与此同时，评估考虑风险的决策有效性的任务并没有解决，仍然由专家自行决定。通过对期权目标评价方法的改进，提出了威胁风险下的决策支持建议方法。制定措施的配套措施的相对效率已计算为给定时间间隔上的时间函数。所提出的分析决策中威胁和风险影响的方法的主要思想是，将导致威胁或风险的事件视为决策支持系统的一部分。因此，将这些威胁或风险模型纳入目标层次，建立其与系统其他部分和目标的联系。保证决策支持系统持续高效运行的主要功能模块有以下几个子系统:分析问题、风险和威胁的子系统;子系统的目标和准则的形成;决策子系统;子系统决定性规则的形成与备选方案的分析。构建了各子系统的功能结构方案。给出的框图提供了一个成熟的决策过程。

{"title":"Risk-based Decision-making System for Information Processing Systems","authors":"S. Zybin, Yana Bielozorova","doi":"10.5815/ijitcs.2021.05.01","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.05.01","url":null,"abstract":"The article is dedicated to using the methodology of building a decision support system under threats and risks. This method has been developed by modifying the methods of targeted evaluation of options and is used for constructing a scheme of the decision support system. Decision support systems help to make correct and effective solution to shortage of time, incompleteness, uncertainty and unreliability of information, and taking into account the risks. When we are making decisions taking into account the risks, it is necessary to solve the following tasks:determination of quantitative characteristics of risk; determination of quantitative indicators for the effectiveness of decisions in the presence of risks; distribution of resources between means of countering threats, and means that are aimed at improving information security. The known methods for solving the first problem provide for the identification of risks (qualitative analysis), as well as the assessment of the probabilities and the extent of possible damage (quantitative analysis). However, at the same time, the task of assessing the effectiveness of decisions taking into account risks is not solved and remains at the discretion of the expert. The suggesting method of decision support under threats and risks has been developed by modifying the methods of targeted evaluation of options. The relative efficiency in supporting measures to develop measures has been calculated as a function of time given on a time interval. The main idea of the proposed approach to the analysis of the impact of threats and risks in decision-making is that events that cause threats or risks are considered as a part of the decision support system. Therefore, such models of threats or risks are included in the hierarchy of goals, their links with other system's parts and goals are established. The main functional modules that ensure the continuous and efficient operation of the decision support system are the following subsystems: subsystem for analysing problems, risks and threats; subsystem for the formation of goals and criteria; decision-making subsystem; subsystem of formation of the decisive rule and analysis of alternatives. Structural schemes of functioning are constructed for each subsystem. The given block diagram provides a full-fledged decision-making process.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128696012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

SBIoT: Scalable Broker Design for Real Time Streaming Big Data in the Internet of Things Environment 物联网环境下实时大数据流的可扩展代理设计

International Journal of Information Technology and Computer Science

Pub Date : 2021-08-08 DOI: 10.5815/ijitcs.2021.04.05

Halil Arslan, M. Yalçın, Yasin Şahan

Thanks to the recent development in the technology number of IoT devices increased dramatically. Therefore,industries have been started to use IoT devices for their business processes. Many systems can be done automatically thanks to them. For this purpose, there is a server to process sensors data. Transferring these data to the server without any loss has crucial importance for the accuracy of IoT applications. Therefore, in this thesis a scalable broker for real time streaming data is proposed. Open source technologies, which are NoSql and in-memory databases, queueing, fulltext index search, virtualization and container management orchestration algorithms, are used to increase efficiency of the broker. Firstly, it is planned to be used for the biggest airport in Turkey to determine the staff location. Considering the experiment analysis, proposed system is good enough to transfer data produced by devices in that airport. In addition to this, the system can adapt to device increase, which means if number of devices increasing in time, number of nodes can be increased to capture more data.

由于最近技术的发展，物联网设备的数量急剧增加。因此，行业已经开始将物联网设备用于其业务流程。由于它们，许多系统可以自动完成。为此，有一个服务器来处理传感器数据。将这些数据毫无损失地传输到服务器对于物联网应用的准确性至关重要。因此，本文提出了一种可扩展的实时流数据代理。开源技术(包括NoSql和内存数据库、排队、全文索引搜索、虚拟化和容器管理编排算法)用于提高代理的效率。首先，它计划用于土耳其最大的机场，以确定工作人员的位置。通过实验分析，该系统可以很好地传输该机场设备产生的数据。除此之外，系统还可以适应设备的增加，即当设备数量及时增加时，可以增加节点的数量以捕获更多的数据。

{"title":"SBIoT: Scalable Broker Design for Real Time Streaming Big Data in the Internet of Things Environment","authors":"Halil Arslan, M. Yalçın, Yasin Şahan","doi":"10.5815/ijitcs.2021.04.05","DOIUrl":"https://doi.org/10.5815/ijitcs.2021.04.05","url":null,"abstract":"Thanks to the recent development in the technology number of IoT devices increased dramatically. Therefore,industries have been started to use IoT devices for their business processes. Many systems can be done automatically thanks to them. For this purpose, there is a server to process sensors data. Transferring these data to the server without any loss has crucial importance for the accuracy of IoT applications. Therefore, in this thesis a scalable broker for real time streaming data is proposed. Open source technologies, which are NoSql and in-memory databases, queueing, fulltext index search, virtualization and container management orchestration algorithms, are used to increase efficiency of the broker. Firstly, it is planned to be used for the biggest airport in Turkey to determine the staff location. Considering the experiment analysis, proposed system is good enough to transfer data produced by devices in that airport. In addition to this, the system can adapt to device increase, which means if number of devices increasing in time, number of nodes can be increased to capture more data.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134431994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0