Pub Date : 2022-02-18DOI: 10.1108/dta-11-2021-0315
C. Floriano, Valdecy Pereira, Brunno e Souza Rodrigues
PurposeAlthough the multi-criteria technique analytic hierarchy process (AHP) has successfully been applied in many areas, either selecting or ranking alternatives or to derive priority vector (weights) for a set of criteria, there is a significant drawback in using this technique if the pairwise comparison matrix (PCM) has inconsistent comparisons, in other words, a consistency ratio (CR) above the value of 0.1, the final solution cannot be validated. Many studies have been developed to treat the inconsistency problem, but few of them tried to satisfy different quality measures, which are minimum inconsistency (fMI), the total number of adjusted pairwise comparisons (fNC), original rank preservation (fKT), minimum average weights adjustment (fWA) and finally, minimum L1 matrix norm between the original PCM and the adjusted PCM (fLM).Design/methodology/approachThe approach is defined in four steps: first, the decision-maker should choose which quality measures she/he wishes to use, ranging from one to all quality measures. In the second step, the authors encode the PCM to be used in a many-objective optimization algorithm (MOOA), and each pairwise comparison can be adjusted individually. The authors generate consistent solutions from the obtained Pareto optimal front that carry the desired quality measures in the third step. Lastly, the decision-maker selects the most suitable solution for her/his problem. Remarkably, as the decision-maker can choose one (mono-objective), two (multi-objective), three or more (many-objectives) quality measures, not all MOOAs can handle or perform well in mono- or multi-objective problems. The unified non-sorting algorithm III (U-NSGA III) is the most appropriate MOOA for this type of scenario because it was specially designed to handle mono-, multi- and many-objective problems.FindingsThe use of two quality measures should not guarantee that the adjusted PCM is similar to the original PCM; hence, the decision-maker should consider using more quality measures if the objective is to preserve the original PCM characteristics.Originality/valueFor the first time, a many-objective approach reduces the CR to consistent levels with the ability to consider one or more quality measures and allows the decision-maker to adjust each pairwise comparison individually.
{"title":"3MO-AHP: an inconsistency reduction approach through mono-, multi- or many-objective quality measures","authors":"C. Floriano, Valdecy Pereira, Brunno e Souza Rodrigues","doi":"10.1108/dta-11-2021-0315","DOIUrl":"https://doi.org/10.1108/dta-11-2021-0315","url":null,"abstract":"PurposeAlthough the multi-criteria technique analytic hierarchy process (AHP) has successfully been applied in many areas, either selecting or ranking alternatives or to derive priority vector (weights) for a set of criteria, there is a significant drawback in using this technique if the pairwise comparison matrix (PCM) has inconsistent comparisons, in other words, a consistency ratio (CR) above the value of 0.1, the final solution cannot be validated. Many studies have been developed to treat the inconsistency problem, but few of them tried to satisfy different quality measures, which are minimum inconsistency (fMI), the total number of adjusted pairwise comparisons (fNC), original rank preservation (fKT), minimum average weights adjustment (fWA) and finally, minimum L1 matrix norm between the original PCM and the adjusted PCM (fLM).Design/methodology/approachThe approach is defined in four steps: first, the decision-maker should choose which quality measures she/he wishes to use, ranging from one to all quality measures. In the second step, the authors encode the PCM to be used in a many-objective optimization algorithm (MOOA), and each pairwise comparison can be adjusted individually. The authors generate consistent solutions from the obtained Pareto optimal front that carry the desired quality measures in the third step. Lastly, the decision-maker selects the most suitable solution for her/his problem. Remarkably, as the decision-maker can choose one (mono-objective), two (multi-objective), three or more (many-objectives) quality measures, not all MOOAs can handle or perform well in mono- or multi-objective problems. The unified non-sorting algorithm III (U-NSGA III) is the most appropriate MOOA for this type of scenario because it was specially designed to handle mono-, multi- and many-objective problems.FindingsThe use of two quality measures should not guarantee that the adjusted PCM is similar to the original PCM; hence, the decision-maker should consider using more quality measures if the objective is to preserve the original PCM characteristics.Originality/valueFor the first time, a many-objective approach reduces the CR to consistent levels with the ability to consider one or more quality measures and allows the decision-maker to adjust each pairwise comparison individually.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77746376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-15DOI: 10.1108/dta-10-2021-0280
T. Bayrak
PurposeThis paper explores and examines the mission statements of the most ethical companies across the globe in terms of their main purposes, values, goals, and objective, and what they say about their vision and goals.Design/methodology/approachThis study is based on the data published by the Ethisphere Institute, the global leader in defining and advancing the standards of ethical business practices. Having compiled the mission statements into a text file, the authors conducted text mining using a commercially available text mining tool SAS Enterprise Miner to survey if the most ethical companies have valued the same vision and mission such as social responsibility and ethics.FindingsA review of their mission statements indicated that some of the most ethical companies surveyed in this study such as 3M and Voya strive to be “socially responsible and ethical,” support their “societies” and respect and protect the “nature,” “planet” and “environment.” The world's most ethical companies that stress these weighted terms in their mission statements may do so to show their commitment by being socially responsible and ethical, and delivering sustainable business solutions to their customers.Originality/valueThis study provides a systematic and comprehensive exploration of mission statements of the most ethical companies in an attempt to identify patterns of differences and similarities within these statements.
{"title":"Text mining the mission statements of the most ethical companies","authors":"T. Bayrak","doi":"10.1108/dta-10-2021-0280","DOIUrl":"https://doi.org/10.1108/dta-10-2021-0280","url":null,"abstract":"PurposeThis paper explores and examines the mission statements of the most ethical companies across the globe in terms of their main purposes, values, goals, and objective, and what they say about their vision and goals.Design/methodology/approachThis study is based on the data published by the Ethisphere Institute, the global leader in defining and advancing the standards of ethical business practices. Having compiled the mission statements into a text file, the authors conducted text mining using a commercially available text mining tool SAS Enterprise Miner to survey if the most ethical companies have valued the same vision and mission such as social responsibility and ethics.FindingsA review of their mission statements indicated that some of the most ethical companies surveyed in this study such as 3M and Voya strive to be “socially responsible and ethical,” support their “societies” and respect and protect the “nature,” “planet” and “environment.” The world's most ethical companies that stress these weighted terms in their mission statements may do so to show their commitment by being socially responsible and ethical, and delivering sustainable business solutions to their customers.Originality/valueThis study provides a systematic and comprehensive exploration of mission statements of the most ethical companies in an attempt to identify patterns of differences and similarities within these statements.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80594930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-15DOI: 10.1108/dta-09-2021-0261
M. Nečaský, P. Škoda, D. Bernhauer, Jakub Klímek, T. Skopal
PurposeSemantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth.Design/methodology/approachIn this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery.FindingsThe study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework.Originality/valueTo the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.
{"title":"Modular framework for similarity-based dataset discovery using external knowledge","authors":"M. Nečaský, P. Škoda, D. Bernhauer, Jakub Klímek, T. Skopal","doi":"10.1108/dta-09-2021-0261","DOIUrl":"https://doi.org/10.1108/dta-09-2021-0261","url":null,"abstract":"PurposeSemantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth.Design/methodology/approachIn this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery.FindingsThe study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework.Originality/valueTo the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78121200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-14DOI: 10.1108/dta-01-2021-0021
Stevan Milovanović, Z. Bogdanović, A. Labus, M. Despotović-Zrakić, Svetlana Mitrovic
PurposeThe paper aims to studiy social recruiting for finding suitable candidates on social networks. The main goal is to develop a methodological approach that would enable preselection of candidates using social network analysis. The research focus is on the automated collection of data using the web scraping method. Based on the information collected from the users' profiles, three clusters of skills and interests are created: technical, empirical and education-based. The identified clusters enable the recruiter to effectively search for suitable candidates.Design/methodology/approachThis paper proposes a new methodological approach for the preselection of candidates based on social network analysis (SNA). The defined methodological approach includes the following phases: Social network selection according to the defined preselection goals; Automatic data collection from the selected social network using the web scraping method; Filtering, processing and statistical analysis of data. Data analysis to identify relevant information for the preselection of candidates using attributes clustering and SNA. Preselection of candidates is based on the information obtained.FindingsIt is possible to contribute to candidate preselection in the recruiting process by identifying key categories of skills and interests of candidates. Using a defined methodological approach allows recruiters to identify candidates who possess the skills and interests defined by the search. A defined method automates the verification of the existence, or absence, of a particular category of skills or interests on the profiles of the potential candidates. The primary intention is reflected in the screening and filtering of the skills and interests of potential candidates, which contributes to a more effective preselection process.Research limitations/implicationsA small sample of the participants is present in the preliminary evaluation. A manual revision of the collected skills and interests is conducted. The recruiters should have basic knowledge of the SNA methodology in order to understand its application in the described method. The reliability of the collected data is assessed, because users provide data themselves when filling out their social network profiles.Practical implicationsThe presented method could be applied on different social networks, such as GitHub or AngelList for clustering profile skills. For a different social network, only the web scraping instructions would change. This method is composed of mutually independent steps. This means that each step can be implemented differently, without changing the whole process. The results of a pilot project evaluation indicate that the HR experts are interested in the proposed method and that they would be willing to include it in their practice.Social implicationsThe social implication should be the determination of relevant skills and interests during the preselection phase of candidates in the process of social re
{"title":"Social recruiting: an application of social network analysis for preselection of candidates","authors":"Stevan Milovanović, Z. Bogdanović, A. Labus, M. Despotović-Zrakić, Svetlana Mitrovic","doi":"10.1108/dta-01-2021-0021","DOIUrl":"https://doi.org/10.1108/dta-01-2021-0021","url":null,"abstract":"PurposeThe paper aims to studiy social recruiting for finding suitable candidates on social networks. The main goal is to develop a methodological approach that would enable preselection of candidates using social network analysis. The research focus is on the automated collection of data using the web scraping method. Based on the information collected from the users' profiles, three clusters of skills and interests are created: technical, empirical and education-based. The identified clusters enable the recruiter to effectively search for suitable candidates.Design/methodology/approachThis paper proposes a new methodological approach for the preselection of candidates based on social network analysis (SNA). The defined methodological approach includes the following phases: Social network selection according to the defined preselection goals; Automatic data collection from the selected social network using the web scraping method; Filtering, processing and statistical analysis of data. Data analysis to identify relevant information for the preselection of candidates using attributes clustering and SNA. Preselection of candidates is based on the information obtained.FindingsIt is possible to contribute to candidate preselection in the recruiting process by identifying key categories of skills and interests of candidates. Using a defined methodological approach allows recruiters to identify candidates who possess the skills and interests defined by the search. A defined method automates the verification of the existence, or absence, of a particular category of skills or interests on the profiles of the potential candidates. The primary intention is reflected in the screening and filtering of the skills and interests of potential candidates, which contributes to a more effective preselection process.Research limitations/implicationsA small sample of the participants is present in the preliminary evaluation. A manual revision of the collected skills and interests is conducted. The recruiters should have basic knowledge of the SNA methodology in order to understand its application in the described method. The reliability of the collected data is assessed, because users provide data themselves when filling out their social network profiles.Practical implicationsThe presented method could be applied on different social networks, such as GitHub or AngelList for clustering profile skills. For a different social network, only the web scraping instructions would change. This method is composed of mutually independent steps. This means that each step can be implemented differently, without changing the whole process. The results of a pilot project evaluation indicate that the HR experts are interested in the proposed method and that they would be willing to include it in their practice.Social implicationsThe social implication should be the determination of relevant skills and interests during the preselection phase of candidates in the process of social re","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83579165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-02DOI: 10.1108/dta-07-2021-0191
D. Asudani, N. K. Nagwani, Pradeep Singh
PurposeClassifying emails as ham or spam based on their content is essential. Determining the semantic and syntactic meaning of words and putting them in a high-dimensional feature vector form for processing is the most difficult challenge in email categorization. The purpose of this paper is to examine the effectiveness of the pre-trained embedding model for the classification of emails using deep learning classifiers such as the long short-term memory (LSTM) model and convolutional neural network (CNN) model.Design/methodology/approachIn this paper, global vectors (GloVe) and Bidirectional Encoder Representations Transformers (BERT) pre-trained word embedding are used to identify relationships between words, which helps to classify emails into their relevant categories using machine learning and deep learning models. Two benchmark datasets, SpamAssassin and Enron, are used in the experimentation.FindingsIn the first set of experiments, machine learning classifiers, the support vector machine (SVM) model, perform better than other machine learning methodologies. The second set of experiments compares the deep learning model performance without embedding, GloVe and BERT embedding. The experiments show that GloVe embedding can be helpful for faster execution with better performance on large-sized datasets.Originality/valueThe experiment reveals that the CNN model with GloVe embedding gives slightly better accuracy than the model with BERT embedding and traditional machine learning algorithms to classify an email as ham or spam. It is concluded that the word embedding models improve email classifiers accuracy.
{"title":"Exploring the effectiveness of word embedding based deep learning model for improving email classification","authors":"D. Asudani, N. K. Nagwani, Pradeep Singh","doi":"10.1108/dta-07-2021-0191","DOIUrl":"https://doi.org/10.1108/dta-07-2021-0191","url":null,"abstract":"PurposeClassifying emails as ham or spam based on their content is essential. Determining the semantic and syntactic meaning of words and putting them in a high-dimensional feature vector form for processing is the most difficult challenge in email categorization. The purpose of this paper is to examine the effectiveness of the pre-trained embedding model for the classification of emails using deep learning classifiers such as the long short-term memory (LSTM) model and convolutional neural network (CNN) model.Design/methodology/approachIn this paper, global vectors (GloVe) and Bidirectional Encoder Representations Transformers (BERT) pre-trained word embedding are used to identify relationships between words, which helps to classify emails into their relevant categories using machine learning and deep learning models. Two benchmark datasets, SpamAssassin and Enron, are used in the experimentation.FindingsIn the first set of experiments, machine learning classifiers, the support vector machine (SVM) model, perform better than other machine learning methodologies. The second set of experiments compares the deep learning model performance without embedding, GloVe and BERT embedding. The experiments show that GloVe embedding can be helpful for faster execution with better performance on large-sized datasets.Originality/valueThe experiment reveals that the CNN model with GloVe embedding gives slightly better accuracy than the model with BERT embedding and traditional machine learning algorithms to classify an email as ham or spam. It is concluded that the word embedding models improve email classifiers accuracy.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79693644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-06DOI: 10.1108/dta-09-2021-0233
D. Sisodia, Dilip Singh Sisodia
PurposeThe problem of choosing the utmost useful features from hundreds of features from time-series user click data arises in online advertising toward fraudulent publisher's classification. Selecting feature subsets is a key issue in such classification tasks. Practically, the use of filter approaches is common; however, they neglect the correlations amid features. Conversely, wrapper approaches could not be applied due to their complexities. Moreover, in particular, existing feature selection methods could not handle such data, which is one of the major causes of instability of feature selection.Design/methodology/approachTo overcome such issues, a majority voting-based hybrid feature selection method, namely feature distillation and accumulated selection (FDAS), is proposed to investigate the optimal subset of relevant features for analyzing the publisher's fraudulent conduct. FDAS works in two phases: (1) feature distillation, where significant features from standard filter and wrapper feature selection methods are obtained using majority voting; (2) accumulated selection, where we enumerated an accumulated evaluation of relevant feature subset to search for an optimal feature subset using effective machine learning (ML) models.FindingsEmpirical results prove enhanced classification performance with proposed features in average precision, recall, f1-score and AUC in publisher identification and classification.Originality/valueThe FDAS is evaluated on FDMA2012 user-click data and nine other benchmark datasets to gauge its generalizing characteristics, first, considering original features, second, with relevant feature subsets selected by feature selection (FS) methods, third, with optimal feature subset obtained by the proposed approach. ANOVA significance test is conducted to demonstrate significant differences between independent features.
{"title":"Feature distillation and accumulated selection for automated fraudulent publisher classification from user click data of online advertising","authors":"D. Sisodia, Dilip Singh Sisodia","doi":"10.1108/dta-09-2021-0233","DOIUrl":"https://doi.org/10.1108/dta-09-2021-0233","url":null,"abstract":"PurposeThe problem of choosing the utmost useful features from hundreds of features from time-series user click data arises in online advertising toward fraudulent publisher's classification. Selecting feature subsets is a key issue in such classification tasks. Practically, the use of filter approaches is common; however, they neglect the correlations amid features. Conversely, wrapper approaches could not be applied due to their complexities. Moreover, in particular, existing feature selection methods could not handle such data, which is one of the major causes of instability of feature selection.Design/methodology/approachTo overcome such issues, a majority voting-based hybrid feature selection method, namely feature distillation and accumulated selection (FDAS), is proposed to investigate the optimal subset of relevant features for analyzing the publisher's fraudulent conduct. FDAS works in two phases: (1) feature distillation, where significant features from standard filter and wrapper feature selection methods are obtained using majority voting; (2) accumulated selection, where we enumerated an accumulated evaluation of relevant feature subset to search for an optimal feature subset using effective machine learning (ML) models.FindingsEmpirical results prove enhanced classification performance with proposed features in average precision, recall, f1-score and AUC in publisher identification and classification.Originality/valueThe FDAS is evaluated on FDMA2012 user-click data and nine other benchmark datasets to gauge its generalizing characteristics, first, considering original features, second, with relevant feature subsets selected by feature selection (FS) methods, third, with optimal feature subset obtained by the proposed approach. ANOVA significance test is conducted to demonstrate significant differences between independent features.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73733228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-06DOI: 10.1108/dta-07-2021-0177
H. Alghamdi, A. Selamat
PurposeWith the proliferation of terrorist/extremist websites on the World Wide Web, it has become progressively more crucial to detect and analyze the content on these websites. Accordingly, the volume of previous research focused on identifying the techniques and activities of terrorist/extremist groups, as revealed by their sites on the so-called dark web, has also grown.Design/methodology/approachThis study presents a review of the techniques used to detect and process the content of terrorist/extremist sites on the dark web. Forty of the most relevant data sources were examined, and various techniques were identified among them.FindingsBased on this review, it was found that methods of feature selection and feature extraction can be used as topic modeling with content analysis and text clustering.Originality/valueAt the end of the review, present the current state-of-the- art and certain open issues associated with Arabic dark Web content analysis.
{"title":"Techniques to detect terrorists/extremists on the dark web: a review","authors":"H. Alghamdi, A. Selamat","doi":"10.1108/dta-07-2021-0177","DOIUrl":"https://doi.org/10.1108/dta-07-2021-0177","url":null,"abstract":"PurposeWith the proliferation of terrorist/extremist websites on the World Wide Web, it has become progressively more crucial to detect and analyze the content on these websites. Accordingly, the volume of previous research focused on identifying the techniques and activities of terrorist/extremist groups, as revealed by their sites on the so-called dark web, has also grown.Design/methodology/approachThis study presents a review of the techniques used to detect and process the content of terrorist/extremist sites on the dark web. Forty of the most relevant data sources were examined, and various techniques were identified among them.FindingsBased on this review, it was found that methods of feature selection and feature extraction can be used as topic modeling with content analysis and text clustering.Originality/valueAt the end of the review, present the current state-of-the- art and certain open issues associated with Arabic dark Web content analysis.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86851666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-04DOI: 10.1108/dta-09-2021-0230
Chih-Hao Wen, Chih-Chan Cheng, Y. Shih
PurposeThis research aims to collect human body variables via 2D images captured by digital cameras. Based on those human variables, the forecast and recommendation of the Digital Camouflage Uniforms (DCU) for Taiwan's military personnel are made.Design/methodology/approachA total of 375 subjects are recruited (male: 253; female: 122). In this study, OpenPose converts the photographed 2D images into four body variables, which are compared with those of a tape measure and 3D scanning simultaneously. Then, the recommendation model of the DCU is built by the decision tree. Meanwhile, the Euclidean distance of each size of the DCU in the manufacturing specification is calculated as the best three recommendations.FindingsThe recommended size established by the decision tree is only 0.62 and 0.63. However, for the recommendation result of the best three options, the DCU Fitting Score can be as high as 0.8 or more. The results of OpenPose and 3D scanning have the highest correlation coefficient even though the method of measuring body size is different. This result confirms that OpenPose has significant measurement validity. That is, inexpensive equipment can be used to obtain reasonable results.Originality/valueIn general, the method proposed in this study is suitable for applications in e-commerce and the apparel industry in a long-distance, non-contact and non-pre-labeled manner when the world is facing Covid-19. In particular, it can reduce the measurement troubles of ordinary users when purchasing clothing online.
{"title":"Artificial intelligence technologies for more flexible recommendation in uniforms","authors":"Chih-Hao Wen, Chih-Chan Cheng, Y. Shih","doi":"10.1108/dta-09-2021-0230","DOIUrl":"https://doi.org/10.1108/dta-09-2021-0230","url":null,"abstract":"PurposeThis research aims to collect human body variables via 2D images captured by digital cameras. Based on those human variables, the forecast and recommendation of the Digital Camouflage Uniforms (DCU) for Taiwan's military personnel are made.Design/methodology/approachA total of 375 subjects are recruited (male: 253; female: 122). In this study, OpenPose converts the photographed 2D images into four body variables, which are compared with those of a tape measure and 3D scanning simultaneously. Then, the recommendation model of the DCU is built by the decision tree. Meanwhile, the Euclidean distance of each size of the DCU in the manufacturing specification is calculated as the best three recommendations.FindingsThe recommended size established by the decision tree is only 0.62 and 0.63. However, for the recommendation result of the best three options, the DCU Fitting Score can be as high as 0.8 or more. The results of OpenPose and 3D scanning have the highest correlation coefficient even though the method of measuring body size is different. This result confirms that OpenPose has significant measurement validity. That is, inexpensive equipment can be used to obtain reasonable results.Originality/valueIn general, the method proposed in this study is suitable for applications in e-commerce and the apparel industry in a long-distance, non-contact and non-pre-labeled manner when the world is facing Covid-19. In particular, it can reduce the measurement troubles of ordinary users when purchasing clothing online.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75047026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-21DOI: 10.1108/dta-06-2021-0153
Laouni Djafri
PurposeThis work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies.Design/methodology/approachIn the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at two levels using the stratified random sampling method. This sampling method is also applied to extract the shared learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for the second level (PLBL2). The experimental results show the efficiency of our solution that the authors provided without significant loss of the classification results. Thus, in practical terms, the system DDPML is generally dedicated to big data mining processing, and works effectively in distributed systems with a simple structure, such as client-server networks.FindingsThe authors got very satisfactory classification results.Originality/valueDDPML system is specially designed to smoothly handle big data mining classification.
{"title":"Dynamic Distributed and Parallel Machine Learning algorithms for big data mining processing","authors":"Laouni Djafri","doi":"10.1108/dta-06-2021-0153","DOIUrl":"https://doi.org/10.1108/dta-06-2021-0153","url":null,"abstract":"PurposeThis work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies.Design/methodology/approachIn the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at two levels using the stratified random sampling method. This sampling method is also applied to extract the shared learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for the second level (PLBL2). The experimental results show the efficiency of our solution that the authors provided without significant loss of the classification results. Thus, in practical terms, the system DDPML is generally dedicated to big data mining processing, and works effectively in distributed systems with a simple structure, such as client-server networks.FindingsThe authors got very satisfactory classification results.Originality/valueDDPML system is specially designed to smoothly handle big data mining classification.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2021-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78852754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}