Data Technologies and Applications最新文献_第10页

Modular framework for similarity-based dataset discovery using external knowledge 使用外部知识进行基于相似性的数据集发现的模块化框架

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2022-02-15 DOI: 10.1108/dta-09-2021-0261

M. Nečaský, P. Škoda, D. Bernhauer, Jakub Klímek, T. Skopal

PurposeSemantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth.Design/methodology/approachIn this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery.FindingsThe study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework.Originality/valueTo the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.

作为开放数据发布的数据集的语义检索和发现仍然是一项具有挑战性的任务。数据集本质上起源于全球分布的网络丛林，缺乏集中的数据库管理、数据库方案、共享属性、词汇表、结构和语义。现有的数据集目录提供了基本的搜索功能，依赖于附加在数据集上的简短的、不完整的或误导性的文本元数据的关键字搜索。因此，搜索结果往往是不充分的。然而，通过使用基于内容的检索、机器学习工具、第三方(外部)知识库、无数特征提取方法和描述模型等，存在许多改进数据集发现的方法。设计/方法/方法在本文中，作者提出了一个模块化框架，用于基于相似性的数据集发现方法的快速实验。该框架由可扩展的组件目录组成，这些组件准备形成用于数据集表示和发现的自定义管道。该研究提出了几个概念验证管道，包括实验评估，展示了该框架的使用。原创性/价值据作者所知，在数据集发现的背景下，没有类似的正式框架来实验各种相似方法。该框架的目标是为数据集发现领域的可重复性和可比性研究建立一个平台。该框架的原型实现可以在GitHub上获得。

{"title":"Modular framework for similarity-based dataset discovery using external knowledge","authors":"M. Nečaský, P. Škoda, D. Bernhauer, Jakub Klímek, T. Skopal","doi":"10.1108/dta-09-2021-0261","DOIUrl":"https://doi.org/10.1108/dta-09-2021-0261","url":null,"abstract":"PurposeSemantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth.Design/methodology/approachIn this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery.FindingsThe study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework.Originality/valueTo the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"38 1","pages":"506-535"},"PeriodicalIF":1.6,"publicationDate":"2022-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78121200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Social recruiting: an application of social network analysis for preselection of candidates 社会招聘:社会网络分析在候选人预选中的应用

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2022-02-14 DOI: 10.1108/dta-01-2021-0021

Stevan Milovanović, Z. Bogdanović, A. Labus, M. Despotović-Zrakić, Svetlana Mitrovic

PurposeThe paper aims to studiy social recruiting for finding suitable candidates on social networks. The main goal is to develop a methodological approach that would enable preselection of candidates using social network analysis. The research focus is on the automated collection of data using the web scraping method. Based on the information collected from the users' profiles, three clusters of skills and interests are created: technical, empirical and education-based. The identified clusters enable the recruiter to effectively search for suitable candidates.Design/methodology/approachThis paper proposes a new methodological approach for the preselection of candidates based on social network analysis (SNA). The defined methodological approach includes the following phases: Social network selection according to the defined preselection goals; Automatic data collection from the selected social network using the web scraping method; Filtering, processing and statistical analysis of data. Data analysis to identify relevant information for the preselection of candidates using attributes clustering and SNA. Preselection of candidates is based on the information obtained.FindingsIt is possible to contribute to candidate preselection in the recruiting process by identifying key categories of skills and interests of candidates. Using a defined methodological approach allows recruiters to identify candidates who possess the skills and interests defined by the search. A defined method automates the verification of the existence, or absence, of a particular category of skills or interests on the profiles of the potential candidates. The primary intention is reflected in the screening and filtering of the skills and interests of potential candidates, which contributes to a more effective preselection process.Research limitations/implicationsA small sample of the participants is present in the preliminary evaluation. A manual revision of the collected skills and interests is conducted. The recruiters should have basic knowledge of the SNA methodology in order to understand its application in the described method. The reliability of the collected data is assessed, because users provide data themselves when filling out their social network profiles.Practical implicationsThe presented method could be applied on different social networks, such as GitHub or AngelList for clustering profile skills. For a different social network, only the web scraping instructions would change. This method is composed of mutually independent steps. This means that each step can be implemented differently, without changing the whole process. The results of a pilot project evaluation indicate that the HR experts are interested in the proposed method and that they would be willing to include it in their practice.Social implicationsThe social implication should be the determination of relevant skills and interests during the preselection phase of candidates in the process of social re

本文旨在研究社交招聘，在社交网络上寻找合适的候选人。主要目标是开发一种方法方法，可以使用社会网络分析来预选候选人。研究的重点是使用网络抓取方法自动收集数据。根据从用户档案中收集的信息，创建了三组技能和兴趣:技术、经验和教育基础。确定的集群使招聘人员能够有效地寻找合适的候选人。设计/方法/途径本文提出了一种基于社会网络分析(SNA)的候选人预选方法。确定的方法方法包括以下几个阶段:根据确定的预选目标进行社会网络选择;使用网页抓取方法从选定的社交网络自动收集数据;对数据进行过滤、处理和统计分析。使用属性聚类和SNA进行数据分析，识别相关信息，预选候选人。候选人的预选是基于所获得的信息。通过确定候选人的关键技能和兴趣类别，可以在招聘过程中对候选人进行预选。使用一种明确的方法方法，招聘人员可以识别出拥有搜索定义的技能和兴趣的候选人。已定义的方法可以自动验证潜在候选人的概要文件中存在或不存在特定类别的技能或兴趣。主要意图反映在对潜在候选人的技能和兴趣进行筛选和过滤，这有助于更有效的预选过程。研究的局限性/意义初步评估的参与者样本很小。对收集到的技能和兴趣进行手工修订。招聘人员应具备SNA方法论的基本知识，以便了解其在所述方法中的应用。收集到的数据的可靠性是评估的，因为用户在填写他们的社交网络资料时提供了自己的数据。本文提出的方法可以应用于不同的社交网络，如GitHub或AngelList的聚类配置文件技能。对于不同的社交网络，只有网页抓取指令会改变。该方法由相互独立的步骤组成。这意味着每个步骤可以以不同的方式实现，而无需改变整个过程。试点项目评估的结果表明，人力资源专家对提出的方法很感兴趣，并且他们愿意将其纳入他们的实践。社会含义社会含义应该是社会招聘过程中候选人预选阶段对相关技能和兴趣的确定。原创性/价值与本文讨论的先前研究相反，本文定义了一种使用web scraper工具自动收集数据的方法。所描述的方法允许在较短的时间内收集更多的数据。此外，它通过消除招聘面试官、提问者和从社交网络收集数据的人员的成本，降低了创建初始数据集的成本。从一个特定的社交网络中收集数据的完全自动化的过程从当前可用的解决方案中脱颖而出。考虑到本文中实现的数据收集方法，所提出的方法提供了将收集数据的范围扩展到隐式数据的机会，这是使用其他论文中提供的工具无法实现的。

{"title":"Social recruiting: an application of social network analysis for preselection of candidates","authors":"Stevan Milovanović, Z. Bogdanović, A. Labus, M. Despotović-Zrakić, Svetlana Mitrovic","doi":"10.1108/dta-01-2021-0021","DOIUrl":"https://doi.org/10.1108/dta-01-2021-0021","url":null,"abstract":"PurposeThe paper aims to studiy social recruiting for finding suitable candidates on social networks. The main goal is to develop a methodological approach that would enable preselection of candidates using social network analysis. The research focus is on the automated collection of data using the web scraping method. Based on the information collected from the users' profiles, three clusters of skills and interests are created: technical, empirical and education-based. The identified clusters enable the recruiter to effectively search for suitable candidates.Design/methodology/approachThis paper proposes a new methodological approach for the preselection of candidates based on social network analysis (SNA). The defined methodological approach includes the following phases: Social network selection according to the defined preselection goals; Automatic data collection from the selected social network using the web scraping method; Filtering, processing and statistical analysis of data. Data analysis to identify relevant information for the preselection of candidates using attributes clustering and SNA. Preselection of candidates is based on the information obtained.FindingsIt is possible to contribute to candidate preselection in the recruiting process by identifying key categories of skills and interests of candidates. Using a defined methodological approach allows recruiters to identify candidates who possess the skills and interests defined by the search. A defined method automates the verification of the existence, or absence, of a particular category of skills or interests on the profiles of the potential candidates. The primary intention is reflected in the screening and filtering of the skills and interests of potential candidates, which contributes to a more effective preselection process.Research limitations/implicationsA small sample of the participants is present in the preliminary evaluation. A manual revision of the collected skills and interests is conducted. The recruiters should have basic knowledge of the SNA methodology in order to understand its application in the described method. The reliability of the collected data is assessed, because users provide data themselves when filling out their social network profiles.Practical implicationsThe presented method could be applied on different social networks, such as GitHub or AngelList for clustering profile skills. For a different social network, only the web scraping instructions would change. This method is composed of mutually independent steps. This means that each step can be implemented differently, without changing the whole process. The results of a pilot project evaluation indicate that the HR experts are interested in the proposed method and that they would be willing to include it in their practice.Social implicationsThe social implication should be the determination of relevant skills and interests during the preselection phase of candidates in the process of social re","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"58 1","pages":"536-557"},"PeriodicalIF":1.6,"publicationDate":"2022-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83579165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Exploring the effectiveness of word embedding based deep learning model for improving email classification 探索基于词嵌入的深度学习模型改进电子邮件分类的有效性

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2022-02-02 DOI: 10.1108/dta-07-2021-0191

D. Asudani, N. K. Nagwani, Pradeep Singh

PurposeClassifying emails as ham or spam based on their content is essential. Determining the semantic and syntactic meaning of words and putting them in a high-dimensional feature vector form for processing is the most difficult challenge in email categorization. The purpose of this paper is to examine the effectiveness of the pre-trained embedding model for the classification of emails using deep learning classifiers such as the long short-term memory (LSTM) model and convolutional neural network (CNN) model.Design/methodology/approachIn this paper, global vectors (GloVe) and Bidirectional Encoder Representations Transformers (BERT) pre-trained word embedding are used to identify relationships between words, which helps to classify emails into their relevant categories using machine learning and deep learning models. Two benchmark datasets, SpamAssassin and Enron, are used in the experimentation.FindingsIn the first set of experiments, machine learning classifiers, the support vector machine (SVM) model, perform better than other machine learning methodologies. The second set of experiments compares the deep learning model performance without embedding, GloVe and BERT embedding. The experiments show that GloVe embedding can be helpful for faster execution with better performance on large-sized datasets.Originality/valueThe experiment reveals that the CNN model with GloVe embedding gives slightly better accuracy than the model with BERT embedding and traditional machine learning algorithms to classify an email as ham or spam. It is concluded that the word embedding models improve email classifiers accuracy.

目的根据内容将电子邮件分类为火腿或垃圾邮件是必要的。确定词的语义和句法意义并将其转化为高维特征向量形式进行处理是电子邮件分类中最困难的挑战。本文的目的是研究使用深度学习分类器(如长短期记忆(LSTM)模型和卷积神经网络(CNN)模型)对电子邮件进行分类的预训练嵌入模型的有效性。设计/方法/方法在本文中，使用全局向量(GloVe)和双向编码器表示变形(BERT)预训练词嵌入来识别词之间的关系，这有助于使用机器学习和深度学习模型将电子邮件分类到相关的类别中。实验中使用了两个基准数据集，SpamAssassin和Enron。在第一组实验中，机器学习分类器，即支持向量机(SVM)模型，比其他机器学习方法表现得更好。第二组实验比较了未嵌入、GloVe和BERT嵌入的深度学习模型的性能。实验表明，在大型数据集上，GloVe嵌入有助于提高算法的执行速度和性能。原创性/价值实验表明，在将电子邮件分类为火腿或垃圾邮件时，使用GloVe嵌入的CNN模型比使用BERT嵌入和传统机器学习算法的模型的准确率略高。结果表明，词嵌入模型提高了电子邮件分类器的准确率。

{"title":"Exploring the effectiveness of word embedding based deep learning model for improving email classification","authors":"D. Asudani, N. K. Nagwani, Pradeep Singh","doi":"10.1108/dta-07-2021-0191","DOIUrl":"https://doi.org/10.1108/dta-07-2021-0191","url":null,"abstract":"PurposeClassifying emails as ham or spam based on their content is essential. Determining the semantic and syntactic meaning of words and putting them in a high-dimensional feature vector form for processing is the most difficult challenge in email categorization. The purpose of this paper is to examine the effectiveness of the pre-trained embedding model for the classification of emails using deep learning classifiers such as the long short-term memory (LSTM) model and convolutional neural network (CNN) model.Design/methodology/approachIn this paper, global vectors (GloVe) and Bidirectional Encoder Representations Transformers (BERT) pre-trained word embedding are used to identify relationships between words, which helps to classify emails into their relevant categories using machine learning and deep learning models. Two benchmark datasets, SpamAssassin and Enron, are used in the experimentation.FindingsIn the first set of experiments, machine learning classifiers, the support vector machine (SVM) model, perform better than other machine learning methodologies. The second set of experiments compares the deep learning model performance without embedding, GloVe and BERT embedding. The experiments show that GloVe embedding can be helpful for faster execution with better performance on large-sized datasets.Originality/valueThe experiment reveals that the CNN model with GloVe embedding gives slightly better accuracy than the model with BERT embedding and traditional machine learning algorithms to classify an email as ham or spam. It is concluded that the word embedding models improve email classifiers accuracy.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"95 1","pages":"483-505"},"PeriodicalIF":1.6,"publicationDate":"2022-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79693644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Feature distillation and accumulated selection for automated fraudulent publisher classification from user click data of online advertising 基于网络广告用户点击数据的特征提取与累积选择，实现虚假发布者自动分类

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2022-01-06 DOI: 10.1108/dta-09-2021-0233

D. Sisodia, Dilip Singh Sisodia

PurposeThe problem of choosing the utmost useful features from hundreds of features from time-series user click data arises in online advertising toward fraudulent publisher's classification. Selecting feature subsets is a key issue in such classification tasks. Practically, the use of filter approaches is common; however, they neglect the correlations amid features. Conversely, wrapper approaches could not be applied due to their complexities. Moreover, in particular, existing feature selection methods could not handle such data, which is one of the major causes of instability of feature selection.Design/methodology/approachTo overcome such issues, a majority voting-based hybrid feature selection method, namely feature distillation and accumulated selection (FDAS), is proposed to investigate the optimal subset of relevant features for analyzing the publisher's fraudulent conduct. FDAS works in two phases: (1) feature distillation, where significant features from standard filter and wrapper feature selection methods are obtained using majority voting; (2) accumulated selection, where we enumerated an accumulated evaluation of relevant feature subset to search for an optimal feature subset using effective machine learning (ML) models.FindingsEmpirical results prove enhanced classification performance with proposed features in average precision, recall, f1-score and AUC in publisher identification and classification.Originality/valueThe FDAS is evaluated on FDMA2012 user-click data and nine other benchmark datasets to gauge its generalizing characteristics, first, considering original features, second, with relevant feature subsets selected by feature selection (FS) methods, third, with optimal feature subset obtained by the proposed approach. ANOVA significance test is conducted to demonstrate significant differences between independent features.

从时间序列用户点击数据的数百个特征中选择最有用的特征的问题出现在针对欺诈性出版商分类的在线广告中。在这类分类任务中，选择特征子集是一个关键问题。实际上，过滤器方法的使用是常见的;然而，他们忽略了特征之间的相关性。相反，包装器方法由于其复杂性而不能应用。特别是现有的特征选择方法无法处理这类数据，这是导致特征选择不稳定的主要原因之一。设计/方法/方法为了克服这些问题，提出了一种基于多数投票的混合特征选择方法，即特征蒸馏和累积选择(FDAS)，以研究用于分析出版商欺诈行为的相关特征的最佳子集。FDAS工作分为两个阶段:(1)特征蒸馏，使用多数投票从标准滤波器和包装器特征选择方法中获得重要特征;(2)累积选择，我们枚举相关特征子集的累积评估，以使用有效的机器学习(ML)模型搜索最优特征子集。实证结果表明，本文提出的特征在发布者识别和分类中的平均准确率、召回率、f1-score和AUC等方面提高了分类性能。在FDMA2012用户点击数据和其他9个基准数据集上对FDAS进行评估，以衡量其泛化特征，首先考虑原始特征，其次使用特征选择(FS)方法选择相关特征子集，第三，使用本文方法获得的最优特征子集。进行方差分析显著性检验，以证明独立特征之间存在显著差异。

{"title":"Feature distillation and accumulated selection for automated fraudulent publisher classification from user click data of online advertising","authors":"D. Sisodia, Dilip Singh Sisodia","doi":"10.1108/dta-09-2021-0233","DOIUrl":"https://doi.org/10.1108/dta-09-2021-0233","url":null,"abstract":"PurposeThe problem of choosing the utmost useful features from hundreds of features from time-series user click data arises in online advertising toward fraudulent publisher's classification. Selecting feature subsets is a key issue in such classification tasks. Practically, the use of filter approaches is common; however, they neglect the correlations amid features. Conversely, wrapper approaches could not be applied due to their complexities. Moreover, in particular, existing feature selection methods could not handle such data, which is one of the major causes of instability of feature selection.Design/methodology/approachTo overcome such issues, a majority voting-based hybrid feature selection method, namely feature distillation and accumulated selection (FDAS), is proposed to investigate the optimal subset of relevant features for analyzing the publisher's fraudulent conduct. FDAS works in two phases: (1) feature distillation, where significant features from standard filter and wrapper feature selection methods are obtained using majority voting; (2) accumulated selection, where we enumerated an accumulated evaluation of relevant feature subset to search for an optimal feature subset using effective machine learning (ML) models.FindingsEmpirical results prove enhanced classification performance with proposed features in average precision, recall, f1-score and AUC in publisher identification and classification.Originality/valueThe FDAS is evaluated on FDMA2012 user-click data and nine other benchmark datasets to gauge its generalizing characteristics, first, considering original features, second, with relevant feature subsets selected by feature selection (FS) methods, third, with optimal feature subset obtained by the proposed approach. ANOVA significance test is conducted to demonstrate significant differences between independent features.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"101 1","pages":"602-625"},"PeriodicalIF":1.6,"publicationDate":"2022-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73733228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Techniques to detect terrorists/extremists on the dark web: a review 在暗网上发现恐怖分子/极端分子的技术:综述

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2022-01-06 DOI: 10.1108/dta-07-2021-0177

H. Alghamdi, A. Selamat

PurposeWith the proliferation of terrorist/extremist websites on the World Wide Web, it has become progressively more crucial to detect and analyze the content on these websites. Accordingly, the volume of previous research focused on identifying the techniques and activities of terrorist/extremist groups, as revealed by their sites on the so-called dark web, has also grown.Design/methodology/approachThis study presents a review of the techniques used to detect and process the content of terrorist/extremist sites on the dark web. Forty of the most relevant data sources were examined, and various techniques were identified among them.FindingsBased on this review, it was found that methods of feature selection and feature extraction can be used as topic modeling with content analysis and text clustering.Originality/valueAt the end of the review, present the current state-of-the- art and certain open issues associated with Arabic dark Web content analysis.

随着互联网上恐怖主义/极端主义网站的激增，检测和分析这些网站上的内容变得越来越重要。因此，以前的研究集中于识别恐怖主义/极端主义团体的技术和活动，正如他们在所谓的暗网上的网站所揭示的那样，数量也有所增加。设计/方法/方法本研究回顾了暗网上用于检测和处理恐怖主义/极端主义网站内容的技术。审查了40个最相关的数据来源，并在其中确定了各种技术。在此综述的基础上，发现特征选择和特征提取方法可以作为主题建模与内容分析和文本聚类。原创性/价值在回顾结束时，呈现当前的艺术状态和某些与阿拉伯暗网内容分析相关的开放问题。

引用次数: 2

Artificial intelligence technologies for more flexible recommendation in uniforms 人工智能技术更灵活地推荐制服

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2022-01-04 DOI: 10.1108/dta-09-2021-0230

Chih-Hao Wen, Chih-Chan Cheng, Y. Shih

PurposeThis research aims to collect human body variables via 2D images captured by digital cameras. Based on those human variables, the forecast and recommendation of the Digital Camouflage Uniforms (DCU) for Taiwan's military personnel are made.Design/methodology/approachA total of 375 subjects are recruited (male: 253; female: 122). In this study, OpenPose converts the photographed 2D images into four body variables, which are compared with those of a tape measure and 3D scanning simultaneously. Then, the recommendation model of the DCU is built by the decision tree. Meanwhile, the Euclidean distance of each size of the DCU in the manufacturing specification is calculated as the best three recommendations.FindingsThe recommended size established by the decision tree is only 0.62 and 0.63. However, for the recommendation result of the best three options, the DCU Fitting Score can be as high as 0.8 or more. The results of OpenPose and 3D scanning have the highest correlation coefficient even though the method of measuring body size is different. This result confirms that OpenPose has significant measurement validity. That is, inexpensive equipment can be used to obtain reasonable results.Originality/valueIn general, the method proposed in this study is suitable for applications in e-commerce and the apparel industry in a long-distance, non-contact and non-pre-labeled manner when the world is facing Covid-19. In particular, it can reduce the measurement troubles of ordinary users when purchasing clothing online.

本研究旨在通过数码相机拍摄的二维图像收集人体变量。基于这些人为变量，对台湾军事人员数字化迷彩服(DCU)进行了预测和推荐。设计/方法/方法共招募375名受试者(男性253名;女:122)。在本研究中，OpenPose将拍摄的二维图像转换为四个身体变量，并与卷尺和三维扫描同时进行比较。然后，采用决策树的方法建立DCU的推荐模型。同时，计算制造规范中DCU各尺寸的欧氏距离，作为最佳的三个建议。结果:决策树建立的推荐尺寸仅为0.62和0.63。但是，对于最好的三个选项的推荐结果，DCU Fitting Score可以高达0.8甚至更高。尽管测量体型的方法不同，但OpenPose和3D扫描的结果相关系数最高。这一结果证实了OpenPose具有显著的测量效度。也就是说，廉价的设备可以获得合理的结果。总的来说，本研究提出的方法适合在全球面临Covid-19的情况下，以远距离、非接触、非预标签的方式应用于电子商务和服装行业。特别是可以减少普通用户在网上购买服装时的测量烦恼。

{"title":"Artificial intelligence technologies for more flexible recommendation in uniforms","authors":"Chih-Hao Wen, Chih-Chan Cheng, Y. Shih","doi":"10.1108/dta-09-2021-0230","DOIUrl":"https://doi.org/10.1108/dta-09-2021-0230","url":null,"abstract":"PurposeThis research aims to collect human body variables via 2D images captured by digital cameras. Based on those human variables, the forecast and recommendation of the Digital Camouflage Uniforms (DCU) for Taiwan's military personnel are made.Design/methodology/approachA total of 375 subjects are recruited (male: 253; female: 122). In this study, OpenPose converts the photographed 2D images into four body variables, which are compared with those of a tape measure and 3D scanning simultaneously. Then, the recommendation model of the DCU is built by the decision tree. Meanwhile, the Euclidean distance of each size of the DCU in the manufacturing specification is calculated as the best three recommendations.FindingsThe recommended size established by the decision tree is only 0.62 and 0.63. However, for the recommendation result of the best three options, the DCU Fitting Score can be as high as 0.8 or more. The results of OpenPose and 3D scanning have the highest correlation coefficient even though the method of measuring body size is different. This result confirms that OpenPose has significant measurement validity. That is, inexpensive equipment can be used to obtain reasonable results.Originality/valueIn general, the method proposed in this study is suitable for applications in e-commerce and the apparel industry in a long-distance, non-contact and non-pre-labeled manner when the world is facing Covid-19. In particular, it can reduce the measurement troubles of ordinary users when purchasing clothing online.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"13 1","pages":"626-643"},"PeriodicalIF":1.6,"publicationDate":"2022-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75047026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Credit default swap prediction based on generative adversarial networks 基于生成对抗网络的信用违约互换预测

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2022-01-01 DOI: 10.1108/DTA-09-2021-0260

Shu-Ying Lin, Duen-Ren Liu, Hsien-Pin Huang

引用次数: 2

Dynamic Distributed and Parallel Machine Learning algorithms for big data mining processing 面向大数据挖掘处理的动态分布式并行机器学习算法

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2021-12-21 DOI: 10.1108/dta-06-2021-0153

Laouni Djafri

PurposeThis work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies.Design/methodology/approachIn the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at two levels using the stratified random sampling method. This sampling method is also applied to extract the shared learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for the second level (PLBL2). The experimental results show the efficiency of our solution that the authors provided without significant loss of the classification results. Thus, in practical terms, the system DDPML is generally dedicated to big data mining processing, and works effectively in distributed systems with a simple structure, such as client-server networks.FindingsThe authors got very satisfactory classification results.Originality/valueDDPML system is specially designed to smoothly handle big data mining classification.

本工作可以用作其他设置中的构建块，例如GPU, Map-Reduce, Spark或任何其他设置。此外，DDPML还可以部署在其他分布式系统上，如P2P网络、集群、云计算或其他技术。设计/方法/方法在大数据时代，所有公司都希望从大量数据中受益。这些数据可以帮助他们了解他们的内部和外部环境，并预测相关的现象，因为这些数据转化为知识，可以用于以后的预测。因此，这些知识成为公司手中的巨大资产。这正是数据挖掘的目的。但随着大量数据和知识以更快的速度产生，作者现在谈论的是大数据挖掘。因此，作者提出的工作主要针对使用分布式和并行处理技术对大数据进行分类时的数量、准确性、有效性和速度问题。因此，作者在这项工作中提出的问题是，作者如何使机器学习算法同时以分布式和并行的方式工作，而不会失去分类结果的准确性。为了解决这个问题，作者提出了一种称为动态分布式并行机器学习(DDPML)算法的系统。为了构建它，作者将他们的工作分为两部分。首先，作者提出了一种由Map-Reduce算法控制的分布式架构，而Map-Reduce算法又依赖于随机抽样技术。因此，作者设计的分布式架构专门用于处理大数据处理，该处理与本工作中提出的采样策略以一致和有效的方式运行。该体系结构还有助于作者实际验证使用代表性学习库(RLB)获得的分类结果。第二部分，采用分层随机抽样的方法，在两个层次上抽样，提取了具有代表性的学习库。该采样方法还应用于提取第一层的共享学习基(SLB)和部分学习基(PLBL1)以及第二层的部分学习基(PLBL2)。实验结果表明，在分类结果没有明显损失的情况下，本文提出的方法是有效的。因此，在实践中，系统DDPML通常专门用于大数据挖掘处理，并且在具有简单结构的分布式系统(例如客户机-服务器网络)中有效地工作。结果得到了满意的分类结果。Originality/valueDDPML系统是专门为顺利处理大数据挖掘分类而设计的。

{"title":"Dynamic Distributed and Parallel Machine Learning algorithms for big data mining processing","authors":"Laouni Djafri","doi":"10.1108/dta-06-2021-0153","DOIUrl":"https://doi.org/10.1108/dta-06-2021-0153","url":null,"abstract":"PurposeThis work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies.Design/methodology/approachIn the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at two levels using the stratified random sampling method. This sampling method is also applied to extract the shared learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for the second level (PLBL2). The experimental results show the efficiency of our solution that the authors provided without significant loss of the classification results. Thus, in practical terms, the system DDPML is generally dedicated to big data mining processing, and works effectively in distributed systems with a simple structure, such as client-server networks.FindingsThe authors got very satisfactory classification results.Originality/valueDDPML system is specially designed to smoothly handle big data mining classification.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"59 1","pages":"558-601"},"PeriodicalIF":1.6,"publicationDate":"2021-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78852754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A robust framework for shoulder implant X-ray image classification 一个强健的肩部植入物x线图像分类框架

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2021-11-30 DOI: 10.1108/dta-08-2021-0210

M. Vo, Anh H. Vo, Tuong Le

PurposeMedical images are increasingly popular; therefore, the analysis of these images based on deep learning helps diagnose diseases become more and more essential and necessary. Recently, the shoulder implant X-ray image classification (SIXIC) dataset that includes X-ray images of implanted shoulder prostheses produced by four manufacturers was released. The implant's model detection helps to select the correct equipment and procedures in the upcoming surgery.Design/methodology/approachThis study proposes a robust model named X-Net to improve the predictability for shoulder implants X-ray image classification in the SIXIC dataset. The X-Net model utilizes the Squeeze and Excitation (SE) block integrated into Residual Network (ResNet) module. The SE module aims to weigh each feature map extracted from ResNet, which aids in improving the performance. The feature extraction process of X-Net model is performed by both modules: ResNet and SE modules. The final feature is obtained by incorporating the extracted features from the above steps, which brings more important characteristics of X-ray images in the input dataset. Next, X-Net uses this fine-grained feature to classify the input images into four classes (Cofield, Depuy, Zimmer and Tornier) in the SIXIC dataset.FindingsExperiments are conducted to show the proposed approach's effectiveness compared with other state-of-the-art methods for SIXIC. The experimental results indicate that the approach outperforms the various experimental methods in terms of several performance metrics. In addition, the proposed approach provides the new state of the art results in all performance metrics, such as accuracy, precision, recall, F1-score and area under the curve (AUC), for the experimental dataset.Originality/valueThe proposed method with high predictive performance can be used to assist in the treatment of injured shoulder joints.

医学影像越来越受欢迎;因此，基于深度学习对这些图像的分析帮助诊断疾病变得越来越重要和必要。最近，肩部植入物x射线图像分类(SIXIC)数据集发布，该数据集包括四家制造商生产的植入肩部假体的x射线图像。植入物的模型检测有助于在接下来的手术中选择正确的设备和程序。本研究提出了一个名为X-Net的鲁棒模型，以提高SIXIC数据集中肩部植入物x射线图像分类的可预测性。X-Net模型利用了集成在残余网络(ResNet)模块中的挤压和激励(SE)模块。SE模块旨在权衡从ResNet中提取的每个特征映射，这有助于提高性能。X-Net模型的特征提取过程由ResNet和SE两个模块完成。结合以上步骤提取的特征得到最终的特征，它带来了输入数据集中更多重要的x射线图像特征。接下来，X-Net使用这个细粒度特征在SIXIC数据集中将输入图像分为四类(Cofield、Depuy、Zimmer和Tornier)。实验结果表明，与其他先进的SIXIC方法相比，所提出的方法是有效的。实验结果表明，该方法在几个性能指标方面优于各种实验方法。此外，该方法为实验数据集提供了所有性能指标的最新结果，如准确性、精密度、召回率、f1分数和曲线下面积(AUC)。独创性/价值提出的方法具有较高的预测性能，可用于辅助治疗肩关节损伤。

{"title":"A robust framework for shoulder implant X-ray image classification","authors":"M. Vo, Anh H. Vo, Tuong Le","doi":"10.1108/dta-08-2021-0210","DOIUrl":"https://doi.org/10.1108/dta-08-2021-0210","url":null,"abstract":"PurposeMedical images are increasingly popular; therefore, the analysis of these images based on deep learning helps diagnose diseases become more and more essential and necessary. Recently, the shoulder implant X-ray image classification (SIXIC) dataset that includes X-ray images of implanted shoulder prostheses produced by four manufacturers was released. The implant's model detection helps to select the correct equipment and procedures in the upcoming surgery.Design/methodology/approachThis study proposes a robust model named X-Net to improve the predictability for shoulder implants X-ray image classification in the SIXIC dataset. The X-Net model utilizes the Squeeze and Excitation (SE) block integrated into Residual Network (ResNet) module. The SE module aims to weigh each feature map extracted from ResNet, which aids in improving the performance. The feature extraction process of X-Net model is performed by both modules: ResNet and SE modules. The final feature is obtained by incorporating the extracted features from the above steps, which brings more important characteristics of X-ray images in the input dataset. Next, X-Net uses this fine-grained feature to classify the input images into four classes (Cofield, Depuy, Zimmer and Tornier) in the SIXIC dataset.FindingsExperiments are conducted to show the proposed approach's effectiveness compared with other state-of-the-art methods for SIXIC. The experimental results indicate that the approach outperforms the various experimental methods in terms of several performance metrics. In addition, the proposed approach provides the new state of the art results in all performance metrics, such as accuracy, precision, recall, F1-score and area under the curve (AUC), for the experimental dataset.Originality/valueThe proposed method with high predictive performance can be used to assist in the treatment of injured shoulder joints.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"2 1","pages":"447-460"},"PeriodicalIF":1.6,"publicationDate":"2021-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83468297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A novel semi-supervised self-training method based on resampling for Twitter fake account identification 一种基于重采样的半监督自训练方法用于Twitter虚假账户识别

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2021-11-29 DOI: 10.1108/dta-07-2021-0196

Ziming Zeng, Tingting Li, Shouqiang Sun, Jingjing Sun, Jie Yin

PurposeTwitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective identification of bot accounts is conducive to accurately judge the disseminated information for the public. However, in actual fake account identification, it is expensive and inefficient to manually label Twitter accounts, and the labeled data are usually unbalanced in classes. To this end, the authors propose a novel framework to solve these problems.Design/methodology/approachIn the proposed framework, the authors introduce the concept of semi-supervised self-training learning and apply it to the real Twitter account data set from Kaggle. Specifically, the authors first train the classifier in the initial small amount of labeled account data, then use the trained classifier to automatically label large-scale unlabeled account data. Next, iteratively select high confidence instances from unlabeled data to expand the labeled data. Finally, an expanded Twitter account training set is obtained. It is worth mentioning that the resampling technique is integrated into the self-training process, and the data class is balanced at the initial stage of the self-training iteration.FindingsThe proposed framework effectively improves labeling efficiency and reduces the influence of class imbalance. It shows excellent identification results on 6 different base classifiers, especially for the initial small-scale labeled Twitter accounts.Originality/valueThis paper provides novel insights in identifying Twitter fake accounts. First, the authors take the lead in introducing a self-training method to automatically label Twitter accounts from the semi-supervised background. Second, the resampling technique is integrated into the self-training process to effectively reduce the influence of class imbalance on the identification effect.

推特假账号是指第三方机构为影响舆论、进行商业宣传或冒充他人而开设的机器人账号。对bot账号的有效识别有利于公众准确判断传播信息。然而，在实际的假账户识别中，手工标注Twitter账户成本高，效率低，而且标注的数据在类中通常是不平衡的。为此，作者提出了一个新的框架来解决这些问题。设计/方法/方法在提出的框架中，作者引入了半监督自训练学习的概念，并将其应用于Kaggle的真实Twitter账户数据集。具体而言，作者首先在初始少量标记的帐户数据中训练分类器，然后使用训练好的分类器对大规模未标记的帐户数据进行自动标记。接下来，迭代地从未标记的数据中选择高置信度的实例来扩展标记的数据。最后得到扩展后的Twitter账号训练集。值得一提的是，在自训练过程中集成了重采样技术，并且在自训练迭代的初始阶段对数据类进行了平衡。研究结果所提出的框架有效地提高了标注效率，减少了类别不平衡的影响。它在6种不同的基分类器上显示了出色的识别结果，特别是对于初始的小规模标记Twitter帐户。原创性/价值本文在识别Twitter虚假账户方面提供了新颖的见解。首先，作者率先引入了一种自训练方法，从半监督背景中自动标记Twitter账户。其次，将重采样技术融入到自训练过程中，有效降低类不平衡对识别效果的影响。

{"title":"A novel semi-supervised self-training method based on resampling for Twitter fake account identification","authors":"Ziming Zeng, Tingting Li, Shouqiang Sun, Jingjing Sun, Jie Yin","doi":"10.1108/dta-07-2021-0196","DOIUrl":"https://doi.org/10.1108/dta-07-2021-0196","url":null,"abstract":"PurposeTwitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective identification of bot accounts is conducive to accurately judge the disseminated information for the public. However, in actual fake account identification, it is expensive and inefficient to manually label Twitter accounts, and the labeled data are usually unbalanced in classes. To this end, the authors propose a novel framework to solve these problems.Design/methodology/approachIn the proposed framework, the authors introduce the concept of semi-supervised self-training learning and apply it to the real Twitter account data set from Kaggle. Specifically, the authors first train the classifier in the initial small amount of labeled account data, then use the trained classifier to automatically label large-scale unlabeled account data. Next, iteratively select high confidence instances from unlabeled data to expand the labeled data. Finally, an expanded Twitter account training set is obtained. It is worth mentioning that the resampling technique is integrated into the self-training process, and the data class is balanced at the initial stage of the self-training iteration.FindingsThe proposed framework effectively improves labeling efficiency and reduces the influence of class imbalance. It shows excellent identification results on 6 different base classifiers, especially for the initial small-scale labeled Twitter accounts.Originality/valueThis paper provides novel insights in identifying Twitter fake accounts. First, the authors take the lead in introducing a self-training method to automatically label Twitter accounts from the semi-supervised background. Second, the resampling technique is integrated into the self-training process to effectively reduce the influence of class imbalance on the identification effect.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"94 1","pages":"409-428"},"PeriodicalIF":1.6,"publicationDate":"2021-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91039787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3