PurposeThe proliferation of bots in social networks has profoundly affected the interactions of legitimate users. Detecting and rejecting these unwelcome bots has become part of the collective Internet agenda. Unfortunately, as bot creators use more sophisticated approaches to avoid being discovered, it has become increasingly difficult to distinguish social bots from legitimate users. Therefore, this paper proposes a novel social bot detection mechanism to adapt to new and different kinds of bots.Design/methodology/approachThis paper proposes a research framework to enhance the generalization of social bot detection from two dimensions: feature extraction and detection approaches. First, 36 features are extracted from four views for social bot detection. Then, this paper analyzes the feature contribution in different kinds of social bots, and the features with stronger generalization are proposed. Finally, this paper introduces outlier detection approaches to enhance the ever-changing social bot detection.FindingsThe experimental results show that the more important features can be more effectively generalized to different social bot detection tasks. Compared with the traditional binary-class classifier, the proposed outlier detection approaches can better adapt to the ever-changing social bots with a performance of 89.23 per cent measured using the F1 score.Originality/valueBased on the visual interpretation of the feature contribution, the features with stronger generalization in different detection tasks are found. The outlier detection approaches are first introduced to enhance the detection of ever-changing social bots.
{"title":"Research on the generalization of social bot detection from two dimensions: feature extraction and detection approaches","authors":"Ziming Zeng, Tingting Li, Jingjing Sun, Shouqiang Sun, Yu Zhang","doi":"10.1108/dta-02-2022-0084","DOIUrl":"https://doi.org/10.1108/dta-02-2022-0084","url":null,"abstract":"PurposeThe proliferation of bots in social networks has profoundly affected the interactions of legitimate users. Detecting and rejecting these unwelcome bots has become part of the collective Internet agenda. Unfortunately, as bot creators use more sophisticated approaches to avoid being discovered, it has become increasingly difficult to distinguish social bots from legitimate users. Therefore, this paper proposes a novel social bot detection mechanism to adapt to new and different kinds of bots.Design/methodology/approachThis paper proposes a research framework to enhance the generalization of social bot detection from two dimensions: feature extraction and detection approaches. First, 36 features are extracted from four views for social bot detection. Then, this paper analyzes the feature contribution in different kinds of social bots, and the features with stronger generalization are proposed. Finally, this paper introduces outlier detection approaches to enhance the ever-changing social bot detection.FindingsThe experimental results show that the more important features can be more effectively generalized to different social bot detection tasks. Compared with the traditional binary-class classifier, the proposed outlier detection approaches can better adapt to the ever-changing social bots with a performance of 89.23 per cent measured using the F1 score.Originality/valueBased on the visual interpretation of the feature contribution, the features with stronger generalization in different detection tasks are found. The outlier detection approaches are first introduced to enhance the detection of ever-changing social bots.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"33 1","pages":"177-198"},"PeriodicalIF":1.6,"publicationDate":"2022-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74177373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-02DOI: 10.1108/dta-04-2022-0172
Jaeseung Park, Xinzhe Li, Qing Li, Jaekyeong Kim
PurposeThe existing collaborative filtering algorithm may select an insufficiently representative customer as the neighbor of a target customer, which means that the performance in providing recommendations is not sufficiently accurate. This study aims to investigate the impact on recommendation performance of selecting influential and representative customers.Design/methodology/approachSome studies have shown that review helpfulness and consistency significantly affect purchase decision-making. Thus, this study focuses on customers who have written helpful and consistent reviews to select influential and representative neighbors. To achieve the purpose of this study, the authors apply a text-mining approach to analyze review helpfulness and consistency. In addition, they evaluate the performance of the proposed methodology using several real-world Amazon review data sets for experimental utility and reliability.FindingsThis study is the first to propose a methodology to investigate the effect of review consistency and helpfulness on recommendation performance. The experimental results confirmed that the recommendation performance was excellent when a neighbor was selected who wrote consistent or helpful reviews more than when neighbors were selected for all customers.Originality/valueThis study investigates the effect of review consistency and helpfulness on recommendation performance. Online review can enhance recommendation performance because it reflects the purchasing behavior of customers who consider reviews when purchasing items. The experimental results indicate that review helpfulness and consistency can enhance the performance of personalized recommendation services, increase customer satisfaction and increase confidence in a company.
{"title":"Impact on recommendation performance of online review helpfulness and consistency","authors":"Jaeseung Park, Xinzhe Li, Qing Li, Jaekyeong Kim","doi":"10.1108/dta-04-2022-0172","DOIUrl":"https://doi.org/10.1108/dta-04-2022-0172","url":null,"abstract":"PurposeThe existing collaborative filtering algorithm may select an insufficiently representative customer as the neighbor of a target customer, which means that the performance in providing recommendations is not sufficiently accurate. This study aims to investigate the impact on recommendation performance of selecting influential and representative customers.Design/methodology/approachSome studies have shown that review helpfulness and consistency significantly affect purchase decision-making. Thus, this study focuses on customers who have written helpful and consistent reviews to select influential and representative neighbors. To achieve the purpose of this study, the authors apply a text-mining approach to analyze review helpfulness and consistency. In addition, they evaluate the performance of the proposed methodology using several real-world Amazon review data sets for experimental utility and reliability.FindingsThis study is the first to propose a methodology to investigate the effect of review consistency and helpfulness on recommendation performance. The experimental results confirmed that the recommendation performance was excellent when a neighbor was selected who wrote consistent or helpful reviews more than when neighbors were selected for all customers.Originality/valueThis study investigates the effect of review consistency and helpfulness on recommendation performance. Online review can enhance recommendation performance because it reflects the purchasing behavior of customers who consider reviews when purchasing items. The experimental results indicate that review helpfulness and consistency can enhance the performance of personalized recommendation services, increase customer satisfaction and increase confidence in a company.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"18 1","pages":"199-221"},"PeriodicalIF":1.6,"publicationDate":"2022-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83740601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1108/dta-01-2022-0001
A. Keyhanipour, F. Oroumchian
PurposeUser feedback inferred from the user's search-time behavior could improve the learning to rank (L2R) algorithms. Click models (CMs) present probabilistic frameworks for describing and predicting the user's clicks during search sessions. Most of these CMs are based on common assumptions such as Attractiveness, Examination and User Satisfaction. CMs usually consider the Attractiveness and Examination as pre- and post-estimators of the actual relevance. They also assume that User Satisfaction is a function of the actual relevance. This paper extends the authors' previous work by building a reinforcement learning (RL) model to predict the relevance. The Attractiveness, Examination and User Satisfaction are estimated using a limited number of the features of the utilized benchmark data set and then they are incorporated in the construction of an RL agent. The proposed RL model learns to predict the relevance label of documents with respect to a given query more effectively than the baseline RL models for those data sets.Design/methodology/approachIn this paper, User Satisfaction is used as an indication of the relevance level of a query to a document. User Satisfaction itself is estimated through Attractiveness and Examination, and in turn, Attractiveness and Examination are calculated by the random forest algorithm. In this process, only a small subset of top information retrieval (IR) features are used, which are selected based on their mean average precision and normalized discounted cumulative gain values. Based on the authors' observations, the multiplication of the Attractiveness and Examination values of a given query–document pair closely approximates the User Satisfaction and hence the relevance level. Besides, an RL model is designed in such a way that the current state of the RL agent is determined by discretization of the estimated Attractiveness and Examination values. In this way, each query–document pair would be mapped into a specific state based on its Attractiveness and Examination values. Then, based on the reward function, the RL agent would try to choose an action (relevance label) which maximizes the received reward in its current state. Using temporal difference (TD) learning algorithms, such as Q-learning and SARSA, the learning agent gradually learns to identify an appropriate relevance label in each state. The reward that is used in the RL agent is proportional to the difference between the User Satisfaction and the selected action.FindingsExperimental results on MSLR-WEB10K and WCL2R benchmark data sets demonstrate that the proposed algorithm, named as SeaRank, outperforms baseline algorithms. Improvement is more noticeable in top-ranked results, which usually receive more attention from users.Originality/valueThis research provides a mapping from IR features to the CM features and thereafter utilizes these newly generated features to build an RL model. This RL model is proposed with the definition of the states, acti
{"title":"SeaRank: relevance prediction based on click models in a reinforcement learning framework","authors":"A. Keyhanipour, F. Oroumchian","doi":"10.1108/dta-01-2022-0001","DOIUrl":"https://doi.org/10.1108/dta-01-2022-0001","url":null,"abstract":"PurposeUser feedback inferred from the user's search-time behavior could improve the learning to rank (L2R) algorithms. Click models (CMs) present probabilistic frameworks for describing and predicting the user's clicks during search sessions. Most of these CMs are based on common assumptions such as Attractiveness, Examination and User Satisfaction. CMs usually consider the Attractiveness and Examination as pre- and post-estimators of the actual relevance. They also assume that User Satisfaction is a function of the actual relevance. This paper extends the authors' previous work by building a reinforcement learning (RL) model to predict the relevance. The Attractiveness, Examination and User Satisfaction are estimated using a limited number of the features of the utilized benchmark data set and then they are incorporated in the construction of an RL agent. The proposed RL model learns to predict the relevance label of documents with respect to a given query more effectively than the baseline RL models for those data sets.Design/methodology/approachIn this paper, User Satisfaction is used as an indication of the relevance level of a query to a document. User Satisfaction itself is estimated through Attractiveness and Examination, and in turn, Attractiveness and Examination are calculated by the random forest algorithm. In this process, only a small subset of top information retrieval (IR) features are used, which are selected based on their mean average precision and normalized discounted cumulative gain values. Based on the authors' observations, the multiplication of the Attractiveness and Examination values of a given query–document pair closely approximates the User Satisfaction and hence the relevance level. Besides, an RL model is designed in such a way that the current state of the RL agent is determined by discretization of the estimated Attractiveness and Examination values. In this way, each query–document pair would be mapped into a specific state based on its Attractiveness and Examination values. Then, based on the reward function, the RL agent would try to choose an action (relevance label) which maximizes the received reward in its current state. Using temporal difference (TD) learning algorithms, such as Q-learning and SARSA, the learning agent gradually learns to identify an appropriate relevance label in each state. The reward that is used in the RL agent is proportional to the difference between the User Satisfaction and the selected action.FindingsExperimental results on MSLR-WEB10K and WCL2R benchmark data sets demonstrate that the proposed algorithm, named as SeaRank, outperforms baseline algorithms. Improvement is more noticeable in top-ranked results, which usually receive more attention from users.Originality/valueThis research provides a mapping from IR features to the CM features and thereafter utilizes these newly generated features to build an RL model. This RL model is proposed with the definition of the states, acti","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47120228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-13DOI: 10.1108/dta-05-2022-0209
H. Dibowski
PurposeThe curation of ontologies and knowledge graphs (KGs) is an essential task for industrial knowledge-based applications, as they rely on the contained knowledge to be correct and error-free. Often, a significant amount of a KG is curated by humans. Established validation methods, such as Shapes Constraint Language, Shape Expressions or Web Ontology Language, can detect wrong statements only after their materialization, which can be too late. Instead, an approach that avoids errors and adequately supports users is required.Design/methodology/approachFor solving that problem, Property Assertion Constraints (PACs) have been developed. PACs extend the range definition of a property with additional logic expressed with SPARQL. For the context of a given instance and property, a tailored PAC query is dynamically built and triggered on the KG. It can determine all values that will result in valid property value assertions.FindingsPACs can avoid the expansion of KGs with invalid property value assertions effectively, as their contained expertise narrows down the valid options a user can choose from. This simplifies the knowledge curation and, most notably, relieves users or machines from knowing and applying this expertise, but instead enables a computer to take care of it.Originality/valuePACs are fundamentally different from existing approaches. Instead of detecting erroneous materialized facts, they can determine all semantically correct assertions before materializing them. This avoids invalid property value assertions and provides users an informed, purposeful assistance. To the author's knowledge, PACs are the only such approach.
{"title":"Property Assertion Constraints for ontologies and knowledge graphs","authors":"H. Dibowski","doi":"10.1108/dta-05-2022-0209","DOIUrl":"https://doi.org/10.1108/dta-05-2022-0209","url":null,"abstract":"PurposeThe curation of ontologies and knowledge graphs (KGs) is an essential task for industrial knowledge-based applications, as they rely on the contained knowledge to be correct and error-free. Often, a significant amount of a KG is curated by humans. Established validation methods, such as Shapes Constraint Language, Shape Expressions or Web Ontology Language, can detect wrong statements only after their materialization, which can be too late. Instead, an approach that avoids errors and adequately supports users is required.Design/methodology/approachFor solving that problem, Property Assertion Constraints (PACs) have been developed. PACs extend the range definition of a property with additional logic expressed with SPARQL. For the context of a given instance and property, a tailored PAC query is dynamically built and triggered on the KG. It can determine all values that will result in valid property value assertions.FindingsPACs can avoid the expansion of KGs with invalid property value assertions effectively, as their contained expertise narrows down the valid options a user can choose from. This simplifies the knowledge curation and, most notably, relieves users or machines from knowing and applying this expertise, but instead enables a computer to take care of it.Originality/valuePACs are fundamentally different from existing approaches. Instead of detecting erroneous materialized facts, they can determine all semantically correct assertions before materializing them. This avoids invalid property value assertions and provides users an informed, purposeful assistance. To the author's knowledge, PACs are the only such approach.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"12 1","pages":"157-176"},"PeriodicalIF":1.6,"publicationDate":"2022-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80805380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-10DOI: 10.1108/dta-08-2021-0222
Jie Ma, Zhiyuan Hao, Mo Hu
PurposeThe density peak clustering algorithm (DP) is proposed to identify cluster centers by two parameters, i.e. ρ value (local density) and δ value (the distance between a point and another point with a higher ρ value). According to the center-identifying principle of the DP, the potential cluster centers should have a higher ρ value and a higher δ value than other points. However, this principle may limit the DP from identifying some categories with multi-centers or the centers in lower-density regions. In addition, the improper assignment strategy of the DP could cause a wrong assignment result for the non-center points. This paper aims to address the aforementioned issues and improve the clustering performance of the DP.Design/methodology/approachFirst, to identify as many potential cluster centers as possible, the authors construct a point-domain by introducing the pinhole imaging strategy to extend the searching range of the potential cluster centers. Second, they design different novel calculation methods for calculating the domain distance, point-domain density and domain similarity. Third, they adopt domain similarity to achieve the domain merging process and optimize the final clustering results.FindingsThe experimental results on analyzing 12 synthetic data sets and 12 real-world data sets show that two-stage density peak clustering based on multi-strategy optimization (TMsDP) outperforms the DP and other state-of-the-art algorithms.Originality/valueThe authors propose a novel DP-based clustering method, i.e. TMsDP, and transform the relationship between points into that between domains to ultimately further optimize the clustering performance of the DP.
{"title":"TMsDP: two-stage density peak clustering based on multi-strategy optimization","authors":"Jie Ma, Zhiyuan Hao, Mo Hu","doi":"10.1108/dta-08-2021-0222","DOIUrl":"https://doi.org/10.1108/dta-08-2021-0222","url":null,"abstract":"PurposeThe density peak clustering algorithm (DP) is proposed to identify cluster centers by two parameters, i.e. ρ value (local density) and δ value (the distance between a point and another point with a higher ρ value). According to the center-identifying principle of the DP, the potential cluster centers should have a higher ρ value and a higher δ value than other points. However, this principle may limit the DP from identifying some categories with multi-centers or the centers in lower-density regions. In addition, the improper assignment strategy of the DP could cause a wrong assignment result for the non-center points. This paper aims to address the aforementioned issues and improve the clustering performance of the DP.Design/methodology/approachFirst, to identify as many potential cluster centers as possible, the authors construct a point-domain by introducing the pinhole imaging strategy to extend the searching range of the potential cluster centers. Second, they design different novel calculation methods for calculating the domain distance, point-domain density and domain similarity. Third, they adopt domain similarity to achieve the domain merging process and optimize the final clustering results.FindingsThe experimental results on analyzing 12 synthetic data sets and 12 real-world data sets show that two-stage density peak clustering based on multi-strategy optimization (TMsDP) outperforms the DP and other state-of-the-art algorithms.Originality/valueThe authors propose a novel DP-based clustering method, i.e. TMsDP, and transform the relationship between points into that between domains to ultimately further optimize the clustering performance of the DP.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2022-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43355836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-09DOI: 10.1108/dta-02-2022-0072
Riju Bhattacharya, N. K. Nagwani, Sarsij Tripathi
PurposeSocial networking platforms are increasingly using the Follower Link Prediction tool in an effort to expand the number of their users. It facilitates the discovery of previously unidentified individuals and can be employed to determine the relationships among the nodes in a social network. On the other hand, social site firms use follower–followee link prediction (FFLP) to increase their user base. FFLP can help identify unfamiliar people and determine node-to-node links in a social network. Choosing the appropriate person to follow becomes crucial as the number of users increases. A hybrid model employing the Ensemble Learning algorithm for FFLP (HMELA) is proposed to advise the formation of new follower links in large networks.Design/methodology/approachHMELA includes fundamental classification techniques for treating link prediction as a binary classification problem. The data sets are represented using a variety of machine-learning-friendly hybrid graph features. The HMELA is evaluated using six real-world social network data sets.FindingsThe first set of experiments used exploratory data analysis on a di-graph to produce a balanced matrix. The second set of experiments compared the benchmark and hybrid features on data sets. This was followed by using benchmark classifiers and ensemble learning methods. The experiments show that the proposed (HMELA) method predicts missing links better than other methods.Practical implicationsA hybrid suggested model for link prediction is proposed in this paper. The suggested HMELA model makes use of AUC scores to predict new future links. The proposed approach facilitates comprehension and insight into the domain of link prediction. This work is almost entirely aimed at academics, practitioners, and those involved in the field of social networks, etc. Also, the model is quite effective in the field of product recommendation and in recommending a new friend and user on social networks.Originality/valueThe outcome on six benchmark data sets revealed that when the HMELA strategy had been applied to all of the selected data sets, the area under the curve (AUC) scores were greater than when individual techniques were applied to the same data sets. Using the HMELA technique, the maximum AUC score in the Facebook data set has been increased by 10.3 per cent from 0.8449 to 0.9479. There has also been an 8.53 per cent increase in the accuracy of the Net Science, Karate Club and USAir databases. As a result, the HMELA strategy outperforms every other strategy tested in the study.
{"title":"A hybrid approach for predicting missing follower-followee links in social networks using topological features with ensemble learning","authors":"Riju Bhattacharya, N. K. Nagwani, Sarsij Tripathi","doi":"10.1108/dta-02-2022-0072","DOIUrl":"https://doi.org/10.1108/dta-02-2022-0072","url":null,"abstract":"PurposeSocial networking platforms are increasingly using the Follower Link Prediction tool in an effort to expand the number of their users. It facilitates the discovery of previously unidentified individuals and can be employed to determine the relationships among the nodes in a social network. On the other hand, social site firms use follower–followee link prediction (FFLP) to increase their user base. FFLP can help identify unfamiliar people and determine node-to-node links in a social network. Choosing the appropriate person to follow becomes crucial as the number of users increases. A hybrid model employing the Ensemble Learning algorithm for FFLP (HMELA) is proposed to advise the formation of new follower links in large networks.Design/methodology/approachHMELA includes fundamental classification techniques for treating link prediction as a binary classification problem. The data sets are represented using a variety of machine-learning-friendly hybrid graph features. The HMELA is evaluated using six real-world social network data sets.FindingsThe first set of experiments used exploratory data analysis on a di-graph to produce a balanced matrix. The second set of experiments compared the benchmark and hybrid features on data sets. This was followed by using benchmark classifiers and ensemble learning methods. The experiments show that the proposed (HMELA) method predicts missing links better than other methods.Practical implicationsA hybrid suggested model for link prediction is proposed in this paper. The suggested HMELA model makes use of AUC scores to predict new future links. The proposed approach facilitates comprehension and insight into the domain of link prediction. This work is almost entirely aimed at academics, practitioners, and those involved in the field of social networks, etc. Also, the model is quite effective in the field of product recommendation and in recommending a new friend and user on social networks.Originality/valueThe outcome on six benchmark data sets revealed that when the HMELA strategy had been applied to all of the selected data sets, the area under the curve (AUC) scores were greater than when individual techniques were applied to the same data sets. Using the HMELA technique, the maximum AUC score in the Facebook data set has been increased by 10.3 per cent from 0.8449 to 0.9479. There has also been an 8.53 per cent increase in the accuracy of the Net Science, Karate Club and USAir databases. As a result, the HMELA strategy outperforms every other strategy tested in the study.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"24 1","pages":"131-153"},"PeriodicalIF":1.6,"publicationDate":"2022-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83050313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-05DOI: 10.1108/dta-12-2021-0359
Jiho Kim, Hanjun Lee, Hongchul Lee
PurposeThis paper aims to find determinants that can predict the helpfulness of online customer reviews (OCRs) with a novel approach.Design/methodology/approachThe approach consists of feature engineering using various text mining techniques including BERT and machine learning models that can classify OCRs according to their potential helpfulness. Moreover, explainable artificial intelligence methodologies are used to identify the determinants for helpfulness.FindingsThe important result is that the boosting-based ensemble model showed the highest prediction performance. In addition, it was confirmed that the sentiment features of OCRs and the reputation of reviewers are important determinants that augment the review helpfulness.Research limitations/implicationsEach online community has different purposes, fields and characteristics. Thus, the results of this study cannot be generalized. However, it is expected that this novel approach can be integrated with any platform where online reviews are used.Originality/valueThis paper incorporates feature engineering methodologies for online reviews, including the latest methodology. It also includes novel techniques to contribute to ongoing research on mining the determinants of review helpfulness.
{"title":"Mining the determinants of review helpfulness: a novel approach using intelligent feature engineering and explainable AI","authors":"Jiho Kim, Hanjun Lee, Hongchul Lee","doi":"10.1108/dta-12-2021-0359","DOIUrl":"https://doi.org/10.1108/dta-12-2021-0359","url":null,"abstract":"PurposeThis paper aims to find determinants that can predict the helpfulness of online customer reviews (OCRs) with a novel approach.Design/methodology/approachThe approach consists of feature engineering using various text mining techniques including BERT and machine learning models that can classify OCRs according to their potential helpfulness. Moreover, explainable artificial intelligence methodologies are used to identify the determinants for helpfulness.FindingsThe important result is that the boosting-based ensemble model showed the highest prediction performance. In addition, it was confirmed that the sentiment features of OCRs and the reputation of reviewers are important determinants that augment the review helpfulness.Research limitations/implicationsEach online community has different purposes, fields and characteristics. Thus, the results of this study cannot be generalized. However, it is expected that this novel approach can be integrated with any platform where online reviews are used.Originality/valueThis paper incorporates feature engineering methodologies for online reviews, including the latest methodology. It also includes novel techniques to contribute to ongoing research on mining the determinants of review helpfulness.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"116 1","pages":"108-130"},"PeriodicalIF":1.6,"publicationDate":"2022-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87907085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-28DOI: 10.1108/dta-02-2022-0076
Akhil Kumar
PurposeThis work aims to present a deep learning model for face mask detection in surveillance environments such as automatic teller machines (ATMs), banks, etc. to identify persons wearing face masks. In surveillance environments, complete visibility of the face area is a guideline, and criminals and law offenders commit crimes by hiding their faces behind a face mask. The face mask detector model proposed in this work can be used as a tool and integrated with surveillance cameras in autonomous surveillance environments to identify and catch law offenders and criminals.Design/methodology/approachThe proposed face mask detector is developed by integrating the residual network (ResNet)34 feature extractor on top of three You Only Look Once (YOLO) detection layers along with the usage of the spatial pyramid pooling (SPP) layer to extract a rich and dense feature map. Furthermore, at the training time, data augmentation operations such as Mosaic and MixUp have been applied to the feature extraction network so that it can get trained with images of varying complexities. The proposed detector is trained and tested over a custom face mask detection dataset consisting of 52,635 images. For validation, comparisons have been provided with the performance of YOLO v1, v2, tiny YOLO v1, v2, v3 and v4 and other benchmark work present in the literature by evaluating performance metrics such as precision, recall, F1 score, mean average precision (mAP) for the overall dataset and average precision (AP) for each class of the dataset.FindingsThe proposed face mask detector achieved 4.75–9.75 per cent higher detection accuracy in terms of mAP, 5–31 per cent higher AP for detection of faces with masks and, specifically, 2–30 per cent higher AP for detection of face masks on the face region as compared to the tested baseline variants of YOLO. Furthermore, the usage of the ResNet34 feature extractor and SPP layer in the proposed detection model reduced the training time and the detection time. The proposed face mask detection model can perform detection over an image in 0.45 s, which is 0.2–0.15 s lesser than that for other tested YOLO variants, thus making the proposed detection model perform detections at a higher speed.Research limitations/implicationsThe proposed face mask detector model can be utilized as a tool to detect persons with face masks who are a potential threat to the automatic surveillance environments such as ATMs, banks, airport security checks, etc. The other research implication of the proposed work is that it can be trained and tested for other object detection problems such as cancer detection in images, fish species detection, vehicle detection, etc.Practical implicationsThe proposed face mask detector can be integrated with automatic surveillance systems and used as a tool to detect persons with face masks who are potential threats to ATMs, banks, etc. and in the present times of COVID-19 to detect if the people are following a COVID-appropria
本工作旨在提出一种深度学习模型,用于自动柜员机(atm)、银行等监控环境中的口罩检测,以识别戴口罩的人员。在监视环境中,面部区域完全可见是一种指导方针,犯罪分子和违法者将脸部隐藏在口罩后面进行犯罪。本文提出的面罩检测器模型可以作为一种工具,并与自主监控环境中的监控摄像头相结合,以识别和捕获违法者和犯罪分子。设计/方法/方法所提出的人脸检测器是在三个You Only Look Once (YOLO)检测层之上集成残差网络(ResNet)34特征提取器,并使用空间金字塔池(SPP)层提取丰富而密集的特征图。此外,在训练时,在特征提取网络中应用了马赛克和MixUp等数据增强操作,使其可以用不同复杂度的图像进行训练。该检测器在包含52,635张图像的自定义面罩检测数据集上进行训练和测试。为了验证,通过评估精度、召回率、F1分数、整体数据集的平均平均精度(mAP)和每类数据集的平均精度(AP)等性能指标,对YOLO v1、v2、微型YOLO v1、v2、v3和v4以及文献中存在的其他基准工作的性能进行了比较。与YOLO测试的基线变体相比,所提出的口罩检测器在mAP方面的检测准确率提高了4.75 - 9.75%,在检测戴口罩的面部时的AP提高了5 - 31%,特别是在检测面部区域的口罩时的AP提高了2 - 30%。此外,在检测模型中使用ResNet34特征提取器和SPP层,减少了训练时间和检测时间。本文提出的口罩检测模型可以在0.45 s内完成对一幅图像的检测,比其他已测试的YOLO变体检测时间缩短0.2-0.15 s,从而提高了检测速度。研究局限/启示建议的面罩侦测模型可作为一种工具,用以侦测对自动监察环境(例如自动柜员机、银行、机场保安检查等)构成潜在威胁的戴面罩人士。这项工作的另一个研究意义是,它可以被训练和测试用于其他目标检测问题,如图像中的癌症检测、鱼类检测、车辆检测等。实际意义提议的面罩检测器可以与自动监控系统集成,作为一种工具,用于检测对自动取款机、银行、等,并在当前COVID-19时期检测人们在公共场所是否遵循了佩戴口罩的COVID-19适当行为。独创性/价值本工作的新颖之处在于使用了带有YOLO检测层的ResNet34特征提取器,这使得所提出的模型成为一个紧凑而强大的基于卷积神经网络的人脸面具检测模型。此外,将SPP层应用于ResNet34特征提取器,使其能够提取出丰富而密集的特征图。本工作的另一个新颖之处是在训练网络中实现马赛克和混合数据增强,为特征提取器提供不同复杂性和方向的3倍图像,并进一步帮助实现更高的检测精度。该模型提取了丰富的特征,在训练时进行了增强,在保持检测速度的同时实现了较高的检测精度。
{"title":"A cascaded deep-learning-based model for face mask detection","authors":"Akhil Kumar","doi":"10.1108/dta-02-2022-0076","DOIUrl":"https://doi.org/10.1108/dta-02-2022-0076","url":null,"abstract":"PurposeThis work aims to present a deep learning model for face mask detection in surveillance environments such as automatic teller machines (ATMs), banks, etc. to identify persons wearing face masks. In surveillance environments, complete visibility of the face area is a guideline, and criminals and law offenders commit crimes by hiding their faces behind a face mask. The face mask detector model proposed in this work can be used as a tool and integrated with surveillance cameras in autonomous surveillance environments to identify and catch law offenders and criminals.Design/methodology/approachThe proposed face mask detector is developed by integrating the residual network (ResNet)34 feature extractor on top of three You Only Look Once (YOLO) detection layers along with the usage of the spatial pyramid pooling (SPP) layer to extract a rich and dense feature map. Furthermore, at the training time, data augmentation operations such as Mosaic and MixUp have been applied to the feature extraction network so that it can get trained with images of varying complexities. The proposed detector is trained and tested over a custom face mask detection dataset consisting of 52,635 images. For validation, comparisons have been provided with the performance of YOLO v1, v2, tiny YOLO v1, v2, v3 and v4 and other benchmark work present in the literature by evaluating performance metrics such as precision, recall, F1 score, mean average precision (mAP) for the overall dataset and average precision (AP) for each class of the dataset.FindingsThe proposed face mask detector achieved 4.75–9.75 per cent higher detection accuracy in terms of mAP, 5–31 per cent higher AP for detection of faces with masks and, specifically, 2–30 per cent higher AP for detection of face masks on the face region as compared to the tested baseline variants of YOLO. Furthermore, the usage of the ResNet34 feature extractor and SPP layer in the proposed detection model reduced the training time and the detection time. The proposed face mask detection model can perform detection over an image in 0.45 s, which is 0.2–0.15 s lesser than that for other tested YOLO variants, thus making the proposed detection model perform detections at a higher speed.Research limitations/implicationsThe proposed face mask detector model can be utilized as a tool to detect persons with face masks who are a potential threat to the automatic surveillance environments such as ATMs, banks, airport security checks, etc. The other research implication of the proposed work is that it can be trained and tested for other object detection problems such as cancer detection in images, fish species detection, vehicle detection, etc.Practical implicationsThe proposed face mask detector can be integrated with automatic surveillance systems and used as a tool to detect persons with face masks who are potential threats to ATMs, banks, etc. and in the present times of COVID-19 to detect if the people are following a COVID-appropria","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"1 1","pages":"84-107"},"PeriodicalIF":1.6,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85569302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-20DOI: 10.1108/dta-10-2021-0296
L. Singh, R. Janghel, S. Sahu
PurposeAutomated skin lesion analysis plays a vital role in early detection. Having relatively small-sized imbalanced skin lesion datasets impedes learning and dominates research in automated skin lesion analysis. The unavailability of adequate data poses difficulty in developing classification methods due to the skewed class distribution.Design/methodology/approachBoosting-based transfer learning (TL) paradigms like Transfer AdaBoost algorithm can compensate for such a lack of samples by taking advantage of auxiliary data. However, in such methods, beneficial source instances representing the target have a fast and stochastic weight convergence, which results in “weight-drift” that negates transfer. In this paper, a framework is designed utilizing the “Rare-Transfer” (RT), a boosting-based TL algorithm, that prevents “weight-drift” and simultaneously addresses absolute-rarity in skin lesion datasets. RT prevents the weights of source samples from quick convergence. It addresses absolute-rarity using an instance transfer approach incorporating the best-fit set of auxiliary examples, which improves balanced error minimization. It compensates for class unbalance and scarcity of training samples in absolute-rarity simultaneously for inducing balanced error optimization.FindingsPromising results are obtained utilizing the RT compared with state-of-the-art techniques on absolute-rare skin lesion datasets with an accuracy of 92.5%. Wilcoxon signed-rank test examines significant differences amid the proposed RT algorithm and conventional algorithms used in the experiment.Originality/valueExperimentation is performed on absolute-rare four skin lesion datasets, and the effectiveness of RT is assessed based on accuracy, sensitivity, specificity and area under curve. The performance is compared with an existing ensemble and boosting-based TL methods.
{"title":"A boosting-based transfer learning method to address absolute-rarity in skin lesion datasets and prevent weight-drift for melanoma detection","authors":"L. Singh, R. Janghel, S. Sahu","doi":"10.1108/dta-10-2021-0296","DOIUrl":"https://doi.org/10.1108/dta-10-2021-0296","url":null,"abstract":"PurposeAutomated skin lesion analysis plays a vital role in early detection. Having relatively small-sized imbalanced skin lesion datasets impedes learning and dominates research in automated skin lesion analysis. The unavailability of adequate data poses difficulty in developing classification methods due to the skewed class distribution.Design/methodology/approachBoosting-based transfer learning (TL) paradigms like Transfer AdaBoost algorithm can compensate for such a lack of samples by taking advantage of auxiliary data. However, in such methods, beneficial source instances representing the target have a fast and stochastic weight convergence, which results in “weight-drift” that negates transfer. In this paper, a framework is designed utilizing the “Rare-Transfer” (RT), a boosting-based TL algorithm, that prevents “weight-drift” and simultaneously addresses absolute-rarity in skin lesion datasets. RT prevents the weights of source samples from quick convergence. It addresses absolute-rarity using an instance transfer approach incorporating the best-fit set of auxiliary examples, which improves balanced error minimization. It compensates for class unbalance and scarcity of training samples in absolute-rarity simultaneously for inducing balanced error optimization.FindingsPromising results are obtained utilizing the RT compared with state-of-the-art techniques on absolute-rare skin lesion datasets with an accuracy of 92.5%. Wilcoxon signed-rank test examines significant differences amid the proposed RT algorithm and conventional algorithms used in the experiment.Originality/valueExperimentation is performed on absolute-rare four skin lesion datasets, and the effectiveness of RT is assessed based on accuracy, sensitivity, specificity and area under curve. The performance is compared with an existing ensemble and boosting-based TL methods.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"14 1","pages":"1-17"},"PeriodicalIF":1.6,"publicationDate":"2022-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75255945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-03DOI: 10.1108/dta-12-2021-0389
Xiyue Deng, Xiaoming Li, Zhenzhen Chen, Meng Zhu, N. Xiong, Li Shen
PurposeHuman group behavior is the driving force behind many complex social and economic phenomena. Few studies have integrated multi-dimensional travel patterns and city interest points to construct urban security risk indicators. This paper combines traffic data and urban alarm data to analyze the safe travel characteristics of the urban population. The research results are helpful to explore the diversity of human group behavior, grasp the temporal and spatial laws and reveal regional security risks. It provides a reference for optimizing resource deployment and group intelligence analysis in emergency management.Design/methodology/approachBased on the dynamics index of group behavior, this paper mines the data of large shared bikes and ride-hailing in a big city of China. We integrate the urban interest points and travel dynamic characteristics, construct the urban traffic safety index based on alarm behavior and further calculate the urban safety index.FindingsThis study found significant differences in the travel power index among ride-sharing users. There is a positive correlation between user shared bike trips and the power-law bimodal phenomenon in the logarithmic coordinate system. It is closely related to the urban public security index.Originality/valueBased on group-shared dynamic index integrated alarm, we innovatively constructed an urban public safety index and analyzed the correlation of travel alarm behavior. The research results fully reveal the internal mechanism of the group behavior safety index and provide a valuable supplement for the police intelligence analysis.
{"title":"Construction of public security indicators based on characteristics of shared group behavior patterns","authors":"Xiyue Deng, Xiaoming Li, Zhenzhen Chen, Meng Zhu, N. Xiong, Li Shen","doi":"10.1108/dta-12-2021-0389","DOIUrl":"https://doi.org/10.1108/dta-12-2021-0389","url":null,"abstract":"PurposeHuman group behavior is the driving force behind many complex social and economic phenomena. Few studies have integrated multi-dimensional travel patterns and city interest points to construct urban security risk indicators. This paper combines traffic data and urban alarm data to analyze the safe travel characteristics of the urban population. The research results are helpful to explore the diversity of human group behavior, grasp the temporal and spatial laws and reveal regional security risks. It provides a reference for optimizing resource deployment and group intelligence analysis in emergency management.Design/methodology/approachBased on the dynamics index of group behavior, this paper mines the data of large shared bikes and ride-hailing in a big city of China. We integrate the urban interest points and travel dynamic characteristics, construct the urban traffic safety index based on alarm behavior and further calculate the urban safety index.FindingsThis study found significant differences in the travel power index among ride-sharing users. There is a positive correlation between user shared bike trips and the power-law bimodal phenomenon in the logarithmic coordinate system. It is closely related to the urban public security index.Originality/valueBased on group-shared dynamic index integrated alarm, we innovatively constructed an urban public safety index and analyzed the correlation of travel alarm behavior. The research results fully reveal the internal mechanism of the group behavior safety index and provide a valuable supplement for the police intelligence analysis.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2022-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48065288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}