Data Technologies and Applications最新文献_第4页

Measuring land lot shapes for property valuation 测量土地形状以作物业估价

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2023-08-08 DOI: 10.1108/dta-12-2022-0461

Chan-Jae Lee

PurposeUnstructured data such as images have defied usage in property valuation for a long time. Instead, structured data in tabular format are commonly employed to estimate property prices. This study attempts to quantify the shape of land lots and uses the resultant output as an input variable for subsequent land valuation models.Design/methodology/approachImagery data containing land lot shapes are fed into a convolutional neural network, and the shape of land lots is classified into two categories, regular and irregular-shaped. Then, the intermediate output (regularity score) is utilized in four downstream models to estimate land prices: random forest, gradient boosting, support vector machine and regression models.FindingsQuantification of the land lot shapes and their exploitation in valuation led to an improvement in the predictive accuracy for all subsequent models.Originality/valueThe study findings are expected to promote the adoption of elusive price determinants such as the shape of a land lot, appearance of a house and the landscape of a neighborhood in property appraisal practices.

目的长期以来，图像等非结构化数据在房地产估价中一直难以使用。相反，通常采用表格形式的结构化数据来估计房地产价格。本研究试图量化地块的形状，并将由此产生的产出作为后续土地估价模型的输入变量。设计/方法/方法包含地块形状的图像数据被输入卷积神经网络，地块形状被分为两类，规则形状和不规则形状。然后，在四个下游模型中使用中间输出（正则性得分）来估计土地价格：随机森林、梯度提升、支持向量机和回归模型。发现在估价中对地块形状及其开发的量化提高了所有后续模型的预测准确性。原创性/价值研究结果有望促进在房地产评估实践中采用难以捉摸的价格决定因素，如地块形状、房屋外观和社区景观。

引用次数: 0

Savitar: an intelligent sign language translation approach for deafness and dysphonia in the COVID-19 era Savitar：新冠肺炎时代耳聋和发音困难的智能手语翻译方法

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2023-07-07 DOI: 10.1108/dta-09-2022-0375

Wuyan Liang, Xiaolong Xu

PurposeIn the COVID-19 era, sign language (SL) translation has gained attention in online learning, which evaluates the physical gestures of each student and bridges the communication gap between dysphonia and hearing people. The purpose of this paper is to devote the alignment between SL sequence and nature language sequence with high translation performance.Design/methodology/approachSL can be characterized as joint/bone location information in two-dimensional space over time, forming skeleton sequences. To encode joint, bone and their motion information, we propose a multistream hierarchy network (MHN) along with a vocab prediction network (VPN) and a joint network (JN) with the recurrent neural network transducer. The JN is used to concatenate the sequences encoded by the MHN and VPN and learn their sequence alignments.FindingsWe verify the effectiveness of the proposed approach and provide experimental results on three large-scale datasets, which show that translation accuracy is 94.96, 54.52, and 92.88 per cent, and the inference time is 18 and 1.7 times faster than listen-attend-spell network (LAS) and visual hierarchy to lexical sequence network (H2SNet) , respectively.Originality/valueIn this paper, we propose a novel framework that can fuse multimodal input (i.e. joint, bone and their motion stream) and align input streams with nature language. Moreover, the provided framework is improved by the different properties of MHN, VPN and JN. Experimental results on the three datasets demonstrate that our approaches outperform the state-of-the-art methods in terms of translation accuracy and speed.

目的在新型冠状病毒感染症(COVID-19)时代，手语翻译在在线学习中受到关注，因为手语翻译可以评估每个学生的肢体动作，弥合语音障碍者和听力障碍者之间的沟通鸿沟。本文的目的是研究具有高翻译性能的SL序列与自然语言序列之间的对齐。设计/方法/方法sl可以表征为关节/骨骼在二维空间中随时间变化的位置信息，形成骨骼序列。为了对关节、骨骼及其运动信息进行编码，我们提出了一个多流层次网络(MHN)、一个词汇预测网络(VPN)和一个带有循环神经网络传感器的关节网络(JN)。JN用于连接由MHN和VPN编码的序列，并学习它们的序列对齐。结果表明，该方法的翻译准确率分别为94.96%、54.52%和92.88%，推理时间分别比listen- attention -spell network (LAS)和visual hierarchy to lexical sequence network (H2SNet)快18倍和1.7倍。在本文中，我们提出了一个新的框架，可以融合多模态输入(即关节、骨骼及其运动流)，并将输入流与自然语言对齐。此外，根据MHN、VPN和JN的不同特性对所提供的框架进行了改进。在三个数据集上的实验结果表明，我们的方法在翻译精度和速度方面都优于目前最先进的方法。

{"title":"Savitar: an intelligent sign language translation approach for deafness and dysphonia in the COVID-19 era","authors":"Wuyan Liang, Xiaolong Xu","doi":"10.1108/dta-09-2022-0375","DOIUrl":"https://doi.org/10.1108/dta-09-2022-0375","url":null,"abstract":"PurposeIn the COVID-19 era, sign language (SL) translation has gained attention in online learning, which evaluates the physical gestures of each student and bridges the communication gap between dysphonia and hearing people. The purpose of this paper is to devote the alignment between SL sequence and nature language sequence with high translation performance.Design/methodology/approachSL can be characterized as joint/bone location information in two-dimensional space over time, forming skeleton sequences. To encode joint, bone and their motion information, we propose a multistream hierarchy network (MHN) along with a vocab prediction network (VPN) and a joint network (JN) with the recurrent neural network transducer. The JN is used to concatenate the sequences encoded by the MHN and VPN and learn their sequence alignments.FindingsWe verify the effectiveness of the proposed approach and provide experimental results on three large-scale datasets, which show that translation accuracy is 94.96, 54.52, and 92.88 per cent, and the inference time is 18 and 1.7 times faster than listen-attend-spell network (LAS) and visual hierarchy to lexical sequence network (H2SNet) , respectively.Originality/valueIn this paper, we propose a novel framework that can fuse multimodal input (i.e. joint, bone and their motion stream) and align input streams with nature language. Moreover, the provided framework is improved by the different properties of MHN, VPN and JN. Experimental results on the three datasets demonstrate that our approaches outperform the state-of-the-art methods in terms of translation accuracy and speed.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44500769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Giving life to dead: role of WayBack Machine in recovery of dead URLs 赋予死者生命:WayBack机器在恢复死亡url中的作用

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2023-07-05 DOI: 10.1108/dta-06-2022-0242

Fayaz Ahmad Loan, A. Khan, Syed Aasif Ahmad Andrabi, Sozia Rashid Sozia, Umer Yousuf Parray

PurposeThe purpose of the present study is to identify the active and dead links of uniform resource locators (URLs) associated with web references and to compare the effectiveness of Chrome, Google and WayBack Machine in retrieving the dead URLs.Design/methodology/approachThe web references of the Library Hi Tech from 2004 to 2008 were selected for analysis to fulfill the set objectives. The URLs were extracted from the articles to verify their accessibility in terms of persistence and decay. The URLs were then executed directly in the internet browser (Chrome), search engine (Google) and Internet Archive (WayBack Machine). The collected data were recorded in an excel file and presented in tables/diagrams for further analysis.FindingsFrom the total of 1,083 web references, a maximum number was retrieved by the WayBack Machine (786; 72.6 per cent) followed by Google (501; 46.3 per cent) and the lowest by Chrome (402; 37.1 per cent). The study concludes that the WayBack Machine is more efficient, retrieves a maximum number of missing web citations and fulfills the mission of preservation of web sources to a larger extent.Originality/valueA good number of studies have been conducted to analyze the persistence and decay of web-references; however, the present study is unique as it compared the dead URL retrieval effectiveness of internet explorer (Chrome), search engine giant (Google) and WayBack Machine of the Internet Archive.Research limitations/implicationsThe web references of a single journal, namely, Library Hi Tech, were analyzed for 5 years only. A major study across disciplines and sources may yield better results.Practical implicationsURL decay is becoming a major problem in the preservation and citation of web resources. The study has some healthy recommendations for authors, editors, publishers, librarians and web designers to improve the persistence of web references.

目的本研究旨在识别与网络参考文献相关的统一资源定位器（URL）的活动链接和死链接，并比较Chrome、Google和WayBack Machine在检索死链接方面的有效性。URL是从文章中提取的，以验证它们在持久性和衰减方面的可访问性。然后，URL直接在互联网浏览器（Chrome）、搜索引擎（谷歌）和互联网档案（WayBack Machine）中执行。收集的数据记录在excel文件中，并以表格/图表形式呈现，以供进一步分析。发现在总共1083条网络参考文献中，WayBack Machine检索到的数量最多（786条；72.6%），其次是谷歌（501条；46.3%），Chrome检索到的最少（402条；37.1%）。该研究得出结论，WayBack Machine更高效，检索到最大数量的缺失网络引用，并在更大程度上完成了保存网络资源的任务。原创性/价值已经进行了大量的研究来分析网络参考文献的持久性和衰退性；然而，本研究是独一无二的，因为它比较了互联网浏览器（Chrome）、搜索引擎巨头（Google）和互联网档案馆的WayBack Machine的死URL检索效率。跨学科和跨来源的重大研究可能会产生更好的结果。实际含义URL衰减正成为网络资源保存和引用中的一个主要问题。该研究为作者、编辑、出版商、图书馆员和网络设计师提供了一些健康的建议，以提高网络参考文献的持久性。

{"title":"Giving life to dead: role of WayBack Machine in recovery of dead URLs","authors":"Fayaz Ahmad Loan, A. Khan, Syed Aasif Ahmad Andrabi, Sozia Rashid Sozia, Umer Yousuf Parray","doi":"10.1108/dta-06-2022-0242","DOIUrl":"https://doi.org/10.1108/dta-06-2022-0242","url":null,"abstract":"PurposeThe purpose of the present study is to identify the active and dead links of uniform resource locators (URLs) associated with web references and to compare the effectiveness of Chrome, Google and WayBack Machine in retrieving the dead URLs.Design/methodology/approachThe web references of the Library Hi Tech from 2004 to 2008 were selected for analysis to fulfill the set objectives. The URLs were extracted from the articles to verify their accessibility in terms of persistence and decay. The URLs were then executed directly in the internet browser (Chrome), search engine (Google) and Internet Archive (WayBack Machine). The collected data were recorded in an excel file and presented in tables/diagrams for further analysis.FindingsFrom the total of 1,083 web references, a maximum number was retrieved by the WayBack Machine (786; 72.6 per cent) followed by Google (501; 46.3 per cent) and the lowest by Chrome (402; 37.1 per cent). The study concludes that the WayBack Machine is more efficient, retrieves a maximum number of missing web citations and fulfills the mission of preservation of web sources to a larger extent.Originality/valueA good number of studies have been conducted to analyze the persistence and decay of web-references; however, the present study is unique as it compared the dead URL retrieval effectiveness of internet explorer (Chrome), search engine giant (Google) and WayBack Machine of the Internet Archive.Research limitations/implicationsThe web references of a single journal, namely, Library Hi Tech, were analyzed for 5 years only. A major study across disciplines and sources may yield better results.Practical implicationsURL decay is becoming a major problem in the preservation and citation of web resources. The study has some healthy recommendations for authors, editors, publishers, librarians and web designers to improve the persistence of web references.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42079490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Performance prediction of multivariable linear regression based on the optimal influencing factors for ranking aggregation in crowdsourcing task 众包任务中基于最优影响因素的多变量线性回归排序聚合性能预测

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2023-07-04 DOI: 10.1108/dta-09-2022-0346

Yuping Xing, Yongzhao Zhan

PurposeFor ranking aggregation in crowdsourcing task, the key issue is how to select the optimal working group with a given number of workers to optimize the performance of their aggregation. Performance prediction for ranking aggregation can solve this issue effectively. However, the performance prediction effect for ranking aggregation varies greatly due to the different influencing factors selected. Although questions on why and how data fusion methods perform well have been thoroughly discussed in the past, there is a lack of insight about how to select influencing factors to predict the performance and how much can be improved of.Design/methodology/approachIn this paper, performance prediction of multivariable linear regression based on the optimal influencing factors for ranking aggregation in crowdsourcing task is studied. An influencing factor optimization selection method based on stepwise regression (IFOS-SR) is proposed to screen the optimal influencing factors. A working group selection model based on the optimal influencing factors is built to select the optimal working group with a given number of workers.FindingsThe proposed approach can identify the optimal influencing factors of ranking aggregation, predict the aggregation performance more accurately than the state-of-the-art methods and select the optimal working group with a given number of workers.Originality/valueTo find out under which condition data fusion method may lead to performance improvement for ranking aggregation in crowdsourcing task, the optimal influencing factors are identified by the IFOS-SR method. This paper presents an analysis of the behavior of the linear combination method and the CombSUM method based on the optimal influencing factors, and optimizes the task assignment with a given number of workers by the optimal working group selection method.

目的对于众包任务中的聚合排序，关键问题是如何选择具有给定员工数量的最优工作组来优化其聚合性能。排名聚合的性能预测可以有效地解决这一问题。然而，由于选择的影响因素不同，排名聚合的性能预测效果差异很大。尽管过去已经彻底讨论了数据融合方法为什么以及如何表现良好的问题，但对于如何选择影响因素来预测性能以及可以提高多少性能，却缺乏深入的了解。设计/方法论/方法本文研究了基于最优影响因素的多变量线性回归在众包任务中的排名聚合性能预测。提出了一种基于逐步回归的影响因素优化选择方法（IFOS-SR）来筛选最优影响因素。建立了一个基于最优影响因素的工作组选择模型，以选择具有给定工人数量的最优工作组。发现所提出的方法可以识别排名聚合的最佳影响因素，比现有方法更准确地预测聚合性能，并选择具有给定工人数量的最佳工作组。独创性/价值为了找出在何种情况下数据融合方法可以提高众包任务中排名聚合的性能，采用IFOS-SR方法确定了最佳影响因素。本文基于最优影响因素分析了线性组合方法和CombSUM方法的行为，并通过最优工作组选择方法对给定工人数量的任务分配进行了优化。

{"title":"Performance prediction of multivariable linear regression based on the optimal influencing factors for ranking aggregation in crowdsourcing task","authors":"Yuping Xing, Yongzhao Zhan","doi":"10.1108/dta-09-2022-0346","DOIUrl":"https://doi.org/10.1108/dta-09-2022-0346","url":null,"abstract":"PurposeFor ranking aggregation in crowdsourcing task, the key issue is how to select the optimal working group with a given number of workers to optimize the performance of their aggregation. Performance prediction for ranking aggregation can solve this issue effectively. However, the performance prediction effect for ranking aggregation varies greatly due to the different influencing factors selected. Although questions on why and how data fusion methods perform well have been thoroughly discussed in the past, there is a lack of insight about how to select influencing factors to predict the performance and how much can be improved of.Design/methodology/approachIn this paper, performance prediction of multivariable linear regression based on the optimal influencing factors for ranking aggregation in crowdsourcing task is studied. An influencing factor optimization selection method based on stepwise regression (IFOS-SR) is proposed to screen the optimal influencing factors. A working group selection model based on the optimal influencing factors is built to select the optimal working group with a given number of workers.FindingsThe proposed approach can identify the optimal influencing factors of ranking aggregation, predict the aggregation performance more accurately than the state-of-the-art methods and select the optimal working group with a given number of workers.Originality/valueTo find out under which condition data fusion method may lead to performance improvement for ranking aggregation in crowdsourcing task, the optimal influencing factors are identified by the IFOS-SR method. This paper presents an analysis of the behavior of the linear combination method and the CombSUM method based on the optimal influencing factors, and optimizes the task assignment with a given number of workers by the optimal working group selection method.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43369891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Impact of information consistency in online reviews on consumer behavior in the e-commerce industry: a text mining approach 在线评论中信息一致性对电子商务行业消费者行为的影响：一种文本挖掘方法

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2023-06-12 DOI: 10.1108/dta-08-2022-0342

Qing Li, Jaeseung Park, Jaekyeong Kim

PurposeThe current study investigates the impact on perceived review helpfulness of the simultaneous processing of information from multiple cues with various central and peripheral cue combinations based on the elaboration likelihood model (ELM). Thus, the current study develops and tests hypotheses by analyzing real-world review data with a text mining approach in e-commerce to investigate how information consistency (rating inconsistency, review consistency and text similarity) influences perceived helpfulness. Moreover, the role of product type is examined in online consumer reviews of perceived helpfulness.Design/methodology/approachThe current study collected 61,900 online reviews, including 600 products in six categories, from Amazon.com. Additionally, 51,927 reviews were filtered that received helpfulness votes, and then text mining and negative binomial regression were applied.FindingsThe current study found that rating inconsistency and text similarity negatively affect perceived helpfulness and that review consistency positively affects perceived helpfulness. Moreover, peripheral cues (rating inconsistency) positively affect perceived helpfulness in reviews of experience goods rather than search goods. However, there is a lack of evidence to demonstrate the hypothesis that product types moderate the effectiveness of central cues (review consistency and text similarity) on perceived helpfulness.Originality/valuePrevious studies have mainly focused on numerical and textual factors to investigate the effect on perceived helpfulness. Additionally, previous studies have independently confirmed the factors that affect perceived helpfulness. The current study investigated how information consistency affects perceived helpfulness and found that various combinations of cues significantly affect perceived helpfulness. This result contributes to the review helpfulness and ELM literature by identifying the impact on perceived helpfulness from a comprehensive perspective of consumer review and information consistency.

目的本研究基于精化似然模型（ELM），研究了在不同中心和外围线索组合的情况下，同时处理来自多个线索的信息对感知复习有用性的影响。因此，当前的研究通过在电子商务中使用文本挖掘方法分析真实世界的评论数据来发展和测试假设，以调查信息一致性（评级不一致性、评论一致性和文本相似性）如何影响感知的帮助性。此外，产品类型在在线消费者对有用性的评价中的作用也得到了检验。设计/方法/方法当前的研究从亚马逊网站收集了61900条在线评论，包括六类600种产品。此外，还过滤了51927条获得有益投票的评论，然后应用文本挖掘和负二项回归。发现当前的研究发现，评分不一致性和文本相似性对感知的帮助性产生负面影响，而评论一致性对感知帮助性产生正面影响。此外，在体验商品而非搜索商品的评论中，外围线索（评级不一致）会积极影响感知到的有用性。然而，缺乏证据证明产品类型会调节中心线索（评论一致性和文本相似性）对感知帮助的有效性这一假设。原创性/价值以往的研究主要集中在数字和文本因素上，以调查对感知帮助的影响。此外，先前的研究已经独立证实了影响感知乐于助人的因素。目前的研究调查了信息一致性如何影响感知的帮助性，发现各种线索组合显著影响感知的助人性。这一结果通过从消费者评论和信息一致性的综合角度确定对感知帮助的影响，为评论帮助性和ELM文献做出了贡献。

{"title":"Impact of information consistency in online reviews on consumer behavior in the e-commerce industry: a text mining approach","authors":"Qing Li, Jaeseung Park, Jaekyeong Kim","doi":"10.1108/dta-08-2022-0342","DOIUrl":"https://doi.org/10.1108/dta-08-2022-0342","url":null,"abstract":"PurposeThe current study investigates the impact on perceived review helpfulness of the simultaneous processing of information from multiple cues with various central and peripheral cue combinations based on the elaboration likelihood model (ELM). Thus, the current study develops and tests hypotheses by analyzing real-world review data with a text mining approach in e-commerce to investigate how information consistency (rating inconsistency, review consistency and text similarity) influences perceived helpfulness. Moreover, the role of product type is examined in online consumer reviews of perceived helpfulness.Design/methodology/approachThe current study collected 61,900 online reviews, including 600 products in six categories, from Amazon.com. Additionally, 51,927 reviews were filtered that received helpfulness votes, and then text mining and negative binomial regression were applied.FindingsThe current study found that rating inconsistency and text similarity negatively affect perceived helpfulness and that review consistency positively affects perceived helpfulness. Moreover, peripheral cues (rating inconsistency) positively affect perceived helpfulness in reviews of experience goods rather than search goods. However, there is a lack of evidence to demonstrate the hypothesis that product types moderate the effectiveness of central cues (review consistency and text similarity) on perceived helpfulness.Originality/valuePrevious studies have mainly focused on numerical and textual factors to investigate the effect on perceived helpfulness. Additionally, previous studies have independently confirmed the factors that affect perceived helpfulness. The current study investigated how information consistency affects perceived helpfulness and found that various combinations of cues significantly affect perceived helpfulness. This result contributes to the review helpfulness and ELM literature by identifying the impact on perceived helpfulness from a comprehensive perspective of consumer review and information consistency.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44046328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A hybrid learning method for distinguishing lung adenocarcinoma and squamous cell carcinoma 区分肺腺癌和鳞状细胞癌的混合学习方法

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2023-05-19 DOI: 10.1108/dta-10-2022-0384

Anil Kumar Swain, A. Swetapadma, Jitendra Kumar Rout, Bunil Kumar Balabantaray

PurposeThe objective of the proposed work is to identify the most commonly occurring non–small cell carcinoma types, such as adenocarcinoma and squamous cell carcinoma, within the human population. Another objective of the work is to reduce the false positive rate during the classification.Design/methodology/approachIn this work, a hybrid method using convolutional neural networks (CNNs), extreme gradient boosting (XGBoost) and long-short-term memory networks (LSTMs) has been proposed to distinguish between lung adenocarcinoma and squamous cell carcinoma. To extract features from non–small cell lung carcinoma images, a three-layer convolution and three-layer max-pooling-based CNN is used. A few important features have been selected from the extracted features using the XGBoost algorithm as the optimal feature. Finally, LSTM has been used for the classification of carcinoma types. The accuracy of the proposed method is 99.57 per cent, and the false positive rate is 0.427 per cent.FindingsThe proposed CNN–XGBoost–LSTM hybrid method has significantly improved the results in distinguishing between adenocarcinoma and squamous cell carcinoma. The importance of the method can be outlined as follows: It has a very low false positive rate of 0.427 per cent. It has very high accuracy, i.e. 99.57 per cent. CNN-based features are providing accurate results in classifying lung carcinoma. It has the potential to serve as an assisting aid for doctors.Practical implicationsIt can be used by doctors as a secondary tool for the analysis of non–small cell lung cancers.Social implicationsIt can help rural doctors by sending the patients to specialized doctors for more analysis of lung cancer.Originality/valueIn this work, a hybrid method using CNN, XGBoost and LSTM has been proposed to distinguish between lung adenocarcinoma and squamous cell carcinoma. A three-layer convolution and three-layer max-pooling-based CNN is used to extract features from the non–small cell lung carcinoma images. A few important features have been selected from the extracted features using the XGBoost algorithm as the optimal feature. Finally, LSTM has been used for the classification of carcinoma types.

目的提出的工作的目的是确定最常见的非小细胞癌类型，如腺癌和鳞状细胞癌，在人群中。工作的另一个目标是降低分类过程中的误报率。在这项工作中，提出了一种使用卷积神经网络(cnn)、极端梯度增强(XGBoost)和长短期记忆网络(LSTMs)的混合方法来区分肺腺癌和鳞状细胞癌。为了从非小细胞肺癌图像中提取特征，使用了三层卷积和基于三层最大池化的CNN。使用XGBoost算法从提取的特征中选择一些重要的特征作为最优特征。最后，LSTM已被用于肿瘤类型的分类。结果提出的CNN-XGBoost-LSTM混合方法对腺癌和鳞状细胞癌的鉴别结果有显著提高。该方法的重要性可以概括如下:它的假阳性率非常低，为0.427%。准确率非常高，为99.57%。基于cnn的特征在肺癌分类中提供了准确的结果。它有可能成为医生的辅助工具。它可以被医生用作分析非小细胞肺癌的辅助工具。它可以帮助农村医生将患者送到专科医生那里进行更多的肺癌分析。在这项工作中，我们提出了一种使用CNN、XGBoost和LSTM的混合方法来区分肺腺癌和鳞状细胞癌。采用三层卷积和基于三层最大池化的CNN对非小细胞肺癌图像进行特征提取。使用XGBoost算法从提取的特征中选择一些重要的特征作为最优特征。最后，LSTM已被用于肿瘤类型的分类。

{"title":"A hybrid learning method for distinguishing lung adenocarcinoma and squamous cell carcinoma","authors":"Anil Kumar Swain, A. Swetapadma, Jitendra Kumar Rout, Bunil Kumar Balabantaray","doi":"10.1108/dta-10-2022-0384","DOIUrl":"https://doi.org/10.1108/dta-10-2022-0384","url":null,"abstract":"PurposeThe objective of the proposed work is to identify the most commonly occurring non–small cell carcinoma types, such as adenocarcinoma and squamous cell carcinoma, within the human population. Another objective of the work is to reduce the false positive rate during the classification.Design/methodology/approachIn this work, a hybrid method using convolutional neural networks (CNNs), extreme gradient boosting (XGBoost) and long-short-term memory networks (LSTMs) has been proposed to distinguish between lung adenocarcinoma and squamous cell carcinoma. To extract features from non–small cell lung carcinoma images, a three-layer convolution and three-layer max-pooling-based CNN is used. A few important features have been selected from the extracted features using the XGBoost algorithm as the optimal feature. Finally, LSTM has been used for the classification of carcinoma types. The accuracy of the proposed method is 99.57 per cent, and the false positive rate is 0.427 per cent.FindingsThe proposed CNN–XGBoost–LSTM hybrid method has significantly improved the results in distinguishing between adenocarcinoma and squamous cell carcinoma. The importance of the method can be outlined as follows: It has a very low false positive rate of 0.427 per cent. It has very high accuracy, i.e. 99.57 per cent. CNN-based features are providing accurate results in classifying lung carcinoma. It has the potential to serve as an assisting aid for doctors.Practical implicationsIt can be used by doctors as a secondary tool for the analysis of non–small cell lung cancers.Social implicationsIt can help rural doctors by sending the patients to specialized doctors for more analysis of lung cancer.Originality/valueIn this work, a hybrid method using CNN, XGBoost and LSTM has been proposed to distinguish between lung adenocarcinoma and squamous cell carcinoma. A three-layer convolution and three-layer max-pooling-based CNN is used to extract features from the non–small cell lung carcinoma images. A few important features have been selected from the extracted features using the XGBoost algorithm as the optimal feature. Finally, LSTM has been used for the classification of carcinoma types.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47665975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A novel word-graph-based query rewriting method for question answering 一种新的基于词图的问答查询重写方法

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2023-05-18 DOI: 10.1108/dta-05-2022-0187

Rongen Yan, Depeng Dang, Huiyu Gao, Yan Wu, Wenhui Yu

PurposeQuestion answering (QA) answers the questions asked by people in the form of natural language. In the QA, due to the subjectivity of users, the questions they query have different expressions, which increases the difficulty of text retrieval. Therefore, the purpose of this paper is to explore new query rewriting method for QA that integrates multiple related questions (RQs) to form an optimal question. Moreover, it is important to generate a new dataset of the original query (OQ) with multiple RQs.Design/methodology/approachThis study collects a new dataset SQuAD_extend by crawling the QA community and uses word-graph to model the collected OQs. Next, Beam search finds the best path to get the best question. To deeply represent the features of the question, pretrained model BERT is used to model sentences.FindingsThe experimental results show three outstanding findings. (1) The quality of the answers is better after adding the RQs of the OQs. (2) The word-graph that is used to model the problem and choose the optimal path is conducive to finding the best question. (3) Finally, BERT can deeply characterize the semantics of the exact problem.Originality/valueThe proposed method can use word-graph to construct multiple questions and select the optimal path for rewriting the question, and the quality of answers is better than the baseline. In practice, the research results can help guide users to clarify their query intentions and finally achieve the best answer.

目的问答（QA）以自然语言的形式回答人们提出的问题。在QA中，由于用户的主观性，他们查询的问题有不同的表达方式，这增加了文本检索的难度。因此，本文的目的是探索一种新的QA查询重写方法，该方法将多个相关问题（RQ）集成到一个最优问题中。此外，重要的是生成具有多个RQ的原始查询（OQ）的新数据集。设计/方法论/方法本研究通过对QA社区进行爬行，收集了一个新的数据集SQuAD_extend，并使用词图对收集到的OQ进行建模。接下来，Beam搜索会找到获得最佳问题的最佳路径。为了深入表达问题的特征，使用预训练的模型BERT对句子进行建模。实验结果显示了三个突出的发现。（1）在添加OQ的RQ之后，答案的质量更好。（2）用于对问题进行建模并选择最佳路径的单词图有助于找到最佳问题。（3）最后，BERT可以深入刻画精确问题的语义。独创性/价值所提出的方法可以使用单词图来构造多个问题，并选择重写问题的最佳路径，并且答案的质量优于基线。在实践中，研究结果可以帮助引导用户明确自己的查询意图，最终获得最佳答案。

{"title":"A novel word-graph-based query rewriting method for question answering","authors":"Rongen Yan, Depeng Dang, Huiyu Gao, Yan Wu, Wenhui Yu","doi":"10.1108/dta-05-2022-0187","DOIUrl":"https://doi.org/10.1108/dta-05-2022-0187","url":null,"abstract":"PurposeQuestion answering (QA) answers the questions asked by people in the form of natural language. In the QA, due to the subjectivity of users, the questions they query have different expressions, which increases the difficulty of text retrieval. Therefore, the purpose of this paper is to explore new query rewriting method for QA that integrates multiple related questions (RQs) to form an optimal question. Moreover, it is important to generate a new dataset of the original query (OQ) with multiple RQs.Design/methodology/approachThis study collects a new dataset SQuAD_extend by crawling the QA community and uses word-graph to model the collected OQs. Next, Beam search finds the best path to get the best question. To deeply represent the features of the question, pretrained model BERT is used to model sentences.FindingsThe experimental results show three outstanding findings. (1) The quality of the answers is better after adding the RQs of the OQs. (2) The word-graph that is used to model the problem and choose the optimal path is conducive to finding the best question. (3) Finally, BERT can deeply characterize the semantics of the exact problem.Originality/valueThe proposed method can use word-graph to construct multiple questions and select the optimal path for rewriting the question, and the quality of answers is better than the baseline. In practice, the research results can help guide users to clarify their query intentions and finally achieve the best answer.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"1 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41453485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A model of image retrieval based on KD-Tree Random Forest 基于KD-Tree随机森林的图像检索模型

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2023-05-05 DOI: 10.1108/dta-06-2022-0247

Nguyen Thi Dinh, Nguyen Vu Uyen Nhi, T. Le, Thanh The Van

PurposeThe problem of image retrieval and image description exists in various fields. In this paper, a model of content-based image retrieval and image content extraction based on the KD-Tree structure was proposed.Design/methodology/approachA Random Forest structure was built to classify the objects on each image on the basis of the balanced multibranch KD-Tree structure. From that purpose, a KD-Tree structure was generated by the Random Forest to retrieve a set of similar images for an input image. A KD-Tree structure is applied to determine a relationship word at leaves to extract the relationship between objects on an input image. An input image content is described based on class names and relationships between objects.FindingsA model of image retrieval and image content extraction was proposed based on the proposed theoretical basis; simultaneously, the experiment was built on multi-object image datasets including Microsoft COCO and Flickr with an average image retrieval precision of 0.9028 and 0.9163, respectively. The experimental results were compared with those of other works on the same image dataset to demonstrate the effectiveness of the proposed method.Originality/valueA balanced multibranch KD-Tree structure was built to apply to relationship classification on the basis of the original KD-Tree structure. Then, KD-Tree Random Forest was built to improve the classifier performance and retrieve a set of similar images for an input image. Concurrently, the image content was described in the process of combining class names and relationships between objects.

目的图像检索和图像描述问题存在于各个领域。本文提出了一种基于KD树结构的基于内容的图像检索和图像内容提取模型。设计/方法论/方法在平衡多分支KD树结构的基础上，建立随机森林结构对每张图像上的对象进行分类。为此，随机森林生成了一个KD树结构，以检索输入图像的一组相似图像。应用KD树结构来确定树叶处的关系词，以提取输入图像上对象之间的关系。基于类名和对象之间的关系来描述输入图像内容。在提出的理论基础上，提出了图像检索和图像内容提取的模型；同时，该实验建立在包括Microsoft COCO和Flickr在内的多目标图像数据集上，平均图像检索精度分别为0.9028和0.9163。在同一图像数据集上，将实验结果与其他工作的结果进行了比较，验证了该方法的有效性。独创性/价值在原有KD树结构的基础上，建立了一个平衡的多分支KD树结构，用于关系分类。然后，建立KD树随机森林来提高分类器的性能，并为输入图像检索一组相似的图像。同时，在组合类名和对象之间关系的过程中描述了图像内容。

{"title":"A model of image retrieval based on KD-Tree Random Forest","authors":"Nguyen Thi Dinh, Nguyen Vu Uyen Nhi, T. Le, Thanh The Van","doi":"10.1108/dta-06-2022-0247","DOIUrl":"https://doi.org/10.1108/dta-06-2022-0247","url":null,"abstract":"PurposeThe problem of image retrieval and image description exists in various fields. In this paper, a model of content-based image retrieval and image content extraction based on the KD-Tree structure was proposed.Design/methodology/approachA Random Forest structure was built to classify the objects on each image on the basis of the balanced multibranch KD-Tree structure. From that purpose, a KD-Tree structure was generated by the Random Forest to retrieve a set of similar images for an input image. A KD-Tree structure is applied to determine a relationship word at leaves to extract the relationship between objects on an input image. An input image content is described based on class names and relationships between objects.FindingsA model of image retrieval and image content extraction was proposed based on the proposed theoretical basis; simultaneously, the experiment was built on multi-object image datasets including Microsoft COCO and Flickr with an average image retrieval precision of 0.9028 and 0.9163, respectively. The experimental results were compared with those of other works on the same image dataset to demonstrate the effectiveness of the proposed method.Originality/valueA balanced multibranch KD-Tree structure was built to apply to relationship classification on the basis of the original KD-Tree structure. Then, KD-Tree Random Forest was built to improve the classifier performance and retrieve a set of similar images for an input image. Concurrently, the image content was described in the process of combining class names and relationships between objects.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42664223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Identifying business information through deep learning: analyzing the tender documents of an Internet-based logistics bidding platform 通过深度学习识别商业信息——分析基于互联网的物流投标平台的投标文件

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2023-05-04 DOI: 10.1108/dta-08-2022-0308

Yingwen Yu, Jing Ma

PurposeThe tender documents, an essential data source for internet-based logistics tendering platforms, incorporate massive fine-grained data, ranging from information on tenderee, shipping location and shipping items. Automated information extraction in this area is, however, under-researched, making the extraction process a time- and effort-consuming one. For Chinese logistics tender entities, in particular, existing named entity recognition (NER) solutions are mostly unsuitable as they involve domain-specific terminologies and possess different semantic features.Design/methodology/approachTo tackle this problem, a novel lattice long short-term memory (LSTM) model, combining a variant contextual feature representation and a conditional random field (CRF) layer, is proposed in this paper for identifying valuable entities from logistic tender documents. Instead of traditional word embedding, the proposed model uses the pretrained Bidirectional Encoder Representations from Transformers (BERT) model as input to augment the contextual feature representation. Subsequently, with the Lattice-LSTM model, the information of characters and words is effectively utilized to avoid error segmentation.FindingsThe proposed model is then verified by the Chinese logistic tender named entity corpus. Moreover, the results suggest that the proposed model excels in the logistics tender corpus over other mainstream NER models. The proposed model underpins the automatic extraction of logistics tender information, enabling logistic companies to perceive the ever-changing market trends and make far-sighted logistic decisions.Originality/value(1) A practical model for logistic tender NER is proposed in the manuscript. By employing and fine-tuning BERT into the downstream task with a small amount of data, the experiment results show that the model has a better performance than other existing models. This is the first study, to the best of the authors' knowledge, to extract named entities from Chinese logistic tender documents. (2) A real logistic tender corpus for practical use is constructed and a program of the model for online-processing real logistic tender documents is developed in this work. The authors believe that the model will facilitate logistic companies in converting unstructured documents to structured data and further perceive the ever-changing market trends to make far-sighted logistic decisions.

目的招标文件是互联网物流招标平台的重要数据来源，包含了大量的细粒度数据，包括招标人信息、运输地点信息、运输物品信息等。然而，这一领域的自动化信息提取研究并不充分，使得提取过程既耗时又费力。特别是对于中国的物流招标实体，现有的命名实体识别(NER)解决方案大多不适合，因为它们涉及特定领域的术语，并且具有不同的语义特征。为了解决这个问题，本文提出了一种新的晶格长短期记忆(LSTM)模型，该模型结合了变量上下文特征表示和条件随机场(CRF)层，用于从物流投标文件中识别有价值的实体。与传统的词嵌入不同，该模型使用预训练的双向编码器表示作为输入来增强上下文特征表示。随后，利用Lattice-LSTM模型，有效地利用了字符和单词的信息，避免了错误分割。研究结果:提出的模型随后通过中国物流招标命名实体语料库进行验证。此外，研究结果表明，该模型在物流投标语料库中优于其他主流NER模型。该模型支持物流投标信息的自动提取，使物流公司能够感知不断变化的市场趋势，做出有远见的物流决策。原创性/价值(1)本文提出了一个实用的物流投标NER模型。通过将BERT应用于少量数据的下游任务中并对其进行微调，实验结果表明该模型比现有的其他模型具有更好的性能。据作者所知，这是第一次从中国物流招标文件中提取命名实体的研究。(2)构建了实用的物流实物标书语料库，开发了物流实物标书在线处理模型程序。该模型有助于物流企业将非结构化文档转化为结构化数据，进一步洞察不断变化的市场趋势，做出有远见的物流决策。

{"title":"Identifying business information through deep learning: analyzing the tender documents of an Internet-based logistics bidding platform","authors":"Yingwen Yu, Jing Ma","doi":"10.1108/dta-08-2022-0308","DOIUrl":"https://doi.org/10.1108/dta-08-2022-0308","url":null,"abstract":"PurposeThe tender documents, an essential data source for internet-based logistics tendering platforms, incorporate massive fine-grained data, ranging from information on tenderee, shipping location and shipping items. Automated information extraction in this area is, however, under-researched, making the extraction process a time- and effort-consuming one. For Chinese logistics tender entities, in particular, existing named entity recognition (NER) solutions are mostly unsuitable as they involve domain-specific terminologies and possess different semantic features.Design/methodology/approachTo tackle this problem, a novel lattice long short-term memory (LSTM) model, combining a variant contextual feature representation and a conditional random field (CRF) layer, is proposed in this paper for identifying valuable entities from logistic tender documents. Instead of traditional word embedding, the proposed model uses the pretrained Bidirectional Encoder Representations from Transformers (BERT) model as input to augment the contextual feature representation. Subsequently, with the Lattice-LSTM model, the information of characters and words is effectively utilized to avoid error segmentation.FindingsThe proposed model is then verified by the Chinese logistic tender named entity corpus. Moreover, the results suggest that the proposed model excels in the logistics tender corpus over other mainstream NER models. The proposed model underpins the automatic extraction of logistics tender information, enabling logistic companies to perceive the ever-changing market trends and make far-sighted logistic decisions.Originality/value(1) A practical model for logistic tender NER is proposed in the manuscript. By employing and fine-tuning BERT into the downstream task with a small amount of data, the experiment results show that the model has a better performance than other existing models. This is the first study, to the best of the authors' knowledge, to extract named entities from Chinese logistic tender documents. (2) A real logistic tender corpus for practical use is constructed and a program of the model for online-processing real logistic tender documents is developed in this work. The authors believe that the model will facilitate logistic companies in converting unstructured documents to structured data and further perceive the ever-changing market trends to make far-sighted logistic decisions.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45262725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Risk assessment in machine learning enhanced failure mode and effects analysis 机器学习中的风险评估增强了故障模式和影响分析

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications

Pub Date : 2023-05-04 DOI: 10.1108/dta-06-2022-0232

Zeping Wang, Hengte Du, Liangyan Tao, S. Javed

PurposeThe traditional failure mode and effect analysis (FMEA) has some limitations, such as the neglect of relevant historical data, subjective use of rating numbering and the less rationality and accuracy of the Risk Priority Number. The current study proposes a machine learning–enhanced FMEA (ML-FMEA) method based on a popular machine learning tool, Waikato environment for knowledge analysis (WEKA).Design/methodology/approachThis work uses the collected FMEA historical data to predict the probability of component/product failure risk by machine learning based on different commonly used classifiers. To compare the correct classification rate of ML-FMEA based on different classifiers, the 10-fold cross-validation is employed. Moreover, the prediction error is estimated by repeated experiments with different random seeds under varying initialization settings. Finally, the case of the submersible pump in Bhattacharjee et al. (2020) is utilized to test the performance of the proposed method.FindingsThe results show that ML-FMEA, based on most of the commonly used classifiers, outperforms the Bhattacharjee model. For example, the ML-FMEA based on Random Committee improves the correct classification rate from 77.47 to 90.09 per cent and area under the curve of receiver operating characteristic curve (ROC) from 80.9 to 91.8 per cent, respectively.Originality/valueThe proposed method not only enables the decision-maker to use the historical failure data and predict the probability of the risk of failure but also may pave a new way for the application of machine learning techniques in FMEA.

目的传统的失效模式与影响分析(FMEA)存在忽视相关历史数据、主观使用评级编号、风险优先级编号的合理性和准确性不高等局限性。本研究提出了一种机器学习增强的FMEA (ML-FMEA)方法，该方法基于一种流行的机器学习工具，Waikato环境for knowledge analysis (WEKA)。设计/方法/方法本工作使用收集的FMEA历史数据，通过基于不同常用分类器的机器学习来预测组件/产品故障风险的概率。为了比较不同分类器对ML-FMEA的分类正确率，采用10倍交叉验证。通过不同初始化设置下不同随机种子的重复实验估计预测误差。最后，利用Bhattacharjee等人(2020)的潜水泵案例来测试所提出方法的性能。结果表明，基于大多数常用分类器的ML-FMEA优于Bhattacharjee模型。例如，基于Random Committee的ML-FMEA将分类正确率从77.47提高到90.09%，将受试者工作特征曲线(ROC)曲线下面积从80.9%提高到91.8%。提出的方法不仅使决策者能够使用历史故障数据并预测故障风险的概率，而且为机器学习技术在FMEA中的应用铺平了新的道路。

{"title":"Risk assessment in machine learning enhanced failure mode and effects analysis","authors":"Zeping Wang, Hengte Du, Liangyan Tao, S. Javed","doi":"10.1108/dta-06-2022-0232","DOIUrl":"https://doi.org/10.1108/dta-06-2022-0232","url":null,"abstract":"PurposeThe traditional failure mode and effect analysis (FMEA) has some limitations, such as the neglect of relevant historical data, subjective use of rating numbering and the less rationality and accuracy of the Risk Priority Number. The current study proposes a machine learning–enhanced FMEA (ML-FMEA) method based on a popular machine learning tool, Waikato environment for knowledge analysis (WEKA).Design/methodology/approachThis work uses the collected FMEA historical data to predict the probability of component/product failure risk by machine learning based on different commonly used classifiers. To compare the correct classification rate of ML-FMEA based on different classifiers, the 10-fold cross-validation is employed. Moreover, the prediction error is estimated by repeated experiments with different random seeds under varying initialization settings. Finally, the case of the submersible pump in Bhattacharjee et al. (2020) is utilized to test the performance of the proposed method.FindingsThe results show that ML-FMEA, based on most of the commonly used classifiers, outperforms the Bhattacharjee model. For example, the ML-FMEA based on Random Committee improves the correct classification rate from 77.47 to 90.09 per cent and area under the curve of receiver operating characteristic curve (ROC) from 80.9 to 91.8 per cent, respectively.Originality/valueThe proposed method not only enables the decision-maker to use the historical failure data and predict the probability of the risk of failure but also may pave a new way for the application of machine learning techniques in FMEA.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47381854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0