Pub Date : 2023-08-08DOI: 10.1108/dta-12-2022-0461
Chan-Jae Lee
PurposeUnstructured data such as images have defied usage in property valuation for a long time. Instead, structured data in tabular format are commonly employed to estimate property prices. This study attempts to quantify the shape of land lots and uses the resultant output as an input variable for subsequent land valuation models.Design/methodology/approachImagery data containing land lot shapes are fed into a convolutional neural network, and the shape of land lots is classified into two categories, regular and irregular-shaped. Then, the intermediate output (regularity score) is utilized in four downstream models to estimate land prices: random forest, gradient boosting, support vector machine and regression models.FindingsQuantification of the land lot shapes and their exploitation in valuation led to an improvement in the predictive accuracy for all subsequent models.Originality/valueThe study findings are expected to promote the adoption of elusive price determinants such as the shape of a land lot, appearance of a house and the landscape of a neighborhood in property appraisal practices.
{"title":"Measuring land lot shapes for property valuation","authors":"Chan-Jae Lee","doi":"10.1108/dta-12-2022-0461","DOIUrl":"https://doi.org/10.1108/dta-12-2022-0461","url":null,"abstract":"PurposeUnstructured data such as images have defied usage in property valuation for a long time. Instead, structured data in tabular format are commonly employed to estimate property prices. This study attempts to quantify the shape of land lots and uses the resultant output as an input variable for subsequent land valuation models.Design/methodology/approachImagery data containing land lot shapes are fed into a convolutional neural network, and the shape of land lots is classified into two categories, regular and irregular-shaped. Then, the intermediate output (regularity score) is utilized in four downstream models to estimate land prices: random forest, gradient boosting, support vector machine and regression models.FindingsQuantification of the land lot shapes and their exploitation in valuation led to an improvement in the predictive accuracy for all subsequent models.Originality/valueThe study findings are expected to promote the adoption of elusive price determinants such as the shape of a land lot, appearance of a house and the landscape of a neighborhood in property appraisal practices.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45251306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-07DOI: 10.1108/dta-09-2022-0375
Wuyan Liang, Xiaolong Xu
PurposeIn the COVID-19 era, sign language (SL) translation has gained attention in online learning, which evaluates the physical gestures of each student and bridges the communication gap between dysphonia and hearing people. The purpose of this paper is to devote the alignment between SL sequence and nature language sequence with high translation performance.Design/methodology/approachSL can be characterized as joint/bone location information in two-dimensional space over time, forming skeleton sequences. To encode joint, bone and their motion information, we propose a multistream hierarchy network (MHN) along with a vocab prediction network (VPN) and a joint network (JN) with the recurrent neural network transducer. The JN is used to concatenate the sequences encoded by the MHN and VPN and learn their sequence alignments.FindingsWe verify the effectiveness of the proposed approach and provide experimental results on three large-scale datasets, which show that translation accuracy is 94.96, 54.52, and 92.88 per cent, and the inference time is 18 and 1.7 times faster than listen-attend-spell network (LAS) and visual hierarchy to lexical sequence network (H2SNet) , respectively.Originality/valueIn this paper, we propose a novel framework that can fuse multimodal input (i.e. joint, bone and their motion stream) and align input streams with nature language. Moreover, the provided framework is improved by the different properties of MHN, VPN and JN. Experimental results on the three datasets demonstrate that our approaches outperform the state-of-the-art methods in terms of translation accuracy and speed.
{"title":"Savitar: an intelligent sign language translation approach for deafness and dysphonia in the COVID-19 era","authors":"Wuyan Liang, Xiaolong Xu","doi":"10.1108/dta-09-2022-0375","DOIUrl":"https://doi.org/10.1108/dta-09-2022-0375","url":null,"abstract":"PurposeIn the COVID-19 era, sign language (SL) translation has gained attention in online learning, which evaluates the physical gestures of each student and bridges the communication gap between dysphonia and hearing people. The purpose of this paper is to devote the alignment between SL sequence and nature language sequence with high translation performance.Design/methodology/approachSL can be characterized as joint/bone location information in two-dimensional space over time, forming skeleton sequences. To encode joint, bone and their motion information, we propose a multistream hierarchy network (MHN) along with a vocab prediction network (VPN) and a joint network (JN) with the recurrent neural network transducer. The JN is used to concatenate the sequences encoded by the MHN and VPN and learn their sequence alignments.FindingsWe verify the effectiveness of the proposed approach and provide experimental results on three large-scale datasets, which show that translation accuracy is 94.96, 54.52, and 92.88 per cent, and the inference time is 18 and 1.7 times faster than listen-attend-spell network (LAS) and visual hierarchy to lexical sequence network (H2SNet) , respectively.Originality/valueIn this paper, we propose a novel framework that can fuse multimodal input (i.e. joint, bone and their motion stream) and align input streams with nature language. Moreover, the provided framework is improved by the different properties of MHN, VPN and JN. Experimental results on the three datasets demonstrate that our approaches outperform the state-of-the-art methods in terms of translation accuracy and speed.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44500769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-05DOI: 10.1108/dta-06-2022-0242
Fayaz Ahmad Loan, A. Khan, Syed Aasif Ahmad Andrabi, Sozia Rashid Sozia, Umer Yousuf Parray
PurposeThe purpose of the present study is to identify the active and dead links of uniform resource locators (URLs) associated with web references and to compare the effectiveness of Chrome, Google and WayBack Machine in retrieving the dead URLs.Design/methodology/approachThe web references of the Library Hi Tech from 2004 to 2008 were selected for analysis to fulfill the set objectives. The URLs were extracted from the articles to verify their accessibility in terms of persistence and decay. The URLs were then executed directly in the internet browser (Chrome), search engine (Google) and Internet Archive (WayBack Machine). The collected data were recorded in an excel file and presented in tables/diagrams for further analysis.FindingsFrom the total of 1,083 web references, a maximum number was retrieved by the WayBack Machine (786; 72.6 per cent) followed by Google (501; 46.3 per cent) and the lowest by Chrome (402; 37.1 per cent). The study concludes that the WayBack Machine is more efficient, retrieves a maximum number of missing web citations and fulfills the mission of preservation of web sources to a larger extent.Originality/valueA good number of studies have been conducted to analyze the persistence and decay of web-references; however, the present study is unique as it compared the dead URL retrieval effectiveness of internet explorer (Chrome), search engine giant (Google) and WayBack Machine of the Internet Archive.Research limitations/implicationsThe web references of a single journal, namely, Library Hi Tech, were analyzed for 5 years only. A major study across disciplines and sources may yield better results.Practical implicationsURL decay is becoming a major problem in the preservation and citation of web resources. The study has some healthy recommendations for authors, editors, publishers, librarians and web designers to improve the persistence of web references.
{"title":"Giving life to dead: role of WayBack Machine in recovery of dead URLs","authors":"Fayaz Ahmad Loan, A. Khan, Syed Aasif Ahmad Andrabi, Sozia Rashid Sozia, Umer Yousuf Parray","doi":"10.1108/dta-06-2022-0242","DOIUrl":"https://doi.org/10.1108/dta-06-2022-0242","url":null,"abstract":"PurposeThe purpose of the present study is to identify the active and dead links of uniform resource locators (URLs) associated with web references and to compare the effectiveness of Chrome, Google and WayBack Machine in retrieving the dead URLs.Design/methodology/approachThe web references of the Library Hi Tech from 2004 to 2008 were selected for analysis to fulfill the set objectives. The URLs were extracted from the articles to verify their accessibility in terms of persistence and decay. The URLs were then executed directly in the internet browser (Chrome), search engine (Google) and Internet Archive (WayBack Machine). The collected data were recorded in an excel file and presented in tables/diagrams for further analysis.FindingsFrom the total of 1,083 web references, a maximum number was retrieved by the WayBack Machine (786; 72.6 per cent) followed by Google (501; 46.3 per cent) and the lowest by Chrome (402; 37.1 per cent). The study concludes that the WayBack Machine is more efficient, retrieves a maximum number of missing web citations and fulfills the mission of preservation of web sources to a larger extent.Originality/valueA good number of studies have been conducted to analyze the persistence and decay of web-references; however, the present study is unique as it compared the dead URL retrieval effectiveness of internet explorer (Chrome), search engine giant (Google) and WayBack Machine of the Internet Archive.Research limitations/implicationsThe web references of a single journal, namely, Library Hi Tech, were analyzed for 5 years only. A major study across disciplines and sources may yield better results.Practical implicationsURL decay is becoming a major problem in the preservation and citation of web resources. The study has some healthy recommendations for authors, editors, publishers, librarians and web designers to improve the persistence of web references.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42079490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-04DOI: 10.1108/dta-09-2022-0346
Yuping Xing, Yongzhao Zhan
PurposeFor ranking aggregation in crowdsourcing task, the key issue is how to select the optimal working group with a given number of workers to optimize the performance of their aggregation. Performance prediction for ranking aggregation can solve this issue effectively. However, the performance prediction effect for ranking aggregation varies greatly due to the different influencing factors selected. Although questions on why and how data fusion methods perform well have been thoroughly discussed in the past, there is a lack of insight about how to select influencing factors to predict the performance and how much can be improved of.Design/methodology/approachIn this paper, performance prediction of multivariable linear regression based on the optimal influencing factors for ranking aggregation in crowdsourcing task is studied. An influencing factor optimization selection method based on stepwise regression (IFOS-SR) is proposed to screen the optimal influencing factors. A working group selection model based on the optimal influencing factors is built to select the optimal working group with a given number of workers.FindingsThe proposed approach can identify the optimal influencing factors of ranking aggregation, predict the aggregation performance more accurately than the state-of-the-art methods and select the optimal working group with a given number of workers.Originality/valueTo find out under which condition data fusion method may lead to performance improvement for ranking aggregation in crowdsourcing task, the optimal influencing factors are identified by the IFOS-SR method. This paper presents an analysis of the behavior of the linear combination method and the CombSUM method based on the optimal influencing factors, and optimizes the task assignment with a given number of workers by the optimal working group selection method.
{"title":"Performance prediction of multivariable linear regression based on the optimal influencing factors for ranking aggregation in crowdsourcing task","authors":"Yuping Xing, Yongzhao Zhan","doi":"10.1108/dta-09-2022-0346","DOIUrl":"https://doi.org/10.1108/dta-09-2022-0346","url":null,"abstract":"PurposeFor ranking aggregation in crowdsourcing task, the key issue is how to select the optimal working group with a given number of workers to optimize the performance of their aggregation. Performance prediction for ranking aggregation can solve this issue effectively. However, the performance prediction effect for ranking aggregation varies greatly due to the different influencing factors selected. Although questions on why and how data fusion methods perform well have been thoroughly discussed in the past, there is a lack of insight about how to select influencing factors to predict the performance and how much can be improved of.Design/methodology/approachIn this paper, performance prediction of multivariable linear regression based on the optimal influencing factors for ranking aggregation in crowdsourcing task is studied. An influencing factor optimization selection method based on stepwise regression (IFOS-SR) is proposed to screen the optimal influencing factors. A working group selection model based on the optimal influencing factors is built to select the optimal working group with a given number of workers.FindingsThe proposed approach can identify the optimal influencing factors of ranking aggregation, predict the aggregation performance more accurately than the state-of-the-art methods and select the optimal working group with a given number of workers.Originality/valueTo find out under which condition data fusion method may lead to performance improvement for ranking aggregation in crowdsourcing task, the optimal influencing factors are identified by the IFOS-SR method. This paper presents an analysis of the behavior of the linear combination method and the CombSUM method based on the optimal influencing factors, and optimizes the task assignment with a given number of workers by the optimal working group selection method.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43369891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-12DOI: 10.1108/dta-08-2022-0342
Qing Li, Jaeseung Park, Jaekyeong Kim
PurposeThe current study investigates the impact on perceived review helpfulness of the simultaneous processing of information from multiple cues with various central and peripheral cue combinations based on the elaboration likelihood model (ELM). Thus, the current study develops and tests hypotheses by analyzing real-world review data with a text mining approach in e-commerce to investigate how information consistency (rating inconsistency, review consistency and text similarity) influences perceived helpfulness. Moreover, the role of product type is examined in online consumer reviews of perceived helpfulness.Design/methodology/approachThe current study collected 61,900 online reviews, including 600 products in six categories, from Amazon.com. Additionally, 51,927 reviews were filtered that received helpfulness votes, and then text mining and negative binomial regression were applied.FindingsThe current study found that rating inconsistency and text similarity negatively affect perceived helpfulness and that review consistency positively affects perceived helpfulness. Moreover, peripheral cues (rating inconsistency) positively affect perceived helpfulness in reviews of experience goods rather than search goods. However, there is a lack of evidence to demonstrate the hypothesis that product types moderate the effectiveness of central cues (review consistency and text similarity) on perceived helpfulness.Originality/valuePrevious studies have mainly focused on numerical and textual factors to investigate the effect on perceived helpfulness. Additionally, previous studies have independently confirmed the factors that affect perceived helpfulness. The current study investigated how information consistency affects perceived helpfulness and found that various combinations of cues significantly affect perceived helpfulness. This result contributes to the review helpfulness and ELM literature by identifying the impact on perceived helpfulness from a comprehensive perspective of consumer review and information consistency.
{"title":"Impact of information consistency in online reviews on consumer behavior in the e-commerce industry: a text mining approach","authors":"Qing Li, Jaeseung Park, Jaekyeong Kim","doi":"10.1108/dta-08-2022-0342","DOIUrl":"https://doi.org/10.1108/dta-08-2022-0342","url":null,"abstract":"PurposeThe current study investigates the impact on perceived review helpfulness of the simultaneous processing of information from multiple cues with various central and peripheral cue combinations based on the elaboration likelihood model (ELM). Thus, the current study develops and tests hypotheses by analyzing real-world review data with a text mining approach in e-commerce to investigate how information consistency (rating inconsistency, review consistency and text similarity) influences perceived helpfulness. Moreover, the role of product type is examined in online consumer reviews of perceived helpfulness.Design/methodology/approachThe current study collected 61,900 online reviews, including 600 products in six categories, from Amazon.com. Additionally, 51,927 reviews were filtered that received helpfulness votes, and then text mining and negative binomial regression were applied.FindingsThe current study found that rating inconsistency and text similarity negatively affect perceived helpfulness and that review consistency positively affects perceived helpfulness. Moreover, peripheral cues (rating inconsistency) positively affect perceived helpfulness in reviews of experience goods rather than search goods. However, there is a lack of evidence to demonstrate the hypothesis that product types moderate the effectiveness of central cues (review consistency and text similarity) on perceived helpfulness.Originality/valuePrevious studies have mainly focused on numerical and textual factors to investigate the effect on perceived helpfulness. Additionally, previous studies have independently confirmed the factors that affect perceived helpfulness. The current study investigated how information consistency affects perceived helpfulness and found that various combinations of cues significantly affect perceived helpfulness. This result contributes to the review helpfulness and ELM literature by identifying the impact on perceived helpfulness from a comprehensive perspective of consumer review and information consistency.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44046328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PurposeThe objective of the proposed work is to identify the most commonly occurring non–small cell carcinoma types, such as adenocarcinoma and squamous cell carcinoma, within the human population. Another objective of the work is to reduce the false positive rate during the classification.Design/methodology/approachIn this work, a hybrid method using convolutional neural networks (CNNs), extreme gradient boosting (XGBoost) and long-short-term memory networks (LSTMs) has been proposed to distinguish between lung adenocarcinoma and squamous cell carcinoma. To extract features from non–small cell lung carcinoma images, a three-layer convolution and three-layer max-pooling-based CNN is used. A few important features have been selected from the extracted features using the XGBoost algorithm as the optimal feature. Finally, LSTM has been used for the classification of carcinoma types. The accuracy of the proposed method is 99.57 per cent, and the false positive rate is 0.427 per cent.FindingsThe proposed CNN–XGBoost–LSTM hybrid method has significantly improved the results in distinguishing between adenocarcinoma and squamous cell carcinoma. The importance of the method can be outlined as follows: It has a very low false positive rate of 0.427 per cent. It has very high accuracy, i.e. 99.57 per cent. CNN-based features are providing accurate results in classifying lung carcinoma. It has the potential to serve as an assisting aid for doctors.Practical implicationsIt can be used by doctors as a secondary tool for the analysis of non–small cell lung cancers.Social implicationsIt can help rural doctors by sending the patients to specialized doctors for more analysis of lung cancer.Originality/valueIn this work, a hybrid method using CNN, XGBoost and LSTM has been proposed to distinguish between lung adenocarcinoma and squamous cell carcinoma. A three-layer convolution and three-layer max-pooling-based CNN is used to extract features from the non–small cell lung carcinoma images. A few important features have been selected from the extracted features using the XGBoost algorithm as the optimal feature. Finally, LSTM has been used for the classification of carcinoma types.
{"title":"A hybrid learning method for distinguishing lung adenocarcinoma and squamous cell carcinoma","authors":"Anil Kumar Swain, A. Swetapadma, Jitendra Kumar Rout, Bunil Kumar Balabantaray","doi":"10.1108/dta-10-2022-0384","DOIUrl":"https://doi.org/10.1108/dta-10-2022-0384","url":null,"abstract":"PurposeThe objective of the proposed work is to identify the most commonly occurring non–small cell carcinoma types, such as adenocarcinoma and squamous cell carcinoma, within the human population. Another objective of the work is to reduce the false positive rate during the classification.Design/methodology/approachIn this work, a hybrid method using convolutional neural networks (CNNs), extreme gradient boosting (XGBoost) and long-short-term memory networks (LSTMs) has been proposed to distinguish between lung adenocarcinoma and squamous cell carcinoma. To extract features from non–small cell lung carcinoma images, a three-layer convolution and three-layer max-pooling-based CNN is used. A few important features have been selected from the extracted features using the XGBoost algorithm as the optimal feature. Finally, LSTM has been used for the classification of carcinoma types. The accuracy of the proposed method is 99.57 per cent, and the false positive rate is 0.427 per cent.FindingsThe proposed CNN–XGBoost–LSTM hybrid method has significantly improved the results in distinguishing between adenocarcinoma and squamous cell carcinoma. The importance of the method can be outlined as follows: It has a very low false positive rate of 0.427 per cent. It has very high accuracy, i.e. 99.57 per cent. CNN-based features are providing accurate results in classifying lung carcinoma. It has the potential to serve as an assisting aid for doctors.Practical implicationsIt can be used by doctors as a secondary tool for the analysis of non–small cell lung cancers.Social implicationsIt can help rural doctors by sending the patients to specialized doctors for more analysis of lung cancer.Originality/valueIn this work, a hybrid method using CNN, XGBoost and LSTM has been proposed to distinguish between lung adenocarcinoma and squamous cell carcinoma. A three-layer convolution and three-layer max-pooling-based CNN is used to extract features from the non–small cell lung carcinoma images. A few important features have been selected from the extracted features using the XGBoost algorithm as the optimal feature. Finally, LSTM has been used for the classification of carcinoma types.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47665975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-18DOI: 10.1108/dta-05-2022-0187
Rongen Yan, Depeng Dang, Huiyu Gao, Yan Wu, Wenhui Yu
PurposeQuestion answering (QA) answers the questions asked by people in the form of natural language. In the QA, due to the subjectivity of users, the questions they query have different expressions, which increases the difficulty of text retrieval. Therefore, the purpose of this paper is to explore new query rewriting method for QA that integrates multiple related questions (RQs) to form an optimal question. Moreover, it is important to generate a new dataset of the original query (OQ) with multiple RQs.Design/methodology/approachThis study collects a new dataset SQuAD_extend by crawling the QA community and uses word-graph to model the collected OQs. Next, Beam search finds the best path to get the best question. To deeply represent the features of the question, pretrained model BERT is used to model sentences.FindingsThe experimental results show three outstanding findings. (1) The quality of the answers is better after adding the RQs of the OQs. (2) The word-graph that is used to model the problem and choose the optimal path is conducive to finding the best question. (3) Finally, BERT can deeply characterize the semantics of the exact problem.Originality/valueThe proposed method can use word-graph to construct multiple questions and select the optimal path for rewriting the question, and the quality of answers is better than the baseline. In practice, the research results can help guide users to clarify their query intentions and finally achieve the best answer.
{"title":"A novel word-graph-based query rewriting method for question answering","authors":"Rongen Yan, Depeng Dang, Huiyu Gao, Yan Wu, Wenhui Yu","doi":"10.1108/dta-05-2022-0187","DOIUrl":"https://doi.org/10.1108/dta-05-2022-0187","url":null,"abstract":"PurposeQuestion answering (QA) answers the questions asked by people in the form of natural language. In the QA, due to the subjectivity of users, the questions they query have different expressions, which increases the difficulty of text retrieval. Therefore, the purpose of this paper is to explore new query rewriting method for QA that integrates multiple related questions (RQs) to form an optimal question. Moreover, it is important to generate a new dataset of the original query (OQ) with multiple RQs.Design/methodology/approachThis study collects a new dataset SQuAD_extend by crawling the QA community and uses word-graph to model the collected OQs. Next, Beam search finds the best path to get the best question. To deeply represent the features of the question, pretrained model BERT is used to model sentences.FindingsThe experimental results show three outstanding findings. (1) The quality of the answers is better after adding the RQs of the OQs. (2) The word-graph that is used to model the problem and choose the optimal path is conducive to finding the best question. (3) Finally, BERT can deeply characterize the semantics of the exact problem.Originality/valueThe proposed method can use word-graph to construct multiple questions and select the optimal path for rewriting the question, and the quality of answers is better than the baseline. In practice, the research results can help guide users to clarify their query intentions and finally achieve the best answer.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"1 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41453485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-05DOI: 10.1108/dta-06-2022-0247
Nguyen Thi Dinh, Nguyen Vu Uyen Nhi, T. Le, Thanh The Van
PurposeThe problem of image retrieval and image description exists in various fields. In this paper, a model of content-based image retrieval and image content extraction based on the KD-Tree structure was proposed.Design/methodology/approachA Random Forest structure was built to classify the objects on each image on the basis of the balanced multibranch KD-Tree structure. From that purpose, a KD-Tree structure was generated by the Random Forest to retrieve a set of similar images for an input image. A KD-Tree structure is applied to determine a relationship word at leaves to extract the relationship between objects on an input image. An input image content is described based on class names and relationships between objects.FindingsA model of image retrieval and image content extraction was proposed based on the proposed theoretical basis; simultaneously, the experiment was built on multi-object image datasets including Microsoft COCO and Flickr with an average image retrieval precision of 0.9028 and 0.9163, respectively. The experimental results were compared with those of other works on the same image dataset to demonstrate the effectiveness of the proposed method.Originality/valueA balanced multibranch KD-Tree structure was built to apply to relationship classification on the basis of the original KD-Tree structure. Then, KD-Tree Random Forest was built to improve the classifier performance and retrieve a set of similar images for an input image. Concurrently, the image content was described in the process of combining class names and relationships between objects.
{"title":"A model of image retrieval based on KD-Tree Random Forest","authors":"Nguyen Thi Dinh, Nguyen Vu Uyen Nhi, T. Le, Thanh The Van","doi":"10.1108/dta-06-2022-0247","DOIUrl":"https://doi.org/10.1108/dta-06-2022-0247","url":null,"abstract":"PurposeThe problem of image retrieval and image description exists in various fields. In this paper, a model of content-based image retrieval and image content extraction based on the KD-Tree structure was proposed.Design/methodology/approachA Random Forest structure was built to classify the objects on each image on the basis of the balanced multibranch KD-Tree structure. From that purpose, a KD-Tree structure was generated by the Random Forest to retrieve a set of similar images for an input image. A KD-Tree structure is applied to determine a relationship word at leaves to extract the relationship between objects on an input image. An input image content is described based on class names and relationships between objects.FindingsA model of image retrieval and image content extraction was proposed based on the proposed theoretical basis; simultaneously, the experiment was built on multi-object image datasets including Microsoft COCO and Flickr with an average image retrieval precision of 0.9028 and 0.9163, respectively. The experimental results were compared with those of other works on the same image dataset to demonstrate the effectiveness of the proposed method.Originality/valueA balanced multibranch KD-Tree structure was built to apply to relationship classification on the basis of the original KD-Tree structure. Then, KD-Tree Random Forest was built to improve the classifier performance and retrieve a set of similar images for an input image. Concurrently, the image content was described in the process of combining class names and relationships between objects.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42664223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-04DOI: 10.1108/dta-08-2022-0308
Yingwen Yu, Jing Ma
PurposeThe tender documents, an essential data source for internet-based logistics tendering platforms, incorporate massive fine-grained data, ranging from information on tenderee, shipping location and shipping items. Automated information extraction in this area is, however, under-researched, making the extraction process a time- and effort-consuming one. For Chinese logistics tender entities, in particular, existing named entity recognition (NER) solutions are mostly unsuitable as they involve domain-specific terminologies and possess different semantic features.Design/methodology/approachTo tackle this problem, a novel lattice long short-term memory (LSTM) model, combining a variant contextual feature representation and a conditional random field (CRF) layer, is proposed in this paper for identifying valuable entities from logistic tender documents. Instead of traditional word embedding, the proposed model uses the pretrained Bidirectional Encoder Representations from Transformers (BERT) model as input to augment the contextual feature representation. Subsequently, with the Lattice-LSTM model, the information of characters and words is effectively utilized to avoid error segmentation.FindingsThe proposed model is then verified by the Chinese logistic tender named entity corpus. Moreover, the results suggest that the proposed model excels in the logistics tender corpus over other mainstream NER models. The proposed model underpins the automatic extraction of logistics tender information, enabling logistic companies to perceive the ever-changing market trends and make far-sighted logistic decisions.Originality/value(1) A practical model for logistic tender NER is proposed in the manuscript. By employing and fine-tuning BERT into the downstream task with a small amount of data, the experiment results show that the model has a better performance than other existing models. This is the first study, to the best of the authors' knowledge, to extract named entities from Chinese logistic tender documents. (2) A real logistic tender corpus for practical use is constructed and a program of the model for online-processing real logistic tender documents is developed in this work. The authors believe that the model will facilitate logistic companies in converting unstructured documents to structured data and further perceive the ever-changing market trends to make far-sighted logistic decisions.
{"title":"Identifying business information through deep learning: analyzing the tender documents of an Internet-based logistics bidding platform","authors":"Yingwen Yu, Jing Ma","doi":"10.1108/dta-08-2022-0308","DOIUrl":"https://doi.org/10.1108/dta-08-2022-0308","url":null,"abstract":"PurposeThe tender documents, an essential data source for internet-based logistics tendering platforms, incorporate massive fine-grained data, ranging from information on tenderee, shipping location and shipping items. Automated information extraction in this area is, however, under-researched, making the extraction process a time- and effort-consuming one. For Chinese logistics tender entities, in particular, existing named entity recognition (NER) solutions are mostly unsuitable as they involve domain-specific terminologies and possess different semantic features.Design/methodology/approachTo tackle this problem, a novel lattice long short-term memory (LSTM) model, combining a variant contextual feature representation and a conditional random field (CRF) layer, is proposed in this paper for identifying valuable entities from logistic tender documents. Instead of traditional word embedding, the proposed model uses the pretrained Bidirectional Encoder Representations from Transformers (BERT) model as input to augment the contextual feature representation. Subsequently, with the Lattice-LSTM model, the information of characters and words is effectively utilized to avoid error segmentation.FindingsThe proposed model is then verified by the Chinese logistic tender named entity corpus. Moreover, the results suggest that the proposed model excels in the logistics tender corpus over other mainstream NER models. The proposed model underpins the automatic extraction of logistics tender information, enabling logistic companies to perceive the ever-changing market trends and make far-sighted logistic decisions.Originality/value(1) A practical model for logistic tender NER is proposed in the manuscript. By employing and fine-tuning BERT into the downstream task with a small amount of data, the experiment results show that the model has a better performance than other existing models. This is the first study, to the best of the authors' knowledge, to extract named entities from Chinese logistic tender documents. (2) A real logistic tender corpus for practical use is constructed and a program of the model for online-processing real logistic tender documents is developed in this work. The authors believe that the model will facilitate logistic companies in converting unstructured documents to structured data and further perceive the ever-changing market trends to make far-sighted logistic decisions.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45262725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-04DOI: 10.1108/dta-06-2022-0232
Zeping Wang, Hengte Du, Liangyan Tao, S. Javed
PurposeThe traditional failure mode and effect analysis (FMEA) has some limitations, such as the neglect of relevant historical data, subjective use of rating numbering and the less rationality and accuracy of the Risk Priority Number. The current study proposes a machine learning–enhanced FMEA (ML-FMEA) method based on a popular machine learning tool, Waikato environment for knowledge analysis (WEKA).Design/methodology/approachThis work uses the collected FMEA historical data to predict the probability of component/product failure risk by machine learning based on different commonly used classifiers. To compare the correct classification rate of ML-FMEA based on different classifiers, the 10-fold cross-validation is employed. Moreover, the prediction error is estimated by repeated experiments with different random seeds under varying initialization settings. Finally, the case of the submersible pump in Bhattacharjee et al. (2020) is utilized to test the performance of the proposed method.FindingsThe results show that ML-FMEA, based on most of the commonly used classifiers, outperforms the Bhattacharjee model. For example, the ML-FMEA based on Random Committee improves the correct classification rate from 77.47 to 90.09 per cent and area under the curve of receiver operating characteristic curve (ROC) from 80.9 to 91.8 per cent, respectively.Originality/valueThe proposed method not only enables the decision-maker to use the historical failure data and predict the probability of the risk of failure but also may pave a new way for the application of machine learning techniques in FMEA.
{"title":"Risk assessment in machine learning enhanced failure mode and effects analysis","authors":"Zeping Wang, Hengte Du, Liangyan Tao, S. Javed","doi":"10.1108/dta-06-2022-0232","DOIUrl":"https://doi.org/10.1108/dta-06-2022-0232","url":null,"abstract":"PurposeThe traditional failure mode and effect analysis (FMEA) has some limitations, such as the neglect of relevant historical data, subjective use of rating numbering and the less rationality and accuracy of the Risk Priority Number. The current study proposes a machine learning–enhanced FMEA (ML-FMEA) method based on a popular machine learning tool, Waikato environment for knowledge analysis (WEKA).Design/methodology/approachThis work uses the collected FMEA historical data to predict the probability of component/product failure risk by machine learning based on different commonly used classifiers. To compare the correct classification rate of ML-FMEA based on different classifiers, the 10-fold cross-validation is employed. Moreover, the prediction error is estimated by repeated experiments with different random seeds under varying initialization settings. Finally, the case of the submersible pump in Bhattacharjee et al. (2020) is utilized to test the performance of the proposed method.FindingsThe results show that ML-FMEA, based on most of the commonly used classifiers, outperforms the Bhattacharjee model. For example, the ML-FMEA based on Random Committee improves the correct classification rate from 77.47 to 90.09 per cent and area under the curve of receiver operating characteristic curve (ROC) from 80.9 to 91.8 per cent, respectively.Originality/valueThe proposed method not only enables the decision-maker to use the historical failure data and predict the probability of the risk of failure but also may pave a new way for the application of machine learning techniques in FMEA.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47381854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}