Ganesh Chandrasekaran, Tu N. Nguyen, Jude Hemanth D.
The analysis of sentiments is essential in identifying and classifying opinions regarding a source material that is, a product or service. The analysis of these sentiments finds a variety of applications like product reviews, opinion polls, movie reviews on YouTube, news video analysis, and health care applications including stress and depression analysis. The traditional approach of sentiment analysis which is based on text involves the collection of large textual data and different algorithms to extract the sentiment information from it. But multimodal sentimental analysis provides methods to carry out opinion analysis based on the combination of video, audio, and text which goes a way beyond the conventional text‐based sentimental analysis in understanding human behaviors. The remarkable increase in the use of social media provides a large collection of multimodal data that reflects the user's sentiment on certain aspects. This multimodal sentimental analysis approach helps in classifying the polarity (positive, negative, and neutral) of the individual sentiments. Our work aims to present a survey of recent developments in analyzing the multimodal sentiments (involving text, audio, and video/image) which involve human–machine interaction and challenges involved in analyzing them. A detailed survey on sentimental dataset, feature extraction algorithms, data fusion methods, and efficiency of different classification techniques are presented in this work.
{"title":"Multimodal sentimental analysis for social media applications: A comprehensive review","authors":"Ganesh Chandrasekaran, Tu N. Nguyen, Jude Hemanth D.","doi":"10.1002/widm.1415","DOIUrl":"https://doi.org/10.1002/widm.1415","url":null,"abstract":"The analysis of sentiments is essential in identifying and classifying opinions regarding a source material that is, a product or service. The analysis of these sentiments finds a variety of applications like product reviews, opinion polls, movie reviews on YouTube, news video analysis, and health care applications including stress and depression analysis. The traditional approach of sentiment analysis which is based on text involves the collection of large textual data and different algorithms to extract the sentiment information from it. But multimodal sentimental analysis provides methods to carry out opinion analysis based on the combination of video, audio, and text which goes a way beyond the conventional text‐based sentimental analysis in understanding human behaviors. The remarkable increase in the use of social media provides a large collection of multimodal data that reflects the user's sentiment on certain aspects. This multimodal sentimental analysis approach helps in classifying the polarity (positive, negative, and neutral) of the individual sentiments. Our work aims to present a survey of recent developments in analyzing the multimodal sentiments (involving text, audio, and video/image) which involve human–machine interaction and challenges involved in analyzing them. A detailed survey on sentimental dataset, feature extraction algorithms, data fusion methods, and efficiency of different classification techniques are presented in this work.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"22 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89843715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Namrata Bhattacharya, C. Nelson, Gaurav Ahuja, Debarka Sengupta
Single‐cell omics technologies provide biologists with a new dimension for systematically dissecting the underlying complexities within biological systems. These powerful technologies have triggered a wave of rapid development and deployment of new computational tools capable of teasing out critical insights by analysis of large volumes of omics data at single‐cell resolution. Some of the key advancements include identifying molecular signatures imparting cellular identities, their evolutionary relationships, identifying novel and rare cell‐types, and establishing a direct link between cellular genotypes and phenotypes. With the sharp increase in the throughput of single‐cell platforms, the demand for efficient computational algorithms has become prominent. As such, devising novel computational strategies is critical to ensure optimal use of this wealth of molecular data for gaining newer insights into cellular biology. Here we discuss some of the grand opportunities of computational breakthroughs which would accelerate single‐cell research. These are: predicting cellular identity, single‐cell guided in silico drug screening for precision medicine, transfer learning methods to handle sparsity and heterogeneity of expression data, establishing genotype–phenotype relationships at single‐cell resolution, and developing computational platforms for handling big data.
{"title":"Big data analytics in single‐cell transcriptomics: Five grand opportunities","authors":"Namrata Bhattacharya, C. Nelson, Gaurav Ahuja, Debarka Sengupta","doi":"10.1002/widm.1414","DOIUrl":"https://doi.org/10.1002/widm.1414","url":null,"abstract":"Single‐cell omics technologies provide biologists with a new dimension for systematically dissecting the underlying complexities within biological systems. These powerful technologies have triggered a wave of rapid development and deployment of new computational tools capable of teasing out critical insights by analysis of large volumes of omics data at single‐cell resolution. Some of the key advancements include identifying molecular signatures imparting cellular identities, their evolutionary relationships, identifying novel and rare cell‐types, and establishing a direct link between cellular genotypes and phenotypes. With the sharp increase in the throughput of single‐cell platforms, the demand for efficient computational algorithms has become prominent. As such, devising novel computational strategies is critical to ensure optimal use of this wealth of molecular data for gaining newer insights into cellular biology. Here we discuss some of the grand opportunities of computational breakthroughs which would accelerate single‐cell research. These are: predicting cellular identity, single‐cell guided in silico drug screening for precision medicine, transfer learning methods to handle sparsity and heterogeneity of expression data, establishing genotype–phenotype relationships at single‐cell resolution, and developing computational platforms for handling big data.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"61 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78700399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clustering of structure‐rich heterogeneous information networks composed of multiple types of objects and relationships, which has become a challenge in data mining. Most of the existing clustering heterogeneous network methods focus on the internal information of the dataset while ignoring the domain knowledge outside the dataset. However, in real‐world scenarios, domain knowledge can often offer valuable information for clustering. In this study, we propose a three‐layer model OntoHeteClus, which is able to cluster multitype objects in star‐structured heterogeneous networks by considering both the dataset itself and the background information quantified via the ontology. OntoHeteClus first evaluates the similarity between central objects according to formalized domain ontology information, based on which central objects are subsequently clustered. Finally, attribute objects are clustered according to the central object clustering result. A numerical example is presented to illustrate the modeling concept and working principle of the proposed method, and experiments on a real‐world dataset demonstrate the effectiveness of the proposed algorithms.
{"title":"Incorporating domain ontology information into clustering in heterogeneous networks","authors":"Yue Huang","doi":"10.1002/widm.1413","DOIUrl":"https://doi.org/10.1002/widm.1413","url":null,"abstract":"Clustering of structure‐rich heterogeneous information networks composed of multiple types of objects and relationships, which has become a challenge in data mining. Most of the existing clustering heterogeneous network methods focus on the internal information of the dataset while ignoring the domain knowledge outside the dataset. However, in real‐world scenarios, domain knowledge can often offer valuable information for clustering. In this study, we propose a three‐layer model OntoHeteClus, which is able to cluster multitype objects in star‐structured heterogeneous networks by considering both the dataset itself and the background information quantified via the ontology. OntoHeteClus first evaluates the similarity between central objects according to formalized domain ontology information, based on which central objects are subsequently clustered. Finally, attribute objects are clustered according to the central object clustering result. A numerical example is presented to illustrate the modeling concept and working principle of the proposed method, and experiments on a real‐world dataset demonstrate the effectiveness of the proposed algorithms.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"14 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78636892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. G. Goroso, Alvaro Fraga, Michel Macedo, Carla Fernanda de Miranda Rodrigues, Bruno Mendes de Oliveira Silva, W. Watanabe, D. P. D. Silva, R. R. Silva, J. Puglisi, James Marcin, M. Dharmar
A new predictive model to classify childhood obesity was implemented using machine learning techniques. The first step was to calculate the most relevant anthropomorphic and cardiovascular parameters of 187 children through principal component analysis (PCA) and cluster classification. Then Naïve‐Bayes method classified these children into six groups using anthropometric Z Score, measurements of abdominal obesity, and arterial pressure: Group I (20.32% of total): composed mainly by accentuated malnutrition and malnutrition children; Group II (36.36%): composed primarily by eutrophic children; Group III (21.4%): constituted by eutrophic plus overweight children; Group IV (14.97%): comprised mainly by overweight and obese children; Group V (5.34%): Obese and overweight children; and Group VI (1.6%): obese at risk children. From Group II to VI, the proportion of pre‐hypertensive and hypertensive children increased monotonically from 5 to 33%. This classification modes was tested on 66 children that were not originally included with a success rate of 97%. This predictive model will facilitate future longitudinal studies of obesity in children and will help plan interventions and evaluations of their results.
{"title":"Automatic segmentation to characterize anthropometric parameters and cardiovascular indicators in children","authors":"D. G. Goroso, Alvaro Fraga, Michel Macedo, Carla Fernanda de Miranda Rodrigues, Bruno Mendes de Oliveira Silva, W. Watanabe, D. P. D. Silva, R. R. Silva, J. Puglisi, James Marcin, M. Dharmar","doi":"10.1002/widm.1411","DOIUrl":"https://doi.org/10.1002/widm.1411","url":null,"abstract":"A new predictive model to classify childhood obesity was implemented using machine learning techniques. The first step was to calculate the most relevant anthropomorphic and cardiovascular parameters of 187 children through principal component analysis (PCA) and cluster classification. Then Naïve‐Bayes method classified these children into six groups using anthropometric Z Score, measurements of abdominal obesity, and arterial pressure: Group I (20.32% of total): composed mainly by accentuated malnutrition and malnutrition children; Group II (36.36%): composed primarily by eutrophic children; Group III (21.4%): constituted by eutrophic plus overweight children; Group IV (14.97%): comprised mainly by overweight and obese children; Group V (5.34%): Obese and overweight children; and Group VI (1.6%): obese at risk children. From Group II to VI, the proportion of pre‐hypertensive and hypertensive children increased monotonically from 5 to 33%. This classification modes was tested on 66 children that were not originally included with a success rate of 97%. This predictive model will facilitate future longitudinal studies of obesity in children and will help plan interventions and evaluations of their results.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"07 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79987951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To realize a low‐carbon and sustainable energy transition, smart energy systems (SES) assisted by data and information technology are regarded as promising solutions for energy system integration (ESI) and have been put into regional practices. However, there is still lacking attention on the development of multiregional smart energy systems (MRSES), which include three or more areas. This article aims to analyze concepts and practices of SES and enlighten a new perspective of MRSES. The conceptual evolution and regional practices of SES in the world were first reviewed, and it was found out that SES does not means the end of the conceptual evolution of ESI. Current regional practices are still limited in small areas, being typically remote areas, urban areas, and industrial areas. Secondly, the review of concepts and practices of SES in China indicate that the understanding of SES concepts are still confusing in national scale, and the apparent regional disparity in China is calling attention on the development of MRSES. Finally, a preliminary concept of MRSES was proposed and its perspective in China and the world, which is composed by four connected sub‐SES and named as a coordinated development of “smart energy farms + smart energy towns + smart energy industrial parks + smart energy transportation networks” was discussed. The former three sub‐SES are identified according to various economic characteristics and resources endowment in different regions, and they are all connected by the forth sub‐SES. Although this concept is still preliminary, it provides an imagination of future large‐scale SES, and the realization of this concept needs further breakthrough of data technology.
{"title":"The development of regional smart energy systems in the World and China: The concepts, practices, and a new perspective","authors":"Yunlong Zhao, Linwei Ma, Zheng Li, W. Ni","doi":"10.1002/widm.1409","DOIUrl":"https://doi.org/10.1002/widm.1409","url":null,"abstract":"To realize a low‐carbon and sustainable energy transition, smart energy systems (SES) assisted by data and information technology are regarded as promising solutions for energy system integration (ESI) and have been put into regional practices. However, there is still lacking attention on the development of multiregional smart energy systems (MRSES), which include three or more areas. This article aims to analyze concepts and practices of SES and enlighten a new perspective of MRSES. The conceptual evolution and regional practices of SES in the world were first reviewed, and it was found out that SES does not means the end of the conceptual evolution of ESI. Current regional practices are still limited in small areas, being typically remote areas, urban areas, and industrial areas. Secondly, the review of concepts and practices of SES in China indicate that the understanding of SES concepts are still confusing in national scale, and the apparent regional disparity in China is calling attention on the development of MRSES. Finally, a preliminary concept of MRSES was proposed and its perspective in China and the world, which is composed by four connected sub‐SES and named as a coordinated development of “smart energy farms + smart energy towns + smart energy industrial parks + smart energy transportation networks” was discussed. The former three sub‐SES are identified according to various economic characteristics and resources endowment in different regions, and they are all connected by the forth sub‐SES. Although this concept is still preliminary, it provides an imagination of future large‐scale SES, and the realization of this concept needs further breakthrough of data technology.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"34 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74657713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Kurian, A. Sethi, Anil Reddy Konduru, A. Mahajan, S. Rane
Deep learning (DL)‐based interpretation of medical images has reached a critical juncture of expanding outside research projects into translational ones, and is ready to make its way to the clinics. Advances over the last decade in data availability, DL techniques, as well as computing capabilities have accelerated this journey. Through this journey, today we have a better understanding of the challenges to and pitfalls of wider adoption of DL into clinical care, which, according to us, should and will drive the advances in this field in the next few years. The most important among these challenges are the lack of an appropriately digitized environment within healthcare institutions, the lack of adequate open and representative datasets on which DL algorithms can be trained and tested, and the lack of robustness of widely used DL training algorithms to certain pervasive pathological characteristics of medical images and repositories. In this review, we provide an overview of the role of imaging in oncology, the different techniques that are shaping the way DL algorithms are being made ready for clinical use, and also the problems that DL techniques still need to address before DL can find a home in clinics. Finally, we also provide a summary of how DL can potentially drive the adoption of digital pathology, vendor neutral archives, and picture archival and communication systems. We caution that the respective researchers may find the coverage of their own fields to be at a high‐level. This is so by design as this format is meant to only introduce those looking in from outside of deep learning and medical research, respectively, to gain an appreciation for the main concerns and limitations of these two fields instead of telling them something new about their own.
{"title":"A 2021 update on cancer image analytics with deep learning","authors":"N. Kurian, A. Sethi, Anil Reddy Konduru, A. Mahajan, S. Rane","doi":"10.1002/widm.1410","DOIUrl":"https://doi.org/10.1002/widm.1410","url":null,"abstract":"Deep learning (DL)‐based interpretation of medical images has reached a critical juncture of expanding outside research projects into translational ones, and is ready to make its way to the clinics. Advances over the last decade in data availability, DL techniques, as well as computing capabilities have accelerated this journey. Through this journey, today we have a better understanding of the challenges to and pitfalls of wider adoption of DL into clinical care, which, according to us, should and will drive the advances in this field in the next few years. The most important among these challenges are the lack of an appropriately digitized environment within healthcare institutions, the lack of adequate open and representative datasets on which DL algorithms can be trained and tested, and the lack of robustness of widely used DL training algorithms to certain pervasive pathological characteristics of medical images and repositories. In this review, we provide an overview of the role of imaging in oncology, the different techniques that are shaping the way DL algorithms are being made ready for clinical use, and also the problems that DL techniques still need to address before DL can find a home in clinics. Finally, we also provide a summary of how DL can potentially drive the adoption of digital pathology, vendor neutral archives, and picture archival and communication systems. We caution that the respective researchers may find the coverage of their own fields to be at a high‐level. This is so by design as this format is meant to only introduce those looking in from outside of deep learning and medical research, respectively, to gain an appreciation for the main concerns and limitations of these two fields instead of telling them something new about their own.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"19 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83549899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Table understanding methods extract, transform, and interpret the information contained in tabular data embedded in documents/files of different formats. Such automatic understanding would allow to exploit tabular information with the aim of accurately answering queries, or integrating heterogeneous repositories of information in a common knowledge base, or exchanging information among different sources. The purpose of this survey is to provide a comprehensive analysis of the research efforts so far devoted to the problem of table understanding and to describe systems that support the transformation of heterogeneous tables into meaningful information.
{"title":"Table understanding approaches for extracting knowledge from heterogeneous tables","authors":"Sara Bonfitto, E. Casiraghi, M. Mesiti","doi":"10.1002/widm.1407","DOIUrl":"https://doi.org/10.1002/widm.1407","url":null,"abstract":"Table understanding methods extract, transform, and interpret the information contained in tabular data embedded in documents/files of different formats. Such automatic understanding would allow to exploit tabular information with the aim of accurately answering queries, or integrating heterogeneous repositories of information in a common knowledge base, or exchanging information among different sources. The purpose of this survey is to provide a comprehensive analysis of the research efforts so far devoted to the problem of table understanding and to describe systems that support the transformation of heterogeneous tables into meaningful information.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"686 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76876799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An in‐depth study on big data mining is urgently needed for the next‐generation energy systems, which are characterized by a deep integration of cyber, physical, and social components. This paper presents an initial discussion on big data mining and its applications in intelligent energy systems. New progress in big data mining, such as deep learning, transfer learning, randomized learning, granular computing, and multisource data fusion, is introduced first. Some applications of data mining in energy systems, such as load forecasting and modeling, integrated power and transportation system, and electricity market forecasting and simulation, are discussed then. Moreover, some research problems in energy system data mining, such as cyber–physical–social system modeling and super‐resolution perception for smart meter data, which require further attention in the future, are also discussed.
{"title":"Data mining for energy systems: Review and prospect","authors":"Wenxuan Liu, Junhua Zhao, Dianhui Wang","doi":"10.1002/widm.1406","DOIUrl":"https://doi.org/10.1002/widm.1406","url":null,"abstract":"An in‐depth study on big data mining is urgently needed for the next‐generation energy systems, which are characterized by a deep integration of cyber, physical, and social components. This paper presents an initial discussion on big data mining and its applications in intelligent energy systems. New progress in big data mining, such as deep learning, transfer learning, randomized learning, granular computing, and multisource data fusion, is introduced first. Some applications of data mining in energy systems, such as load forecasting and modeling, integrated power and transportation system, and electricity market forecasting and simulation, are discussed then. Moreover, some research problems in energy system data mining, such as cyber–physical–social system modeling and super‐resolution perception for smart meter data, which require further attention in the future, are also discussed.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"1 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77079889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For many years, the role played by domain knowledge in all stages of knowledge discovery has been recognized. However, the real‐world semantics embedded in data is often still not fully considered in traditional data mining methods. In this article, we argue that the quality of data mining results is directly related to the extent that they reflect important properties of real‐world entities represented therein. Analyzing and characterizing the nature of these entities is the very business of the area of formal ontology. We briefly elaborate on two particular types of artifacts produced by this area: foundational ontologies and ontology‐driven conceptual modeling languages grounded on them. We then elaborate on the benefits they can bring to several activities in a data mining process.
{"title":"Foundational ontologies, ontology‐driven conceptual modeling, and their multiple benefits to data mining","authors":"G. Amaral, F. Baião, G. Guizzardi","doi":"10.1002/widm.1408","DOIUrl":"https://doi.org/10.1002/widm.1408","url":null,"abstract":"For many years, the role played by domain knowledge in all stages of knowledge discovery has been recognized. However, the real‐world semantics embedded in data is often still not fully considered in traditional data mining methods. In this article, we argue that the quality of data mining results is directly related to the extent that they reflect important properties of real‐world entities represented therein. Analyzing and characterizing the nature of these entities is the very business of the area of formal ontology. We briefly elaborate on two particular types of artifacts produced by this area: foundational ontologies and ontology‐driven conceptual modeling languages grounded on them. We then elaborate on the benefits they can bring to several activities in a data mining process.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"15 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87055707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cluster analysis refers to a wide range of data analytic techniques for class discovery and is popular in many application fields. To assess the quality of a clustering result, different cluster validation procedures have been proposed in the literature. While there is extensive work on classical validation techniques, such as internal and external validation, less attention has been given to validating and replicating a clustering result using a validation dataset. Such a dataset may be part of the original dataset, which is separated before analysis begins, or it could be an independently collected dataset. We present a systematic, structured review of the existing literature about this topic. For this purpose, we outline a formal framework that covers most existing approaches for validating clustering results on validation data. In particular, we review classical validation techniques such as internal and external validation, stability analysis, and visual validation, and show how they can be interpreted in terms of our framework. We define and formalize different types of validation of clustering results on a validation dataset, and give examples of how clustering studies from the applied literature that used a validation dataset can be seen as instances of our framework.
{"title":"Validation of cluster analysis results on validation data: A systematic framework","authors":"Theresa Ullmann, C. Hennig, A. Boulesteix","doi":"10.1002/widm.1444","DOIUrl":"https://doi.org/10.1002/widm.1444","url":null,"abstract":"Cluster analysis refers to a wide range of data analytic techniques for class discovery and is popular in many application fields. To assess the quality of a clustering result, different cluster validation procedures have been proposed in the literature. While there is extensive work on classical validation techniques, such as internal and external validation, less attention has been given to validating and replicating a clustering result using a validation dataset. Such a dataset may be part of the original dataset, which is separated before analysis begins, or it could be an independently collected dataset. We present a systematic, structured review of the existing literature about this topic. For this purpose, we outline a formal framework that covers most existing approaches for validating clustering results on validation data. In particular, we review classical validation techniques such as internal and external validation, stability analysis, and visual validation, and show how they can be interpreted in terms of our framework. We define and formalize different types of validation of clustering results on a validation dataset, and give examples of how clustering studies from the applied literature that used a validation dataset can be seen as instances of our framework.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"26 1","pages":""},"PeriodicalIF":7.8,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89773996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}