Oualid Ouarem, Farid Nouioua, Philippe Fournier-Viger
Episode mining is a research area in data mining, where the aim is to discover interesting episodes, that is, subsequences of events, in an event sequence. The most popular episode-mining task is frequent episode mining (FEM), which consists of identifying episodes that appear frequently in an event sequence, but this task has also been extended in various ways. It was shown that episode mining can reveal insightful patterns for numerous applications such as web stream analysis, network fault management, and cybersecurity, and that episodes can be useful for prediction. Episode mining is an active research area, and there have been numerous advances in the field over the last 25 years. However, due to the rapid evolution of the pattern mining field, there is no prior study that summarizes and gives a detailed overview of this field. The contribution of this article is to fill this gap by presenting an up-to-date survey that provides an introduction to episode mining and an overview of recent developments and research opportunities. This advanced review first gives an introduction to the field of episode mining and the first algorithms. Then, the main concepts used in these algorithms are explained. After that, several recent studies are reviewed that have addressed some limitations of these algorithms and proposed novel solutions to overcome them. Finally, the paper lists some possible extensions of the existing frameworks to mine more meaningful patterns and presents some possible orientations for future work that may contribute to the evolution of the episode mining field.
{"title":"A survey of episode mining","authors":"Oualid Ouarem, Farid Nouioua, Philippe Fournier-Viger","doi":"10.1002/widm.1524","DOIUrl":"https://doi.org/10.1002/widm.1524","url":null,"abstract":"Episode mining is a research area in data mining, where the aim is to discover interesting episodes, that is, subsequences of events, in an event sequence. The most popular episode-mining task is frequent episode mining (FEM), which consists of identifying episodes that appear frequently in an event sequence, but this task has also been extended in various ways. It was shown that episode mining can reveal insightful patterns for numerous applications such as web stream analysis, network fault management, and cybersecurity, and that episodes can be useful for prediction. Episode mining is an active research area, and there have been numerous advances in the field over the last 25 years. However, due to the rapid evolution of the pattern mining field, there is no prior study that summarizes and gives a detailed overview of this field. The contribution of this article is to fill this gap by presenting an up-to-date survey that provides an introduction to episode mining and an overview of recent developments and research opportunities. This advanced review first gives an introduction to the field of episode mining and the first algorithms. Then, the main concepts used in these algorithms are explained. After that, several recent studies are reviewed that have addressed some limitations of these algorithms and proposed novel solutions to overcome them. Finally, the paper lists some possible extensions of the existing frameworks to mine more meaningful patterns and presents some possible orientations for future work that may contribute to the evolution of the episode mining field.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"107 48","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138455943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sin Liang Lim, Jaya Sreevalsan-Nair, B. S. Daya Sagar
This article gives a brief overview of various aspects of data mining of multispectral image data. We focus on specifically the remote sensing satellite images acquired using multispectral imaging (MSI), given the technology used across multiple knowledge domains, such as chemistry, medical imaging, remote sensing, and so on with a sufficient amount of variation. In this article, the different data mining processes are reviewed along with state-of-the-art methods and applications. To study data mining, it is important to know how the data are acquired and preprocessed. Hence, those topics are briefly covered in the article. The article concludes with applications demonstrating the knowledge discovery from data mining, modern challenges, and promising future directions for MSI data mining research.
{"title":"Multispectral data mining: A focus on remote sensing satellite images","authors":"Sin Liang Lim, Jaya Sreevalsan-Nair, B. S. Daya Sagar","doi":"10.1002/widm.1522","DOIUrl":"https://doi.org/10.1002/widm.1522","url":null,"abstract":"This article gives a brief overview of various aspects of data mining of multispectral image data. We focus on specifically the remote sensing satellite images acquired using multispectral imaging (MSI), given the technology used across multiple knowledge domains, such as chemistry, medical imaging, remote sensing, and so on with a sufficient amount of variation. In this article, the different data mining processes are reviewed along with state-of-the-art methods and applications. To study data mining, it is important to know how the data are acquired and preprocessed. Hence, those topics are briefly covered in the article. The article concludes with applications demonstrating the knowledge discovery from data mining, modern challenges, and promising future directions for MSI data mining research.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"35 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138455933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arash Heidari, Nima Jafari Navimipour, Hasan Dag, Mehmet Unal
Deep Learning (DL) has been effectively utilized in various complicated challenges in healthcare, industry, and academia for various purposes, including thyroid diagnosis, lung nodule recognition, computer vision, large data analytics, and human-level control. Nevertheless, developments in digital technology have been used to produce software that poses a threat to democracy, national security, and confidentiality. Deepfake is one of those DL-powered apps that has lately surfaced. So, deepfake systems can create fake images primarily by replacement of scenes or images, movies, and sounds that humans cannot tell apart from real ones. Various technologies have brought the capacity to change a synthetic speech, image, or video to our fingers. Furthermore, video and image frauds are now so convincing that it is hard to distinguish between false and authentic content with the naked eye. It might result in various issues and ranging from deceiving public opinion to using doctored evidence in a court. For such considerations, it is critical to have technologies that can assist us in discerning reality. This study gives a complete assessment of the literature on deepfake detection strategies using DL-based algorithms. We categorize deepfake detection methods in this work based on their applications, which include video detection, image detection, audio detection, and hybrid multimedia detection. The objective of this paper is to give the reader a better knowledge of (1) how deepfakes are generated and identified, (2) the latest developments and breakthroughs in this realm, (3) weaknesses of existing security methods, and (4) areas requiring more investigation and consideration. The results suggest that the Conventional Neural Networks (CNN) methodology is the most often employed DL method in publications. According to research, the majority of the articles are on the subject of video deepfake detection. The majority of the articles focused on enhancing only one parameter, with the accuracy parameter receiving the most attention.
{"title":"Deepfake detection using deep learning methods: A systematic and comprehensive review","authors":"Arash Heidari, Nima Jafari Navimipour, Hasan Dag, Mehmet Unal","doi":"10.1002/widm.1520","DOIUrl":"https://doi.org/10.1002/widm.1520","url":null,"abstract":"Deep Learning (DL) has been effectively utilized in various complicated challenges in healthcare, industry, and academia for various purposes, including thyroid diagnosis, lung nodule recognition, computer vision, large data analytics, and human-level control. Nevertheless, developments in digital technology have been used to produce software that poses a threat to democracy, national security, and confidentiality. Deepfake is one of those DL-powered apps that has lately surfaced. So, deepfake systems can create fake images primarily by replacement of scenes or images, movies, and sounds that humans cannot tell apart from real ones. Various technologies have brought the capacity to change a synthetic speech, image, or video to our fingers. Furthermore, video and image frauds are now so convincing that it is hard to distinguish between false and authentic content with the naked eye. It might result in various issues and ranging from deceiving public opinion to using doctored evidence in a court. For such considerations, it is critical to have technologies that can assist us in discerning reality. This study gives a complete assessment of the literature on deepfake detection strategies using DL-based algorithms. We categorize deepfake detection methods in this work based on their applications, which include video detection, image detection, audio detection, and hybrid multimedia detection. The objective of this paper is to give the reader a better knowledge of (1) how deepfakes are generated and identified, (2) the latest developments and breakthroughs in this realm, (3) weaknesses of existing security methods, and (4) areas requiring more investigation and consideration. The results suggest that the Conventional Neural Networks (CNN) methodology is the most often employed DL method in publications. According to research, the majority of the articles are on the subject of video deepfake detection. The majority of the articles focused on enhancing only one parameter, with the accuracy parameter receiving the most attention.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"35 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138455932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bruno I. Grisci, Bruno César Feltes, Joice de Faria Poloni, Pedro H. Narloch, Márcio Dorn
Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA-seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA-seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results.
{"title":"The use of gene expression datasets in feature selection research: 20 years of inherent bias?","authors":"Bruno I. Grisci, Bruno César Feltes, Joice de Faria Poloni, Pedro H. Narloch, Márcio Dorn","doi":"10.1002/widm.1523","DOIUrl":"https://doi.org/10.1002/widm.1523","url":null,"abstract":"Feature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA-seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA-seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"28 8","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138455934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}