首页 > 最新文献

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery最新文献

英文 中文
Taxonomy of machine learning paradigms: A data‐centric perspective 机器学习范式的分类:以数据为中心的视角
IF 7.8 2区 计算机科学 Q1 Computer Science Pub Date : 2022-06-03 DOI: 10.1002/widm.1470
F. Emmert-Streib, M. Dehmer
Machine learning is a field composed of various pillars. Traditionally, supervised learning (SL), unsupervised learning (UL), and reinforcement learning (RL) are the dominating learning paradigms that inspired the field since the 1950s. Based on these, thousands of different methods have been developed during the last seven decades used in nearly all application domains. However, recently, other learning paradigms are gaining momentum which complement and extend the above learning paradigms significantly. These are multi‐label learning (MLL), semi‐supervised learning (SSL), one‐class classification (OCC), positive‐unlabeled learning (PUL), transfer learning (TL), multi‐task learning (MTL), and one‐shot learning (OSL). The purpose of this article is a systematic discussion of these modern learning paradigms and their connection to the traditional ones. We discuss each of the learning paradigms formally by defining key constituents and paying particular attention to the data requirements for allowing an easy connection to applications. That means, we assume a data‐driven perspective. This perspective will also allow a systematic identification of relations between the individual learning paradigms in the form of a learning‐paradigm graph (LP‐graph). Overall, the LP‐graph establishes a taxonomy among 10 different learning paradigms.
机器学习是一个由各种支柱组成的领域。传统上,监督学习(SL)、无监督学习(UL)和强化学习(RL)是20世纪50年代以来启发该领域的主要学习范式。基于这些,在过去的七十年中,已经开发了数千种不同的方法,用于几乎所有的应用领域。然而,最近,其他学习范式正在获得动力,它们对上述学习范式进行了显著的补充和扩展。它们是多标签学习(MLL)、半监督学习(SSL)、单类分类(OCC)、正无标签学习(PUL)、迁移学习(TL)、多任务学习(MTL)和单次学习(OSL)。本文的目的是系统地讨论这些现代学习范式及其与传统学习范式的联系。我们通过定义关键组成部分来正式讨论每个学习范例,并特别关注允许轻松连接到应用程序的数据需求。这意味着,我们假设一个数据驱动的视角。这一视角也将允许以学习范式图(LP - graph)的形式系统地识别个体学习范式之间的关系。总体而言,LP - graph建立了10种不同学习范式的分类。
{"title":"Taxonomy of machine learning paradigms: A data‐centric perspective","authors":"F. Emmert-Streib, M. Dehmer","doi":"10.1002/widm.1470","DOIUrl":"https://doi.org/10.1002/widm.1470","url":null,"abstract":"Machine learning is a field composed of various pillars. Traditionally, supervised learning (SL), unsupervised learning (UL), and reinforcement learning (RL) are the dominating learning paradigms that inspired the field since the 1950s. Based on these, thousands of different methods have been developed during the last seven decades used in nearly all application domains. However, recently, other learning paradigms are gaining momentum which complement and extend the above learning paradigms significantly. These are multi‐label learning (MLL), semi‐supervised learning (SSL), one‐class classification (OCC), positive‐unlabeled learning (PUL), transfer learning (TL), multi‐task learning (MTL), and one‐shot learning (OSL). The purpose of this article is a systematic discussion of these modern learning paradigms and their connection to the traditional ones. We discuss each of the learning paradigms formally by defining key constituents and paying particular attention to the data requirements for allowing an easy connection to applications. That means, we assume a data‐driven perspective. This perspective will also allow a systematic identification of relations between the individual learning paradigms in the form of a learning‐paradigm graph (LP‐graph). Overall, the LP‐graph establishes a taxonomy among 10 different learning paradigms.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":null,"pages":null},"PeriodicalIF":7.8,"publicationDate":"2022-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85662535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Machine intelligence in dynamical systems: A state‐of‐art review 动态系统中的机器智能:最新进展
IF 7.8 2区 计算机科学 Q1 Computer Science Pub Date : 2022-05-13 DOI: 10.1002/widm.1461
A. Sahoo, S. Chakraverty
This article is dedicated to study the impact of machine intelligence (MI) methods viz. various types of Neural models for investigating dynamical systems arising in interdisciplinary areas. Different types of artificial neural network (ANN) methods, viz., recurrent neural network, functional‐link neural network, convolutional neural network, symplectic artificial neural network, genetic algorithm neural network, and so on, are addressed by different researchers to investigate these problems. Although various traditional methods have been developed by researchers to solve these dynamical problems but the existing traditional methods may sometimes be problem dependent, require repetitions of the simulations, and fail to solve nonlinearity behavior. In this regard, neural network model based methods are more general and solutions are continuous over the given domain of integration, self‐adaptive and can be used as a black box. As such, in this article, we have reviewed and analyzed different MI methods, which are applied to investigate these problems.
本文致力于研究机器智能(MI)方法的影响,即各种类型的神经模型用于研究跨学科领域中出现的动态系统。不同类型的人工神经网络(ANN)方法,如递归神经网络、功能链接神经网络、卷积神经网络、辛人工神经网络、遗传算法神经网络等,被不同的研究者用来研究这些问题。尽管研究人员已经开发了各种传统方法来解决这些动力学问题,但现有的传统方法有时可能存在问题依赖,需要重复模拟,并且无法解决非线性行为。在这方面,基于神经网络模型的方法更通用,并且在给定的积分域中解是连续的,自适应的,可以用作黑盒。因此,在本文中,我们回顾和分析了用于研究这些问题的不同MI方法。
{"title":"Machine intelligence in dynamical systems: A state‐of‐art review","authors":"A. Sahoo, S. Chakraverty","doi":"10.1002/widm.1461","DOIUrl":"https://doi.org/10.1002/widm.1461","url":null,"abstract":"This article is dedicated to study the impact of machine intelligence (MI) methods viz. various types of Neural models for investigating dynamical systems arising in interdisciplinary areas. Different types of artificial neural network (ANN) methods, viz., recurrent neural network, functional‐link neural network, convolutional neural network, symplectic artificial neural network, genetic algorithm neural network, and so on, are addressed by different researchers to investigate these problems. Although various traditional methods have been developed by researchers to solve these dynamical problems but the existing traditional methods may sometimes be problem dependent, require repetitions of the simulations, and fail to solve nonlinearity behavior. In this regard, neural network model based methods are more general and solutions are continuous over the given domain of integration, self‐adaptive and can be used as a black box. As such, in this article, we have reviewed and analyzed different MI methods, which are applied to investigate these problems.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":null,"pages":null},"PeriodicalIF":7.8,"publicationDate":"2022-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77884153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Critical review of bio‐inspired data optimization techniques: An image steganalysis perspective 生物启发数据优化技术的关键审查:图像隐写分析的角度
IF 7.8 2区 计算机科学 Q1 Computer Science Pub Date : 2022-05-03 DOI: 10.1002/widm.1460
Anita Christaline Johnvictor, Austin Joe Amalanathan, Ramya Meghana Pariti Venkata, Nishtha Jethi
Image steganalysis involves the discovery of secret information embedded in an image. The common method is blind image steganalysis, which is a two‐class classification problem. Blind steganalysis extracts all possible feature variations in an image due to embedding, select the most appropriate feature data, and then classifies the image. The dimensionality of the extracted image features are high and demand data reduction to identify the most relevant features and to aid accurate classification of an image. The classification is under two classes namely, clean (cover) image and stego (image with embedded secret data) image. Since the classification accuracy depends on selection of most appropriate features, opting for the best data reduction or data optimization algorithms becomes a prime requisite. Research shows that most of the statistical optimization techniques converge to local minima and lead to less classification accuracy as compared to bio‐inspired methods. Bio‐inspired optimization methods obtain improved classification accuracy by reducing the high‐dimensional image features. These methods start with an initial population and then optimize them in steps till a global optimal point is reached. Examples of such methods include Ant Lion Optimization (ALO), Fire Fly Algorithm (FFA), and literature shows around 54 such algorithms. Bio‐inspired optimization has been applied in various fields of design optimization and is novel to image steganalysis. This article analyses the various bio‐inspired optimization techniques and their accuracy in image steganalysis pertaining to the discovery of embedded information in both JPEG and spatial domain steganalysis.
图像隐写分析涉及发现嵌入在图像中的秘密信息。常用的方法是盲图像隐写分析,这是一个两类分类问题。盲隐写分析提取图像中所有可能由于嵌入而产生的特征变化,选择最合适的特征数据,然后对图像进行分类。提取的图像特征的维数很高,需要数据简化来识别最相关的特征,并帮助图像的准确分类。分类分为两类,即干净(覆盖)图像和隐藏(嵌入秘密数据的图像)图像。由于分类精度取决于选择最合适的特征,因此选择最佳的数据约简或数据优化算法成为首要条件。研究表明,与生物启发方法相比,大多数统计优化技术收敛于局部最小值,导致分类精度较低。生物启发优化方法通过减少高维图像特征来提高分类精度。这些方法从初始种群开始,然后逐步优化,直到达到全局最优点。这些方法的例子包括蚂蚁狮子优化(ALO),萤火虫算法(FFA),文献显示大约有54种这样的算法。生物启发优化已应用于各种设计优化领域,是图像隐写分析的新方法。本文分析了各种生物启发优化技术及其在图像隐写分析中的准确性,涉及JPEG和空间域隐写分析中嵌入信息的发现。
{"title":"Critical review of bio‐inspired data optimization techniques: An image steganalysis perspective","authors":"Anita Christaline Johnvictor, Austin Joe Amalanathan, Ramya Meghana Pariti Venkata, Nishtha Jethi","doi":"10.1002/widm.1460","DOIUrl":"https://doi.org/10.1002/widm.1460","url":null,"abstract":"Image steganalysis involves the discovery of secret information embedded in an image. The common method is blind image steganalysis, which is a two‐class classification problem. Blind steganalysis extracts all possible feature variations in an image due to embedding, select the most appropriate feature data, and then classifies the image. The dimensionality of the extracted image features are high and demand data reduction to identify the most relevant features and to aid accurate classification of an image. The classification is under two classes namely, clean (cover) image and stego (image with embedded secret data) image. Since the classification accuracy depends on selection of most appropriate features, opting for the best data reduction or data optimization algorithms becomes a prime requisite. Research shows that most of the statistical optimization techniques converge to local minima and lead to less classification accuracy as compared to bio‐inspired methods. Bio‐inspired optimization methods obtain improved classification accuracy by reducing the high‐dimensional image features. These methods start with an initial population and then optimize them in steps till a global optimal point is reached. Examples of such methods include Ant Lion Optimization (ALO), Fire Fly Algorithm (FFA), and literature shows around 54 such algorithms. Bio‐inspired optimization has been applied in various fields of design optimization and is novel to image steganalysis. This article analyses the various bio‐inspired optimization techniques and their accuracy in image steganalysis pertaining to the discovery of embedded information in both JPEG and spatial domain steganalysis.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":null,"pages":null},"PeriodicalIF":7.8,"publicationDate":"2022-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85287828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Artificial intelligence for climate change adaptation 适应气候变化的人工智能
IF 7.8 2区 计算机科学 Q1 Computer Science Pub Date : 2022-04-12 DOI: 10.1002/widm.1459
S. Cheong, K. Sankaran, Hamsa Bastani
Although artificial intelligence (AI; inclusive of machine learning) is gaining traction supporting climate change projections and impacts, limited work has used AI to address climate change adaptation. We identify this gap and highlight the value of AI especially in supporting complex adaptation choices and implementation. We illustrate how AI can effectively leverage precise, real‐time information in data‐scarce settings. We focus on supervised learning, transfer learning, reinforcement learning, and multimodal learning to illustrate how innovative AI methods can enable better‐informed choices, tailor adaptation measures to heterogenous groups and generate effective synergies and trade‐offs.
虽然人工智能(AI;人工智能(包括机器学习)正在获得支持气候变化预测和影响的牵引力,有限的工作使用人工智能来解决气候变化适应问题。我们发现了这一差距,并强调了人工智能的价值,特别是在支持复杂的适应选择和实施方面。我们说明了人工智能如何在数据稀缺的环境中有效地利用精确、实时的信息。我们专注于监督学习、迁移学习、强化学习和多模式学习,以说明创新的人工智能方法如何能够实现更明智的选择,为异质群体量身定制适应措施,并产生有效的协同效应和权衡。
{"title":"Artificial intelligence for climate change adaptation","authors":"S. Cheong, K. Sankaran, Hamsa Bastani","doi":"10.1002/widm.1459","DOIUrl":"https://doi.org/10.1002/widm.1459","url":null,"abstract":"Although artificial intelligence (AI; inclusive of machine learning) is gaining traction supporting climate change projections and impacts, limited work has used AI to address climate change adaptation. We identify this gap and highlight the value of AI especially in supporting complex adaptation choices and implementation. We illustrate how AI can effectively leverage precise, real‐time information in data‐scarce settings. We focus on supervised learning, transfer learning, reinforcement learning, and multimodal learning to illustrate how innovative AI methods can enable better‐informed choices, tailor adaptation measures to heterogenous groups and generate effective synergies and trade‐offs.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":null,"pages":null},"PeriodicalIF":7.8,"publicationDate":"2022-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79937568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A review on data fusion in multimodal learning analytics and educational data mining 多模态学习分析与教育数据挖掘中的数据融合研究综述
IF 7.8 2区 计算机科学 Q1 Computer Science Pub Date : 2022-04-05 DOI: 10.1002/widm.1458
Wilson Chango, J. Lara, Rebeca Cerezo, C. Romero
The new educational models such as smart learning environments use of digital and context‐aware devices to facilitate the learning process. In this new educational scenario, a huge quantity of multimodal students' data from a variety of different sources can be captured, fused, and analyze. It offers to researchers and educators a unique opportunity of being able to discover new knowledge to better understand the learning process and to intervene if necessary. However, it is necessary to apply correctly data fusion approaches and techniques in order to combine various sources of multimodal learning analytics (MLA). These sources or modalities in MLA include audio, video, electrodermal activity data, eye‐tracking, user logs, and click‐stream data, but also learning artifacts and more natural human signals such as gestures, gaze, speech, or writing. This survey introduces data fusion in learning analytics (LA) and educational data mining (EDM) and how these data fusion techniques have been applied in smart learning. It shows the current state of the art by reviewing the main publications, the main type of fused educational data, and the data fusion approaches and techniques used in EDM/LA, as well as the main open problems, trends, and challenges in this specific research area.
新的教育模式,如智能学习环境,使用数字和情境感知设备来促进学习过程。在这种新的教育场景中,来自各种不同来源的大量多模式学生数据可以被捕获、融合和分析。它为研究人员和教育工作者提供了一个独特的机会,能够发现新的知识,更好地理解学习过程,并在必要时进行干预。然而,为了结合多模态学习分析(MLA)的各种来源,有必要正确应用数据融合方法和技术。MLA中的这些来源或模式包括音频、视频、皮肤电活动数据、眼动追踪、用户日志和点击流数据,还包括学习工件和更自然的人类信号,如手势、凝视、语音或写作。本调查介绍了学习分析(LA)和教育数据挖掘(EDM)中的数据融合,以及这些数据融合技术如何应用于智能学习。它通过回顾主要出版物、融合教育数据的主要类型、EDM/LA中使用的数据融合方法和技术,以及该特定研究领域的主要开放问题、趋势和挑战,展示了当前的技术状况。
{"title":"A review on data fusion in multimodal learning analytics and educational data mining","authors":"Wilson Chango, J. Lara, Rebeca Cerezo, C. Romero","doi":"10.1002/widm.1458","DOIUrl":"https://doi.org/10.1002/widm.1458","url":null,"abstract":"The new educational models such as smart learning environments use of digital and context‐aware devices to facilitate the learning process. In this new educational scenario, a huge quantity of multimodal students' data from a variety of different sources can be captured, fused, and analyze. It offers to researchers and educators a unique opportunity of being able to discover new knowledge to better understand the learning process and to intervene if necessary. However, it is necessary to apply correctly data fusion approaches and techniques in order to combine various sources of multimodal learning analytics (MLA). These sources or modalities in MLA include audio, video, electrodermal activity data, eye‐tracking, user logs, and click‐stream data, but also learning artifacts and more natural human signals such as gestures, gaze, speech, or writing. This survey introduces data fusion in learning analytics (LA) and educational data mining (EDM) and how these data fusion techniques have been applied in smart learning. It shows the current state of the art by reviewing the main publications, the main type of fused educational data, and the data fusion approaches and techniques used in EDM/LA, as well as the main open problems, trends, and challenges in this specific research area.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":null,"pages":null},"PeriodicalIF":7.8,"publicationDate":"2022-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84559881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
A review of bus arrival time prediction using artificial intelligence 基于人工智能的公交到达时间预测研究综述
IF 7.8 2区 计算机科学 Q1 Computer Science Pub Date : 2022-04-03 DOI: 10.1002/widm.1457
Nisha Singh, K. Kumar
Buses are one of the important parts of public transport system. To provide accurate information about bus arrival and departure times at bus stops is one of the main parameters of good quality public transport. Accurate arrival and departure times information is important for a public transport mode since it enhances ridership as well as satisfaction of travelers. With accurate arrival‐time and departure time information, travelers can make informed decisions about their journey. The application of artificial intelligence (AI) based methods/algorithms to predict the bus arrival time (BAT) is reviewed in detail. Systematic survey of existing research conducted by various researchers by applying the different branches of AI has been done. Prediction models have been segregated and are accumulated under respective branches of AI. Thorough discussion is presented to elaborate different branches of AI that have been applied for several aspects of BAT prediction. Research gaps and possible future directions for further research work are summarized.
公共汽车是公共交通系统的重要组成部分。在公交站点提供准确的公交到达和出发时间信息是优质公共交通的主要参数之一。准确的到达和离开时间信息对公共交通模式很重要,因为它可以提高乘客数量和旅客的满意度。与准确的到达时间和出发时间的信息,旅客可以作出明智的决定,他们的旅程。详细介绍了基于人工智能(AI)的公交到达时间预测方法/算法的应用。通过应用人工智能的不同分支,对不同研究人员进行的现有研究进行了系统的调查。预测模型已经被隔离,并在各自的人工智能分支下积累。深入讨论了人工智能的不同分支,这些分支已应用于BAT预测的几个方面。总结了研究的不足和未来可能的研究方向。
{"title":"A review of bus arrival time prediction using artificial intelligence","authors":"Nisha Singh, K. Kumar","doi":"10.1002/widm.1457","DOIUrl":"https://doi.org/10.1002/widm.1457","url":null,"abstract":"Buses are one of the important parts of public transport system. To provide accurate information about bus arrival and departure times at bus stops is one of the main parameters of good quality public transport. Accurate arrival and departure times information is important for a public transport mode since it enhances ridership as well as satisfaction of travelers. With accurate arrival‐time and departure time information, travelers can make informed decisions about their journey. The application of artificial intelligence (AI) based methods/algorithms to predict the bus arrival time (BAT) is reviewed in detail. Systematic survey of existing research conducted by various researchers by applying the different branches of AI has been done. Prediction models have been segregated and are accumulated under respective branches of AI. Thorough discussion is presented to elaborate different branches of AI that have been applied for several aspects of BAT prediction. Research gaps and possible future directions for further research work are summarized.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":null,"pages":null},"PeriodicalIF":7.8,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91260375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Gaining insights in datasets in the shade of “garbage in, garbage out” rationale: Feature space distribution fitting 在“垃圾输入,垃圾输出”原理的阴影下获得数据集的见解:特征空间分布拟合
IF 7.8 2区 计算机科学 Q1 Computer Science Pub Date : 2022-03-30 DOI: 10.1002/widm.1456
Gürol Canbek
This article emphasizes comprehending the “Garbage In, Garbage Out” (GIGO) rationale and ensuring the dataset quality in Machine Learning (ML) applications to achieve high and generalizable performance. An initial step should be added in an ML workflow where researchers evaluate the insights gained by quantitative analysis of the datasets sample and feature spaces. This study contributes towards achieving such a goal by suggesting a technique to quantify datasets in terms of feature frequency distribution characteristics. Hence a unique insight is provided into how the features in the available dataset samples are frequent. The technique was demonstrated in 11 benign and malign (malware) Android application datasets belonging to six academic Android mobile malware classification studies. The permissions requested by applications such as CALL_PHONE compose a relatively high‐dimensional binary feature space. The results showed that the distributions fit well into two of the four long right‐tail statistical distributions: log‐normal, exponential, power law, and Poisson. Precisely, log‐normal was the most exhibited statistical distribution except the two malign datasets that were in exponential. This study also explores statistical distribution fit/unfit feature analysis that enhances the insights in feature space. Finally, the study compiles phenomena examples in the literature exhibiting these statistical distributions that should be considered for interpreting the fitted distributions. In conclusion, conducting well‐formed statistical methods provides a clear understanding of the datasets and intra‐class and inter‐class differences before proceeding with selecting features and building a classifier model. Feature distribution characteristics should be one to analyze beforehand.
本文强调理解“垃圾输入,垃圾输出”(GIGO)的基本原理,并确保机器学习(ML)应用程序中的数据集质量,以实现高且可推广的性能。应该在ML工作流程中添加初始步骤,研究人员评估通过数据集样本和特征空间的定量分析获得的见解。本研究通过提出一种根据特征频率分布特征量化数据集的技术,为实现这一目标做出了贡献。因此,对于可用数据集样本中的特征是如何频繁出现的,提供了独特的见解。该技术在属于6个学术Android移动恶意软件分类研究的11个良性和恶意(恶意软件)Android应用程序数据集中进行了演示。CALL_PHONE等应用程序请求的权限构成了一个相对高维的二进制特征空间。结果表明,这些分布很好地符合四种长右尾统计分布中的两种:对数正态分布、指数分布、幂律分布和泊松分布。准确地说,除了两个呈指数的恶性数据集外,对数正态分布是最明显的统计分布。本研究还探讨了统计分布适合/不适合特征分析,以增强对特征空间的洞察力。最后,研究汇编了文献中显示这些统计分布的现象示例,这些统计分布应被考虑用于解释拟合分布。总之,在继续选择特征和构建分类器模型之前,执行格式良好的统计方法可以清楚地了解数据集以及类内和类间的差异。特征分布特征是需要事先分析的。
{"title":"Gaining insights in datasets in the shade of “garbage in, garbage out” rationale: Feature space distribution fitting","authors":"Gürol Canbek","doi":"10.1002/widm.1456","DOIUrl":"https://doi.org/10.1002/widm.1456","url":null,"abstract":"This article emphasizes comprehending the “Garbage In, Garbage Out” (GIGO) rationale and ensuring the dataset quality in Machine Learning (ML) applications to achieve high and generalizable performance. An initial step should be added in an ML workflow where researchers evaluate the insights gained by quantitative analysis of the datasets sample and feature spaces. This study contributes towards achieving such a goal by suggesting a technique to quantify datasets in terms of feature frequency distribution characteristics. Hence a unique insight is provided into how the features in the available dataset samples are frequent. The technique was demonstrated in 11 benign and malign (malware) Android application datasets belonging to six academic Android mobile malware classification studies. The permissions requested by applications such as CALL_PHONE compose a relatively high‐dimensional binary feature space. The results showed that the distributions fit well into two of the four long right‐tail statistical distributions: log‐normal, exponential, power law, and Poisson. Precisely, log‐normal was the most exhibited statistical distribution except the two malign datasets that were in exponential. This study also explores statistical distribution fit/unfit feature analysis that enhances the insights in feature space. Finally, the study compiles phenomena examples in the literature exhibiting these statistical distributions that should be considered for interpreting the fitted distributions. In conclusion, conducting well‐formed statistical methods provides a clear understanding of the datasets and intra‐class and inter‐class differences before proceeding with selecting features and building a classifier model. Feature distribution characteristics should be one to analyze beforehand.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":null,"pages":null},"PeriodicalIF":7.8,"publicationDate":"2022-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79092308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Review and data mining of linguistic studies of English modal verbs 英语情态动词语言学研究综述与数据挖掘
IF 7.8 2区 计算机科学 Q1 Computer Science Pub Date : 2022-03-29 DOI: 10.1002/widm.1455
Jianping Yu, Jilin Fu, Tana Bai, Xueping Xu
Modal verbs express modality, and modality is concerned with the status of the proposition that describes an event, it also expresses the opinion and attitude of a speaker toward the proposition of an utterance. Since modalities are directly related to the objective world, subjective world, and language use, they have been a hot topic of philosophers, logicians and linguists. Philosophers concern with the relations between the objective world and the true/false values of the modality; logicians are interested in the relations among the possibility, necessity and the objective world; and linguists pay attention to the modality category, sense category, function, recognition, and application of modal verbs. In recent years, the linguistic studies of modal verbs have extended from general linguistic studies to computational linguistic studies. Since modal verbs are a complex semantic system and they are often indeterminate in senses, they have been a tough issue in linguistic studies and have attracted great attention. To clarify the status of the previous linguistic studies of modal verbs and reveal the characteristics of the studies will be of great significance for the further study. Therefore, this article will focus on the review of the previous linguistic studies of English modal verbs and the data mining of the characteristics of the previous studies, and based on the summary of the previous studies, give suggestions for the further study of the English modal verbs.
情态动词表达情态,情态与描述事件的命题的状态有关,它也表达了说话人对话语中命题的意见和态度。由于模态直接关系到客观世界、主观世界和语言的使用,因此一直是哲学家、逻辑学家和语言学家的研究热点。哲学家关注的是客观世界与情态的真/假价值之间的关系;逻辑学家关心的是可能性、必然性和客观世界之间的关系;语言学家关注情态动词的情态范畴、意义范畴、功能、识别和应用。近年来,情态动词的语言学研究已经从一般语言学研究扩展到计算语言学研究。由于情态动词是一个复杂的语义系统,其意义往往是不确定的,因此一直是语言学研究中的一个难题,受到了广泛的关注。厘清以往情态动词语言学研究的现状,揭示其研究特点,对进一步研究情态动词具有重要意义。因此,本文将着重对以往英语情态动词的语言学研究进行回顾,并对以往研究的特点进行数据挖掘,并在总结前人研究的基础上,对英语情态动词的进一步研究提出建议。
{"title":"Review and data mining of linguistic studies of English modal verbs","authors":"Jianping Yu, Jilin Fu, Tana Bai, Xueping Xu","doi":"10.1002/widm.1455","DOIUrl":"https://doi.org/10.1002/widm.1455","url":null,"abstract":"Modal verbs express modality, and modality is concerned with the status of the proposition that describes an event, it also expresses the opinion and attitude of a speaker toward the proposition of an utterance. Since modalities are directly related to the objective world, subjective world, and language use, they have been a hot topic of philosophers, logicians and linguists. Philosophers concern with the relations between the objective world and the true/false values of the modality; logicians are interested in the relations among the possibility, necessity and the objective world; and linguists pay attention to the modality category, sense category, function, recognition, and application of modal verbs. In recent years, the linguistic studies of modal verbs have extended from general linguistic studies to computational linguistic studies. Since modal verbs are a complex semantic system and they are often indeterminate in senses, they have been a tough issue in linguistic studies and have attracted great attention. To clarify the status of the previous linguistic studies of modal verbs and reveal the characteristics of the studies will be of great significance for the further study. Therefore, this article will focus on the review of the previous linguistic studies of English modal verbs and the data mining of the characteristics of the previous studies, and based on the summary of the previous studies, give suggestions for the further study of the English modal verbs.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":null,"pages":null},"PeriodicalIF":7.8,"publicationDate":"2022-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74774200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Subgraph mining in a large graph: A review 大图中的子图挖掘:综述
IF 7.8 2区 计算机科学 Q1 Computer Science Pub Date : 2022-03-08 DOI: 10.1002/widm.1454
Lam B. Q. Nguyen, I. Zelinka, V. Snás̃el, Loan T. T. Nguyen, Bay Vo
Large graphs are often used to simulate and model complex systems in various research and application fields. Because of its importance, frequent subgraph mining (FSM) in single large graphs is a vital issue, and recently, it has attracted numerous researchers, and played an important role in various tasks for both research and application purposes. FSM is aimed at finding all subgraphs whose number of appearances in a large graph is greater than or equal to a given frequency threshold. In most recent applications, the underlying graphs are very large, such as social networks, and therefore algorithms for FSM from a single large graph have been rapidly developed, but all of them have NP‐hard (nondeterministic polynomial time) complexity with huge search spaces, and therefore still need a lot of time and memory to restore and process. In this article, we present an overview of problems of FSM, important phases in FSM, main groups of FSM, as well as surveying many modern applied algorithms. This includes many practical applications and is a fundamental premise for many studies in the future.
在各种研究和应用领域中,大图形经常被用来对复杂系统进行模拟和建模。由于其重要性,单个大图的频繁子图挖掘(FSM)是一个非常重要的问题,近年来吸引了众多研究者,并在各种研究和应用任务中发挥了重要作用。FSM的目标是找到在一个大图中出现次数大于或等于给定频率阈值的所有子图。在最近的应用中,底层图是非常大的,例如社交网络,因此从单个大图中进行FSM的算法已经迅速发展,但它们都具有NP - hard(不确定多项式时间)复杂性,并且具有巨大的搜索空间,因此仍然需要大量的时间和内存来恢复和处理。在本文中,我们概述了FSM的问题,FSM的重要阶段,FSM的主要组,以及许多现代应用的算法。这包括许多实际应用,是未来许多研究的基本前提。
{"title":"Subgraph mining in a large graph: A review","authors":"Lam B. Q. Nguyen, I. Zelinka, V. Snás̃el, Loan T. T. Nguyen, Bay Vo","doi":"10.1002/widm.1454","DOIUrl":"https://doi.org/10.1002/widm.1454","url":null,"abstract":"Large graphs are often used to simulate and model complex systems in various research and application fields. Because of its importance, frequent subgraph mining (FSM) in single large graphs is a vital issue, and recently, it has attracted numerous researchers, and played an important role in various tasks for both research and application purposes. FSM is aimed at finding all subgraphs whose number of appearances in a large graph is greater than or equal to a given frequency threshold. In most recent applications, the underlying graphs are very large, such as social networks, and therefore algorithms for FSM from a single large graph have been rapidly developed, but all of them have NP‐hard (nondeterministic polynomial time) complexity with huge search spaces, and therefore still need a lot of time and memory to restore and process. In this article, we present an overview of problems of FSM, important phases in FSM, main groups of FSM, as well as surveying many modern applied algorithms. This includes many practical applications and is a fundamental premise for many studies in the future.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":null,"pages":null},"PeriodicalIF":7.8,"publicationDate":"2022-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76111040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Machine learning in postgenomic biology and personalized medicine. 后基因组生物学和个性化医疗中的机器学习。
IF 7.8 2区 计算机科学 Q1 Computer Science Pub Date : 2022-03-01 DOI: 10.1002/widm.1451
Animesh Ray

In recent years Artificial Intelligence in the form of machine learning has been revolutionizing biology, biomedical sciences, and gene-based agricultural technology capabilities. Massive data generated in biological sciences by rapid and deep gene sequencing and protein or other molecular structure determination, on the one hand, requires data analysis capabilities using machine learning that are distinctly different from classical statistical methods; on the other, these large datasets are enabling the adoption of novel data-intensive machine learning algorithms for the solution of biological problems that until recently had relied on mechanistic model-based approaches that are computationally expensive. This review provides a bird's eye view of the applications of machine learning in post-genomic biology. Attempt is also made to indicate as far as possible the areas of research that are poised to make further impacts in these areas, including the importance of explainable artificial intelligence (XAI) in human health. Further contributions of machine learning are expected to transform medicine, public health, agricultural technology, as well as to provide invaluable gene-based guidance for the management of complex environments in this age of global warming.

近年来,机器学习形式的人工智能已经彻底改变了生物学、生物医学科学和基于基因的农业技术能力。在生物科学中,通过快速、深入的基因测序和蛋白质或其他分子结构测定产生的海量数据,一方面需要使用机器学习的数据分析能力,这与经典的统计方法明显不同;另一方面,这些大型数据集使得采用新颖的数据密集型机器学习算法来解决生物问题成为可能,直到最近,这些算法还依赖于计算成本高昂的基于机制模型的方法。本文综述了机器学习在后基因组生物学中的应用。报告还试图尽可能指出有望在这些领域产生进一步影响的研究领域,包括可解释人工智能(XAI)对人类健康的重要性。机器学习的进一步贡献有望改变医学、公共卫生、农业技术,并为在这个全球变暖的时代管理复杂环境提供宝贵的基于基因的指导。
{"title":"Machine learning in postgenomic biology and personalized medicine.","authors":"Animesh Ray","doi":"10.1002/widm.1451","DOIUrl":"https://doi.org/10.1002/widm.1451","url":null,"abstract":"<p><p>In recent years Artificial Intelligence in the form of machine learning has been revolutionizing biology, biomedical sciences, and gene-based agricultural technology capabilities. Massive data generated in biological sciences by rapid and deep gene sequencing and protein or other molecular structure determination, on the one hand, requires data analysis capabilities using machine learning that are distinctly different from classical statistical methods; on the other, these large datasets are enabling the adoption of novel data-intensive machine learning algorithms for the solution of biological problems that until recently had relied on mechanistic model-based approaches that are computationally expensive. This review provides a bird's eye view of the applications of machine learning in post-genomic biology. Attempt is also made to indicate as far as possible the areas of research that are poised to make further impacts in these areas, including the importance of explainable artificial intelligence (XAI) in human health. Further contributions of machine learning are expected to transform medicine, public health, agricultural technology, as well as to provide invaluable gene-based guidance for the management of complex environments in this age of global warming.</p>","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":null,"pages":null},"PeriodicalIF":7.8,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9371441/pdf/nihms-1770264.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9375926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1