Machine learning and knowledge extraction最新文献_第4页

PCa-Clf: A Classifier of Prostate Cancer Patients into Patients with Indolent and Aggressive Tumors Using Machine Learning PCa-Clf:利用机器学习将前列腺癌患者分为惰性和侵袭性肿瘤

Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine learning and knowledge extraction

Pub Date : 2023-09-27 DOI: 10.3390/make5040066

Yashwanth Karthik Kumar Mamidi, Tarun Karthik Kumar Mamidi, Md Wasi Ul Kabir, Jiande Wu, Md Tamjidul Hoque, Chindo Hicks

A critical unmet medical need in prostate cancer (PCa) clinical management centers around distinguishing indolent from aggressive tumors. Traditionally, Gleason grading has been utilized for this purpose. However, tumor classification using Gleason Grade 7 is often ambiguous, as the clinical behavior of these tumors follows a variable clinical course. This study aimed to investigate the application of machine learning techniques (ML) to classify patients into indolent and aggressive PCas. We used gene expression data from The Cancer Genome Atlas and compared gene expression levels between indolent and aggressive tumors to identify features for developing and validating a range of ML and stacking algorithms. ML algorithms accurately distinguished indolent from aggressive PCas. With the accuracy of 96%, the stacking model was superior to individual ML algorithms when all samples with primary Gleason Grades 6 to 10 were used. Excluding samples with Gleason Grade 7 improved accuracy to 97%. This study shows that ML algorithms and stacking models are powerful approaches for the accurate classification of indolent versus aggressive PCas. Future implementation of this methodology may significantly impact clinical decision making and patient outcomes in the clinical management of prostate cancer.

在前列腺癌(PCa)临床管理中，一个关键的未满足的医疗需求集中在区分惰性肿瘤和侵袭性肿瘤上。传统上，格里森分级已被用于这一目的。然而，使用Gleason分级7对肿瘤进行分类往往是不明确的，因为这些肿瘤的临床行为遵循不同的临床过程。本研究旨在探讨机器学习技术(ML)在将患者分为惰性和侵袭性前列腺癌中的应用。我们使用来自癌症基因组图谱的基因表达数据，比较了惰性肿瘤和侵袭性肿瘤的基因表达水平，以确定开发和验证一系列ML和堆叠算法的特征。ML算法可以准确区分惰性和侵袭性pca。当使用所有初级Gleason等级为6至10的样本时，堆叠模型的准确率为96%，优于单个ML算法。排除Gleason Grade 7的样本将准确率提高到97%。本研究表明，机器学习算法和叠加模型是准确分类惰性和侵袭性pca的有效方法。该方法学的未来实施可能会显著影响前列腺癌临床管理的临床决策和患者预后。

{"title":"PCa-Clf: A Classifier of Prostate Cancer Patients into Patients with Indolent and Aggressive Tumors Using Machine Learning","authors":"Yashwanth Karthik Kumar Mamidi, Tarun Karthik Kumar Mamidi, Md Wasi Ul Kabir, Jiande Wu, Md Tamjidul Hoque, Chindo Hicks","doi":"10.3390/make5040066","DOIUrl":"https://doi.org/10.3390/make5040066","url":null,"abstract":"A critical unmet medical need in prostate cancer (PCa) clinical management centers around distinguishing indolent from aggressive tumors. Traditionally, Gleason grading has been utilized for this purpose. However, tumor classification using Gleason Grade 7 is often ambiguous, as the clinical behavior of these tumors follows a variable clinical course. This study aimed to investigate the application of machine learning techniques (ML) to classify patients into indolent and aggressive PCas. We used gene expression data from The Cancer Genome Atlas and compared gene expression levels between indolent and aggressive tumors to identify features for developing and validating a range of ML and stacking algorithms. ML algorithms accurately distinguished indolent from aggressive PCas. With the accuracy of 96%, the stacking model was superior to individual ML algorithms when all samples with primary Gleason Grades 6 to 10 were used. Excluding samples with Gleason Grade 7 improved accuracy to 97%. This study shows that ML algorithms and stacking models are powerful approaches for the accurate classification of indolent versus aggressive PCas. Future implementation of this methodology may significantly impact clinical decision making and patient outcomes in the clinical management of prostate cancer.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135585762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unraveling COVID-19 Dynamics via Machine Learning and XAI: Investigating Variant Influence and Prognostic Classification 通过机器学习和XAI揭示COVID-19动态:调查变异影响和预后分类

Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine learning and knowledge extraction

Pub Date : 2023-09-25 DOI: 10.3390/make5040064

Oliver Lohaj, Ján Paralič, Peter Bednár, Zuzana Paraličová, Matúš Huba

Machine learning (ML) has been used in different ways in the fight against COVID-19 disease. ML models have been developed, e.g., for diagnostic or prognostic purposes and using various modalities of data (e.g., textual, visual, or structured). Due to the many specific aspects of this disease and its evolution over time, there is still not enough understanding of all relevant factors influencing the course of COVID-19 in particular patients. In all aspects of our work, there was a strong involvement of a medical expert following the human-in-the-loop principle. This is a very important but usually neglected part of the ML and knowledge extraction (KE) process. Our research shows that explainable artificial intelligence (XAI) may significantly support this part of ML and KE. Our research focused on using ML for knowledge extraction in two specific scenarios. In the first scenario, we aimed to discover whether adding information about the predominant COVID-19 variant impacts the performance of the ML models. In the second scenario, we focused on prognostic classification models concerning the need for an intensive care unit for a given patient in connection with different explainability AI (XAI) methods. We have used nine ML algorithms, namely XGBoost, CatBoost, LightGBM, logistic regression, Naive Bayes, random forest, SGD, SVM-linear, and SVM-RBF. We measured the performance of the resulting models using precision, accuracy, and AUC metrics. Subsequently, we focused on knowledge extraction from the best-performing models using two different approaches as follows: (a) features extracted automatically by forward stepwise selection (FSS); (b) attributes and their interactions discovered by model explainability methods. Both were compared with the attributes selected by the medical experts in advance based on the domain expertise. Our experiments showed that adding information about the COVID-19 variant did not influence the performance of the resulting ML models. It also turned out that medical experts were much more precise in the identification of significant attributes than FSS. Explainability methods identified almost the same attributes as a medical expert and interesting interactions among them, which the expert discussed from a medical point of view. The results of our research and their consequences are discussed.

机器学习(ML)已经以不同的方式用于对抗COVID-19疾病。ML模型已经被开发出来，例如，用于诊断或预后目的，并使用各种数据模式(例如，文本、视觉或结构化)。由于这种疾病的许多具体方面及其随时间的演变，对影响特定患者COVID-19病程的所有相关因素仍然没有足够的了解。在我们工作的各个方面，都有一位医学专家按照“人在循环”的原则大力参与。这是ML和知识提取(KE)过程中非常重要但通常被忽视的部分。我们的研究表明，可解释的人工智能(XAI)可能会显著支持ML和KE的这一部分。我们的研究重点是在两个特定的场景中使用ML进行知识提取。在第一个场景中，我们的目标是发现添加有关主要COVID-19变体的信息是否会影响ML模型的性能。在第二种情况下，我们将重点放在与不同可解释性AI (XAI)方法相关的特定患者需要重症监护病房的预后分类模型上。我们使用了9种ML算法，即XGBoost、CatBoost、LightGBM、逻辑回归、朴素贝叶斯、随机森林、SGD、SVM-linear和SVM-RBF。我们使用精度、准确度和AUC度量度量结果模型的性能。随后，我们重点研究了使用两种不同的方法从表现最好的模型中提取知识:(a)采用前向逐步选择(FSS)自动提取特征;(b)模型可解释性方法发现的属性及其相互作用。将两者与医学专家根据领域专长预先选择的属性进行比较。我们的实验表明，添加关于COVID-19变体的信息并不影响最终ML模型的性能。另外，医学专家对重要属性的判断也比金融监督院准确得多。可解释性方法确定了与医学专家几乎相同的属性和它们之间有趣的相互作用，专家从医学的角度对其进行了讨论。讨论了我们的研究结果及其后果。

{"title":"Unraveling COVID-19 Dynamics via Machine Learning and XAI: Investigating Variant Influence and Prognostic Classification","authors":"Oliver Lohaj, Ján Paralič, Peter Bednár, Zuzana Paraličová, Matúš Huba","doi":"10.3390/make5040064","DOIUrl":"https://doi.org/10.3390/make5040064","url":null,"abstract":"Machine learning (ML) has been used in different ways in the fight against COVID-19 disease. ML models have been developed, e.g., for diagnostic or prognostic purposes and using various modalities of data (e.g., textual, visual, or structured). Due to the many specific aspects of this disease and its evolution over time, there is still not enough understanding of all relevant factors influencing the course of COVID-19 in particular patients. In all aspects of our work, there was a strong involvement of a medical expert following the human-in-the-loop principle. This is a very important but usually neglected part of the ML and knowledge extraction (KE) process. Our research shows that explainable artificial intelligence (XAI) may significantly support this part of ML and KE. Our research focused on using ML for knowledge extraction in two specific scenarios. In the first scenario, we aimed to discover whether adding information about the predominant COVID-19 variant impacts the performance of the ML models. In the second scenario, we focused on prognostic classification models concerning the need for an intensive care unit for a given patient in connection with different explainability AI (XAI) methods. We have used nine ML algorithms, namely XGBoost, CatBoost, LightGBM, logistic regression, Naive Bayes, random forest, SGD, SVM-linear, and SVM-RBF. We measured the performance of the resulting models using precision, accuracy, and AUC metrics. Subsequently, we focused on knowledge extraction from the best-performing models using two different approaches as follows: (a) features extracted automatically by forward stepwise selection (FSS); (b) attributes and their interactions discovered by model explainability methods. Both were compared with the attributes selected by the medical experts in advance based on the domain expertise. Our experiments showed that adding information about the COVID-19 variant did not influence the performance of the resulting ML models. It also turned out that medical experts were much more precise in the identification of significant attributes than FSS. Explainability methods identified almost the same attributes as a medical expert and interesting interactions among them, which the expert discussed from a medical point of view. The results of our research and their consequences are discussed.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135864524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Brainstorming Will Never Be the Same Again—A Human Group Supported by Artificial Intelligence 头脑风暴将永远不会再一样了——人工智能支持的人类团体

Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine learning and knowledge extraction

Pub Date : 2023-09-25 DOI: 10.3390/make5040065

Franc Lavrič, Andrej Škraba

A modification of the brainstorming process by the application of artificial intelligence (AI) was proposed. Here, we describe the design of the software system “kresilnik”, which enables hybrid work between a human group and AI. The proposed system integrates the Open AI-GPT-3.5–turbo model with the server side providing the results to clients. The proposed architecture provides the possibility to not only generate ideas but also categorize them and set priorities. With the developed prototype, 760 ideas were generated on the topic of the design of the Gorenjska region’s development plan with eight different temperatures with the OpenAI-GPT-3.5-turbo algorithm. For the set of generated ideas, the entropy was determined, as well as the time needed for their generation. The distributions of the entropy of the ideas generated by the human-generated and the AI-generated sets of ideas of the OpenAI-GPT-3.5–turbo algorithm at different temperatures are provided in the form of histograms. Ideas are presented as word clouds and histograms for the human group and the AI-generated sets. A comparison of the process of generating ideas between the human group and AI was conducted. The statistical Mann-Whitney U-test was performed, which confirmed the significant differences in the average entropy of the generated ideas. Correlations between the length of the generated ideas and the time needed were determined for the human group and AI. The distributions for the time needed and the length of the ideas were determined, which are possible indicators to distinguish between human and artificial processes of generating ideas.

提出了一种应用人工智能对头脑风暴过程进行改进的方法。在这里，我们描述了软件系统“kresilnik”的设计，它使人类群体和人工智能之间的混合工作成为可能。该系统集成了Open ai - gpt -3.5 turbo模型，并将结果提供给客户端。所建议的体系结构不仅提供了产生想法的可能性，还提供了对它们进行分类和设置优先级的可能性。通过开发的原型，利用openai - gpt -3.5 turbo算法设计8种不同温度下的戈伦尼斯卡地区发展计划，产生了760个想法。对于产生的想法集合，熵是确定的，它们产生所需的时间也是确定的。openai - gpt -3.5 turbo算法的human-generated和AI-generated的ideas集合在不同温度下的熵分布以直方图的形式给出。想法以词云和直方图的形式呈现给人类群体和人工智能生成的集合。对人类群体和人工智能之间产生想法的过程进行了比较。统计的Mann-Whitney u检验证实了产生的想法的平均熵的显著差异。产生的想法的长度和所需时间之间的相关性是由人类小组和人工智能决定的。确定了所需时间的分布和想法的长度，这是区分产生想法的人为过程和人工过程的可能指标。

{"title":"Brainstorming Will Never Be the Same Again—A Human Group Supported by Artificial Intelligence","authors":"Franc Lavrič, Andrej Škraba","doi":"10.3390/make5040065","DOIUrl":"https://doi.org/10.3390/make5040065","url":null,"abstract":"A modification of the brainstorming process by the application of artificial intelligence (AI) was proposed. Here, we describe the design of the software system “kresilnik”, which enables hybrid work between a human group and AI. The proposed system integrates the Open AI-GPT-3.5–turbo model with the server side providing the results to clients. The proposed architecture provides the possibility to not only generate ideas but also categorize them and set priorities. With the developed prototype, 760 ideas were generated on the topic of the design of the Gorenjska region’s development plan with eight different temperatures with the OpenAI-GPT-3.5-turbo algorithm. For the set of generated ideas, the entropy was determined, as well as the time needed for their generation. The distributions of the entropy of the ideas generated by the human-generated and the AI-generated sets of ideas of the OpenAI-GPT-3.5–turbo algorithm at different temperatures are provided in the form of histograms. Ideas are presented as word clouds and histograms for the human group and the AI-generated sets. A comparison of the process of generating ideas between the human group and AI was conducted. The statistical Mann-Whitney U-test was performed, which confirmed the significant differences in the average entropy of the generated ideas. Correlations between the length of the generated ideas and the time needed were determined for the human group and AI. The distributions for the time needed and the length of the ideas were determined, which are possible indicators to distinguish between human and artificial processes of generating ideas.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135864525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Beyond Weisfeiler–Lehman with Local Ego-Network Encodings 超越Weisfeiler-Lehman与局部自我网络编码

Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine learning and knowledge extraction

Pub Date : 2023-09-22 DOI: 10.3390/make5040063

Nurudin Alvarez-Gonzalez, Andreas Kaltenbrunner, Vicenç Gómez

Identifying similar network structures is key to capturing graph isomorphisms and learning representations that exploit structural information encoded in graph data. This work shows that ego networks can produce a structural encoding scheme for arbitrary graphs with greater expressivity than the Weisfeiler–Lehman (1-WL) test. We introduce IGEL, a preprocessing step to produce features that augment node representations by encoding ego networks into sparse vectors that enrich message passing (MP) graph neural networks (GNNs) beyond 1-WL expressivity. We formally describe the relation between IGEL and 1-WL, and characterize its expressive power and limitations. Experiments show that IGEL matches the empirical expressivity of state-of-the-art methods on isomorphism detection while improving performance on nine GNN architectures and six graph machine learning tasks.

识别相似的网络结构是捕获图同构和学习利用图数据中编码的结构信息的表示的关键。这项工作表明，自我网络可以为任意图产生具有比Weisfeiler-Lehman (1-WL)检验更大表现力的结构编码方案。我们引入IGEL，这是一个预处理步骤，通过将自我网络编码为稀疏向量来生成增强节点表示的特征，从而丰富消息传递(MP)图神经网络(gnn)，超越1-WL表达能力。我们正式描述了IGEL和1-WL之间的关系，并描述了它的表达能力和局限性。实验表明，IGEL与最先进的同构检测方法的经验表达能力相匹配，同时提高了9个GNN架构和6个图机器学习任务的性能。

引用次数: 0

Multi-Task Representation Learning for Renewable-Power Forecasting: A Comparative Analysis of Unified Autoencoder Variants and Task-Embedding Dimensions 用于可再生能源预测的多任务表示学习:统一自编码器变量和任务嵌入维数的比较分析

Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine learning and knowledge extraction

Pub Date : 2023-09-20 DOI: 10.3390/make5030062

Chandana Priya Nivarthi, Stephan Vogt, Bernhard Sick

Typically, renewable-power-generation forecasting using machine learning involves creating separate models for each photovoltaic or wind park, known as single-task learning models. However, transfer learning has gained popularity in recent years, as it allows for the transfer of knowledge from source parks to target parks. Nevertheless, determining the most similar source park(s) for transfer learning can be challenging, particularly when the target park has limited or no historical data samples. To address this issue, we propose a multi-task learning architecture that employs a Unified Autoencoder (UAE) to initially learn a common representation of input weather features among tasks and then utilizes a Task-Embedding layer in a Neural Network (TENN) to learn task-specific information. This proposed UAE-TENN architecture can be easily extended to new parks with or without historical data. We evaluate the performance of our proposed architecture and compare it to single-task learning models on six photovoltaic and wind farm datasets consisting of a total of 529 parks. Our results show that the UAE-TENN architecture significantly improves power-forecasting performance by 10 to 19% for photovoltaic parks and 5 to 15% for wind parks compared to baseline models. We also demonstrate that UAE-TENN improves forecast accuracy for a new park by 19% for photovoltaic parks, even in a zero-shot learning scenario where there is no historical data. Additionally, we propose variants of the Unified Autoencoder with convolutional and LSTM layers, compare their performance, and provide a comparison among architectures with different numbers of task-embedding dimensions. Finally, we demonstrate the utility of trained task embeddings for interpretation and visualization purposes.

通常，使用机器学习的可再生能源发电预测涉及为每个光伏或风力发电场创建单独的模型，称为单任务学习模型。然而，近年来迁移学习越来越受欢迎，因为它允许知识从源园区转移到目标园区。然而，为迁移学习确定最相似的源园区可能是具有挑战性的，特别是当目标园区只有有限或没有历史数据样本时。为了解决这个问题，我们提出了一种多任务学习架构，该架构采用统一自动编码器(UAE)来初始学习任务之间输入天气特征的共同表示，然后利用神经网络(TENN)中的任务嵌入层来学习任务特定信息。这个提议的UAE-TENN架构可以很容易地扩展到有或没有历史数据的新公园。我们评估了我们提出的架构的性能，并将其与包含529个公园的六个光伏和风电场数据集的单任务学习模型进行了比较。我们的研究结果表明，与基线模型相比，阿联酋- tenn架构显著提高了光伏公园和风力公园的电力预测性能，分别提高了10 - 19%和5 - 15%。我们还证明，即使在没有历史数据的零射击学习场景下，UAE-TENN也将光伏公园的新公园预测精度提高了19%。此外，我们提出了具有卷积层和LSTM层的统一自编码器的变体，比较了它们的性能，并提供了具有不同任务嵌入维数的架构之间的比较。最后，我们演示了训练任务嵌入用于解释和可视化目的的效用。

{"title":"Multi-Task Representation Learning for Renewable-Power Forecasting: A Comparative Analysis of Unified Autoencoder Variants and Task-Embedding Dimensions","authors":"Chandana Priya Nivarthi, Stephan Vogt, Bernhard Sick","doi":"10.3390/make5030062","DOIUrl":"https://doi.org/10.3390/make5030062","url":null,"abstract":"Typically, renewable-power-generation forecasting using machine learning involves creating separate models for each photovoltaic or wind park, known as single-task learning models. However, transfer learning has gained popularity in recent years, as it allows for the transfer of knowledge from source parks to target parks. Nevertheless, determining the most similar source park(s) for transfer learning can be challenging, particularly when the target park has limited or no historical data samples. To address this issue, we propose a multi-task learning architecture that employs a Unified Autoencoder (UAE) to initially learn a common representation of input weather features among tasks and then utilizes a Task-Embedding layer in a Neural Network (TENN) to learn task-specific information. This proposed UAE-TENN architecture can be easily extended to new parks with or without historical data. We evaluate the performance of our proposed architecture and compare it to single-task learning models on six photovoltaic and wind farm datasets consisting of a total of 529 parks. Our results show that the UAE-TENN architecture significantly improves power-forecasting performance by 10 to 19% for photovoltaic parks and 5 to 15% for wind parks compared to baseline models. We also demonstrate that UAE-TENN improves forecast accuracy for a new park by 19% for photovoltaic parks, even in a zero-shot learning scenario where there is no historical data. Additionally, we propose variants of the Unified Autoencoder with convolutional and LSTM layers, compare their performance, and provide a comparison among architectures with different numbers of task-embedding dimensions. Finally, we demonstrate the utility of trained task embeddings for interpretation and visualization purposes.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136373764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Early Thyroid Risk Prediction by Data Mining and Ensemble Classifiers 基于数据挖掘和集成分类器的早期甲状腺风险预测

Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine learning and knowledge extraction

Pub Date : 2023-09-18 DOI: 10.3390/make5030061

Mohammad H. Alshayeji

Thyroid disease is among the most prevalent endocrinopathies worldwide. As the thyroid gland controls human metabolism, thyroid illness is a matter of concern for human health. To save time and reduce error rates, an automatic, reliable, and accurate thyroid identification machine-learning (ML) system is essential. The proposed model aims to address existing work limitations such as the lack of detailed feature analysis, visualization, improvement in prediction accuracy, and reliability. Here, a public thyroid illness dataset containing 29 clinical features from the University of California, Irvine ML repository was used. The clinical features helped us to build an ML model that can predict thyroid illness by analyzing early symptoms and replacing the manual analysis of these attributes. Feature analysis and visualization facilitate an understanding of the role of features in thyroid prediction tasks. In addition, the overfitting problem was eliminated by 5-fold cross-validation and data balancing using the synthetic minority oversampling technique (SMOTE). Ensemble learning ensures prediction model reliability owing to the involvement of multiple classifiers in the prediction decisions. The proposed model achieved 99.5% accuracy, 99.39% sensitivity, and 99.59% specificity with the boosting method which is applicable to real-time computer-aided diagnosis (CAD) systems to ease diagnosis and promote early treatment.

甲状腺疾病是世界上最常见的内分泌疾病之一。由于甲状腺控制着人体的代谢，甲状腺疾病是一个关注人类健康的问题。为了节省时间和降低错误率，一个自动、可靠、准确的甲状腺识别机器学习(ML)系统是必不可少的。该模型旨在解决现有工作的局限性，如缺乏详细的特征分析、可视化、预测精度和可靠性的提高。在这里，使用了一个公共甲状腺疾病数据集，其中包含来自加州大学欧文分校ML存储库的29个临床特征。临床特征帮助我们建立了一个ML模型，可以通过分析早期症状来预测甲状腺疾病，并取代手工分析这些属性。特征分析和可视化有助于理解特征在甲状腺预测任务中的作用。此外，使用合成少数过采样技术(SMOTE)通过5倍交叉验证和数据平衡消除了过拟合问题。由于在预测决策中涉及多个分类器，集成学习保证了预测模型的可靠性。该模型准确率为99.5%，灵敏度为99.39%，特异度为99.59%，可应用于实时计算机辅助诊断(CAD)系统，方便诊断，促进早期治疗。

{"title":"Early Thyroid Risk Prediction by Data Mining and Ensemble Classifiers","authors":"Mohammad H. Alshayeji","doi":"10.3390/make5030061","DOIUrl":"https://doi.org/10.3390/make5030061","url":null,"abstract":"Thyroid disease is among the most prevalent endocrinopathies worldwide. As the thyroid gland controls human metabolism, thyroid illness is a matter of concern for human health. To save time and reduce error rates, an automatic, reliable, and accurate thyroid identification machine-learning (ML) system is essential. The proposed model aims to address existing work limitations such as the lack of detailed feature analysis, visualization, improvement in prediction accuracy, and reliability. Here, a public thyroid illness dataset containing 29 clinical features from the University of California, Irvine ML repository was used. The clinical features helped us to build an ML model that can predict thyroid illness by analyzing early symptoms and replacing the manual analysis of these attributes. Feature analysis and visualization facilitate an understanding of the role of features in thyroid prediction tasks. In addition, the overfitting problem was eliminated by 5-fold cross-validation and data balancing using the synthetic minority oversampling technique (SMOTE). Ensemble learning ensures prediction model reliability owing to the involvement of multiple classifiers in the prediction decisions. The proposed model achieved 99.5% accuracy, 99.39% sensitivity, and 99.59% specificity with the boosting method which is applicable to real-time computer-aided diagnosis (CAD) systems to ease diagnosis and promote early treatment.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135207960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gradient-Based Neural Architecture Search: A Comprehensive Evaluation 基于梯度的神经结构搜索:一个综合评价

Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine learning and knowledge extraction

Pub Date : 2023-09-14 DOI: 10.3390/make5030060

Sarwat Ali, M. Arif Wani

One of the challenges in deep learning involves discovering the optimal architecture for a specific task. This is effectively tackled through Neural Architecture Search (NAS). Neural Architecture Search encompasses three prominent approaches—reinforcement learning, evolutionary algorithms, and gradient descent—that have demonstrated noteworthy potential in identifying good candidate architectures. However, approaches based on reinforcement learning and evolutionary algorithms often necessitate extensive computational resources, requiring hundreds of GPU days or more. Therefore, we confine this work to a gradient-based approach due to its lower computational resource demands. Our objective encompasses identifying the optimal gradient-based NAS method and pinpointing opportunities for future enhancements. To achieve this, a comprehensive evaluation of the use of four major Gradient descent-based architecture search methods for discovering the best neural architecture for image classification tasks is provided. An overview of these gradient-based methods, i.e., DARTS, PDARTS, Fair DARTS and Att-DARTS, is presented. A theoretical comparison, based on search spaces, continuous relaxation strategy and bi-level optimization, for deriving the best neural architecture is then provided. The strong and weak features of these methods are also listed. Experimental results for comparing the error rate and computational cost of these gradient-based methods are analyzed. These experiments involved using bench marking datasets CIFAR-10, CIFAR-100 and ImageNet. The results show that PDARTS is better and faster among the examined methods, making it a potent candidate for automating Neural Architecture Search. By effectively conducting a comparative analysis, our research provides valuable insights and future research directions to address the criticism and gaps in the literature.

深度学习的挑战之一是为特定任务找到最佳架构。通过神经结构搜索(NAS)可以有效地解决这个问题。神经架构搜索包含三种突出的方法——强化学习、进化算法和梯度下降——它们在识别好的候选架构方面显示出了显著的潜力。然而，基于强化学习和进化算法的方法通常需要大量的计算资源，需要数百天或更长时间的GPU。因此，由于其较低的计算资源需求，我们将这项工作限制为基于梯度的方法。我们的目标包括确定最佳的基于梯度的NAS方法，并确定未来增强的机会。为了实现这一目标，综合评估了四种主要的基于梯度下降的结构搜索方法的使用，以发现图像分类任务的最佳神经结构。概述了这些基于梯度的方法，即dart、pdart、Fair dart和at - dart。在此基础上，对基于搜索空间、连续松弛策略和双级优化的神经网络结构进行了理论比较。并列举了这些方法的优缺点。实验结果比较了这些基于梯度的方法的错误率和计算成本。这些实验涉及使用基准测试数据集CIFAR-10, CIFAR-100和ImageNet。结果表明，PDARTS算法是一种性能较好、速度较快的算法，是神经结构搜索自动化的有力候选算法。通过有效的比较分析，我们的研究为解决文献中的批评和空白提供了有价值的见解和未来的研究方向。

{"title":"Gradient-Based Neural Architecture Search: A Comprehensive Evaluation","authors":"Sarwat Ali, M. Arif Wani","doi":"10.3390/make5030060","DOIUrl":"https://doi.org/10.3390/make5030060","url":null,"abstract":"One of the challenges in deep learning involves discovering the optimal architecture for a specific task. This is effectively tackled through Neural Architecture Search (NAS). Neural Architecture Search encompasses three prominent approaches—reinforcement learning, evolutionary algorithms, and gradient descent—that have demonstrated noteworthy potential in identifying good candidate architectures. However, approaches based on reinforcement learning and evolutionary algorithms often necessitate extensive computational resources, requiring hundreds of GPU days or more. Therefore, we confine this work to a gradient-based approach due to its lower computational resource demands. Our objective encompasses identifying the optimal gradient-based NAS method and pinpointing opportunities for future enhancements. To achieve this, a comprehensive evaluation of the use of four major Gradient descent-based architecture search methods for discovering the best neural architecture for image classification tasks is provided. An overview of these gradient-based methods, i.e., DARTS, PDARTS, Fair DARTS and Att-DARTS, is presented. A theoretical comparison, based on search spaces, continuous relaxation strategy and bi-level optimization, for deriving the best neural architecture is then provided. The strong and weak features of these methods are also listed. Experimental results for comparing the error rate and computational cost of these gradient-based methods are analyzed. These experiments involved using bench marking datasets CIFAR-10, CIFAR-100 and ImageNet. The results show that PDARTS is better and faster among the examined methods, making it a potent candidate for automating Neural Architecture Search. By effectively conducting a comparative analysis, our research provides valuable insights and future research directions to address the criticism and gaps in the literature.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134913576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Autoencoder-Based Visual Anomaly Localization for Manufacturing Quality Control 基于自动编码器的视觉异常定位，用于制造质量控制

Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine learning and knowledge extraction

Pub Date : 2023-09-13 DOI: 10.3390/make6010001

Devang Mehta, Noah Klarmann

Manufacturing industries require the efficient and voluminous production of high-quality finished goods. In the context of Industry 4.0, visual anomaly detection poses an optimistic solution for automatically controlled product quality with high precision. In general, automation based on computer vision is a promising solution to prevent bottlenecks at the product quality checkpoint. We considered recent advancements in machine learning to improve visual defect localization, but challenges persist in obtaining a balanced feature set and database of the wide variety of defects occurring in the production line. Hence, this paper proposes a defect localizing autoencoder with unsupervised class selection by clustering with k-means the features extracted from a pretrained VGG16 network. Moreover, the selected classes of defects are augmented with natural wild textures to simulate artificial defects. The study demonstrates the effectiveness of the defect localizing autoencoder with unsupervised class selection for improving defect detection in manufacturing industries. The proposed methodology shows promising results with precise and accurate localization of quality defects on melamine-faced boards for the furniture industry. Incorporating artificial defects into the training data shows significant potential for practical implementation in real-world quality control scenarios.

制造业需要高效、大量地生产高质量的成品。在工业 4.0 的背景下，视觉异常检测为高精度自动控制产品质量提供了乐观的解决方案。一般来说，基于计算机视觉的自动化是防止产品质量检查点出现瓶颈的一种有前途的解决方案。我们考虑了机器学习在改善视觉缺陷定位方面的最新进展，但在获得均衡的特征集和生产线上出现的各种缺陷的数据库方面仍然存在挑战。因此，本文提出了一种缺陷定位自动编码器，通过对从预训练的 VGG16 网络中提取的特征进行 k-means 聚类，在无监督的情况下进行类别选择。此外，还利用自然野生纹理对所选缺陷类别进行增强，以模拟人工缺陷。这项研究证明了缺陷定位自动编码器与无监督类别选择在改进制造业缺陷检测方面的有效性。所提出的方法在精确定位家具行业三聚氰胺面板的质量缺陷方面取得了可喜的成果。将人工缺陷纳入训练数据显示了在现实世界质量控制场景中实际应用的巨大潜力。

{"title":"Autoencoder-Based Visual Anomaly Localization for Manufacturing Quality Control","authors":"Devang Mehta, Noah Klarmann","doi":"10.3390/make6010001","DOIUrl":"https://doi.org/10.3390/make6010001","url":null,"abstract":"Manufacturing industries require the efficient and voluminous production of high-quality finished goods. In the context of Industry 4.0, visual anomaly detection poses an optimistic solution for automatically controlled product quality with high precision. In general, automation based on computer vision is a promising solution to prevent bottlenecks at the product quality checkpoint. We considered recent advancements in machine learning to improve visual defect localization, but challenges persist in obtaining a balanced feature set and database of the wide variety of defects occurring in the production line. Hence, this paper proposes a defect localizing autoencoder with unsupervised class selection by clustering with k-means the features extracted from a pretrained VGG16 network. Moreover, the selected classes of defects are augmented with natural wild textures to simulate artificial defects. The study demonstrates the effectiveness of the defect localizing autoencoder with unsupervised class selection for improving defect detection in manufacturing industries. The proposed methodology shows promising results with precise and accurate localization of quality defects on melamine-faced boards for the furniture industry. Incorporating artificial defects into the training data shows significant potential for practical implementation in real-world quality control scenarios.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139340434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models 面向海量文本集鲁棒扩充的自动体裁识别:大语言模型时代分类方法研究

Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine learning and knowledge extraction

Pub Date : 2023-09-12 DOI: 10.3390/make5030059

Taja Kuzman, Igor Mozetič, Nikola Ljubešić

Massive text collections are the backbone of large language models, the main ingredient of the current significant progress in artificial intelligence. However, as these collections are mostly collected using automatic methods, researchers have few insights into what types of texts they consist of. Automatic genre identification is a text classification task that enriches texts with genre labels, such as promotional and legal, providing meaningful insights into the composition of these large text collections. In this paper, we evaluate machine learning approaches for the genre identification task based on their generalizability across different datasets to assess which model is the most suitable for the downstream task of enriching large web corpora with genre information. We train and test multiple fine-tuned BERT-like Transformer-based models and show that merging different genre-annotated datasets yields superior results. Moreover, we explore the zero-shot capabilities of large GPT Transformer models in this task and discuss the advantages and disadvantages of the zero-shot approach. We also publish the best-performing fine-tuned model that enables automatic genre annotation in multiple languages. In addition, to promote further research in this area, we plan to share, upon request, a new benchmark for automatic genre annotation, ensuring the non-exposure of the latest large language models.

海量文本集合是大型语言模型的支柱，是当前人工智能取得重大进展的主要因素。然而，由于这些集合大多是使用自动方法收集的，研究人员对它们包含的文本类型知之甚少。自动体裁识别是一项文本分类任务，它使用体裁标签(如促销和法律)丰富文本，为这些大型文本集合的组成提供有意义的见解。在本文中，我们基于类型识别任务的机器学习方法在不同数据集上的泛化性来评估哪种模型最适合用类型信息丰富大型web语料库的下游任务。我们训练和测试了多个经过微调的基于BERT-like transformer的模型，并表明合并不同类型注释的数据集可以产生更好的结果。此外，我们在本任务中探讨了大型GPT变压器模型的零射击能力，并讨论了零射击方法的优缺点。我们还发布了性能最好的微调模型，该模型支持多种语言的自动类型注释。此外，为了促进这一领域的进一步研究，我们计划应要求分享一个自动类型标注的新基准，以确保最新的大型语言模型不被泄露。

{"title":"Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models","authors":"Taja Kuzman, Igor Mozetič, Nikola Ljubešić","doi":"10.3390/make5030059","DOIUrl":"https://doi.org/10.3390/make5030059","url":null,"abstract":"Massive text collections are the backbone of large language models, the main ingredient of the current significant progress in artificial intelligence. However, as these collections are mostly collected using automatic methods, researchers have few insights into what types of texts they consist of. Automatic genre identification is a text classification task that enriches texts with genre labels, such as promotional and legal, providing meaningful insights into the composition of these large text collections. In this paper, we evaluate machine learning approaches for the genre identification task based on their generalizability across different datasets to assess which model is the most suitable for the downstream task of enriching large web corpora with genre information. We train and test multiple fine-tuned BERT-like Transformer-based models and show that merging different genre-annotated datasets yields superior results. Moreover, we explore the zero-shot capabilities of large GPT Transformer models in this task and discuss the advantages and disadvantages of the zero-shot approach. We also publish the best-performing fine-tuned model that enables automatic genre annotation in multiple languages. In addition, to promote further research in this area, we plan to share, upon request, a new benchmark for automatic genre annotation, ensuring the non-exposure of the latest large language models.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135885309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cyberattack Detection in Social Network Messages Based on Convolutional Neural Networks and NLP Techniques 基于卷积神经网络和NLP技术的社交网络信息网络攻击检测

Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine learning and knowledge extraction

Pub Date : 2023-09-01 DOI: 10.3390/make5030058

Jorge E. Coyac-Torres, Grigori Sidorov, E. Aguirre-Anaya, Gerardo Hernández-Oregón

Social networks have captured the attention of many people worldwide. However, these services have also attracted a considerable number of malicious users whose aim is to compromise the digital assets of other users by using messages as an attack vector to execute different types of cyberattacks against them. This work presents an approach based on natural language processing tools and a convolutional neural network architecture to detect and classify four types of cyberattacks in social network messages, including malware, phishing, spam, and even one whose aim is to deceive a user into spreading malicious messages to other users, which, in this work, is identified as a bot attack. One notable feature of this work is that it analyzes textual content without depending on any characteristics from a specific social network, making its analysis independent of particular data sources. Finally, this work was tested on real data, demonstrating its results in two stages. The first stage detected the existence of any of the four types of cyberattacks within the message, achieving an accuracy value of 0.91. After detecting a message as a cyberattack, the next stage was to classify it as one of the four types of cyberattack, achieving an accuracy value of 0.82.

社交网络吸引了全世界许多人的注意。然而，这些服务也吸引了相当多的恶意用户，其目的是通过使用消息作为攻击向量来对其他用户实施不同类型的网络攻击，从而损害其他用户的数字资产。这项工作提出了一种基于自然语言处理工具和卷积神经网络架构的方法，以检测和分类社交网络消息中的四种类型的网络攻击，包括恶意软件、网络钓鱼、垃圾邮件，甚至是一种旨在欺骗用户向其他用户传播恶意消息的攻击，在这项工作中，这被确定为机器人攻击。这项工作的一个显著特点是，它分析文本内容，而不依赖于特定社交网络的任何特征，使其分析独立于特定的数据源。最后，这项工作在实际数据上进行了测试，分两个阶段展示了其结果。第一阶段检测到消息中存在四种类型的网络攻击中的任何一种，准确率值为0.91。在检测到一条消息是网络攻击后，下一阶段将其归类为四种网络攻击之一，准确率值为0.82。

{"title":"Cyberattack Detection in Social Network Messages Based on Convolutional Neural Networks and NLP Techniques","authors":"Jorge E. Coyac-Torres, Grigori Sidorov, E. Aguirre-Anaya, Gerardo Hernández-Oregón","doi":"10.3390/make5030058","DOIUrl":"https://doi.org/10.3390/make5030058","url":null,"abstract":"Social networks have captured the attention of many people worldwide. However, these services have also attracted a considerable number of malicious users whose aim is to compromise the digital assets of other users by using messages as an attack vector to execute different types of cyberattacks against them. This work presents an approach based on natural language processing tools and a convolutional neural network architecture to detect and classify four types of cyberattacks in social network messages, including malware, phishing, spam, and even one whose aim is to deceive a user into spreading malicious messages to other users, which, in this work, is identified as a bot attack. One notable feature of this work is that it analyzes textual content without depending on any characteristics from a specific social network, making its analysis independent of particular data sources. Finally, this work was tested on real data, demonstrating its results in two stages. The first stage detected the existence of any of the four types of cyberattacks within the message, achieving an accuracy value of 0.91. After detecting a message as a cyberattack, the next stage was to classify it as one of the four types of cyberattack, achieving an accuracy value of 0.82.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42027449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0