首页 > 最新文献

Discover artificial intelligence最新文献

英文 中文
Ai-guided vectorization for efficient storage and semantic retrieval of visual data. 面向视觉数据高效存储和语义检索的人工智能引导矢量化。
Pub Date : 2026-01-01 Epub Date: 2025-12-05 DOI: 10.1007/s44163-025-00713-y
Ahmed A Harby, Farhana Zulkernine, Hanady M Abdulsalam

The rapid growth of multimedia content has increased the demand for effective methods to reduce storage requirements while maintaining quality and enabling fast data transmission. Existing standards and generative model approaches often involve high computational cost, require extensive parameter tuning, and produce inconsistent results, particularly in environments with limited processing resources. This paper presents a convolutional autoencoder framework for reducing the storage footprint of image and video data. The proposed method is designed for efficient integration with existing storage and retrieval systems. Several autoencoder architectures are developed and evaluated on diverse datasets including CelebA, IMDb Faces, Oxford Flowers 102, MNIST, and UCF101. The results show 56.6% to 70.8% for image data volume with minimal degradation in perceptual quality. The system incorporates a latent representation module that supports compact storage, efficient indexing, and accurate reconstruction. These capabilities are essential for practical deployment in multimedia platforms. Experimental evaluation demonstrates that the proposed approach performs competitively with recent techniques while providing greater consistency and reduced computational overhead. In comparison to generative models, the method achieves a higher peak signal to noise ratio and improved structural fidelity. This study offers a practical and reproducible solution for storage reduction, well suited for large scale image and video archiving and retrieval under constrained or high-throughput conditions.

多媒体内容的快速增长增加了对有效方法的需求,以减少存储需求,同时保持质量和实现快速数据传输。现有的标准和生成模型方法通常涉及高计算成本,需要大量的参数调整,并且产生不一致的结果,特别是在处理资源有限的环境中。本文提出了一种减少图像和视频数据存储空间的卷积自编码器框架。该方法旨在与现有的存储和检索系统有效集成。在不同的数据集上开发和评估了几种自动编码器架构,包括CelebA, IMDb Faces, Oxford Flowers 102, MNIST和UCF101。结果表明,在感知质量下降最小的情况下,图像数据量可以达到56.6% ~ 70.8%。该系统包含一个潜在的表示模块,支持紧凑的存储,高效的索引和准确的重建。这些功能对于多媒体平台的实际部署是必不可少的。实验评估表明,该方法与最近的技术相比具有竞争力,同时提供了更高的一致性和更少的计算开销。与生成模型相比,该方法实现了更高的峰值信噪比,提高了结构保真度。该研究提供了一种实用且可重复的存储减少解决方案,非常适合在受限或高吞吐量条件下的大规模图像和视频存档和检索。
{"title":"Ai-guided vectorization for efficient storage and semantic retrieval of visual data.","authors":"Ahmed A Harby, Farhana Zulkernine, Hanady M Abdulsalam","doi":"10.1007/s44163-025-00713-y","DOIUrl":"10.1007/s44163-025-00713-y","url":null,"abstract":"<p><p>The rapid growth of multimedia content has increased the demand for effective methods to reduce storage requirements while maintaining quality and enabling fast data transmission. Existing standards and generative model approaches often involve high computational cost, require extensive parameter tuning, and produce inconsistent results, particularly in environments with limited processing resources. This paper presents a convolutional autoencoder framework for reducing the storage footprint of image and video data. The proposed method is designed for efficient integration with existing storage and retrieval systems. Several autoencoder architectures are developed and evaluated on diverse datasets including CelebA, IMDb Faces, Oxford Flowers 102, MNIST, and UCF101. The results show 56.6% to 70.8% for image data volume with minimal degradation in perceptual quality. The system incorporates a latent representation module that supports compact storage, efficient indexing, and accurate reconstruction. These capabilities are essential for practical deployment in multimedia platforms. Experimental evaluation demonstrates that the proposed approach performs competitively with recent techniques while providing greater consistency and reduced computational overhead. In comparison to generative models, the method achieves a higher peak signal to noise ratio and improved structural fidelity. This study offers a practical and reproducible solution for storage reduction, well suited for large scale image and video archiving and retrieval under constrained or high-throughput conditions.</p>","PeriodicalId":520312,"journal":{"name":"Discover artificial intelligence","volume":"6 1","pages":"16"},"PeriodicalIF":0.0,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12795887/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A review of machine learning and deep learning for Parkinson's disease detection. 机器学习和深度学习在帕金森病检测中的研究进展。
Pub Date : 2025-01-01 Epub Date: 2025-03-12 DOI: 10.1007/s44163-025-00241-9
Hajar Rabie, Moulay A Akhloufi

Millions of people worldwide suffer from Parkinson's disease (PD), a neurodegenerative disorder marked by motor symptoms such as tremors, bradykinesia, and stiffness. Accurate early diagnosis is crucial for effective management and treatment. This article presents a novel review of Machine Learning (ML) and Deep Learning (DL) techniques for PD detection and progression monitoring, offering new perspectives by integrating diverse data sources. We examine the public datasets recently used in studies, including audio recordings, gait analysis, and medical imaging. We discuss the preprocessing methods applied, the state-of-the-art models utilized, and their performance. Our evaluation included different algorithms such as support vector machines (SVM), random forests (RF), convolutional neural networks (CNN). These algorithms have shown promising results in PD diagnosis with accuracy rates exceeding 99% in some studies combining data sources. Our analysis particularly showcases the effectiveness of audio analysis in early symptom detection and gait analysis, including the Unified Parkinson's Disease Rating Scale (UPDRS), in monitoring disease progression. Medical imaging, enhanced by DL techniques, has improved the identification of PD. The application of ML and DL in PD research offers significant potential for improving diagnostic accuracy. However, challenges like the need for large and diverse datasets, data privacy concerns, and data quality in healthcare remain. Additionally, developing explainable AI is crucial to ensure that clinicians can trust and understand ML and DL models. Our review highlights these key challenges that must be addressed to enhance the robustness and applicability of AI models in PD diagnosis, setting the groundwork for future research to overcome these obstacles.

全世界有数百万人患有帕金森病(PD),这是一种以震颤、运动迟缓和僵硬等运动症状为特征的神经退行性疾病。准确的早期诊断对有效的管理和治疗至关重要。本文介绍了用于PD检测和进展监测的机器学习(ML)和深度学习(DL)技术的新综述,通过整合不同的数据源提供了新的视角。我们检查了最近在研究中使用的公共数据集,包括录音、步态分析和医学成像。我们讨论了所采用的预处理方法,所使用的最先进的模型,以及它们的性能。我们的评估包括不同的算法,如支持向量机(SVM)、随机森林(RF)、卷积神经网络(CNN)。在一些结合数据源的研究中,这些算法在PD诊断中显示出很好的结果,准确率超过99%。我们的分析特别展示了音频分析在早期症状检测和步态分析中的有效性,包括统一帕金森病评定量表(UPDRS),在监测疾病进展方面的有效性。医学影像,增强DL技术,提高了PD的识别。ML和DL在PD研究中的应用为提高诊断准确性提供了巨大的潜力。然而,诸如对大型和多样化数据集的需求、数据隐私问题和医疗保健中的数据质量等挑战仍然存在。此外,开发可解释的人工智能对于确保临床医生能够信任和理解ML和DL模型至关重要。我们的综述强调了这些必须解决的关键挑战,以提高人工智能模型在帕金森病诊断中的鲁棒性和适用性,为未来的研究奠定基础,以克服这些障碍。
{"title":"A review of machine learning and deep learning for Parkinson's disease detection.","authors":"Hajar Rabie, Moulay A Akhloufi","doi":"10.1007/s44163-025-00241-9","DOIUrl":"https://doi.org/10.1007/s44163-025-00241-9","url":null,"abstract":"<p><p>Millions of people worldwide suffer from Parkinson's disease (PD), a neurodegenerative disorder marked by motor symptoms such as tremors, bradykinesia, and stiffness. Accurate early diagnosis is crucial for effective management and treatment. This article presents a novel review of Machine Learning (ML) and Deep Learning (DL) techniques for PD detection and progression monitoring, offering new perspectives by integrating diverse data sources. We examine the public datasets recently used in studies, including audio recordings, gait analysis, and medical imaging. We discuss the preprocessing methods applied, the state-of-the-art models utilized, and their performance. Our evaluation included different algorithms such as support vector machines (SVM), random forests (RF), convolutional neural networks (CNN). These algorithms have shown promising results in PD diagnosis with accuracy rates exceeding 99% in some studies combining data sources. Our analysis particularly showcases the effectiveness of audio analysis in early symptom detection and gait analysis, including the Unified Parkinson's Disease Rating Scale (UPDRS), in monitoring disease progression. Medical imaging, enhanced by DL techniques, has improved the identification of PD. The application of ML and DL in PD research offers significant potential for improving diagnostic accuracy. However, challenges like the need for large and diverse datasets, data privacy concerns, and data quality in healthcare remain. Additionally, developing explainable AI is crucial to ensure that clinicians can trust and understand ML and DL models. Our review highlights these key challenges that must be addressed to enhance the robustness and applicability of AI models in PD diagnosis, setting the groundwork for future research to overcome these obstacles.</p>","PeriodicalId":520312,"journal":{"name":"Discover artificial intelligence","volume":"5 1","pages":"24"},"PeriodicalIF":0.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11903556/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143653218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A two-phase hybrid clustering framework exploring transitional activities in HAR. 一个探索HAR过渡活动的两阶段混合聚类框架。
Pub Date : 2025-01-01 Epub Date: 2025-08-31 DOI: 10.1007/s44163-025-00503-6
Martin Woo, Ahmed A Harby, Farhana Zulkernine, Hanady M Abdulsalam

Human Activity Recognition (HAR) using data streams from wearable sensors is challenging due to high data dimensionality, noise, and the lack of labeled data in unsupervised settings. Our prior work proved that traditional clustering models, which achieve state-of-the-art performance on simulated datasets, perform poorly on time-series numeric sensor data. This paper explores different autoencoder (AE) architectures to extract latent features with reduced dimensionality from streaming HAR datasets, which is then clustered using a clustering model to identify different activity patterns. Since the vanilla AE has shortcomings in learning distinguishing data patterns from spatio temporal time-series sensor data, we leverage the vanilla AE with convolutional, long-short term memory (LSTM), and a combination of convolutional and LSTM layers in multiple design phases. We apply supervised learning to train a superior spatio-temporal feature extraction AE model. Using the data features extracted by the trained AE, we train a clustering model with unsupervised learning approach. Our end-to-end integrated hybrid convolutional AE+LSTM feature extractor and K-Means clustering model achieves state-of-the-art clustering accuracy of up to 0.99 in terms of Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) scores for MobiAct and UCI HAR datasets, improving clustering performance by over 50% compared to previous methods. Further improvements are achieved through rigorous experimentation and advanced data preprocessing methods. We also present a visualization of the clusters, which explains the transitional activity patterns in the overlapping parts of the clusters.

由于高数据维度、噪声和无监督环境中缺乏标记数据,使用来自可穿戴传感器的数据流进行人类活动识别(HAR)具有挑战性。我们之前的工作证明,传统的聚类模型在模拟数据集上实现了最先进的性能,但在时间序列数字传感器数据上表现不佳。本文探索了不同的自编码器(AE)架构,以从流HAR数据集中提取降维的潜在特征,然后使用聚类模型对其进行聚类以识别不同的活动模式。由于香草声发射在从时空时间序列传感器数据中学习区分数据模式方面存在缺点,因此我们在多个设计阶段将香草声发射与卷积、长短期记忆(LSTM)以及卷积和LSTM层的组合结合起来。我们应用监督学习训练了一个优秀的时空特征提取AE模型。利用训练好的AE提取的数据特征,采用无监督学习方法训练聚类模型。我们的端到端集成混合卷积AE+LSTM特征提取器和K-Means聚类模型在MobiAct和UCI HAR数据集的归一化互信息(NMI)和调整兰德指数(ARI)得分方面达到了最高的聚类精度0.99,与以前的方法相比,聚类性能提高了50%以上。通过严格的实验和先进的数据预处理方法,实现了进一步的改进。我们还展示了集群的可视化,这解释了集群重叠部分的过渡活动模式。
{"title":"A two-phase hybrid clustering framework exploring transitional activities in HAR.","authors":"Martin Woo, Ahmed A Harby, Farhana Zulkernine, Hanady M Abdulsalam","doi":"10.1007/s44163-025-00503-6","DOIUrl":"10.1007/s44163-025-00503-6","url":null,"abstract":"<p><p>Human Activity Recognition (HAR) using data streams from wearable sensors is challenging due to high data dimensionality, noise, and the lack of labeled data in unsupervised settings. Our prior work proved that traditional clustering models, which achieve state-of-the-art performance on simulated datasets, perform poorly on time-series numeric sensor data. This paper explores different autoencoder (AE) architectures to extract latent features with reduced dimensionality from streaming HAR datasets, which is then clustered using a clustering model to identify different activity patterns. Since the vanilla AE has shortcomings in learning distinguishing data patterns from spatio temporal time-series sensor data, we leverage the vanilla AE with convolutional, long-short term memory (LSTM), and a combination of convolutional and LSTM layers in multiple design phases. We apply supervised learning to train a superior spatio-temporal feature extraction AE model. Using the data features extracted by the trained AE, we train a clustering model with unsupervised learning approach. Our end-to-end integrated hybrid convolutional AE+LSTM feature extractor and K-Means clustering model achieves state-of-the-art clustering accuracy of up to 0.99 in terms of Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) scores for MobiAct and UCI HAR datasets, improving clustering performance by over 50% compared to previous methods. Further improvements are achieved through rigorous experimentation and advanced data preprocessing methods. We also present a visualization of the clusters, which explains the transitional activity patterns in the overlapping parts of the clusters.</p>","PeriodicalId":520312,"journal":{"name":"Discover artificial intelligence","volume":"5 1","pages":"233"},"PeriodicalIF":0.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12399724/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144994989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BMWP: the first Bengali math word problems dataset for operation prediction and solving. BMWP:第一个用于操作预测和解决的孟加拉数学字题数据集。
Pub Date : 2025-01-01 Epub Date: 2025-03-13 DOI: 10.1007/s44163-025-00243-7
Sanchita Mondal, Debnarayan Khatua, Sourav Mandal, Dilip K Prasad, Arif Ahmed Sekh

Solving math word problems of varying complexities is one of the most challenging and exciting research questions in artificial intelligence (AI), particularly in natural language processing (NLP) and machine learning (ML). Foundational language models such as GPT must be evaluated for intelligence, and solving word problems is a key method for this assessment. These problems become especially difficult when presented in low-resource regional languages such as Bengali. Word problem solving integrates the cognitive domains of language processing, comprehension, and transformation into real-world solutions. During the past decade, advances in AI and machine learning have significantly progressed in addressing this complex issue. Although researchers worldwide have primarily utilized datasets in English and some in Chinese, there has been a lack of standard datasets for low-resource languages such as Bengali. In this pioneering study, we introduce the first Bengali Math Word Problem Benchmark Data Set (BMWP), comprising 8653 word problems. We detail the creation of this dataset and the benchmarking methods employed. Furthermore, we investigate operation prediction from Bengali word problems using state-of-the-art deep learning (DL) techniques. We implemented and compared various standard DL-based neural network architectures, achieving an accuracy of 92 ± 2 % . The data set and the code will be available at https://github.com/SanchitaMondal/BMWP.

解决不同复杂性的数学单词问题是人工智能(AI)中最具挑战性和最令人兴奋的研究问题之一,特别是在自然语言处理(NLP)和机器学习(ML)中。像GPT这样的基础语言模型必须对智力进行评估,而解决单词问题是这项评估的关键方法。当使用诸如孟加拉语等资源匮乏的地区性语言时,这些问题变得尤其困难。解决文字问题将语言处理、理解和转换的认知领域整合到现实世界的解决方案中。在过去十年中,人工智能和机器学习的进步在解决这一复杂问题方面取得了重大进展。虽然世界各地的研究人员主要使用英语和一些中文数据集,但缺乏针对孟加拉语等低资源语言的标准数据集。在这项开创性的研究中,我们引入了第一个孟加拉数学字题基准数据集(BMWP),包括8653个字题。我们详细介绍了该数据集的创建和所采用的基准测试方法。此外,我们使用最先进的深度学习(DL)技术研究了孟加拉语单词问题的操作预测。我们实现并比较了各种标准的基于dl的神经网络架构,达到了92±2%的精度。数据集和代码可在https://github.com/SanchitaMondal/BMWP上获得。
{"title":"BMWP: the first Bengali math word problems dataset for operation prediction and solving.","authors":"Sanchita Mondal, Debnarayan Khatua, Sourav Mandal, Dilip K Prasad, Arif Ahmed Sekh","doi":"10.1007/s44163-025-00243-7","DOIUrl":"https://doi.org/10.1007/s44163-025-00243-7","url":null,"abstract":"<p><p>Solving math word problems of varying complexities is one of the most challenging and exciting research questions in artificial intelligence (AI), particularly in natural language processing (NLP) and machine learning (ML). Foundational language models such as GPT must be evaluated for intelligence, and solving word problems is a key method for this assessment. These problems become especially difficult when presented in low-resource regional languages such as Bengali. Word problem solving integrates the cognitive domains of language processing, comprehension, and transformation into real-world solutions. During the past decade, advances in AI and machine learning have significantly progressed in addressing this complex issue. Although researchers worldwide have primarily utilized datasets in English and some in Chinese, there has been a lack of standard datasets for low-resource languages such as Bengali. In this pioneering study, we introduce the first Bengali Math Word Problem Benchmark Data Set (BMWP), comprising 8653 word problems. We detail the creation of this dataset and the benchmarking methods employed. Furthermore, we investigate operation prediction from Bengali word problems using state-of-the-art deep learning (DL) techniques. We implemented and compared various standard DL-based neural network architectures, achieving an accuracy of <math><mrow><mn>92</mn> <mo>±</mo> <mn>2</mn> <mo>%</mo></mrow> </math> . The data set and the code will be available at https://github.com/SanchitaMondal/BMWP.</p>","PeriodicalId":520312,"journal":{"name":"Discover artificial intelligence","volume":"5 1","pages":"25"},"PeriodicalIF":0.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11903620/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143653156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The effect of resampling techniques on the performances of machine learning clinical risk prediction models in the setting of severe class imbalance: development and internal validation in a retrospective cohort. 重采样技术对严重班级失衡情况下机器学习临床风险预测模型性能的影响:回顾性队列的发展和内部验证。
Pub Date : 2024-01-01 Epub Date: 2024-11-26 DOI: 10.1007/s44163-024-00199-0
Janny Xue Chen Ke, Arunachalam DhakshinaMurthy, Ronald B George, Paula Branco

Purpose: The availability of population datasets and machine learning techniques heralded a new era of sophisticated prediction models involving a large number of routinely collected variables. However, severe class imbalance in clinical datasets is a major challenge. The aim of this study is to investigate the impact of commonly-used resampling techniques in combination with commonly-used machine learning algorithms in a clinical dataset, to determine whether combination(s) of these approaches improve upon the original multivariable logistic regression with no resampling.

Methods: We previously developed and internally validated a multivariable logistic regression 30-day mortality prediction model in 30,619 patients using preoperative and intraoperative features.Using the same dataset, we systematically evaluated and compared model performances after application of resampling techniques [random under-sampling, near miss under-sampling, random oversampling, and synthetic minority oversampling (SMOTE)] in combination with machine learning algorithms (logistic regression, elastic net, decision trees, random forest, and extreme gradient boosting).

Results: We found that in the setting of severe class imbalance, the impact of resampling techniques on model performance varied by the machine learning algorithm and the evaluation metric. Existing resampling techniques did not meaningfully improve area under receiving operating curve (AUROC). The area under the precision recall curve (AUPRC) was only increased by random under-sampling and SMOTE for decision trees, and oversampling and SMOTE for extreme gradient boosting. Importantly, some combinations of algorithm and resampling technique decreased AUROC and AUPRC compared to no resampling.

Conclusion: Existing resampling techniques had a variable impact on models, depending on the algorithms and the evaluation metrics. Future research is needed to improve predictive performances in the setting of severe class imbalance.

目的:人口数据集和机器学习技术的可用性预示着涉及大量常规收集变量的复杂预测模型的新时代。然而,临床数据集中严重的分类不平衡是一个重大挑战。本研究的目的是研究临床数据集中常用的重采样技术与常用的机器学习算法相结合的影响,以确定这些方法的组合是否在没有重采样的情况下改善了原始的多变量逻辑回归。方法:我们先前开发并内部验证了30,619例患者术前和术中特征的多变量logistic回归30天死亡率预测模型。使用相同的数据集,我们系统地评估和比较了重新采样技术(随机欠采样、近缺失欠采样、随机过采样和合成少数过采样(SMOTE))与机器学习算法(逻辑回归、弹性网络、决策树、随机森林和极端梯度增强)结合后的模型性能。结果:我们发现,在严重类失衡的情况下,重采样技术对模型性能的影响因机器学习算法和评估指标而异。现有的重采样技术并没有显著提高接收工作曲线下面积(AUROC)。对于决策树的随机欠采样和SMOTE,以及极端梯度增强的随机过采样和SMOTE,只会增加精确召回曲线下的面积。重要的是,与没有重采样相比,一些算法和重采样技术的组合降低了AUROC和AUPRC。结论:现有的重采样技术对模型有不同的影响,取决于算法和评估指标。在班级严重失衡的情况下,需要进一步的研究来提高预测性能。
{"title":"The effect of resampling techniques on the performances of machine learning clinical risk prediction models in the setting of severe class imbalance: development and internal validation in a retrospective cohort.","authors":"Janny Xue Chen Ke, Arunachalam DhakshinaMurthy, Ronald B George, Paula Branco","doi":"10.1007/s44163-024-00199-0","DOIUrl":"https://doi.org/10.1007/s44163-024-00199-0","url":null,"abstract":"<p><strong>Purpose: </strong>The availability of population datasets and machine learning techniques heralded a new era of sophisticated prediction models involving a large number of routinely collected variables. However, severe class imbalance in clinical datasets is a major challenge. The aim of this study is to investigate the impact of commonly-used resampling techniques in combination with commonly-used machine learning algorithms in a clinical dataset, to determine whether combination(s) of these approaches improve upon the original multivariable logistic regression with no resampling.</p><p><strong>Methods: </strong>We previously developed and internally validated a multivariable logistic regression 30-day mortality prediction model in 30,619 patients using preoperative and intraoperative features.Using the same dataset, we systematically evaluated and compared model performances after application of resampling techniques [random under-sampling, near miss under-sampling, random oversampling, and synthetic minority oversampling (SMOTE)] in combination with machine learning algorithms (logistic regression, elastic net, decision trees, random forest, and extreme gradient boosting).</p><p><strong>Results: </strong>We found that in the setting of severe class imbalance, the impact of resampling techniques on model performance varied by the machine learning algorithm and the evaluation metric. Existing resampling techniques did not meaningfully improve area under receiving operating curve (AUROC). The area under the precision recall curve (AUPRC) was only increased by random under-sampling and SMOTE for decision trees, and oversampling and SMOTE for extreme gradient boosting. Importantly, some combinations of algorithm and resampling technique decreased AUROC and AUPRC compared to no resampling.</p><p><strong>Conclusion: </strong>Existing resampling techniques had a variable impact on models, depending on the algorithms and the evaluation metrics. Future research is needed to improve predictive performances in the setting of severe class imbalance.</p>","PeriodicalId":520312,"journal":{"name":"Discover artificial intelligence","volume":"4 1","pages":"91"},"PeriodicalIF":0.0,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11610218/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142776258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Discover artificial intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1