Journal of Big Data最新文献_第5页

Optimization-based convolutional neural model for the classification of white blood cells 基于优化的卷积神经模型用于白细胞分类

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-06-26 DOI: 10.1186/s40537-024-00949-y

Tulasi Gayatri Devi, Nagamma Patil

White blood cells (WBCs) are one of the most significant parts of the human immune system, and they play a crucial role in diagnosing the characteristics of pathologists and blood-related diseases. The characteristics of WBCs are well-defined based on the morphological behavior of their nuclei, and the number and types of WBCs can often determine the presence of diseases or illnesses. Generally, there are different types of WBCs, and the accurate classification of WBCs helps in proper diagnosis and treatment. Although various classification models were developed in the past, they face issues like less classification accuracy, high error rate, and large execution. Hence, a novel classification strategy named the African Buffalo-based Convolutional Neural Model (ABCNM) is proposed to classify the types of WBCs accurately. The proposed strategy commences with collecting WBC sample databases, which are preprocessed and trained into the system for classification. The preprocessing phase removes the noises and training flaws, which helps improve the dataset's quality and consistency. Further, feature extraction is performed to segment the WBCs, and African Buffalo fitness is updated in the classification layer for the correct classification of WBCs. The proposed framework is modeled in Python, and the experimental analysis depicts that it achieved 99.12% accuracy, 98.16% precision, 99% sensitivity, 99.04% specificity, and 99.02% f-measure. Furthermore, a comparative assessment with the existing techniques validated that the proposed strategy obtained better performances than the conventional models.

白细胞（WBC）是人体免疫系统中最重要的组成部分之一，在诊断病理特征和血液相关疾病方面起着至关重要的作用。白细胞的特征是根据其细胞核的形态行为明确界定的，白细胞的数量和类型往往可以判断疾病或病症的存在。一般来说，白细胞有不同的类型，对白细胞进行准确分类有助于正确诊断和治疗。虽然过去开发了各种分类模型，但它们都面临着分类准确率低、错误率高和执行量大等问题。因此，我们提出了一种名为 "基于非洲水牛的卷积神经模型"（ABCNM）的新型分类策略，用于对白细胞类型进行准确分类。所提出的策略首先要收集白细胞样本数据库，然后对其进行预处理，并训练系统进行分类。预处理阶段可去除噪音和训练缺陷，有助于提高数据集的质量和一致性。此外，还将进行特征提取以分割白细胞，并在分类层中更新非洲水牛的适配性，以便对白细胞进行正确分类。所提出的框架以 Python 为模型，实验分析表明其准确率达到 99.12%，精确率达到 98.16%，灵敏度达到 99%，特异性达到 99.04%，f-measure 达到 99.02%。此外，通过与现有技术的比较评估，验证了所提出的策略比传统模型获得了更好的性能。

{"title":"Optimization-based convolutional neural model for the classification of white blood cells","authors":"Tulasi Gayatri Devi, Nagamma Patil","doi":"10.1186/s40537-024-00949-y","DOIUrl":"https://doi.org/10.1186/s40537-024-00949-y","url":null,"abstract":"White blood cells (WBCs) are one of the most significant parts of the human immune system, and they play a crucial role in diagnosing the characteristics of pathologists and blood-related diseases. The characteristics of WBCs are well-defined based on the morphological behavior of their nuclei, and the number and types of WBCs can often determine the presence of diseases or illnesses. Generally, there are different types of WBCs, and the accurate classification of WBCs helps in proper diagnosis and treatment. Although various classification models were developed in the past, they face issues like less classification accuracy, high error rate, and large execution. Hence, a novel classification strategy named the African Buffalo-based Convolutional Neural Model (ABCNM) is proposed to classify the types of WBCs accurately. The proposed strategy commences with collecting WBC sample databases, which are preprocessed and trained into the system for classification. The preprocessing phase removes the noises and training flaws, which helps improve the dataset's quality and consistency. Further, feature extraction is performed to segment the WBCs, and African Buffalo fitness is updated in the classification layer for the correct classification of WBCs. The proposed framework is modeled in Python, and the experimental analysis depicts that it achieved 99.12% accuracy, 98.16% precision, 99% sensitivity, 99.04% specificity, and 99.02% f-measure. Furthermore, a comparative assessment with the existing techniques validated that the proposed strategy obtained better performances than the conventional models.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"38 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141503493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Feature reduction for hepatocellular carcinoma prediction using machine learning algorithms 利用机器学习算法减少预测肝细胞癌的特征

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-06-18 DOI: 10.1186/s40537-024-00944-3

Ghada Mostafa, Hamdi Mahmoud, Tarek Abd El-Hafeez, Mohamed E. ElAraby

Hepatocellular carcinoma (HCC) is a highly prevalent form of liver cancer that necessitates accurate prediction models for early diagnosis and effective treatment. Machine learning algorithms have demonstrated promising results in various medical domains, including cancer prediction. In this study, we propose a comprehensive approach for HCC prediction by comparing the performance of different machine learning algorithms before and after applying feature reduction methods. We employ popular feature reduction techniques, such as weighting features, hidden features correlation, feature selection, and optimized selection, to extract a reduced feature subset that captures the most relevant information related to HCC. Subsequently, we apply multiple algorithms, including Naive Bayes, support vector machines (SVM), Neural Networks, Decision Tree, and K nearest neighbors (KNN), to both the original high-dimensional dataset and the reduced feature set. By comparing the predictive accuracy, precision, F Score, recall, and execution time of each algorithm, we assess the effectiveness of feature reduction in enhancing the performance of HCC prediction models. Our experimental results, obtained using a comprehensive dataset comprising clinical features of HCC patients, demonstrate that feature reduction significantly improves the performance of all examined algorithms. Notably, the reduced feature set consistently outperforms the original high-dimensional dataset in terms of prediction accuracy and execution time. After applying feature reduction techniques, the employed algorithms, namely decision trees, Naive Bayes, KNN, neural networks, and SVM achieved accuracies of 96%, 97.33%, 94.67%, 96%, and 96.00%, respectively.

肝细胞癌（HCC）是一种高发的肝癌，需要准确的预测模型来进行早期诊断和有效治疗。机器学习算法在包括癌症预测在内的各种医疗领域都取得了可喜的成果。在本研究中，我们通过比较不同机器学习算法在应用特征缩减方法前后的性能，提出了一种用于 HCC 预测的综合方法。我们采用了流行的特征缩减技术，如加权特征、隐藏特征相关性、特征选择和优化选择，以提取能捕捉与 HCC 最相关信息的缩减特征子集。随后，我们对原始高维数据集和缩减后的特征集应用了多种算法，包括奈维贝叶、支持向量机（SVM）、神经网络、决策树和 K 近邻（KNN）。通过比较每种算法的预测准确率、精确度、F Score、召回率和执行时间，我们评估了特征缩减在提高 HCC 预测模型性能方面的有效性。我们使用包含 HCC 患者临床特征的综合数据集获得的实验结果表明，特征缩减显著提高了所有研究算法的性能。值得注意的是，缩减后的特征集在预测准确性和执行时间方面始终优于原始高维数据集。在应用特征缩减技术后，所采用的算法，即决策树、Naive Bayes、KNN、神经网络和 SVM 的准确率分别达到了 96%、97.33%、94.67%、96% 和 96.00%。

{"title":"Feature reduction for hepatocellular carcinoma prediction using machine learning algorithms","authors":"Ghada Mostafa, Hamdi Mahmoud, Tarek Abd El-Hafeez, Mohamed E. ElAraby","doi":"10.1186/s40537-024-00944-3","DOIUrl":"https://doi.org/10.1186/s40537-024-00944-3","url":null,"abstract":"Hepatocellular carcinoma (HCC) is a highly prevalent form of liver cancer that necessitates accurate prediction models for early diagnosis and effective treatment. Machine learning algorithms have demonstrated promising results in various medical domains, including cancer prediction. In this study, we propose a comprehensive approach for HCC prediction by comparing the performance of different machine learning algorithms before and after applying feature reduction methods. We employ popular feature reduction techniques, such as weighting features, hidden features correlation, feature selection, and optimized selection, to extract a reduced feature subset that captures the most relevant information related to HCC. Subsequently, we apply multiple algorithms, including Naive Bayes, support vector machines (SVM), Neural Networks, Decision Tree, and K nearest neighbors (KNN), to both the original high-dimensional dataset and the reduced feature set. By comparing the predictive accuracy, precision, F Score, recall, and execution time of each algorithm, we assess the effectiveness of feature reduction in enhancing the performance of HCC prediction models. Our experimental results, obtained using a comprehensive dataset comprising clinical features of HCC patients, demonstrate that feature reduction significantly improves the performance of all examined algorithms. Notably, the reduced feature set consistently outperforms the original high-dimensional dataset in terms of prediction accuracy and execution time. After applying feature reduction techniques, the employed algorithms, namely decision trees, Naive Bayes, KNN, neural networks, and SVM achieved accuracies of 96%, 97.33%, 94.67%, 96%, and 96.00%, respectively.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"22 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141503495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Advanced RIME architecture for global optimization and feature selection 用于全局优化和特征选择的先进 RIME 架构

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-06-18 DOI: 10.1186/s40537-024-00931-8

Ruba Abu Khurma, Malik Braik, Abdullah Alzaqebah, Krishna Gopal Dhal, Robertas Damaševičius, Bilal Abu-Salih

The article introduces an innovative approach to global optimization and feature selection (FS) using the RIME algorithm, inspired by RIME-ice formation. The RIME algorithm employs a soft-RIME search strategy and a hard-RIME puncture mechanism, along with an improved positive greedy selection mechanism, to resist getting trapped in local optima and enhance its overall search capabilities. The article also introduces Binary modified RIME (mRIME), a binary adaptation of the RIME algorithm to address the unique challenges posed by FS problems, which typically involve binary search spaces. Four different types of transfer functions (TFs) were selected for FS issues, and their efficacy was investigated for global optimization using CEC2011 and CEC2017 and FS tasks related to disease diagnosis. The results of the proposed mRIME were tested on ten reliable optimization algorithms. The advanced RIME architecture demonstrated superior performance in global optimization and FS tasks, providing an effective solution to complex optimization problems in various domains.

文章介绍了一种利用 RIME 算法进行全局优化和特征选择（FS）的创新方法，其灵感来自 RIME 冰的形成。RIME 算法采用了软 RIME 搜索策略和硬 RIME 穿刺机制，以及改进的正贪婪选择机制，以防止陷入局部最优并增强其整体搜索能力。文章还介绍了二进制修正 RIME（mRIME），这是对 RIME 算法的二进制调整，以应对 FS 问题带来的独特挑战，这些问题通常涉及二进制搜索空间。针对 FS 问题选择了四种不同类型的传递函数 (TF)，并利用 CEC2011 和 CEC2017 以及与疾病诊断相关的 FS 任务研究了它们在全局优化方面的功效。在十种可靠的优化算法上测试了所提出的 mRIME 的结果。先进的 RIME 架构在全局优化和 FS 任务中表现出卓越的性能，为不同领域的复杂优化问题提供了有效的解决方案。

引用次数: 0

PoLYTC: a novel BERT-based classifier to detect political leaning of YouTube videos based on their titles PoLYTC：基于 BERT 的新型分类器，根据标题检测 YouTube 视频的政治倾向性

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-06-05 DOI: 10.1186/s40537-024-00946-1

Nouar AlDahoul, Talal Rahwan, Yasir Zaki

Over two-thirds of the U.S. population uses YouTube, and a quarter of U.S. adults regularly receive their news from it. Despite the massive political content available on the platform, to date, no classifier has been proposed to classify the political leaning of YouTube videos. The only exception is a classifier that requires extensive information about each video (rather than just the title) and classifies the videos into just three classes (rather than the widely-used categorization into six classes). To fill this gap, “PoLYTC” (Political Leaning YouTube Classifier) is proposed to classify YouTube videos based on their titles into six political classes. PoLYTC utilizes a large language model, namely BERT, and is fine-tuned on a public dataset of 11.5 million YouTube videos. Experiments reveal that the proposed solution achieves high accuracy (75%) and high F1-score (77%), thereby outperforming the state of the art. To further validate the solution’s classification performance, several videos were collected from numerous prominent news agencies’ YouTube channels, such as Fox News and The New York Times, which have widely known political leanings. These videos were classified based on their titles, and the results have shown that, in the vast majority of cases, the predicted political leaning matches that of the news agency. PoLYTC can help YouTube users make informed decisions about which videos to watch and can help researchers analyze the political content on YouTube.

超过三分之二的美国人使用 YouTube，四分之一的美国成年人经常从 YouTube 上获取新闻。尽管该平台上有大量的政治内容，但迄今为止，还没有人提出过分类器来对 YouTube 视频的政治倾向进行分类。唯一的例外是，分类器需要每个视频的大量信息（而不仅仅是标题），并将视频分为三类（而不是广泛使用的六类）。为了填补这一空白，我们提出了 "PoLYTC"（政治倾向 YouTube 分类器），根据标题将 YouTube 视频分为六个政治类别。PoLYTC 采用了一个大型语言模型，即 BERT，并在一个包含 1150 万个 YouTube 视频的公共数据集上进行了微调。实验表明，所提出的解决方案实现了较高的准确率（75%）和较高的 F1 分数（77%），从而超越了现有技术水平。为了进一步验证该解决方案的分类性能，我们从福克斯新闻和《纽约时报》等众多知名新闻机构的 YouTube 频道中收集了一些视频，这些视频具有广为人知的政治倾向。根据标题对这些视频进行了分类，结果表明，在绝大多数情况下，预测的政治倾向与新闻机构的政治倾向相吻合。PoLYTC可以帮助YouTube用户在观看视频时做出明智的决定，也可以帮助研究人员分析YouTube上的政治内容。

{"title":"PoLYTC: a novel BERT-based classifier to detect political leaning of YouTube videos based on their titles","authors":"Nouar AlDahoul, Talal Rahwan, Yasir Zaki","doi":"10.1186/s40537-024-00946-1","DOIUrl":"https://doi.org/10.1186/s40537-024-00946-1","url":null,"abstract":"Over two-thirds of the U.S. population uses YouTube, and a quarter of U.S. adults regularly receive their news from it. Despite the massive political content available on the platform, to date, no classifier has been proposed to classify the political leaning of YouTube videos. The only exception is a classifier that requires extensive information about each video (rather than just the title) and classifies the videos into just three classes (rather than the widely-used categorization into six classes). To fill this gap, “PoLYTC” (Political Leaning YouTube Classifier) is proposed to classify YouTube videos based on their titles into six political classes. PoLYTC utilizes a large language model, namely BERT, and is fine-tuned on a public dataset of 11.5 million YouTube videos. Experiments reveal that the proposed solution achieves high accuracy (75%) and high F1-score (77%), thereby outperforming the state of the art. To further validate the solution’s classification performance, several videos were collected from numerous prominent news agencies’ YouTube channels, such as Fox News and The New York Times, which have widely known political leanings. These videos were classified based on their titles, and the results have shown that, in the vast majority of cases, the predicted political leaning matches that of the news agency. PoLYTC can help YouTube users make informed decisions about which videos to watch and can help researchers analyze the political content on YouTube.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"74 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141520246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GB-AFS: graph-based automatic feature selection for multi-class classification via Mean Simplified Silhouette GB-AFS：基于图的自动特征选择，通过平均简化剪影实现多类分类

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-31 DOI: 10.1186/s40537-024-00934-5

David Levin, Gonen Singer

This paper introduces a novel graph-based filter method for automatic feature selection (abbreviated as GB-AFS) for multi-class classification tasks. The method determines the minimum combination of features required to sustain prediction performance while maintaining complementary discriminating abilities between different classes. It does not require any user-defined parameters such as the number of features to select. The minimum number of features is selected using our newly developed Mean Simplified Silhouette (abbreviated as MSS) index, designed to evaluate the clustering results for the feature selection task. To illustrate the effectiveness and generality of the method, we applied the GB-AFS method using various combinations of statistical measures and dimensionality reduction techniques. The experimental results demonstrate the superior performance of the proposed GB-AFS over other filter-based techniques and automatic feature selection approaches, and demonstrate that the GB-AFS method is independent of the statistical measure or the dimensionality reduction technique chosen by the user. Moreover, the proposed method maintained the accuracy achieved when utilizing all features while using only 7–(30%) of the original features. This resulted in an average time saving ranging from (15%) for the smallest dataset to (70%) for the largest. Our code is available at https://github.com/davidlevinwork/gbfs/.

本文介绍了一种新颖的基于图的自动特征选择滤波方法（简称 GB-AFS），用于多类分类任务。该方法可确定维持预测性能所需的最小特征组合，同时保持不同类别之间的互补分辨能力。它不需要任何用户定义的参数，如要选择的特征数量。最小特征数是通过我们新开发的平均简化轮廓（缩写为 MSS）指数来选择的，该指数旨在评估特征选择任务的聚类结果。为了说明该方法的有效性和通用性，我们使用各种统计量和降维技术组合来应用 GB-AFS 方法。实验结果表明，与其他基于滤波器的技术和自动特征选择方法相比，所提出的 GB-AFS 方法性能优越，并证明了 GB-AFS 方法与用户选择的统计量或降维技术无关。此外，所提出的方法保持了利用所有特征时所达到的准确度，同时只使用了 7（30%）个原始特征。这使得最小数据集的平均时间节省了15%，最大数据集的平均时间节省了70%。我们的代码见 https://github.com/davidlevinwork/gbfs/。

{"title":"GB-AFS: graph-based automatic feature selection for multi-class classification via Mean Simplified Silhouette","authors":"David Levin, Gonen Singer","doi":"10.1186/s40537-024-00934-5","DOIUrl":"https://doi.org/10.1186/s40537-024-00934-5","url":null,"abstract":"This paper introduces a novel graph-based filter method for automatic feature selection (abbreviated as GB-AFS) for multi-class classification tasks. The method determines the minimum combination of features required to sustain prediction performance while maintaining complementary discriminating abilities between different classes. It does not require any user-defined parameters such as the number of features to select. The minimum number of features is selected using our newly developed Mean Simplified Silhouette (abbreviated as MSS) index, designed to evaluate the clustering results for the feature selection task. To illustrate the effectiveness and generality of the method, we applied the GB-AFS method using various combinations of statistical measures and dimensionality reduction techniques. The experimental results demonstrate the superior performance of the proposed GB-AFS over other filter-based techniques and automatic feature selection approaches, and demonstrate that the GB-AFS method is independent of the statistical measure or the dimensionality reduction technique chosen by the user. Moreover, the proposed method maintained the accuracy achieved when utilizing all features while using only 7–(30%) of the original features. This resulted in an average time saving ranging from (15%) for the smallest dataset to (70%) for the largest. Our code is available at https://github.com/davidlevinwork/gbfs/.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"117 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141192655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integration of feature enhancement technique in Google inception network for breast cancer detection and classification 在谷歌感知网络中整合特征增强技术，用于乳腺癌检测和分类

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-28 DOI: 10.1186/s40537-024-00936-3

Wasyihun Sema Admass, Yirga Yayeh Munaye, Ayodeji Olalekan Salau

Breast cancer is a major public health concern, and early detection and classification are essential for improving patient outcomes. However, breast tumors can be difficult to distinguish from benign tumors, leading to high false positive rates in screening. The reason is that both benign and malignant tumors have no consistent shape, are found at the same position, have variable sizes, and have high correlations. The ambiguity of the correlation challenges the computer-aided system, and the inconsistency of morphology challenges an expert in identifying and classifying what is positive and what is negative. Due to this, most of the time, breast cancer screen is prone to false positive rates. This research paper presents the introduction of a feature enhancement method into the Google inception network for breast cancer detection and classification. The proposed model preserves both local and global information, which is important for addressing the variability of breast tumor morphology and their complex correlations. A locally preserving projection transformation function is introduced to retain local information that might be lost in the intermediate output of the inception model. Additionally, transfer learning is used to improve the performance of the proposed model on limited datasets. The proposed model is evaluated on a dataset of ultrasound images and achieves an accuracy of 99.81%, recall of 96.48%, and sensitivity of 93.0%. These results demonstrate the effectiveness of the proposed method for breast cancer detection and classification.

乳腺癌是一个重大的公共卫生问题，早期检测和分类对改善患者预后至关重要。然而，乳腺肿瘤很难与良性肿瘤区分开来，导致筛查的假阳性率很高。究其原因，良性肿瘤和恶性肿瘤形状不一致、位置相同、大小不一、相关性高。相关性的模糊性给计算机辅助系统带来了挑战，而形态的不一致性又给专家鉴别和分类哪些是阳性哪些是阴性带来了挑战。因此，在大多数情况下，乳腺癌筛查容易出现假阳性率。本研究论文将一种特征增强方法引入谷歌萌芽网络，用于乳腺癌的检测和分类。所提出的模型同时保留了局部和全局信息，这对于解决乳腺肿瘤形态的多变性及其复杂的相关性非常重要。该模型引入了局部保存投影变换函数，以保留初始模型中间输出中可能丢失的局部信息。此外，还使用迁移学习来提高所提模型在有限数据集上的性能。该模型在超声波图像数据集上进行了评估，准确率达到 99.81%，召回率达到 96.48%，灵敏度达到 93.0%。这些结果证明了所提方法在乳腺癌检测和分类方面的有效性。

{"title":"Integration of feature enhancement technique in Google inception network for breast cancer detection and classification","authors":"Wasyihun Sema Admass, Yirga Yayeh Munaye, Ayodeji Olalekan Salau","doi":"10.1186/s40537-024-00936-3","DOIUrl":"https://doi.org/10.1186/s40537-024-00936-3","url":null,"abstract":"Breast cancer is a major public health concern, and early detection and classification are essential for improving patient outcomes. However, breast tumors can be difficult to distinguish from benign tumors, leading to high false positive rates in screening. The reason is that both benign and malignant tumors have no consistent shape, are found at the same position, have variable sizes, and have high correlations. The ambiguity of the correlation challenges the computer-aided system, and the inconsistency of morphology challenges an expert in identifying and classifying what is positive and what is negative. Due to this, most of the time, breast cancer screen is prone to false positive rates. This research paper presents the introduction of a feature enhancement method into the Google inception network for breast cancer detection and classification. The proposed model preserves both local and global information, which is important for addressing the variability of breast tumor morphology and their complex correlations. A locally preserving projection transformation function is introduced to retain local information that might be lost in the intermediate output of the inception model. Additionally, transfer learning is used to improve the performance of the proposed model on limited datasets. The proposed model is evaluated on a dataset of ultrasound images and achieves an accuracy of 99.81%, recall of 96.48%, and sensitivity of 93.0%. These results demonstrate the effectiveness of the proposed method for breast cancer detection and classification.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"29 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141170319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficiently approaching vertical federated learning by combining data reduction and conditional computation techniques 结合数据缩减和条件计算技术，高效实现垂直联合学习

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-28 DOI: 10.1186/s40537-024-00933-6

Francesco Folino, Gianluigi Folino, Francesco Sergio Pisani, Luigi Pontieri, Pietro Sabatino

In this paper, a framework based on a sparse Mixture of Experts (MoE) architecture is proposed for the federated learning and application of a distributed classification model in domains (like cybersecurity and healthcare) where different parties of the federation store different subsets of features for a number of data instances. The framework is designed to limit the risk of information leakage and computation/communication costs in both model training (through data sampling) and application (leveraging the conditional-computation abilities of sparse MoEs). Experiments on real data have shown the proposed approach to ensure a better balance between efficiency and model accuracy, compared to other VFL-based solutions. Notably, in a real-life cybersecurity case study focused on malware classification (the KronoDroid dataset), the proposed method surpasses competitors even though it utilizes only 50% and 75% of the training set, which is fully utilized by the other approaches in the competition. This method achieves reductions in the rate of false positives by 16.9% and 18.2%, respectively, and also delivers satisfactory results on the other evaluation metrics. These results showcase our framework’s potential to significantly enhance cybersecurity threat detection and prevention in a collaborative yet secure manner.

本文提出了一个基于稀疏专家混合物（MoE）架构的框架，用于分布式分类模型在不同领域（如网络安全和医疗保健）的联合学习和应用，在这些领域中，联合体的不同成员为大量数据实例存储了不同的特征子集。该框架旨在限制模型训练（通过数据采样）和应用（利用稀疏 MoE 的条件计算能力）中的信息泄露风险和计算/通信成本。对真实数据的实验表明，与其他基于 VFL 的解决方案相比，所提出的方法能确保在效率和模型准确性之间取得更好的平衡。值得注意的是，在以恶意软件分类为重点的真实网络安全案例研究（KronoDroid 数据集）中，所提出的方法超越了竞争对手，尽管它只使用了训练集的 50%和 75%，而其他方法在竞争中已经充分利用了训练集。该方法的误报率分别降低了 16.9% 和 18.2%，在其他评估指标上也取得了令人满意的结果。这些结果表明，我们的框架具有以协作而安全的方式显著提高网络安全威胁检测和预防能力的潜力。

{"title":"Efficiently approaching vertical federated learning by combining data reduction and conditional computation techniques","authors":"Francesco Folino, Gianluigi Folino, Francesco Sergio Pisani, Luigi Pontieri, Pietro Sabatino","doi":"10.1186/s40537-024-00933-6","DOIUrl":"https://doi.org/10.1186/s40537-024-00933-6","url":null,"abstract":"In this paper, a framework based on a sparse Mixture of Experts (MoE) architecture is proposed for the federated learning and application of a distributed classification model in domains (like cybersecurity and healthcare) where different parties of the federation store different subsets of features for a number of data instances. The framework is designed to limit the risk of information leakage and computation/communication costs in both model training (through data sampling) and application (leveraging the conditional-computation abilities of sparse MoEs). Experiments on real data have shown the proposed approach to ensure a better balance between efficiency and model accuracy, compared to other VFL-based solutions. Notably, in a real-life cybersecurity case study focused on malware classification (the KronoDroid dataset), the proposed method surpasses competitors even though it utilizes only 50% and 75% of the training set, which is fully utilized by the other approaches in the competition. This method achieves reductions in the rate of false positives by 16.9% and 18.2%, respectively, and also delivers satisfactory results on the other evaluation metrics. These results showcase our framework’s potential to significantly enhance cybersecurity threat detection and prevention in a collaborative yet secure manner.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"23 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141170198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

15 years of Big Data: a systematic literature review 大数据 15 年：系统文献综述

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-14 DOI: 10.1186/s40537-024-00914-9

Davide Tosi, Redon Kokaj, Marco Roccetti

Big Data is still gaining attention as a fundamental building block of the Artificial Intelligence and Machine Learning world. Therefore, a lot of effort has been pushed into Big Data research in the last 15 years. The objective of this Systematic Literature Review is to summarize the current state of the art of the previous 15 years of research about Big Data by providing answers to a set of research questions related to the main application domains for Big Data analytics; the significant challenges and limitations researchers have encountered in Big Data analysis, and emerging research trends and future directions in Big Data. The review follows a predefined procedure that automatically searches five well-known digital libraries. After applying the selection criteria to the results, 189 primary studies were identified as relevant, of which 32 were Systematic Literature Reviews. Required information was extracted from the 32 studies and summarized. Our Systematic Literature Review sketched the picture of 15 years of research in Big Data, identifying application domains, challenges, and future directions in this research field. We believe that a substantial amount of work remains to be done to align and seamlessly integrate Big Data into data-driven advanced software solutions of the future.

大数据作为人工智能和机器学习领域的基本构件，仍然受到越来越多的关注。因此，在过去的 15 年中，人们在大数据研究方面投入了大量精力。本系统性文献综述的目的是总结过去 15 年有关大数据的研究现状，回答一系列研究问题，这些问题涉及大数据分析的主要应用领域、研究人员在大数据分析中遇到的重大挑战和限制，以及大数据的新兴研究趋势和未来方向。综述按照预定程序自动搜索了五个著名的数字图书馆。在对结果应用选择标准后，确定了 189 项相关的主要研究，其中 32 项为系统文献综述。我们从这 32 项研究中提取了所需的信息并进行了总结。我们的系统文献综述勾勒出大数据 15 年来的研究图景，确定了这一研究领域的应用领域、挑战和未来方向。我们认为，要将大数据与数据驱动的未来先进软件解决方案进行协调和无缝集成，仍有大量工作要做。

{"title":"15 years of Big Data: a systematic literature review","authors":"Davide Tosi, Redon Kokaj, Marco Roccetti","doi":"10.1186/s40537-024-00914-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00914-9","url":null,"abstract":"Big Data is still gaining attention as a fundamental building block of the Artificial Intelligence and Machine Learning world. Therefore, a lot of effort has been pushed into Big Data research in the last 15 years. The objective of this Systematic Literature Review is to summarize the current state of the art of the previous 15 years of research about Big Data by providing answers to a set of research questions related to the main application domains for Big Data analytics; the significant challenges and limitations researchers have encountered in Big Data analysis, and emerging research trends and future directions in Big Data. The review follows a predefined procedure that automatically searches five well-known digital libraries. After applying the selection criteria to the results, 189 primary studies were identified as relevant, of which 32 were Systematic Literature Reviews. Required information was extracted from the 32 studies and summarized. Our Systematic Literature Review sketched the picture of 15 years of research in Big Data, identifying application domains, challenges, and future directions in this research field. We believe that a substantial amount of work remains to be done to align and seamlessly integrate Big Data into data-driven advanced software solutions of the future.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"100 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Skyline query under multidimensional incomplete data based on classification tree 基于分类树的多维不完整数据下的天际线查询

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-12 DOI: 10.1186/s40537-024-00923-8

Dengke Yuan, Liping Zhang, Song Li, Guanglu Sun

A method for skyline query of multidimensional incomplete data based on a classification tree has been proposed to address the problem of a large amount of useless data in existing skyline queries with multidimensional incomplete data, which leads to low query efficiency and algorithm performance. This method consists of two main parts. The first part is the proposed incomplete data weighted classification tree algorithm. In the first part, an incomplete data weighted classification tree is proposed, and the incomplete data set is classified using this tree. The data classified in the first part serves as the basis for the second step of the query. The second part proposes a skyline query algorithm for multidimensional incomplete data. The concept of optimal virtual points has been recently introduced, effectively reducing the number of comparisons of a large amount of data, thereby improving the query efficiency for incomplete data. Theoretical research and experimental analysis have shown that the proposed method can perform skyline queries for multidimensional incomplete data well, with high query efficiency and accuracy of the algorithm.

针对现有多维不完整数据的天线条查询中存在大量无用数据，导致查询效率和算法性能低下的问题，提出了一种基于分类树的多维不完整数据天线条查询方法。该方法主要由两部分组成。第一部分是提出的不完整数据加权分类树算法。在第一部分中，提出了一种不完整数据加权分类树，并使用该树对不完整数据集进行分类。第一部分分类的数据将作为第二步查询的基础。第二部分提出了多维不完整数据的天际线查询算法。最近提出了最优虚拟点的概念，有效减少了大量数据的比较次数，从而提高了不完整数据的查询效率。理论研究和实验分析表明，所提出的方法能很好地进行多维不完整数据的天际线查询，查询效率高，算法准确率高。

{"title":"Skyline query under multidimensional incomplete data based on classification tree","authors":"Dengke Yuan, Liping Zhang, Song Li, Guanglu Sun","doi":"10.1186/s40537-024-00923-8","DOIUrl":"https://doi.org/10.1186/s40537-024-00923-8","url":null,"abstract":"A method for skyline query of multidimensional incomplete data based on a classification tree has been proposed to address the problem of a large amount of useless data in existing skyline queries with multidimensional incomplete data, which leads to low query efficiency and algorithm performance. This method consists of two main parts. The first part is the proposed incomplete data weighted classification tree algorithm. In the first part, an incomplete data weighted classification tree is proposed, and the incomplete data set is classified using this tree. The data classified in the first part serves as the basis for the second step of the query. The second part proposes a skyline query algorithm for multidimensional incomplete data. The concept of optimal virtual points has been recently introduced, effectively reducing the number of comparisons of a large amount of data, thereby improving the query efficiency for incomplete data. Theoretical research and experimental analysis have shown that the proposed method can perform skyline queries for multidimensional incomplete data well, with high query efficiency and accuracy of the algorithm.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"147 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predicting air quality index using attention hybrid deep learning and quantum-inspired particle swarm optimization 利用注意力混合深度学习和量子启发粒子群优化预测空气质量指数

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-11 DOI: 10.1186/s40537-024-00926-5

Anh Tuan Nguyen, Duy Hoang Pham, Bee Lan Oo, Yonghan Ahn, Benson T. H. Lim

Air pollution poses a significant threat to the health of the environment and human well-being. The air quality index (AQI) is an important measure of air pollution that describes the degree of air pollution and its impact on health. Therefore, accurate and reliable prediction of the AQI is critical but challenging due to the non-linearity and stochastic nature of air particles. This research aims to propose an AQI prediction hybrid deep learning model based on the Attention Convolutional Neural Networks (ACNN), Autoregressive Integrated Moving Average (ARIMA), Quantum Particle Swarm Optimization (QPSO)-enhanced-Long Short-Term Memory (LSTM) and XGBoost modelling techniques. Daily air quality data were collected from the official Seoul Air registry for the period 2021 to 2022. The data were first preprocessed through the ARIMA model to capture and fit the linear part of the data and followed by a hybrid deep learning architecture developed in the pretraining–finetuning framework for the non-linear part of the data. This hybrid model first used convolution to extract the deep features of the original air quality data, and then used the QPSO to optimize the hyperparameter for LSTM network for mining the long-terms time series features, and the XGBoost model was adopted to fine-tune the final AQI prediction model. The robustness and reliability of the resulting model were assessed and compared with other widely used models and across meteorological stations. Our proposed model achieves up to 31.13% reduction in MSE, 19.03% reduction in MAE and 2% improvement in R-squared compared to the best appropriate conventional model, indicating a much stronger magnitude of relationships between predicted and actual values. The overall results show that the attentive hybrid deep Quantum inspired Particle Swarm Optimization model is more feasible and efficient in predicting air quality index at both city-wide and station-specific levels.

空气污染对环境健康和人类福祉构成重大威胁。空气质量指数（AQI）是衡量空气污染的重要指标，它描述了空气污染的程度及其对健康的影响。因此，准确可靠地预测空气质量指数至关重要，但由于空气微粒的非线性和随机性，预测具有挑战性。本研究旨在提出一种基于注意力卷积神经网络（ACNN）、自回归综合移动平均（ARIMA）、量子粒子群优化（QPSO）-增强型长短期记忆（LSTM）和 XGBoost 建模技术的空气质量指数预测混合深度学习模型。2021 年至 2022 年期间的每日空气质量数据来自首尔空气官方登记册。首先通过 ARIMA 模型对数据进行预处理，以捕捉和拟合数据的线性部分，然后针对数据的非线性部分采用在预训练-微调框架下开发的混合深度学习架构。该混合模型首先利用卷积提取原始空气质量数据的深度特征，然后利用 QPSO 优化 LSTM 网络的超参数以挖掘长时序列特征，并采用 XGBoost 模型对最终的 AQI 预测模型进行微调。评估了最终模型的稳健性和可靠性，并与其他广泛使用的模型和各气象站进行了比较。与最合适的传统模型相比，我们提出的模型的 MSE 降低了 31.13%，MAE 降低了 19.03%，R 平方提高了 2%，这表明预测值与实际值之间的关系更为紧密。总体结果表明，受量子启发的粒子群优化混合模型在预测全市和特定站点的空气质量指数方面更加可行和高效。

{"title":"Predicting air quality index using attention hybrid deep learning and quantum-inspired particle swarm optimization","authors":"Anh Tuan Nguyen, Duy Hoang Pham, Bee Lan Oo, Yonghan Ahn, Benson T. H. Lim","doi":"10.1186/s40537-024-00926-5","DOIUrl":"https://doi.org/10.1186/s40537-024-00926-5","url":null,"abstract":"Air pollution poses a significant threat to the health of the environment and human well-being. The air quality index (AQI) is an important measure of air pollution that describes the degree of air pollution and its impact on health. Therefore, accurate and reliable prediction of the AQI is critical but challenging due to the non-linearity and stochastic nature of air particles. This research aims to propose an AQI prediction hybrid deep learning model based on the Attention Convolutional Neural Networks (ACNN), Autoregressive Integrated Moving Average (ARIMA), Quantum Particle Swarm Optimization (QPSO)-enhanced-Long Short-Term Memory (LSTM) and XGBoost modelling techniques. Daily air quality data were collected from the official Seoul Air registry for the period 2021 to 2022. The data were first preprocessed through the ARIMA model to capture and fit the linear part of the data and followed by a hybrid deep learning architecture developed in the pretraining–finetuning framework for the non-linear part of the data. This hybrid model first used convolution to extract the deep features of the original air quality data, and then used the QPSO to optimize the hyperparameter for LSTM network for mining the long-terms time series features, and the XGBoost model was adopted to fine-tune the final AQI prediction model. The robustness and reliability of the resulting model were assessed and compared with other widely used models and across meteorological stations. Our proposed model achieves up to 31.13% reduction in MSE, 19.03% reduction in MAE and 2% improvement in R-squared compared to the best appropriate conventional model, indicating a much stronger magnitude of relationships between predicted and actual values. The overall results show that the attentive hybrid deep Quantum inspired Particle Swarm Optimization model is more feasible and efficient in predicting air quality index at both city-wide and station-specific levels.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"12 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0