The pay-what-you-use model of serverless Cloud computing (or serverless, for short) offers significant benefits to the users. This computing paradigm is ideal for short running ephemeral tasks, however, it is not suitable for stateful long running tasks, such as complex data analytics and query processing. We propose FunDa, an on-premises serverless data analytics framework, which extends our previously proposed system for unified data analytics and in situ SQL query processing called DaskDB. Unlike existing serverless solutions, which struggle with stateful and long running data analytics tasks, FunDa overcomes their limitations. Our ongoing research focuses on developing a robust architecture for FunDa, enabling true serverless in on-premises environments, while being able to operate on a public Cloud, such as AWS Cloud. We have evaluated our system on several benchmarks with different scale factors. Our experimental results in both on-premises and AWS Cloud settings demonstrate FunDa's ability to support automatic scaling, low-latency execution of data analytics workloads, and more flexibility to serverless users.
{"title":"<i>F</i>u<i>n</i>Da: scalable serverless data analytics and in situ query processing.","authors":"Elyes Lounissi, Suvam Kumar Das, Ronnit Peter, Xiaozheng Zhang, Suprio Ray, Lianyin Jia","doi":"10.1186/s40537-025-01141-6","DOIUrl":"https://doi.org/10.1186/s40537-025-01141-6","url":null,"abstract":"<p><p>The pay-what-you-use model of serverless Cloud computing (or serverless, for short) offers significant benefits to the users. This computing paradigm is ideal for short running ephemeral tasks, however, it is not suitable for stateful long running tasks, such as complex data analytics and query processing. We propose <i>F</i>u<i>n</i>Da, an on-premises serverless data analytics framework, which extends our previously proposed system for unified data analytics and in situ SQL query processing called DaskDB. Unlike existing serverless solutions, which struggle with stateful and long running data analytics tasks, <i>F</i>u<i>n</i>Da overcomes their limitations. Our ongoing research focuses on developing a robust architecture for <i>F</i>u<i>n</i>Da, enabling true serverless in on-premises environments, while being able to operate on a public Cloud, such as AWS Cloud. We have evaluated our system on several benchmarks with different scale factors. Our experimental results in both on-premises and AWS Cloud settings demonstrate <i>F</i>u<i>n</i>Da's ability to support automatic scaling, low-latency execution of data analytics workloads, and more flexibility to serverless users.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"12 1","pages":"116"},"PeriodicalIF":8.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12064580/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143991480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-11-17DOI: 10.1186/s40537-025-01307-2
Abdulrauf A Gidado, C I Ezeife
To date, most large corporations still have their core solutions on relational databases but only use non-relational (i.e. NoSQL) database management systems (DBMS) for their non-core systems that favour availability and scalability through partitioning while trading off consistency. NoSQL systems are built based on the CAP (i.e., Consistency, Availability and Partitioning) database theorem, which trades off one of these features while maintaining the others. The need for systems availability and scalability drives the use of NoSQL, while the lack of consistency and robust query engines as obtainable in relational databases, impede their usage. To mitigate these drawbacks, researchers and companies like Amazon, Google, and Facebook run 'SQL over NoSQL' systems such as Dynamo, Google's Spanner, Memcache, Zidian, Apache Hive and SparkSQL. These systems create a query engine layer over NoSQL systems but suffer from data redundancy and lack consistency obtainable in relational DBMS. Also, their query engine is not relational complete because they cannot process all relational algebra-based queries as obtainable in a relational database. In this paper, we present a 'Unique NoSQL over SQL Database' (UniqueNOSD) system, an extension of NOSD and an inverse of existing approaches. This approach is motivated by the need for existing systems to fully deploy NoSQL data store functionalities without the limitation of building an extra SQL layer for querying. To allow appropriate storage and retrieval of data on document-based NoSQL databases without data redundancy and inconsistency while encouraging both horizontal and vertical partitioning, we propose NoSQL over SQL Block as a Value ([Formula: see text]) data storage strategy. Unlike relational database model where a relation is represented as [Formula: see text], with a key attribute [Formula: see text] and [Formula: see text] is the primary key to the set of attributes [Formula: see text] of the relation, in [Formula: see text] (represented as a tuple (K, B) where K means key and B means block). We represent a relation as [Formula: see text] with a key attribute K and a set of n relations (i.e., r) called blocks B and each r [Formula: see text] contains a set of its own attributes and is denoted as [Formula: see text] with a key attribute k and a set of n attributes typical to a relational model. The relations [Formula: see text] in R of [Formula: see text] are related through foreign key relationships. Using existing benchmark systems of 'SQL over NoSQL', relational databases and real-life datasets for our experiments, we demonstrated that our NoSQL over SQL system outperforms existing relational databases, SQL over NoSQL systems and is novel in ensuring data consistency, scalability, query execution and improving data storage and retrieval in large database systems without data loss and enhancing improved performan
到目前为止,大多数大公司的核心解决方案仍然是关系数据库,但他们的非核心系统只使用非关系(即NoSQL)数据库管理系统(DBMS),这些系统通过分区支持可用性和可伸缩性,同时牺牲一致性。NoSQL系统是基于CAP(即一致性、可用性和分区)数据库定理构建的,它在维护其他特性的同时权衡了其中一个特性。对系统可用性和可伸缩性的需求推动了NoSQL的使用,而关系数据库中可获得的一致性和健壮的查询引擎的缺乏阻碍了它们的使用。为了减轻这些缺点,研究人员和亚马逊、b谷歌和Facebook等公司运行“SQL over NoSQL”系统,如Dynamo、b谷歌的Spanner、Memcache、Zidian、Apache Hive和SparkSQL。这些系统在NoSQL系统之上创建了一个查询引擎层,但在关系DBMS中存在数据冗余和缺乏一致性的问题。而且,它们的查询引擎不是关系完整的,因为它们不能处理关系数据库中所有基于关系代数的查询。在本文中,我们提出了一个“唯一NoSQL over SQL数据库”(UniqueNOSD)系统,它是NOSD的扩展和现有方法的逆。这种方法的动机是现有系统需要完全部署NoSQL数据存储功能,而不需要为查询构建额外的SQL层。为了允许在基于文档的NoSQL数据库上进行适当的数据存储和检索,而不会出现数据冗余和不一致,同时鼓励水平和垂直分区,我们提出将NoSQL over SQL Block作为值([公式:见文本])数据存储策略。与关系数据库模型不同的是,关系被表示为[Formula: see text],带有一个键属性[Formula: see text], [Formula: see text]是关系的属性集[Formula: see text]的主键,在[Formula: see text]中(表示为元组(K, B),其中K表示键,B表示块)。我们将一个关系表示为[公式:见文],它有一个键属性K和一组n个关系(即r),称为块B,每个r[公式:见文]包含一组自己的属性,并表示为[公式:见文],它有一个键属性K和一组n个典型的关系模型属性。[Formula: see text]的R中的关系[Formula: see text]是通过外键关系联系起来的。使用现有的“SQL over NoSQL”基准系统、关系数据库和实际数据集进行实验,我们证明了我们的NoSQL over SQL系统优于现有的关系数据库、SQL over NoSQL系统,并且在确保数据一致性、可扩展性、查询执行和改进大型数据库系统中的数据存储和检索方面是新颖的,没有数据丢失,并提高了NoSQL数据库的性能。
{"title":"UniqueNOSD: a novel framework for NoSQL over SQL databases.","authors":"Abdulrauf A Gidado, C I Ezeife","doi":"10.1186/s40537-025-01307-2","DOIUrl":"https://doi.org/10.1186/s40537-025-01307-2","url":null,"abstract":"<p><p>To date, most large corporations still have their core solutions on relational databases but only use non-relational (i.e. NoSQL) database management systems (DBMS) for their non-core systems that favour availability and scalability through partitioning while trading off consistency. NoSQL systems are built based on the CAP (i.e., Consistency, Availability and Partitioning) database theorem, which trades off one of these features while maintaining the others. The need for systems availability and scalability drives the use of NoSQL, while the lack of consistency and robust query engines as obtainable in relational databases, impede their usage. To mitigate these drawbacks, researchers and companies like Amazon, Google, and Facebook run 'SQL over NoSQL' systems such as Dynamo, Google's Spanner, Memcache, Zidian, Apache Hive and SparkSQL. These systems create a query engine layer over NoSQL systems but suffer from data redundancy and lack consistency obtainable in relational DBMS. Also, their query engine is not relational complete because they cannot process all relational algebra-based queries as obtainable in a relational database. In this paper, we present a 'Unique NoSQL over SQL Database' (UniqueNOSD) system, an extension of NOSD and an inverse of existing approaches. This approach is motivated by the need for existing systems to fully deploy NoSQL data store functionalities without the limitation of building an extra SQL layer for querying. To allow appropriate storage and retrieval of data on document-based NoSQL databases without data redundancy and inconsistency while encouraging both horizontal and vertical partitioning, we propose NoSQL over SQL Block as a Value ([Formula: see text]) data storage strategy. Unlike relational database model where a relation is represented as [Formula: see text], with a key attribute [Formula: see text] and [Formula: see text] is the primary key to the set of attributes [Formula: see text] of the relation, in [Formula: see text] (represented as a tuple (<i>K</i>, <i>B</i>) where <i>K</i> means key and <i>B</i> means block). We represent a relation as [Formula: see text] with a key attribute <i>K</i> and a set of <i>n</i> relations (i.e., <i>r</i>) called blocks <i>B</i> and each <i>r</i> [Formula: see text] contains a set of its own attributes and is denoted as [Formula: see text] with a key attribute <i>k</i> and a set of <i>n</i> attributes typical to a relational model. The relations [Formula: see text] in <i>R</i> of [Formula: see text] are related through foreign key relationships. Using existing benchmark systems of 'SQL over NoSQL', relational databases and real-life datasets for our experiments, we demonstrated that our NoSQL over SQL system outperforms existing relational databases, SQL over NoSQL systems and is novel in ensuring data consistency, scalability, query execution and improving data storage and retrieval in large database systems without data loss and enhancing improved performan","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"12 1","pages":"255"},"PeriodicalIF":6.4,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12628391/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145563910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-18DOI: 10.1186/s40537-024-00985-8
Esraa Hassan, Samar Elbedwehy, Mahmoud Y. Shams, Tarek Abd El-Hafeez, Nora El-Rashidy
This study introduces a novel deep learning-based approach for classifying poultry audio signals, incorporating a custom Burn Layer to enhance model robustness. The methodology integrates digital audio signal processing, convolutional neural networks (CNNs), and the innovative Burn Layer, which injects controlled random noise during training to reinforce the model's resilience to input signal variations. The proposed architecture is streamlined, with convolutional blocks, densely connected layers, dropout, and an additional Burn Layer to fortify robustness. The model demonstrates efficiency by reducing trainable parameters to 191,235, compared to traditional architectures with over 1.7 million parameters. The proposed model utilizes a Burn Layer with burn intensity as a parameter and an Adamax optimizer to optimize and address the overfitting problem. Thorough evaluation using six standard classification metrics showcases the model's superior performance, achieving exceptional sensitivity (96.77%), specificity (100.00%), precision (100.00%), negative predictive value (NPV) (95.00%), accuracy (98.55%), F1 score (98.36%), and Matthew’s correlation coefficient (MCC) (95.88%). This research contributes valuable insights into the fields of audio signal processing, animal health monitoring, and robust deep-learning classification systems. The proposed model presents a systematic approach for developing and evaluating a deep learning-based poultry audio classification system. It processes raw audio data and labels to generate digital representations, utilizes a Burn Layer for training variability, and constructs a CNN model with convolutional blocks, pooling, and dense layers. The model is optimized using the Adamax algorithm and trained with data augmentation and early-stopping techniques. Rigorous assessment on a test dataset using standard metrics demonstrates the model's robustness and efficiency, with the potential to significantly advance animal health monitoring and disease detection through audio signal analysis.
{"title":"Optimizing poultry audio signal classification with deep learning and burn layer fusion","authors":"Esraa Hassan, Samar Elbedwehy, Mahmoud Y. Shams, Tarek Abd El-Hafeez, Nora El-Rashidy","doi":"10.1186/s40537-024-00985-8","DOIUrl":"https://doi.org/10.1186/s40537-024-00985-8","url":null,"abstract":"<p>This study introduces a novel deep learning-based approach for classifying poultry audio signals, incorporating a custom Burn Layer to enhance model robustness. The methodology integrates digital audio signal processing, convolutional neural networks (CNNs), and the innovative Burn Layer, which injects controlled random noise during training to reinforce the model's resilience to input signal variations. The proposed architecture is streamlined, with convolutional blocks, densely connected layers, dropout, and an additional Burn Layer to fortify robustness. The model demonstrates efficiency by reducing trainable parameters to 191,235, compared to traditional architectures with over 1.7 million parameters. The proposed model utilizes a Burn Layer with burn intensity as a parameter and an Adamax optimizer to optimize and address the overfitting problem. Thorough evaluation using six standard classification metrics showcases the model's superior performance, achieving exceptional sensitivity (96.77%), specificity (100.00%), precision (100.00%), negative predictive value (NPV) (95.00%), accuracy (98.55%), F1 score (98.36%), and Matthew’s correlation coefficient (MCC) (95.88%). This research contributes valuable insights into the fields of audio signal processing, animal health monitoring, and robust deep-learning classification systems. The proposed model presents a systematic approach for developing and evaluating a deep learning-based poultry audio classification system. It processes raw audio data and labels to generate digital representations, utilizes a Burn Layer for training variability, and constructs a CNN model with convolutional blocks, pooling, and dense layers. The model is optimized using the Adamax algorithm and trained with data augmentation and early-stopping techniques. Rigorous assessment on a test dataset using standard metrics demonstrates the model's robustness and efficiency, with the potential to significantly advance animal health monitoring and disease detection through audio signal analysis.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"23 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142253832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-18DOI: 10.1186/s40537-024-00991-w
Doaa El-Shahat, Ahmed Tolba, Mohamed Abouhawwash, Mohamed Abdel-Basset
In late 2023, the United Nations conference on climate change (COP28), which was held in Dubai, encouraged a quick move from fossil fuels to renewable energy. Solar energy is one of the most promising forms of energy that is both sustainable and renewable. Generally, photovoltaic systems transform solar irradiance into electricity. Unfortunately, instability and intermittency in solar radiation can lead to interruptions in electricity production. The accurate forecasting of solar irradiance guarantees sustainable power production even when solar irradiance is not present. Batteries can store solar energy to be used during periods of solar absence. Additionally, deterministic models take into account the specification of technical PV systems and may be not accurate for low solar irradiance. This paper presents a comparative study for the most common Deep Learning (DL) and Machine Learning (ML) algorithms employed for short-term solar irradiance forecasting. The dataset was gathered in Islamabad during a five-year period, from 2015 to 2019, at hourly intervals with accurate meteorological sensors. Furthermore, the Grid Search Cross Validation (GSCV) with five folds is introduced to ML and DL models for optimizing the hyperparameters of these models. Several performance metrics are used to assess the algorithms, such as the Adjusted R2 score, Normalized Root Mean Square Error (NRMSE), Mean Absolute Deviation (MAD), Mean Absolute Error (MAE) and Mean Square Error (MSE). The statistical analysis shows that CNN-LSTM outperforms its counterparts of nine well-known DL models with Adjusted R2 score value of 0.984. For ML algorithms, gradient boosting regression is an effective forecasting method with Adjusted R2 score value of 0.962, beating its rivals of six ML models. Furthermore, SHAP and LIME are examples of explainable Artificial Intelligence (XAI) utilized for understanding the reasons behind the obtained results.
2023 年底,在迪拜举行的联合国气候变化大会(COP28)鼓励尽快从化石燃料转向可再生能源。太阳能是最有前途的可持续和可再生能源之一。一般来说,光伏系统将太阳辐照转化为电能。遗憾的是,太阳辐射的不稳定性和间歇性会导致电力生产中断。对太阳辐照度的准确预测可确保即使在没有太阳辐照度的情况下也能持续发电。电池可以储存太阳能,以便在没有太阳能时使用。此外,确定性模型考虑了技术光伏系统的规格,在太阳辐照度较低时可能并不准确。本文对短期太阳辐照度预测中最常用的深度学习(DL)和机器学习(ML)算法进行了比较研究。数据集收集于伊斯兰堡,时间跨度为五年(2015 年至 2019 年),使用精确的气象传感器以小时为间隔进行收集。此外,还为 ML 和 DL 模型引入了网格搜索交叉验证 (GSCV),以优化这些模型的超参数。评估算法时使用了几个性能指标,如调整后 R2 分数、归一化均方根误差(NRMSE)、平均绝对偏差(MAD)、平均绝对误差(MAE)和平均平方误差(MSE)。统计分析结果表明,CNN-LSTM 的调整后 R2 得分为 0.984,优于九种著名的 DL 模型。在 ML 算法中,梯度提升回归是一种有效的预测方法,其调整后 R2 得分为 0.962,优于 6 个 ML 模型的对手。此外,SHAP 和 LIME 是可解释人工智能(XAI)的范例,可用于理解所得结果背后的原因。
{"title":"Machine learning and deep learning models based grid search cross validation for short-term solar irradiance forecasting","authors":"Doaa El-Shahat, Ahmed Tolba, Mohamed Abouhawwash, Mohamed Abdel-Basset","doi":"10.1186/s40537-024-00991-w","DOIUrl":"https://doi.org/10.1186/s40537-024-00991-w","url":null,"abstract":"<p>In late 2023, the United Nations conference on climate change (COP28), which was held in Dubai, encouraged a quick move from fossil fuels to renewable energy. Solar energy is one of the most promising forms of energy that is both sustainable and renewable. Generally, photovoltaic systems transform solar irradiance into electricity. Unfortunately, instability and intermittency in solar radiation can lead to interruptions in electricity production. The accurate forecasting of solar irradiance guarantees sustainable power production even when solar irradiance is not present. Batteries can store solar energy to be used during periods of solar absence. Additionally, deterministic models take into account the specification of technical PV systems and may be not accurate for low solar irradiance. This paper presents a comparative study for the most common Deep Learning (DL) and Machine Learning (ML) algorithms employed for short-term solar irradiance forecasting. The dataset was gathered in Islamabad during a five-year period, from 2015 to 2019, at hourly intervals with accurate meteorological sensors. Furthermore, the Grid Search Cross Validation (GSCV) with five folds is introduced to ML and DL models for optimizing the hyperparameters of these models. Several performance metrics are used to assess the algorithms, such as the <i>Adjusted R</i><sup><i>2</i></sup><i> score</i>, <i>Normalized Root Mean Square Error</i> (NRMSE), <i>Mean Absolute Deviation</i> (MAD), <i>Mean Absolute Error</i> (MAE) and <i>Mean Square Error</i> (MSE). The statistical analysis shows that CNN-LSTM outperforms its counterparts of nine well-known DL models with <i>Adjusted R</i><sup><i>2</i></sup><i> score</i> value of 0.984. For ML algorithms, gradient boosting regression is an effective forecasting method with <i>Adjusted R</i><sup><i>2</i></sup><i> score</i> value of 0.962, beating its rivals of six ML models. Furthermore, SHAP and LIME are examples of explainable Artificial Intelligence (XAI) utilized for understanding the reasons behind the obtained results.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"13 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142253831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-18DOI: 10.1186/s40537-024-00994-7
Ali Mohammed Alsaffar, Mostafa Nouri-Baygi, Hamed M. Zolbanin
The frequent usage of computer networks and the Internet has made computer networks vulnerable to numerous attacks, highlighting the critical need to enhance the precision of security mechanisms. One of the most essential measures to safeguard networking resources and infrastructures is an intrusion detection system (IDS). IDSs are widely used to detect, identify, and track malicious threats. Although various machine learning algorithms have been used successfully in IDSs, they are still suffering from low prediction performances. One reason behind the low accuracy of IDSs is that existing network traffic datasets have high computational complexities that are mainly caused by redundant, incomplete, and irrelevant features. Furthermore, standalone classifiers exhibit restricted classification performance and typically fail to produce satisfactory outcomes when dealing with imbalanced, multi-category traffic data. To address these issues, we propose an efficient intrusion detection model, which is based on hybrid feature selection and stack ensemble learning. Our hybrid feature selection method, called MI-Boruta, combines mutual information (MI) as a filter method and the Boruta algorithm as a wrapper method to determine optimal features from our datasets. Then, we apply stacked ensemble learning by using random forest (RF), Catboost, and XGBoost algorithms as base learners with multilayer perceptron (MLP) as meta-learner. We test our intrusion detection model on two widely recognized benchmark datasets, namely UNSW-NB15 and CICIDS2017. We show that our proposed IDS outperforms existing IDSs in almost all performance criteria, including accuracy, recall, precision, F1-Score, false positive rate, true positive rate, and error rate.
{"title":"Shielding networks: enhancing intrusion detection with hybrid feature selection and stack ensemble learning","authors":"Ali Mohammed Alsaffar, Mostafa Nouri-Baygi, Hamed M. Zolbanin","doi":"10.1186/s40537-024-00994-7","DOIUrl":"https://doi.org/10.1186/s40537-024-00994-7","url":null,"abstract":"<p>The frequent usage of computer networks and the Internet has made computer networks vulnerable to numerous attacks, highlighting the critical need to enhance the precision of security mechanisms. One of the most essential measures to safeguard networking resources and infrastructures is an intrusion detection system (IDS). IDSs are widely used to detect, identify, and track malicious threats. Although various machine learning algorithms have been used successfully in IDSs, they are still suffering from low prediction performances. One reason behind the low accuracy of IDSs is that existing network traffic datasets have high computational complexities that are mainly caused by redundant, incomplete, and irrelevant features. Furthermore, standalone classifiers exhibit restricted classification performance and typically fail to produce satisfactory outcomes when dealing with imbalanced, multi-category traffic data. To address these issues, we propose an efficient intrusion detection model, which is based on hybrid feature selection and stack ensemble learning. Our hybrid feature selection method, called MI-Boruta, combines mutual information (MI) as a filter method and the Boruta algorithm as a wrapper method to determine optimal features from our datasets. Then, we apply stacked ensemble learning by using random forest (RF), Catboost, and XGBoost algorithms as base learners with multilayer perceptron (MLP) as meta-learner. We test our intrusion detection model on two widely recognized benchmark datasets, namely UNSW-NB15 and CICIDS2017. We show that our proposed IDS outperforms existing IDSs in almost all performance criteria, including accuracy, recall, precision, F1-Score, false positive rate, true positive rate, and error rate.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"19 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142253792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The tumor microenvironment (TME) provides a region for intricate interactions within or between immune and non-immune cells. We aimed to reveal the tissue architecture and comprehensive landscape of cells within the TME of colorectal cancer (CRC).
Methods
Fresh frozen invasive adenocarcinoma of the large intestine tissue from 10× Genomics Datasets was obtained from BioIVT Asterand. The integration of microarray-based spatial transcriptomics (ST) and RNA sequencing (RNA-seq) was applied to characterize gene expression and cell landscape within the TME of CRC tissue architecture. Multiple R packages and deconvolution algorithms including MCPcounter, XCELL, EPIC, and ESTIMATE methods were performed for further immune distribution analysis.
Results
The subpopulations of immune and non-immune cells within the TME of the CRC tissue architecture were appropriately annotated. According to ST and RNA-seq analyses, a heterogeneous spatial atlas of gene distribution and cell landscape was comprehensively characterized. We distinguished between the cancer and stromal regions of CRC tissues. As expected, epithelial cells were located in the cancerous region, whereas fibroblasts were mainly located in the stroma. In addition, the fibroblasts were further subdivided into two subgroups (F1 and F2) according to the differentially expressed genes (DEGs), which were mainly enriched in pathways including hallmark-oxidative-phosphorylation, hallmark-e2f-targets and hallmark-unfolded-protein-response. Furthermore, the top 5 DEGs, SPP1, CXCL10, APOE, APOC1, and LYZ, were found to be closely related to immunoregulation of the TME, methylation, and survival of CRC patients.
Conclusions
This study characterized the heterogeneous spatial landscape of various cell subtypes within the TME of the tissue architecture. The TME-related roles of fibroblast subsets addressed the potential crosstalk among diverse cells.
{"title":"Integrating microarray-based spatial transcriptomics and RNA-seq reveals tissue architecture in colorectal cancer","authors":"Zheng Li, Xiaojie Zhang, Chongyuan Sun, Zefeng Li, He Fei, Dongbing Zhao","doi":"10.1186/s40537-024-00992-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00992-9","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>The tumor microenvironment (TME) provides a region for intricate interactions within or between immune and non-immune cells. We aimed to reveal the tissue architecture and comprehensive landscape of cells within the TME of colorectal cancer (CRC).</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>Fresh frozen invasive adenocarcinoma of the large intestine tissue from 10× Genomics Datasets was obtained from BioIVT Asterand. The integration of microarray-based spatial transcriptomics (ST) and RNA sequencing (RNA-seq) was applied to characterize gene expression and cell landscape within the TME of CRC tissue architecture. Multiple R packages and deconvolution algorithms including MCPcounter, XCELL, EPIC, and ESTIMATE methods were performed for further immune distribution analysis.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>The subpopulations of immune and non-immune cells within the TME of the CRC tissue architecture were appropriately annotated. According to ST and RNA-seq analyses, a heterogeneous spatial atlas of gene distribution and cell landscape was comprehensively characterized. We distinguished between the cancer and stromal regions of CRC tissues. As expected, epithelial cells were located in the cancerous region, whereas fibroblasts were mainly located in the stroma. In addition, the fibroblasts were further subdivided into two subgroups (F1 and F2) according to the differentially expressed genes (DEGs), which were mainly enriched in pathways including hallmark-oxidative-phosphorylation, hallmark-e2f-targets and hallmark-unfolded-protein-response. Furthermore, the top 5 DEGs, SPP1, CXCL10, APOE, APOC1, and LYZ, were found to be closely related to immunoregulation of the TME, methylation, and survival of CRC patients.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>This study characterized the heterogeneous spatial landscape of various cell subtypes within the TME of the tissue architecture. The TME-related roles of fibroblast subsets addressed the potential crosstalk among diverse cells.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"26 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142253793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-12DOI: 10.1186/s40537-024-00968-9
Wei Feng, Bingjie Wang, Dan Song, Mengda Li, Anming Chen, Jing Wang, Siyong Lin, Yiran Zhao, Bin Wang, Zongyuan Ge, Shuyi Xu, Yuntao Hu
Diabetic retinopathy (DR) is the most prevalent cause of preventable vision loss worldwide, imposing a significant economic and medical burden on society today, of which early identification is the cornerstones of the management. The diagnosis and severity grading of DR rely on scales based on clinical visualized features, but lack detailed quantitative parameters. Retinal non-perfusion area (NPA) is a pathogenic characteristic of DR that symbolizes retinal hypoxia conditions, and was found to be intimately associated with disease progression, prognosis, and management. However, the practical value of NPA is constrained since it appears on fundus fluorescein angiography (FFA) as distributed, irregularly shaped, darker plaques that are challenging to measure manually. In this study, we propose a deep learning-based method, NPA-Net, for accurate and automatic segmentation of NPAs from FFA images acquired in clinical practice. NPA-Net uses the U-net structure as the basic backbone, which has an encoder-decoder model structure. To enhance the recognition performance of the model for NPA, we adaptively incorporate multi-scale features and contextual information in feature learning and design three modules: Adaptive Encoder Feature Fusion (AEFF) module, Multilayer Deep Supervised Loss, and Atrous Spatial Pyramid Pooling (ASPP) module, which enhance the recognition ability of the model for NPAs of different sizes from different perspectives. We conducted extensive experiments on a clinical dataset with 163 eyes with NPAs manually annotated by ophthalmologists, and NPA-Net achieved better segmentation performance compared to other existing methods with an area under the receiver operating characteristic curve (AUC) of 0.9752, accuracy of 0.9431, sensitivity of 0.8794, specificity of 0.9459, IOU of 0.3876 and Dice of 0.5686. This new automatic segmentation model is useful for identifying NPA in clinical practice, generating quantitative parameters that can be useful for further research as well as guiding DR detection, grading severity, treatment planning, and prognosis.
{"title":"Development and evaluation of a deep learning model for automatic segmentation of non-perfusion area in fundus fluorescein angiography","authors":"Wei Feng, Bingjie Wang, Dan Song, Mengda Li, Anming Chen, Jing Wang, Siyong Lin, Yiran Zhao, Bin Wang, Zongyuan Ge, Shuyi Xu, Yuntao Hu","doi":"10.1186/s40537-024-00968-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00968-9","url":null,"abstract":"<p>Diabetic retinopathy (DR) is the most prevalent cause of preventable vision loss worldwide, imposing a significant economic and medical burden on society today, of which early identification is the cornerstones of the management. The diagnosis and severity grading of DR rely on scales based on clinical visualized features, but lack detailed quantitative parameters. Retinal non-perfusion area (NPA) is a pathogenic characteristic of DR that symbolizes retinal hypoxia conditions, and was found to be intimately associated with disease progression, prognosis, and management. However, the practical value of NPA is constrained since it appears on fundus fluorescein angiography (FFA) as distributed, irregularly shaped, darker plaques that are challenging to measure manually. In this study, we propose a deep learning-based method, NPA-Net, for accurate and automatic segmentation of NPAs from FFA images acquired in clinical practice. NPA-Net uses the U-net structure as the basic backbone, which has an encoder-decoder model structure. To enhance the recognition performance of the model for NPA, we adaptively incorporate multi-scale features and contextual information in feature learning and design three modules: Adaptive Encoder Feature Fusion (AEFF) module, Multilayer Deep Supervised Loss, and Atrous Spatial Pyramid Pooling (ASPP) module, which enhance the recognition ability of the model for NPAs of different sizes from different perspectives. We conducted extensive experiments on a clinical dataset with 163 eyes with NPAs manually annotated by ophthalmologists, and NPA-Net achieved better segmentation performance compared to other existing methods with an area under the receiver operating characteristic curve (AUC) of 0.9752, accuracy of 0.9431, sensitivity of 0.8794, specificity of 0.9459, IOU of 0.3876 and Dice of 0.5686. This new automatic segmentation model is useful for identifying NPA in clinical practice, generating quantitative parameters that can be useful for further research as well as guiding DR detection, grading severity, treatment planning, and prognosis.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"37 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The long-term impacts of COVID-19 on human health are a major concern, yet comprehensive evaluations of its effects on various health conditions are lacking.
Methods
This study aims to evaluate the role of various diseases in relation to COVID-19 by analyzing genetic data from a large-scale population over 2,000,000 individuals. A bidirectional two-sample Mendelian randomization approach was used, with exposures including COVID-19 susceptibility, hospitalization, and severity, and outcomes encompassing 86 different diseases or traits. A reverse Mendelian randomization analysis was performed to assess the impact of these diseases on COVID-19.
Results
Our analysis identified causal relationships between COVID-19 susceptibility and several conditions, including breast cancer (OR = 1.0073, 95% CI = 1.0032–1.0114, p = 5 × 10 − 4), ER + breast cancer (OR = 0.5252, 95% CI = 0.3589–0.7685, p = 9 × 10 − 4), and heart failure (OR = 1.0026, 95% CI = 1.001–1.0042, p = 0.002). COVID-19 hospitalization was causally linked to heart failure (OR = 1.0017, 95% CI = 1.0006–1.0028, p = 0.002) and Alzheimer’s disease (OR = 1.5092, 95% CI = 1.1942–1.9072, p = 0.0006). COVID-19 severity had causal effects on primary biliary cirrhosis (OR = 2.6333, 95% CI = 1.8274–3.7948, p = 2.059 × 10 − 7), celiac disease (OR = 0.0708, 95% CI = 0.0538–0.0932, p = 9.438 × 10–80), and Alzheimer’s disease (OR = 1.5092, 95% CI = 1.1942–1.9072, p = 0.0006). Reverse MR analysis indicated that rheumatoid arthritis, diabetic nephropathy, multiple sclerosis, and total testosterone (female) influence COVID-19 outcomes. We assessed heterogeneity and horizontal pleiotropy to ensure result reliability and employed the Steiger directionality test to confirm the direction of causality.
Conclusions
This study provides a comprehensive analysis of the causal relationships between COVID-19 and diverse health conditions. Our findings highlight the long-term impacts of COVID-19 on human health, emphasizing the need for continuous monitoring and targeted interventions for affected individuals. Future research should explore these relationships to develop comprehensive healthcare strategies.
背景COVID-19对人类健康的长期影响是一个重大问题,但目前还缺乏对其对各种健康状况影响的全面评估。方法本研究旨在通过分析超过200万人的大规模人群的遗传数据,评估各种疾病与COVID-19的关系。研究采用了双向双样本孟德尔随机化方法,暴露包括 COVID-19 易感性、住院和严重程度,结果包括 86 种不同的疾病或性状。结果我们的分析确定了 COVID-19 易感性与几种疾病之间的因果关系,包括乳腺癌(OR = 1.0073, 95% CI = 1.0032-1.0114, p = 5 × 10 - 4)、ER + 乳腺癌(OR = 0.5252, 95% CI = 0.3589-0.7685, p = 9 × 10 - 4)和心力衰竭(OR = 1.0026, 95% CI = 1.001-1.0042, p = 0.002)。COVID-19住院与心力衰竭(OR = 1.0017,95% CI = 1.0006-1.0028,p = 0.002)和阿尔茨海默病(OR = 1.5092,95% CI = 1.1942-1.9072,p = 0.0006)有因果关系。COVID-19 严重程度对原发性胆汁性肝硬化(OR = 2.6333,95% CI = 1.8274-3.7948,p = 2.059 × 10-7)、乳糜泻(OR = 0.0708,95% CI = 0.0538-0.0932,p = 9.438 × 10-80)和阿尔茨海默病(OR = 1.5092,95% CI = 1.1942-1.9072,p = 0.0006)有因果效应。反向 MR 分析表明,类风湿性关节炎、糖尿病肾病、多发性硬化症和总睾酮(女性)会影响 COVID-19 的结果。我们评估了异质性和水平多向性,以确保结果的可靠性,并采用 Steiger 方向性检验来确认因果关系的方向。我们的研究结果突显了 COVID-19 对人类健康的长期影响,强调了对受影响人群进行持续监测和有针对性干预的必要性。未来的研究应探讨这些关系,以制定全面的医疗保健策略。
{"title":"Leveraging large-scale genetic data to assess the causal impact of COVID-19 on multisystemic diseases","authors":"Xiangyang Zhang, Zhaohui Jiang, Jiayao Ma, Yaru Qi, Yin Li, Yan Zhang, Yihan Liu, Chaochao Wei, Yihong Chen, Ping Liu, Yinghui Peng, Jun Tan, Ying Han, Shan Zeng, Changjing Cai, Hong Shen","doi":"10.1186/s40537-024-00997-4","DOIUrl":"https://doi.org/10.1186/s40537-024-00997-4","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>The long-term impacts of COVID-19 on human health are a major concern, yet comprehensive evaluations of its effects on various health conditions are lacking.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>This study aims to evaluate the role of various diseases in relation to COVID-19 by analyzing genetic data from a large-scale population over 2,000,000 individuals. A bidirectional two-sample Mendelian randomization approach was used, with exposures including COVID-19 susceptibility, hospitalization, and severity, and outcomes encompassing 86 different diseases or traits. A reverse Mendelian randomization analysis was performed to assess the impact of these diseases on COVID-19.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Our analysis identified causal relationships between COVID-19 susceptibility and several conditions, including breast cancer (OR = 1.0073, 95% CI = 1.0032–1.0114, <i>p</i> = 5 × 10 − 4), ER + breast cancer (OR = 0.5252, 95% CI = 0.3589–0.7685, <i>p</i> = 9 × 10 − 4), and heart failure (OR = 1.0026, 95% CI = 1.001–1.0042, <i>p</i> = 0.002). COVID-19 hospitalization was causally linked to heart failure (OR = 1.0017, 95% CI = 1.0006–1.0028, <i>p</i> = 0.002) and Alzheimer’s disease (OR = 1.5092, 95% CI = 1.1942–1.9072, <i>p</i> = 0.0006). COVID-19 severity had causal effects on primary biliary cirrhosis (OR = 2.6333, 95% CI = 1.8274–3.7948, <i>p</i> = 2.059 × 10 − 7), celiac disease (OR = 0.0708, 95% CI = 0.0538–0.0932, <i>p</i> = 9.438 × 10–80), and Alzheimer’s disease (OR = 1.5092, 95% CI = 1.1942–1.9072, <i>p</i> = 0.0006). Reverse MR analysis indicated that rheumatoid arthritis, diabetic nephropathy, multiple sclerosis, and total testosterone (female) influence COVID-19 outcomes. We assessed heterogeneity and horizontal pleiotropy to ensure result reliability and employed the Steiger directionality test to confirm the direction of causality.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>This study provides a comprehensive analysis of the causal relationships between COVID-19 and diverse health conditions. Our findings highlight the long-term impacts of COVID-19 on human health, emphasizing the need for continuous monitoring and targeted interventions for affected individuals. Future research should explore these relationships to develop comprehensive healthcare strategies.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"1 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The image object recognition and detection technology are widely used in many scenarios. In recent years, big data has become increasingly abundant, and big data-driven artificial intelligence models have attracted more and more attention. Evolutionary computation has also provided a powerful driving force for the optimization and improvement of deep learning models. In this paper, we propose an image object detection method based on self-supervised and data-driven learning. Differ from other methods, our approach stands out due to its innovative use of multispectral data fusion and evolutionary computation for model optimization. Specifically, our method uniquely combines visible light images and infrared images to detect and identify image targets. Firstly, we utilize a self-supervised learning method and the AutoEncoder model to perform high-dimensional feature extraction on the two types of images. Secondly, we fuse the extracted features from the visible light and infrared images to detect and identify objects. Thirdly, we introduce a model parameter optimization method using evolutionary learning algorithms to enhance model performance. Validation on public datasets shows that our method achieves comparable or superior performance to existing methods.
{"title":"Evolutionary computation-based self-supervised learning for image processing: a big data-driven approach to feature extraction and fusion for multispectral object detection","authors":"Xiaoyang Shen, Haibin Li, Achyut Shankar, Wattana Viriyasitavat, Vinay Chamola","doi":"10.1186/s40537-024-00988-5","DOIUrl":"https://doi.org/10.1186/s40537-024-00988-5","url":null,"abstract":"<p>The image object recognition and detection technology are widely used in many scenarios. In recent years, big data has become increasingly abundant, and big data-driven artificial intelligence models have attracted more and more attention. Evolutionary computation has also provided a powerful driving force for the optimization and improvement of deep learning models. In this paper, we propose an image object detection method based on self-supervised and data-driven learning. Differ from other methods, our approach stands out due to its innovative use of multispectral data fusion and evolutionary computation for model optimization. Specifically, our method uniquely combines visible light images and infrared images to detect and identify image targets. Firstly, we utilize a self-supervised learning method and the AutoEncoder model to perform high-dimensional feature extraction on the two types of images. Secondly, we fuse the extracted features from the visible light and infrared images to detect and identify objects. Thirdly, we introduce a model parameter optimization method using evolutionary learning algorithms to enhance model performance. Validation on public datasets shows that our method achieves comparable or superior performance to existing methods.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"6 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-12DOI: 10.1186/s40537-024-00965-y
Asefeh Asemi, Adeleh Asemi, Andrea Ko
This article presents an investment recommender system based on an Adaptive Neuro-Fuzzy Inference System (ANFIS) and pre-trained weights from a Multimodal Neural Network (MNN). The model is designed to support the investment process for the customers and takes into consideration seven factors to implement the proposed investment system model through the customer or potential investor data set. The system takes input from a web-based questionnaire that collects data on investors' preferences and investment goals. The data is then preprocessed and clustered using ETL tools, JMP, MATLAB, and Python. The ANFIS-based recommender system is designed with three inputs and one output and trained using a hybrid approach over three epochs with 188 data pairs and 18 fuzzy rules. The system's performance is evaluated using metrics such as RMSE, accuracy, precision, recall, and F1-score. The system is also designed to incorporate expert feedback and opinions from investors to customize and improve investment recommendations. The article concludes that the proposed ANFIS-based investment recommender system is effective and accurate in generating investment recommendations that meet investors' preferences and goals.