首页 > 最新文献

Journal of Big Data最新文献

英文 中文
FunDa: scalable serverless data analytics and in situ query processing. FunDa:可扩展的无服务器数据分析和原位查询处理。
IF 8.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-01-01 Epub Date: 2025-05-09 DOI: 10.1186/s40537-025-01141-6
Elyes Lounissi, Suvam Kumar Das, Ronnit Peter, Xiaozheng Zhang, Suprio Ray, Lianyin Jia

The pay-what-you-use model of serverless Cloud computing (or serverless, for short) offers significant benefits to the users. This computing paradigm is ideal for short running ephemeral tasks, however, it is not suitable for stateful long running tasks, such as complex data analytics and query processing. We propose FunDa, an on-premises serverless data analytics framework, which extends our previously proposed system for unified data analytics and in situ SQL query processing called DaskDB. Unlike existing serverless solutions, which struggle with stateful and long running data analytics tasks, FunDa overcomes their limitations. Our ongoing research focuses on developing a robust architecture for FunDa, enabling true serverless in on-premises environments, while being able to operate on a public Cloud, such as AWS Cloud. We have evaluated our system on several benchmarks with different scale factors. Our experimental results in both on-premises and AWS Cloud settings demonstrate FunDa's ability to support automatic scaling, low-latency execution of data analytics workloads, and more flexibility to serverless users.

无服务器云计算(简称无服务器)的按需付费模式为用户提供了显著的好处。这种计算范式非常适合短时间运行的临时任务,但是,它不适合有状态的长时间运行任务,例如复杂的数据分析和查询处理。我们提出FunDa,这是一个内部部署的无服务器数据分析框架,它扩展了我们之前提出的统一数据分析和原位SQL查询处理系统,称为DaskDB。与现有的无服务器解决方案不同,FunDa克服了它们的局限性,这些解决方案都在努力处理有状态和长时间运行的数据分析任务。我们正在进行的研究重点是为FunDa开发一个强大的架构,在内部部署环境中实现真正的无服务器,同时能够在公共云(如AWS云)上运行。我们用不同的尺度因子在几个基准上评估了我们的系统。我们在本地和AWS云设置中的实验结果表明,FunDa能够支持自动扩展,低延迟执行数据分析工作负载,并为无服务器用户提供更大的灵活性。
{"title":"<i>F</i>u<i>n</i>Da: scalable serverless data analytics and in situ query processing.","authors":"Elyes Lounissi, Suvam Kumar Das, Ronnit Peter, Xiaozheng Zhang, Suprio Ray, Lianyin Jia","doi":"10.1186/s40537-025-01141-6","DOIUrl":"https://doi.org/10.1186/s40537-025-01141-6","url":null,"abstract":"<p><p>The pay-what-you-use model of serverless Cloud computing (or serverless, for short) offers significant benefits to the users. This computing paradigm is ideal for short running ephemeral tasks, however, it is not suitable for stateful long running tasks, such as complex data analytics and query processing. We propose <i>F</i>u<i>n</i>Da, an on-premises serverless data analytics framework, which extends our previously proposed system for unified data analytics and in situ SQL query processing called DaskDB. Unlike existing serverless solutions, which struggle with stateful and long running data analytics tasks, <i>F</i>u<i>n</i>Da overcomes their limitations. Our ongoing research focuses on developing a robust architecture for <i>F</i>u<i>n</i>Da, enabling true serverless in on-premises environments, while being able to operate on a public Cloud, such as AWS Cloud. We have evaluated our system on several benchmarks with different scale factors. Our experimental results in both on-premises and AWS Cloud settings demonstrate <i>F</i>u<i>n</i>Da's ability to support automatic scaling, low-latency execution of data analytics workloads, and more flexibility to serverless users.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"12 1","pages":"116"},"PeriodicalIF":8.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12064580/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143991480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UniqueNOSD: a novel framework for NoSQL over SQL databases. 一个新的基于SQL数据库的NoSQL框架。
IF 6.4 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-01-01 Epub Date: 2025-11-17 DOI: 10.1186/s40537-025-01307-2
Abdulrauf A Gidado, C I Ezeife

To date, most large corporations still have their core solutions on relational databases but only use non-relational (i.e. NoSQL) database management systems (DBMS) for their non-core systems that favour availability and scalability through partitioning while trading off consistency. NoSQL systems are built based on the CAP (i.e., Consistency, Availability and Partitioning) database theorem, which trades off one of these features while maintaining the others. The need for systems availability and scalability drives the use of NoSQL, while the lack of consistency and robust query engines as obtainable in relational databases, impede their usage. To mitigate these drawbacks, researchers and companies like Amazon, Google, and Facebook run 'SQL over NoSQL' systems such as Dynamo, Google's Spanner, Memcache, Zidian, Apache Hive and SparkSQL. These systems create a query engine layer over NoSQL systems but suffer from data redundancy and lack consistency obtainable in relational DBMS. Also, their query engine is not relational complete because they cannot process all relational algebra-based queries as obtainable in a relational database. In this paper, we present a 'Unique NoSQL over SQL Database' (UniqueNOSD) system, an extension of NOSD and an inverse of existing approaches. This approach is motivated by the need for existing systems to fully deploy NoSQL data store functionalities without the limitation of building an extra SQL layer for querying. To allow appropriate storage and retrieval of data on document-based NoSQL databases without data redundancy and inconsistency while encouraging both horizontal and vertical partitioning, we propose NoSQL over SQL Block as a Value ([Formula: see text]) data storage strategy. Unlike relational database model where a relation is represented as [Formula: see text], with a key attribute [Formula: see text] and [Formula: see text] is the primary key to the set of attributes [Formula: see text] of the relation, in [Formula: see text] (represented as a tuple (KB) where K means key and B means block). We represent a relation as [Formula: see text] with a key attribute K and a set of n relations (i.e., r) called blocks B and each r [Formula: see text] contains a set of its own attributes and is denoted as [Formula: see text] with a key attribute k and a set of n attributes typical to a relational model. The relations [Formula: see text] in R of [Formula: see text] are related through foreign key relationships. Using existing benchmark systems of 'SQL over NoSQL', relational databases and real-life datasets for our experiments, we demonstrated that our NoSQL over SQL system outperforms existing relational databases, SQL over NoSQL systems and is novel in ensuring data consistency, scalability, query execution and improving data storage and retrieval in large database systems without data loss and enhancing improved performan

到目前为止,大多数大公司的核心解决方案仍然是关系数据库,但他们的非核心系统只使用非关系(即NoSQL)数据库管理系统(DBMS),这些系统通过分区支持可用性和可伸缩性,同时牺牲一致性。NoSQL系统是基于CAP(即一致性、可用性和分区)数据库定理构建的,它在维护其他特性的同时权衡了其中一个特性。对系统可用性和可伸缩性的需求推动了NoSQL的使用,而关系数据库中可获得的一致性和健壮的查询引擎的缺乏阻碍了它们的使用。为了减轻这些缺点,研究人员和亚马逊、b谷歌和Facebook等公司运行“SQL over NoSQL”系统,如Dynamo、b谷歌的Spanner、Memcache、Zidian、Apache Hive和SparkSQL。这些系统在NoSQL系统之上创建了一个查询引擎层,但在关系DBMS中存在数据冗余和缺乏一致性的问题。而且,它们的查询引擎不是关系完整的,因为它们不能处理关系数据库中所有基于关系代数的查询。在本文中,我们提出了一个“唯一NoSQL over SQL数据库”(UniqueNOSD)系统,它是NOSD的扩展和现有方法的逆。这种方法的动机是现有系统需要完全部署NoSQL数据存储功能,而不需要为查询构建额外的SQL层。为了允许在基于文档的NoSQL数据库上进行适当的数据存储和检索,而不会出现数据冗余和不一致,同时鼓励水平和垂直分区,我们提出将NoSQL over SQL Block作为值([公式:见文本])数据存储策略。与关系数据库模型不同的是,关系被表示为[Formula: see text],带有一个键属性[Formula: see text], [Formula: see text]是关系的属性集[Formula: see text]的主键,在[Formula: see text]中(表示为元组(K, B),其中K表示键,B表示块)。我们将一个关系表示为[公式:见文],它有一个键属性K和一组n个关系(即r),称为块B,每个r[公式:见文]包含一组自己的属性,并表示为[公式:见文],它有一个键属性K和一组n个典型的关系模型属性。[Formula: see text]的R中的关系[Formula: see text]是通过外键关系联系起来的。使用现有的“SQL over NoSQL”基准系统、关系数据库和实际数据集进行实验,我们证明了我们的NoSQL over SQL系统优于现有的关系数据库、SQL over NoSQL系统,并且在确保数据一致性、可扩展性、查询执行和改进大型数据库系统中的数据存储和检索方面是新颖的,没有数据丢失,并提高了NoSQL数据库的性能。
{"title":"UniqueNOSD: a novel framework for NoSQL over SQL databases.","authors":"Abdulrauf A Gidado, C I Ezeife","doi":"10.1186/s40537-025-01307-2","DOIUrl":"https://doi.org/10.1186/s40537-025-01307-2","url":null,"abstract":"<p><p>To date, most large corporations still have their core solutions on relational databases but only use non-relational (i.e. NoSQL) database management systems (DBMS) for their non-core systems that favour availability and scalability through partitioning while trading off consistency. NoSQL systems are built based on the CAP (i.e., Consistency, Availability and Partitioning) database theorem, which trades off one of these features while maintaining the others. The need for systems availability and scalability drives the use of NoSQL, while the lack of consistency and robust query engines as obtainable in relational databases, impede their usage. To mitigate these drawbacks, researchers and companies like Amazon, Google, and Facebook run 'SQL over NoSQL' systems such as Dynamo, Google's Spanner, Memcache, Zidian, Apache Hive and SparkSQL. These systems create a query engine layer over NoSQL systems but suffer from data redundancy and lack consistency obtainable in relational DBMS. Also, their query engine is not relational complete because they cannot process all relational algebra-based queries as obtainable in a relational database. In this paper, we present a 'Unique NoSQL over SQL Database' (UniqueNOSD) system, an extension of NOSD and an inverse of existing approaches. This approach is motivated by the need for existing systems to fully deploy NoSQL data store functionalities without the limitation of building an extra SQL layer for querying. To allow appropriate storage and retrieval of data on document-based NoSQL databases without data redundancy and inconsistency while encouraging both horizontal and vertical partitioning, we propose NoSQL over SQL Block as a Value ([Formula: see text]) data storage strategy. Unlike relational database model where a relation is represented as [Formula: see text], with a key attribute [Formula: see text] and [Formula: see text] is the primary key to the set of attributes [Formula: see text] of the relation, in [Formula: see text] (represented as a tuple (<i>K</i>, <i>B</i>) where <i>K</i> means key and <i>B</i> means block). We represent a relation as [Formula: see text] with a key attribute <i>K</i> and a set of <i>n</i> relations (i.e., <i>r</i>) called blocks <i>B</i> and each <i>r</i> [Formula: see text] contains a set of its own attributes and is denoted as [Formula: see text] with a key attribute <i>k</i> and a set of <i>n</i> attributes typical to a relational model. The relations [Formula: see text] in <i>R</i> of [Formula: see text] are related through foreign key relationships. Using existing benchmark systems of 'SQL over NoSQL', relational databases and real-life datasets for our experiments, we demonstrated that our NoSQL over SQL system outperforms existing relational databases, SQL over NoSQL systems and is novel in ensuring data consistency, scalability, query execution and improving data storage and retrieval in large database systems without data loss and enhancing improved performan","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"12 1","pages":"255"},"PeriodicalIF":6.4,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12628391/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145563910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing poultry audio signal classification with deep learning and burn layer fusion 利用深度学习和燃烧层融合优化家禽音频信号分类
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-18 DOI: 10.1186/s40537-024-00985-8
Esraa Hassan, Samar Elbedwehy, Mahmoud Y. Shams, Tarek Abd El-Hafeez, Nora El-Rashidy

This study introduces a novel deep learning-based approach for classifying poultry audio signals, incorporating a custom Burn Layer to enhance model robustness. The methodology integrates digital audio signal processing, convolutional neural networks (CNNs), and the innovative Burn Layer, which injects controlled random noise during training to reinforce the model's resilience to input signal variations. The proposed architecture is streamlined, with convolutional blocks, densely connected layers, dropout, and an additional Burn Layer to fortify robustness. The model demonstrates efficiency by reducing trainable parameters to 191,235, compared to traditional architectures with over 1.7 million parameters. The proposed model utilizes a Burn Layer with burn intensity as a parameter and an Adamax optimizer to optimize and address the overfitting problem. Thorough evaluation using six standard classification metrics showcases the model's superior performance, achieving exceptional sensitivity (96.77%), specificity (100.00%), precision (100.00%), negative predictive value (NPV) (95.00%), accuracy (98.55%), F1 score (98.36%), and Matthew’s correlation coefficient (MCC) (95.88%). This research contributes valuable insights into the fields of audio signal processing, animal health monitoring, and robust deep-learning classification systems. The proposed model presents a systematic approach for developing and evaluating a deep learning-based poultry audio classification system. It processes raw audio data and labels to generate digital representations, utilizes a Burn Layer for training variability, and constructs a CNN model with convolutional blocks, pooling, and dense layers. The model is optimized using the Adamax algorithm and trained with data augmentation and early-stopping techniques. Rigorous assessment on a test dataset using standard metrics demonstrates the model's robustness and efficiency, with the potential to significantly advance animal health monitoring and disease detection through audio signal analysis.

本研究介绍了一种新颖的基于深度学习的家禽音频信号分类方法,该方法结合了定制的 "燃烧层"(Burn Layer),以增强模型的鲁棒性。该方法整合了数字音频信号处理、卷积神经网络(CNN)和创新的 "燃烧层"(Burn Layer)。"燃烧层 "在训练过程中注入受控随机噪声,以增强模型对输入信号变化的适应能力。所提出的架构非常精简,包括卷积块、密集连接层、剔除层和额外的 "燃烧层"(Burn Layer),以加强鲁棒性。与拥有 170 多万个参数的传统架构相比,该模型将可训练参数减少到 191,235 个,从而提高了效率。所提出的模型利用燃烧层(以燃烧强度作为参数)和 Adamax 优化器来优化和解决过拟合问题。使用六个标准分类指标进行的全面评估显示了该模型的卓越性能,实现了出色的灵敏度(96.77%)、特异度(100.00%)、精确度(100.00%)、负预测值(NPV)(95.00%)、准确度(98.55%)、F1 分数(98.36%)和马修相关系数(MCC)(95.88%)。这项研究为音频信号处理、动物健康监测和鲁棒深度学习分类系统等领域提供了有价值的见解。所提出的模型为开发和评估基于深度学习的家禽音频分类系统提供了一种系统方法。它处理原始音频数据和标签以生成数字表征,利用燃烧层(Burn Layer)进行可变性训练,并利用卷积块、池化和密集层构建 CNN 模型。该模型使用 Adamax 算法进行优化,并使用数据增强和早期停止技术进行训练。在测试数据集上使用标准指标进行的严格评估证明了该模型的稳健性和效率,有望通过音频信号分析大大推进动物健康监测和疾病检测。
{"title":"Optimizing poultry audio signal classification with deep learning and burn layer fusion","authors":"Esraa Hassan, Samar Elbedwehy, Mahmoud Y. Shams, Tarek Abd El-Hafeez, Nora El-Rashidy","doi":"10.1186/s40537-024-00985-8","DOIUrl":"https://doi.org/10.1186/s40537-024-00985-8","url":null,"abstract":"<p>This study introduces a novel deep learning-based approach for classifying poultry audio signals, incorporating a custom Burn Layer to enhance model robustness. The methodology integrates digital audio signal processing, convolutional neural networks (CNNs), and the innovative Burn Layer, which injects controlled random noise during training to reinforce the model's resilience to input signal variations. The proposed architecture is streamlined, with convolutional blocks, densely connected layers, dropout, and an additional Burn Layer to fortify robustness. The model demonstrates efficiency by reducing trainable parameters to 191,235, compared to traditional architectures with over 1.7 million parameters. The proposed model utilizes a Burn Layer with burn intensity as a parameter and an Adamax optimizer to optimize and address the overfitting problem. Thorough evaluation using six standard classification metrics showcases the model's superior performance, achieving exceptional sensitivity (96.77%), specificity (100.00%), precision (100.00%), negative predictive value (NPV) (95.00%), accuracy (98.55%), F1 score (98.36%), and Matthew’s correlation coefficient (MCC) (95.88%). This research contributes valuable insights into the fields of audio signal processing, animal health monitoring, and robust deep-learning classification systems. The proposed model presents a systematic approach for developing and evaluating a deep learning-based poultry audio classification system. It processes raw audio data and labels to generate digital representations, utilizes a Burn Layer for training variability, and constructs a CNN model with convolutional blocks, pooling, and dense layers. The model is optimized using the Adamax algorithm and trained with data augmentation and early-stopping techniques. Rigorous assessment on a test dataset using standard metrics demonstrates the model's robustness and efficiency, with the potential to significantly advance animal health monitoring and disease detection through audio signal analysis.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"23 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142253832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning and deep learning models based grid search cross validation for short-term solar irradiance forecasting 基于网格搜索交叉验证的机器学习和深度学习模型用于短期太阳辐照度预报
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-18 DOI: 10.1186/s40537-024-00991-w
Doaa El-Shahat, Ahmed Tolba, Mohamed Abouhawwash, Mohamed Abdel-Basset

In late 2023, the United Nations conference on climate change (COP28), which was held in Dubai, encouraged a quick move from fossil fuels to renewable energy. Solar energy is one of the most promising forms of energy that is both sustainable and renewable. Generally, photovoltaic systems transform solar irradiance into electricity. Unfortunately, instability and intermittency in solar radiation can lead to interruptions in electricity production. The accurate forecasting of solar irradiance guarantees sustainable power production even when solar irradiance is not present. Batteries can store solar energy to be used during periods of solar absence. Additionally, deterministic models take into account the specification of technical PV systems and may be not accurate for low solar irradiance. This paper presents a comparative study for the most common Deep Learning (DL) and Machine Learning (ML) algorithms employed for short-term solar irradiance forecasting. The dataset was gathered in Islamabad during a five-year period, from 2015 to 2019, at hourly intervals with accurate meteorological sensors. Furthermore, the Grid Search Cross Validation (GSCV) with five folds is introduced to ML and DL models for optimizing the hyperparameters of these models. Several performance metrics are used to assess the algorithms, such as the Adjusted R2 score, Normalized Root Mean Square Error (NRMSE), Mean Absolute Deviation (MAD), Mean Absolute Error (MAE) and Mean Square Error (MSE). The statistical analysis shows that CNN-LSTM outperforms its counterparts of nine well-known DL models with Adjusted R2 score value of 0.984. For ML algorithms, gradient boosting regression is an effective forecasting method with Adjusted R2 score value of 0.962, beating its rivals of six ML models. Furthermore, SHAP and LIME are examples of explainable Artificial Intelligence (XAI) utilized for understanding the reasons behind the obtained results.

2023 年底,在迪拜举行的联合国气候变化大会(COP28)鼓励尽快从化石燃料转向可再生能源。太阳能是最有前途的可持续和可再生能源之一。一般来说,光伏系统将太阳辐照转化为电能。遗憾的是,太阳辐射的不稳定性和间歇性会导致电力生产中断。对太阳辐照度的准确预测可确保即使在没有太阳辐照度的情况下也能持续发电。电池可以储存太阳能,以便在没有太阳能时使用。此外,确定性模型考虑了技术光伏系统的规格,在太阳辐照度较低时可能并不准确。本文对短期太阳辐照度预测中最常用的深度学习(DL)和机器学习(ML)算法进行了比较研究。数据集收集于伊斯兰堡,时间跨度为五年(2015 年至 2019 年),使用精确的气象传感器以小时为间隔进行收集。此外,还为 ML 和 DL 模型引入了网格搜索交叉验证 (GSCV),以优化这些模型的超参数。评估算法时使用了几个性能指标,如调整后 R2 分数、归一化均方根误差(NRMSE)、平均绝对偏差(MAD)、平均绝对误差(MAE)和平均平方误差(MSE)。统计分析结果表明,CNN-LSTM 的调整后 R2 得分为 0.984,优于九种著名的 DL 模型。在 ML 算法中,梯度提升回归是一种有效的预测方法,其调整后 R2 得分为 0.962,优于 6 个 ML 模型的对手。此外,SHAP 和 LIME 是可解释人工智能(XAI)的范例,可用于理解所得结果背后的原因。
{"title":"Machine learning and deep learning models based grid search cross validation for short-term solar irradiance forecasting","authors":"Doaa El-Shahat, Ahmed Tolba, Mohamed Abouhawwash, Mohamed Abdel-Basset","doi":"10.1186/s40537-024-00991-w","DOIUrl":"https://doi.org/10.1186/s40537-024-00991-w","url":null,"abstract":"<p>In late 2023, the United Nations conference on climate change (COP28), which was held in Dubai, encouraged a quick move from fossil fuels to renewable energy. Solar energy is one of the most promising forms of energy that is both sustainable and renewable. Generally, photovoltaic systems transform solar irradiance into electricity. Unfortunately, instability and intermittency in solar radiation can lead to interruptions in electricity production. The accurate forecasting of solar irradiance guarantees sustainable power production even when solar irradiance is not present. Batteries can store solar energy to be used during periods of solar absence. Additionally, deterministic models take into account the specification of technical PV systems and may be not accurate for low solar irradiance. This paper presents a comparative study for the most common Deep Learning (DL) and Machine Learning (ML) algorithms employed for short-term solar irradiance forecasting. The dataset was gathered in Islamabad during a five-year period, from 2015 to 2019, at hourly intervals with accurate meteorological sensors. Furthermore, the Grid Search Cross Validation (GSCV) with five folds is introduced to ML and DL models for optimizing the hyperparameters of these models. Several performance metrics are used to assess the algorithms, such as the <i>Adjusted R</i><sup><i>2</i></sup><i> score</i>, <i>Normalized Root Mean Square Error</i> (NRMSE), <i>Mean Absolute Deviation</i> (MAD), <i>Mean Absolute Error</i> (MAE) and <i>Mean Square Error</i> (MSE). The statistical analysis shows that CNN-LSTM outperforms its counterparts of nine well-known DL models with <i>Adjusted R</i><sup><i>2</i></sup><i> score</i> value of 0.984. For ML algorithms, gradient boosting regression is an effective forecasting method with <i>Adjusted R</i><sup><i>2</i></sup><i> score</i> value of 0.962, beating its rivals of six ML models. Furthermore, SHAP and LIME are examples of explainable Artificial Intelligence (XAI) utilized for understanding the reasons behind the obtained results.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"13 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142253831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Shielding networks: enhancing intrusion detection with hybrid feature selection and stack ensemble learning 屏蔽网络:利用混合特征选择和堆栈集合学习加强入侵检测
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-18 DOI: 10.1186/s40537-024-00994-7
Ali Mohammed Alsaffar, Mostafa Nouri-Baygi, Hamed M. Zolbanin

The frequent usage of computer networks and the Internet has made computer networks vulnerable to numerous attacks, highlighting the critical need to enhance the precision of security mechanisms. One of the most essential measures to safeguard networking resources and infrastructures is an intrusion detection system (IDS). IDSs are widely used to detect, identify, and track malicious threats. Although various machine learning algorithms have been used successfully in IDSs, they are still suffering from low prediction performances. One reason behind the low accuracy of IDSs is that existing network traffic datasets have high computational complexities that are mainly caused by redundant, incomplete, and irrelevant features. Furthermore, standalone classifiers exhibit restricted classification performance and typically fail to produce satisfactory outcomes when dealing with imbalanced, multi-category traffic data. To address these issues, we propose an efficient intrusion detection model, which is based on hybrid feature selection and stack ensemble learning. Our hybrid feature selection method, called MI-Boruta, combines mutual information (MI) as a filter method and the Boruta algorithm as a wrapper method to determine optimal features from our datasets. Then, we apply stacked ensemble learning by using random forest (RF), Catboost, and XGBoost algorithms as base learners with multilayer perceptron (MLP) as meta-learner. We test our intrusion detection model on two widely recognized benchmark datasets, namely UNSW-NB15 and CICIDS2017. We show that our proposed IDS outperforms existing IDSs in almost all performance criteria, including accuracy, recall, precision, F1-Score, false positive rate, true positive rate, and error rate.

计算机网络和互联网的频繁使用使得计算机网络很容易受到各种攻击,这凸显了提高安全机制精确性的迫切需要。入侵检测系统(IDS)是保护网络资源和基础设施的最基本措施之一。IDS 广泛用于检测、识别和跟踪恶意威胁。虽然各种机器学习算法已成功应用于 IDS,但它们的预测性能仍然很低。IDS 准确率低的一个原因是,现有的网络流量数据集计算复杂度高,这主要是由冗余、不完整和不相关的特征造成的。此外,独立分类器的分类性能有限,在处理不平衡的多类别流量数据时通常无法产生令人满意的结果。为了解决这些问题,我们提出了一种基于混合特征选择和堆栈集合学习的高效入侵检测模型。我们的混合特征选择方法被称为 MI-Boruta,它将互信息(MI)作为一种过滤方法,将 Boruta 算法作为一种包装方法来确定数据集中的最佳特征。然后,我们使用随机森林 (RF)、Catboost 和 XGBoost 算法作为基础学习器,使用多层感知器 (MLP) 作为元学习器,进行堆叠集合学习。我们在两个广受认可的基准数据集(即 UNSW-NB15 和 CICIDS2017)上测试了我们的入侵检测模型。结果表明,我们提出的 IDS 在准确率、召回率、精确度、F1 分数、误报率、真阳性率和错误率等几乎所有性能指标上都优于现有的 IDS。
{"title":"Shielding networks: enhancing intrusion detection with hybrid feature selection and stack ensemble learning","authors":"Ali Mohammed Alsaffar, Mostafa Nouri-Baygi, Hamed M. Zolbanin","doi":"10.1186/s40537-024-00994-7","DOIUrl":"https://doi.org/10.1186/s40537-024-00994-7","url":null,"abstract":"<p>The frequent usage of computer networks and the Internet has made computer networks vulnerable to numerous attacks, highlighting the critical need to enhance the precision of security mechanisms. One of the most essential measures to safeguard networking resources and infrastructures is an intrusion detection system (IDS). IDSs are widely used to detect, identify, and track malicious threats. Although various machine learning algorithms have been used successfully in IDSs, they are still suffering from low prediction performances. One reason behind the low accuracy of IDSs is that existing network traffic datasets have high computational complexities that are mainly caused by redundant, incomplete, and irrelevant features. Furthermore, standalone classifiers exhibit restricted classification performance and typically fail to produce satisfactory outcomes when dealing with imbalanced, multi-category traffic data. To address these issues, we propose an efficient intrusion detection model, which is based on hybrid feature selection and stack ensemble learning. Our hybrid feature selection method, called MI-Boruta, combines mutual information (MI) as a filter method and the Boruta algorithm as a wrapper method to determine optimal features from our datasets. Then, we apply stacked ensemble learning by using random forest (RF), Catboost, and XGBoost algorithms as base learners with multilayer perceptron (MLP) as meta-learner. We test our intrusion detection model on two widely recognized benchmark datasets, namely UNSW-NB15 and CICIDS2017. We show that our proposed IDS outperforms existing IDSs in almost all performance criteria, including accuracy, recall, precision, F1-Score, false positive rate, true positive rate, and error rate.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"19 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142253792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrating microarray-based spatial transcriptomics and RNA-seq reveals tissue architecture in colorectal cancer 整合基于芯片的空间转录组学和 RNA-seq 技术揭示结直肠癌的组织结构
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-17 DOI: 10.1186/s40537-024-00992-9
Zheng Li, Xiaojie Zhang, Chongyuan Sun, Zefeng Li, He Fei, Dongbing Zhao

Background

The tumor microenvironment (TME) provides a region for intricate interactions within or between immune and non-immune cells. We aimed to reveal the tissue architecture and comprehensive landscape of cells within the TME of colorectal cancer (CRC).

Methods

Fresh frozen invasive adenocarcinoma of the large intestine tissue from 10× Genomics Datasets was obtained from BioIVT Asterand. The integration of microarray-based spatial transcriptomics (ST) and RNA sequencing (RNA-seq) was applied to characterize gene expression and cell landscape within the TME of CRC tissue architecture. Multiple R packages and deconvolution algorithms including MCPcounter, XCELL, EPIC, and ESTIMATE methods were performed for further immune distribution analysis.

Results

The subpopulations of immune and non-immune cells within the TME of the CRC tissue architecture were appropriately annotated. According to ST and RNA-seq analyses, a heterogeneous spatial atlas of gene distribution and cell landscape was comprehensively characterized. We distinguished between the cancer and stromal regions of CRC tissues. As expected, epithelial cells were located in the cancerous region, whereas fibroblasts were mainly located in the stroma. In addition, the fibroblasts were further subdivided into two subgroups (F1 and F2) according to the differentially expressed genes (DEGs), which were mainly enriched in pathways including hallmark-oxidative-phosphorylation, hallmark-e2f-targets and hallmark-unfolded-protein-response. Furthermore, the top 5 DEGs, SPP1, CXCL10, APOE, APOC1, and LYZ, were found to be closely related to immunoregulation of the TME, methylation, and survival of CRC patients.

Conclusions

This study characterized the heterogeneous spatial landscape of various cell subtypes within the TME of the tissue architecture. The TME-related roles of fibroblast subsets addressed the potential crosstalk among diverse cells.

背景肿瘤微环境(TME)为免疫细胞和非免疫细胞内部或之间错综复杂的相互作用提供了一个区域。我们的目的是揭示结直肠癌(CRC)TME内的组织结构和细胞的综合景观。方法从BioIVT Asterand公司的10×基因组学数据集中获得新鲜冷冻的大肠浸润性腺癌组织。应用基于微阵列的空间转录组学(ST)和 RNA 测序(RNA-seq)的整合来描述 CRC 组织结构中 TME 内的基因表达和细胞景观。结果 对 CRC 组织结构 TME 中的免疫和非免疫细胞亚群进行了适当的注释。根据ST和RNA-seq分析,全面描述了基因分布和细胞景观的异质性空间图谱。我们区分了 CRC 组织的癌区和基质区。不出所料,上皮细胞位于癌区,而成纤维细胞主要位于基质区。此外,根据差异表达基因(DEGs),成纤维细胞被进一步细分为两个亚组(F1和F2),主要富集在包括霍尔马克氧化磷酸化、霍尔马克-e2f-靶标和霍尔马克未折叠蛋白反应等通路中。此外,研究还发现前 5 个 DEGs(SPP1、CXCL10、APOE、APOC1 和 LYZ)与 TME 的免疫调节、甲基化和 CRC 患者的生存密切相关。成纤维细胞亚群在 TME 中的相关作用揭示了不同细胞之间的潜在串扰。
{"title":"Integrating microarray-based spatial transcriptomics and RNA-seq reveals tissue architecture in colorectal cancer","authors":"Zheng Li, Xiaojie Zhang, Chongyuan Sun, Zefeng Li, He Fei, Dongbing Zhao","doi":"10.1186/s40537-024-00992-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00992-9","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>The tumor microenvironment (TME) provides a region for intricate interactions within or between immune and non-immune cells. We aimed to reveal the tissue architecture and comprehensive landscape of cells within the TME of colorectal cancer (CRC).</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>Fresh frozen invasive adenocarcinoma of the large intestine tissue from 10× Genomics Datasets was obtained from BioIVT Asterand. The integration of microarray-based spatial transcriptomics (ST) and RNA sequencing (RNA-seq) was applied to characterize gene expression and cell landscape within the TME of CRC tissue architecture. Multiple R packages and deconvolution algorithms including MCPcounter, XCELL, EPIC, and ESTIMATE methods were performed for further immune distribution analysis.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>The subpopulations of immune and non-immune cells within the TME of the CRC tissue architecture were appropriately annotated. According to ST and RNA-seq analyses, a heterogeneous spatial atlas of gene distribution and cell landscape was comprehensively characterized. We distinguished between the cancer and stromal regions of CRC tissues. As expected, epithelial cells were located in the cancerous region, whereas fibroblasts were mainly located in the stroma. In addition, the fibroblasts were further subdivided into two subgroups (F1 and F2) according to the differentially expressed genes (DEGs), which were mainly enriched in pathways including hallmark-oxidative-phosphorylation, hallmark-e2f-targets and hallmark-unfolded-protein-response. Furthermore, the top 5 DEGs, SPP1, CXCL10, APOE, APOC1, and LYZ, were found to be closely related to immunoregulation of the TME, methylation, and survival of CRC patients.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>This study characterized the heterogeneous spatial landscape of various cell subtypes within the TME of the tissue architecture. The TME-related roles of fibroblast subsets addressed the potential crosstalk among diverse cells.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"26 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142253793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Development and evaluation of a deep learning model for automatic segmentation of non-perfusion area in fundus fluorescein angiography 开发和评估用于自动分割眼底荧光素血管造影非灌注区的深度学习模型
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-12 DOI: 10.1186/s40537-024-00968-9
Wei Feng, Bingjie Wang, Dan Song, Mengda Li, Anming Chen, Jing Wang, Siyong Lin, Yiran Zhao, Bin Wang, Zongyuan Ge, Shuyi Xu, Yuntao Hu

Diabetic retinopathy (DR) is the most prevalent cause of preventable vision loss worldwide, imposing a significant economic and medical burden on society today, of which early identification is the cornerstones of the management. The diagnosis and severity grading of DR rely on scales based on clinical visualized features, but lack detailed quantitative parameters. Retinal non-perfusion area (NPA) is a pathogenic characteristic of DR that symbolizes retinal hypoxia conditions, and was found to be intimately associated with disease progression, prognosis, and management. However, the practical value of NPA is constrained since it appears on fundus fluorescein angiography (FFA) as distributed, irregularly shaped, darker plaques that are challenging to measure manually. In this study, we propose a deep learning-based method, NPA-Net, for accurate and automatic segmentation of NPAs from FFA images acquired in clinical practice. NPA-Net uses the U-net structure as the basic backbone, which has an encoder-decoder model structure. To enhance the recognition performance of the model for NPA, we adaptively incorporate multi-scale features and contextual information in feature learning and design three modules: Adaptive Encoder Feature Fusion (AEFF) module, Multilayer Deep Supervised Loss, and Atrous Spatial Pyramid Pooling (ASPP) module, which enhance the recognition ability of the model for NPAs of different sizes from different perspectives. We conducted extensive experiments on a clinical dataset with 163 eyes with NPAs manually annotated by ophthalmologists, and NPA-Net achieved better segmentation performance compared to other existing methods with an area under the receiver operating characteristic curve (AUC) of 0.9752, accuracy of 0.9431, sensitivity of 0.8794, specificity of 0.9459, IOU of 0.3876 and Dice of 0.5686. This new automatic segmentation model is useful for identifying NPA in clinical practice, generating quantitative parameters that can be useful for further research as well as guiding DR detection, grading severity, treatment planning, and prognosis.

糖尿病视网膜病变(DR)是全球最常见的可预防性视力丧失的原因,给当今社会造成了巨大的经济和医疗负担,而早期识别是治疗的基础。DR 的诊断和严重程度分级依赖于基于临床可视化特征的量表,但缺乏详细的量化参数。视网膜非灌注区(NPA)是 DR 的致病特征,象征着视网膜缺氧状况,并被发现与疾病进展、预后和管理密切相关。然而,NPA 的实用价值受到限制,因为它在眼底荧光素血管造影(FFA)中表现为分布不均、形状不规则、颜色较深的斑块,人工测量难度很大。在本研究中,我们提出了一种基于深度学习的方法 NPA-Net,用于从临床实践中获取的 FFA 图像中准确、自动地分割 NPA。NPA-Net 以 U-net 结构为基本骨干,具有编码器-解码器模型结构。为了提高模型对 NPA 的识别性能,我们在特征学习中自适应地加入了多尺度特征和上下文信息,并设计了三个模块:自适应编码器特征融合(AEFF)模块、多层深度监督损失(Multilayer Deep Supervised Loss)模块和阿特鲁斯空间金字塔池化(ASPP)模块,从不同角度提高了模型对不同规模 NPA 的识别能力。与其他现有方法相比,NPA-Net 获得了更好的分割性能,接收者工作特征曲线下面积(AUC)为 0.9752,准确率为 0.9431,灵敏度为 0.8794,特异性为 0.9459,IOU 为 0.3876,Dice 为 0.5686。这一新的自动分割模型有助于在临床实践中识别 NPA,生成有助于进一步研究的定量参数,并指导 DR 检测、严重程度分级、治疗计划和预后。
{"title":"Development and evaluation of a deep learning model for automatic segmentation of non-perfusion area in fundus fluorescein angiography","authors":"Wei Feng, Bingjie Wang, Dan Song, Mengda Li, Anming Chen, Jing Wang, Siyong Lin, Yiran Zhao, Bin Wang, Zongyuan Ge, Shuyi Xu, Yuntao Hu","doi":"10.1186/s40537-024-00968-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00968-9","url":null,"abstract":"<p>Diabetic retinopathy (DR) is the most prevalent cause of preventable vision loss worldwide, imposing a significant economic and medical burden on society today, of which early identification is the cornerstones of the management. The diagnosis and severity grading of DR rely on scales based on clinical visualized features, but lack detailed quantitative parameters. Retinal non-perfusion area (NPA) is a pathogenic characteristic of DR that symbolizes retinal hypoxia conditions, and was found to be intimately associated with disease progression, prognosis, and management. However, the practical value of NPA is constrained since it appears on fundus fluorescein angiography (FFA) as distributed, irregularly shaped, darker plaques that are challenging to measure manually. In this study, we propose a deep learning-based method, NPA-Net, for accurate and automatic segmentation of NPAs from FFA images acquired in clinical practice. NPA-Net uses the U-net structure as the basic backbone, which has an encoder-decoder model structure. To enhance the recognition performance of the model for NPA, we adaptively incorporate multi-scale features and contextual information in feature learning and design three modules: Adaptive Encoder Feature Fusion (AEFF) module, Multilayer Deep Supervised Loss, and Atrous Spatial Pyramid Pooling (ASPP) module, which enhance the recognition ability of the model for NPAs of different sizes from different perspectives. We conducted extensive experiments on a clinical dataset with 163 eyes with NPAs manually annotated by ophthalmologists, and NPA-Net achieved better segmentation performance compared to other existing methods with an area under the receiver operating characteristic curve (AUC) of 0.9752, accuracy of 0.9431, sensitivity of 0.8794, specificity of 0.9459, IOU of 0.3876 and Dice of 0.5686. This new automatic segmentation model is useful for identifying NPA in clinical practice, generating quantitative parameters that can be useful for further research as well as guiding DR detection, grading severity, treatment planning, and prognosis.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"37 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging large-scale genetic data to assess the causal impact of COVID-19 on multisystemic diseases 利用大规模遗传数据评估 COVID-19 对多系统疾病的因果影响
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-12 DOI: 10.1186/s40537-024-00997-4
Xiangyang Zhang, Zhaohui Jiang, Jiayao Ma, Yaru Qi, Yin Li, Yan Zhang, Yihan Liu, Chaochao Wei, Yihong Chen, Ping Liu, Yinghui Peng, Jun Tan, Ying Han, Shan Zeng, Changjing Cai, Hong Shen

Background

The long-term impacts of COVID-19 on human health are a major concern, yet comprehensive evaluations of its effects on various health conditions are lacking.

Methods

This study aims to evaluate the role of various diseases in relation to COVID-19 by analyzing genetic data from a large-scale population over 2,000,000 individuals. A bidirectional two-sample Mendelian randomization approach was used, with exposures including COVID-19 susceptibility, hospitalization, and severity, and outcomes encompassing 86 different diseases or traits. A reverse Mendelian randomization analysis was performed to assess the impact of these diseases on COVID-19.

Results

Our analysis identified causal relationships between COVID-19 susceptibility and several conditions, including breast cancer (OR = 1.0073, 95% CI = 1.0032–1.0114, p = 5 × 10 − 4), ER + breast cancer (OR = 0.5252, 95% CI = 0.3589–0.7685, p = 9 × 10 − 4), and heart failure (OR = 1.0026, 95% CI = 1.001–1.0042, p = 0.002). COVID-19 hospitalization was causally linked to heart failure (OR = 1.0017, 95% CI = 1.0006–1.0028, p = 0.002) and Alzheimer’s disease (OR = 1.5092, 95% CI = 1.1942–1.9072, p = 0.0006). COVID-19 severity had causal effects on primary biliary cirrhosis (OR = 2.6333, 95% CI = 1.8274–3.7948, p = 2.059 × 10 − 7), celiac disease (OR = 0.0708, 95% CI = 0.0538–0.0932, p = 9.438 × 10–80), and Alzheimer’s disease (OR = 1.5092, 95% CI = 1.1942–1.9072, p = 0.0006). Reverse MR analysis indicated that rheumatoid arthritis, diabetic nephropathy, multiple sclerosis, and total testosterone (female) influence COVID-19 outcomes. We assessed heterogeneity and horizontal pleiotropy to ensure result reliability and employed the Steiger directionality test to confirm the direction of causality.

Conclusions

This study provides a comprehensive analysis of the causal relationships between COVID-19 and diverse health conditions. Our findings highlight the long-term impacts of COVID-19 on human health, emphasizing the need for continuous monitoring and targeted interventions for affected individuals. Future research should explore these relationships to develop comprehensive healthcare strategies.

背景COVID-19对人类健康的长期影响是一个重大问题,但目前还缺乏对其对各种健康状况影响的全面评估。方法本研究旨在通过分析超过200万人的大规模人群的遗传数据,评估各种疾病与COVID-19的关系。研究采用了双向双样本孟德尔随机化方法,暴露包括 COVID-19 易感性、住院和严重程度,结果包括 86 种不同的疾病或性状。结果我们的分析确定了 COVID-19 易感性与几种疾病之间的因果关系,包括乳腺癌(OR = 1.0073, 95% CI = 1.0032-1.0114, p = 5 × 10 - 4)、ER + 乳腺癌(OR = 0.5252, 95% CI = 0.3589-0.7685, p = 9 × 10 - 4)和心力衰竭(OR = 1.0026, 95% CI = 1.001-1.0042, p = 0.002)。COVID-19住院与心力衰竭(OR = 1.0017,95% CI = 1.0006-1.0028,p = 0.002)和阿尔茨海默病(OR = 1.5092,95% CI = 1.1942-1.9072,p = 0.0006)有因果关系。COVID-19 严重程度对原发性胆汁性肝硬化(OR = 2.6333,95% CI = 1.8274-3.7948,p = 2.059 × 10-7)、乳糜泻(OR = 0.0708,95% CI = 0.0538-0.0932,p = 9.438 × 10-80)和阿尔茨海默病(OR = 1.5092,95% CI = 1.1942-1.9072,p = 0.0006)有因果效应。反向 MR 分析表明,类风湿性关节炎、糖尿病肾病、多发性硬化症和总睾酮(女性)会影响 COVID-19 的结果。我们评估了异质性和水平多向性,以确保结果的可靠性,并采用 Steiger 方向性检验来确认因果关系的方向。我们的研究结果突显了 COVID-19 对人类健康的长期影响,强调了对受影响人群进行持续监测和有针对性干预的必要性。未来的研究应探讨这些关系,以制定全面的医疗保健策略。
{"title":"Leveraging large-scale genetic data to assess the causal impact of COVID-19 on multisystemic diseases","authors":"Xiangyang Zhang, Zhaohui Jiang, Jiayao Ma, Yaru Qi, Yin Li, Yan Zhang, Yihan Liu, Chaochao Wei, Yihong Chen, Ping Liu, Yinghui Peng, Jun Tan, Ying Han, Shan Zeng, Changjing Cai, Hong Shen","doi":"10.1186/s40537-024-00997-4","DOIUrl":"https://doi.org/10.1186/s40537-024-00997-4","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>The long-term impacts of COVID-19 on human health are a major concern, yet comprehensive evaluations of its effects on various health conditions are lacking.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>This study aims to evaluate the role of various diseases in relation to COVID-19 by analyzing genetic data from a large-scale population over 2,000,000 individuals. A bidirectional two-sample Mendelian randomization approach was used, with exposures including COVID-19 susceptibility, hospitalization, and severity, and outcomes encompassing 86 different diseases or traits. A reverse Mendelian randomization analysis was performed to assess the impact of these diseases on COVID-19.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Our analysis identified causal relationships between COVID-19 susceptibility and several conditions, including breast cancer (OR = 1.0073, 95% CI = 1.0032–1.0114, <i>p</i> = 5 × 10 − 4), ER + breast cancer (OR = 0.5252, 95% CI = 0.3589–0.7685, <i>p</i> = 9 × 10 − 4), and heart failure (OR = 1.0026, 95% CI = 1.001–1.0042, <i>p</i> = 0.002). COVID-19 hospitalization was causally linked to heart failure (OR = 1.0017, 95% CI = 1.0006–1.0028, <i>p</i> = 0.002) and Alzheimer’s disease (OR = 1.5092, 95% CI = 1.1942–1.9072, <i>p</i> = 0.0006). COVID-19 severity had causal effects on primary biliary cirrhosis (OR = 2.6333, 95% CI = 1.8274–3.7948, <i>p</i> = 2.059 × 10 − 7), celiac disease (OR = 0.0708, 95% CI = 0.0538–0.0932, <i>p</i> = 9.438 × 10–80), and Alzheimer’s disease (OR = 1.5092, 95% CI = 1.1942–1.9072, <i>p</i> = 0.0006). Reverse MR analysis indicated that rheumatoid arthritis, diabetic nephropathy, multiple sclerosis, and total testosterone (female) influence COVID-19 outcomes. We assessed heterogeneity and horizontal pleiotropy to ensure result reliability and employed the Steiger directionality test to confirm the direction of causality.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>This study provides a comprehensive analysis of the causal relationships between COVID-19 and diverse health conditions. Our findings highlight the long-term impacts of COVID-19 on human health, emphasizing the need for continuous monitoring and targeted interventions for affected individuals. Future research should explore these relationships to develop comprehensive healthcare strategies.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"1 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evolutionary computation-based self-supervised learning for image processing: a big data-driven approach to feature extraction and fusion for multispectral object detection 基于进化计算的图像处理自监督学习:用于多光谱物体检测的特征提取和融合的大数据驱动方法
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-12 DOI: 10.1186/s40537-024-00988-5
Xiaoyang Shen, Haibin Li, Achyut Shankar, Wattana Viriyasitavat, Vinay Chamola

The image object recognition and detection technology are widely used in many scenarios. In recent years, big data has become increasingly abundant, and big data-driven artificial intelligence models have attracted more and more attention. Evolutionary computation has also provided a powerful driving force for the optimization and improvement of deep learning models. In this paper, we propose an image object detection method based on self-supervised and data-driven learning. Differ from other methods, our approach stands out due to its innovative use of multispectral data fusion and evolutionary computation for model optimization. Specifically, our method uniquely combines visible light images and infrared images to detect and identify image targets. Firstly, we utilize a self-supervised learning method and the AutoEncoder model to perform high-dimensional feature extraction on the two types of images. Secondly, we fuse the extracted features from the visible light and infrared images to detect and identify objects. Thirdly, we introduce a model parameter optimization method using evolutionary learning algorithms to enhance model performance. Validation on public datasets shows that our method achieves comparable or superior performance to existing methods.

图像物体识别与检测技术在很多场景中都有广泛应用。近年来,大数据日益丰富,大数据驱动的人工智能模型越来越受到关注。进化计算也为深度学习模型的优化和改进提供了强大的驱动力。本文提出了一种基于自监督和数据驱动学习的图像物体检测方法。与其他方法不同的是,我们的方法创新性地使用了多光谱数据融合和进化计算来优化模型。具体来说,我们的方法独特地结合了可见光图像和红外图像来检测和识别图像目标。首先,我们利用自监督学习方法和 AutoEncoder 模型对两类图像进行高维特征提取。其次,我们融合从可见光和红外图像中提取的特征来检测和识别目标。第三,我们利用进化学习算法引入了一种模型参数优化方法,以提高模型性能。在公共数据集上的验证表明,我们的方法取得了与现有方法相当甚至更优的性能。
{"title":"Evolutionary computation-based self-supervised learning for image processing: a big data-driven approach to feature extraction and fusion for multispectral object detection","authors":"Xiaoyang Shen, Haibin Li, Achyut Shankar, Wattana Viriyasitavat, Vinay Chamola","doi":"10.1186/s40537-024-00988-5","DOIUrl":"https://doi.org/10.1186/s40537-024-00988-5","url":null,"abstract":"<p>The image object recognition and detection technology are widely used in many scenarios. In recent years, big data has become increasingly abundant, and big data-driven artificial intelligence models have attracted more and more attention. Evolutionary computation has also provided a powerful driving force for the optimization and improvement of deep learning models. In this paper, we propose an image object detection method based on self-supervised and data-driven learning. Differ from other methods, our approach stands out due to its innovative use of multispectral data fusion and evolutionary computation for model optimization. Specifically, our method uniquely combines visible light images and infrared images to detect and identify image targets. Firstly, we utilize a self-supervised learning method and the AutoEncoder model to perform high-dimensional feature extraction on the two types of images. Secondly, we fuse the extracted features from the visible light and infrared images to detect and identify objects. Thirdly, we introduce a model parameter optimization method using evolutionary learning algorithms to enhance model performance. Validation on public datasets shows that our method achieves comparable or superior performance to existing methods.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"6 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A model for investment type recommender system based on the potential investors based on investors and experts feedback using ANFIS and MNN 基于投资者和专家反馈的潜在投资者投资类型推荐系统模型(使用 ANFIS 和 MNN
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-12 DOI: 10.1186/s40537-024-00965-y
Asefeh Asemi, Adeleh Asemi, Andrea Ko

This article presents an investment recommender system based on an Adaptive Neuro-Fuzzy Inference System (ANFIS) and pre-trained weights from a Multimodal Neural Network (MNN). The model is designed to support the investment process for the customers and takes into consideration seven factors to implement the proposed investment system model through the customer or potential investor data set. The system takes input from a web-based questionnaire that collects data on investors' preferences and investment goals. The data is then preprocessed and clustered using ETL tools, JMP, MATLAB, and Python. The ANFIS-based recommender system is designed with three inputs and one output and trained using a hybrid approach over three epochs with 188 data pairs and 18 fuzzy rules. The system's performance is evaluated using metrics such as RMSE, accuracy, precision, recall, and F1-score. The system is also designed to incorporate expert feedback and opinions from investors to customize and improve investment recommendations. The article concludes that the proposed ANFIS-based investment recommender system is effective and accurate in generating investment recommendations that meet investors' preferences and goals.

Graphical abstract

本文介绍了一种基于自适应神经模糊推理系统(ANFIS)和多模态神经网络(MNN)预训练权重的投资推荐系统。该模型旨在为客户的投资过程提供支持,并考虑了七个因素,通过客户或潜在投资者数据集来实现所建议的投资系统模型。该系统通过网络问卷收集有关投资者偏好和投资目标的数据。然后使用 ETL 工具、JMP、MATLAB 和 Python 对数据进行预处理和聚类。基于 ANFIS 的推荐系统设计了三个输入和一个输出,并使用混合方法对 188 对数据和 18 条模糊规则进行了三次历时训练。该系统的性能使用 RMSE、准确度、精确度、召回率和 F1 分数等指标进行评估。该系统还设计了专家反馈和投资者意见,以定制和改进投资建议。文章的结论是,所提出的基于 ANFIS 的投资推荐系统能有效、准确地生成符合投资者偏好和目标的投资建议。 图表摘要
{"title":"A model for investment type recommender system based on the potential investors based on investors and experts feedback using ANFIS and MNN","authors":"Asefeh Asemi, Adeleh Asemi, Andrea Ko","doi":"10.1186/s40537-024-00965-y","DOIUrl":"https://doi.org/10.1186/s40537-024-00965-y","url":null,"abstract":"<p>This article presents an investment recommender system based on an Adaptive Neuro-Fuzzy Inference System (ANFIS) and pre-trained weights from a Multimodal Neural Network (MNN). The model is designed to support the investment process for the customers and takes into consideration seven factors to implement the proposed investment system model through the customer or potential investor data set. The system takes input from a web-based questionnaire that collects data on investors' preferences and investment goals. The data is then preprocessed and clustered using ETL tools, JMP, MATLAB, and Python. The ANFIS-based recommender system is designed with three inputs and one output and trained using a hybrid approach over three epochs with 188 data pairs and 18 fuzzy rules. The system's performance is evaluated using metrics such as RMSE, accuracy, precision, recall, and F1-score. The system is also designed to incorporate expert feedback and opinions from investors to customize and improve investment recommendations. The article concludes that the proposed ANFIS-based investment recommender system is effective and accurate in generating investment recommendations that meet investors' preferences and goals.</p><h3 data-test=\"abstract-sub-heading\">Graphical abstract</h3>\u0000","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"9 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Big Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1