Journal of Big Data最新文献_第7页

De-occlusion and recognition of frontal face images: a comparative study of multiple imputation methods 正面人脸图像的去剔除和识别：多重估算方法的比较研究

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-04-29 DOI: 10.1186/s40537-024-00925-6

Joseph Agyapong Mensah, Ezekiel N. N. Nortey, Eric Ocran, Samuel Iddi, Louis Asiedu

Increasingly, automatic face recognition algorithms have become necessary with the development and extensive use of face recognition technology, particularly in the era of machine learning and artificial intelligence. However, the presence of unconstrained environmental conditions degrades the quality of acquired face images and may deteriorate the performance of many classical face recognition algorithms. Due to this backdrop, many researchers have given considerable attention to image restoration and enhancement mechanisms, but with minimal focus on occlusion-related and multiple-constrained problems. Although occlusion robust face recognition modules, via sparse representation have been explored, they require a large number of features to achieve correct computations and to maximize robustness to occlusions. Therefore, such an approach may become deficient in the presence of random occlusions of relatively moderate magnitude. This study assesses the robustness of Principal Component Analysis and Singular Value Decomposition using Discrete Wavelet Transformation for preprocessing and city block distance for classification (DWT-PCA/SVD-L1) face recognition module to image degradations due to random occlusions of varying magnitudes (10% and 20%) in test images acquired with varying expressions. Numerical evaluation of the performance of the DWT-PCA/SVD-L1 face recognition module showed that the use of the de-occluded faces for recognition enhanced significantly the performance of the study recognition module at each level (10% and 20%) of occlusion. The algorithm attained the highest recognition rate of 85.94% and 78.65% at 10% and 20% occlusions respectively, when the MICE de-occluded face images were used for recognition. With the exception of Entropy where MICE de-occluded face images attained the highest average value, the MICE and RegEM result in images of similar quality as measured by their Absolute mean brightness error (AMBE) and peak signal to noise ratio (PSNR). The study therefore recommends MICE as a suitable imputation mechanism for de-occlusion of face images acquired under varying expressions.

随着人脸识别技术的发展和广泛应用，特别是在机器学习和人工智能时代，自动人脸识别算法变得越来越必要。然而，无约束环境条件的存在会降低所获取的人脸图像的质量，并可能使许多经典人脸识别算法的性能下降。在此背景下，许多研究人员对图像修复和增强机制给予了极大关注，但却很少关注与遮挡相关的多重受限问题。虽然人们已经探索了通过稀疏表示的抗遮挡人脸识别模块，但它们需要大量的特征来实现正确的计算，并最大限度地提高对遮挡的鲁棒性。因此，这种方法在出现相对中等程度的随机遮挡时可能会出现缺陷。本研究评估了使用离散小波变换进行预处理的主成分分析和奇异值分解以及用于分类的城市块距离（DWT-PCA/SVD-L1）人脸识别模块对不同表情下获取的测试图像中不同程度（10% 和 20%）的随机遮挡造成的图像质量下降的鲁棒性。对 DWT-PCA/SVD-L1 人脸识别模块的性能进行的数值评估表明，在每个闭塞程度（10% 和 20%）下，使用去闭塞人脸进行识别可显著提高研究识别模块的性能。当使用 MICE 剔除的人脸图像进行识别时，算法在 10%和 20%的闭塞度下分别达到了 85.94% 和 78.65% 的最高识别率。从绝对平均亮度误差（AMBE）和峰值信噪比（PSNR）来看，MICE 和 RegEM 的图像质量相似，但熵值不同，MICE 去噪人脸图像的平均值最高。因此，该研究建议将 MICE 作为一种合适的归因机制，用于在不同表情下获取的人脸图像的去剔除。

{"title":"De-occlusion and recognition of frontal face images: a comparative study of multiple imputation methods","authors":"Joseph Agyapong Mensah, Ezekiel N. N. Nortey, Eric Ocran, Samuel Iddi, Louis Asiedu","doi":"10.1186/s40537-024-00925-6","DOIUrl":"https://doi.org/10.1186/s40537-024-00925-6","url":null,"abstract":"<p>Increasingly, automatic face recognition algorithms have become necessary with the development and extensive use of face recognition technology, particularly in the era of machine learning and artificial intelligence. However, the presence of unconstrained environmental conditions degrades the quality of acquired face images and may deteriorate the performance of many classical face recognition algorithms. Due to this backdrop, many researchers have given considerable attention to image restoration and enhancement mechanisms, but with minimal focus on occlusion-related and multiple-constrained problems. Although occlusion robust face recognition modules, via sparse representation have been explored, they require a large number of features to achieve correct computations and to maximize robustness to occlusions. Therefore, such an approach may become deficient in the presence of random occlusions of relatively moderate magnitude. This study assesses the robustness of Principal Component Analysis and Singular Value Decomposition using Discrete Wavelet Transformation for preprocessing and city block distance for classification (DWT-PCA/SVD-L1) face recognition module to image degradations due to random occlusions of varying magnitudes (10% and 20%) in test images acquired with varying expressions. Numerical evaluation of the performance of the DWT-PCA/SVD-L1 face recognition module showed that the use of the de-occluded faces for recognition enhanced significantly the performance of the study recognition module at each level (10% and 20%) of occlusion. The algorithm attained the highest recognition rate of 85.94% and 78.65% at 10% and 20% occlusions respectively, when the MICE de-occluded face images were used for recognition. With the exception of Entropy where MICE de-occluded face images attained the highest average value, the MICE and RegEM result in images of similar quality as measured by their Absolute mean brightness error (AMBE) and peak signal to noise ratio (PSNR). The study therefore recommends MICE as a suitable imputation mechanism for de-occlusion of face images acquired under varying expressions.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"17 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140841610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Profitability trend prediction in crypto financial markets using Fibonacci technical indicator and hybrid CNN model 利用斐波那契技术指标和混合 CNN 模型预测加密货币金融市场的盈利趋势

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-04-28 DOI: 10.1186/s40537-024-00908-7

Bilal Hassan Ahmed Khattak, Imran Shafi, Chaudhary Hamza Rashid, Mejdl Safran, Sultan Alfarhood, Imran Ashraf

Cryptocurrency has become a popular trading asset due to its security, anonymity, and decentralization. However, predicting the direction of the financial market can be challenging, leading to difficult financial decisions and potential losses. The purpose of this study is to gain insights into the impact of Fibonacci technical indicator (TI) and multi-class classification based on trend direction and price-strength (trend-strength) to improve the performance and profitability of artificial intelligence (AI) models, particularly hybrid convolutional neural network (CNN) incorporating long short-term memory (LSTM), and to modify it to reduce its complexity. The main contribution of this paper lies in its introduction of Fibonacci TI, demonstrating its impact on financial prediction, and incorporation of a multi-classification technique focusing on trend strength, thereby enhancing the depth and accuracy of predictions. Lastly, profitability analysis sheds light on the tangible benefits of utilizing Fibonacci and multi-classification. The research methodology employed to carry out profitability analysis is based on a hybrid investment strategy—direction and strength by employing a six-stage predictive system: data collection, preprocessing, sampling, training and prediction, investment simulation, and evaluation. Empirical findings show that the Fibonacci TI has improved its performance (44% configurations) and profitability (68% configurations) of AI models. Hybrid CNNs showed most performance improvements particularly the C-LSTM model for trend (binary-0.0023) and trend-strength (4 class-0.0020) and 6 class-0.0099). Hybrid CNNs showed improved profitability, particularly in CLSTM, and performance in CLSTM mod. Trend-strength prediction showed max improvements in long strategy ROI (6.89%) and average ROIs for long-short strategy. Regarding the choice between hybrid CNNs, the C-LSTM mod is a viable option for trend-strength prediction at 4-class and 6-class due to better performance and profitability.

加密货币因其安全性、匿名性和去中心化而成为一种流行的交易资产。然而，预测金融市场的走向可能具有挑战性，从而导致艰难的金融决策和潜在的损失。本研究的目的是深入了解斐波那契技术指标（TI）和基于趋势方向和价格强度（趋势强度）的多类分类对提高人工智能（AI）模型，特别是结合了长短期记忆（LSTM）的混合卷积神经网络（CNN）的性能和盈利能力的影响，并对其进行修改以降低其复杂性。本文的主要贡献在于引入了斐波那契 TI，展示了其对金融预测的影响，并纳入了以趋势强度为重点的多重分类技术，从而提高了预测的深度和准确性。最后，盈利能力分析揭示了利用斐波那契和多重分类的实际好处。盈利能力分析所采用的研究方法基于混合投资策略--方向和强度，采用了六阶段预测系统：数据收集、预处理、抽样、训练和预测、投资模拟和评估。实证研究结果表明，斐波那契 TI 提高了人工智能模型的性能（44% 的配置）和盈利能力（68% 的配置）。混合 CNN 的性能提高最多，尤其是 C-LSTM 模型的趋势（二进制-0.0023）和趋势强度（4 级-0.0020）和 6 级-0.0099）。混合 CNN（尤其是 CLSTM）的盈利能力和 CLSTM 模式的性能都有所提高。趋势强度预测在多头策略投资回报率（6.89%）和多空策略平均投资回报率方面都有最大改进。在混合 CNN 的选择方面，由于 C-LSTM mod 具有更好的性能和盈利能力，因此是 4 级和 6 级趋势强度预测的可行选择。

{"title":"Profitability trend prediction in crypto financial markets using Fibonacci technical indicator and hybrid CNN model","authors":"Bilal Hassan Ahmed Khattak, Imran Shafi, Chaudhary Hamza Rashid, Mejdl Safran, Sultan Alfarhood, Imran Ashraf","doi":"10.1186/s40537-024-00908-7","DOIUrl":"https://doi.org/10.1186/s40537-024-00908-7","url":null,"abstract":"<p>Cryptocurrency has become a popular trading asset due to its security, anonymity, and decentralization. However, predicting the direction of the financial market can be challenging, leading to difficult financial decisions and potential losses. The purpose of this study is to gain insights into the impact of Fibonacci technical indicator (TI) and multi-class classification based on trend direction and price-strength (trend-strength) to improve the performance and profitability of artificial intelligence (AI) models, particularly hybrid convolutional neural network (CNN) incorporating long short-term memory (LSTM), and to modify it to reduce its complexity. The main contribution of this paper lies in its introduction of Fibonacci TI, demonstrating its impact on financial prediction, and incorporation of a multi-classification technique focusing on trend strength, thereby enhancing the depth and accuracy of predictions. Lastly, profitability analysis sheds light on the tangible benefits of utilizing Fibonacci and multi-classification. The research methodology employed to carry out profitability analysis is based on a hybrid investment strategy—direction and strength by employing a six-stage predictive system: data collection, preprocessing, sampling, training and prediction, investment simulation, and evaluation. Empirical findings show that the Fibonacci TI has improved its performance (44% configurations) and profitability (68% configurations) of AI models. Hybrid CNNs showed most performance improvements particularly the C-LSTM model for trend (binary-0.0023) and trend-strength (4 class-0.0020) and 6 class-0.0099). Hybrid CNNs showed improved profitability, particularly in CLSTM, and performance in CLSTM mod. Trend-strength prediction showed max improvements in long strategy ROI (6.89%) and average ROIs for long-short strategy. Regarding the choice between hybrid CNNs, the C-LSTM mod is a viable option for trend-strength prediction at 4-class and 6-class due to better performance and profitability.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"60 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140811116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Big data resolving using Apache Spark for load forecasting and demand response in smart grid: a case study of Low Carbon London Project 使用 Apache Spark 解决智能电网中负荷预测和需求响应的大数据问题：伦敦低碳项目案例研究

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-04-28 DOI: 10.1186/s40537-024-00909-6

Hussien Ali El-Sayed Ali, M. H. Alham, Doaa Khalil Ibrahim

Using recent information and communication technologies for monitoring and management initiates a revolution in the smart grid. These technologies generate massive data that can only be processed using big data tools. This paper emphasizes the role of big data in resolving load forecasting, renewable energy sources integration, and demand response as significant aspects of smart grids. Meters data from the Low Carbon London Project is investigated as a case study. Because of the immense stream of meters' readings and exogenous data added to load forecasting models, addressing the problem is in the context of big data. Descriptive analytics are developed using Spark SQL to get insights regarding household energy consumption. Spark MLlib is utilized for predictive analytics by building scalable machine learning models accommodating meters' data streams. Multivariate polynomial regression and decision tree models are preferred here based on the big data point of view and the literature that ensures they are accurate and interpretable. The results confirmed the descriptive analytics and data visualization capabilities to provide valuable insights, guide the feature selection process, and enhance load forecasting models' accuracy. Accordingly, proper evaluation of demand response programs and integration of renewable energy resources is accomplished using achieved load forecasting results.

利用最新的信息和通信技术进行监控和管理是智能电网的一场革命。这些技术产生的海量数据只能通过大数据工具进行处理。本文强调了大数据在解决负荷预测、可再生能源整合和需求响应等智能电网重要方面的作用。本文以伦敦低碳项目的电表数据为案例进行研究。由于大量的电表读数和外源数据被添加到负荷预测模型中，因此需要在大数据背景下解决这一问题。使用 Spark SQL 开发了描述性分析方法，以深入了解家庭能源消耗情况。利用 Spark MLlib 建立可扩展的机器学习模型，以适应电表数据流，从而进行预测分析。基于大数据观点和确保其准确性和可解释性的文献，这里首选多变量多项式回归和决策树模型。结果证实，描述性分析和数据可视化功能可提供有价值的见解，指导特征选择过程，并提高负荷预测模型的准确性。因此，利用取得的负荷预测结果，可以对需求响应计划和可再生能源资源整合进行适当评估。

{"title":"Big data resolving using Apache Spark for load forecasting and demand response in smart grid: a case study of Low Carbon London Project","authors":"Hussien Ali El-Sayed Ali, M. H. Alham, Doaa Khalil Ibrahim","doi":"10.1186/s40537-024-00909-6","DOIUrl":"https://doi.org/10.1186/s40537-024-00909-6","url":null,"abstract":"<p>Using recent information and communication technologies for monitoring and management initiates a revolution in the smart grid. These technologies generate massive data that can only be processed using big data tools. This paper emphasizes the role of big data in resolving load forecasting, renewable energy sources integration, and demand response as significant aspects of smart grids. Meters data from the Low Carbon London Project is investigated as a case study. Because of the immense stream of meters' readings and exogenous data added to load forecasting models, addressing the problem is in the context of big data. Descriptive analytics are developed using Spark SQL to get insights regarding household energy consumption. Spark MLlib is utilized for predictive analytics by building scalable machine learning models accommodating meters' data streams. Multivariate polynomial regression and decision tree models are preferred here based on the big data point of view and the literature that ensures they are accurate and interpretable. The results confirmed the descriptive analytics and data visualization capabilities to provide valuable insights, guide the feature selection process, and enhance load forecasting models' accuracy. Accordingly, proper evaluation of demand response programs and integration of renewable energy resources is accomplished using achieved load forecasting results.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"37 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140813019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Green and sustainable AI research: an integrated thematic and topic modeling analysis 绿色和可持续人工智能研究：专题和主题建模综合分析

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-04-22 DOI: 10.1186/s40537-024-00920-x

Raghu Raman, Debidutta Pattnaik, Hiran H. Lathabai, Chandan Kumar, Kannan Govindan, Prema Nedungadi

This investigation delves into Green AI and Sustainable AI literature through a dual-analytical approach, combining thematic analysis with BERTopic modeling to reveal both broad thematic clusters and nuanced emerging topics. It identifies three major thematic clusters: (1) Responsible AI for Sustainable Development, focusing on integrating sustainability and ethics within AI technologies; (2) Advancements in Green AI for Energy Optimization, centering on energy efficiency; and (3) Big Data-Driven Computational Advances, emphasizing AI’s influence on socio-economic and environmental aspects. Concurrently, BERTopic modeling uncovers five emerging topics: Ethical Eco-Intelligence, Sustainable Neural Computing, Ethical Healthcare Intelligence, AI Learning Quest, and Cognitive AI Innovation, indicating a trend toward embedding ethical and sustainability considerations into AI research. The study reveals novel intersections between Sustainable and Ethical AI and Green Computing, indicating significant research trends and identifying Ethical Healthcare Intelligence and AI Learning Quest as evolving areas within AI’s socio-economic and societal impacts. The study advocates for a unified approach to innovation in AI, promoting environmental sustainability and ethical integrity to foster responsible AI development. This aligns with the Sustainable Development Goals, emphasizing the need for ecological balance, societal welfare, and responsible innovation. This refined focus underscores the critical need for integrating ethical and environmental considerations into the AI development lifecycle, offering insights for future research directions and policy interventions.

本调查通过双重分析方法深入研究绿色人工智能和可持续人工智能文献，将主题分析与 BERTopic 模型相结合，以揭示广泛的主题集群和细微的新兴主题。研究确定了三大主题集群：(1) 负责任的人工智能促进可持续发展，侧重于将可持续性和伦理融入人工智能技术；(2) 绿色人工智能在能源优化方面的进展，以能源效率为中心；(3) 大数据驱动的计算进展，强调人工智能对社会经济和环境方面的影响。同时，BERTopic 模型还揭示了五个新兴主题：伦理生态智能、可持续神经计算、伦理医疗智能、人工智能学习探索和认知人工智能创新，表明了将伦理和可持续发展因素纳入人工智能研究的趋势。研究揭示了可持续和伦理人工智能与绿色计算之间的新交叉点，指出了重要的研究趋势，并将伦理医疗智能和人工智能学习探索确定为人工智能对社会经济和社会影响中不断发展的领域。该研究倡导采用统一的方法进行人工智能创新，促进环境可持续性和道德诚信，以推动负责任的人工智能发展。这与可持续发展目标相一致，强调了生态平衡、社会福利和负责任创新的必要性。这一细化的重点强调了将伦理和环境因素纳入人工智能发展生命周期的迫切需要，为未来的研究方向和政策干预提供了启示。

{"title":"Green and sustainable AI research: an integrated thematic and topic modeling analysis","authors":"Raghu Raman, Debidutta Pattnaik, Hiran H. Lathabai, Chandan Kumar, Kannan Govindan, Prema Nedungadi","doi":"10.1186/s40537-024-00920-x","DOIUrl":"https://doi.org/10.1186/s40537-024-00920-x","url":null,"abstract":"<p>This investigation delves into Green AI and Sustainable AI literature through a dual-analytical approach, combining thematic analysis with BERTopic modeling to reveal both broad thematic clusters and nuanced emerging topics. It identifies three major thematic clusters: (1) Responsible AI for Sustainable Development, focusing on integrating sustainability and ethics within AI technologies; (2) Advancements in Green AI for Energy Optimization, centering on energy efficiency; and (3) Big Data-Driven Computational Advances, emphasizing AI’s influence on socio-economic and environmental aspects. Concurrently, BERTopic modeling uncovers five emerging topics: Ethical Eco-Intelligence, Sustainable Neural Computing, Ethical Healthcare Intelligence, AI Learning Quest, and Cognitive AI Innovation, indicating a trend toward embedding ethical and sustainability considerations into AI research. The study reveals novel intersections between Sustainable and Ethical AI and Green Computing, indicating significant research trends and identifying Ethical Healthcare Intelligence and AI Learning Quest as evolving areas within AI’s socio-economic and societal impacts. The study advocates for a unified approach to innovation in AI, promoting environmental sustainability and ethical integrity to foster responsible AI development. This aligns with the Sustainable Development Goals, emphasizing the need for ecological balance, societal welfare, and responsible innovation. This refined focus underscores the critical need for integrating ethical and environmental considerations into the AI development lifecycle, offering insights for future research directions and policy interventions.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"140 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An improved deep hashing model for image retrieval with binary code similarities 利用二进制代码相似性进行图像检索的改进型深度散列模型

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-04-18 DOI: 10.1186/s40537-024-00919-4

Huawen Liu, Zongda Wu, Minghao Yin, Donghua Yu, Xinzhong Zhu, Jungang Lou

The exponential growth of data raises an unprecedented challenge in data analysis: how to retrieve interesting information from such large-scale data. Hash learning is a promising solution to address this challenge, because it may bring many potential advantages, such as extremely high efficiency and low storage cost, after projecting high-dimensional data to compact binary codes. However, traditional hash learning algorithms often suffer from the problem of semantic inconsistency, where images with similar semantic features may have different binary codes. In this paper, we propose a novel end-to-end deep hashing method based on the similarities of binary codes, dubbed CSDH (Code Similarity-based Deep Hashing), for image retrieval. Specifically, it extracts deep features from images to capture semantic information using a pre-trained deep convolutional neural network. Additionally, a hidden and fully connected layer is attached at the end of the deep network to derive hash bits by virtue of an activation function. To preserve the semantic consistency of images, a loss function has been introduced. It takes the label similarities, as well as the Hamming embedding distances, into consideration. By doing so, CSDH can learn more compact and powerful hash codes, which not only can preserve semantic similarity but also have small Hamming distances between similar images. To verify the effectiveness of CSDH, we evaluate CSDH on two public benchmark image collections, i.e., CIFAR-10 and NUS-WIDE, with five classic shallow hashing models and six popular deep hashing ones. The experimental results show that CSDH can achieve competitive performance to the popular deep hashing algorithms.

数据的指数级增长给数据分析带来了前所未有的挑战：如何从如此大规模的数据中检索出有趣的信息。哈希学习是应对这一挑战的一个很有前景的解决方案，因为在将高维数据投射到紧凑的二进制代码后，它可能带来许多潜在的优势，如极高的效率和较低的存储成本。然而，传统的哈希学习算法往往存在语义不一致的问题，即具有相似语义特征的图像可能具有不同的二进制代码。在本文中，我们提出了一种基于二进制代码相似性的新型端到端深度散列方法，称为 CSDH（基于代码相似性的深度散列），用于图像检索。具体来说，它使用预先训练好的深度卷积神经网络从图像中提取深度特征，捕捉语义信息。此外，在深度网络的末端还附加了一个全连接的隐藏层，通过激活函数来推导散列比特。为了保持图像语义的一致性，我们引入了一个损失函数。它将标签相似性和汉明嵌入距离都考虑在内。这样，CSDH 就能学习到更紧凑、更强大的哈希编码，不仅能保持语义的相似性，而且相似图像之间的汉明距离也很小。为了验证 CSDH 的有效性，我们在两个公开的基准图像集（即 CIFAR-10 和 NUS-WIDE）上使用五种经典的浅散列模型和六种流行的深散列模型对 CSDH 进行了评估。实验结果表明，与流行的深度散列算法相比，CSDH 的性能更具竞争力。

{"title":"An improved deep hashing model for image retrieval with binary code similarities","authors":"Huawen Liu, Zongda Wu, Minghao Yin, Donghua Yu, Xinzhong Zhu, Jungang Lou","doi":"10.1186/s40537-024-00919-4","DOIUrl":"https://doi.org/10.1186/s40537-024-00919-4","url":null,"abstract":"<p>The exponential growth of data raises an unprecedented challenge in data analysis: how to retrieve interesting information from such large-scale data. Hash learning is a promising solution to address this challenge, because it may bring many potential advantages, such as extremely high efficiency and low storage cost, after projecting high-dimensional data to compact binary codes. However, traditional hash learning algorithms often suffer from the problem of semantic inconsistency, where images with similar semantic features may have different binary codes. In this paper, we propose a novel end-to-end deep hashing method based on the similarities of binary codes, dubbed CSDH (Code Similarity-based Deep Hashing), for image retrieval. Specifically, it extracts deep features from images to capture semantic information using a pre-trained deep convolutional neural network. Additionally, a hidden and fully connected layer is attached at the end of the deep network to derive hash bits by virtue of an activation function. To preserve the semantic consistency of images, a loss function has been introduced. It takes the label similarities, as well as the Hamming embedding distances, into consideration. By doing so, CSDH can learn more compact and powerful hash codes, which not only can preserve semantic similarity but also have small Hamming distances between similar images. To verify the effectiveness of CSDH, we evaluate CSDH on two public benchmark image collections, i.e., CIFAR-10 and NUS-WIDE, with five classic shallow hashing models and six popular deep hashing ones. The experimental results show that CSDH can achieve competitive performance to the popular deep hashing algorithms.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"25 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140627077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Revisiting the potential value of vital signs in the real-time prediction of mortality risk in intensive care unit patients 重新审视生命体征在实时预测重症监护室患者死亡风险方面的潜在价值

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-04-18 DOI: 10.1186/s40537-024-00896-8

Pan Pan, Yue Wang, Chang Liu, Yanhui Tu, Haibo Cheng, Qingyun Yang, Fei Xie, Yuan Li, Lixin Xie, Yuhong Liu

<h3 data-test="abstract-sub-heading">Background</h3><p>Predicting patient mortality risk facilitates early intervention in intensive care unit (ICU) patients at greater risk of disease progression. This study applies machine learning methods to multidimensional clinical data to dynamically predict mortality risk in ICU patients.</p><h3 data-test="abstract-sub-heading">Methods</h3><p>A total of 33,798 patients in the MIMIC-III database were collected. An integrated model NIMRF (Network Integrating Memory Module and Random Forest) based on multidimensional variables such as vital sign variables and laboratory variables was developed to predict the risk of death for ICU patients in four non overlapping time windows of 0–1 h, 1–3 h, 3–6 h, and 6–12 h. Mortality risk in four nonoverlapping time windows of 12 h was externally validated on data from 889 patients in the respiratory critical care unit of the Chinese PLA General Hospital and compared with LSTM, random forest and time-dependent cox regression model (survival analysis) methods. We also interpret the developed model to obtain important factors for predicting mortality risk across time windows. The code can be found in https://github.com/wyuexiao/NIMRF.</p><h3 data-test="abstract-sub-heading">Results</h3><p>The NIMRF model developed in this study could predict the risk of death in four nonoverlapping time windows (0–1 h, 1–3 h, 3–6 h, 6–12 h) after any time point in ICU patients, and in internal data validation, it is suggested that the model is more accurate than LSTM, random forest prediction and time-dependent cox regression model (area under receiver operating characteristic curve, or AUC, 0–1 h: 0.8015 [95% CI 0.7725–0.8304] vs. 0.7144 [95%] CI 0.6824–0.7464] vs. 0.7606 [95% CI 0.7300–0.7913] vs 0.3867 [95% CI 0.3573–0.4161]; 1–3 h: 0.7100 [95% CI 0.6777–0.7423] vs. 0.6389 [95% CI 0.6055–0.6723] vs. 0.6992 [95% CI 0.6667–0.7318] vs 0.3854 [95% CI 0.3559–0.4150]; 3–6 h: 0.6760 [95% CI 0.6425–0.7097] vs. 0.5964 [95% CI 0.5622–0.6306] vs. 0.6760 [95% CI 0.6427–0.7099] vs 0.3967 [95% CI 0.3662–0.4271]; 6–12 h: 0.6380 [0.6031–0.6729] vs. 0.6032 [0.5705–0.6406] vs. 0.6055 [0.5682–0.6383] vs 0.4023 [95% CI 0.3709–0.4337]). External validation was performed on the data of patients in the respiratory critical care unit of the Chinese PLA General Hospital. Compared with LSTM, random forest and time-dependent cox regression model, the NIMRF model was still the best, with an AUC of 0.9366 [95% CI 0.9157–0.9575 for predicting death risk in 0–1 h]. The corresponding AUCs of LSTM, random forest and time-dependent cox regression model were 0.9263 [95% CI 0.9039–0.9486], 0.7437 [95% CI 0.7083–0.7791] and 0.2447 [95% CI 0.2202–0.2692], respectively. Interpretation of the model revealed that vital signs (systolic blood pressure, heart rate, diastolic blood pressure, respiratory rate, and body temperature) were highly correlated with events of death.</p><h3 data-test="abstract-sub-heading">Conclusion</h3><p>

背景预测患者的死亡风险有助于对疾病进展风险较大的重症监护室（ICU）患者进行早期干预。本研究将机器学习方法应用于多维临床数据，以动态预测重症监护病房患者的死亡风险。基于生命体征变量和实验室变量等多维变量，建立了一个集成模型NIMRF（网络集成记忆模块和随机森林），用于预测ICU患者在0-1小时、1-3小时、3-6小时和6-12小时四个非重叠时间窗内的死亡风险。在中国人民解放军总医院呼吸重症监护室 889 名患者的数据中对 12 小时内四个非重叠时间窗的死亡风险进行了外部验证，并与 LSTM、随机森林和时间依赖性 cox 回归模型（生存分析）方法进行了比较。我们还对所开发的模型进行了解释，以获得预测跨时间窗死亡风险的重要因素。本研究开发的 NIMRF 模型可预测 ICU 患者任意时间点后四个非重叠时间窗（0-1 h、1-3 h、3-6 h、6-12 h）内的死亡风险，内部数据验证表明，该模型比 LSTM、随机森林预测和时间依赖性 cox 回归模型更准确（接收器操作特征曲线下面积，或 AUC，0-1 h：0.8015 [95% CI 0.7725-0.8304] vs. 0.7144 [95%] CI 0.6824-0.7464] vs. 0.7606 [95% CI 0.7300-0.7913] vs. 0.3867 [95% CI 0.3573-0.4161]; 1-3 h: 0.7100 [95% CI 0.6777-0.7423] vs. 0.6389 [95% CI 0.6055-0.6723] vs. 0.6992 [95% CI 0.6667-0.7318] vs. 0.3854 [95% CI 0.3559-0.4150]; 3-6 h：0.6760 [95% CI 0.6425-0.7097] vs. 0.5964 [95% CI 0.5622-0.6306] vs. 0.6760 [95% CI 0.6427-0.7099] vs. 0.3967 [95% CI 0.3662-0.6-12小时：0.6380 [0.6031-0.6729] vs. 0.6032 [0.5705-0.6406] vs. 0.6055 [0.5682-0.6383] vs. 0.4023 [95% CI 0.3709-0.4337]）。外部验证在中国人民解放军总医院呼吸重症监护室的患者数据中进行。与 LSTM、随机森林和时间依赖性 cox 回归模型相比，NIMRF 模型的 AUC 为 0.9366 [95% CI 0.9157-0.9575（预测 0-1 h 死亡风险）]，仍然是最好的。LSTM、随机森林和时间依赖性 cox 回归模型的相应 AUC 分别为 0.9263 [95% CI 0.9039-0.9486]、0.7437 [95% CI 0.7083-0.7791]和 0.2447 [95% CI 0.2202-0.2692]。结论使用 NIMRF 模型可以整合 ICU 多维变量数据，尤其是生命体征变量数据，从而准确预测 ICU 患者的死亡事件。这些预测可以帮助临床医生选择更及时、更精确的治疗方法和干预措施，更重要的是，可以减少侵入性程序，节约医疗成本。

{"title":"Revisiting the potential value of vital signs in the real-time prediction of mortality risk in intensive care unit patients","authors":"Pan Pan, Yue Wang, Chang Liu, Yanhui Tu, Haibo Cheng, Qingyun Yang, Fei Xie, Yuan Li, Lixin Xie, Yuhong Liu","doi":"10.1186/s40537-024-00896-8","DOIUrl":"https://doi.org/10.1186/s40537-024-00896-8","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>Predicting patient mortality risk facilitates early intervention in intensive care unit (ICU) patients at greater risk of disease progression. This study applies machine learning methods to multidimensional clinical data to dynamically predict mortality risk in ICU patients.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>A total of 33,798 patients in the MIMIC-III database were collected. An integrated model NIMRF (Network Integrating Memory Module and Random Forest) based on multidimensional variables such as vital sign variables and laboratory variables was developed to predict the risk of death for ICU patients in four non overlapping time windows of 0–1 h, 1–3 h, 3–6 h, and 6–12 h. Mortality risk in four nonoverlapping time windows of 12 h was externally validated on data from 889 patients in the respiratory critical care unit of the Chinese PLA General Hospital and compared with LSTM, random forest and time-dependent cox regression model (survival analysis) methods. We also interpret the developed model to obtain important factors for predicting mortality risk across time windows. The code can be found in https://github.com/wyuexiao/NIMRF.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>The NIMRF model developed in this study could predict the risk of death in four nonoverlapping time windows (0–1 h, 1–3 h, 3–6 h, 6–12 h) after any time point in ICU patients, and in internal data validation, it is suggested that the model is more accurate than LSTM, random forest prediction and time-dependent cox regression model (area under receiver operating characteristic curve, or AUC, 0–1 h: 0.8015 [95% CI 0.7725–0.8304] vs. 0.7144 [95%] CI 0.6824–0.7464] vs. 0.7606 [95% CI 0.7300–0.7913] vs 0.3867 [95% CI 0.3573–0.4161]; 1–3 h: 0.7100 [95% CI 0.6777–0.7423] vs. 0.6389 [95% CI 0.6055–0.6723] vs. 0.6992 [95% CI 0.6667–0.7318] vs 0.3854 [95% CI 0.3559–0.4150]; 3–6 h: 0.6760 [95% CI 0.6425–0.7097] vs. 0.5964 [95% CI 0.5622–0.6306] vs. 0.6760 [95% CI 0.6427–0.7099] vs 0.3967 [95% CI 0.3662–0.4271]; 6–12 h: 0.6380 [0.6031–0.6729] vs. 0.6032 [0.5705–0.6406] vs. 0.6055 [0.5682–0.6383] vs 0.4023 [95% CI 0.3709–0.4337]). External validation was performed on the data of patients in the respiratory critical care unit of the Chinese PLA General Hospital. Compared with LSTM, random forest and time-dependent cox regression model, the NIMRF model was still the best, with an AUC of 0.9366 [95% CI 0.9157–0.9575 for predicting death risk in 0–1 h]. The corresponding AUCs of LSTM, random forest and time-dependent cox regression model were 0.9263 [95% CI 0.9039–0.9486], 0.7437 [95% CI 0.7083–0.7791] and 0.2447 [95% CI 0.2202–0.2692], respectively. Interpretation of the model revealed that vital signs (systolic blood pressure, heart rate, diastolic blood pressure, respiratory rate, and body temperature) were highly correlated with events of death.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"11 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140626658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing academic performance prediction with temporal graph networks for massive open online courses 利用时序图网络加强大规模开放式在线课程的学习成绩预测

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-04-13 DOI: 10.1186/s40537-024-00918-5

Qionghao Huang, Jili Chen

Educational big data significantly impacts education, and Massive Open Online Courses (MOOCs), a crucial learning approach, have evolved to be more intelligent with these technologies. Deep neural networks have significantly advanced the crucial task within MOOCs, predicting student academic performance. However, most deep learning-based methods usually ignore the temporal information and interaction behaviors during the learning activities, which can effectively enhance the model’s predictive accuracy. To tackle this, we formulate the learning processes of e-learning students as dynamic temporal graphs to encode the temporal information and interaction behaviors during their studying. We propose a novel academic performance prediction model (APP-TGN) based on temporal graph neural networks. Specifically, in APP-TGN, a dynamic graph is constructed from online learning activity logs. A temporal graph network with low-high filters learns potential academic performance variations encoded in dynamic graphs. Furthermore, a global sampling module is developed to mitigate the problem of false correlations in deep learning-based models. Finally, multi-head attention is utilized for predicting academic outcomes. Extensive experiments are conducted on a well-known public dataset. The experimental results indicate that APP-TGN significantly surpasses existing methods and demonstrates excellent potential in automated feedback and personalized learning.

教育大数据对教育产生了重大影响，而大规模开放式在线课程（MOOCs）作为一种重要的学习方法，在这些技术的推动下变得更加智能。深度神经网络极大地推动了 MOOC 的关键任务--预测学生的学习成绩。然而，大多数基于深度学习的方法通常会忽略学习活动中的时间信息和交互行为，而这些信息和行为可以有效提高模型的预测准确性。为此，我们将网络学习学生的学习过程表述为动态时序图，以编码学习过程中的时序信息和交互行为。我们提出了一种基于时序图神经网络的新型学业成绩预测模型（APP-TGN）。具体来说，APP-TGN 是根据在线学习活动日志构建的动态图。带有低-高过滤器的时序图网络可以学习动态图中编码的潜在学习成绩变化。此外，还开发了一个全局采样模块，以减轻基于深度学习的模型中的错误相关性问题。最后，多头注意力被用于预测学习成绩。我们在一个著名的公共数据集上进行了广泛的实验。实验结果表明，APP-TGN 显著超越了现有方法，并在自动反馈和个性化学习方面展现出卓越的潜力。

{"title":"Enhancing academic performance prediction with temporal graph networks for massive open online courses","authors":"Qionghao Huang, Jili Chen","doi":"10.1186/s40537-024-00918-5","DOIUrl":"https://doi.org/10.1186/s40537-024-00918-5","url":null,"abstract":"<p>Educational big data significantly impacts education, and Massive Open Online Courses (MOOCs), a crucial learning approach, have evolved to be more intelligent with these technologies. Deep neural networks have significantly advanced the crucial task within MOOCs, predicting student academic performance. However, most deep learning-based methods usually ignore the temporal information and interaction behaviors during the learning activities, which can effectively enhance the model’s predictive accuracy. To tackle this, we formulate the learning processes of e-learning students as dynamic temporal graphs to encode the temporal information and interaction behaviors during their studying. We propose a novel academic performance prediction model (APP-TGN) based on temporal graph neural networks. Specifically, in APP-TGN, a dynamic graph is constructed from online learning activity logs. A temporal graph network with low-high filters learns potential academic performance variations encoded in dynamic graphs. Furthermore, a global sampling module is developed to mitigate the problem of false correlations in deep learning-based models. Finally, multi-head attention is utilized for predicting academic outcomes. Extensive experiments are conducted on a well-known public dataset. The experimental results indicate that APP-TGN significantly surpasses existing methods and demonstrates excellent potential in automated feedback and personalized learning.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"8 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The differences in gastric cancer epidemiological data between SEER and GBD: a joinpoint and age-period-cohort analysis SEER 和 GBD 胃癌流行病学数据的差异：连接点和年龄段队列分析

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-04-13 DOI: 10.1186/s40537-024-00907-8

Zenghong Wu, Kun Zhang, Weijun Wang, Mengke Fan, Rong Lin

Background

The burden of gastric cancer (GC) should be further clarified worldwide, and helped us to understand the current situation of GC.

Methods

In the present study, we estimated disability-adjusted life-years (DALYs) and mortality rates attributable to several major GC risk factors, including smoking, dietary risk, and behavioral risk. In addition, we evaluated the incidence rate and trends of incidence-based mortality (IBM) due to GC in the United States (US) during 1992–2018.

Results

Globally, GC incidences increased from 883,395 in 1990 to 1,269,805 in 2019 while GC-associated mortality increased from 788,316 in 1990 to 957,185 in 2019. In 2019, the age-standardized rate (ASR) of GC exhibited variations around the world, with Mongolia having the highest observed ASR (43.7 per 100,000), followed by Bolivia (34 per 100,000) and China (30.6 per 100,000). A negative association was found among estimated annual percentage change (EAPC) and ASR (age-standardized incidence rate (ASIR): r = − 0.28, p < 0.001; age-standardized death rate (ASDR): r = − 0.19, p = 0.005). There were 74,966 incidences of GC and 69,374 GC-related deaths recorded between 1992 and 2018. The significant decrease in GC incidences as well as decreasing trends in IBM of GC were first detected in 1994. The GC IBM significantly increased at a rate of 35%/y from 1992 to 1994 (95% CI 21.2% to 50.4%/y), and then begun to decrease at a rate of − 1.4%/y from 1994 to 2018 (95% CI − 1.6% to − 1.2%/y).

Conclusion

These findings mirror the global disease burden of GC and are important for development of targeted prevention strategies.

背景应进一步明确胃癌（GC）在全球范围内的负担，并帮助我们了解胃癌的现状。方法在本研究中，我们估算了几个主要胃癌风险因素（包括吸烟、饮食风险和行为风险）导致的残疾调整生命年（DALYs）和死亡率。此外，我们还评估了1992-2018年期间美国（US）GC发病率和基于发病率的死亡率（IBM）趋势。结果在全球范围内，GC发病率从1990年的883,395例增加到2019年的1,269,805例，而GC相关死亡率从1990年的788,316例增加到2019年的957,185例。2019 年，全球肺结核的年龄标准化发病率（ASR）在全球范围内呈现出差异，蒙古的年龄标准化发病率最高（43.7/100,000），其次是玻利维亚（34/100,000）和中国（30.6/100,000）。估计年度百分比变化（EAPC）与年龄标准化发病率（ASIR）呈负相关（年龄标准化发病率（ASIR）：r = - 0.28，p < 0.001；年龄标准化死亡率（ASDR）：r = - 0.19，p = 0.005）。1992 年至 2018 年间，共记录了 74966 例 GC 发病和 69374 例 GC 相关死亡。GC 发病率的大幅下降以及 GC IBM 的下降趋势于 1994 年首次被发现。从1992年到1994年，GC IBM以35%/年的速度大幅上升（95% CI为21.2%至50.4%/年），然后从1994年到2018年开始以-1.4%/年的速度下降（95% CI为-1.6%至-1.2%/年）。

{"title":"The differences in gastric cancer epidemiological data between SEER and GBD: a joinpoint and age-period-cohort analysis","authors":"Zenghong Wu, Kun Zhang, Weijun Wang, Mengke Fan, Rong Lin","doi":"10.1186/s40537-024-00907-8","DOIUrl":"https://doi.org/10.1186/s40537-024-00907-8","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>The burden of gastric cancer (GC) should be further clarified worldwide, and helped us to understand the current situation of GC.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>In the present study, we estimated disability-adjusted life-years (DALYs) and mortality rates attributable to several major GC risk factors, including smoking, dietary risk, and behavioral risk. In addition, we evaluated the incidence rate and trends of incidence-based mortality (IBM) due to GC in the United States (US) during 1992–2018.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Globally, GC incidences increased from 883,395 in 1990 to 1,269,805 in 2019 while GC-associated mortality increased from 788,316 in 1990 to 957,185 in 2019. In 2019, the age-standardized rate (ASR) of GC exhibited variations around the world, with Mongolia having the highest observed ASR (43.7 per 100,000), followed by Bolivia (34 per 100,000) and China (30.6 per 100,000). A negative association was found among estimated annual percentage change (EAPC) and ASR (age-standardized incidence rate (ASIR): r = − 0.28, <i>p</i> < 0.001; age-standardized death rate (ASDR): r = − 0.19, <i>p</i> = 0.005). There were 74,966 incidences of GC and 69,374 GC-related deaths recorded between 1992 and 2018. The significant decrease in GC incidences as well as decreasing trends in IBM of GC were first detected in 1994. The GC IBM significantly increased at a rate of 35%/y from 1992 to 1994 (95% CI 21.2% to 50.4%/y), and then begun to decrease at a rate of − 1.4%/y from 1994 to 2018 (95% CI − 1.6% to − 1.2%/y).</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>These findings mirror the global disease burden of GC and are important for development of targeted prevention strategies.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"26 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DAPS diagrams for defining Data Science projects 用于定义数据科学项目的 DAPS 图表

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-04-12 DOI: 10.1186/s40537-024-00916-7

Jeroen de Mast, Joran Lokkerbol

Background

Models for structuring big-data and data-analytics projects typically start with a definition of the project’s goals and the business value they are expected to create. The literature identifies proper project definition as crucial for a project’s success, and also recognizes that the translation of business objectives into data-analytic problems is a difficult task. Unfortunately, common project structures, such as CRISP-DM, provide little guidance for this crucial stage when compared to subsequent project stages such as data preparation and modeling.

Contribution

This paper contributes structure to the project-definition stage of data-analytic projects by proposing the Data-Analytic Problem Structure (DAPS). The diagrammatic technique facilitates the collaborative development of a consistent and precise definition of a data-analytic problem, and the articulation of how it contributes to the organization’s goals. In addition, the technique helps to identify important assumptions, and to break down large ambitions in manageable subprojects.

Methods

The semi-formal specification technique took other models for problem structuring — common in fields such as operations research and business analytics — as a point of departure. The proposed technique was applied in 47 real data-analytic projects and refined based on the results, following a design-science approach.

背景构建大数据和数据分析项目的模型通常从定义项目目标和预期创造的业务价值开始。文献指出，正确的项目定义是项目成功的关键，同时也认识到将业务目标转化为数据分析问题是一项艰巨的任务。遗憾的是，与数据准备和建模等后续项目阶段相比，CRISP-DM 等常见项目结构对这一关键阶段几乎没有提供指导。这种图解技术有助于对数据分析问题进行一致、准确的定义，并阐明该问题如何有助于实现组织目标。此外，该技术还有助于确定重要的假设，并将庞大的雄心壮志分解为易于管理的子项目。半正式说明技术以运筹学和商业分析等领域常见的其他问题结构模型为出发点。在 47 个实际数据分析项目中应用了所提出的技术，并根据结果，采用设计科学方法对其进行了改进。

{"title":"DAPS diagrams for defining Data Science projects","authors":"Jeroen de Mast, Joran Lokkerbol","doi":"10.1186/s40537-024-00916-7","DOIUrl":"https://doi.org/10.1186/s40537-024-00916-7","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>Models for structuring big-data and data-analytics projects typically start with a definition of the project’s goals and the business value they are expected to create. The literature identifies proper project definition as crucial for a project’s success, and also recognizes that the translation of business objectives into data-analytic problems is a difficult task. Unfortunately, common project structures, such as CRISP-DM, provide little guidance for this crucial stage when compared to subsequent project stages such as data preparation and modeling.</p><h3 data-test=\"abstract-sub-heading\">Contribution</h3><p>This paper contributes structure to the project-definition stage of data-analytic projects by proposing the Data-Analytic Problem Structure (DAPS). The diagrammatic technique facilitates the collaborative development of a consistent and precise definition of a data-analytic problem, and the articulation of how it contributes to the organization’s goals. In addition, the technique helps to identify important assumptions, and to break down large ambitions in manageable subprojects.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>The semi-formal specification technique took other models for problem structuring — common in fields such as operations research and business analytics — as a point of departure. The proposed technique was applied in 47 real data-analytic projects and refined based on the results, following a design-science approach.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"36 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

B-CAT: a model for detecting botnet attacks using deep attack behavior analysis on network traffic flows B-CAT：利用对网络流量的深度攻击行为分析检测僵尸网络攻击的模型

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-04-10 DOI: 10.1186/s40537-024-00900-1

Muhammad Aidiel Rachman Putra, Tohari Ahmad, Dandy Pramana Hostiadi

Threats on computer networks have been increasing rapidly, and irresponsible parties are always trying to exploit vulnerabilities in the network to do various dangerous things. One way to exploit vulnerabilities in a computer network is by employing malware. Botnets are a type of malware that infects and attacks targets in groups. Botnets develop quickly; the characteristics of initially sporadic attacks have grown into periodic and simultaneous. This rapid development has proved that the botnet is advanced and requires more attention and proper handling. Many studies have introduced detection models for botnet attack activity on computer networks. Apart from detecting the presence of botnet attacks, those studies have attempted to explore the characteristics of botnets, such as attack intensity, relationships between activities, and time segment analysis. However, there has been no research that explicitly detects those characteristics. On the other hand, each botnet characteristic requires different handling, while recognizing the characteristics of the botnet can help network administrators make appropriate decisions. Based on these reasons, this research builds a detection model that can recognize botnet characteristics using sequential traffic mining and similarity analysis. The proposed method consists of two main processes. The first is training to build a knowledge base, and the second is testing to detect botnet activity and attack characteristics. It involves dynamic thresholds to improve the model sensitivity in recognizing attack characteristics through similarity analysis. The novelty includes developing and combining analytical techniques of sequential traffic mining, similarity analysis, and dynamic threshold to detect and recognize the characteristics of botnet attacks explicitly on actual behavior in network traffic. Extensive experiments have been conducted for the evaluation using three different datasets whose results show better performance than others.

计算机网络面临的威胁与日俱增，不负责任的人总是试图利用网络漏洞做各种危险的事情。利用计算机网络漏洞的方法之一就是使用恶意软件。僵尸网络是一种以群体形式感染和攻击目标的恶意软件。僵尸网络发展迅速，从最初的零星攻击发展为周期性同时攻击。这种快速发展证明僵尸网络很先进，需要更多关注和妥善处理。许多研究提出了计算机网络僵尸网络攻击活动的检测模型。除了检测是否存在僵尸网络攻击外，这些研究还试图探索僵尸网络的特征，如攻击强度、活动之间的关系和时间段分析等。但是，目前还没有明确检测这些特征的研究。另一方面，每个僵尸网络的特征需要不同的处理方法，而识别僵尸网络的特征则有助于网络管理员做出适当的决策。基于这些原因，本研究建立了一个检测模型，利用顺序流量挖掘和相似性分析来识别僵尸网络的特征。所提出的方法包括两个主要过程。第一个过程是训练，以建立知识库；第二个过程是测试，以检测僵尸网络活动和攻击特征。它采用动态阈值，通过相似性分析提高模型识别攻击特征的灵敏度。新颖之处在于开发并结合了顺序流量挖掘、相似性分析和动态阈值等分析技术，以明确的网络流量实际行为来检测和识别僵尸网络攻击的特征。我们使用三个不同的数据集进行了广泛的实验评估，结果表明其性能优于其他数据集。

{"title":"B-CAT: a model for detecting botnet attacks using deep attack behavior analysis on network traffic flows","authors":"Muhammad Aidiel Rachman Putra, Tohari Ahmad, Dandy Pramana Hostiadi","doi":"10.1186/s40537-024-00900-1","DOIUrl":"https://doi.org/10.1186/s40537-024-00900-1","url":null,"abstract":"<p>Threats on computer networks have been increasing rapidly, and irresponsible parties are always trying to exploit vulnerabilities in the network to do various dangerous things. One way to exploit vulnerabilities in a computer network is by employing malware. Botnets are a type of malware that infects and attacks targets in groups. Botnets develop quickly; the characteristics of initially sporadic attacks have grown into periodic and simultaneous. This rapid development has proved that the botnet is advanced and requires more attention and proper handling. Many studies have introduced detection models for botnet attack activity on computer networks. Apart from detecting the presence of botnet attacks, those studies have attempted to explore the characteristics of botnets, such as attack intensity, relationships between activities, and time segment analysis. However, there has been no research that explicitly detects those characteristics. On the other hand, each botnet characteristic requires different handling, while recognizing the characteristics of the botnet can help network administrators make appropriate decisions. Based on these reasons, this research builds a detection model that can recognize botnet characteristics using sequential traffic mining and similarity analysis. The proposed method consists of two main processes. The first is training to build a knowledge base, and the second is testing to detect botnet activity and attack characteristics. It involves dynamic thresholds to improve the model sensitivity in recognizing attack characteristics through similarity analysis. The novelty includes developing and combining analytical techniques of sequential traffic mining, similarity analysis, and dynamic threshold to detect and recognize the characteristics of botnet attacks explicitly on actual behavior in network traffic. Extensive experiments have been conducted for the evaluation using three different datasets whose results show better performance than others.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"82 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0