IET Software最新文献_第2页

Segmented Frequency-Domain Correlation Prediction Model for Long-Term Time Series Forecasting Using Transformer 利用变压器对长期时间序列进行预测的分段频域相关预测模型

IF 1.5 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IET Software

Pub Date : 2024-07-08 DOI: 10.1049/2024/2920167

Haozhuo Tong, Lingyun Kong, Jie Liu, Shiyan Gao, Yilu Xu, Yuezhe Chen

Long-term time series forecasting has received significant attention from researchers in recent years. Transformer model-based approaches have emerged as promising solutions in this domain. Nevertheless, most existing methods rely on point-by-point self-attention mechanisms or employ transformations, decompositions, and reconstructions of the entire sequence to capture dependencies. The point-by-point self-attention mechanism becomes impractical for long-term time series forecasting due to its quadratic complexity with respect to the time series length. Decomposition and reconstruction methods may introduce information loss, leading to performance bottlenecks in the models. In this paper, we propose a Transformer-based forecasting model called NPformer. Our method introduces a novel multiscale segmented Fourier attention mechanism. By segmenting the long-term time series and performing discrete Fourier transforms on different segments, we aim to identify frequency-domain correlations between these segments. This allows us to capture dependencies more effectively. In addition, we incorporate a normalization module and a desmoothing factor into the model. These components address the problem of oversmoothing that arises in sequence decomposition methods. Furthermore, we introduce an isometry convolution method to enhance the prediction accuracy of the model. The experimental results demonstrate that NPformer outperforms other Transformer-based methods in long-term time series forecasting.

近年来，长期时间序列预测受到了研究人员的极大关注。在这一领域，基于变换器模型的方法已成为前景广阔的解决方案。然而，现有的大多数方法都依赖于逐点自我关注机制，或采用对整个序列的变换、分解和重构来捕捉依赖关系。对于长期时间序列预测来说，逐点自我关注机制因其与时间序列长度相关的二次方复杂性而变得不切实际。分解和重构方法可能会带来信息损失，从而导致模型的性能瓶颈。在本文中，我们提出了一种基于变换器的预测模型，称为 NPformer。我们的方法引入了一种新颖的多尺度分段傅立叶注意力机制。通过分割长期时间序列并对不同的分段进行离散傅立叶变换，我们旨在识别这些分段之间的频域相关性。这样，我们就能更有效地捕捉依赖关系。此外，我们还在模型中加入了归一化模块和去平滑因子。这些组件可以解决序列分解方法中出现的过度平滑问题。此外，我们还引入了等距卷积法，以提高模型的预测精度。实验结果表明，在长期时间序列预测方面，NPformer 优于其他基于 Transformer 的方法。

{"title":"Segmented Frequency-Domain Correlation Prediction Model for Long-Term Time Series Forecasting Using Transformer","authors":"Haozhuo Tong, Lingyun Kong, Jie Liu, Shiyan Gao, Yilu Xu, Yuezhe Chen","doi":"10.1049/2024/2920167","DOIUrl":"https://doi.org/10.1049/2024/2920167","url":null,"abstract":"<div>\u0000 <p>Long-term time series forecasting has received significant attention from researchers in recent years. Transformer model-based approaches have emerged as promising solutions in this domain. Nevertheless, most existing methods rely on point-by-point self-attention mechanisms or employ transformations, decompositions, and reconstructions of the entire sequence to capture dependencies. The point-by-point self-attention mechanism becomes impractical for long-term time series forecasting due to its quadratic complexity with respect to the time series length. Decomposition and reconstruction methods may introduce information loss, leading to performance bottlenecks in the models. In this paper, we propose a Transformer-based forecasting model called NPformer. Our method introduces a novel multiscale segmented Fourier attention mechanism. By segmenting the long-term time series and performing discrete Fourier transforms on different segments, we aim to identify frequency-domain correlations between these segments. This allows us to capture dependencies more effectively. In addition, we incorporate a normalization module and a desmoothing factor into the model. These components address the problem of oversmoothing that arises in sequence decomposition methods. Furthermore, we introduce an isometry convolution method to enhance the prediction accuracy of the model. The experimental results demonstrate that NPformer outperforms other Transformer-based methods in long-term time series forecasting.</p>\u0000 </div>","PeriodicalId":50378,"journal":{"name":"IET Software","volume":"2024 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/2024/2920167","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141565711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accounting Management and Optimizing Production Based on Distributed Semantic Recognition 基于分布式语义识别的会计管理与生产优化

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IET Software

Pub Date : 2024-06-18 DOI: 10.1049/2024/8425877

Ruina Guo, Shu Wang, Guangsen Wei

Accounting management and production optimization are vital aspects of enterprise management, serving as indispensable core components in the modern business landscape. However, conventional methods reliant on manual input exhibit drawbacks such as low recognition accuracy and excessive memory consumption. To address these challenges, semantic recognition technology utilizing voice signals has emerged as a pivotal solution across various industries. Building upon this premise, this paper introduces a distributed semantic recognition-based algorithm for accounting management and production optimization. The proposed algorithm encompasses multiple modules, including a front-end feature extraction module, a channel transmission module, and a voice quality vector quantization module. Additionally, a semantic recognition module is introduced to process the voice signals and generate prediction results. By leveraging extensive accounting management and production data for learning and analysis, the algorithm automatically uncovers patterns and laws within the data, extracting valuable information. To validate the proposed algorithm, this study utilizes the dataset from the UCI machine learning repository and applies it for analysis and processing. The experimental findings demonstrate that the algorithm introduced in this paper outperforms alternative methods. Specifically, it achieves a notable 9.3% improvement in comprehensive recognition accuracy and reduces memory usage by 34.4%. These results highlight the algorithm’s efficacy in enhancing the understanding and analysis of customer needs, market trends, competitors, and other pertinent information within the realm of commercial applications for companies.

会计管理和生产优化是企业管理的重要方面，是现代企业不可或缺的核心组成部分。然而，依赖人工输入的传统方法存在识别准确率低、内存消耗过大等缺点。为了应对这些挑战，利用语音信号的语义识别技术已成为各行各业的重要解决方案。在此前提下，本文介绍了一种基于语义识别的分布式算法，用于会计管理和生产优化。该算法包含多个模块，包括前端特征提取模块、信道传输模块和语音质量向量量化模块。此外，还引入了语义识别模块来处理语音信号并生成预测结果。通过利用广泛的会计管理和生产数据进行学习和分析，该算法可自动发现数据中的模式和规律，从而提取有价值的信息。为了验证所提出的算法，本研究利用了 UCI 机器学习库中的数据集，并将其用于分析和处理。实验结果表明，本文介绍的算法优于其他方法。具体而言，该算法的综合识别准确率显著提高了 9.3%，内存使用量减少了 34.4%。这些结果凸显了该算法在增强对客户需求、市场趋势、竞争对手和其他公司商业应用领域相关信息的理解和分析方面的功效。

{"title":"Accounting Management and Optimizing Production Based on Distributed Semantic Recognition","authors":"Ruina Guo, Shu Wang, Guangsen Wei","doi":"10.1049/2024/8425877","DOIUrl":"https://doi.org/10.1049/2024/8425877","url":null,"abstract":"<div>\u0000 <p>Accounting management and production optimization are vital aspects of enterprise management, serving as indispensable core components in the modern business landscape. However, conventional methods reliant on manual input exhibit drawbacks such as low recognition accuracy and excessive memory consumption. To address these challenges, semantic recognition technology utilizing voice signals has emerged as a pivotal solution across various industries. Building upon this premise, this paper introduces a distributed semantic recognition-based algorithm for accounting management and production optimization. The proposed algorithm encompasses multiple modules, including a front-end feature extraction module, a channel transmission module, and a voice quality vector quantization module. Additionally, a semantic recognition module is introduced to process the voice signals and generate prediction results. By leveraging extensive accounting management and production data for learning and analysis, the algorithm automatically uncovers patterns and laws within the data, extracting valuable information. To validate the proposed algorithm, this study utilizes the dataset from the UCI machine learning repository and applies it for analysis and processing. The experimental findings demonstrate that the algorithm introduced in this paper outperforms alternative methods. Specifically, it achieves a notable 9.3% improvement in comprehensive recognition accuracy and reduces memory usage by 34.4%. These results highlight the algorithm’s efficacy in enhancing the understanding and analysis of customer needs, market trends, competitors, and other pertinent information within the realm of commercial applications for companies.</p>\u0000 </div>","PeriodicalId":50378,"journal":{"name":"IET Software","volume":"2024 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/2024/8425877","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141424921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modeling Chandy–Lamport Distributed Snapshot Algorithm Using Colored Petri Net 用彩色 Petri 网模拟钱迪-兰波特分布式快照算法

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IET Software

Pub Date : 2024-06-07 DOI: 10.1049/2024/6582682

Saeid Pashazadeh, Basheer Zuhair Jaafar Al-Basseer, Jafar Tanha

Distributed global snapshot (DGS) is one of the fundamental protocols in distributed systems. It is used for different applications like collecting information from a distributed system and taking checkpoints for process rollback. The Chandy–Lamport protocol (CLP) is famous and well-known for taking DGS. The main aim of this protocol was to generate consistent cuts without interrupting the regular operation of the distributed system. CLP was the origin of many future protocols and inspired them. The first aim of this paper is to propose a novel formal hierarchical parametric colored Petri net model of CLP. The number of constituting processes of the model is parametric. The second aim is to automatically generate a novel message sequence chart (MSC) to show detailed steps for each simulation run of the snapshot protocol. The third aim is model checking of the proposed formal model to verify the correctness of CLP and our proposed colored Petri net model. Having vital tools helps greatly to test the correct operation of the newly proposed distributed snapshot protocol. The proposed model of CLP can easily be used for visually testing the correct operation of the new future under-development DGS protocol. It also permits formal verification of the correct operation of the new proposed protocol. This model can be used as a simple, powerful, and visual tool for the step-by-step run of the CLP, model checking, and teaching it to postgraduate students. The same approach applies to similar complicated distributed protocols.

分布式全局快照（DGS）是分布式系统的基本协议之一。它用于不同的应用，如从分布式系统中收集信息，以及为进程回滚获取检查点。Chandy-Lamport 协议（CLP）是著名的 DGS 协议。该协议的主要目的是在不中断分布式系统正常运行的情况下生成一致的切点。CLP 是许多未来协议的起源和灵感来源。本文的第一个目的是提出一种新颖的 CLP 形式分层参数化彩色 Petri 网模型。该模型的构成进程数是参数化的。第二个目的是自动生成新颖的消息序列图（MSC），以显示快照协议每次模拟运行的详细步骤。第三个目的是对提出的形式模型进行模型检查，以验证 CLP 和我们提出的彩色 Petri 网模型的正确性。拥有重要的工具对测试新提出的分布式快照协议的正确运行大有帮助。拟议的 CLP 模型可轻松用于直观测试未来正在开发的新 DGS 协议的正确运行。它还允许对新提议协议的正确操作进行正式验证。这个模型可以作为一个简单、强大和可视化的工具，用于逐步运行 CLP、进行模型检查和教授研究生。同样的方法也适用于类似的复杂分布式协议。

{"title":"Modeling Chandy–Lamport Distributed Snapshot Algorithm Using Colored Petri Net","authors":"Saeid Pashazadeh, Basheer Zuhair Jaafar Al-Basseer, Jafar Tanha","doi":"10.1049/2024/6582682","DOIUrl":"https://doi.org/10.1049/2024/6582682","url":null,"abstract":"<div>\u0000 <p>Distributed global snapshot (DGS) is one of the fundamental protocols in distributed systems. It is used for different applications like collecting information from a distributed system and taking checkpoints for process rollback. The Chandy–Lamport protocol (CLP) is famous and well-known for taking DGS. The main aim of this protocol was to generate consistent cuts without interrupting the regular operation of the distributed system. CLP was the origin of many future protocols and inspired them. The first aim of this paper is to propose a novel formal hierarchical parametric colored Petri net model of CLP. The number of constituting processes of the model is parametric. The second aim is to automatically generate a novel message sequence chart (MSC) to show detailed steps for each simulation run of the snapshot protocol. The third aim is model checking of the proposed formal model to verify the correctness of CLP and our proposed colored Petri net model. Having vital tools helps greatly to test the correct operation of the newly proposed distributed snapshot protocol. The proposed model of CLP can easily be used for visually testing the correct operation of the new future under-development DGS protocol. It also permits formal verification of the correct operation of the new proposed protocol. This model can be used as a simple, powerful, and visual tool for the step-by-step run of the CLP, model checking, and teaching it to postgraduate students. The same approach applies to similar complicated distributed protocols.</p>\u0000 </div>","PeriodicalId":50378,"journal":{"name":"IET Software","volume":"2024 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/2024/6582682","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141286897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Software Defect Prediction Using Deep Q-Learning Network-Based Feature Extraction 利用基于深度 Q 学习网络的特征提取进行软件缺陷预测

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IET Software

Pub Date : 2024-05-30 DOI: 10.1049/2024/3946655

Qinhe Zhang, Jiachen Zhang, Tie Feng, Jialang Xue, Xinxin Zhu, Ningyang Zhu, Zhiheng Li

Machine learning-based software defect prediction (SDP) approaches have been commonly proposed to help to deliver high-quality software. Unfortunately, all the previous research conducted without effective feature reduction suffers from high-dimensional data, leading to unsatisfactory prediction performance measures. Moreover, without proper feature reduction, the interpretability and generalization ability of machine learning models in SDP may be compromised, hindering their practical utility in diverse software development environments. In this paper, an SDP approach using deep Q-learning network (DQN)-based feature extraction is proposed to eliminate irrelevant, redundant, and noisy features and improve the classification performance. In the data preprocessing phase, the undersampling method of BalanceCascade is applied to divide the original datasets. As the first step of feature extraction, the weight ranking of all the metric elements is calculated according to the expected cross-entropy. Then, the relation matrix is constructed by applying random matrix theory. After that, the reward principle is defined for computing the Q value of Q-learning based on weight ranking, relation matrix, and the number of errors, according to which a convolutional neural network model is trained on datasets until the sequences of metric pairs are generated for all datasets acting as the revised feature set. Various experiments have been conducted on 11 NASA and 11 PROMISE repository datasets. Sensitive analysis experiments show that binary classification algorithms based on SDP approaches using the DQN-based feature extraction outperform those without using it. We also conducted experiments to compare our approach with four state-of-the-art approaches on common datasets, which show that our approach is superior to these methods in precision, F-measure, area under receiver operating characteristics curve, and Matthews correlation coefficient values.

人们普遍提出了基于机器学习的软件缺陷预测（SDP）方法，以帮助交付高质量的软件。遗憾的是，以往的所有研究都没有对特征进行有效的缩减，因而受到高维数据的困扰，导致预测性能指标不尽人意。此外，如果没有适当的特征缩减，SDP 中机器学习模型的可解释性和泛化能力可能会受到影响，从而阻碍其在各种软件开发环境中的实际应用。本文提出了一种基于深度 Q 学习网络（DQN）特征提取的 SDP 方法，以消除无关、冗余和噪声特征，提高分类性能。在数据预处理阶段，采用 BalanceCascade 的欠采样方法对原始数据集进行划分。作为特征提取的第一步，根据预期交叉熵计算所有度量元素的权重排序。然后，运用随机矩阵理论构建关系矩阵。然后，根据权重排序、关系矩阵和错误数定义奖励原则，计算 Q-learning 的 Q 值，并根据该原则在数据集上训练卷积神经网络模型，直到生成所有数据集的度量对序列作为修正特征集。在 11 个 NASA 和 11 个 PROMISE 数据库数据集上进行了各种实验。敏感性分析实验表明，使用基于 DQN 的特征提取的基于 SDP 方法的二元分类算法优于未使用 DQN 的算法。我们还进行了实验，在常见数据集上将我们的方法与四种最先进的方法进行了比较，结果表明我们的方法在精确度、F-measure、接收者操作特性曲线下面积和马修斯相关系数值方面都优于这些方法。

{"title":"Software Defect Prediction Using Deep Q-Learning Network-Based Feature Extraction","authors":"Qinhe Zhang, Jiachen Zhang, Tie Feng, Jialang Xue, Xinxin Zhu, Ningyang Zhu, Zhiheng Li","doi":"10.1049/2024/3946655","DOIUrl":"https://doi.org/10.1049/2024/3946655","url":null,"abstract":"<div>\u0000 <p>Machine learning-based software defect prediction (SDP) approaches have been commonly proposed to help to deliver high-quality software. Unfortunately, all the previous research conducted without effective feature reduction suffers from high-dimensional data, leading to unsatisfactory prediction performance measures. Moreover, without proper feature reduction, the interpretability and generalization ability of machine learning models in SDP may be compromised, hindering their practical utility in diverse software development environments. In this paper, an SDP approach using deep <i>Q</i>-learning network (DQN)-based feature extraction is proposed to eliminate irrelevant, redundant, and noisy features and improve the classification performance. In the data preprocessing phase, the undersampling method of BalanceCascade is applied to divide the original datasets. As the first step of feature extraction, the weight ranking of all the metric elements is calculated according to the expected cross-entropy. Then, the relation matrix is constructed by applying random matrix theory. After that, the reward principle is defined for computing the <i>Q</i> value of <i>Q</i>-learning based on weight ranking, relation matrix, and the number of errors, according to which a convolutional neural network model is trained on datasets until the sequences of metric pairs are generated for all datasets acting as the revised feature set. Various experiments have been conducted on 11 NASA and 11 PROMISE repository datasets. Sensitive analysis experiments show that binary classification algorithms based on SDP approaches using the DQN-based feature extraction outperform those without using it. We also conducted experiments to compare our approach with four state-of-the-art approaches on common datasets, which show that our approach is superior to these methods in precision, <i>F</i>-measure, area under receiver operating characteristics curve, and Matthews correlation coefficient values.</p>\u0000 </div>","PeriodicalId":50378,"journal":{"name":"IET Software","volume":"2024 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/2024/3946655","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141246131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Balanced Adversarial Tight Matching for Cross-Project Defect Prediction 用于跨项目缺陷预测的平衡对抗式紧密匹配

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IET Software

Pub Date : 2024-05-16 DOI: 10.1049/2024/1561351

Siyu Jiang, Jiapeng Zhang, Feng Guo, Teng Ouyang, Jing Li

Cross-project defect prediction (CPDP) is an attractive research area in software testing. It identifies defects in projects with limited labeled data (target projects) by utilizing predictive models from data-rich projects (source projects). Existing CPDP methods based on transfer learning mainly rely on the assumption of a unimodal distribution and consider the case where the feature distribution has one obvious peak. However, in actual situations, the feature distribution of project samples often exhibits multiple peaks that cannot be ignored. It manifests as a multimodal distribution, making it challenging to align distributions between different projects. To address this issue, we propose a balanced adversarial tight-matching model for CPDP. Specifically, this method employs multilinear conditioning to obtain the cross-covariance of both features and classifier predictions, capturing the multimodal distribution of the feature. When reducing the captured multimodal distribution differences, pseudo-labels are needed, but pseudo-labels have uncertainty. Therefore, we additionally add an auxiliary classifier and attempt to generate pseudo-labels using a pseudo-label strategy with less uncertainty. Finally, the feature generator and two classifiers undergo adversarial training to align the multimodal distributions of different projects. This method outperforms the state-of-the-art CPDP model used on the benchmark dataset.

跨项目缺陷预测（CPDP）是软件测试中一个极具吸引力的研究领域。它通过利用数据丰富的项目（源项目）中的预测模型，识别标注数据有限的项目（目标项目）中的缺陷。现有的基于迁移学习的 CPDP 方法主要依赖于单模态分布假设，并考虑特征分布有一个明显峰值的情况。然而，在实际情况中，项目样本的特征分布往往会出现多个峰值，这一点不容忽视。它表现为一种多模态分布，使得不同项目之间的分布对齐具有挑战性。为解决这一问题，我们提出了一种用于 CPDP 的平衡对抗紧密匹配模型。具体来说，该方法采用多线性调节来获得特征和分类器预测的交叉协方差，从而捕捉特征的多模态分布。在减少捕捉到的多模态分布差异时，需要伪标签，但伪标签具有不确定性。因此，我们额外添加了一个辅助分类器，并尝试使用不确定性较小的伪标签策略生成伪标签。最后，对特征生成器和两个分类器进行对抗训练，以调整不同项目的多模态分布。这种方法优于基准数据集上使用的最先进的 CPDP 模型。

{"title":"Balanced Adversarial Tight Matching for Cross-Project Defect Prediction","authors":"Siyu Jiang, Jiapeng Zhang, Feng Guo, Teng Ouyang, Jing Li","doi":"10.1049/2024/1561351","DOIUrl":"10.1049/2024/1561351","url":null,"abstract":"<div>\u0000 <p>Cross-project defect prediction (CPDP) is an attractive research area in software testing. It identifies defects in projects with limited labeled data (target projects) by utilizing predictive models from data-rich projects (source projects). Existing CPDP methods based on transfer learning mainly rely on the assumption of a unimodal distribution and consider the case where the feature distribution has one obvious peak. However, in actual situations, the feature distribution of project samples often exhibits multiple peaks that cannot be ignored. It manifests as a multimodal distribution, making it challenging to align distributions between different projects. To address this issue, we propose a balanced adversarial tight-matching model for CPDP. Specifically, this method employs multilinear conditioning to obtain the cross-covariance of both features and classifier predictions, capturing the multimodal distribution of the feature. When reducing the captured multimodal distribution differences, pseudo-labels are needed, but pseudo-labels have uncertainty. Therefore, we additionally add an auxiliary classifier and attempt to generate pseudo-labels using a pseudo-label strategy with less uncertainty. Finally, the feature generator and two classifiers undergo adversarial training to align the multimodal distributions of different projects. This method outperforms the state-of-the-art CPDP model used on the benchmark dataset.</p>\u0000 </div>","PeriodicalId":50378,"journal":{"name":"IET Software","volume":"2024 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/2024/1561351","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140968219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Empirical Study on Downstream Dependency Package Groups in Software Packaging Ecosystems 软件包生态系统中的下游依赖包群实证研究

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IET Software

Pub Date : 2024-04-30 DOI: 10.1049/2024/4488412

Qing Qi, Jian Cao

The role of focal packages in packaging ecosystems is crucial for the development of the entire ecosystem, as they are the packages on which other packages depend. However, the evolution of dependency groups in packaging ecosystems has not been systematically investigated. In this study, we examine the downstream dependency package groups (DDGs) in three typical packaging ecosystems—Cargo for Rust, Comprehensive Perl Archive Network for Perl, and RubyGems for Ruby—to identify their features and evolution. We also identify and analyze a special type of DDG, the collaborative downstream dependency package group (CDDG), which requires shared contributors. Our findings show that the overall development of DDGs, particularly CDDGs, is consistent with the status of the whole ecosystem, and the size of DDGs and CDDGs follows a power law distribution. Furthermore, the interaction mechanisms between focal packages and downstream packages differ between ecosystems, but focal packages always play a leading role in the development of DDGs and CDDGs. Finally, we investigate predictive models for the development of CDDGs in the next stage based on their features, and our results show that random forest and Gradient Boosting Regression Tree achieve acceptable prediction accuracy. We provide the raw data and scripts used for our analysis at https://github.com/onion616/DDG.

重点包装在包装生态系统中的作用对整个生态系统的发展至关重要，因为它们是其他包装所依赖的包装。然而，包装生态系统中依赖包群的演变尚未得到系统研究。在本研究中，我们研究了三个典型打包生态系统中的下游依赖包组（DDGs）--Rust 的 Cargo、Perl 的 Comprehensive Perl Archive Network 和 Ruby 的 RubyGems，以确定它们的特征和演变。我们还识别并分析了一种特殊类型的 DDG，即协作式下游依赖包组（CDDG），它需要共享贡献者。我们的研究结果表明，DDGs（尤其是 CDDGs）的整体发展与整个生态系统的状况是一致的，DDGs 和 CDDGs 的规模遵循幂律分布。此外，不同生态系统中焦点包与下游包之间的相互作用机制也不尽相同，但焦点包在 DDGs 和 CDDGs 的发展中始终起着主导作用。最后，我们根据 CDDGs 的特征研究了下一阶段 CDDGs 发展的预测模型，结果表明随机森林和梯度提升回归树达到了可接受的预测精度。我们在 https://github.com/onion616/DDG 网站上提供了用于分析的原始数据和脚本。

{"title":"An Empirical Study on Downstream Dependency Package Groups in Software Packaging Ecosystems","authors":"Qing Qi, Jian Cao","doi":"10.1049/2024/4488412","DOIUrl":"https://doi.org/10.1049/2024/4488412","url":null,"abstract":"<div>\u0000 <p>The role of focal packages in packaging ecosystems is crucial for the development of the entire ecosystem, as they are the packages on which other packages depend. However, the evolution of dependency groups in packaging ecosystems has not been systematically investigated. In this study, we examine the downstream dependency package groups (DDGs) in three typical packaging ecosystems—Cargo for Rust, Comprehensive Perl Archive Network for Perl, and RubyGems for Ruby—to identify their features and evolution. We also identify and analyze a special type of DDG, the collaborative downstream dependency package group (CDDG), which requires shared contributors. Our findings show that the overall development of DDGs, particularly CDDGs, is consistent with the status of the whole ecosystem, and the size of DDGs and CDDGs follows a power law distribution. Furthermore, the interaction mechanisms between focal packages and downstream packages differ between ecosystems, but focal packages always play a leading role in the development of DDGs and CDDGs. Finally, we investigate predictive models for the development of CDDGs in the next stage based on their features, and our results show that random forest and Gradient Boosting Regression Tree achieve acceptable prediction accuracy. We provide the raw data and scripts used for our analysis at https://github.com/onion616/DDG.</p>\u0000 </div>","PeriodicalId":50378,"journal":{"name":"IET Software","volume":"2024 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/2024/4488412","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141096478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting DBSCAN and Combination Strategy to Prioritize the Test Suite in Regression Testing 利用 DBSCAN 和组合策略确定回归测试中测试套件的优先级

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IET Software

Pub Date : 2024-04-04 DOI: 10.1049/2024/9942959

Zikang Zhang, Jinfu Chen, Yuechao Gu, Zhehao Li, Rexford Nii Ayitey Sosu

Test case prioritization techniques improve the fault detection rate by adjusting the execution sequence of test cases. For static black-box test case prioritization techniques, existing methods generally improve the fault detection rate by increasing the early diversity of execution sequences based on string distance differences. However, such methods have a high time overhead and are less stable. This paper proposes a novel test case prioritization method (DC-TCP) based on density-based spatial clustering of applications with noise (DBSCAN) and combination policies. By introducing a combination strategy to model the inputs to generate a mapping model, the test inputs are mapped to consistent types to improve generality. The DBSCAN method is then used to refine the classification of test cases further, and finally, the Firefly search strategy is introduced to improve the effectiveness of sequence merging. Extensive experimental results demonstrate that the proposed DC-TCP method outperforms other methods in terms of the average percentage of faults detected and exhibits advantages in terms of time efficiency when compared to several existing static black-box sorting methods.

测试用例优先级排序技术通过调整测试用例的执行顺序来提高故障检测率。对于静态黑盒测试用例优先级排序技术，现有的方法一般是根据字符串距离差异来提高执行序列的早期多样性，从而提高故障检测率。然而，这类方法的时间开销较大，稳定性较差。本文提出了一种新的测试用例优先级排序方法（DC-TCP），它基于带噪声应用的密度空间聚类（DBSCAN）和组合策略。通过引入组合策略对输入进行建模以生成映射模型，将测试输入映射到一致的类型以提高通用性。然后使用 DBSCAN 方法进一步完善测试用例的分类，最后引入萤火虫搜索策略来提高序列合并的有效性。广泛的实验结果表明，与现有的几种静态黑盒分类方法相比，所提出的 DC-TCP 方法在检测到的故障平均百分比方面优于其他方法，并在时间效率方面表现出优势。

{"title":"Exploiting DBSCAN and Combination Strategy to Prioritize the Test Suite in Regression Testing","authors":"Zikang Zhang, Jinfu Chen, Yuechao Gu, Zhehao Li, Rexford Nii Ayitey Sosu","doi":"10.1049/2024/9942959","DOIUrl":"https://doi.org/10.1049/2024/9942959","url":null,"abstract":"<div>\u0000 <p>Test case prioritization techniques improve the fault detection rate by adjusting the execution sequence of test cases. For static black-box test case prioritization techniques, existing methods generally improve the fault detection rate by increasing the early diversity of execution sequences based on string distance differences. However, such methods have a high time overhead and are less stable. This paper proposes a novel test case prioritization method (DC-TCP) based on density-based spatial clustering of applications with noise (DBSCAN) and combination policies. By introducing a combination strategy to model the inputs to generate a mapping model, the test inputs are mapped to consistent types to improve generality. The DBSCAN method is then used to refine the classification of test cases further, and finally, the Firefly search strategy is introduced to improve the effectiveness of sequence merging. Extensive experimental results demonstrate that the proposed DC-TCP method outperforms other methods in terms of the average percentage of faults detected and exhibits advantages in terms of time efficiency when compared to several existing static black-box sorting methods.</p>\u0000 </div>","PeriodicalId":50378,"journal":{"name":"IET Software","volume":"2024 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/2024/9942959","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141096405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Expository Examination of Temporally Evolving Graph-Based Approaches for the Visual Investigation of Autonomous Driving 基于时序演进图的自动驾驶视觉研究方法的阐述性研究

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IET Software

Pub Date : 2024-03-20 DOI: 10.1049/2024/5802816

Li Wan, Wenzhi Cheng

With the continuous advancement of autonomous driving technology, visual analysis techniques have emerged as a prominent research topic. The data generated by autonomous driving is large-scale and time-varying, yet more than existing visual analytics methods are required to deal with such complex data effectively. Time-varying diagrams can be used to model and visualize the dynamic relationships in various complex systems and can visually describe the data trends in autonomous driving systems. To this end, this paper introduces a time-varying graph-based method for visual analysis in autonomous driving. The proposed method employs a graph structure to represent the relative positional relationships between the target and obstacle interferences. By incorporating the time dimension, a time-varying graph model is constructed. The method explores the characteristic changes of nodes in the graph at different time instances, establishing feature expressions that differentiate target and obstacle motion patterns. The analysis demonstrates that the feature vector centrality in the time-varying graph effectively captures the distinctions in motion patterns between targets and obstacles. These features can be utilized for accurate target and obstacle recognition, achieving high recognition accuracy. To evaluate the proposed time-varying graph-based visual analytic autopilot method, a comparative study is conducted against traditional visual analytic methods such as the frame differencing method and advanced visual analytic methods like visual lidar odometry and mapping. Robustness, accuracy, and resource consumption experiments are performed using the publicly available KITTI dataset to analyze and compare the three methods. The experimental results show that the proposed time-varying graph-based method exhibits superior accuracy and robustness. This study offers valuable insights and solution ideas for developing deep integration between intelligent networked vehicles and intelligent transportation. It provides a reference for advancing intelligent transportation systems and their integration with autonomous driving technologies.

随着自动驾驶技术的不断进步，可视化分析技术已成为一个突出的研究课题。自动驾驶产生的数据规模大、时变性强，要有效处理这些复杂的数据，需要比现有的可视化分析方法更多的方法。时变图可以用来对各种复杂系统中的动态关系进行建模和可视化，并能直观地描述自动驾驶系统中的数据趋势。为此，本文介绍了一种基于时变图的自动驾驶可视化分析方法。所提出的方法采用图结构来表示目标和障碍物干扰之间的相对位置关系。通过结合时间维度，构建了一个时变图模型。该方法探索了图中节点在不同时间实例下的特征变化，建立了区分目标和障碍物运动模式的特征表达式。分析表明，时变图中的特征向量中心性能有效捕捉目标和障碍物运动模式的区别。这些特征可用于准确识别目标和障碍物，从而实现较高的识别准确率。为了评估所提出的基于时变图的视觉分析自动驾驶方法，我们将其与传统的视觉分析方法（如帧差分法）和先进的视觉分析方法（如视觉激光雷达测距和测绘）进行了比较研究。使用公开的 KITTI 数据集进行了鲁棒性、准确性和资源消耗实验，以分析和比较这三种方法。实验结果表明，所提出的基于时变图的方法具有更高的准确性和鲁棒性。本研究为发展智能网联汽车与智能交通的深度融合提供了有价值的见解和解决思路。它为推进智能交通系统及其与自动驾驶技术的融合提供了参考。

{"title":"An Expository Examination of Temporally Evolving Graph-Based Approaches for the Visual Investigation of Autonomous Driving","authors":"Li Wan, Wenzhi Cheng","doi":"10.1049/2024/5802816","DOIUrl":"10.1049/2024/5802816","url":null,"abstract":"<div>\u0000 <p>With the continuous advancement of autonomous driving technology, visual analysis techniques have emerged as a prominent research topic. The data generated by autonomous driving is large-scale and time-varying, yet more than existing visual analytics methods are required to deal with such complex data effectively. Time-varying diagrams can be used to model and visualize the dynamic relationships in various complex systems and can visually describe the data trends in autonomous driving systems. To this end, this paper introduces a time-varying graph-based method for visual analysis in autonomous driving. The proposed method employs a graph structure to represent the relative positional relationships between the target and obstacle interferences. By incorporating the time dimension, a time-varying graph model is constructed. The method explores the characteristic changes of nodes in the graph at different time instances, establishing feature expressions that differentiate target and obstacle motion patterns. The analysis demonstrates that the feature vector centrality in the time-varying graph effectively captures the distinctions in motion patterns between targets and obstacles. These features can be utilized for accurate target and obstacle recognition, achieving high recognition accuracy. To evaluate the proposed time-varying graph-based visual analytic autopilot method, a comparative study is conducted against traditional visual analytic methods such as the frame differencing method and advanced visual analytic methods like visual lidar odometry and mapping. Robustness, accuracy, and resource consumption experiments are performed using the publicly available KITTI dataset to analyze and compare the three methods. The experimental results show that the proposed time-varying graph-based method exhibits superior accuracy and robustness. This study offers valuable insights and solution ideas for developing deep integration between intelligent networked vehicles and intelligent transportation. It provides a reference for advancing intelligent transportation systems and their integration with autonomous driving technologies.</p>\u0000 </div>","PeriodicalId":50378,"journal":{"name":"IET Software","volume":"2024 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/2024/5802816","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140225546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-Project Defect Prediction Using Transfer Learning with Long Short-Term Memory Networks 利用长短期记忆网络的迁移学习进行跨项目缺陷预测

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IET Software

Pub Date : 2024-03-18 DOI: 10.1049/2024/5550801

Hongwei Tao, Lianyou Fu, Qiaoling Cao, Xiaoxu Niu, Haoran Chen, Songtao Shang, Yang Xian

With the increasing number of software projects, within-project defect prediction (WPDP) has already been unable to meet the demand, and cross-project defect prediction (CPDP) is playing an increasingly significant role in the area of software engineering. The classic CPDP methods mainly concentrated on applying metric features to predict defects. However, these approaches failed to consider the rich semantic information, which usually contains the relationship between software defects and context. Since traditional methods are unable to exploit this characteristic, their performance is often unsatisfactory. In this paper, a transfer long short-term memory (TLSTM) network model is first proposed. Transfer semantic features are extracted by adding a transfer learning algorithm to the long short-term memory (LSTM) network. Then, the traditional metric features and semantic features are combined for CPDP. First, the abstract syntax trees (AST) are generated based on the source codes. Second, the AST node contents are converted into integer vectors as inputs to the TLSTM model. Then, the semantic features of the program can be extracted by TLSTM. On the other hand, transferable metric features are extracted by transfer component analysis (TCA). Finally, the semantic features and metric features are combined and input into the logical regression (LR) classifier for training. The presented TLSTM model performs better on the f-measure indicator than other machine and deep learning models, according to the outcomes of several open-source projects of the PROMISE repository. The TLSTM model built with a single feature achieves 0.7% and 2.1% improvement on Log4j-1.2 and Xalan-2.7, respectively. When using combined features to train the prediction model, we call this model a transfer long short-term memory for defect prediction (DPTLSTM). DPTLSTM achieves a 2.9% and 5% improvement on Synapse-1.2 and Xerces-1.4.4, respectively. Both prove the superiority of the proposed model on the CPDP task. This is because LSTM capture long-term dependencies in sequence data and extract features that contain source code structure and context information. It can be concluded that: (1) the TLSTM model has the advantage of preserving information, which can better retain the semantic features related to software defects; (2) compared with the CPDP model trained with traditional metric features, the performance of the model can validly enhance by combining semantic features and metric features.

随着软件项目数量的不断增加，项目内缺陷预测（WPDP）已经无法满足需求，跨项目缺陷预测（CPDP）在软件工程领域发挥着越来越重要的作用。经典的 CPDP 方法主要集中在应用度量特征来预测缺陷。然而，这些方法没有考虑丰富的语义信息，这些信息通常包含软件缺陷与上下文之间的关系。由于传统方法无法利用这一特点，其性能往往不尽如人意。本文首先提出了一种转移长短时记忆（TLSTM）网络模型。通过在长短时记忆（LSTM）网络中加入迁移学习算法，提取迁移语义特征。然后，将传统的度量特征和语义特征结合起来用于 CPDP。首先，根据源代码生成抽象语法树（AST）。其次，将 AST 节点内容转换为整数向量，作为 TLSTM 模型的输入。然后，就可以通过 TLSTM 提取程序的语义特征。另一方面，通过转移成分分析（TCA）提取可转移的度量特征。最后，将语义特征和度量特征结合起来，输入逻辑回归（LR）分类器进行训练。根据 PROMISE 存储库中几个开源项目的结果，所介绍的 TLSTM 模型在 f-measure 指标上的表现优于其他机器学习和深度学习模型。使用单一特征构建的 TLSTM 模型在 Log4j-1.2 和 Xalan-2.7 上分别提高了 0.7% 和 2.1%。当使用组合特征来训练预测模型时，我们称这种模型为缺陷预测的转移长短期记忆（DPTLSTM）。DPTLSTM 比 Synapse-1.2 和 Xerces-1.4.4 分别提高了 2.9% 和 5%。两者都证明了所提出的模型在 CPDP 任务中的优越性。这是因为 LSTM 能够捕捉序列数据中的长期依赖关系，并提取包含源代码结构和上下文信息的特征。由此可以得出以下结论(1) TLSTM 模型具有保存信息的优势，能更好地保留与软件缺陷相关的语义特征；(2) 与使用传统度量特征训练的 CPDP 模型相比，结合语义特征和度量特征能有效提高模型的性能。

{"title":"Cross-Project Defect Prediction Using Transfer Learning with Long Short-Term Memory Networks","authors":"Hongwei Tao, Lianyou Fu, Qiaoling Cao, Xiaoxu Niu, Haoran Chen, Songtao Shang, Yang Xian","doi":"10.1049/2024/5550801","DOIUrl":"10.1049/2024/5550801","url":null,"abstract":"<div>\u0000 <p>With the increasing number of software projects, within-project defect prediction (WPDP) has already been unable to meet the demand, and cross-project defect prediction (CPDP) is playing an increasingly significant role in the area of software engineering. The classic CPDP methods mainly concentrated on applying metric features to predict defects. However, these approaches failed to consider the rich semantic information, which usually contains the relationship between software defects and context. Since traditional methods are unable to exploit this characteristic, their performance is often unsatisfactory. In this paper, a transfer long short-term memory (TLSTM) network model is first proposed. Transfer semantic features are extracted by adding a transfer learning algorithm to the long short-term memory (LSTM) network. Then, the traditional metric features and semantic features are combined for CPDP. First, the abstract syntax trees (AST) are generated based on the source codes. Second, the AST node contents are converted into integer vectors as inputs to the TLSTM model. Then, the semantic features of the program can be extracted by TLSTM. On the other hand, transferable metric features are extracted by transfer component analysis (TCA). Finally, the semantic features and metric features are combined and input into the logical regression (LR) classifier for training. The presented TLSTM model performs better on the <i>f</i>-measure indicator than other machine and deep learning models, according to the outcomes of several open-source projects of the PROMISE repository. The TLSTM model built with a single feature achieves 0.7% and 2.1% improvement on Log4j-1.2 and Xalan-2.7, respectively. When using combined features to train the prediction model, we call this model a transfer long short-term memory for defect prediction (DPTLSTM). DPTLSTM achieves a 2.9% and 5% improvement on Synapse-1.2 and Xerces-1.4.4, respectively. Both prove the superiority of the proposed model on the CPDP task. This is because LSTM capture long-term dependencies in sequence data and extract features that contain source code structure and context information. It can be concluded that: (1) the TLSTM model has the advantage of preserving information, which can better retain the semantic features related to software defects; (2) compared with the CPDP model trained with traditional metric features, the performance of the model can validly enhance by combining semantic features and metric features.</p>\u0000 </div>","PeriodicalId":50378,"journal":{"name":"IET Software","volume":"2024 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/2024/5550801","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140233101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Design and Efficacy of a Data Lake Architecture for Multimodal Emotion Feature Extraction in Social Media 社交媒体中多模态情感特征提取数据湖架构的设计与功效

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IET Software

Pub Date : 2024-03-08 DOI: 10.1049/2024/6819714

Yuanyuan Fan, Xifeng Mi

In the rapidly evolving landscape of social media, the demand for precise sentiment analysis (SA) on multimodal data has become increasingly pivotal. This paper introduces a sophisticated data lake architecture tailored for efficient multimodal emotion feature extraction, addressing the challenges posed by diverse data types. The proposed framework encompasses a robust storage solution and an innovative SA model, multilevel spatial attention fusion (MLSAF), adept at handling text and visual data concurrently. The data lake architecture comprises five layers, facilitating real-time and offline data collection, storage, processing, standardized interface services, and data mining analysis. The MLSAF model, integrated into the data lake architecture, utilizes a novel approach to SA. It employs a text-guided spatial attention mechanism, fusing textual and visual features to discern subtle emotional interplays. The model’s end-to-end learning approach and attention modules contribute to its efficacy in capturing nuanced sentiment expressions. Empirical evaluations on established multimodal sentiment datasets, MVSA-Single and MVSA-Multi, validate the proposed methodology’s effectiveness. Comparative analyses with state-of-the-art models showcase the superior performance of our approach, with an accuracy improvement of 6% on MVSA-Single and 1.6% on MVSA-Multi. This research significantly contributes to optimizing SA in social media data by offering a versatile and potent framework for data management and analysis. The integration of MLSAF with a scalable data lake architecture presents a strategic innovation poised to navigate the evolving complexities of social media data analytics.

在快速发展的社交媒体环境中，对多模态数据进行精确情感分析（SA）的需求变得越来越重要。本文介绍了一种为高效多模态情感特征提取量身定制的复杂数据湖架构，以应对不同数据类型带来的挑战。所提出的框架包括一个强大的存储解决方案和一个创新的 SA 模型--多级空间注意力融合（MLSAF），该模型善于同时处理文本和视觉数据。数据湖架构由五层组成，便于实时和离线数据收集、存储、处理、标准化接口服务和数据挖掘分析。集成到数据湖架构中的 MLSAF 模型采用了一种新颖的 SA 方法。它采用文本引导的空间注意力机制，融合文本和视觉特征来辨别微妙的情感交织。该模型的端到端学习方法和注意力模块有助于有效捕捉细微的情感表达。在已建立的多模态情感数据集 MVSA-Single 和 MVSA-Multi 上进行的实证评估验证了所提出方法的有效性。与最先进模型的对比分析表明，我们的方法性能优越，在 MVSA-Single 和 MVSA-Multi 数据集上的准确率分别提高了 6% 和 1.6%。这项研究为数据管理和分析提供了一个多功能的有效框架，为优化社交媒体数据中的 SA 做出了重大贡献。将 MLSAF 与可扩展的数据湖架构整合在一起，是一项战略性创新，有助于驾驭社交媒体数据分析不断变化的复杂性。

{"title":"Design and Efficacy of a Data Lake Architecture for Multimodal Emotion Feature Extraction in Social Media","authors":"Yuanyuan Fan, Xifeng Mi","doi":"10.1049/2024/6819714","DOIUrl":"https://doi.org/10.1049/2024/6819714","url":null,"abstract":"<div>\u0000 <p>In the rapidly evolving landscape of social media, the demand for precise sentiment analysis (SA) on multimodal data has become increasingly pivotal. This paper introduces a sophisticated data lake architecture tailored for efficient multimodal emotion feature extraction, addressing the challenges posed by diverse data types. The proposed framework encompasses a robust storage solution and an innovative SA model, multilevel spatial attention fusion (MLSAF), adept at handling text and visual data concurrently. The data lake architecture comprises five layers, facilitating real-time and offline data collection, storage, processing, standardized interface services, and data mining analysis. The MLSAF model, integrated into the data lake architecture, utilizes a novel approach to SA. It employs a text-guided spatial attention mechanism, fusing textual and visual features to discern subtle emotional interplays. The model’s end-to-end learning approach and attention modules contribute to its efficacy in capturing nuanced sentiment expressions. Empirical evaluations on established multimodal sentiment datasets, MVSA-Single and MVSA-Multi, validate the proposed methodology’s effectiveness. Comparative analyses with state-of-the-art models showcase the superior performance of our approach, with an accuracy improvement of 6% on MVSA-Single and 1.6% on MVSA-Multi. This research significantly contributes to optimizing SA in social media data by offering a versatile and potent framework for data management and analysis. The integration of MLSAF with a scalable data lake architecture presents a strategic innovation poised to navigate the evolving complexities of social media data analytics.</p>\u0000 </div>","PeriodicalId":50378,"journal":{"name":"IET Software","volume":"2024 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/2024/6819714","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141096394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0