首页 > 最新文献

IEEE Transactions on Big Data最新文献

英文 中文
CLIP2LE: A Label Enhancement Fair Representation Method via CLIP CLIP2LE:一种基于CLIP的标签增强公平表示方法
IF 5.7 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-06 DOI: 10.1109/TBDATA.2025.3618450
Pu Wang;YinSong Xiong;Zhuoran Zheng
Label enhancement is a novel label shift strategy that aims to integrate the feature space with the logical label space to obtain a high-quality label distribution. This label distribution can serve as a soft target for algorithmic learning, akin to label smoothing, thereby enhancing the performance of various learning paradigms including multi-label learning, single positive multi-label learning, and partial-label learning. However, limited by dataset type and annotation inaccuracy, the same label enhancement algorithm on different datasets struggles to achieve consistent performance, for reasons derived from the following two insights: 1) Differential Contribution of Feature Space and Logical Label Space: The feature space and logical label space of different datasets contribute differently to generating an accurate label distribution; 2) Presence of Noise and Incorrect Labels: Some datasets contain noise and inaccurately labeled samples, leading to divergent outputs for similar inputs. To address these challenges, we propose leveraging CLIP (Contrastive Language-Image Pre-training) as a foundational strategy, treating the feature space and the logical label space as two distinct modalities. By recoding these modalities before applying the label enhancement algorithm, we aim to achieve a fair and robust representation. In addition, we further explained the reasonableness of our motives in the discussion session. Extensive experimental results demonstrate the effectiveness of our approach to help existing label enhancement algorithms improve their performance on several benchmarks.
标签增强是一种新的标签转移策略,旨在将特征空间与逻辑标签空间相结合,以获得高质量的标签分布。这种标签分布可以作为算法学习的软目标,类似于标签平滑,从而提高各种学习范式的性能,包括多标签学习、单正多标签学习和部分标签学习。然而,由于数据集类型和标注不准确性的限制,相同的标签增强算法在不同的数据集上很难达到一致的性能,其原因源于以下两点:1)特征空间和逻辑标签空间的差异贡献:不同数据集的特征空间和逻辑标签空间对生成准确的标签分布的贡献不同;2)存在噪声和不正确的标签:一些数据集包含噪声和不准确标记的样本,导致类似输入的输出不同。为了解决这些挑战,我们建议利用CLIP(对比语言-图像预训练)作为基本策略,将特征空间和逻辑标签空间视为两种不同的模式。通过在应用标签增强算法之前对这些模式进行重新编码,我们的目标是实现公平和鲁棒的表示。此外,我们在讨论环节进一步说明了我们动机的合理性。大量的实验结果表明,我们的方法可以有效地帮助现有的标签增强算法在几个基准测试中提高其性能。
{"title":"CLIP2LE: A Label Enhancement Fair Representation Method via CLIP","authors":"Pu Wang;YinSong Xiong;Zhuoran Zheng","doi":"10.1109/TBDATA.2025.3618450","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3618450","url":null,"abstract":"Label enhancement is a novel label shift strategy that aims to integrate the feature space with the logical label space to obtain a high-quality label distribution. This label distribution can serve as a soft target for algorithmic learning, akin to label smoothing, thereby enhancing the performance of various learning paradigms including multi-label learning, single positive multi-label learning, and partial-label learning. However, limited by dataset type and annotation inaccuracy, the same label enhancement algorithm on different datasets struggles to achieve consistent performance, for reasons derived from the following two insights: 1) Differential Contribution of Feature Space and Logical Label Space: The feature space and logical label space of different datasets contribute differently to generating an accurate label distribution; 2) Presence of Noise and Incorrect Labels: Some datasets contain noise and inaccurately labeled samples, leading to divergent outputs for similar inputs. To address these challenges, we propose leveraging CLIP (Contrastive Language-Image Pre-training) as a foundational strategy, treating the feature space and the logical label space as two distinct modalities. By recoding these modalities before applying the label enhancement algorithm, we aim to achieve a fair and robust representation. In addition, we further explained the reasonableness of our motives in the discussion session. Extensive experimental results demonstrate the effectiveness of our approach to help existing label enhancement algorithms improve their performance on several benchmarks.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"224-235"},"PeriodicalIF":5.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Guest Editorial Special Issue on Federated Learning for Big Data Applications 大数据应用的联邦学习特刊
IF 5.7 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-03 DOI: 10.1109/TBDATA.2024.3417057
Xiaowen Chu;Wei Wang;Cong Wang;Yang Liu;Rongfei Zeng;Christopher G. Brinton
{"title":"Guest Editorial Special Issue on Federated Learning for Big Data Applications","authors":"Xiaowen Chu;Wei Wang;Cong Wang;Yang Liu;Rongfei Zeng;Christopher G. Brinton","doi":"10.1109/TBDATA.2024.3417057","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3417057","url":null,"abstract":"","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 5","pages":"2099-2101"},"PeriodicalIF":5.7,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11149636","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144990054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MuGNet-CMI: Multi-Head Hybrid Graph Neural Network for Predicting circRNA-miRNA Interactions With Global High-Order and Local Low-Order Information MuGNet-CMI:用全局高阶和局部低阶信息预测circRNA-miRNA相互作用的多头混合图神经网络
IF 5.7 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-08-29 DOI: 10.1109/TBDATA.2025.3604175
Chen Jiang;Lei Wang;Changqing Yu;Zhuhong You;Xinfei Wang;Mengmeng Wei;Mianshuo Lu
Circular RNAs (circRNAs) are non-coding RNA molecules that play a crucial role in regulating genes and contributing to disease progression. CircRNAs can function as sponges for microRNAs (miRNAs), thereby regulating gene expression and influencing disease outcomes. Identifying associations between circRNAs and miRNAs through computational methods enhances the understanding of complex disease mechanisms and offers a reliable tool for pre-selecting candidates for experimental validation. Existing models, however, are limited in their ability to capture either global or local node information, the prediction of circRNA and miRNA interactions is still challenging. In order to effectively deal with this problem, we propose a novel framework for predicting circRNA-miRNA interactions (CMIs), known as MuGNet-CMI, which leverages multi-head hybrid graph neural network and global high-order and local low-order information. The model employs the MetaPath2Vec algorithm to generate high-quality node embeddings within the circRNA-miRNA heterogeneous matrix. The multi-head dynamic attention mechanism, combined with GraphSAGE, is incorporated to efficiently capture both global high-order and local low-order node information. Additionally, we integrate neural aggregators into the multi-head dynamic attention mechanism to aggregate feature information from the captured nodes. Validation using three real datasets demonstrates that MuGNet-CMI delivers good performance in predicting CMIs, offering valuable insights to guide experimental research in gene regulation.
环状RNA (circRNAs)是非编码RNA分子,在调节基因和促进疾病进展中起着至关重要的作用。CircRNAs可以作为microRNAs (miRNAs)的海绵,从而调节基因表达并影响疾病结局。通过计算方法识别circRNAs和miRNAs之间的关联,增强了对复杂疾病机制的理解,并为预先选择实验验证的候选物提供了可靠的工具。然而,现有模型在捕获全局或局部节点信息方面的能力有限,circRNA和miRNA相互作用的预测仍然具有挑战性。为了有效地解决这一问题,我们提出了一个新的框架来预测circRNA-miRNA相互作用(cmi),称为MuGNet-CMI,它利用多头混合图神经网络和全局高阶和局部低阶信息。该模型采用MetaPath2Vec算法在circRNA-miRNA异构矩阵中生成高质量的节点嵌入。将多头动态注意机制与GraphSAGE相结合,有效捕获全局高阶和局部低阶节点信息。此外,我们将神经聚合器集成到多头动态注意机制中,从捕获的节点中聚合特征信息。使用三个真实数据集的验证表明,MuGNet-CMI在预测cmi方面具有良好的性能,为指导基因调控的实验研究提供了有价值的见解。
{"title":"MuGNet-CMI: Multi-Head Hybrid Graph Neural Network for Predicting circRNA-miRNA Interactions With Global High-Order and Local Low-Order Information","authors":"Chen Jiang;Lei Wang;Changqing Yu;Zhuhong You;Xinfei Wang;Mengmeng Wei;Mianshuo Lu","doi":"10.1109/TBDATA.2025.3604175","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3604175","url":null,"abstract":"Circular RNAs (circRNAs) are non-coding RNA molecules that play a crucial role in regulating genes and contributing to disease progression. CircRNAs can function as sponges for microRNAs (miRNAs), thereby regulating gene expression and influencing disease outcomes. Identifying associations between circRNAs and miRNAs through computational methods enhances the understanding of complex disease mechanisms and offers a reliable tool for pre-selecting candidates for experimental validation. Existing models, however, are limited in their ability to capture either global or local node information, the prediction of circRNA and miRNA interactions is still challenging. In order to effectively deal with this problem, we propose a novel framework for predicting circRNA-miRNA interactions (CMIs), known as MuGNet-CMI, which leverages multi-head hybrid graph neural network and global high-order and local low-order information. The model employs the MetaPath2Vec algorithm to generate high-quality node embeddings within the circRNA-miRNA heterogeneous matrix. The multi-head dynamic attention mechanism, combined with GraphSAGE, is incorporated to efficiently capture both global high-order and local low-order node information. Additionally, we integrate neural aggregators into the multi-head dynamic attention mechanism to aggregate feature information from the captured nodes. Validation using three real datasets demonstrates that MuGNet-CMI delivers good performance in predicting CMIs, offering valuable insights to guide experimental research in gene regulation.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"159-173"},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimal Transport Barycentric Aggregation for Byzantine-Resilient Federated Learning 拜占庭弹性联邦学习的最优传输重心聚合
IF 5.7 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-08-29 DOI: 10.1109/TBDATA.2025.3604177
K Naveen Kumar;Srinivasa Rao Chalamala;Ajeet Kumar Singh;C Krishna Mohan
Federated learning (FL) has emerged as a promising solution to enable distributed learning without sharing sensitive data. However, FL is vulnerable to data poisoning attacks, where malicious clients inject malicious data during training to compromise the global model. Existing FL defenses suffer from the assumptions of independent and identically distributed (IID) model updates, asymptotic optimal error rate bounds, and strong convexity in the optimization problem. Hence, we propose a novel framework called Federated Learning Optimal Transport (FLOT) that leverages the Wasserstein barycentric technique to obtain a global model from a set of locally trained non-IID models on client devices. In addition, we introduce a loss function-based rejection (LFR) mechanism to suppress malicious updates and a dynamic weighting scheme to optimize the Wasserstein barycentric aggregation function. We provide the theoretical proof of the Byzantine resilience and convergence of FLOT to highlight its efficacy. We evaluate FLOT on four benchmark datasets: GTSRB, KBTS, CIFAR10, and EMNIST. The experimental results underscore the practical significance of FLOT as an effective defense mechanism against data poisoning attacks in FL while maintaining high accuracy and scalability. Also, we observe that FLOT serves as a robust client selection technique under no attack, which demonstrates its effectiveness.
联邦学习(FL)已经成为一种很有前途的解决方案,可以在不共享敏感数据的情况下实现分布式学习。然而,FL容易受到数据中毒攻击,恶意客户端在训练期间注入恶意数据以破坏全局模型。现有的FL防御受到独立同分布(IID)模型更新、渐近最优错误率界和优化问题强凸性的假设的影响。因此,我们提出了一个名为联邦学习最优传输(FLOT)的新框架,该框架利用Wasserstein重心技术从客户端设备上的一组本地训练的非iid模型中获得全局模型。此外,我们引入了一种基于损失函数的拒绝(LFR)机制来抑制恶意更新,并引入了一种动态加权方案来优化Wasserstein重心聚合函数。我们从理论上证明了FLOT的拜占庭弹性和收敛性,以突出其有效性。我们在四个基准数据集上评估了FLOT: GTSRB、KBTS、CIFAR10和EMNIST。实验结果强调了FLOT作为一种有效的防御FL中数据中毒攻击的机制,同时保持较高的准确性和可扩展性的现实意义。此外,我们观察到,在没有攻击的情况下,FLOT是一种健壮的客户端选择技术,这证明了它的有效性。
{"title":"Optimal Transport Barycentric Aggregation for Byzantine-Resilient Federated Learning","authors":"K Naveen Kumar;Srinivasa Rao Chalamala;Ajeet Kumar Singh;C Krishna Mohan","doi":"10.1109/TBDATA.2025.3604177","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3604177","url":null,"abstract":"Federated learning (FL) has emerged as a promising solution to enable distributed learning without sharing sensitive data. However, FL is vulnerable to data poisoning attacks, where malicious clients inject malicious data during training to compromise the global model. Existing FL defenses suffer from the assumptions of independent and identically distributed (IID) model updates, asymptotic optimal error rate bounds, and strong convexity in the optimization problem. Hence, we propose a novel framework called Federated Learning Optimal Transport (FLOT) that leverages the Wasserstein barycentric technique to obtain a global model from a set of locally trained non-IID models on client devices. In addition, we introduce a loss function-based rejection (LFR) mechanism to suppress malicious updates and a dynamic weighting scheme to optimize the Wasserstein barycentric aggregation function. We provide the theoretical proof of the Byzantine resilience and convergence of FLOT to highlight its efficacy. We evaluate FLOT on four benchmark datasets: GTSRB, KBTS, CIFAR10, and EMNIST. The experimental results underscore the practical significance of FLOT as an effective defense mechanism against data poisoning attacks in FL while maintaining high accuracy and scalability. Also, we observe that FLOT serves as a robust client selection technique under no attack, which demonstrates its effectiveness.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"174-185"},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing Deduplication Parameters via a Change-Estimation Analytical Model 基于变更估计分析模型的重复数据删除参数优化
IF 5.7 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-08-29 DOI: 10.1109/TBDATA.2025.3604171
Owen Randall;Luke Schultz;Paul Lu
Variable-sized, content-defined deduplication is a technique to find and eliminate redundant chunks of data for efficient data backups, reduced data transfers, and reduced data-storage overheads. For big datasets, especially with incremental updates over time such as backups and gathered data, deduplication makes data management faster and more efficient. While many existing deduplication systems use default expected chunk lengths such as 4 KB or 8 KB, they are suboptimal. Poorly optimized deduplication systems can significantly increase storage costs and network usage, making large datasets prohibitively expensive to manage. We present the design, implementation, and an empirical validation of our Deduplication Change-Estimation Analytical Model (DCAM) which predicts the performance of sliding window-based deduplication parameters on any given dataset, to be used for parameter optimization. Our empirical evaluation includes workloads based on source code (Linux kernel, Kubernetes, TensorFlow), open-research datasets (CORD-19), and articles (Wikipedia). Validated using both our system and the Destor deduplication system, a DCAM-based search finds deduplication parameters that require up to 3.8× less storage relative to a common baseline. DCAM Search optimizes parameters up to 19.8× faster than previously possible, and the size of the resulting deduplicated datasets are all within 5.15% of the best results found by searching using actual deduplication.
可变大小的、内容定义的重复数据删除是一种查找和消除冗余数据块的技术,可实现高效的数据备份、减少数据传输和减少数据存储开销。对于大数据集,特别是随着时间的推移,如备份和收集数据的增量更新,重复数据删除使数据管理更快,更有效。虽然许多现有的重复数据删除系统使用默认的预期块长度,如4 KB或8 KB,但它们不是最优的。优化不佳的重复数据删除系统会显著增加存储成本和网络使用,使大型数据集的管理成本过高。我们提出了我们的重复数据删除变化估计分析模型(DCAM)的设计,实现和经验验证,该模型预测基于滑动窗口的重复数据删除参数在任何给定数据集上的性能,用于参数优化。我们的经验评估包括基于源代码(Linux内核、Kubernetes、TensorFlow)、开放研究数据集(CORD-19)和文章(Wikipedia)的工作负载。通过使用我们的系统和Destor重复数据删除系统进行验证,基于dcam的搜索发现,与普通基线相比,重复数据删除参数所需的存储空间最多减少3.8倍。DCAM Search优化参数的速度比以前快了19.8倍,所得到的重复数据删除数据集的大小都在使用实际重复数据删除搜索得到的最佳结果的5.15%以内。
{"title":"Optimizing Deduplication Parameters via a Change-Estimation Analytical Model","authors":"Owen Randall;Luke Schultz;Paul Lu","doi":"10.1109/TBDATA.2025.3604171","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3604171","url":null,"abstract":"Variable-sized, content-defined deduplication is a technique to find and eliminate redundant chunks of data for efficient data backups, reduced data transfers, and reduced data-storage overheads. For big datasets, especially with incremental updates over time such as backups and gathered data, deduplication makes data management faster and more efficient. While many existing deduplication systems use default expected chunk lengths such as 4 KB or 8 KB, they are suboptimal. Poorly optimized deduplication systems can significantly increase storage costs and network usage, making large datasets prohibitively expensive to manage. We present the design, implementation, and an empirical validation of our Deduplication Change-Estimation Analytical Model (DCAM) which predicts the performance of sliding window-based deduplication parameters on any given dataset, to be used for parameter optimization. Our empirical evaluation includes workloads based on source code (Linux kernel, Kubernetes, TensorFlow), open-research datasets (CORD-19), and articles (Wikipedia). Validated using both our system and the Destor deduplication system, a DCAM-based search finds deduplication parameters that require up to 3.8× less storage relative to a common baseline. DCAM Search optimizes parameters up to 19.8× faster than previously possible, and the size of the resulting deduplicated datasets are all within 5.15% of the best results found by searching using actual deduplication.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"135-146"},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
scProGraph: A Cell Bagging Strategy for Cell Type Annotation With Gene Interaction-Aware Explainability scprogram:一种具有基因相互作用可解释性的细胞类型注释的细胞装袋策略
IF 5.7 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-08-29 DOI: 10.1109/TBDATA.2025.3604169
Xinyuan Li;Yue-Chao Li;Hai-Ru You;Xuequn Shang;Leon Wong;Zhi-An Huang;Zhu-Hong You;Yu-An Huang
The rapid advancement of scRNA-seq has generated massive data for cell type annotation. However, current automated annotation methods remain limited: most approaches separately model either cell-cell similarities or gene-gene relationships, neglecting their synergistic effects, which leads to suboptimal accuracy and poor biological interpretability. To address this, we propose scProGraph, a prototype-guided graph neural network that jointly models cell type classification and functional gene subgraph discovery. By constructing a cell similarity graph and incorporating cell-type prototypes as prior anchors, our method simultaneously optimizes classification boundaries and the interpretability of gene subgraphs. Experiments on seven independent datasets spanning three disease categories demonstrate that scProGraph achieves over 90% accuracy on four datasets and exceeds 80% on six datasets, outperforming state-of-the-art methods. Further analysis reveals that the gene subgraphs extracted by scProGraph for Macrophage, Fibroblast, and Monocyte cover 26.92%, 26.83%, and 22.22% of a protein-protein interaction networks dataset, respectively, validating the biological relevance of the identified gene modules. This study not only provides a high-accuracy tool for single-cell annotation but also opens new avenues for discovering novel biomarkers and regulatory mechanisms through gene relationship mining.
scRNA-seq技术的快速发展为细胞类型标注提供了大量的数据。然而,目前的自动化注释方法仍然有限:大多数方法单独建模细胞-细胞相似性或基因-基因关系,忽略了它们的协同效应,这导致准确性不佳和生物可解释性差。为了解决这个问题,我们提出了scprogram,这是一个原型引导的图神经网络,可以联合建模细胞类型分类和功能基因子图发现。通过构建细胞相似图并将细胞类型原型作为先验锚点,我们的方法同时优化了分类边界和基因子图的可解释性。在跨越3种疾病类别的7个独立数据集上进行的实验表明,scprogram在4个数据集上的准确率超过90%,在6个数据集上的准确率超过80%,优于目前最先进的方法。进一步的分析表明,scProGraph提取的巨噬细胞、成纤维细胞和单核细胞的基因亚图分别覆盖了蛋白-蛋白相互作用网络数据集的26.92%、26.83%和22.22%,验证了所鉴定基因模块的生物学相关性。该研究不仅为单细胞注释提供了高精度的工具,而且为通过基因关系挖掘发现新的生物标志物和调控机制开辟了新的途径。
{"title":"scProGraph: A Cell Bagging Strategy for Cell Type Annotation With Gene Interaction-Aware Explainability","authors":"Xinyuan Li;Yue-Chao Li;Hai-Ru You;Xuequn Shang;Leon Wong;Zhi-An Huang;Zhu-Hong You;Yu-An Huang","doi":"10.1109/TBDATA.2025.3604169","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3604169","url":null,"abstract":"The rapid advancement of scRNA-seq has generated massive data for cell type annotation. However, current automated annotation methods remain limited: most approaches separately model either cell-cell similarities or gene-gene relationships, neglecting their synergistic effects, which leads to suboptimal accuracy and poor biological interpretability. To address this, we propose scProGraph, a prototype-guided graph neural network that jointly models cell type classification and functional gene subgraph discovery. By constructing a cell similarity graph and incorporating cell-type prototypes as prior anchors, our method simultaneously optimizes classification boundaries and the interpretability of gene subgraphs. Experiments on seven independent datasets spanning three disease categories demonstrate that scProGraph achieves over 90% accuracy on four datasets and exceeds 80% on six datasets, outperforming state-of-the-art methods. Further analysis reveals that the gene subgraphs extracted by scProGraph for Macrophage, Fibroblast, and Monocyte cover 26.92%, 26.83%, and 22.22% of a protein-protein interaction networks dataset, respectively, validating the biological relevance of the identified gene modules. This study not only provides a high-accuracy tool for single-cell annotation but also opens new avenues for discovering novel biomarkers and regulatory mechanisms through gene relationship mining.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"147-158"},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SDEC: Semantic Deep Embedded Clustering 语义深度嵌入聚类
IF 5.7 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-08-28 DOI: 10.1109/TBDATA.2025.3603433
Mohammad Wali Ur Rahman;Ric Nevarez;Lamia Tasnim Mim;Salim Hariri
The high dimensional and semantically complex nature of textual Big data presents significant challenges for text clustering, which frequently lead to suboptimal groupings when using conventional techniques like k-means or hierarchical clustering. This work presents Semantic Deep Embedded Clustering (SDEC), an unsupervised text clustering framework that combines an improved autoencoder with transformer-based embeddings to overcome these challenges. This novel method preserves semantic relationships during data reconstruction by combining Mean Squared Error (MSE) and Cosine Similarity Loss (CSL) within an autoencoder. Furthermore, a semantic refinement stage that takes advantage of the contextual richness of transformer embeddings is used by SDEC to further improve a clustering layer with soft cluster assignments and distributional loss. The capabilities of SDEC are demonstrated by extensive testing on five benchmark datasets: AG News, Yahoo! Answers, DBPedia, Reuters 2, and Reuters 5. The framework not only outperformed existing methods with a clustering accuracy of 85.7% on AG News and set a new benchmark of 53.63% on Yahoo! Answers, but also showed robust performance across other diverse text corpora. These findings highlight the significant improvements in accuracy and semantic comprehension of text data provided by SDEC's advances in unsupervised text clustering.
文本大数据的高维性和语义复杂性给文本聚类带来了重大挑战,当使用k-means或分层聚类等传统技术时,这经常导致次优分组。这项工作提出了语义深度嵌入聚类(SDEC),这是一种无监督文本聚类框架,它结合了改进的自编码器和基于变压器的嵌入来克服这些挑战。该方法结合了自编码器的均方误差(MSE)和余弦相似度损失(CSL),在数据重构过程中保留了语义关系。此外,SDEC还利用变压器嵌入的上下文丰富性来进一步改进具有软聚类分配和分布损失的聚类层。SDEC的能力通过在五个基准数据集上的广泛测试得到了证明:AG News、Yahoo!答案,DBPedia,路透社2和路透社5。该框架不仅在AG News上的聚类准确率达到了85.7%,而且在Yahoo!答案,但在其他不同的文本语料库中也表现出强劲的表现。这些发现突出了SDEC在无监督文本聚类方面的进步在文本数据的准确性和语义理解方面的重大改进。
{"title":"SDEC: Semantic Deep Embedded Clustering","authors":"Mohammad Wali Ur Rahman;Ric Nevarez;Lamia Tasnim Mim;Salim Hariri","doi":"10.1109/TBDATA.2025.3603433","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3603433","url":null,"abstract":"The high dimensional and semantically complex nature of textual Big data presents significant challenges for text clustering, which frequently lead to suboptimal groupings when using conventional techniques like k-means or hierarchical clustering. This work presents Semantic Deep Embedded Clustering (SDEC), an unsupervised text clustering framework that combines an improved autoencoder with transformer-based embeddings to overcome these challenges. This novel method preserves semantic relationships during data reconstruction by combining Mean Squared Error (MSE) and Cosine Similarity Loss (CSL) within an autoencoder. Furthermore, a semantic refinement stage that takes advantage of the contextual richness of transformer embeddings is used by SDEC to further improve a clustering layer with soft cluster assignments and distributional loss. The capabilities of SDEC are demonstrated by extensive testing on five benchmark datasets: <italic>AG News, Yahoo! Answers, DBPedia, Reuters 2,</i> and <italic>Reuters 5</i>. The framework not only outperformed existing methods with a clustering accuracy of 85.7% on <italic>AG News</i> and set a new benchmark of 53.63% on <italic>Yahoo! Answers</i>, but also showed robust performance across other diverse text corpora. These findings highlight the significant improvements in accuracy and semantic comprehension of text data provided by SDEC's advances in unsupervised text clustering.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"119-134"},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Community-Imbalanced Graph Sampling 社区不平衡图采样
IF 5.7 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-08-19 DOI: 10.1109/TBDATA.2025.3600032
Ying Zhao;Genghuai Bai;Yusheng Qiu;Yiwen Liu;Chuhan Zhang;Chi Han;Yitao Wu;Kehua Guo;Jian Zhang;Fangfang Zhou
A community-imbalanced graph refers to a graph containing multiple communities with large differences in node and edge scales. Graph sampling is a widely used graph reduction technique to accelerate graph computations and simplify graph visualizations. However, existing graph sampling algorithms may encounter several problems, including the loss of small communities, disconnections between communities, and distortions of community scale distribution, on maintaining the community structures in a community-imbalanced graph. In this work, a new quality indicator is proposed to determine if a graph can be regarded as a community-imbalanced graph. A community-imbalanced graph sampling (CIGS) algorithm is proposed to address the community-imbalanced graph sampling problems. Three new evaluation metrics are proposed to assess the performance of community structure maintenance of graph sampling. An algorithm performance experiment and a user study are conducted to evaluate the effectiveness of the proposed CIGS.
社区不平衡图是指包含多个社区的图,这些社区在节点和边缘尺度上差异很大。图采样是一种广泛应用的图约简技术,可以加速图计算和简化图的可视化。然而,现有的图采样算法在维持群落不平衡图中的群落结构时,可能会遇到一些问题,包括小群落的丢失、群落之间的断开、群落规模分布的扭曲等。在这项工作中,提出了一个新的质量指标来确定一个图是否可以被视为社区不平衡图。针对社区不平衡图采样问题,提出了一种社区不平衡图采样(CIGS)算法。提出了三个新的评价指标来评价图采样的群落结构维护效果。通过算法性能实验和用户研究来评估所提出的CIGS的有效性。
{"title":"Community-Imbalanced Graph Sampling","authors":"Ying Zhao;Genghuai Bai;Yusheng Qiu;Yiwen Liu;Chuhan Zhang;Chi Han;Yitao Wu;Kehua Guo;Jian Zhang;Fangfang Zhou","doi":"10.1109/TBDATA.2025.3600032","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3600032","url":null,"abstract":"A community-imbalanced graph refers to a graph containing multiple communities with large differences in node and edge scales. Graph sampling is a widely used graph reduction technique to accelerate graph computations and simplify graph visualizations. However, existing graph sampling algorithms may encounter several problems, including the loss of small communities, disconnections between communities, and distortions of community scale distribution, on maintaining the community structures in a community-imbalanced graph. In this work, a new quality indicator is proposed to determine if a graph can be regarded as a community-imbalanced graph. A community-imbalanced graph sampling (CIGS) algorithm is proposed to address the community-imbalanced graph sampling problems. Three new evaluation metrics are proposed to assess the performance of community structure maintenance of graph sampling. An algorithm performance experiment and a user study are conducted to evaluate the effectiveness of the proposed CIGS.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"105-118"},"PeriodicalIF":5.7,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TAG: Triple Alignment With Rationale Generation for Knowledge-Based Visual Question Answering 标签:基于知识的可视化问答的三重对齐与基本原理生成
IF 5.7 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-08-19 DOI: 10.1109/TBDATA.2025.3600012
Sihang Cai;Xuan Lin;Wenqiang Xu;Jingtong Wu;Tao Jin;Zhou Zhao;Fei Wu;Jun Yu
Knowledge-based Visual Question Answering (VQA) involves answering questions based not only on the given image, but also on external knowledge. Existing methods for knowledge-based VQA can be classified into two main categories: those that rely on external knowledge bases, and those that use Large Language Models (LLMs) as implicit knowledge engines. However, the former approach heavily relies on the quality of information retrieval, introducing additional information bias to the entire system. And the latter approach suffers from the extremely high computational cost and the loss of image information. To address these issues, we propose a novel framework called TAG that reformulates knowledge-based VQA as a contrastive learning problem. We innovatively propose a triple asymmetric paradigm, which aligns a lightweight text encoder to the image space with an extremely low training cost (0.0152B trainable parameters), and enhance its understanding ability on semantic granularity. TAG is both computation-efficient and effective, and we evaluate it on the knowledge-based VQA datasets, A-OKVQA, OK-VQA and VCR. The results show that TAG (0.387B) achieves the state-of-the-art performance when compared to methods using less than 1B parameters. Besides, TAG still shows competitive performance when compared to methods with LLM.
基于知识的视觉问答(VQA)不仅基于给定的图像,还基于外部知识来回答问题。现有的基于知识的VQA方法可以分为两大类:依赖于外部知识库的方法和使用大型语言模型(llm)作为隐式知识引擎的方法。然而,前一种方法严重依赖于信息检索的质量,给整个系统带来了额外的信息偏差。后一种方法的缺点是计算成本高,图像信息丢失。为了解决这些问题,我们提出了一个名为TAG的新框架,该框架将基于知识的VQA重新表述为一个对比学习问题。我们创新性地提出了一种三重不对称范式,以极低的训练成本(0.0152B个可训练参数)将轻量级文本编码器对齐到图像空间,并增强了其对语义粒度的理解能力。TAG是一种计算效率高且有效的方法,我们在基于知识的VQA数据集,A-OKVQA, OK-VQA和VCR上对其进行了评估。结果表明,与使用少于1B个参数的方法相比,TAG (0.387B)达到了最先进的性能。此外,与使用LLM的方法相比,TAG仍然具有竞争力。
{"title":"TAG: Triple Alignment With Rationale Generation for Knowledge-Based Visual Question Answering","authors":"Sihang Cai;Xuan Lin;Wenqiang Xu;Jingtong Wu;Tao Jin;Zhou Zhao;Fei Wu;Jun Yu","doi":"10.1109/TBDATA.2025.3600012","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3600012","url":null,"abstract":"Knowledge-based Visual Question Answering (VQA) involves answering questions based not only on the given image, but also on external knowledge. Existing methods for knowledge-based VQA can be classified into two main categories: those that rely on external knowledge bases, and those that use Large Language Models (LLMs) as implicit knowledge engines. However, the former approach heavily relies on the quality of information retrieval, introducing additional information bias to the entire system. And the latter approach suffers from the extremely high computational cost and the loss of image information. To address these issues, we propose a novel framework called TAG that reformulates knowledge-based VQA as a contrastive learning problem. We innovatively propose a triple asymmetric paradigm, which aligns a lightweight text encoder to the image space with an extremely low training cost (0.0152B trainable parameters), and enhance its understanding ability on semantic granularity. TAG is both computation-efficient and effective, and we evaluate it on the knowledge-based VQA datasets, A-OKVQA, OK-VQA and VCR. The results show that TAG (0.387B) achieves the state-of-the-art performance when compared to methods using less than 1B parameters. Besides, TAG still shows competitive performance when compared to methods with LLM.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"47-61"},"PeriodicalIF":5.7,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Can GNNs Learn Link Heuristics? a Concise Review and Evaluation of Link Prediction Methods gnn能学习链接启发式吗?链接预测方法综述与评价
IF 5.7 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-08-19 DOI: 10.1109/TBDATA.2025.3600031
Shuming Liang;Yu Ding;Zhidong Li;Bin Liang;Siqi Zhang;Yang Wang;Fang Chen
This paper explores the ability of Graph Neural Networks (GNNs) in learning various forms of information for link prediction, alongside a brief review of existing link prediction methods. Our analysis reveals that GNNs cannot effectively learn structural information related to the number of common neighbors between two nodes, primarily due to the nature of set-based pooling of the neighborhood aggregation scheme. Also, our extensive experiments indicate that trainable node embeddings can improve the performance of GNN-based link prediction models. Importantly, we observe that the denser the graph, the greater such the improvement. We attribute this to the characteristics of node embeddings, where the link state of each link sample could be encoded into the embeddings of nodes that are involved in the neighborhood aggregation of the two nodes in that link sample. In denser graphs, every node could have more opportunities to attend the neighborhood aggregation of other nodes and encode states of more link samples to its embedding, thus learning better node embeddings for link prediction. Lastly, we demonstrate that the insights gained from our research carry important implications in identifying the limitations of existing link prediction methods, which could guide the future development of more robust algorithms.
本文探讨了图神经网络(gnn)在学习各种形式的信息以进行链路预测方面的能力,并简要回顾了现有的链路预测方法。我们的分析表明,gnn不能有效地学习与两个节点之间共同邻居数量相关的结构信息,这主要是由于邻居聚集方案的基于集合池的性质。此外,我们的大量实验表明,可训练节点嵌入可以提高基于gnn的链路预测模型的性能。重要的是,我们观察到图形越密集,这种改进就越大。我们将此归因于节点嵌入的特性,其中每个链路样本的链路状态可以被编码到节点嵌入中,这些节点嵌入涉及该链路样本中两个节点的邻域聚合。在更密集的图中,每个节点可以有更多的机会参加其他节点的邻域聚合,并将更多的链路样本的状态编码到自己的嵌入中,从而学习到更好的节点嵌入来进行链路预测。最后,我们证明了从我们的研究中获得的见解在识别现有链接预测方法的局限性方面具有重要意义,这可以指导未来开发更健壮的算法。
{"title":"Can GNNs Learn Link Heuristics? a Concise Review and Evaluation of Link Prediction Methods","authors":"Shuming Liang;Yu Ding;Zhidong Li;Bin Liang;Siqi Zhang;Yang Wang;Fang Chen","doi":"10.1109/TBDATA.2025.3600031","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3600031","url":null,"abstract":"This paper explores the ability of Graph Neural Networks (GNNs) in learning various forms of information for link prediction, alongside a brief review of existing link prediction methods. Our analysis reveals that GNNs cannot effectively learn structural information related to the number of common neighbors between two nodes, primarily due to the nature of set-based pooling of the neighborhood aggregation scheme. Also, our extensive experiments indicate that trainable node embeddings can improve the performance of GNN-based link prediction models. Importantly, we observe that the denser the graph, the greater such the improvement. We attribute this to the characteristics of node embeddings, where the link state of each link sample could be encoded into the embeddings of nodes that are involved in the neighborhood aggregation of the two nodes in that link sample. In denser graphs, every node could have more opportunities to attend the neighborhood aggregation of other nodes and encode states of more link samples to its embedding, thus learning better node embeddings for link prediction. Lastly, we demonstrate that the insights gained from our research carry important implications in identifying the limitations of existing link prediction methods, which could guide the future development of more robust algorithms.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"1-14"},"PeriodicalIF":5.7,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Big Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1