IEEE Transactions on Big Data最新文献_第7页

Label Distribution Learning Based on Horizontal and Vertical Mining of Label Correlations 基于标签相关性横向和纵向挖掘的标签分布学习

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-11-30 DOI: 10.1109/TBDATA.2023.3338023

Yaojin Lin;Yulin Li;Chenxi Wang;Lei Guo;Jinkun Chen

Label distribution learning (LDL) is a novel approach that outputs labels with varying degrees of description. To enhance the performance of LDL algorithms, researchers have developed different algorithms with mining label correlations globally, locally, and both globally and locally. However, existing LDL algorithms for mining local label correlations roughly assume that samples within a cluster share same label correlations, which may not be applicable to all samples. Moreover, existing LDL algorithms apply global and local label correlations to the same parameter matrix, which cannot fully exploit their respective advantages. To address these issues, a novel LDL method based on horizontal and vertical mining of label correlations (LDL-HVLC) is proposed in this paper. The method first encodes a unique local influence vector for each sample through the label distribution of its neighbor samples. Then, this vector is extended as additional features to assist in predicting unknown instances, and a penalty term is designed to correct wrong local influence vector (horizontal mining). Finally, to capture both local and global correlations of label, a new regularization term is constructed to constrain the global label correlations on the output results (vertical mining). Extensive experiments on real datasets demonstrate that the proposed method effectively solves the label distribution problem and outperforms the current state-of-the-art methods.

标签分布学习（LDL）是一种新颖的方法，可输出具有不同描述程度的标签。为了提高 LDL 算法的性能，研究人员开发出了全局、局部以及同时全局和局部挖掘标签相关性的不同算法。不过，现有的 LDL 算法在挖掘局部标签相关性时，会大致假设一个聚类中的样本具有相同的标签相关性，但这可能并不适用于所有样本。此外，现有的 LDL 算法将全局和局部标签相关性应用于同一个参数矩阵，无法充分发挥各自的优势。为解决这些问题，本文提出了一种基于横向和纵向标签相关性挖掘的新型 LDL 方法（LDL-HVLC）。该方法首先通过邻近样本的标签分布为每个样本编码一个唯一的局部影响向量。然后，将该向量扩展为辅助预测未知实例的附加特征，并设计一个惩罚项来纠正错误的局部影响向量（水平挖掘）。最后，为了捕捉标签的局部和全局相关性，我们构建了一个新的正则化项来限制输出结果的全局标签相关性（纵向挖掘）。在真实数据集上进行的大量实验证明，所提出的方法能有效解决标签分布问题，并优于目前最先进的方法。

{"title":"Label Distribution Learning Based on Horizontal and Vertical Mining of Label Correlations","authors":"Yaojin Lin;Yulin Li;Chenxi Wang;Lei Guo;Jinkun Chen","doi":"10.1109/TBDATA.2023.3338023","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3338023","url":null,"abstract":"Label distribution learning (LDL) is a novel approach that outputs labels with varying degrees of description. To enhance the performance of LDL algorithms, researchers have developed different algorithms with mining label correlations globally, locally, and both globally and locally. However, existing LDL algorithms for mining local label correlations roughly assume that samples within a cluster share same label correlations, which may not be applicable to all samples. Moreover, existing LDL algorithms apply global and local label correlations to the same parameter matrix, which cannot fully exploit their respective advantages. To address these issues, a novel LDL method based on horizontal and vertical mining of label correlations (LDL-HVLC) is proposed in this paper. The method first encodes a unique local influence vector for each sample through the label distribution of its neighbor samples. Then, this vector is extended as additional features to assist in predicting unknown instances, and a penalty term is designed to correct wrong local influence vector (horizontal mining). Finally, to capture both local and global correlations of label, a new regularization term is constructed to constrain the global label correlations on the output results (vertical mining). Extensive experiments on real datasets demonstrate that the proposed method effectively solves the label distribution problem and outperforms the current state-of-the-art methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"275-287"},"PeriodicalIF":7.2,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140924694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hierarchical Deep Reinforcement Learning for VWAP Strategy Optimization 针对 VWAP 策略优化的分层深度强化学习

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-11-30 DOI: 10.1109/TBDATA.2023.3338011

Xiaodong Li;Pangjing Wu;Chenxin Zou;Qing Li

Designing algorithmic trading strategies targeting volume-weighted average price (VWAP) for long-duration orders is a critical concern for brokers. Traditional rule-based strategies are explicitly predetermined, lacking effective adaptability to achieve lower transaction costs in dynamic markets. Numerous studies have attempted to minimize transaction costs through reinforcement learning. However, the improvement for long-duration order trading strategies, such as VWAP strategy, remains limited due to intraday liquidity pattern changes and sparse reward signals. To address this issue, we propose a jointed model called Macro-Meta-Micro Trader, which combines deep learning and hierarchical reinforcement learning. This model aims to optimize parent order allocation and child order execution in the VWAP strategy, thereby reducing transaction costs for long-duration orders. It effectively captures market patterns and executes orders across different temporal scales. Our experiments on stocks listed on the Shanghai Stock Exchange demonstrated that our approach outperforms optimal baselines in terms of VWAP slippage by saving up to 2.22 base points, verifying that further splitting tranches into several subgoals can effectively reduce transaction costs.

为长期订单设计以成交量加权平均价（VWAP）为目标的算法交易策略是经纪商的一个重要关注点。传统的基于规则的策略是明确预定的，缺乏有效的适应性，无法在动态市场中实现较低的交易成本。许多研究都试图通过强化学习最大限度地降低交易成本。然而，由于日内流动性模式的变化和奖励信号的稀疏，对长期订单交易策略（如 VWAP 策略）的改进仍然有限。为解决这一问题，我们提出了一种名为 "宏观-宏观-微观交易者 "的联合模型，该模型结合了深度学习和分层强化学习。该模型旨在优化 VWAP 策略中的父订单分配和子订单执行，从而降低长期订单的交易成本。它能有效捕捉市场模式，并在不同时间尺度上执行订单。我们在上海证券交易所上市的股票上进行的实验表明，在 VWAP 滑点方面，我们的方法优于最优基线，最多可节省 2.22 个基点，这验证了进一步将分批订单拆分为多个子目标可有效降低交易成本。

{"title":"Hierarchical Deep Reinforcement Learning for VWAP Strategy Optimization","authors":"Xiaodong Li;Pangjing Wu;Chenxin Zou;Qing Li","doi":"10.1109/TBDATA.2023.3338011","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3338011","url":null,"abstract":"Designing algorithmic trading strategies targeting volume-weighted average price (VWAP) for long-duration orders is a critical concern for brokers. Traditional rule-based strategies are explicitly predetermined, lacking effective adaptability to achieve lower transaction costs in dynamic markets. Numerous studies have attempted to minimize transaction costs through reinforcement learning. However, the improvement for long-duration order trading strategies, such as VWAP strategy, remains limited due to intraday liquidity pattern changes and sparse reward signals. To address this issue, we propose a jointed model called Macro-Meta-Micro Trader, which combines deep learning and hierarchical reinforcement learning. This model aims to optimize parent order allocation and child order execution in the VWAP strategy, thereby reducing transaction costs for long-duration orders. It effectively captures market patterns and executes orders across different temporal scales. Our experiments on stocks listed on the Shanghai Stock Exchange demonstrated that our approach outperforms optimal baselines in terms of VWAP slippage by saving up to 2.22 base points, verifying that further splitting tranches into several subgoals can effectively reduce transaction costs.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"288-300"},"PeriodicalIF":7.2,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140924693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Balanced Bayesian Classifiers From Labeled and Unlabeled Data 从标记和未标记数据中学习平衡贝叶斯分类器

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-11-30 DOI: 10.1109/TBDATA.2023.3338019

Lu Guo;Limin Wang;Qilong Li;Kuo Li

How to train learners over unbalanced data with asymmetric costs has been recognized as one of the most significant challenges in data mining. Bayesian network classifier (BNC) provides a powerful probabilistic tool to encode the probabilistic dependencies among random variables in directed acyclic graph (DAG), whereas unbalanced data will result in unbalanced network topology. This will lead to a biased estimate of the conditional or joint probability distribution, and finally a reduction in the classification accuracy. To address this issue, we propose to redefine the information-theoretic metrics to uniformly represent the balanced dependencies between attributes or that between attribute values. Then heuristic search strategy and thresholding operation are introduced to respectively learn refined DAGs from labeled and unlabeled data. The experimental results on 32 benchmark datasets reveal that the proposed highly scalable algorithm is competitive with or superior to a number of state-of-the-art single and ensemble learners.

如何在成本不对称的不平衡数据上训练学习者，已被公认为数据挖掘领域最重要的挑战之一。贝叶斯网络分类器（BNC）提供了一种强大的概率工具，用于编码有向无环图（DAG）中随机变量之间的概率依赖关系。这将导致对条件或联合概率分布的估计出现偏差，最终降低分类准确性。为了解决这个问题，我们建议重新定义信息论指标，以统一表示属性之间或属性值之间的平衡依赖关系。然后引入启发式搜索策略和阈值操作，分别从有标签和无标签数据中学习精炼的 DAG。在 32 个基准数据集上的实验结果表明，所提出的具有高度可扩展性的算法与一些最先进的单学习器和集合学习器相比具有竞争力或更胜一筹。

引用次数: 0

FATS: Feature Distribution Analysis-Based Test Selection for Deep Learning Enhancement FATS：基于特征分布分析的深度学习增强测试选择

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-11-20 DOI: 10.1109/TBDATA.2023.3334648

Li Li;Chuanqi Tao;Hongjing Guo;Jingxuan Zhang;Xiaobing Sun

Deep Learning has been applied to many applications across different domains. However, the distribution shift between the test data and training data is a major factor impacting the quality of deep neural networks (DNNs). To address this issue, existing research mainly focuses on enhancing DNN models by retraining them using labeled test data. However, labeling test data is costly, which seriously reduces the efficiency of DNN testing. To solve this problem, test selection strategically selected a small set of tests to label. Unfortunately, existing test selection methods seldom focus on the data distribution shift. To address the issue, this paper proposes an approach for test selection named Feature Distribution Analysis-Based Test Selection (FATS). FATS analyzes the distributions of test data and training data and then adopts learning to rank (a kind of supervised machine learning to solve ranking tasks) to intelligently combine the results of analysis for test selection. We conduct an empirical study on popular datasets and DNN models, and then compare FATS with seven test selection methods. Experiment results show that FATS effectively alleviates the impact of distribution shifts and outperforms the compared methods with the average accuracy improvement of 19.6%

$sim$

69.7% for DNN model enhancement.

深度学习已被应用于不同领域的许多应用中。然而，测试数据和训练数据之间的分布偏移是影响深度神经网络（DNN）质量的一个主要因素。为解决这一问题，现有研究主要侧重于通过使用标注测试数据对 DNN 模型进行再训练来增强 DNN 模型。然而，标注测试数据的成本很高，严重降低了 DNN 测试的效率。为了解决这个问题，测试选择策略性地选择了一小部分测试数据进行标注。遗憾的是，现有的测试选择方法很少关注数据分布的变化。为了解决这个问题，本文提出了一种测试选择方法，名为基于特征分布分析的测试选择（FATS）。FATS 分析测试数据和训练数据的分布，然后采用学习排序（一种解决排序任务的有监督机器学习）来智能地结合分析结果进行测试选择。我们对流行的数据集和 DNN 模型进行了实证研究，并将 FATS 与七种测试选择方法进行了比较。实验结果表明，FATS 有效地缓解了分布偏移的影响，在 DNN 模型增强方面的平均准确率提高了 19.6%$sim$69.7%，优于所比较的方法。

{"title":"FATS: Feature Distribution Analysis-Based Test Selection for Deep Learning Enhancement","authors":"Li Li;Chuanqi Tao;Hongjing Guo;Jingxuan Zhang;Xiaobing Sun","doi":"10.1109/TBDATA.2023.3334648","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3334648","url":null,"abstract":"Deep Learning has been applied to many applications across different domains. However, the distribution shift between the test data and training data is a major factor impacting the quality of deep neural networks (DNNs). To address this issue, existing research mainly focuses on enhancing DNN models by retraining them using labeled test data. However, labeling test data is costly, which seriously reduces the efficiency of DNN testing. To solve this problem, test selection strategically selected a small set of tests to label. Unfortunately, existing test selection methods seldom focus on the data distribution shift. To address the issue, this paper proposes an approach for test selection named Feature Distribution Analysis-Based Test Selection (FATS). FATS analyzes the distributions of test data and training data and then adopts learning to rank (a kind of supervised machine learning to solve ranking tasks) to intelligently combine the results of analysis for test selection. We conduct an empirical study on popular datasets and DNN models, and then compare FATS with seven test selection methods. Experiment results show that FATS effectively alleviates the impact of distribution shifts and outperforms the compared methods with the average accuracy improvement of 19.6%\u0000<inline-formula><tex-math>$sim$</tex-math></inline-formula>\u000069.7% for DNN model enhancement.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 2","pages":"132-145"},"PeriodicalIF":7.2,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140123522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Graph Structure Aware Contrastive Multi-View Clustering 图形结构感知对比多视图聚类

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-11-20 DOI: 10.1109/TBDATA.2023.3334674

Rui Chen;Yongqiang Tang;Xiangrui Cai;Xiaojie Yuan;Wenlong Feng;Wensheng Zhang

Multi-view clustering has become a research hotspot in recent decades because of its effectiveness in heterogeneous data fusion. Although a large number of related studies have been developed one after another, most of them usually only concern with the characteristics of the data themselves and overlook the inherent connection among samples, hindering them from exploring structural knowledge of graph space. Moreover, many current works tend to highlight the compactness of one cluster without taking the differences between clusters into account. To track these two drawbacks, in this article, we propose a graph structure aware contrastive multi-view clustering (namely, GCMC) approach. Specifically, we incorporate the well-designed graph autoencoder with conventional multi-layer perception autoencoder to extract the structural and high-level representation of multi-view data, so that the underlying correlation of samples can be effectively squeezed for model learning. Then the contrastive learning paradigm is performed on multiple pseudo-label distributions to ensure that the positive pairs of pseudo-label representations share the complementarity across views while the divergence between negative pairs is sufficiently large. This makes each semantic cluster more discriminative, i.e., jointly satisfying intra-cluster compactness and inter-cluster exclusiveness. Through comprehensive experiments on eight widely-known datasets, we prove that the proposed approach can perform better than the state-of-the-art opponents.

多视图聚类因其在异构数据融合中的有效性而成为近几十年来的研究热点。虽然大量相关研究相继问世，但大多数研究通常只关注数据本身的特征，忽略了样本之间的内在联系，阻碍了对图空间结构知识的探索。此外，目前的许多研究都倾向于突出一个聚类的紧凑性，而不考虑聚类之间的差异。为了克服这两个缺点，我们在本文中提出了一种图结构感知对比多视角聚类（即 GCMC）方法。具体来说，我们将精心设计的图自动编码器与传统的多层感知自动编码器相结合，提取多视角数据的结构和高层表示，从而有效地挤压样本的底层相关性，用于模型学习。然后对多个伪标签分布进行对比学习范式，以确保正对伪标签表征在不同视图之间具有互补性，而负对间的分歧足够大。这就使得每个语义簇更具区分度，即共同满足簇内紧凑性和簇间排他性。通过在八个广为人知的数据集上进行综合实验，我们证明了所提出的方法比最先进的对手表现得更好。

{"title":"Graph Structure Aware Contrastive Multi-View Clustering","authors":"Rui Chen;Yongqiang Tang;Xiangrui Cai;Xiaojie Yuan;Wenlong Feng;Wensheng Zhang","doi":"10.1109/TBDATA.2023.3334674","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3334674","url":null,"abstract":"Multi-view clustering has become a research hotspot in recent decades because of its effectiveness in heterogeneous data fusion. Although a large number of related studies have been developed one after another, most of them usually only concern with the characteristics of the data themselves and overlook the inherent connection among samples, hindering them from exploring structural knowledge of graph space. Moreover, many current works tend to highlight the compactness of one cluster without taking the differences between clusters into account. To track these two drawbacks, in this article, we propose a graph structure aware contrastive multi-view clustering (namely, GCMC) approach. Specifically, we incorporate the well-designed graph autoencoder with conventional multi-layer perception autoencoder to extract the structural and high-level representation of multi-view data, so that the underlying correlation of samples can be effectively squeezed for model learning. Then the contrastive learning paradigm is performed on multiple pseudo-label distributions to ensure that the positive pairs of pseudo-label representations share the complementarity across views while the divergence between negative pairs is sufficiently large. This makes each semantic cluster more discriminative, i.e., jointly satisfying intra-cluster compactness and inter-cluster exclusiveness. Through comprehensive experiments on eight widely-known datasets, we prove that the proposed approach can perform better than the state-of-the-art opponents.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"260-274"},"PeriodicalIF":7.2,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140924670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Don’t Be Misled by Emotion! Disentangle Emotions and Semantics for Cross-Language and Cross-Domain Rumor Detection 不要被情绪误导！区分情感和语义，实现跨语言和跨领域谣言检测

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-11-20 DOI: 10.1109/TBDATA.2023.3334634

Yu Shi;Xi Zhang;Yuming Shang;Ning Yu

Cross-language and cross-domain rumor detection is a crucial research topic for maintaining a healthy social media environment. Previous studies reveal that the emotions expressed in posts are important features for rumor detection. However, existing studies typically leverage the entangled representation of semantics and emotions, ignoring the fact that different languages and domains have different emotions toward rumors. Therefore, it inevitably leads to a biased adaptation of the features learned from the source to the target language and domain. To address this issue, this paper proposes a novel approach to adapt the knowledge obtained from the source to the target dataset by disentangling the emotional and semantic features of the datasets. Specifically, the proposed method mainly consists of three steps: (1) disentanglement, which encodes rumors into two separate semantic and emotional spaces to prevent emotional interference; (2) adaptation, merging semantics with the emotions from another language and domain for contrastive alignment to ensure effective adaptation; (3) joint training strategy, which enables the above two steps to work in synergy and mutually promote each other. Extensive experimental results demonstrate that the proposed method outperforms state-of-the-art baselines.

跨语言和跨领域的谣言检测是维护健康社交媒体环境的一个重要研究课题。以往的研究表明，帖子中表达的情绪是谣言检测的重要特征。然而，现有研究通常利用语义和情绪的纠缠表示法，忽略了不同语言和领域对谣言的情绪不同这一事实。因此，这不可避免地会导致将从源语言学习到的特征有偏差地适应到目标语言和领域。针对这一问题，本文提出了一种新方法，通过分离数据集的情感和语义特征，将从源数据集获得的知识适配到目标数据集。具体来说，本文提出的方法主要包括三个步骤：(1）分离，将朗姆酒编码为两个独立的语义空间和情感空间，以防止情感干扰；（2）适应，将语义与来自另一种语言和领域的情感合并，进行对比对齐，以确保有效适应；（3）联合训练策略，使上述两个步骤协同工作，相互促进。广泛的实验结果表明，所提出的方法优于最先进的基线方法。

{"title":"Don’t Be Misled by Emotion! Disentangle Emotions and Semantics for Cross-Language and Cross-Domain Rumor Detection","authors":"Yu Shi;Xi Zhang;Yuming Shang;Ning Yu","doi":"10.1109/TBDATA.2023.3334634","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3334634","url":null,"abstract":"Cross-language and cross-domain rumor detection is a crucial research topic for maintaining a healthy social media environment. Previous studies reveal that the emotions expressed in posts are important features for rumor detection. However, existing studies typically leverage the entangled representation of semantics and emotions, ignoring the fact that different languages and domains have different emotions toward rumors. Therefore, it inevitably leads to a biased adaptation of the features learned from the source to the target language and domain. To address this issue, this paper proposes a novel approach to adapt the knowledge obtained from the source to the target dataset by disentangling the emotional and semantic features of the datasets. Specifically, the proposed method mainly consists of three steps: (1) disentanglement, which encodes rumors into two separate semantic and emotional spaces to prevent emotional interference; (2) adaptation, merging semantics with the emotions from another language and domain for contrastive alignment to ensure effective adaptation; (3) joint training strategy, which enables the above two steps to work in synergy and mutually promote each other. Extensive experimental results demonstrate that the proposed method outperforms state-of-the-art baselines.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"249-259"},"PeriodicalIF":7.2,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140924736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AMDECDA: Attention Mechanism Combined With Data Ensemble Strategy for Predicting CircRNA-Disease Association AMDECDA：注意机制与数据集合策略相结合，预测 CircRNA 与疾病的关联性

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-11-20 DOI: 10.1109/TBDATA.2023.3334673

Lei Wang;Leon Wong;Zhu-Hong You;De-Shuang Huang

Accumulating evidence from recent research reveals that circRNA is tightly bound to human complex disease and plays an important regulatory role in disease progression. Identifying disease-associated circRNA occupies a key role in the research of disease pathogenesis. In this study, we propose a new model AMDECDA for predicting circRNA-disease association (CDA) by combining attention mechanism and data ensemble strategy. Firstly, we fuse the heterogeneous information including circRNA Gaussian interaction profile (GIP), disease semantics and disease GIP, and then use the attention mechanism of Graph Attention Network (GAT) to focus on the critical information of data, reasonably allocate resources and extract their essential features. Finally, the ensemble deep RVFL network (edRVFL) is utilized to quickly and accurately predict CDA in the non-iterative manner of closed-form solutions. In the five-fold cross-validation experiment on the benchmark data set, AMDECDA achieves an accuracy of 93.10% with a sensitivity of 97.56% in 0.9235 AUC. In comparison with previous models, AMDECDA exhibits highly competitiveness. Furthermore, 26 of the top 30 unknown CDAs of AMDECDA predicted scores are proved by the related literature. These results indicate that AMDECDA can effectively anticipate latent CDA and provide help for further biological wet experiments.

最新研究积累的证据显示，circRNA 与人类复杂疾病紧密结合，在疾病进展中发挥着重要的调控作用。鉴定与疾病相关的 circRNA 在疾病发病机制的研究中占有重要地位。在本研究中，我们结合注意机制和数据集合策略，提出了一种预测 circRNA 与疾病关联（CDA）的新模型 AMDECDA。首先，我们融合了包括 circRNA 高斯交互谱（GIP）、疾病语义和疾病 GIP 在内的异构信息，然后利用图注意力网络（GAT）的注意力机制聚焦数据中的关键信息，合理分配资源并提取其本质特征。最后，利用集合深度 RVFL 网络（edRVFL），以闭式求解的非迭代方式快速准确地预测 CDA。在基准数据集的五倍交叉验证实验中，AMDECDA 的准确率达到 93.10%，灵敏度达到 97.56%，AUC 为 0.9235。与之前的模型相比，AMDECDA 具有很强的竞争力。此外，在 AMDECDA 预测得分排名前 30 位的未知 CDA 中，有 26 个得到了相关文献的证实。这些结果表明，AMDECDA 可以有效预测潜在的 CDA，为进一步的生物湿实验提供帮助。

{"title":"AMDECDA: Attention Mechanism Combined With Data Ensemble Strategy for Predicting CircRNA-Disease Association","authors":"Lei Wang;Leon Wong;Zhu-Hong You;De-Shuang Huang","doi":"10.1109/TBDATA.2023.3334673","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3334673","url":null,"abstract":"Accumulating evidence from recent research reveals that circRNA is tightly bound to human complex disease and plays an important regulatory role in disease progression. Identifying disease-associated circRNA occupies a key role in the research of disease pathogenesis. In this study, we propose a new model AMDECDA for predicting circRNA-disease association (CDA) by combining attention mechanism and data ensemble strategy. Firstly, we fuse the heterogeneous information including circRNA Gaussian interaction profile (GIP), disease semantics and disease GIP, and then use the attention mechanism of Graph Attention Network (GAT) to focus on the critical information of data, reasonably allocate resources and extract their essential features. Finally, the ensemble deep RVFL network (edRVFL) is utilized to quickly and accurately predict CDA in the non-iterative manner of closed-form solutions. In the five-fold cross-validation experiment on the benchmark data set, AMDECDA achieves an accuracy of 93.10% with a sensitivity of 97.56% in 0.9235 AUC. In comparison with previous models, AMDECDA exhibits highly competitiveness. Furthermore, 26 of the top 30 unknown CDAs of AMDECDA predicted scores are proved by the related literature. These results indicate that AMDECDA can effectively anticipate latent CDA and provide help for further biological wet experiments.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"320-329"},"PeriodicalIF":7.5,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multilevel Stochastic Optimization for Imputation in Massive Medical Data Records 在海量医疗数据记录中进行多层次随机优化归算

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-10-30 DOI: 10.1109/TBDATA.2023.3328433

Wenrui Li;Xiaoyu Wang;Yuetian Sun;Snezana Milanovic;Mark Kon;Julio Enrique Castrillón-Candás

It has long been a recognized problem that many datasets contain significant levels of missing numerical data. A potentially critical predicate for application of machine learning methods to datasets involves addressing this problem. However, this is a challenging task. In this article, we apply a recently developed multi-level stochastic optimization approach to the problem of imputation in massive medical records. The approach is based on computational applied mathematics techniques and is highly accurate. In particular, for the Best Linear Unbiased Predictor (BLUP) this multi-level formulation is exact, and is significantly faster and more numerically stable. This permits practical application of Kriging methods to data imputation problems for massive datasets. We test this approach on data from the National Inpatient Sample (NIS) data records, Healthcare Cost and Utilization Project (HCUP), Agency for Healthcare Research and Quality. Numerical results show that the multi-level method significantly outperforms current approaches and is numerically robust. It has superior accuracy as compared with methods recommended in the recent report from HCUP. Benchmark tests show up to 75% reductions in error. Furthermore, the results are also superior to recent state of the art methods such as discriminative deep learning.

长期以来，许多数据集包含大量缺失的数字数据，这是一个公认的问题。将机器学习方法应用于数据集的一个潜在关键前提就是解决这一问题。然而，这是一项具有挑战性的任务。在本文中，我们将最近开发的多层次随机优化方法应用于海量医疗记录的估算问题。该方法基于计算应用数学技术，具有很高的准确性。特别是，对于最佳线性无偏预测（BLUP），这种多层次公式是精确的，而且速度更快，数值更稳定。这使得克里金方法可以实际应用于海量数据集的数据估算问题。我们在全国住院病人抽样（NIS）数据记录、医疗保健成本与利用项目（HCUP）、医疗保健研究与质量局的数据上测试了这种方法。数值结果表明，多层次方法明显优于当前的方法，并且在数值上是稳健的。与 HCUP 近期报告中推荐的方法相比，它具有更高的准确性。基准测试表明，误差最多可减少 75%。此外，其结果也优于最近的先进方法，如判别式深度学习。

{"title":"Multilevel Stochastic Optimization for Imputation in Massive Medical Data Records","authors":"Wenrui Li;Xiaoyu Wang;Yuetian Sun;Snezana Milanovic;Mark Kon;Julio Enrique Castrillón-Candás","doi":"10.1109/TBDATA.2023.3328433","DOIUrl":"10.1109/TBDATA.2023.3328433","url":null,"abstract":"It has long been a recognized problem that many datasets contain significant levels of missing numerical data. A potentially critical predicate for application of machine learning methods to datasets involves addressing this problem. However, this is a challenging task. In this article, we apply a recently developed multi-level stochastic optimization approach to the problem of imputation in massive medical records. The approach is based on computational applied mathematics techniques and is highly accurate. In particular, for the Best Linear Unbiased Predictor (BLUP) this multi-level formulation is \u0000<italic>exact</i>\u0000, and is significantly faster and more numerically stable. This permits practical application of Kriging methods to data imputation problems for massive datasets. We test this approach on data from the National Inpatient Sample (NIS) data records, Healthcare Cost and Utilization Project (HCUP), Agency for Healthcare Research and Quality. Numerical results show that the multi-level method significantly outperforms current approaches and is numerically robust. It has superior accuracy as compared with methods recommended in the recent report from HCUP. Benchmark tests show up to 75% reductions in error. Furthermore, the results are also superior to recent state of the art methods such as discriminative deep learning.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 2","pages":"122-131"},"PeriodicalIF":7.2,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135261593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mix2SFL: Two-Way Mixup for Scalable, Accurate, and Communication-Efficient Split Federated Learning Mix2SFL：可扩展、准确、通信效率高的双向混合拆分联合学习

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-10-30 DOI: 10.1109/TBDATA.2023.3328424

Seungeun Oh;Hyelin Nam;Jihong Park;Praneeth Vepakomma;Ramesh Raskar;Mehdi Bennis;Seong-Lyun Kim

In recent years, split learning (SL) has emerged as a promising distributed learning framework that can utilize Big Data in parallel without privacy leakage while reducing client-side computing resources. In the initial implementation of SL, however, the server serves multiple clients sequentially incurring high latency. Parallel implementation of SL can alleviate this latency problem, but existing Parallel SL algorithms compromise scalability due to its fundamental structural problem. To this end, our previous works have proposed two scalable Parallel SL algorithms, dubbed SGLR and LocFedMix-SL, by solving the aforementioned fundamental problem of the Parallel SL structure. In this article, we propose a novel Parallel SL framework, coined Mix2SFL, that can ameliorate both accuracy and communication-efficiency while still ensuring scalability. Mix2SFL first supplies more samples to the server through a manifold mixup between the smashed data uploaded to the server as in SmashMix of LocFedMix-SL, and then averages the split-layer gradient as in GradMix of SGLR, followed by local model aggregation as in SFL. Numerical evaluation corroborates that Mix2SFL achieves improved performance in both accuracy and latency compared to the state-of-the-art SL algorithm with scalability guarantees. Moreover, its convergence speed as well as privacy guarantee are validated through the experimental results.

近年来，分离式学习（SL）已成为一种前景广阔的分布式学习框架，它可以并行利用大数据而不会泄露隐私，同时还能减少客户端的计算资源。然而，在 SL 的初始实施中，服务器会依次为多个客户端提供服务，从而产生较高的延迟。并行 SL 实现可以缓解这一延迟问题，但现有的并行 SL 算法由于其基本结构问题而影响了可扩展性。为此，我们之前的工作通过解决上述并行 SL 结构的基本问题，提出了两种可扩展的并行 SL 算法，分别称为 SGLR 和 LocFedMix-SL。在本文中，我们提出了一种新颖的并行 SL 框架（Mix2SFL），它既能提高精度和通信效率，又能确保可扩展性。Mix2SFL 首先像 LocFedMix-SL 的 SmashMix 一样，通过上传到服务器的粉碎数据之间的流形混合向服务器提供更多样本，然后像 SGLR 的 GradMix 一样对分割层梯度进行平均，最后像 SFL 一样进行局部模型聚合。数值评估证实，与最先进的可扩展 SL 算法相比，Mix2SFL 在精度和延迟方面都取得了更高的性能。此外，其收敛速度和隐私保证也通过实验结果得到了验证。

{"title":"Mix2SFL: Two-Way Mixup for Scalable, Accurate, and Communication-Efficient Split Federated Learning","authors":"Seungeun Oh;Hyelin Nam;Jihong Park;Praneeth Vepakomma;Ramesh Raskar;Mehdi Bennis;Seong-Lyun Kim","doi":"10.1109/TBDATA.2023.3328424","DOIUrl":"10.1109/TBDATA.2023.3328424","url":null,"abstract":"In recent years, split learning (SL) has emerged as a promising distributed learning framework that can utilize Big Data in parallel without privacy leakage while reducing client-side computing resources. In the initial implementation of SL, however, the server serves multiple clients sequentially incurring high latency. Parallel implementation of SL can alleviate this latency problem, but existing Parallel SL algorithms compromise scalability due to its fundamental structural problem. To this end, our previous works have proposed two scalable Parallel SL algorithms, dubbed SGLR and LocFedMix-SL, by solving the aforementioned fundamental problem of the Parallel SL structure. In this article, we propose a novel Parallel SL framework, coined Mix2SFL, that can ameliorate both accuracy and communication-efficiency while still ensuring scalability. Mix2SFL first supplies more samples to the server through a manifold mixup between the smashed data uploaded to the server as in SmashMix of LocFedMix-SL, and then averages the split-layer gradient as in GradMix of SGLR, followed by local model aggregation as in SFL. Numerical evaluation corroborates that Mix2SFL achieves improved performance in both accuracy and latency compared to the state-of-the-art SL algorithm with scalability guarantees. Moreover, its convergence speed as well as privacy guarantee are validated through the experimental results.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"238-248"},"PeriodicalIF":7.2,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10301639","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135318089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BL: An Efficient Index for Reachability Queries on Large Graphs BL：大型图上可达性查询的高效索引

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-10-25 DOI: 10.1109/TBDATA.2023.3327215

Changyong Yu;Tianmei Ren;Wenyu Li;Huimin Liu;Haitao Ma;Yuhai Zhao

Reachability query has important applications in many fields such as social networks, Semantic Web, and biological information networks. How to improve the query efficiency in directed acyclic graph (DAG) has always been the main problem of reachability query research. Existing methods either can't prune unreachable pairs enough or can't perform well on both index size and query time. In this paper, we propose BL (Bridging Label), a general index framework for reachability queries in large graphs. First, we summarize the relationships between BL and existing label indices. Second, we propose a kind of specific index, named minBL, which can avoid redundant labels. Moreover, we propose TFD-minBL and CTFD-minBL, which generate minBL under the TFD-based permutation single-pass and in incremental, respectively. Finally, we conduct a large number of extensive experiments on real and synthetic datasets. The experimental results show that our methods are much faster and use less storage overhead than the existing reachability query methods.

可达性查询在社交网络、语义网和生物信息网络等许多领域都有重要应用。如何提高有向无环图（DAG）的查询效率一直是可达性查询研究的主要问题。现有的方法要么无法充分剪切不可达对，要么在索引大小和查询时间上都表现不佳。本文提出了用于大型图中可达性查询的通用索引框架 BL（Bridging Label）。首先，我们总结了 BL 与现有标签索引之间的关系。其次，我们提出了一种名为 minBL 的特定索引，它可以避免冗余标签。此外，我们还提出了 TFD-minBL 和 CTFD-minBL，它们分别在基于 TFD 的置换单程和增量下生成 minBL。最后，我们在真实和合成数据集上进行了大量广泛的实验。实验结果表明，与现有的可达性查询方法相比，我们的方法速度更快，使用的存储开销更少。

引用次数: 0