首页 > 最新文献

IEEE Transactions on Big Data最新文献

英文 中文
Risk-Constrained Reinforcement Learning With Augmented Lagrangian Multiplier for Portfolio Optimization 基于增广拉格朗日乘子的风险约束强化学习投资组合优化
IF 5.7 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-31 DOI: 10.1109/TBDATA.2025.3533905
Bayaraa Enkhsaikhan;Ohyun Jo
We explored the application of Risk-averse Reinforcement Learning (Risk-averse RL) in Constrained Markov Decision Process (CMDP) in optimizing investment portfolios, incorporating constraints assessment. The investment portfolio must be always constrained with risk characteristics by investors and regulators. Therefore, the hard constraint is necessary for the practical Portfolio optimization. Moreover, traditional portfolio optimization techniques lack flexibility to model complex dynamic financial market. To address this issue, Augmented Lagrangian Multiplier (ALM) was employed to enforce constraints on the agent, mitigating the impact of risk in the decision process. Our proposal of the risk-constrained RL algorithm demonstrated no constraint violations during the testing phase, and outperformance compared to other Risk-averse RL algorithms, fulfilling our primary goal. This suggests that incorporating a risk-constrained RL technique holds promise for portfolio optimization, particularly for risk-averse investors.
我们探索了在约束马尔可夫决策过程(CMDP)中规避风险强化学习(Risk-averse RL)在优化投资组合中的应用,包括约束评估。投资组合必须始终受到投资者和监管机构的风险特征约束。因此,在实际的投资组合优化中,硬约束是必要的。此外,传统的投资组合优化技术对复杂的动态金融市场缺乏灵活性。为了解决这一问题,采用增广拉格朗日乘子(ALM)对智能体施加约束,减轻决策过程中风险的影响。我们提出的风险约束RL算法在测试阶段没有违反约束,与其他风险规避RL算法相比,性能优于其他RL算法,实现了我们的主要目标。这表明,纳入风险约束的强化学习技术有望实现投资组合优化,特别是对厌恶风险的投资者。
{"title":"Risk-Constrained Reinforcement Learning With Augmented Lagrangian Multiplier for Portfolio Optimization","authors":"Bayaraa Enkhsaikhan;Ohyun Jo","doi":"10.1109/TBDATA.2025.3533905","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3533905","url":null,"abstract":"We explored the application of Risk-averse Reinforcement Learning (Risk-averse RL) in Constrained Markov Decision Process (CMDP) in optimizing investment portfolios, incorporating constraints assessment. The investment portfolio must be always constrained with risk characteristics by investors and regulators. Therefore, the hard constraint is necessary for the practical Portfolio optimization. Moreover, traditional portfolio optimization techniques lack flexibility to model complex dynamic financial market. To address this issue, Augmented Lagrangian Multiplier (ALM) was employed to enforce constraints on the agent, mitigating the impact of risk in the decision process. Our proposal of the risk-constrained RL algorithm demonstrated no constraint violations during the testing phase, and outperformance compared to other Risk-averse RL algorithms, fulfilling our primary goal. This suggests that incorporating a risk-constrained RL technique holds promise for portfolio optimization, particularly for risk-averse investors.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 5","pages":"2489-2502"},"PeriodicalIF":5.7,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144934479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Language Model-Informed ECG Dual Attention Network for Heart Failure Risk Prediction 基于大语言模型的心电双注意网络心衰风险预测
IF 7.5 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-30 DOI: 10.1109/TBDATA.2025.3536922
Chen Chen;Lei Li;Marcel Beetz;Abhirup Banerjee;Ramneek Gupta;Vicente Grau
Heart failure (HF) poses a significant public health challenge, with a rising global mortality rate. Early detection and prevention of HF could significantly reduce its impact. We introduce a novel methodology for predicting HF risk using 12-lead electrocardiograms (ECGs). We present a novel, lightweight dual attention ECG network designed to capture complex ECG features essential for early HF risk prediction, despite the notable imbalance between low and high-risk groups. This network incorporates a cross-lead attention module and 12 lead-specific temporal attention modules, focusing on cross-lead interactions and each lead's local dynamics. To further alleviate model overfitting, we leverage a large language model (LLM) with a public ECG-Report dataset for pretraining on an ECG-Report alignment task. The network is then fine-tuned for HF risk prediction using two specific cohorts from the U.K. Biobank study, focusing on patients with hypertension (UKB-HYP) and those who have had a myocardial infarction (UKB-MI). The results reveal that LLM-informed pre-training substantially enhances HF risk prediction in these cohorts. The dual attention design not only improves interpretability but also predictive accuracy, outperforming existing competitive methods with C-index scores of 0.6349 for UKB-HYP and 0.5805 for UKB-MI. This demonstrates our method's potential in advancing HF risk assessment with clinical complex ECG data.
心力衰竭(HF)是一项重大的公共卫生挑战,全球死亡率不断上升。早期发现和预防心衰可显著降低其影响。我们介绍了一种使用12导联心电图(ECGs)预测HF风险的新方法。我们提出了一种新颖的、轻量级的双重注意ECG网络,旨在捕捉对早期HF风险预测至关重要的复杂ECG特征,尽管低危组和高危组之间存在明显的不平衡。该网络包含一个跨铅注意模块和12个特定铅的时间注意模块,重点关注跨铅互动和每个铅的本地动态。为了进一步缓解模型过拟合,我们利用大型语言模型(LLM)和公共ECG-Report数据集对ECG-Report对齐任务进行预训练。然后使用来自英国生物银行研究的两个特定队列对该网络进行微调,以进行HF风险预测,重点是高血压患者(UKB-HYP)和心肌梗死患者(UKB-MI)。结果显示,llm预先训练大大提高了这些队列的HF风险预测。双重注意设计不仅提高了可解释性,而且提高了预测准确性,优于现有的竞争方法,UKB-HYP的c指数得分为0.6349,UKB-MI的c指数得分为0.5805。这证明了我们的方法在利用临床复杂心电图数据推进心衰风险评估方面的潜力。
{"title":"Large Language Model-Informed ECG Dual Attention Network for Heart Failure Risk Prediction","authors":"Chen Chen;Lei Li;Marcel Beetz;Abhirup Banerjee;Ramneek Gupta;Vicente Grau","doi":"10.1109/TBDATA.2025.3536922","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3536922","url":null,"abstract":"Heart failure (HF) poses a significant public health challenge, with a rising global mortality rate. Early detection and prevention of HF could significantly reduce its impact. We introduce a novel methodology for predicting HF risk using 12-lead electrocardiograms (ECGs). We present a novel, lightweight dual attention ECG network designed to capture complex ECG features essential for early HF risk prediction, despite the notable imbalance between low and high-risk groups. This network incorporates a cross-lead attention module and 12 lead-specific temporal attention modules, focusing on cross-lead interactions and each lead's local dynamics. To further alleviate model overfitting, we leverage a large language model (LLM) with a public ECG-Report dataset for pretraining on an ECG-Report alignment task. The network is then fine-tuned for HF risk prediction using two specific cohorts from the U.K. Biobank study, focusing on patients with hypertension (UKB-HYP) and those who have had a myocardial infarction (UKB-MI). The results reveal that LLM-informed pre-training substantially enhances HF risk prediction in these cohorts. The dual attention design not only improves interpretability but also predictive accuracy, outperforming existing competitive methods with C-index scores of 0.6349 for UKB-HYP and 0.5805 for UKB-MI. This demonstrates our method's potential in advancing HF risk assessment with clinical complex ECG data.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"948-960"},"PeriodicalIF":7.5,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10858425","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Use of Transfer Learning for Affordable In-Context Fake Review Generation 使用迁移学习在可承受的情境中生成虚假评论
IF 7.5 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-30 DOI: 10.1109/TBDATA.2025.3536927
Luis Ibañez-Lissen;Lorena González-Manzano;José M. de Fuentes;Manuel Goyanes
Fake content is a noteworthy threat which is managed by assorted means. This is a serious problem for online shopping platforms whose products can be affected by negative or positive reviews. Artificial intelligence is commonly applied for fake review generation, being transfer learning a promising approach to reduce training requirements. However, the feasibility of generating in-context fake reviews using transfer learning has not been explored yet. This paper analyses the suitability of a couple of transformers (T5 and BART) to generate realistic in-context fake reviews. Results show that 1) the diversity of generated reviews is comparable to existing works; 2) human-based detection is close to random; 3) just reviews generated with one of the used transformers can be detected with 38% precision; and 1 h of training and 8 k real reviews are needed to produce realistic fake reviews.
虚假内容是一个值得注意的威胁,通过各种手段进行管理。对于那些产品可能受到负面或正面评论影响的网购平台来说,这是一个严重的问题。人工智能通常应用于虚假评论生成,迁移学习是一种很有前途的减少培训需求的方法。然而,使用迁移学习生成上下文假评论的可行性尚未得到探讨。本文分析了一对变压器(T5和BART)的适用性,以产生真实的情境假评论。结果表明:1)生成的评论的多样性与现有作品相当;2)基于人的检测接近随机;3)仅使用其中一台变压器产生的评审,检测精度可达38%;需要1小时的训练和8千次真实的评论才能产生真实的假评论。
{"title":"Use of Transfer Learning for Affordable In-Context Fake Review Generation","authors":"Luis Ibañez-Lissen;Lorena González-Manzano;José M. de Fuentes;Manuel Goyanes","doi":"10.1109/TBDATA.2025.3536927","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3536927","url":null,"abstract":"Fake content is a noteworthy threat which is managed by assorted means. This is a serious problem for online shopping platforms whose products can be affected by negative or positive reviews. Artificial intelligence is commonly applied for fake review generation, being transfer learning a promising approach to reduce training requirements. However, the feasibility of generating in-context fake reviews using transfer learning has not been explored yet. This paper analyses the suitability of a couple of transformers (T5 and BART) to generate realistic in-context fake reviews. Results show that 1) the diversity of generated reviews is comparable to existing works; 2) human-based detection is close to random; 3) just reviews generated with one of the used transformers can be detected with 38% precision; and 1 h of training and 8 k real reviews are needed to produce realistic fake reviews.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"976-987"},"PeriodicalIF":7.5,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10858443","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchical Multi-Relational Graph Representation Learning for Large-Scale Prediction of Drug-Drug Interactions 层次多关系图表示学习用于药物-药物相互作用的大规模预测
IF 7.5 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-30 DOI: 10.1109/TBDATA.2025.3536924
Mengying Jiang;Guizhong Liu;Yuanchao Su;Weiqiang Jin;Biao Zhao
Most existing methods for predicting drug-drug interactions (DDI) predominantly concentrate on capturing the explicit relationships among drugs, overlooking the valuable implicit correlations present between drug pairs (DPs), which leads to weak predictions. To address this issue, this paper introduces a hierarchical multi-relational graph representation learning (HMGRL) approach. Within the framework of HMGRL, we leverage a wealth of drug-related heterogeneous data sources to construct heterogeneous graphs, where nodes represent drugs and edges denote clear and various associations. The relational graph convolutional network (RGCN) is employed to capture diverse explicit relationships between drugs from these heterogeneous graphs. Additionally, a multi-view differentiable spectral clustering (MVDSC) module is developed to capture multiple valuable implicit correlations between DPs. Within the MVDSC, we utilize multiple DP features to construct graphs, where nodes represent DPs and edges denote different implicit correlations. Subsequently, multiple DP representations are generated through graph cutting, each emphasizing distinct implicit correlations. The graph-cutting strategy enables our HMGRL to identify strongly connected communities of graphs, thereby reducing the fusion of irrelevant features. By combining every representation view of a DP, we create high-level DP representations for predicting DDIs. Two genuine datasets spanning three distinct tasks are adopted to gauge the efficacy of our HMGRL. Experimental outcomes unequivocally indicate that HMGRL surpasses several leading-edge methods in performance.
大多数现有的预测药物-药物相互作用(DDI)的方法主要集中在捕获药物之间的显式关系,而忽略了药物对(DPs)之间存在的有价值的隐含相关性,这导致预测薄弱。为了解决这个问题,本文引入了一种分层多关系图表示学习(HMGRL)方法。在HMGRL的框架内,我们利用大量与药物相关的异构数据源构建异构图,其中节点代表药物,边缘表示清晰而多样的关联。使用关系图卷积网络(RGCN)从这些异构图中捕获药物之间的各种显式关系。此外,开发了一个多视图可微光谱聚类(MVDSC)模块,以捕获DPs之间多个有价值的隐含相关性。在MVDSC中,我们利用多个DP特征来构建图,其中节点表示DP,边表示不同的隐式相关性。随后,通过图切割生成多个DP表示,每个表示都强调不同的隐式相关性。图切割策略使我们的HMGRL能够识别强连接的图社区,从而减少不相关特征的融合。通过组合DP的每个表示视图,我们创建用于预测ddi的高级DP表示。我们采用了跨越三个不同任务的两个真实数据集来衡量我们的HMGRL的有效性。实验结果明确表明,HMGRL在性能上超过了几种领先的方法。
{"title":"Hierarchical Multi-Relational Graph Representation Learning for Large-Scale Prediction of Drug-Drug Interactions","authors":"Mengying Jiang;Guizhong Liu;Yuanchao Su;Weiqiang Jin;Biao Zhao","doi":"10.1109/TBDATA.2025.3536924","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3536924","url":null,"abstract":"Most existing methods for predicting drug-drug interactions (DDI) predominantly concentrate on capturing the explicit relationships among drugs, overlooking the valuable implicit correlations present between drug pairs (DPs), which leads to weak predictions. To address this issue, this paper introduces a hierarchical multi-relational graph representation learning (HMGRL) approach. Within the framework of HMGRL, we leverage a wealth of drug-related heterogeneous data sources to construct heterogeneous graphs, where nodes represent drugs and edges denote clear and various associations. The relational graph convolutional network (RGCN) is employed to capture diverse explicit relationships between drugs from these heterogeneous graphs. Additionally, a multi-view differentiable spectral clustering (MVDSC) module is developed to capture multiple valuable implicit correlations between DPs. Within the MVDSC, we utilize multiple DP features to construct graphs, where nodes represent DPs and edges denote different implicit correlations. Subsequently, multiple DP representations are generated through graph cutting, each emphasizing distinct implicit correlations. The graph-cutting strategy enables our HMGRL to identify strongly connected communities of graphs, thereby reducing the fusion of irrelevant features. By combining every representation view of a DP, we create high-level DP representations for predicting DDIs. Two genuine datasets spanning three distinct tasks are adopted to gauge the efficacy of our HMGRL. Experimental outcomes unequivocally indicate that HMGRL surpasses several leading-edge methods in performance.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"961-975"},"PeriodicalIF":7.5,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models TinyLVLM-eHub:面向大型视觉语言模型的综合高效评估
IF 7.5 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-30 DOI: 10.1109/TBDATA.2025.3536930
Wenqi Shao;Meng Lei;Yutao Hu;Peng Gao;Peng Xu;Kaipeng Zhang;Fanqing Meng;Siyuan Huang;Hongsheng Li;Yu Qiao;Ping Luo
Large Vision-Language Models (LVLMs) have made significant strides in various multimodal tasks. Notably, GPT4V, Claude, Gemini, and others showcase exceptional multimodal capabilities, marked by profound comprehension and reasoning skills. This study introduces a comprehensive and efficient evaluation framework, TinyLVLM-eHub, to assess LVLMs’ performance, including proprietary models. TinyLVLM-eHub covers six key multimodal capabilities, such as visual perception, knowledge acquisition, reasoning, commonsense understanding, object hallucination, and embodied intelligence. The benchmark, utilizing 2.1K image-text pairs, provides a user-friendly and accessible platform for LVLM evaluation. The evaluation employs the ChatGPT Ensemble Evaluation (CEE) method, which improves alignment with human evaluation compared to word-matching approaches. Results reveal that closed-source API models like GPT4V and GeminiPro-V excel in most capabilities compared to previous open-source LVLMs, though they show some vulnerability in object hallucination. This evaluation underscores areas for LVLM improvement in real-world applications and serves as a foundational assessment for future multimodal advancements.
大型视觉语言模型(LVLMs)在各种多模态任务中取得了重大进展。值得注意的是,GPT4V、Claude、Gemini和其他一些人展示了非凡的多模式能力,以深刻的理解和推理能力为标志。本研究引入了一个全面有效的评估框架,TinyLVLM-eHub,来评估lvlm的性能,包括专有模型。TinyLVLM-eHub涵盖6个关键的多模态能力,如视觉感知、知识获取、推理、常识理解、对象幻觉和具体智能。该基准使用2.1K图像-文本对,为LVLM评估提供了一个用户友好且可访问的平台。评估采用ChatGPT集成评估(CEE)方法,与单词匹配方法相比,该方法提高了与人类评估的一致性。结果显示,与之前的开源lvlm相比,GPT4V和GeminiPro-V等闭源API模型在大多数功能上都表现出色,尽管它们在对象幻觉方面存在一些漏洞。该评估强调了LVLM在实际应用中的改进领域,并作为未来多式联运技术进步的基础评估。
{"title":"TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models","authors":"Wenqi Shao;Meng Lei;Yutao Hu;Peng Gao;Peng Xu;Kaipeng Zhang;Fanqing Meng;Siyuan Huang;Hongsheng Li;Yu Qiao;Ping Luo","doi":"10.1109/TBDATA.2025.3536930","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3536930","url":null,"abstract":"Large Vision-Language Models (LVLMs) have made significant strides in various multimodal tasks. Notably, GPT4V, Claude, Gemini, and others showcase exceptional multimodal capabilities, marked by profound comprehension and reasoning skills. This study introduces a comprehensive and efficient evaluation framework, TinyLVLM-eHub, to assess LVLMs’ performance, including proprietary models. TinyLVLM-eHub covers six key multimodal capabilities, such as visual perception, knowledge acquisition, reasoning, commonsense understanding, object hallucination, and embodied intelligence. The benchmark, utilizing 2.1K image-text pairs, provides a user-friendly and accessible platform for LVLM evaluation. The evaluation employs the ChatGPT Ensemble Evaluation (CEE) method, which improves alignment with human evaluation compared to word-matching approaches. Results reveal that closed-source API models like GPT4V and GeminiPro-V excel in most capabilities compared to previous open-source LVLMs, though they show some vulnerability in object hallucination. This evaluation underscores areas for LVLM improvement in real-world applications and serves as a foundational assessment for future multimodal advancements.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"933-947"},"PeriodicalIF":7.5,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CHEAT: A Large-Scale Dataset for Detecting CHatGPT-writtEn AbsTracts CHEAT:用于检测CHatGPT-writtEn摘要的大规模数据集
IF 7.5 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-30 DOI: 10.1109/TBDATA.2025.3536929
Peipeng Yu;Jiahan Chen;Xuan Feng;Zhihua Xia
The powerful ability of ChatGPT has caused widespread concern in the academic community. Malicious users could synthesize dummy academic content through ChatGPT, which is extremely harmful to academic rigor and originality. The need to develop ChatGPT-written content detection algorithms calls for large-scale datasets. In this paper, we initially investigate the possible negative impact of ChatGPT on academia, and present a large-scale CHatGPT-writtEn AbsTract dataset (CHEAT) to support the development of detection algorithms. In particular, the ChatGPT-written abstract dataset contains 35,304 synthetic abstracts, with $Generation$, $Polish$, and $Fusion$ as prominent representatives. Based on these data, we perform a thorough analysis of the existing text synthesis detection algorithms. We show that ChatGPT-written abstracts are detectable with well-trained detectors, while the detection difficulty increases with more human guidance involved.
ChatGPT的强大能力引起了学术界的广泛关注。恶意用户可以通过ChatGPT合成虚假的学术内容,这对学术严谨性和原创性是极其有害的。开发chatgpt编写的内容检测算法需要大规模的数据集。在本文中,我们初步研究了ChatGPT对学术界可能产生的负面影响,并提出了一个大规模的ChatGPT - written AbsTract dataset (CHEAT)来支持检测算法的开发。特别是,chatgpt编写的摘要数据集包含35304个合成摘要,其中$Generation$, $Polish$和$Fusion$是突出的代表。基于这些数据,我们对现有的文本合成检测算法进行了深入的分析。我们表明,chatgpt编写的摘要可以被训练有素的检测器检测到,而检测难度随着人工指导的增加而增加。
{"title":"CHEAT: A Large-Scale Dataset for Detecting CHatGPT-writtEn AbsTracts","authors":"Peipeng Yu;Jiahan Chen;Xuan Feng;Zhihua Xia","doi":"10.1109/TBDATA.2025.3536929","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3536929","url":null,"abstract":"The powerful ability of ChatGPT has caused widespread concern in the academic community. Malicious users could synthesize dummy academic content through ChatGPT, which is extremely harmful to academic rigor and originality. The need to develop ChatGPT-written content detection algorithms calls for large-scale datasets. In this paper, we initially investigate the possible negative impact of ChatGPT on academia, and present a large-scale CHatGPT-writtEn AbsTract dataset (CHEAT) to support the development of detection algorithms. In particular, the ChatGPT-written abstract dataset contains 35,304 synthetic abstracts, with <inline-formula><tex-math>$Generation$</tex-math></inline-formula>, <inline-formula><tex-math>$Polish$</tex-math></inline-formula>, and <inline-formula><tex-math>$Fusion$</tex-math></inline-formula> as prominent representatives. Based on these data, we perform a thorough analysis of the existing text synthesis detection algorithms. We show that ChatGPT-written abstracts are detectable with well-trained detectors, while the detection difficulty increases with more human guidance involved.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"898-906"},"PeriodicalIF":7.5,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
To Write or Not to Write as a Machine? That’s the Question 像机器一样写作还是不写作?这就是问题所在
IF 7.5 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-30 DOI: 10.1109/TBDATA.2025.3536938
Robiert Sepúlveda-Torres;Iván Martínez-Murillo;Estela Saquete;Elena Lloret;Manuel Palomar
Considering the potential of tools such as ChatGPT or Gemini to generate texts in a similar way to a human would do, having reliable detectors of AI –AI-generated content (AIGC)– is vital to combat the misuse and the surrounding negative consequences of those tools. Most research on AIGC detection has focused on the English language, often overlooking other languages that also have tools capable of generating human-like texts, such is the case of the Spanish language. This paper proposes a novel multilingual and multi-task approach for detecting machine versus human-generated text. The first task classifies whether a text is written by a machine or by a human, which is the research objective of this paper. The second task consists in detect the language of the text. To evaluate the results of our approach, this study has framed the scope of the AuTexTification shared task and also we have collected a different dataset in Spanish. The experiments carried out in Spanish and English show that our approach is very competitive concerning the state of the art, as well as it can generalize better, thus being able to detect an AI-generated text in multiple domains.
考虑到ChatGPT或Gemini等工具以类似于人类的方式生成文本的潜力,拥有可靠的人工智能检测器——人工智能生成的内容(AIGC)——对于打击这些工具的滥用和周围的负面后果至关重要。大多数关于AIGC检测的研究都集中在英语语言上,往往忽略了其他语言,这些语言也有能够生成类似人类文本的工具,比如西班牙语。本文提出了一种新的多语言、多任务的机器生成文本和人工生成文本检测方法。第一个任务是分类文本是由机器还是由人编写的,这是本文的研究目标。第二项任务是检测文本的语言。为了评估我们方法的结果,本研究确定了自动文本化共享任务的范围,并且我们还收集了西班牙语的不同数据集。在西班牙语和英语中进行的实验表明,我们的方法在目前的技术水平上非常有竞争力,而且它可以更好地泛化,从而能够在多个领域检测人工智能生成的文本。
{"title":"To Write or Not to Write as a Machine? That’s the Question","authors":"Robiert Sepúlveda-Torres;Iván Martínez-Murillo;Estela Saquete;Elena Lloret;Manuel Palomar","doi":"10.1109/TBDATA.2025.3536938","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3536938","url":null,"abstract":"Considering the potential of tools such as ChatGPT or Gemini to generate texts in a similar way to a human would do, having reliable detectors of AI –AI-generated content (AIGC)– is vital to combat the misuse and the surrounding negative consequences of those tools. Most research on AIGC detection has focused on the English language, often overlooking other languages that also have tools capable of generating human-like texts, such is the case of the Spanish language. This paper proposes a novel multilingual and multi-task approach for detecting machine versus human-generated text. The first task classifies whether a text is written by a machine or by a human, which is the research objective of this paper. The second task consists in detect the language of the text. To evaluate the results of our approach, this study has framed the scope of the AuTexTification shared task and also we have collected a different dataset in Spanish. The experiments carried out in Spanish and English show that our approach is very competitive concerning the state of the art, as well as it can generalize better, thus being able to detect an AI-generated text in multiple domains.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1042-1053"},"PeriodicalIF":7.5,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10858399","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Terrain Scene Generation Using a Lightweight Vector Quantized Generative Adversarial Network 基于轻量级矢量量化生成对抗网络的地形场景生成
IF 7.5 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-30 DOI: 10.1109/TBDATA.2025.3536926
Yan Wang;Huiyu Zhou;Xinghui Dong
Natural terrain scene images play important roles in the geographical research and application. However, it is challenging to collect a large set of terrain scene images. Recently, great progress has been made in image generation. Although impressive results can be achieved, the efficiency of the state-of-the-art methods, e.g., the Vector Quantized Generative Adversarial Network (VQGAN), is still dissatisfying. The VQGAN confronts two issues, i.e., high space complexity and heavy computational demand. To efficiently fulfill the terrain scene generation task, we first collect a Natural Terrain Scene Data Set (NTSD), which contains 36,672 images divided into 38 classes. Then we propose a Lightweight VQGAN (Lit-VQGAN), which uses the fewer parameters and has the lower computational complexity, compared with the VQGAN. A lightweight super-resolution network is further adopted, to speedily derive a high-resolution image from the image that the Lit-VQGAN generates. The Lit-VQGAN can be trained and tested on the NTSD. To our knowledge, either the NTSD or the Lit-VQGAN has not been exploited before.1 Experimental results show that the Lit-VQGAN is more efficient and effective than the VQGAN for the image generation task. These promising results should be due to the lightweight yet effective networks that we design.
自然地形场景图像在地理学研究和应用中具有重要作用。然而,收集大量的地形场景图像是一个挑战。近年来,在图像生成方面取得了很大的进展。尽管可以取得令人印象深刻的结果,但最先进的方法,例如矢量量化生成对抗网络(VQGAN)的效率仍然令人不满意。VQGAN面临着空间复杂度高和计算量大的问题。为了有效地完成地形场景生成任务,我们首先收集了一个自然地形场景数据集(NTSD),该数据集包含36,672张图像,分为38类。然后,我们提出了一种轻量级的VQGAN (lite -VQGAN),与VQGAN相比,它使用的参数更少,计算复杂度更低。进一步采用轻量级的超分辨率网络,从Lit-VQGAN生成的图像中快速导出高分辨率图像。Lit-VQGAN可以在NTSD上进行培训和测试。据我们所知,无论是NTSD还是Lit-VQGAN之前都没有被利用过实验结果表明,Lit-VQGAN在图像生成任务中比VQGAN更高效。这些有希望的结果应该归功于我们设计的轻量级但有效的网络。
{"title":"Terrain Scene Generation Using a Lightweight Vector Quantized Generative Adversarial Network","authors":"Yan Wang;Huiyu Zhou;Xinghui Dong","doi":"10.1109/TBDATA.2025.3536926","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3536926","url":null,"abstract":"Natural terrain scene images play important roles in the geographical research and application. However, it is challenging to collect a large set of terrain scene images. Recently, great progress has been made in image generation. Although impressive results can be achieved, the efficiency of the state-of-the-art methods, e.g., the Vector Quantized Generative Adversarial Network (VQGAN), is still dissatisfying. The VQGAN confronts two issues, i.e., high space complexity and heavy computational demand. To efficiently fulfill the terrain scene generation task, we first collect a Natural Terrain Scene Data Set (NTSD), which contains 36,672 images divided into 38 classes. Then we propose a Lightweight VQGAN (Lit-VQGAN), which uses the fewer parameters and has the lower computational complexity, compared with the VQGAN. A lightweight super-resolution network is further adopted, to speedily derive a high-resolution image from the image that the Lit-VQGAN generates. The Lit-VQGAN can be trained and tested on the NTSD. To our knowledge, either the NTSD or the Lit-VQGAN has not been exploited before.<sup>1</sup> Experimental results show that the Lit-VQGAN is more efficient and effective than the VQGAN for the image generation task. These promising results should be due to the lightweight yet effective networks that we design.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"988-1000"},"PeriodicalIF":7.5,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adapt Anything: Tailor Any Image Classifier Across Domains and Categories Using Text-to-Image Diffusion Models 适应任何:使用文本到图像扩散模型跨域和类别定制任何图像分类器
IF 7.5 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-30 DOI: 10.1109/TBDATA.2025.3536933
Weijie Chen;Haoyu Wang;Shicai Yang;Lei Zhang;Wei Wei;Yanning Zhang;Luojun Lin;Di Xie;Yueting Zhuang
We study a novel problem in this paper, that is, if a modern text-to-image diffusion model can tailor any image classifier across domains and categories. Existing domain adaption works exploit both source and target data for domain alignment so as to transfer the knowledge from the labeled source data to the unlabeled target data. However, as the development of text-to-image diffusion models, we wonder if the high-fidelity synthetic data can serve as a surrogate of the source data in real world. In this way, we do not need to collect and annotate the source data for each image classification task in a one-for-one manner. Instead, we utilize only one off-the-shelf text-to-image model to synthesize images with labels derived from text prompts, and then leverage them as a bridge to dig out the knowledge from the task-agnostic text-to-image generator to the task-oriented image classifier via domain adaptation. Such a one-for-all adaptation paradigm allows us to adapt anything in the world using only one text-to-image generator as well as any unlabeled target data. Extensive experiments validate the feasibility of this idea, which even surprisingly surpasses the state-of-the-art domain adaptation works using the source data collected and annotated in real world.
本文研究了一个新的问题,即现代文本到图像扩散模型是否可以跨领域和类别定制任何图像分类器。现有的领域自适应工作利用源数据和目标数据进行领域对齐,从而将知识从标记的源数据传递到未标记的目标数据。然而,随着文本到图像扩散模型的发展,我们想知道高保真合成数据是否可以在现实世界中作为源数据的替代品。这样,我们就不需要对每个图像分类任务的源数据进行一对一的收集和标注。相反,我们只使用一个现成的文本到图像模型来合成带有文本提示的标签的图像,然后利用它们作为桥梁,通过域适应从任务不可知的文本到图像生成器到面向任务的图像分类器中挖掘知识。这种“一刀切”的适应范式允许我们仅使用一个文本到图像生成器以及任何未标记的目标数据来适应世界上的任何东西。大量的实验验证了这一想法的可行性,甚至令人惊讶地超过了使用在现实世界中收集和注释的源数据进行的最先进的领域自适应工作。
{"title":"Adapt Anything: Tailor Any Image Classifier Across Domains and Categories Using Text-to-Image Diffusion Models","authors":"Weijie Chen;Haoyu Wang;Shicai Yang;Lei Zhang;Wei Wei;Yanning Zhang;Luojun Lin;Di Xie;Yueting Zhuang","doi":"10.1109/TBDATA.2025.3536933","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3536933","url":null,"abstract":"We study a novel problem in this paper, that is, if a modern text-to-image diffusion model can tailor any image classifier across domains and categories. Existing domain adaption works exploit both source and target data for domain alignment so as to transfer the knowledge from the labeled source data to the unlabeled target data. However, as the development of text-to-image diffusion models, we wonder if the high-fidelity synthetic data can serve as a surrogate of the source data in real world. In this way, we do not need to collect and annotate the source data for each image classification task in a one-for-one manner. Instead, we utilize only one off-the-shelf text-to-image model to synthesize images with labels derived from text prompts, and then leverage them as a bridge to dig out the knowledge from the task-agnostic text-to-image generator to the task-oriented image classifier via domain adaptation. Such a one-for-all adaptation paradigm allows us to adapt anything in the world using only one text-to-image generator as well as any unlabeled target data. Extensive experiments validate the feasibility of this idea, which even surprisingly surpasses the state-of-the-art domain adaptation works using the source data collected and annotated in real world.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1013-1026"},"PeriodicalIF":7.5,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AugGPT: Leveraging ChatGPT for Text Data Augmentation AugGPT:利用ChatGPT进行文本数据增强
IF 7.5 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-30 DOI: 10.1109/TBDATA.2025.3536934
Haixing Dai;Zhengliang Liu;Wenxiong Liao;Xiaoke Huang;Yihan Cao;Zihao Wu;Lin Zhao;Shaochen Xu;Fang Zeng;Wei Liu;Ninghao Liu;Sheng Li;Dajiang Zhu;Hongmin Cai;Lichao Sun;Quanzheng Li;Dinggang Shen;Tianming Liu;Xiang Li
Text data augmentation is an effective strategy for overcoming the challenge of limited sample sizes in many natural language processing (NLP) tasks. This challenge is especially prominent in the few-shot learning (FSL) scenario, where the data in the target domain is generally much scarcer and of lowered quality. A natural and widely used strategy to mitigate such challenges is to perform data augmentation to better capture data invariance and increase the sample size. However, current text data augmentation methods either can’t ensure the correct labeling of the generated data (lacking faithfulness), or can’t ensure sufficient diversity in the generated data (lacking compactness), or both. Inspired by the recent success of large language models (LLM), especially the development of ChatGPT, we propose a text data augmentation approach based on ChatGPT (named ”AugGPT”). AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples. The augmented samples can then be used in downstream model training. Experiment results on multiple few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach over state-of-the-art text data augmentation methods in terms of testing accuracy and distribution of the augmented samples.
在许多自然语言处理(NLP)任务中,文本数据增强是克服样本量有限挑战的有效策略。这一挑战在少次学习(FSL)场景中尤为突出,在这种场景中,目标领域中的数据通常要少得多,质量也较低。缓解此类挑战的一种自然且广泛使用的策略是执行数据增强,以更好地捕获数据不变性并增加样本量。然而,目前的文本数据增强方法要么不能保证生成数据的正确标注(缺乏信度),要么不能保证生成数据的足够多样性(缺乏紧凑性),要么两者兼而有之。受近年来大型语言模型(LLM)的成功,特别是ChatGPT的发展的启发,我们提出了一种基于ChatGPT的文本数据增强方法(命名为“AugGPT”)。AugGPT将训练样本中的每个句子重新表述为多个概念相似但语义不同的样本。增强后的样本可以用于下游模型训练。在多个小样本学习文本分类任务上的实验结果表明,本文提出的AugGPT方法在测试准确率和增强样本分布方面优于当前最先进的文本数据增强方法。
{"title":"AugGPT: Leveraging ChatGPT for Text Data Augmentation","authors":"Haixing Dai;Zhengliang Liu;Wenxiong Liao;Xiaoke Huang;Yihan Cao;Zihao Wu;Lin Zhao;Shaochen Xu;Fang Zeng;Wei Liu;Ninghao Liu;Sheng Li;Dajiang Zhu;Hongmin Cai;Lichao Sun;Quanzheng Li;Dinggang Shen;Tianming Liu;Xiang Li","doi":"10.1109/TBDATA.2025.3536934","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3536934","url":null,"abstract":"Text data augmentation is an effective strategy for overcoming the challenge of limited sample sizes in many natural language processing (NLP) tasks. This challenge is especially prominent in the few-shot learning (FSL) scenario, where the data in the target domain is generally much scarcer and of lowered quality. A natural and widely used strategy to mitigate such challenges is to perform data augmentation to better capture data invariance and increase the sample size. However, current text data augmentation methods either can’t ensure the correct labeling of the generated data (lacking faithfulness), or can’t ensure sufficient diversity in the generated data (lacking compactness), or both. Inspired by the recent success of large language models (LLM), especially the development of ChatGPT, we propose a text data augmentation approach based on ChatGPT (named ”AugGPT”). AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples. The augmented samples can then be used in downstream model training. Experiment results on multiple few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach over state-of-the-art text data augmentation methods in terms of testing accuracy and distribution of the augmented samples.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"907-918"},"PeriodicalIF":7.5,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Big Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1