首页 > 最新文献

Big Data最新文献

英文 中文
DMHANT: DropMessage Hypergraph Attention Network for Information Propagation Prediction. DMHANT:用于信息传播预测的 DropMessage 超图注意力网络。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-23 DOI: 10.1089/big.2023.0131
Qi Ouyang, Hongchang Chen, Shuxin Liu, Liming Pu, Dongdong Ge, Ke Fan

Predicting propagation cascades is crucial for understanding information propagation in social networks. Existing methods always focus on structure or order of infected users in a single cascade sequence, ignoring the global dependencies of cascades and users, which is insufficient to characterize their dynamic interaction preferences. Moreover, existing methods are poor at addressing the problem of model robustness. To address these issues, we propose a predication model named DropMessage Hypergraph Attention Networks, which constructs a hypergraph based on the cascade sequence. Specifically, to dynamically obtain user preferences, we divide the diffusion hypergraph into multiple subgraphs according to the time stamps, develop hypergraph attention networks to explicitly learn complete interactions, and adopt a gated fusion strategy to connect them for user cascade prediction. In addition, a new drop immediately method DropMessage is added to increase the robustness of the model. Experimental results on three real-world datasets indicate that proposed model significantly outperforms the most advanced information propagation prediction model in both MAP@k and Hits@K metrics, and the experiment also proves that the model achieves more significant prediction performance than the existing model under data perturbation.

预测传播级联对于理解社交网络中的信息传播至关重要。现有方法总是关注单个级联序列中受感染用户的结构或顺序,忽略了级联和用户之间的全局依赖关系,不足以描述他们的动态互动偏好。此外,现有方法在解决模型稳健性问题方面也存在不足。为了解决这些问题,我们提出了一种名为 "DropMessage 超图注意力网络 "的预测模型,该模型基于级联序列构建超图。具体来说,为了动态获取用户偏好,我们根据时间戳将扩散超图划分为多个子图,开发超图注意力网络来显式学习完整的交互,并采用门控融合策略将它们连接起来进行用户级联预测。此外,为了提高模型的鲁棒性,还增加了一种新的立即删除方法 DropMessage。在三个真实数据集上的实验结果表明,所提出的模型在 MAP@k 和 Hits@K 两个指标上都明显优于最先进的信息传播预测模型,实验还证明该模型在数据扰动下比现有模型取得了更显著的预测性能。
{"title":"DMHANT: DropMessage Hypergraph Attention Network for Information Propagation Prediction.","authors":"Qi Ouyang, Hongchang Chen, Shuxin Liu, Liming Pu, Dongdong Ge, Ke Fan","doi":"10.1089/big.2023.0131","DOIUrl":"https://doi.org/10.1089/big.2023.0131","url":null,"abstract":"<p><p>Predicting propagation cascades is crucial for understanding information propagation in social networks. Existing methods always focus on structure or order of infected users in a single cascade sequence, ignoring the global dependencies of cascades and users, which is insufficient to characterize their dynamic interaction preferences. Moreover, existing methods are poor at addressing the problem of model robustness. To address these issues, we propose a predication model named DropMessage Hypergraph Attention Networks, which constructs a hypergraph based on the cascade sequence. Specifically, to dynamically obtain user preferences, we divide the diffusion hypergraph into multiple subgraphs according to the time stamps, develop hypergraph attention networks to explicitly learn complete interactions, and adopt a gated fusion strategy to connect them for user cascade prediction. In addition, a new drop immediately method DropMessage is added to increase the robustness of the model. Experimental results on three real-world datasets indicate that proposed model significantly outperforms the most advanced information propagation prediction model in both MAP@k and Hits@K metrics, and the experiment also proves that the model achieves more significant prediction performance than the existing model under data perturbation.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":""},"PeriodicalIF":2.6,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142512575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Maximizing Influence in Social Networks Using Combined Local Features and Deep Learning-Based Node Embedding. 利用组合本地特征和基于深度学习的节点嵌入,最大化社交网络中的影响力。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-22 DOI: 10.1089/big.2023.0117
Asgarali Bouyer, Hamid Ahmadi Beni, Amin Golzari Oskouei, Alireza Rouhi, Bahman Arasteh, Xiaoyang Liu

The influence maximization problem has several issues, including low infection rates and high time complexity. Many proposed methods are not suitable for large-scale networks due to their time complexity or free parameter usage. To address these challenges, this article proposes a local heuristic called Embedding Technique for Influence Maximization (ETIM) that uses shell decomposition, graph embedding, and reduction, as well as combined local structural features. The algorithm selects candidate nodes based on their connections among network shells and topological features, reducing the search space and computational overhead. It uses a deep learning-based node embedding technique to create a multidimensional vector of candidate nodes and calculates the dependency on spreading for each node based on local topological features. Finally, influential nodes are identified using the results of the previous phases and newly defined local features. The proposed algorithm is evaluated using the independent cascade model, showing its competitiveness and ability to achieve the best performance in terms of solution quality. Compared with the collective influence global algorithm, ETIM is significantly faster and improves the infection rate by an average of 12%.

影响最大化问题有几个问题,包括低感染率和高时间复杂性。由于时间复杂性或自由参数的使用,许多建议的方法都不适合大规模网络。为了应对这些挑战,本文提出了一种名为 "影响力最大化嵌入技术"(ETIM)的局部启发式算法,该算法使用壳分解、图嵌入和还原,并结合了局部结构特征。该算法根据网络壳之间的连接和拓扑特征选择候选节点,从而减少了搜索空间和计算开销。它使用基于深度学习的节点嵌入技术创建候选节点的多维向量,并根据本地拓扑特征计算每个节点对传播的依赖性。最后,利用前一阶段的结果和新定义的本地特征识别出有影响力的节点。利用独立级联模型对所提出的算法进行了评估,结果表明该算法具有竞争力,能够在解决方案质量方面达到最佳性能。与集体影响全局算法相比,ETIM 的速度明显更快,感染率平均提高了 12%。
{"title":"Maximizing Influence in Social Networks Using Combined Local Features and Deep Learning-Based Node Embedding.","authors":"Asgarali Bouyer, Hamid Ahmadi Beni, Amin Golzari Oskouei, Alireza Rouhi, Bahman Arasteh, Xiaoyang Liu","doi":"10.1089/big.2023.0117","DOIUrl":"https://doi.org/10.1089/big.2023.0117","url":null,"abstract":"<p><p>The influence maximization problem has several issues, including low infection rates and high time complexity. Many proposed methods are not suitable for large-scale networks due to their time complexity or free parameter usage. To address these challenges, this article proposes a local heuristic called Embedding Technique for Influence Maximization (ETIM) that uses shell decomposition, graph embedding, and reduction, as well as combined local structural features. The algorithm selects candidate nodes based on their connections among network shells and topological features, reducing the search space and computational overhead. It uses a deep learning-based node embedding technique to create a multidimensional vector of candidate nodes and calculates the dependency on spreading for each node based on local topological features. Finally, influential nodes are identified using the results of the previous phases and newly defined local features. The proposed algorithm is evaluated using the independent cascade model, showing its competitiveness and ability to achieve the best performance in terms of solution quality. Compared with the collective influence global algorithm, ETIM is significantly faster and improves the infection rate by an average of 12%.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":""},"PeriodicalIF":2.6,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142480288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Attribute-Based Adaptive Homomorphic Encryption for Big Data Security. 基于属性的大数据安全自适应同态加密。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-01 Epub Date: 2021-12-13 DOI: 10.1089/big.2021.0176
R Thenmozhi, S Shridevi, Sachi Nandan Mohanty, Vicente García-Díaz, Deepak Gupta, Prayag Tiwari, Mohammad Shorfuzzaman

There is a drastic increase in Internet usage across the globe, thanks to mobile phone penetration. This extreme Internet usage generates huge volumes of data, in other terms, big data. Security and privacy are the main issues to be considered in big data management. Hence, in this article, Attribute-based Adaptive Homomorphic Encryption (AAHE) is developed to enhance the security of big data. In the proposed methodology, Oppositional Based Black Widow Optimization (OBWO) is introduced to select the optimal key parameters by following the AAHE method. By considering oppositional function, Black Widow Optimization (BWO) convergence analysis was enhanced. The proposed methodology has different processes, namely, process setup, encryption, and decryption processes. The researcher evaluated the proposed methodology with non-abelian rings and the homomorphism process in ciphertext format. Further, it is also utilized in improving one-way security related to the conjugacy examination issue. Afterward, homomorphic encryption is developed to secure the big data. The study considered two types of big data such as adult datasets and anonymous Microsoft web datasets to validate the proposed methodology. With the help of performance metrics such as encryption time, decryption time, key size, processing time, downloading, and uploading time, the proposed method was evaluated and compared against conventional cryptography techniques such as Rivest-Shamir-Adleman (RSA) and Elliptic Curve Cryptography (ECC). Further, the key generation process was also compared against conventional methods such as BWO, Particle Swarm Optimization (PSO), and Firefly Algorithm (FA). The results established that the proposed method is supreme than the compared methods and can be applied in real time in near future.

由于移动电话的普及,全球互联网使用量急剧增加。这种极高的互联网使用率产生了海量数据,换句话说就是大数据。安全和隐私是大数据管理中需要考虑的主要问题。因此,本文开发了基于属性的自适应同态加密(AAHE)来增强大数据的安全性。在所提出的方法中,引入了基于对立函数的黑寡妇优化(OBWO),以按照 AAHE 方法选择最佳密钥参数。通过考虑对立函数,加强了黑寡妇优化(BWO)的收敛性分析。所提出的方法有不同的流程,即流程设置、加密和解密流程。研究人员用非阿贝尔环和密码文本格式中的同构过程对所提出的方法进行了评估。此外,该方法还用于提高与共轭检验问题相关的单向安全性。之后,开发了同态加密技术来保护大数据的安全。研究考虑了两种类型的大数据,如成人数据集和匿名微软网络数据集,以验证所提出的方法。在加密时间、解密时间、密钥大小、处理时间、下载和上传时间等性能指标的帮助下,对所提出的方法进行了评估,并与 Rivest-Shamir-Adleman (RSA)和椭圆曲线加密法(ECC)等传统加密技术进行了比较。此外,还将密钥生成过程与 BWO、粒子群优化(PSO)和萤火虫算法(FA)等传统方法进行了比较。结果表明,所提出的方法比其他方法更优越,可在不久的将来实时应用。
{"title":"Attribute-Based Adaptive Homomorphic Encryption for Big Data Security.","authors":"R Thenmozhi, S Shridevi, Sachi Nandan Mohanty, Vicente García-Díaz, Deepak Gupta, Prayag Tiwari, Mohammad Shorfuzzaman","doi":"10.1089/big.2021.0176","DOIUrl":"10.1089/big.2021.0176","url":null,"abstract":"<p><p>There is a drastic increase in Internet usage across the globe, thanks to mobile phone penetration. This extreme Internet usage generates huge volumes of data, in other terms, big data. Security and privacy are the main issues to be considered in big data management. Hence, in this article, Attribute-based Adaptive Homomorphic Encryption (AAHE) is developed to enhance the security of big data. In the proposed methodology, Oppositional Based Black Widow Optimization (OBWO) is introduced to select the optimal key parameters by following the AAHE method. By considering oppositional function, Black Widow Optimization (BWO) convergence analysis was enhanced. The proposed methodology has different processes, namely, process setup, encryption, and decryption processes. The researcher evaluated the proposed methodology with non-abelian rings and the homomorphism process in ciphertext format. Further, it is also utilized in improving one-way security related to the conjugacy examination issue. Afterward, homomorphic encryption is developed to secure the big data. The study considered two types of big data such as adult datasets and anonymous Microsoft web datasets to validate the proposed methodology. With the help of performance metrics such as encryption time, decryption time, key size, processing time, downloading, and uploading time, the proposed method was evaluated and compared against conventional cryptography techniques such as Rivest-Shamir-Adleman (RSA) and Elliptic Curve Cryptography (ECC). Further, the key generation process was also compared against conventional methods such as BWO, Particle Swarm Optimization (PSO), and Firefly Algorithm (FA). The results established that the proposed method is supreme than the compared methods and can be applied in real time in near future.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"343-356"},"PeriodicalIF":2.6,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39718084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hybrid Deep Learning Approach for Traffic Speed Prediction. 用于交通速度预测的混合深度学习方法。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-01 Epub Date: 2022-02-02 DOI: 10.1089/big.2021.0251
Fei Dai, Pengfei Cao, Penggui Huang, Qi Mo, Bi Huang

Traffic speed prediction plays a fundamental role in traffic management and driving route planning. However, timely accurate traffic speed prediction is challenging as it is affected by complex spatial and temporal correlations. Most existing works cannot simultaneously model spatial and temporal correlations in traffic data, resulting in unsatisfactory prediction performance. In this article, we propose a novel hybrid deep learning approach, named HDL4TSP, to predict traffic speed in each region of a city, which consists of an input layer, a spatial layer, a temporal layer, a fusion layer, and an output layer. Specifically, first, the spatial layer employs graph convolutional networks to capture spatial near dependencies and spatial distant dependencies in the spatial dimension. Second, the temporal layer employs convolutional long short-term memory (ConvLSTM) networks to model closeness, daily periodicity, and weekly periodicity in the temporal dimension. Third, the fusion layer designs a fusion component to merge the outputs of ConvLSTM networks. Finally, we conduct extensive experiments and experimental results to show that HDL4TSP outperforms four baselines on two real-world data sets.

交通速度预测在交通管理和行车路线规划中发挥着重要作用。然而,由于受到复杂的空间和时间相关性的影响,及时准确地预测交通速度具有挑战性。大多数现有研究都无法同时对交通数据中的空间和时间相关性进行建模,导致预测效果不尽如人意。在本文中,我们提出了一种新颖的混合深度学习方法,名为 HDL4TSP,用于预测城市各区域的交通速度,该方法由输入层、空间层、时间层、融合层和输出层组成。具体来说,首先,空间层采用图卷积网络来捕捉空间维度上的空间近依赖关系和空间远依赖关系。其次,时间层采用卷积长短期记忆(ConvLSTM)网络来模拟时间维度上的亲疏关系、日周期性和周周期性。第三,融合层设计了一个融合组件来合并 ConvLSTM 网络的输出。最后,我们进行了大量实验,实验结果表明 HDL4TSP 在两个真实世界数据集上的表现优于四种基线。
{"title":"Hybrid Deep Learning Approach for Traffic Speed Prediction.","authors":"Fei Dai, Pengfei Cao, Penggui Huang, Qi Mo, Bi Huang","doi":"10.1089/big.2021.0251","DOIUrl":"10.1089/big.2021.0251","url":null,"abstract":"<p><p>Traffic speed prediction plays a fundamental role in traffic management and driving route planning. However, timely accurate traffic speed prediction is challenging as it is affected by complex spatial and temporal correlations. Most existing works cannot simultaneously model spatial and temporal correlations in traffic data, resulting in unsatisfactory prediction performance. In this article, we propose a novel hybrid deep learning approach, named HDL4TSP, to predict traffic speed in each region of a city, which consists of an input layer, a spatial layer, a temporal layer, a fusion layer, and an output layer. Specifically, first, the spatial layer employs graph convolutional networks to capture spatial near dependencies and spatial distant dependencies in the spatial dimension. Second, the temporal layer employs convolutional long short-term memory (ConvLSTM) networks to model closeness, daily periodicity, and weekly periodicity in the temporal dimension. Third, the fusion layer designs a fusion component to merge the outputs of ConvLSTM networks. Finally, we conduct extensive experiments and experimental results to show that HDL4TSP outperforms four baselines on two real-world data sets.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"377-389"},"PeriodicalIF":2.6,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39880866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Weighted GraphSAGE-Based Context-Aware Approach for Big Data Access Control. 基于加权 GraphSAGE 的大数据访问控制情境感知方法。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-01 Epub Date: 2023-08-01 DOI: 10.1089/big.2021.0473
Dibin Shan, Xuehui Du, Wenjuan Wang, Aodi Liu, Na Wang

Context information is the key element to realizing dynamic access control of big data. However, existing context-aware access control (CAAC) methods do not support automatic context awareness and cannot automatically model and reason about context relationships. To solve these problems, this article proposes a weighted GraphSAGE-based context-aware approach for big data access control. First, graph modeling is performed on the access record data set and transforms the access control context-awareness problem into a graph neural network (GNN) node learning problem. Then, a GNN model WGraphSAGE is proposed to achieve automatic context awareness and automatic generation of CAAC rules. Finally, weighted neighbor sampling and weighted aggregation algorithms are designed for the model to realize automatic modeling and reasoning of node relationships and relationship strengths simultaneously in the graph node learning process. The experiment results show that the proposed method has obvious advantages in context awareness and context relationship reasoning compared with similar GNN models. Meanwhile, it obtains better results in dynamic access control decisions than the existing CAAC models.

上下文信息是实现大数据动态访问控制的关键要素。然而,现有的上下文感知访问控制(CAAC)方法不支持自动上下文感知,无法自动建模和推理上下文关系。为了解决这些问题,本文提出了一种基于加权 GraphSAGE 的大数据访问控制上下文感知方法。首先,对访问记录数据集进行图建模,将访问控制上下文感知问题转化为图神经网络(GNN)节点学习问题。然后,提出了一个 GNN 模型 WGraphSAGE,以实现自动上下文感知和 CAAC 规则的自动生成。最后,为该模型设计了加权邻居采样和加权聚合算法,以实现图节点学习过程中节点关系和关系强度的自动建模和同时推理。实验结果表明,与同类 GNN 模型相比,本文提出的方法在上下文感知和上下文关系推理方面具有明显优势。同时,与现有的 CAAC 模型相比,它在动态访问控制决策方面取得了更好的效果。
{"title":"A Weighted GraphSAGE-Based Context-Aware Approach for Big Data Access Control.","authors":"Dibin Shan, Xuehui Du, Wenjuan Wang, Aodi Liu, Na Wang","doi":"10.1089/big.2021.0473","DOIUrl":"10.1089/big.2021.0473","url":null,"abstract":"<p><p>Context information is the key element to realizing dynamic access control of big data. However, existing context-aware access control (CAAC) methods do not support automatic context awareness and cannot automatically model and reason about context relationships. To solve these problems, this article proposes a weighted GraphSAGE-based context-aware approach for big data access control. First, graph modeling is performed on the access record data set and transforms the access control context-awareness problem into a graph neural network (GNN) node learning problem. Then, a GNN model WGraphSAGE is proposed to achieve automatic context awareness and automatic generation of CAAC rules. Finally, weighted neighbor sampling and weighted aggregation algorithms are designed for the model to realize automatic modeling and reasoning of node relationships and relationship strengths simultaneously in the graph node learning process. The experiment results show that the proposed method has obvious advantages in context awareness and context relationship reasoning compared with similar GNN models. Meanwhile, it obtains better results in dynamic access control decisions than the existing CAAC models.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"390-411"},"PeriodicalIF":2.6,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9922924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Special Issue: Big Scientific Data and Machine Learning in Science and Engineering. 特刊:科学与工程中的大科学数据和机器学习。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-01 Epub Date: 2024-07-31 DOI: 10.1089/big.2024.59218.kpa
Farhad Pourkamali-Anaraki
{"title":"Special Issue: Big Scientific Data and Machine Learning in Science and Engineering.","authors":"Farhad Pourkamali-Anaraki","doi":"10.1089/big.2024.59218.kpa","DOIUrl":"10.1089/big.2024.59218.kpa","url":null,"abstract":"","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"269"},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141857096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Unified Training Process for Fake News Detection Based on Finetuned Bidirectional Encoder Representation from Transformers Model. 基于变压器模型微调双向编码器表示的假新闻检测统一训练流程
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-01 Epub Date: 2023-03-22 DOI: 10.1089/big.2022.0050
Vijay Srinivas Tida, Sonya Hsu, Xiali Hei

An efficient fake news detector becomes essential as the accessibility of social media platforms increases rapidly. Previous studies mainly focused on designing the models solely based on individual data sets and might suffer from degradable performance. Therefore, developing a robust model for a combined data set with diverse knowledge becomes crucial. However, designing the model with a combined data set requires extensive training time and sequential workload to obtain optimal performance without having some prior knowledge about the model's parameters. The presented study here will help solve these issues by introducing the unified training strategy to have a base structure for the classifier and all hyperparameters from individual models using a pretrained transformer model. The performance of the proposed model is noted using three publicly available data sets, namely ISOT and others from the Kaggle website. The results indicate that the proposed unified training strategy surpassed the existing models such as Random Forests, convolutional neural networks, and long short-term memory, with 97% accuracy and achieved the F1 score of 0.97. Furthermore, there was a significant reduction in training time by almost 1.5 to 1.8 × by removing words lower than three letters from the input samples. We also did extensive performance analysis by varying the number of encoder blocks to build compact models and trained on the combined data set. We justify that reducing encoder blocks resulted in lower performance from the obtained results.

随着社交媒体平台访问量的快速增长,高效的假新闻检测器变得至关重要。以往的研究主要侧重于仅根据单个数据集设计模型,可能会出现性能下降的问题。因此,为具有不同知识的组合数据集开发一个稳健的模型变得至关重要。然而,利用组合数据集设计模型需要大量的训练时间和连续的工作量,才能在事先不了解模型参数的情况下获得最佳性能。本文提出的研究将通过引入统一的训练策略来帮助解决这些问题,即使用一个预训练的转换器模型,为分类器和来自各个模型的所有超参数建立一个基础结构。使用三个公开数据集(即 ISOT 和 Kaggle 网站上的其他数据集)对所提议模型的性能进行了测试。结果表明,所提出的统一训练策略超越了随机森林、卷积神经网络和长短期记忆等现有模型,准确率达到 97%,F1 得分为 0.97。此外,通过从输入样本中剔除小于三个字母的单词,训练时间大幅减少了近 1.5 到 1.8 倍。我们还通过改变编码器块的数量来建立紧凑的模型,并在组合数据集上进行训练,从而进行了广泛的性能分析。从获得的结果来看,我们认为减少编码器块会降低性能。
{"title":"A Unified Training Process for Fake News Detection Based on Finetuned Bidirectional Encoder Representation from Transformers Model.","authors":"Vijay Srinivas Tida, Sonya Hsu, Xiali Hei","doi":"10.1089/big.2022.0050","DOIUrl":"10.1089/big.2022.0050","url":null,"abstract":"<p><p>An efficient fake news detector becomes essential as the accessibility of social media platforms increases rapidly. Previous studies mainly focused on designing the models solely based on individual data sets and might suffer from degradable performance. Therefore, developing a robust model for a combined data set with diverse knowledge becomes crucial. However, designing the model with a combined data set requires extensive training time and sequential workload to obtain optimal performance without having some prior knowledge about the model's parameters. The presented study here will help solve these issues by introducing the unified training strategy to have a base structure for the classifier and all hyperparameters from individual models using a pretrained transformer model. The performance of the proposed model is noted using three publicly available data sets, namely ISOT and others from the Kaggle website. The results indicate that the proposed unified training strategy surpassed the existing models such as Random Forests, convolutional neural networks, and long short-term memory, with 97% accuracy and achieved the F1 score of 0.97. Furthermore, there was a significant reduction in training time by almost 1.5 to 1.8 × by removing words lower than three letters from the input samples. We also did extensive performance analysis by varying the number of encoder blocks to build compact models and trained on the combined data set. We justify that reducing encoder blocks resulted in lower performance from the obtained results.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"331-342"},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9150389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A New Filter Approach Based on Effective Ranges for Classification of Gene Expression Data. 基于有效范围的基因表达数据分类过滤新方法
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-01 Epub Date: 2023-09-04 DOI: 10.1089/big.2022.0086
Derya Turfan, Bulent Altunkaynak, Özgür Yeniay

Over the years, many studies have been carried out to reduce and eliminate the effects of diseases on human health. Gene expression data sets play a critical role in diagnosing and treating diseases. These data sets consist of thousands of genes and a small number of sample sizes. This situation creates the curse of dimensionality and it becomes problematic to analyze such data sets. One of the most effective strategies to solve this problem is feature selection methods. Feature selection is a preprocessing step to improve classification performance by selecting the most relevant and informative features while increasing the accuracy of classification. In this article, we propose a new statistically based filter method for the feature selection approach named Effective Range-based Feature Selection Algorithm (FSAER). As an extension of the previous Effective Range based Gene Selection (ERGS) and Improved Feature Selection based on Effective Range (IFSER) algorithms, our novel method includes the advantages of both methods while taking into account the disjoint area. To illustrate the efficacy of the proposed algorithm, the experiments have been conducted on six benchmark gene expression data sets. The results of the FSAER and the other filter methods have been compared in terms of classification accuracies to demonstrate the effectiveness of the proposed method. For classification methods, support vector machines, naive Bayes classifier, and k-nearest neighbor algorithms have been used.

多年来,为了减少和消除疾病对人类健康的影响,人们开展了许多研究。基因表达数据集在诊断和治疗疾病方面发挥着至关重要的作用。这些数据集由数千个基因和少量样本组成。这种情况造成了 "维度诅咒",使分析这类数据集成为难题。解决这一问题的最有效策略之一就是特征选择方法。特征选择是一种预处理步骤,通过选择最相关、信息量最大的特征来提高分类性能,同时提高分类的准确性。在本文中,我们为特征选择方法提出了一种新的基于统计的过滤方法,命名为基于有效范围的特征选择算法(FSAER)。作为之前基于有效范围的基因选择算法(ERGS)和基于有效范围的改进特征选择算法(IFSER)的扩展,我们的新方法既包含了这两种方法的优点,又考虑到了不相交区域。为了说明所提算法的有效性,我们在六个基准基因表达数据集上进行了实验。通过比较 FSAER 和其他滤波方法的分类准确率,证明了所提方法的有效性。在分类方法中,使用了支持向量机、天真贝叶斯分类器和 k 近邻算法。
{"title":"A New Filter Approach Based on Effective Ranges for Classification of Gene Expression Data.","authors":"Derya Turfan, Bulent Altunkaynak, Özgür Yeniay","doi":"10.1089/big.2022.0086","DOIUrl":"10.1089/big.2022.0086","url":null,"abstract":"<p><p>Over the years, many studies have been carried out to reduce and eliminate the effects of diseases on human health. Gene expression data sets play a critical role in diagnosing and treating diseases. These data sets consist of thousands of genes and a small number of sample sizes. This situation creates the curse of dimensionality and it becomes problematic to analyze such data sets. One of the most effective strategies to solve this problem is feature selection methods. Feature selection is a preprocessing step to improve classification performance by selecting the most relevant and informative features while increasing the accuracy of classification. In this article, we propose a new statistically based filter method for the feature selection approach named Effective Range-based Feature Selection Algorithm (FSAER). As an extension of the previous Effective Range based Gene Selection (ERGS) and Improved Feature Selection based on Effective Range (IFSER) algorithms, our novel method includes the advantages of both methods while taking into account the disjoint area. To illustrate the efficacy of the proposed algorithm, the experiments have been conducted on six benchmark gene expression data sets. The results of the FSAER and the other filter methods have been compared in terms of classification accuracies to demonstrate the effectiveness of the proposed method. For classification methods, support vector machines, naive Bayes classifier, and k-nearest neighbor algorithms have been used.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"312-330"},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10211345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hybrid Generalized Regularized Extreme Learning Machine Through Gradient-Based Optimizer Model for Self-Cleansing Nondeposition with Clean Bed Mode of Sediment Transport. 基于梯度优化器的混合广义正则化极限学习机模型,用于自清洁非沉积与清洁床模式的沉积物输送。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-01 Epub Date: 2023-03-07 DOI: 10.1089/big.2022.0120
Enes Gul, Mir Jafar Sadegh Safari

Sediment transport modeling is an important problem to minimize sedimentation in open channels that could lead to unexpected operation expenses. From an engineering perspective, the development of accurate models based on effective variables involved for flow velocity computation could provide a reliable solution in channel design. Furthermore, validity of sediment transport models is linked to the range of data used for the model development. Existing design models were established on the limited data ranges. Thus, the present study aimed to utilize all experimental data available in the literature, including recently published datasets that covered an extensive range of hydraulic properties. Extreme learning machine (ELM) algorithm and generalized regularized extreme learning machine (GRELM) were implemented for the modeling, and then, particle swarm optimization (PSO) and gradient-based optimizer (GBO) were utilized for the hybridization of ELM and GRELM. GRELM-PSO and GRELM-GBO findings were compared to the standalone ELM, GRELM, and existing regression models to determine their accurate computations. The analysis of the models demonstrated the robustness of the models that incorporate channel parameter. The poor results of some existing regression models seem to be linked to the disregarding of the channel parameter. Statistical analysis of the model outcomes illustrated the outperformance of GRELM-GBO in contrast to the ELM, GRELM, GRELM-PSO, and regression models, although GRELM-GBO performed slightly better when compared to the GRELM-PSO counterpart. It was found that the mean accuracy of GRELM-GBO was 18.5% better when compared to the best regression model. The promising findings of the current study not only may encourage the use of recommended algorithms for channel design in practice but also may further the application of novel ELM-based methods in alternative environmental problems.

泥沙输运模型是一个重要问题,可最大限度地减少明渠中的泥沙淤积,从而减少意外的运行费用。从工程角度看,根据流速计算所涉及的有效变量开发精确模型,可为渠道设计提供可靠的解决方案。此外,泥沙输运模型的有效性与模型开发所使用的数据范围有关。现有的设计模型是在有限的数据范围内建立的。因此,本研究旨在利用文献中的所有实验数据,包括最近发表的涵盖广泛水力特性的数据集。在建模过程中采用了极限学习机(ELM)算法和广义正则化极限学习机(GRELM),然后利用粒子群优化(PSO)和基于梯度的优化器(GBO)对 ELM 和 GRELM 进行混合。GRELM-PSO 和 GRELM-GBO 的结果与独立的 ELM、GRELM 和现有回归模型进行了比较,以确定其计算的准确性。对模型的分析表明,包含信道参数的模型具有稳健性。一些现有回归模型的结果不佳,似乎与忽略信道参数有关。对模型结果的统计分析表明,GRELM-GBO 的性能优于 ELM、GRELM、GRELM-PSO 和回归模型,但 GRELM-GBO 的性能略高于 GRELM-PSO。研究发现,与最佳回归模型相比,GRELM-GBO 的平均准确率高出 18.5%。当前研究的良好结果不仅可以鼓励在实践中使用推荐算法进行通道设计,还可以进一步推动基于 ELM 的新型方法在其他环境问题中的应用。
{"title":"Hybrid Generalized Regularized Extreme Learning Machine Through Gradient-Based Optimizer Model for Self-Cleansing Nondeposition with Clean Bed Mode of Sediment Transport.","authors":"Enes Gul, Mir Jafar Sadegh Safari","doi":"10.1089/big.2022.0120","DOIUrl":"10.1089/big.2022.0120","url":null,"abstract":"<p><p>Sediment transport modeling is an important problem to minimize sedimentation in open channels that could lead to unexpected operation expenses. From an engineering perspective, the development of accurate models based on effective variables involved for flow velocity computation could provide a reliable solution in channel design. Furthermore, validity of sediment transport models is linked to the range of data used for the model development. Existing design models were established on the limited data ranges. Thus, the present study aimed to utilize all experimental data available in the literature, including recently published datasets that covered an extensive range of hydraulic properties. Extreme learning machine (ELM) algorithm and generalized regularized extreme learning machine (GRELM) were implemented for the modeling, and then, particle swarm optimization (PSO) and gradient-based optimizer (GBO) were utilized for the hybridization of ELM and GRELM. GRELM-PSO and GRELM-GBO findings were compared to the standalone ELM, GRELM, and existing regression models to determine their accurate computations. The analysis of the models demonstrated the robustness of the models that incorporate channel parameter. The poor results of some existing regression models seem to be linked to the disregarding of the channel parameter. Statistical analysis of the model outcomes illustrated the outperformance of GRELM-GBO in contrast to the ELM, GRELM, GRELM-PSO, and regression models, although GRELM-GBO performed slightly better when compared to the GRELM-PSO counterpart. It was found that the mean accuracy of GRELM-GBO was 18.5% better when compared to the best regression model. The promising findings of the current study not only may encourage the use of recommended algorithms for channel design in practice but also may further the application of novel ELM-based methods in alternative environmental problems.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"282-298"},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10861174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vertical and Horizontal Water Penetration Velocity Modeling in Nonhomogenous Soil Using Fast Multi-Output Relevance Vector Regression. 利用快速多输出相关性矢量回归建立非同质土壤的垂直和水平透水速度模型
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-08-01 Epub Date: 2023-03-14 DOI: 10.1089/big.2022.0125
Babak Vaheddoost, Shervin Rahimzadeh Arashloo, Mir Jafar Sadegh Safari

A joint determination of horizontal and vertical movement of water through porous medium is addressed in this study through fast multi-output relevance vector regression (FMRVR). To do this, an experimental data set conducted in a sand box with 300 × 300 × 150 mm dimensions made of Plexiglas is used. A random mixture of sand having size of 0.5-1 mm is used to simulate the porous medium. Within the experiments, 2, 3, 7, and 12 cm walls are used together with different injection locations as 130.7, 91.3, and 51.8 mm measured from the cutoff wall at the upstream. Then, the Cartesian coordinated of the tracer, time interval, length of the wall in each setup, and two dummy variables for determination of the initial point are considered as independent variables for joint estimation of horizontal and vertical velocity of water movement in the porous medium. Alternatively, the multi-linear regression, random forest, and the support vector regression approaches are used to alternate the results obtained by the FMRVR method. It was concluded that the FMRVR outperforms the other models, while the uncertainty in estimation of horizontal penetration is larger than the vertical one.

本研究通过快速多输出相关性向量回归(FMRVR)来联合确定水在多孔介质中的水平和垂直运动。为此,使用了在有机玻璃制成的尺寸为 300 × 300 × 150 毫米的沙箱中进行的实验数据集。使用粒度为 0.5-1 毫米的随机混合物来模拟多孔介质。在实验中,使用了 2、3、7 和 12 厘米的箱壁,以及不同的注入位置,即从上游截壁测量的 130.7、91.3 和 51.8 毫米。然后,将示踪剂的笛卡尔坐标、时间间隔、每个设置中的壁长以及两个用于确定初始点的虚拟变量作为自变量,共同估算多孔介质中水的水平和垂直运动速度。此外,还使用了多线性回归、随机森林和支持向量回归等方法来交替使用 FMRVR 方法得出的结果。得出的结论是,FMRVR 的效果优于其他模型,但水平渗透估算的不确定性大于垂直渗透估算。
{"title":"Vertical and Horizontal Water Penetration Velocity Modeling in Nonhomogenous Soil Using Fast Multi-Output Relevance Vector Regression.","authors":"Babak Vaheddoost, Shervin Rahimzadeh Arashloo, Mir Jafar Sadegh Safari","doi":"10.1089/big.2022.0125","DOIUrl":"10.1089/big.2022.0125","url":null,"abstract":"<p><p>A joint determination of horizontal and vertical movement of water through porous medium is addressed in this study through fast multi-output relevance vector regression (FMRVR). To do this, an experimental data set conducted in a sand box with 300 × 300 × 150 mm dimensions made of Plexiglas is used. A random mixture of sand having size of 0.5-1 mm is used to simulate the porous medium. Within the experiments, 2, 3, 7, and 12 cm walls are used together with different injection locations as 130.7, 91.3, and 51.8 mm measured from the cutoff wall at the upstream. Then, the Cartesian coordinated of the tracer, time interval, length of the wall in each setup, and two dummy variables for determination of the initial point are considered as independent variables for joint estimation of horizontal and vertical velocity of water movement in the porous medium. Alternatively, the multi-linear regression, random forest, and the support vector regression approaches are used to alternate the results obtained by the FMRVR method. It was concluded that the FMRVR outperforms the other models, while the uncertainty in estimation of horizontal penetration is larger than the vertical one.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"299-311"},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9105192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Big Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1