首页 > 最新文献

Big Data Research最新文献

英文 中文
Incomplete data classification via positive approximation based rough subspaces ensemble 通过基于正逼近的粗糙子空间集合进行不完整数据分类
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-14 DOI: 10.1016/j.bdr.2024.100496
Yuanting Yan , Meili Yang , Zhong Zheng , Hao Ge , Yiwen Zhang , Yanping Zhang
Classifying incomplete data using ensemble techniques is a prevalent method for addressing missing values, where multiple classifiers are trained on diverse subsets of features. However, current ensemble-based methods overlook the redundancy within feature subsets, presenting challenges for training robust prediction models, because the redundant features can hinder the learning of the underlying rules in the data. In this paper, we propose a Reduct-Missing Pattern Fusion (RMPF) method to address the aforementioned limitation. It leverages both the advantages of rough set theory and the effectiveness of missing patterns in classifying incomplete data. RMPF employs a heuristic algorithm to generate a set of positive approximation-based attribute reducts. Subsequently, it integrates the missing patterns with these reducts through a fusion strategy to minimize data redundancy. Finally, the optimized subsets are utilized to train a group of base classifiers, and a selective prediction procedure is applied to produce the ensembled prediction results. Experimental results show that our method is superior to the compared state-of-the-art methods in both performance and robustness. Especially, our method obtains significant superiority in the scenarios of data with high missing rates.
使用集合技术对不完整数据进行分类是解决缺失值问题的一种普遍方法,在这种方法中,多个分类器都是根据不同的特征子集进行训练的。然而,目前基于集合的方法忽视了特征子集中的冗余性,给训练稳健的预测模型带来了挑战,因为冗余特征会阻碍数据中潜在规则的学习。在本文中,我们提出了一种减少缺失模式融合(Reduct-Missing Pattern Fusion,RMPF)方法来解决上述局限性。它充分利用了粗糙集理论的优势和缺失模式在不完整数据分类中的有效性。RMPF 采用启发式算法生成一组基于正近似的属性还原。随后,它通过融合策略将缺失模式与这些还原整合在一起,以尽量减少数据冗余。最后,利用优化后的子集来训练一组基础分类器,并采用选择性预测程序来生成集合预测结果。实验结果表明,我们的方法在性能和鲁棒性方面都优于同类最先进的方法。特别是在数据缺失率较高的情况下,我们的方法取得了显著的优势。
{"title":"Incomplete data classification via positive approximation based rough subspaces ensemble","authors":"Yuanting Yan ,&nbsp;Meili Yang ,&nbsp;Zhong Zheng ,&nbsp;Hao Ge ,&nbsp;Yiwen Zhang ,&nbsp;Yanping Zhang","doi":"10.1016/j.bdr.2024.100496","DOIUrl":"10.1016/j.bdr.2024.100496","url":null,"abstract":"<div><div>Classifying incomplete data using ensemble techniques is a prevalent method for addressing missing values, where multiple classifiers are trained on diverse subsets of features. However, current ensemble-based methods overlook the redundancy within feature subsets, presenting challenges for training robust prediction models, because the redundant features can hinder the learning of the underlying rules in the data. In this paper, we propose a Reduct-Missing Pattern Fusion (RMPF) method to address the aforementioned limitation. It leverages both the advantages of rough set theory and the effectiveness of missing patterns in classifying incomplete data. RMPF employs a heuristic algorithm to generate a set of positive approximation-based attribute reducts. Subsequently, it integrates the missing patterns with these reducts through a fusion strategy to minimize data redundancy. Finally, the optimized subsets are utilized to train a group of base classifiers, and a selective prediction procedure is applied to produce the ensembled prediction results. Experimental results show that our method is superior to the compared state-of-the-art methods in both performance and robustness. Especially, our method obtains significant superiority in the scenarios of data with high missing rates.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100496"},"PeriodicalIF":3.5,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142650891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint embedding in hierarchical distance and semantic representation learning for link prediction 分层距离和语义表征学习中的联合嵌入,用于链接预测
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-13 DOI: 10.1016/j.bdr.2024.100495
Jin Liu, Jianye Chen, Chongfeng Fan, Fengyu Zhou
The link prediction task aims to predict missing entities or relations in the knowledge graph and is essential for the downstream application. Existing well-known models deal with this task by mainly focusing on representing knowledge graph triplets in the distance space or semantic space. However, they can not fully capture the information of head and tail entities, nor even make good use of hierarchical level information. Thus, in this paper, we propose a novel knowledge graph embedding model for the link prediction task, namely, HIE, which models each triplet (h, r, t) into distance measurement space and semantic measurement space, simultaneously. Moreover, HIE is introduced into hierarchical-aware space to leverage rich hierarchical information of entities and relations for better representation learning. Specifically, we apply distance transformation operation on the head entity in distance space to obtain the tail entity instead of translation-based or rotation-based approaches. Experimental results of HIE on four real-world datasets show that HIE outperforms several existing state-of-the-art knowledge graph embedding methods on the link prediction task and deals with complex relations accurately.
链接预测任务旨在预测知识图谱中缺失的实体或关系,对于下游应用至关重要。现有的著名模型在处理这项任务时,主要侧重于在距离空间或语义空间中表示知识图谱三元组。但是,这些模型不能完全捕捉头部和尾部实体的信息,甚至不能很好地利用层次信息。因此,本文针对链接预测任务提出了一种新的知识图谱嵌入模型,即 HIE,它将每个三元组(h, r, t)同时建模到距离测量空间和语义测量空间中。此外,HIE 还引入了分层感知空间,以利用实体和关系的丰富分层信息进行更好的表征学习。具体来说,我们在距离空间中对头部实体进行距离变换操作,以获得尾部实体,而不是基于平移或旋转的方法。HIE 在四个真实世界数据集上的实验结果表明,HIE 在链接预测任务上的表现优于现有的几种最先进的知识图嵌入方法,并能准确处理复杂关系。
{"title":"Joint embedding in hierarchical distance and semantic representation learning for link prediction","authors":"Jin Liu,&nbsp;Jianye Chen,&nbsp;Chongfeng Fan,&nbsp;Fengyu Zhou","doi":"10.1016/j.bdr.2024.100495","DOIUrl":"10.1016/j.bdr.2024.100495","url":null,"abstract":"<div><div>The link prediction task aims to predict missing entities or relations in the knowledge graph and is essential for the downstream application. Existing well-known models deal with this task by mainly focusing on representing knowledge graph triplets in the distance space or semantic space. However, they can not fully capture the information of head and tail entities, nor even make good use of hierarchical level information. Thus, in this paper, we propose a novel knowledge graph embedding model for the link prediction task, namely, HIE, which models each triplet (<em>h</em>, <em>r</em>, <em>t</em>) into distance measurement space and semantic measurement space, simultaneously. Moreover, HIE is introduced into hierarchical-aware space to leverage rich hierarchical information of entities and relations for better representation learning. Specifically, we apply distance transformation operation on the head entity in distance space to obtain the tail entity instead of translation-based or rotation-based approaches. Experimental results of HIE on four real-world datasets show that HIE outperforms several existing state-of-the-art knowledge graph embedding methods on the link prediction task and deals with complex relations accurately.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100495"},"PeriodicalIF":3.5,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142650890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep semantics-preserving cross-modal hashing 深度语义保全跨模态散列
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-07 DOI: 10.1016/j.bdr.2024.100494
Zhihui Lai , Xiaomei Fang , Heng Kong
Cross-modal hashing has been paid widespread attention in recent years due to its outstanding performance in cross-modal data retrieval. Cross-modal hashing can be decomposed into two steps, i.e., the feature learning and the binarization. However, most existing cross-modal hash methods do not take the supervisory information of the data into consideration during binary quantization, and thus often fail to adequately preserve semantic information. To solve these problems, this paper proposes a novel deep cross-modal hashing method called deep semantics-preserving cross-modal hashing (DSCMH), which makes full use of intra and inter-modal semantic information to improve the model's performance. Moreover, by designing a label network for semantic alignment during the binarization process, DSCMH's performance can be further improved. In order to verify the performance of the proposed method, extensive experiments were conducted on four big datasets. The results show that the proposed method is better than most of the existing cross-modal hashing methods. In addition, the ablation experiment shows that the proposed new regularized terms all have positive effects on the model's performances in cross-modal retrieval. The code of this paper can be downloaded from http://www.scholat.com/laizhihui.
近年来,跨模态散列因其在跨模态数据检索中的出色表现而受到广泛关注。跨模态散列可以分解为两个步骤,即特征学习和二值化。然而,现有的大多数跨模态哈希方法在二进制量化时没有考虑数据的监督信息,因此往往不能充分保留语义信息。为了解决这些问题,本文提出了一种新颖的深度跨模态哈希方法,即深度语义保留跨模态哈希(DSCMH),它能充分利用模态内和模态间的语义信息来提高模型的性能。此外,通过在二值化过程中设计用于语义对齐的标签网络,DSCMH 的性能还能得到进一步提高。为了验证所提方法的性能,我们在四个大数据集上进行了大量实验。结果表明,所提出的方法优于大多数现有的跨模态哈希方法。此外,消融实验表明,所提出的新正则化项都对模型在跨模态检索中的性能产生了积极影响。本文代码可从 http://www.scholat.com/laizhihui 下载。
{"title":"Deep semantics-preserving cross-modal hashing","authors":"Zhihui Lai ,&nbsp;Xiaomei Fang ,&nbsp;Heng Kong","doi":"10.1016/j.bdr.2024.100494","DOIUrl":"10.1016/j.bdr.2024.100494","url":null,"abstract":"<div><div>Cross-modal hashing has been paid widespread attention in recent years due to its outstanding performance in cross-modal data retrieval. Cross-modal hashing can be decomposed into two steps, i.e., the feature learning and the binarization. However, most existing cross-modal hash methods do not take the supervisory information of the data into consideration during binary quantization, and thus often fail to adequately preserve semantic information. To solve these problems, this paper proposes a novel deep cross-modal hashing method called deep semantics-preserving cross-modal hashing (DSCMH), which makes full use of intra and inter-modal semantic information to improve the model's performance. Moreover, by designing a label network for semantic alignment during the binarization process, DSCMH's performance can be further improved. In order to verify the performance of the proposed method, extensive experiments were conducted on four big datasets. The results show that the proposed method is better than most of the existing cross-modal hashing methods. In addition, the ablation experiment shows that the proposed new regularized terms all have positive effects on the model's performances in cross-modal retrieval. The code of this paper can be downloaded from <span><span>http://www.scholat.com/laizhihui</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100494"},"PeriodicalIF":3.5,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142650889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Research on the characteristics of information propagation dynamic on the weighted multiplex Weibo networks 加权复用微博网络信息传播动态特征研究
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-27 DOI: 10.1016/j.bdr.2024.100493
Yinuo Qian, Fuzhong Nian
In order to simulate the forwarding situation of different categories of Weibo and discover interesting propagation phenomena in different layers of Weibo networks, this paper proposes the retweeting weighted multiplex networks and propagation model coupled with multi-class Weibo. Firstly, the weighted multiplex social network is constructed through the processing of Weibo network data. Secondly, a new information propagation model is established by using the weight and interlayer information of the Weibo multiplex network combined with the coupling factors in the propagation. Finally, the information propagation simulated by the propagation model is compared with the real data, so as to summarize different information propagation phenomena in multiplex social multiplex network. At the same time, by comparing the structure of the forwarding weighted multiplex network constructed by the short time data and the long time data, we find the self-similarity of the forwarding weighted multiplex network, which proves the generalization of the experiment. Through the above research, the mystery of the Weibo social network has been deeply explored, and a new perspective has been opened up for the exploration of social media information propagation.
为了模拟不同类别微博的转发情况,发现微博网络不同层级中有趣的传播现象,本文提出了与多类别微博耦合的转发加权复用网络及传播模型。首先,通过对微博网络数据的处理构建了加权复用社交网络。其次,利用微博复用网络的权重和层间信息,结合传播中的耦合因子,建立了新的信息传播模型。最后,将传播模型模拟的信息传播与真实数据进行对比,从而总结出复用社交复用网络中不同的信息传播现象。同时,通过对比短时间数据和长时间数据构建的转发加权复用网络结构,发现转发加权复用网络的自相似性,证明了实验的普适性。通过以上研究,深入探索了微博社交网络的奥秘,为探索社交媒体信息传播开辟了新的视角。
{"title":"Research on the characteristics of information propagation dynamic on the weighted multiplex Weibo networks","authors":"Yinuo Qian,&nbsp;Fuzhong Nian","doi":"10.1016/j.bdr.2024.100493","DOIUrl":"10.1016/j.bdr.2024.100493","url":null,"abstract":"<div><div>In order to simulate the forwarding situation of different categories of Weibo and discover interesting propagation phenomena in different layers of Weibo networks, this paper proposes the retweeting weighted multiplex networks and propagation model coupled with multi-class Weibo. Firstly, the weighted multiplex social network is constructed through the processing of Weibo network data. Secondly, a new information propagation model is established by using the weight and interlayer information of the Weibo multiplex network combined with the coupling factors in the propagation. Finally, the information propagation simulated by the propagation model is compared with the real data, so as to summarize different information propagation phenomena in multiplex social multiplex network. At the same time, by comparing the structure of the forwarding weighted multiplex network constructed by the short time data and the long time data, we find the self-similarity of the forwarding weighted multiplex network, which proves the generalization of the experiment. Through the above research, the mystery of the Weibo social network has been deeply explored, and a new perspective has been opened up for the exploration of social media information propagation.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100493"},"PeriodicalIF":3.5,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142417738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging social computing for epidemic surveillance: A case study 利用社交计算进行流行病监测:案例研究
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-08 DOI: 10.1016/j.bdr.2024.100483
Bilal Tahir , Muhammad Amir Mehmood

Social media platforms have become a popular source of information for real-time monitoring of events and user behavior. In particular, Twitter provides invaluable information related to diseases and public health to build real-time disease surveillance systems. Effective use of such social media platforms for public health surveillance requires data-driven AI models which are hindered by the difficult, expensive, and time-consuming task of collecting high-quality and large-scale datasets. In this paper, we build and analyze the Epidemic TweetBank (EpiBank) dataset containing 271 million English tweets related to six epidemic-prone diseases COVID19, Flu, Hepatitis, Dengue, Malaria, and HIV/AIDs. For this purpose, we develop a tool of ESS-T (Epidemic Surveillance Study via Twitter) which collects tweets according to provided input parameters and keywords. Also, our tool assigns location to tweets with 95% accuracy value and performs analysis of collected tweets focusing on temporal distribution, spatial patterns, users, entities, sentiment, and misinformation. Leveraging ESS-T, we build two geo-tagged datasets of EpiBank-global and EpiBank-Pak containing 86 million tweets from 190 countries and 2.6 million tweets from Pakistan, respectively. Our spatial analysis of EpiBank-global for COVID19, Malaria, and Dengue indicates that our framework correctly identifies high-risk epidemic-prone countries according to World Health Organization (WHO) statistics.

社交媒体平台已成为实时监控事件和用户行为的热门信息来源。特别是,Twitter 为建立实时疾病监测系统提供了与疾病和公共卫生相关的宝贵信息。有效利用此类社交媒体平台进行公共卫生监测需要数据驱动的人工智能模型,而收集高质量和大规模数据集的工作难度大、成本高、耗时长,阻碍了人工智能模型的发展。在本文中,我们建立并分析了 Epidemic TweetBank(EpiBank)数据集,其中包含与 COVID19、流感、肝炎、登革热、疟疾和艾滋病毒/艾滋病六种流行病相关的 2.71 亿条英文推文。为此,我们开发了一个 ESS-T 工具(通过 Twitter 进行流行病监测研究),该工具可根据提供的输入参数和关键词收集推文。此外,我们的工具还能以 95% 的准确率为推文分配位置,并对收集到的推文进行分析,重点关注时间分布、空间模式、用户、实体、情感和错误信息。利用 ESS-T,我们建立了 EpiBank-global 和 EpiBank-Pak 两个地理标记数据集,分别包含来自 190 个国家的 8600 万条推文和来自巴基斯坦的 260 万条推文。我们针对 COVID19、疟疾和登革热对 EpiBank-global 进行的空间分析表明,根据世界卫生组织(WHO)的统计数据,我们的框架能正确识别流行病高发国家。
{"title":"Leveraging social computing for epidemic surveillance: A case study","authors":"Bilal Tahir ,&nbsp;Muhammad Amir Mehmood","doi":"10.1016/j.bdr.2024.100483","DOIUrl":"10.1016/j.bdr.2024.100483","url":null,"abstract":"<div><p>Social media platforms have become a popular source of information for real-time monitoring of events and user behavior. In particular, Twitter provides invaluable information related to diseases and public health to build real-time disease surveillance systems. Effective use of such social media platforms for public health surveillance requires data-driven AI models which are hindered by the difficult, expensive, and time-consuming task of collecting high-quality and large-scale datasets. In this paper, we build and analyze the Epidemic TweetBank (EpiBank) dataset containing 271 million English tweets related to six epidemic-prone diseases COVID19, Flu, Hepatitis, Dengue, Malaria, and HIV/AIDs. For this purpose, we develop a tool of ESS-T (Epidemic Surveillance Study via Twitter) which collects tweets according to provided input parameters and keywords. Also, our tool assigns location to tweets with 95% accuracy value and performs analysis of collected tweets focusing on temporal distribution, spatial patterns, users, entities, sentiment, and misinformation. Leveraging ESS-T, we build two geo-tagged datasets of EpiBank-global and EpiBank-Pak containing 86 million tweets from 190 countries and 2.6 million tweets from Pakistan, respectively. Our spatial analysis of EpiBank-global for COVID19, Malaria, and Dengue indicates that our framework correctly identifies high-risk epidemic-prone countries according to World Health Organization (WHO) statistics.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100483"},"PeriodicalIF":3.5,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141978839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Anomaly detection based on system text logs of virtual network functions 基于虚拟网络功能系统文本日志的异常检测
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-02 DOI: 10.1016/j.bdr.2024.100485
Daniela N. Rim , DongNyeong Heo , Chungjun Lee , Sukhyun Nam , Jae-Hyoung Yoo , James Won-Ki Hong , Heeyoul Choi

In virtual network environments building secure and effective systems is crucial for its correct functioning, and so the anomaly detection task is at its core. To uncover and predict abnormalities in the behavior of a virtual machine, it is desirable to extract relevant information from system text logs. The main issue is that text is unstructured and symbolic data, and also very expensive to process. However, recent advances in deep learning have shown remarkable capabilities of handling such data. In this work, we propose using a simple LSTM recurrent network on top of a pre-trained Sentence-BERT, which encodes the system logs into fixed-length vectors. We trained the model in an unsupervised fashion to learn the likelihood of the represented sequences of logs. This way, the model can trigger a warning with an accuracy of 81% when a virtual machine generates an abnormal sequence. Our model approach is not only easy to train and computationally cheap, it also generalizes to the content of any input.

在虚拟网络环境中,建立安全有效的系统对于系统的正常运行至关重要,因此异常检测任务是其核心。要发现和预测虚拟机行为的异常,最好是从系统文本日志中提取相关信息。主要问题在于,文本是非结构化的符号数据,处理起来也非常昂贵。然而,深度学习的最新进展已经显示出处理此类数据的卓越能力。在这项工作中,我们建议在预先训练好的 Sentence-BERT 基础上使用简单的 LSTM 循环网络,将系统日志编码为固定长度的向量。我们以无监督方式训练该模型,以学习所代表日志序列的可能性。这样,当虚拟机产生异常序列时,该模型能以 81% 的准确率触发警告。我们的模型方法不仅易于训练且计算成本低廉,还能泛化到任何输入内容。
{"title":"Anomaly detection based on system text logs of virtual network functions","authors":"Daniela N. Rim ,&nbsp;DongNyeong Heo ,&nbsp;Chungjun Lee ,&nbsp;Sukhyun Nam ,&nbsp;Jae-Hyoung Yoo ,&nbsp;James Won-Ki Hong ,&nbsp;Heeyoul Choi","doi":"10.1016/j.bdr.2024.100485","DOIUrl":"10.1016/j.bdr.2024.100485","url":null,"abstract":"<div><p>In virtual network environments building secure and effective systems is crucial for its correct functioning, and so the anomaly detection task is at its core. To uncover and predict abnormalities in the behavior of a virtual machine, it is desirable to extract relevant information from system text logs. The main issue is that text is unstructured and symbolic data, and also very expensive to process. However, recent advances in deep learning have shown remarkable capabilities of handling such data. In this work, we propose using a simple LSTM recurrent network on top of a pre-trained Sentence-BERT, which encodes the system logs into fixed-length vectors. We trained the model in an unsupervised fashion to learn the likelihood of the represented sequences of logs. This way, the model can trigger a warning with an accuracy of 81% when a virtual machine generates an abnormal sequence. Our model approach is not only easy to train and computationally cheap, it also generalizes to the content of any input.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100485"},"PeriodicalIF":3.5,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A dual algorithmic approach to deal with multiclass imbalanced classification problems 处理多类不平衡分类问题的双重算法方法
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-02 DOI: 10.1016/j.bdr.2024.100484
S. Sridhar , S. Anusuya

Many real-world applications involve multiclass classification problems, and often the data across classes is not evenly distributed. Due to this disproportion, supervised learning models tend to classify instances towards the class with the maximum number of instances, which is a severe issue that needs to be addressed. In multiclass imbalanced data classification, machine learning researchers try to reduce the learning model's bias towards the class with a high sample count. Researchers attempt to reduce this unfairness by either balancing the data before the classifier learns it, modifying the classifier's learning phase to pay more attention to the class with a minimum number of instances, or a combination of both. The existing algorithmic approaches find it difficult to understand the clear boundary between the samples of different classes due to unfair class distribution and overlapping issues. As a result, the minority class recognition rate is poor. A new algorithmic approach is proposed that uses dual decision trees. One is used to create an induced dataset using a PCA based grouping approach and by assigning weights to the data samples followed by another decision tree to learn and predict from the induced dataset. The distinct feature of this algorithmic approach is that it recognizes the data instances without altering their underlying data distribution and is applicable for all categories of multiclass imbalanced datasets. Five multiclass imbalanced datasets from UCI were used to classify the data using our proposed algorithm, and the results revealed that the duo-decision tree approach pays better attention to both the minor and major class samples.

现实世界中的许多应用都涉及多类分类问题,而不同类别的数据往往分布不均。由于这种比例失调,监督学习模型倾向于将实例分类到实例数量最多的类别,这是一个亟待解决的严重问题。在多类不平衡数据分类中,机器学习研究人员试图减少学习模型对样本数量多的类别的偏向。研究人员试图通过以下两种方法来减少这种不公平现象:在分类器学习之前平衡数据;修改分类器的学习阶段,使其更加关注实例数量最少的类别;或者两者相结合。由于类别分布不公平和重叠问题,现有的算法方法很难理解不同类别样本之间的明确界限。因此,少数类别的识别率很低。本文提出了一种使用双决策树的新算法方法。一棵决策树用于使用基于 PCA 的分组方法创建诱导数据集,并为数据样本分配权重,然后另一棵决策树从诱导数据集中学习和预测。这种算法方法的显著特点是,它能在不改变基础数据分布的情况下识别数据实例,适用于所有类别的多类不平衡数据集。使用我们提出的算法对来自 UCI 的五个多类不平衡数据集进行了分类,结果显示,双决策树方法能更好地关注小类和大类样本。
{"title":"A dual algorithmic approach to deal with multiclass imbalanced classification problems","authors":"S. Sridhar ,&nbsp;S. Anusuya","doi":"10.1016/j.bdr.2024.100484","DOIUrl":"10.1016/j.bdr.2024.100484","url":null,"abstract":"<div><p>Many real-world applications involve multiclass classification problems, and often the data across classes is not evenly distributed. Due to this disproportion, supervised learning models tend to classify instances towards the class with the maximum number of instances, which is a severe issue that needs to be addressed. In multiclass imbalanced data classification, machine learning researchers try to reduce the learning model's bias towards the class with a high sample count. Researchers attempt to reduce this unfairness by either balancing the data before the classifier learns it, modifying the classifier's learning phase to pay more attention to the class with a minimum number of instances, or a combination of both. The existing algorithmic approaches find it difficult to understand the clear boundary between the samples of different classes due to unfair class distribution and overlapping issues. As a result, the minority class recognition rate is poor. A new algorithmic approach is proposed that uses dual decision trees. One is used to create an induced dataset using a PCA based grouping approach and by assigning weights to the data samples followed by another decision tree to learn and predict from the induced dataset. The distinct feature of this algorithmic approach is that it recognizes the data instances without altering their underlying data distribution and is applicable for all categories of multiclass imbalanced datasets. Five multiclass imbalanced datasets from UCI were used to classify the data using our proposed algorithm, and the results revealed that the duo-decision tree approach pays better attention to both the minor and major class samples.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100484"},"PeriodicalIF":3.5,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141985718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-step trend aware graph neural network for traffic flow forecasting 用于交通流量预测的多步骤趋势感知图神经网络
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-29 DOI: 10.1016/j.bdr.2024.100482
Lipeng Zhao , Bing Guo , Cheng Dai , Yan Shen , Fei Chen , Mingjie Zhao , Yuchuan Hu

Traffic flow prediction plays an important role in smart cities. Although many neural network models already existed that can predict traffic flow, in the face of complex spatio-temporal data, these models still have some shortcomings. Firstly, they although take into account local spatio-temporal relations, ignore global information, leading to inability to capture global trend. Secondly, most models although construct spatio-temporal graphs for convolution, ignore the dynamic characteristics of spatio-temporal graphs, leading to the inability to capture local fluctuation. Finally, the current popular models need to take a lot of training time to obtain better prediction results, resulting in higher computing cost. To this end, we propose a new model: Multi-Step Trend Aware Graph Neural Network (MSTAGNN), which considers the influence of global spatio-temporal information and captures the dynamic characteristics of spatio-temporal graph. It can not only accurately capture local fluctuation, but also extract global trend and dramatically reduce computing cost. The experimental results showed that our proposed model achieved optimal results compared to baseline. Among them, mean absolute error (MAE) was reduced by 6.25% and the total training time was reduced by 79% on the PEMSD8 dataset. The source codes are available at: https://github.com/Vitalitypi/MSTAGNN.

交通流量预测在智慧城市中发挥着重要作用。虽然目前已经有很多神经网络模型可以预测交通流量,但面对复杂的时空数据,这些模型仍然存在一些缺陷。首先,这些模型虽然考虑了局部时空关系,但忽略了全局信息,导致无法捕捉全局趋势。其次,大多数模型虽然构建了用于卷积的时空图,但忽略了时空图的动态特性,导致无法捕捉局部波动。最后,目前流行的模型需要花费大量的训练时间才能获得较好的预测结果,导致计算成本较高。为此,我们提出了一种新模型:MSTAGNN(Multi-tep rend ware raph eural etwork),它考虑了全局时空信息的影响,捕捉了时空图的动态特征。它不仅能准确捕捉局部波动,还能提取全局趋势,并显著降低计算成本。实验结果表明,与基线相比,我们提出的模型取得了最佳效果。其中,在 PEMSD8 数据集上,平均绝对误差(MAE)减少了 6.25%,总训练时间减少了 79%。源代码见.
{"title":"Multi-step trend aware graph neural network for traffic flow forecasting","authors":"Lipeng Zhao ,&nbsp;Bing Guo ,&nbsp;Cheng Dai ,&nbsp;Yan Shen ,&nbsp;Fei Chen ,&nbsp;Mingjie Zhao ,&nbsp;Yuchuan Hu","doi":"10.1016/j.bdr.2024.100482","DOIUrl":"10.1016/j.bdr.2024.100482","url":null,"abstract":"<div><p>Traffic flow prediction plays an important role in smart cities. Although many neural network models already existed that can predict traffic flow, in the face of complex spatio-temporal data, these models still have some shortcomings. Firstly, they although take into account local spatio-temporal relations, ignore global information, leading to inability to capture global trend. Secondly, most models although construct spatio-temporal graphs for convolution, ignore the dynamic characteristics of spatio-temporal graphs, leading to the inability to capture local fluctuation. Finally, the current popular models need to take a lot of training time to obtain better prediction results, resulting in higher computing cost. To this end, we propose a new model: <strong>M</strong>ulti-<strong>S</strong>tep <strong>T</strong>rend <strong>A</strong>ware <strong>G</strong>raph <strong>N</strong>eural <strong>N</strong>etwork (MSTAGNN), which considers the influence of global spatio-temporal information and captures the dynamic characteristics of spatio-temporal graph. It can not only accurately capture local fluctuation, but also extract global trend and dramatically reduce computing cost. The experimental results showed that our proposed model achieved optimal results compared to baseline. Among them, mean absolute error (MAE) was reduced by 6.25% and the total training time was reduced by 79% on the PEMSD8 dataset. The source codes are available at: <span><span>https://github.com/Vitalitypi/MSTAGNN</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100482"},"PeriodicalIF":3.5,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unmasking hate in the pandemic: A cross-platform study of the COVID-19 infodemic 在大流行中揭开仇恨的面纱:COVID-19 信息大流行的跨平台研究
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-06-25 DOI: 10.1016/j.bdr.2024.100481
Fatima Zahrah , Jason R.C. Nurse , Michael Goldsmith

The past few decades have established how digital technologies and platforms have provided an effective medium for spreading hateful content, which has been linked to several catastrophic consequences. Recent academic studies have also highlighted how online hate is a phenomenon that strategically makes use of multiple online platforms. In this article, we seek to advance the current research landscape by harnessing a cross-platform approach to computationally analyse content relating to the 2020 COVID-19 pandemic. More specifically, we analyse content on hate-specific environments from Twitter, Reddit, 4chan and Stormfront. Our findings show how content and posting activity can change across platforms, and how the psychological components of online content can differ depending on the platform being used. Through this, we provide unique insight into the cross-platform behaviours of online hate. We further define several avenues for future research within this field so as to gain a more comprehensive understanding of the global hate ecosystem.

过去几十年来,数字技术和平台为仇恨内容的传播提供了有效的媒介,而仇恨内容的传播与一些灾难性后果有关。最近的学术研究也强调了网络仇恨是如何战略性地利用多种网络平台的现象。在本文中,我们试图利用跨平台方法对与 2020 年 COVID-19 大流行相关的内容进行计算分析,从而推进当前的研究工作。更具体地说,我们分析了 Twitter、Reddit、4chan 和 Stormfront 上特定仇恨环境的内容。我们的研究结果表明了内容和发帖活动在不同平台上的变化,以及网络内容的心理成分如何因所使用的平台而不同。通过这些研究,我们对网络仇恨的跨平台行为有了独特的见解。我们进一步确定了该领域未来研究的几个方向,以便更全面地了解全球仇恨生态系统。
{"title":"Unmasking hate in the pandemic: A cross-platform study of the COVID-19 infodemic","authors":"Fatima Zahrah ,&nbsp;Jason R.C. Nurse ,&nbsp;Michael Goldsmith","doi":"10.1016/j.bdr.2024.100481","DOIUrl":"https://doi.org/10.1016/j.bdr.2024.100481","url":null,"abstract":"<div><p>The past few decades have established how digital technologies and platforms have provided an effective medium for spreading hateful content, which has been linked to several catastrophic consequences. Recent academic studies have also highlighted how online hate is a phenomenon that strategically makes use of multiple online platforms. In this article, we seek to advance the current research landscape by harnessing a cross-platform approach to computationally analyse content relating to the 2020 COVID-19 pandemic. More specifically, we analyse content on hate-specific environments from Twitter, Reddit, 4chan and Stormfront. Our findings show how content and posting activity can change across platforms, and how the psychological components of online content can differ depending on the platform being used. Through this, we provide unique insight into the cross-platform behaviours of online hate. We further define several avenues for future research within this field so as to gain a more comprehensive understanding of the global hate ecosystem.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100481"},"PeriodicalIF":3.5,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2214579624000558/pdfft?md5=a8e2330701051448866927c6cb877d10&pid=1-s2.0-S2214579624000558-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141480176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Two-dimensional data partitioning for non-negative matrix tri-factorization 非负矩阵三因子化的二维数据分区
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-06-19 DOI: 10.1016/j.bdr.2024.100473
Jiaxing Yan , Hai Liu , Zhiqi Lei , Yanghui Rao , Guan Liu , Haoran Xie , Xiaohui Tao , Fu Lee Wang

As a two-sided clustering and dimensionality reduction paradigm, Non-negative Matrix Tri-Factorization (NMTF) has attracted much attention in machine learning and data mining researchers due to its excellent performance and reliable theoretical support. Unlike Non-negative Matrix Factorization (NMF) methods applicable to one-sided clustering only, NMTF introduces an additional factor matrix and uses the inherent duality of data to realize the mutual promotion of sample clustering and feature clustering, thus showing great advantages in many scenarios (e.g., text co-clustering). However, the existing methods for solving NMTF usually involve intensive matrix multiplication, which is characterized by high time and space complexities, that is, there are limitations of slow convergence of the multiplicative update rules and high memory overhead. In order to solve the above problems, this paper develops a distributed parallel algorithm with a 2-dimensional data partition scheme for NMTF (i.e., PNMTF-2D). Experiments on multiple text datasets show that the proposed PNMTF-2D can substantially improve the computational efficiency of NMTF (e.g., the average iteration time is reduced by up to 99.7% on Amazon) while ensuring the effectiveness of convergence and co-clustering.

作为一种双面聚类和降维范式,非负矩阵三因式分解(NMTF)以其优异的性能和可靠的理论支持吸引了机器学习和数据挖掘研究人员的广泛关注。与只适用于单边聚类的非负矩阵因式分解(NMF)方法不同,NMTF 引入了额外的因式矩阵,利用数据固有的二元性实现了样本聚类和特征聚类的相互促进,因此在很多场景(如文本共聚类)中都显示出巨大的优势。然而,现有的 NMTF 求解方法通常涉及密集的矩阵乘法,具有时间和空间复杂度高的特点,即存在乘法更新规则收敛慢和内存开销大的局限性。为了解决上述问题,本文针对 NMTF 开发了一种具有二维数据分区方案的分布式并行算法(即 PNMTF-2D)。在多个文本数据集上的实验表明,所提出的 PNMTF-2D 可以大幅提高 NMTF 的计算效率(例如,在亚马逊上平均迭代时间最多可缩短 99.7%),同时确保收敛和共聚类的有效性。
{"title":"Two-dimensional data partitioning for non-negative matrix tri-factorization","authors":"Jiaxing Yan ,&nbsp;Hai Liu ,&nbsp;Zhiqi Lei ,&nbsp;Yanghui Rao ,&nbsp;Guan Liu ,&nbsp;Haoran Xie ,&nbsp;Xiaohui Tao ,&nbsp;Fu Lee Wang","doi":"10.1016/j.bdr.2024.100473","DOIUrl":"https://doi.org/10.1016/j.bdr.2024.100473","url":null,"abstract":"<div><p>As a two-sided clustering and dimensionality reduction paradigm, Non-negative Matrix Tri-Factorization (NMTF) has attracted much attention in machine learning and data mining researchers due to its excellent performance and reliable theoretical support. Unlike Non-negative Matrix Factorization (NMF) methods applicable to one-sided clustering only, NMTF introduces an additional factor matrix and uses the inherent duality of data to realize the mutual promotion of sample clustering and feature clustering, thus showing great advantages in many scenarios (e.g., text co-clustering). However, the existing methods for solving NMTF usually involve intensive matrix multiplication, which is characterized by high time and space complexities, that is, there are limitations of slow convergence of the multiplicative update rules and high memory overhead. In order to solve the above problems, this paper develops a distributed parallel algorithm with a 2-dimensional data partition scheme for NMTF (i.e., PNMTF-2D). Experiments on multiple text datasets show that the proposed PNMTF-2D can substantially improve the computational efficiency of NMTF (e.g., the average iteration time is reduced by up to 99.7% on Amazon) while ensuring the effectiveness of convergence and co-clustering.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100473"},"PeriodicalIF":3.5,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141480175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Big Data Research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1