首页 > 最新文献

Big Data Research最新文献

英文 中文
Research on the characteristics of information propagation dynamic on the weighted multiplex Weibo networks 加权复用微博网络信息传播动态特征研究
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-27 DOI: 10.1016/j.bdr.2024.100493
Yinuo Qian, Fuzhong Nian
In order to simulate the forwarding situation of different categories of Weibo and discover interesting propagation phenomena in different layers of Weibo networks, this paper proposes the retweeting weighted multiplex networks and propagation model coupled with multi-class Weibo. Firstly, the weighted multiplex social network is constructed through the processing of Weibo network data. Secondly, a new information propagation model is established by using the weight and interlayer information of the Weibo multiplex network combined with the coupling factors in the propagation. Finally, the information propagation simulated by the propagation model is compared with the real data, so as to summarize different information propagation phenomena in multiplex social multiplex network. At the same time, by comparing the structure of the forwarding weighted multiplex network constructed by the short time data and the long time data, we find the self-similarity of the forwarding weighted multiplex network, which proves the generalization of the experiment. Through the above research, the mystery of the Weibo social network has been deeply explored, and a new perspective has been opened up for the exploration of social media information propagation.
为了模拟不同类别微博的转发情况,发现微博网络不同层级中有趣的传播现象,本文提出了与多类别微博耦合的转发加权复用网络及传播模型。首先,通过对微博网络数据的处理构建了加权复用社交网络。其次,利用微博复用网络的权重和层间信息,结合传播中的耦合因子,建立了新的信息传播模型。最后,将传播模型模拟的信息传播与真实数据进行对比,从而总结出复用社交复用网络中不同的信息传播现象。同时,通过对比短时间数据和长时间数据构建的转发加权复用网络结构,发现转发加权复用网络的自相似性,证明了实验的普适性。通过以上研究,深入探索了微博社交网络的奥秘,为探索社交媒体信息传播开辟了新的视角。
{"title":"Research on the characteristics of information propagation dynamic on the weighted multiplex Weibo networks","authors":"Yinuo Qian,&nbsp;Fuzhong Nian","doi":"10.1016/j.bdr.2024.100493","DOIUrl":"10.1016/j.bdr.2024.100493","url":null,"abstract":"<div><div>In order to simulate the forwarding situation of different categories of Weibo and discover interesting propagation phenomena in different layers of Weibo networks, this paper proposes the retweeting weighted multiplex networks and propagation model coupled with multi-class Weibo. Firstly, the weighted multiplex social network is constructed through the processing of Weibo network data. Secondly, a new information propagation model is established by using the weight and interlayer information of the Weibo multiplex network combined with the coupling factors in the propagation. Finally, the information propagation simulated by the propagation model is compared with the real data, so as to summarize different information propagation phenomena in multiplex social multiplex network. At the same time, by comparing the structure of the forwarding weighted multiplex network constructed by the short time data and the long time data, we find the self-similarity of the forwarding weighted multiplex network, which proves the generalization of the experiment. Through the above research, the mystery of the Weibo social network has been deeply explored, and a new perspective has been opened up for the exploration of social media information propagation.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.5,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142417738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging social computing for epidemic surveillance: A case study 利用社交计算进行流行病监测:案例研究
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-08 DOI: 10.1016/j.bdr.2024.100483
Bilal Tahir , Muhammad Amir Mehmood

Social media platforms have become a popular source of information for real-time monitoring of events and user behavior. In particular, Twitter provides invaluable information related to diseases and public health to build real-time disease surveillance systems. Effective use of such social media platforms for public health surveillance requires data-driven AI models which are hindered by the difficult, expensive, and time-consuming task of collecting high-quality and large-scale datasets. In this paper, we build and analyze the Epidemic TweetBank (EpiBank) dataset containing 271 million English tweets related to six epidemic-prone diseases COVID19, Flu, Hepatitis, Dengue, Malaria, and HIV/AIDs. For this purpose, we develop a tool of ESS-T (Epidemic Surveillance Study via Twitter) which collects tweets according to provided input parameters and keywords. Also, our tool assigns location to tweets with 95% accuracy value and performs analysis of collected tweets focusing on temporal distribution, spatial patterns, users, entities, sentiment, and misinformation. Leveraging ESS-T, we build two geo-tagged datasets of EpiBank-global and EpiBank-Pak containing 86 million tweets from 190 countries and 2.6 million tweets from Pakistan, respectively. Our spatial analysis of EpiBank-global for COVID19, Malaria, and Dengue indicates that our framework correctly identifies high-risk epidemic-prone countries according to World Health Organization (WHO) statistics.

社交媒体平台已成为实时监控事件和用户行为的热门信息来源。特别是,Twitter 为建立实时疾病监测系统提供了与疾病和公共卫生相关的宝贵信息。有效利用此类社交媒体平台进行公共卫生监测需要数据驱动的人工智能模型,而收集高质量和大规模数据集的工作难度大、成本高、耗时长,阻碍了人工智能模型的发展。在本文中,我们建立并分析了 Epidemic TweetBank(EpiBank)数据集,其中包含与 COVID19、流感、肝炎、登革热、疟疾和艾滋病毒/艾滋病六种流行病相关的 2.71 亿条英文推文。为此,我们开发了一个 ESS-T 工具(通过 Twitter 进行流行病监测研究),该工具可根据提供的输入参数和关键词收集推文。此外,我们的工具还能以 95% 的准确率为推文分配位置,并对收集到的推文进行分析,重点关注时间分布、空间模式、用户、实体、情感和错误信息。利用 ESS-T,我们建立了 EpiBank-global 和 EpiBank-Pak 两个地理标记数据集,分别包含来自 190 个国家的 8600 万条推文和来自巴基斯坦的 260 万条推文。我们针对 COVID19、疟疾和登革热对 EpiBank-global 进行的空间分析表明,根据世界卫生组织(WHO)的统计数据,我们的框架能正确识别流行病高发国家。
{"title":"Leveraging social computing for epidemic surveillance: A case study","authors":"Bilal Tahir ,&nbsp;Muhammad Amir Mehmood","doi":"10.1016/j.bdr.2024.100483","DOIUrl":"10.1016/j.bdr.2024.100483","url":null,"abstract":"<div><p>Social media platforms have become a popular source of information for real-time monitoring of events and user behavior. In particular, Twitter provides invaluable information related to diseases and public health to build real-time disease surveillance systems. Effective use of such social media platforms for public health surveillance requires data-driven AI models which are hindered by the difficult, expensive, and time-consuming task of collecting high-quality and large-scale datasets. In this paper, we build and analyze the Epidemic TweetBank (EpiBank) dataset containing 271 million English tweets related to six epidemic-prone diseases COVID19, Flu, Hepatitis, Dengue, Malaria, and HIV/AIDs. For this purpose, we develop a tool of ESS-T (Epidemic Surveillance Study via Twitter) which collects tweets according to provided input parameters and keywords. Also, our tool assigns location to tweets with 95% accuracy value and performs analysis of collected tweets focusing on temporal distribution, spatial patterns, users, entities, sentiment, and misinformation. Leveraging ESS-T, we build two geo-tagged datasets of EpiBank-global and EpiBank-Pak containing 86 million tweets from 190 countries and 2.6 million tweets from Pakistan, respectively. Our spatial analysis of EpiBank-global for COVID19, Malaria, and Dengue indicates that our framework correctly identifies high-risk epidemic-prone countries according to World Health Organization (WHO) statistics.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.5,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141978839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Anomaly detection based on system text logs of virtual network functions 基于虚拟网络功能系统文本日志的异常检测
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-02 DOI: 10.1016/j.bdr.2024.100485
Daniela N. Rim , DongNyeong Heo , Chungjun Lee , Sukhyun Nam , Jae-Hyoung Yoo , James Won-Ki Hong , Heeyoul Choi

In virtual network environments building secure and effective systems is crucial for its correct functioning, and so the anomaly detection task is at its core. To uncover and predict abnormalities in the behavior of a virtual machine, it is desirable to extract relevant information from system text logs. The main issue is that text is unstructured and symbolic data, and also very expensive to process. However, recent advances in deep learning have shown remarkable capabilities of handling such data. In this work, we propose using a simple LSTM recurrent network on top of a pre-trained Sentence-BERT, which encodes the system logs into fixed-length vectors. We trained the model in an unsupervised fashion to learn the likelihood of the represented sequences of logs. This way, the model can trigger a warning with an accuracy of 81% when a virtual machine generates an abnormal sequence. Our model approach is not only easy to train and computationally cheap, it also generalizes to the content of any input.

在虚拟网络环境中,建立安全有效的系统对于系统的正常运行至关重要,因此异常检测任务是其核心。要发现和预测虚拟机行为的异常,最好是从系统文本日志中提取相关信息。主要问题在于,文本是非结构化的符号数据,处理起来也非常昂贵。然而,深度学习的最新进展已经显示出处理此类数据的卓越能力。在这项工作中,我们建议在预先训练好的 Sentence-BERT 基础上使用简单的 LSTM 循环网络,将系统日志编码为固定长度的向量。我们以无监督方式训练该模型,以学习所代表日志序列的可能性。这样,当虚拟机产生异常序列时,该模型能以 81% 的准确率触发警告。我们的模型方法不仅易于训练且计算成本低廉,还能泛化到任何输入内容。
{"title":"Anomaly detection based on system text logs of virtual network functions","authors":"Daniela N. Rim ,&nbsp;DongNyeong Heo ,&nbsp;Chungjun Lee ,&nbsp;Sukhyun Nam ,&nbsp;Jae-Hyoung Yoo ,&nbsp;James Won-Ki Hong ,&nbsp;Heeyoul Choi","doi":"10.1016/j.bdr.2024.100485","DOIUrl":"10.1016/j.bdr.2024.100485","url":null,"abstract":"<div><p>In virtual network environments building secure and effective systems is crucial for its correct functioning, and so the anomaly detection task is at its core. To uncover and predict abnormalities in the behavior of a virtual machine, it is desirable to extract relevant information from system text logs. The main issue is that text is unstructured and symbolic data, and also very expensive to process. However, recent advances in deep learning have shown remarkable capabilities of handling such data. In this work, we propose using a simple LSTM recurrent network on top of a pre-trained Sentence-BERT, which encodes the system logs into fixed-length vectors. We trained the model in an unsupervised fashion to learn the likelihood of the represented sequences of logs. This way, the model can trigger a warning with an accuracy of 81% when a virtual machine generates an abnormal sequence. Our model approach is not only easy to train and computationally cheap, it also generalizes to the content of any input.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.5,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A dual algorithmic approach to deal with multiclass imbalanced classification problems 处理多类不平衡分类问题的双重算法方法
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-02 DOI: 10.1016/j.bdr.2024.100484
S. Sridhar , S. Anusuya

Many real-world applications involve multiclass classification problems, and often the data across classes is not evenly distributed. Due to this disproportion, supervised learning models tend to classify instances towards the class with the maximum number of instances, which is a severe issue that needs to be addressed. In multiclass imbalanced data classification, machine learning researchers try to reduce the learning model's bias towards the class with a high sample count. Researchers attempt to reduce this unfairness by either balancing the data before the classifier learns it, modifying the classifier's learning phase to pay more attention to the class with a minimum number of instances, or a combination of both. The existing algorithmic approaches find it difficult to understand the clear boundary between the samples of different classes due to unfair class distribution and overlapping issues. As a result, the minority class recognition rate is poor. A new algorithmic approach is proposed that uses dual decision trees. One is used to create an induced dataset using a PCA based grouping approach and by assigning weights to the data samples followed by another decision tree to learn and predict from the induced dataset. The distinct feature of this algorithmic approach is that it recognizes the data instances without altering their underlying data distribution and is applicable for all categories of multiclass imbalanced datasets. Five multiclass imbalanced datasets from UCI were used to classify the data using our proposed algorithm, and the results revealed that the duo-decision tree approach pays better attention to both the minor and major class samples.

现实世界中的许多应用都涉及多类分类问题,而不同类别的数据往往分布不均。由于这种比例失调,监督学习模型倾向于将实例分类到实例数量最多的类别,这是一个亟待解决的严重问题。在多类不平衡数据分类中,机器学习研究人员试图减少学习模型对样本数量多的类别的偏向。研究人员试图通过以下两种方法来减少这种不公平现象:在分类器学习之前平衡数据;修改分类器的学习阶段,使其更加关注实例数量最少的类别;或者两者相结合。由于类别分布不公平和重叠问题,现有的算法方法很难理解不同类别样本之间的明确界限。因此,少数类别的识别率很低。本文提出了一种使用双决策树的新算法方法。一棵决策树用于使用基于 PCA 的分组方法创建诱导数据集,并为数据样本分配权重,然后另一棵决策树从诱导数据集中学习和预测。这种算法方法的显著特点是,它能在不改变基础数据分布的情况下识别数据实例,适用于所有类别的多类不平衡数据集。使用我们提出的算法对来自 UCI 的五个多类不平衡数据集进行了分类,结果显示,双决策树方法能更好地关注小类和大类样本。
{"title":"A dual algorithmic approach to deal with multiclass imbalanced classification problems","authors":"S. Sridhar ,&nbsp;S. Anusuya","doi":"10.1016/j.bdr.2024.100484","DOIUrl":"10.1016/j.bdr.2024.100484","url":null,"abstract":"<div><p>Many real-world applications involve multiclass classification problems, and often the data across classes is not evenly distributed. Due to this disproportion, supervised learning models tend to classify instances towards the class with the maximum number of instances, which is a severe issue that needs to be addressed. In multiclass imbalanced data classification, machine learning researchers try to reduce the learning model's bias towards the class with a high sample count. Researchers attempt to reduce this unfairness by either balancing the data before the classifier learns it, modifying the classifier's learning phase to pay more attention to the class with a minimum number of instances, or a combination of both. The existing algorithmic approaches find it difficult to understand the clear boundary between the samples of different classes due to unfair class distribution and overlapping issues. As a result, the minority class recognition rate is poor. A new algorithmic approach is proposed that uses dual decision trees. One is used to create an induced dataset using a PCA based grouping approach and by assigning weights to the data samples followed by another decision tree to learn and predict from the induced dataset. The distinct feature of this algorithmic approach is that it recognizes the data instances without altering their underlying data distribution and is applicable for all categories of multiclass imbalanced datasets. Five multiclass imbalanced datasets from UCI were used to classify the data using our proposed algorithm, and the results revealed that the duo-decision tree approach pays better attention to both the minor and major class samples.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.5,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141985718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-step trend aware graph neural network for traffic flow forecasting 用于交通流量预测的多步骤趋势感知图神经网络
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-29 DOI: 10.1016/j.bdr.2024.100482
Lipeng Zhao , Bing Guo , Cheng Dai , Yan Shen , Fei Chen , Mingjie Zhao , Yuchuan Hu

Traffic flow prediction plays an important role in smart cities. Although many neural network models already existed that can predict traffic flow, in the face of complex spatio-temporal data, these models still have some shortcomings. Firstly, they although take into account local spatio-temporal relations, ignore global information, leading to inability to capture global trend. Secondly, most models although construct spatio-temporal graphs for convolution, ignore the dynamic characteristics of spatio-temporal graphs, leading to the inability to capture local fluctuation. Finally, the current popular models need to take a lot of training time to obtain better prediction results, resulting in higher computing cost. To this end, we propose a new model: Multi-Step Trend Aware Graph Neural Network (MSTAGNN), which considers the influence of global spatio-temporal information and captures the dynamic characteristics of spatio-temporal graph. It can not only accurately capture local fluctuation, but also extract global trend and dramatically reduce computing cost. The experimental results showed that our proposed model achieved optimal results compared to baseline. Among them, mean absolute error (MAE) was reduced by 6.25% and the total training time was reduced by 79% on the PEMSD8 dataset. The source codes are available at: https://github.com/Vitalitypi/MSTAGNN.

交通流量预测在智慧城市中发挥着重要作用。虽然目前已经有很多神经网络模型可以预测交通流量,但面对复杂的时空数据,这些模型仍然存在一些缺陷。首先,这些模型虽然考虑了局部时空关系,但忽略了全局信息,导致无法捕捉全局趋势。其次,大多数模型虽然构建了用于卷积的时空图,但忽略了时空图的动态特性,导致无法捕捉局部波动。最后,目前流行的模型需要花费大量的训练时间才能获得较好的预测结果,导致计算成本较高。为此,我们提出了一种新模型:MSTAGNN(Multi-tep rend ware raph eural etwork),它考虑了全局时空信息的影响,捕捉了时空图的动态特征。它不仅能准确捕捉局部波动,还能提取全局趋势,并显著降低计算成本。实验结果表明,与基线相比,我们提出的模型取得了最佳效果。其中,在 PEMSD8 数据集上,平均绝对误差(MAE)减少了 6.25%,总训练时间减少了 79%。源代码见.
{"title":"Multi-step trend aware graph neural network for traffic flow forecasting","authors":"Lipeng Zhao ,&nbsp;Bing Guo ,&nbsp;Cheng Dai ,&nbsp;Yan Shen ,&nbsp;Fei Chen ,&nbsp;Mingjie Zhao ,&nbsp;Yuchuan Hu","doi":"10.1016/j.bdr.2024.100482","DOIUrl":"10.1016/j.bdr.2024.100482","url":null,"abstract":"<div><p>Traffic flow prediction plays an important role in smart cities. Although many neural network models already existed that can predict traffic flow, in the face of complex spatio-temporal data, these models still have some shortcomings. Firstly, they although take into account local spatio-temporal relations, ignore global information, leading to inability to capture global trend. Secondly, most models although construct spatio-temporal graphs for convolution, ignore the dynamic characteristics of spatio-temporal graphs, leading to the inability to capture local fluctuation. Finally, the current popular models need to take a lot of training time to obtain better prediction results, resulting in higher computing cost. To this end, we propose a new model: <strong>M</strong>ulti-<strong>S</strong>tep <strong>T</strong>rend <strong>A</strong>ware <strong>G</strong>raph <strong>N</strong>eural <strong>N</strong>etwork (MSTAGNN), which considers the influence of global spatio-temporal information and captures the dynamic characteristics of spatio-temporal graph. It can not only accurately capture local fluctuation, but also extract global trend and dramatically reduce computing cost. The experimental results showed that our proposed model achieved optimal results compared to baseline. Among them, mean absolute error (MAE) was reduced by 6.25% and the total training time was reduced by 79% on the PEMSD8 dataset. The source codes are available at: <span><span>https://github.com/Vitalitypi/MSTAGNN</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.5,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unmasking hate in the pandemic: A cross-platform study of the COVID-19 infodemic 在大流行中揭开仇恨的面纱:COVID-19 信息大流行的跨平台研究
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-06-25 DOI: 10.1016/j.bdr.2024.100481
Fatima Zahrah , Jason R.C. Nurse , Michael Goldsmith

The past few decades have established how digital technologies and platforms have provided an effective medium for spreading hateful content, which has been linked to several catastrophic consequences. Recent academic studies have also highlighted how online hate is a phenomenon that strategically makes use of multiple online platforms. In this article, we seek to advance the current research landscape by harnessing a cross-platform approach to computationally analyse content relating to the 2020 COVID-19 pandemic. More specifically, we analyse content on hate-specific environments from Twitter, Reddit, 4chan and Stormfront. Our findings show how content and posting activity can change across platforms, and how the psychological components of online content can differ depending on the platform being used. Through this, we provide unique insight into the cross-platform behaviours of online hate. We further define several avenues for future research within this field so as to gain a more comprehensive understanding of the global hate ecosystem.

过去几十年来,数字技术和平台为仇恨内容的传播提供了有效的媒介,而仇恨内容的传播与一些灾难性后果有关。最近的学术研究也强调了网络仇恨是如何战略性地利用多种网络平台的现象。在本文中,我们试图利用跨平台方法对与 2020 年 COVID-19 大流行相关的内容进行计算分析,从而推进当前的研究工作。更具体地说,我们分析了 Twitter、Reddit、4chan 和 Stormfront 上特定仇恨环境的内容。我们的研究结果表明了内容和发帖活动在不同平台上的变化,以及网络内容的心理成分如何因所使用的平台而不同。通过这些研究,我们对网络仇恨的跨平台行为有了独特的见解。我们进一步确定了该领域未来研究的几个方向,以便更全面地了解全球仇恨生态系统。
{"title":"Unmasking hate in the pandemic: A cross-platform study of the COVID-19 infodemic","authors":"Fatima Zahrah ,&nbsp;Jason R.C. Nurse ,&nbsp;Michael Goldsmith","doi":"10.1016/j.bdr.2024.100481","DOIUrl":"https://doi.org/10.1016/j.bdr.2024.100481","url":null,"abstract":"<div><p>The past few decades have established how digital technologies and platforms have provided an effective medium for spreading hateful content, which has been linked to several catastrophic consequences. Recent academic studies have also highlighted how online hate is a phenomenon that strategically makes use of multiple online platforms. In this article, we seek to advance the current research landscape by harnessing a cross-platform approach to computationally analyse content relating to the 2020 COVID-19 pandemic. More specifically, we analyse content on hate-specific environments from Twitter, Reddit, 4chan and Stormfront. Our findings show how content and posting activity can change across platforms, and how the psychological components of online content can differ depending on the platform being used. Through this, we provide unique insight into the cross-platform behaviours of online hate. We further define several avenues for future research within this field so as to gain a more comprehensive understanding of the global hate ecosystem.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.5,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2214579624000558/pdfft?md5=a8e2330701051448866927c6cb877d10&pid=1-s2.0-S2214579624000558-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141480176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Two-dimensional data partitioning for non-negative matrix tri-factorization 非负矩阵三因子化的二维数据分区
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-06-19 DOI: 10.1016/j.bdr.2024.100473
Jiaxing Yan , Hai Liu , Zhiqi Lei , Yanghui Rao , Guan Liu , Haoran Xie , Xiaohui Tao , Fu Lee Wang

As a two-sided clustering and dimensionality reduction paradigm, Non-negative Matrix Tri-Factorization (NMTF) has attracted much attention in machine learning and data mining researchers due to its excellent performance and reliable theoretical support. Unlike Non-negative Matrix Factorization (NMF) methods applicable to one-sided clustering only, NMTF introduces an additional factor matrix and uses the inherent duality of data to realize the mutual promotion of sample clustering and feature clustering, thus showing great advantages in many scenarios (e.g., text co-clustering). However, the existing methods for solving NMTF usually involve intensive matrix multiplication, which is characterized by high time and space complexities, that is, there are limitations of slow convergence of the multiplicative update rules and high memory overhead. In order to solve the above problems, this paper develops a distributed parallel algorithm with a 2-dimensional data partition scheme for NMTF (i.e., PNMTF-2D). Experiments on multiple text datasets show that the proposed PNMTF-2D can substantially improve the computational efficiency of NMTF (e.g., the average iteration time is reduced by up to 99.7% on Amazon) while ensuring the effectiveness of convergence and co-clustering.

作为一种双面聚类和降维范式,非负矩阵三因式分解(NMTF)以其优异的性能和可靠的理论支持吸引了机器学习和数据挖掘研究人员的广泛关注。与只适用于单边聚类的非负矩阵因式分解(NMF)方法不同,NMTF 引入了额外的因式矩阵,利用数据固有的二元性实现了样本聚类和特征聚类的相互促进,因此在很多场景(如文本共聚类)中都显示出巨大的优势。然而,现有的 NMTF 求解方法通常涉及密集的矩阵乘法,具有时间和空间复杂度高的特点,即存在乘法更新规则收敛慢和内存开销大的局限性。为了解决上述问题,本文针对 NMTF 开发了一种具有二维数据分区方案的分布式并行算法(即 PNMTF-2D)。在多个文本数据集上的实验表明,所提出的 PNMTF-2D 可以大幅提高 NMTF 的计算效率(例如,在亚马逊上平均迭代时间最多可缩短 99.7%),同时确保收敛和共聚类的有效性。
{"title":"Two-dimensional data partitioning for non-negative matrix tri-factorization","authors":"Jiaxing Yan ,&nbsp;Hai Liu ,&nbsp;Zhiqi Lei ,&nbsp;Yanghui Rao ,&nbsp;Guan Liu ,&nbsp;Haoran Xie ,&nbsp;Xiaohui Tao ,&nbsp;Fu Lee Wang","doi":"10.1016/j.bdr.2024.100473","DOIUrl":"https://doi.org/10.1016/j.bdr.2024.100473","url":null,"abstract":"<div><p>As a two-sided clustering and dimensionality reduction paradigm, Non-negative Matrix Tri-Factorization (NMTF) has attracted much attention in machine learning and data mining researchers due to its excellent performance and reliable theoretical support. Unlike Non-negative Matrix Factorization (NMF) methods applicable to one-sided clustering only, NMTF introduces an additional factor matrix and uses the inherent duality of data to realize the mutual promotion of sample clustering and feature clustering, thus showing great advantages in many scenarios (e.g., text co-clustering). However, the existing methods for solving NMTF usually involve intensive matrix multiplication, which is characterized by high time and space complexities, that is, there are limitations of slow convergence of the multiplicative update rules and high memory overhead. In order to solve the above problems, this paper develops a distributed parallel algorithm with a 2-dimensional data partition scheme for NMTF (i.e., PNMTF-2D). Experiments on multiple text datasets show that the proposed PNMTF-2D can substantially improve the computational efficiency of NMTF (e.g., the average iteration time is reduced by up to 99.7% on Amazon) while ensuring the effectiveness of convergence and co-clustering.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.5,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141480175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessment of soil fertility in Xinjiang oasis cotton field based on big data techniques 基于大数据技术的新疆绿洲棉田土壤肥力评估
IF 3.3 3区 计算机科学 Q1 Business, Management and Accounting Pub Date : 2024-06-13 DOI: 10.1016/j.bdr.2024.100480
Peng Wang , Jiang Li , Yingli Wang , Youchun liu , Yu Zhang

Assessing soil fertility through traditional methods has faced challenges due to the vast amount of meteorological data and the complexity of heterogeneous data. In this study, we address these challenges by employing the K-means algorithm for cluster analysis on soil fertility data and developing a novel K-means algorithm within the Hadoop framework. Our research aims to provide a comprehensive analysis of soil fertility in the Shihezi region, particularly in the context of oasis cotton fields, leveraging big data techniques. The methodology involves utilizing soil nutrient data from 29 sampling points across six round fields in 2022. Through K-means clustering with varying K values, we determine that setting K to 3 yields optimal cluster effects, aligning closely with the actual soil fertility distribution. Furthermore, we compare the performance of our proposed K-means algorithm under the MapReduce framework with traditional serial K-means algorithms, demonstrating significant improvements in operational speed and successful completion of large-scale data computations. Our findings reveal that soil fertility in the Shihezi region can be classified into four distinct grades, providing valuable insights for agricultural practices and land management strategies. This classification contributes to a better understanding of soil resources in oasis cotton fields and facilitates informed decision-making processes for farmers and policymakers alike.

由于气象数据海量且异构数据复杂,通过传统方法评估土壤肥力面临挑战。在本研究中,我们采用 K-means 算法对土壤肥力数据进行聚类分析,并在 Hadoop 框架内开发了一种新型 K-means 算法,从而解决了这些难题。我们的研究旨在利用大数据技术全面分析石河子地区的土壤肥力,尤其是绿洲棉田的土壤肥力。研究方法包括利用 2022 年 6 块圆形棉田 29 个采样点的土壤养分数据。通过不同 K 值的 K 均值聚类,我们确定将 K 设为 3 可产生最佳聚类效果,与实际土壤肥力分布密切相关。此外,我们还比较了我们提出的 K-means 算法在 MapReduce 框架下与传统串行 K-means 算法的性能,结果表明,我们的算法在运行速度和成功完成大规模数据计算方面都有显著提高。我们的研究结果表明,石河子地区的土壤肥力可分为四个不同等级,为农业实践和土地管理策略提供了宝贵的启示。这种分类有助于更好地了解绿洲棉田的土壤资源,并促进农民和政策制定者的知情决策过程。
{"title":"Assessment of soil fertility in Xinjiang oasis cotton field based on big data techniques","authors":"Peng Wang ,&nbsp;Jiang Li ,&nbsp;Yingli Wang ,&nbsp;Youchun liu ,&nbsp;Yu Zhang","doi":"10.1016/j.bdr.2024.100480","DOIUrl":"10.1016/j.bdr.2024.100480","url":null,"abstract":"<div><p>Assessing soil fertility through traditional methods has faced challenges due to the vast amount of meteorological data and the complexity of heterogeneous data. In this study, we address these challenges by employing the K-means algorithm for cluster analysis on soil fertility data and developing a novel K-means algorithm within the Hadoop framework. Our research aims to provide a comprehensive analysis of soil fertility in the Shihezi region, particularly in the context of oasis cotton fields, leveraging big data techniques. The methodology involves utilizing soil nutrient data from 29 sampling points across six round fields in 2022. Through K-means clustering with varying K values, we determine that setting K to 3 yields optimal cluster effects, aligning closely with the actual soil fertility distribution. Furthermore, we compare the performance of our proposed K-means algorithm under the MapReduce framework with traditional serial K-means algorithms, demonstrating significant improvements in operational speed and successful completion of large-scale data computations. Our findings reveal that soil fertility in the Shihezi region can be classified into four distinct grades, providing valuable insights for agricultural practices and land management strategies. This classification contributes to a better understanding of soil resources in oasis cotton fields and facilitates informed decision-making processes for farmers and policymakers alike.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141393452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Intelligent geological interpretation of AMT data based on machine learning 基于机器学习的 AMT 数据智能地质解释
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-06-13 DOI: 10.1016/j.bdr.2024.100475
Shuo Wang , Xiang Yu , Dan Zhao , Guocai Ma , Wei Ren , Shuxin Duan

AMT (Audio Magnetotelluric) is widely used for obtaining geological settings related to sandstone-type Uranium deposits, such as the range of buried sand body and the top boundary of baserock. However, these geological settings are hard to interpret via survey sections without conducting geological interpretation, which highly relies on experience and cognition. On the other hand, with the development of 3D technology, artificial geological interpretation shows low efficiency and reliability. In this paper, a machine learning model constructed using U-net was used for the geological interpretation of AMT data in the Naren-Yihegaole area. To train the model, a training dataset was built based on simulated data from random models. The issue of insufficient data samples has been addressed. In the prediction stage, sand bodies and baserock were delineated from the inversion resistivity images. The comparison between two interpretations, one by machine learning method, showed high consistency with the artificial one, but with better time-saving. It indicates that this technology is more individualized and effective than the traditional way.

AMT(音频磁法)被广泛用于获取与砂岩型铀矿床相关的地质环境,如砂体埋藏范围和基岩顶界。然而,如果不进行地质解释,就很难通过勘测断面解释这些地质环境,而地质解释在很大程度上依赖于经验和认知。另一方面,随着三维技术的发展,人工地质解释的效率和可靠性都很低。本文利用 U-net 构建了一个机器学习模型,用于那仁-义合高勒地区 AMT 数据的地质解释。为了训练该模型,根据随机模型的模拟数据建立了一个训练数据集。数据样本不足的问题已得到解决。在预测阶段,根据反演电阻率图像划分了砂体和基岩。对两种解释进行了比较,其中一种解释采用了机器学习方法,结果显示与人工解释高度一致,但更节省时间。这表明该技术比传统方法更加个性化和有效。
{"title":"Intelligent geological interpretation of AMT data based on machine learning","authors":"Shuo Wang ,&nbsp;Xiang Yu ,&nbsp;Dan Zhao ,&nbsp;Guocai Ma ,&nbsp;Wei Ren ,&nbsp;Shuxin Duan","doi":"10.1016/j.bdr.2024.100475","DOIUrl":"10.1016/j.bdr.2024.100475","url":null,"abstract":"<div><p>AMT (Audio Magnetotelluric) is widely used for obtaining geological settings related to sandstone-type Uranium deposits, such as the range of buried sand body and the top boundary of baserock. However, these geological settings are hard to interpret via survey sections without conducting geological interpretation, which highly relies on experience and cognition. On the other hand, with the development of 3D technology, artificial geological interpretation shows low efficiency and reliability. In this paper, a machine learning model constructed using U-net was used for the geological interpretation of AMT data in the Naren-Yihegaole area. To train the model, a training dataset was built based on simulated data from random models. The issue of insufficient data samples has been addressed. In the prediction stage, sand bodies and baserock were delineated from the inversion resistivity images. The comparison between two interpretations, one by machine learning method, showed high consistency with the artificial one, but with better time-saving. It indicates that this technology is more individualized and effective than the traditional way.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141408443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semi-supervised topic representation through sentiment analysis and semantic networks 通过情感分析和语义网络进行半监督式主题表示
IF 3.5 3区 计算机科学 Q1 Business, Management and Accounting Pub Date : 2024-06-13 DOI: 10.1016/j.bdr.2024.100474
Marco Ortu, Maurizio Romano, Andrea Carta

This paper proposes a novel approach to topic detection aimed at improving the semi-supervised clustering of customer reviews in the context of customers' services. The proposed methodology, named SeMi-supervised clustering for Assessment of Reviews using Topic and Sentiment (SMARTS) for Topic-Community Representation with Semantic Networks, combines semantic and sentiment analysis of words to derive topics related to positive and negative reviews of specific services. To achieve this, a semantic network of words is constructed based on word embedding semantic similarity to identify relationships between words used in the reviews. The resulting network is then used to derive the topics present in users' reviews, which are grouped by positive and negative sentiment based on words related to specific services. Clusters of words, obtained from the network's communities, are used to extract topics related to particular services and to improve the interpretation of users' assessments of those services. The proposed methodology is applied to tourism review data from Booking.com, and the results demonstrate the efficacy of the approach in enhancing the interpretability of the topics obtained by semi-supervised clustering. The methodology has the potential to provide valuable insights into the sentiment of customers toward tourism services, which could be utilized by service providers and decision-makers to enhance the quality of their services.

本文提出了一种新颖的主题检测方法,旨在改进客户服务背景下的客户评论半监督聚类。所提出的方法名为 "利用主题和情感评估评论的 SeMi-supervised clustering(SMARTS)",即利用语义网络进行主题-社群表示,该方法结合了对词语的语义分析和情感分析,以得出与特定服务的正面和负面评论相关的主题。为了实现这一目标,我们根据词语嵌入语义相似性构建了词语语义网络,以识别评论中使用的词语之间的关系。然后,利用生成的网络推导出用户评论中的主题,并根据与特定服务相关的词语按正面和负面情绪进行分组。从网络社区中获得的词群用于提取与特定服务相关的主题,并改进对用户对这些服务评价的解释。我们将所提出的方法应用于 Booking.com 的旅游评论数据,结果表明该方法在提高通过半监督聚类获得的主题的可解释性方面非常有效。该方法有可能为客户对旅游服务的情感提供有价值的见解,服务提供商和决策者可以利用这些见解来提高服务质量。
{"title":"Semi-supervised topic representation through sentiment analysis and semantic networks","authors":"Marco Ortu,&nbsp;Maurizio Romano,&nbsp;Andrea Carta","doi":"10.1016/j.bdr.2024.100474","DOIUrl":"10.1016/j.bdr.2024.100474","url":null,"abstract":"<div><p>This paper proposes a novel approach to topic detection aimed at improving the semi-supervised clustering of customer reviews in the context of customers' services. The proposed methodology, named SeMi-supervised clustering for Assessment of Reviews using Topic and Sentiment (SMARTS) for Topic-Community Representation with Semantic Networks, combines semantic and sentiment analysis of words to derive topics related to positive and negative reviews of specific services. To achieve this, a semantic network of words is constructed based on word embedding semantic similarity to identify relationships between words used in the reviews. The resulting network is then used to derive the topics present in users' reviews, which are grouped by positive and negative sentiment based on words related to specific services. Clusters of words, obtained from the network's communities, are used to extract topics related to particular services and to improve the interpretation of users' assessments of those services. The proposed methodology is applied to tourism review data from Booking.com, and the results demonstrate the efficacy of the approach in enhancing the interpretability of the topics obtained by semi-supervised clustering. The methodology has the potential to provide valuable insights into the sentiment of customers toward tourism services, which could be utilized by service providers and decision-makers to enhance the quality of their services.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2214579624000509/pdfft?md5=46a689f4478007ad8db7233af95c8c2e&pid=1-s2.0-S2214579624000509-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141401445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Big Data Research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1