Social media platforms have become a popular source of information for real-time monitoring of events and user behavior. In particular, Twitter provides invaluable information related to diseases and public health to build real-time disease surveillance systems. Effective use of such social media platforms for public health surveillance requires data-driven AI models which are hindered by the difficult, expensive, and time-consuming task of collecting high-quality and large-scale datasets. In this paper, we build and analyze the Epidemic TweetBank (EpiBank) dataset containing 271 million English tweets related to six epidemic-prone diseases COVID19, Flu, Hepatitis, Dengue, Malaria, and HIV/AIDs. For this purpose, we develop a tool of ESS-T (Epidemic Surveillance Study via Twitter) which collects tweets according to provided input parameters and keywords. Also, our tool assigns location to tweets with 95% accuracy value and performs analysis of collected tweets focusing on temporal distribution, spatial patterns, users, entities, sentiment, and misinformation. Leveraging ESS-T, we build two geo-tagged datasets of EpiBank-global and EpiBank-Pak containing 86 million tweets from 190 countries and 2.6 million tweets from Pakistan, respectively. Our spatial analysis of EpiBank-global for COVID19, Malaria, and Dengue indicates that our framework correctly identifies high-risk epidemic-prone countries according to World Health Organization (WHO) statistics.
In virtual network environments building secure and effective systems is crucial for its correct functioning, and so the anomaly detection task is at its core. To uncover and predict abnormalities in the behavior of a virtual machine, it is desirable to extract relevant information from system text logs. The main issue is that text is unstructured and symbolic data, and also very expensive to process. However, recent advances in deep learning have shown remarkable capabilities of handling such data. In this work, we propose using a simple LSTM recurrent network on top of a pre-trained Sentence-BERT, which encodes the system logs into fixed-length vectors. We trained the model in an unsupervised fashion to learn the likelihood of the represented sequences of logs. This way, the model can trigger a warning with an accuracy of 81% when a virtual machine generates an abnormal sequence. Our model approach is not only easy to train and computationally cheap, it also generalizes to the content of any input.
Many real-world applications involve multiclass classification problems, and often the data across classes is not evenly distributed. Due to this disproportion, supervised learning models tend to classify instances towards the class with the maximum number of instances, which is a severe issue that needs to be addressed. In multiclass imbalanced data classification, machine learning researchers try to reduce the learning model's bias towards the class with a high sample count. Researchers attempt to reduce this unfairness by either balancing the data before the classifier learns it, modifying the classifier's learning phase to pay more attention to the class with a minimum number of instances, or a combination of both. The existing algorithmic approaches find it difficult to understand the clear boundary between the samples of different classes due to unfair class distribution and overlapping issues. As a result, the minority class recognition rate is poor. A new algorithmic approach is proposed that uses dual decision trees. One is used to create an induced dataset using a PCA based grouping approach and by assigning weights to the data samples followed by another decision tree to learn and predict from the induced dataset. The distinct feature of this algorithmic approach is that it recognizes the data instances without altering their underlying data distribution and is applicable for all categories of multiclass imbalanced datasets. Five multiclass imbalanced datasets from UCI were used to classify the data using our proposed algorithm, and the results revealed that the duo-decision tree approach pays better attention to both the minor and major class samples.
Traffic flow prediction plays an important role in smart cities. Although many neural network models already existed that can predict traffic flow, in the face of complex spatio-temporal data, these models still have some shortcomings. Firstly, they although take into account local spatio-temporal relations, ignore global information, leading to inability to capture global trend. Secondly, most models although construct spatio-temporal graphs for convolution, ignore the dynamic characteristics of spatio-temporal graphs, leading to the inability to capture local fluctuation. Finally, the current popular models need to take a lot of training time to obtain better prediction results, resulting in higher computing cost. To this end, we propose a new model: Multi-Step Trend Aware Graph Neural Network (MSTAGNN), which considers the influence of global spatio-temporal information and captures the dynamic characteristics of spatio-temporal graph. It can not only accurately capture local fluctuation, but also extract global trend and dramatically reduce computing cost. The experimental results showed that our proposed model achieved optimal results compared to baseline. Among them, mean absolute error (MAE) was reduced by 6.25% and the total training time was reduced by 79% on the PEMSD8 dataset. The source codes are available at: https://github.com/Vitalitypi/MSTAGNN.
The past few decades have established how digital technologies and platforms have provided an effective medium for spreading hateful content, which has been linked to several catastrophic consequences. Recent academic studies have also highlighted how online hate is a phenomenon that strategically makes use of multiple online platforms. In this article, we seek to advance the current research landscape by harnessing a cross-platform approach to computationally analyse content relating to the 2020 COVID-19 pandemic. More specifically, we analyse content on hate-specific environments from Twitter, Reddit, 4chan and Stormfront. Our findings show how content and posting activity can change across platforms, and how the psychological components of online content can differ depending on the platform being used. Through this, we provide unique insight into the cross-platform behaviours of online hate. We further define several avenues for future research within this field so as to gain a more comprehensive understanding of the global hate ecosystem.
As a two-sided clustering and dimensionality reduction paradigm, Non-negative Matrix Tri-Factorization (NMTF) has attracted much attention in machine learning and data mining researchers due to its excellent performance and reliable theoretical support. Unlike Non-negative Matrix Factorization (NMF) methods applicable to one-sided clustering only, NMTF introduces an additional factor matrix and uses the inherent duality of data to realize the mutual promotion of sample clustering and feature clustering, thus showing great advantages in many scenarios (e.g., text co-clustering). However, the existing methods for solving NMTF usually involve intensive matrix multiplication, which is characterized by high time and space complexities, that is, there are limitations of slow convergence of the multiplicative update rules and high memory overhead. In order to solve the above problems, this paper develops a distributed parallel algorithm with a 2-dimensional data partition scheme for NMTF (i.e., PNMTF-2D). Experiments on multiple text datasets show that the proposed PNMTF-2D can substantially improve the computational efficiency of NMTF (e.g., the average iteration time is reduced by up to 99.7% on Amazon) while ensuring the effectiveness of convergence and co-clustering.
Assessing soil fertility through traditional methods has faced challenges due to the vast amount of meteorological data and the complexity of heterogeneous data. In this study, we address these challenges by employing the K-means algorithm for cluster analysis on soil fertility data and developing a novel K-means algorithm within the Hadoop framework. Our research aims to provide a comprehensive analysis of soil fertility in the Shihezi region, particularly in the context of oasis cotton fields, leveraging big data techniques. The methodology involves utilizing soil nutrient data from 29 sampling points across six round fields in 2022. Through K-means clustering with varying K values, we determine that setting K to 3 yields optimal cluster effects, aligning closely with the actual soil fertility distribution. Furthermore, we compare the performance of our proposed K-means algorithm under the MapReduce framework with traditional serial K-means algorithms, demonstrating significant improvements in operational speed and successful completion of large-scale data computations. Our findings reveal that soil fertility in the Shihezi region can be classified into four distinct grades, providing valuable insights for agricultural practices and land management strategies. This classification contributes to a better understanding of soil resources in oasis cotton fields and facilitates informed decision-making processes for farmers and policymakers alike.
AMT (Audio Magnetotelluric) is widely used for obtaining geological settings related to sandstone-type Uranium deposits, such as the range of buried sand body and the top boundary of baserock. However, these geological settings are hard to interpret via survey sections without conducting geological interpretation, which highly relies on experience and cognition. On the other hand, with the development of 3D technology, artificial geological interpretation shows low efficiency and reliability. In this paper, a machine learning model constructed using U-net was used for the geological interpretation of AMT data in the Naren-Yihegaole area. To train the model, a training dataset was built based on simulated data from random models. The issue of insufficient data samples has been addressed. In the prediction stage, sand bodies and baserock were delineated from the inversion resistivity images. The comparison between two interpretations, one by machine learning method, showed high consistency with the artificial one, but with better time-saving. It indicates that this technology is more individualized and effective than the traditional way.
This paper proposes a novel approach to topic detection aimed at improving the semi-supervised clustering of customer reviews in the context of customers' services. The proposed methodology, named SeMi-supervised clustering for Assessment of Reviews using Topic and Sentiment (SMARTS) for Topic-Community Representation with Semantic Networks, combines semantic and sentiment analysis of words to derive topics related to positive and negative reviews of specific services. To achieve this, a semantic network of words is constructed based on word embedding semantic similarity to identify relationships between words used in the reviews. The resulting network is then used to derive the topics present in users' reviews, which are grouped by positive and negative sentiment based on words related to specific services. Clusters of words, obtained from the network's communities, are used to extract topics related to particular services and to improve the interpretation of users' assessments of those services. The proposed methodology is applied to tourism review data from Booking.com, and the results demonstrate the efficacy of the approach in enhancing the interpretability of the topics obtained by semi-supervised clustering. The methodology has the potential to provide valuable insights into the sentiment of customers toward tourism services, which could be utilized by service providers and decision-makers to enhance the quality of their services.