Pub Date : 2021-12-27DOI: 10.26599/BDMA.2021.9020018
He Wang;Zhoujian Cao;Yue Zhou;Zong-Kuan Guo;Zhixiang Ren
Extracting knowledge from high-dimensional data has been notoriously difficult, primarily due to the so-called "curse of dimensionality" and the complex joint distributions of these dimensions. This is a particularly profound issue for high-dimensional gravitational wave data analysis where one requires to conduct Bayesian inference and estimate joint posterior distributions. In this study, we incorporate prior physical knowledge by sampling from desired interim distributions to develop the training dataset. Accordingly, the more relevant regions of the high-dimensional feature space are covered by additional data points, such that the model can learn the subtle but important details. We adapt the normalizing flow method to be more expressive and trainable, such that the information can be effectively extracted and represented by the transformation between the prior and target distributions. Once trained, our model only takes approximately 1 s on one V100 GPU to generate thousands of samples for probabilistic inference purposes. The evaluation of our approach confirms the efficacy and efficiency of gravitational wave data inferences and points to a promising direction for similar research. The source code, specifications, and detailed procedures are publicly accessible on GitHub.
{"title":"Sampling with prior knowledge for high-dimensional gravitational wave data analysis","authors":"He Wang;Zhoujian Cao;Yue Zhou;Zong-Kuan Guo;Zhixiang Ren","doi":"10.26599/BDMA.2021.9020018","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020018","url":null,"abstract":"Extracting knowledge from high-dimensional data has been notoriously difficult, primarily due to the so-called \"curse of dimensionality\" and the complex joint distributions of these dimensions. This is a particularly profound issue for high-dimensional gravitational wave data analysis where one requires to conduct Bayesian inference and estimate joint posterior distributions. In this study, we incorporate prior physical knowledge by sampling from desired interim distributions to develop the training dataset. Accordingly, the more relevant regions of the high-dimensional feature space are covered by additional data points, such that the model can learn the subtle but important details. We adapt the normalizing flow method to be more expressive and trainable, such that the information can be effectively extracted and represented by the transformation between the prior and target distributions. Once trained, our model only takes approximately 1 s on one V100 GPU to generate thousands of samples for probabilistic inference purposes. The evaluation of our approach confirms the efficacy and efficiency of gravitational wave data inferences and points to a promising direction for similar research. The source code, specifications, and detailed procedures are publicly accessible on GitHub.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"5 1","pages":"53-63"},"PeriodicalIF":13.6,"publicationDate":"2021-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9663253/09663260.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68077806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-27DOI: 10.26599/BDMA.2021.9020025
Satellite images are humungous sources of data that require efficient methods for knowledge discovery. The increased availability of earth data from satellite images has immense opportunities in various fields. However, the volume and heterogeneity of data poses serious computational challenges. The development of efficient techniques has the potential of discovering hidden information from these images. This knowledge can be used in various activities related to planning, monitoring, and managing the earth resources. Deep learning are being widely used for image analysis and processing. Deep learning based models can be effectively used for mining and knowledge discovery from satellite images.
{"title":"Call for papers: Special issue on deep learning and evolutionary computation for satellite imagery","authors":"","doi":"10.26599/BDMA.2021.9020025","DOIUrl":"10.26599/BDMA.2021.9020025","url":null,"abstract":"Satellite images are humungous sources of data that require efficient methods for knowledge discovery. The increased availability of earth data from satellite images has immense opportunities in various fields. However, the volume and heterogeneity of data poses serious computational challenges. The development of efficient techniques has the potential of discovering hidden information from these images. This knowledge can be used in various activities related to planning, monitoring, and managing the earth resources. Deep learning are being widely used for image analysis and processing. Deep learning based models can be effectively used for mining and knowledge discovery from satellite images.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"5 1","pages":"79-79"},"PeriodicalIF":13.6,"publicationDate":"2021-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9663253/09663262.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44723605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-27DOI: 10.26599/BDMA.2021.9020026
Artificial Intelligence of Things (AIoT) is experiencing unimaginable fast booming with the popularization of end devices and advanced machine learning and data processing techniques. An increasing volume of data is being collected every single second to enable Artificial Intelligence (AI) on the Internet of Things (IoT). The explosion of data brings significant benefits to various intelligent industries to provide predictive services and research institutes to advance human knowledge in data-intensive fields. To make the best use of the collected data, various data mining techniques have been deployed to extract data patterns. In classic scenarios, the data collected from IoT devices is directly sent to cloud servers for processing using diverse methods such as training machine learning models. However, the network between cloud servers and massive end devices may not be stable due to irregular bursts of traffic, weather, etc. Therefore, autonomous data mining that is self-organized by a group of local devices to maintain ongoing and robust AI services plays a growing important role for critical IoT infrastructures. Privacy issues become more concerning in this scenario. The data transmitted via autonomous networks are publicly accessible by all internal participants, which increases the risk of exposure. Besides, data mining techniques may reveal sensitive information from the collected data. Various attacks, such as inference attacks, are emerging and evolving to breach sensitive data due to its great financial benefits. Motivated by this, it is essential to devise novel privacy-preserving autonomous data mining solutions for AIoT. In this Special Issue, we aim to gather state-of-art advances in privacy-preserving data mining and autonomous data processing solutions for AIoT. Topics include, but are not limited to, the following: • Privacy-preserving federated learning for AIoT • Differentially private machine learning for AIoT • Personalized privacy-preserving data mining • Decentralized machine learning paradigms for autonomous data mining using blockchain • AI-enhanced edge data mining for AIoT • AI and blockchain empowered privacy-preserving big data analytics for AIoT • Anomaly detection and inference attack defense for AIoT • Privacy protection measurement metrics • Zero trust architectures for privacy protection management • Privacy protection data mining and analysis via blockchain-enabled digital twin.
{"title":"Call for papers: Special issue on privacy-preserving data mining for artificial intelligence of things","authors":"","doi":"10.26599/BDMA.2021.9020026","DOIUrl":"10.26599/BDMA.2021.9020026","url":null,"abstract":"Artificial Intelligence of Things (AIoT) is experiencing unimaginable fast booming with the popularization of end devices and advanced machine learning and data processing techniques. An increasing volume of data is being collected every single second to enable Artificial Intelligence (AI) on the Internet of Things (IoT). The explosion of data brings significant benefits to various intelligent industries to provide predictive services and research institutes to advance human knowledge in data-intensive fields. To make the best use of the collected data, various data mining techniques have been deployed to extract data patterns. In classic scenarios, the data collected from IoT devices is directly sent to cloud servers for processing using diverse methods such as training machine learning models. However, the network between cloud servers and massive end devices may not be stable due to irregular bursts of traffic, weather, etc. Therefore, autonomous data mining that is self-organized by a group of local devices to maintain ongoing and robust AI services plays a growing important role for critical IoT infrastructures. Privacy issues become more concerning in this scenario. The data transmitted via autonomous networks are publicly accessible by all internal participants, which increases the risk of exposure. Besides, data mining techniques may reveal sensitive information from the collected data. Various attacks, such as inference attacks, are emerging and evolving to breach sensitive data due to its great financial benefits. Motivated by this, it is essential to devise novel privacy-preserving autonomous data mining solutions for AIoT. In this Special Issue, we aim to gather state-of-art advances in privacy-preserving data mining and autonomous data processing solutions for AIoT. Topics include, but are not limited to, the following: • Privacy-preserving federated learning for AIoT • Differentially private machine learning for AIoT • Personalized privacy-preserving data mining • Decentralized machine learning paradigms for autonomous data mining using blockchain • AI-enhanced edge data mining for AIoT • AI and blockchain empowered privacy-preserving big data analytics for AIoT • Anomaly detection and inference attack defense for AIoT • Privacy protection measurement metrics • Zero trust architectures for privacy protection management • Privacy protection data mining and analysis via blockchain-enabled digital twin.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"5 1","pages":"80-80"},"PeriodicalIF":13.6,"publicationDate":"2021-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9663253/09663263.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41991676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-08-26DOI: 10.26599/BDMA.2021.9020008
Jintao Zhang;Quan Xu
As a powerful tool for elucidating the embedding representation of graph-structured data, Graph Neural Networks (GNNs), which are a series of powerful tools built on homogeneous networks, have been widely used in various data mining tasks. It is a huge challenge to apply a GNN to an embedding Heterogeneous Information Network (HIN). The main reason for this challenge is that HINs contain many different types of nodes and different types of relationships between nodes. HIN contains rich semantic and structural information, which requires a specially designed graph neural network. However, the existing HIN-based graph neural network models rarely consider the interactive information hidden between the meta-paths of HIN in the poor embedding of nodes in the HIN. In this paper, we propose an Attention-aware Heterogeneous graph Neural Network (AHNN) model to effectively extract useful information from HIN and use it to learn the embedding representation of nodes. Specifically, we first use node-level attention to aggregate and update the embedding representation of nodes, and then concatenate the embedding representation of the nodes on different meta-paths. Finally, the semantic-level neural network is proposed to extract the feature interaction relationships on different meta-paths and learn the final embedding of nodes. Experimental results on three widely used datasets showed that the AHNN model could significantly outperform the state-of-the-art models.
{"title":"Attention-aware heterogeneous graph neural network","authors":"Jintao Zhang;Quan Xu","doi":"10.26599/BDMA.2021.9020008","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020008","url":null,"abstract":"As a powerful tool for elucidating the embedding representation of graph-structured data, Graph Neural Networks (GNNs), which are a series of powerful tools built on homogeneous networks, have been widely used in various data mining tasks. It is a huge challenge to apply a GNN to an embedding Heterogeneous Information Network (HIN). The main reason for this challenge is that HINs contain many different types of nodes and different types of relationships between nodes. HIN contains rich semantic and structural information, which requires a specially designed graph neural network. However, the existing HIN-based graph neural network models rarely consider the interactive information hidden between the meta-paths of HIN in the poor embedding of nodes in the HIN. In this paper, we propose an Attention-aware Heterogeneous graph Neural Network (AHNN) model to effectively extract useful information from HIN and use it to learn the embedding representation of nodes. Specifically, we first use node-level attention to aggregate and update the embedding representation of nodes, and then concatenate the embedding representation of the nodes on different meta-paths. Finally, the semantic-level neural network is proposed to extract the feature interaction relationships on different meta-paths and learn the final embedding of nodes. Experimental results on three widely used datasets showed that the AHNN model could significantly outperform the state-of-the-art models.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"233-241"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523497.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68020434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-08-26DOI: 10.26599/BDMA.2021.9020006
Changjie Wang;Zhihua Li;Benjamin Sarpong
Identity-recognition technologies require assistive equipment, whereas they are poor in recognition accuracy and expensive. To overcome this deficiency, this paper proposes several gait feature identification algorithms. First, in combination with the collected gait information of individuals from triaxial accelerometers on smartphones, the collected information is preprocessed, and multimodal fusion is used with the existing standard datasets to yield a multimodal synthetic dataset; then, with the multimodal characteristics of the collected biological gait information, a Convolutional Neural Network based Gait Recognition (CNN-GR) model and the related scheme for the multimodal features are developed; at last, regarding the proposed CNN-GR model and scheme, a unimodal gait feature identity single-gait feature identification algorithm and a multimodal gait feature fusion identity multimodal gait information algorithm are proposed. Experimental results show that the proposed algorithms perform well in recognition accuracy, the confusion matrix, and the kappa statistic, and they have better recognition scores and robustness than the compared algorithms; thus, the proposed algorithm has prominent promise in practice.
{"title":"Multimodal adaptive identity-recognition algorithm fused with gait perception","authors":"Changjie Wang;Zhihua Li;Benjamin Sarpong","doi":"10.26599/BDMA.2021.9020006","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020006","url":null,"abstract":"Identity-recognition technologies require assistive equipment, whereas they are poor in recognition accuracy and expensive. To overcome this deficiency, this paper proposes several gait feature identification algorithms. First, in combination with the collected gait information of individuals from triaxial accelerometers on smartphones, the collected information is preprocessed, and multimodal fusion is used with the existing standard datasets to yield a multimodal synthetic dataset; then, with the multimodal characteristics of the collected biological gait information, a Convolutional Neural Network based Gait Recognition (CNN-GR) model and the related scheme for the multimodal features are developed; at last, regarding the proposed CNN-GR model and scheme, a unimodal gait feature identity single-gait feature identification algorithm and a multimodal gait feature fusion identity multimodal gait information algorithm are proposed. Experimental results show that the proposed algorithms perform well in recognition accuracy, the confusion matrix, and the kappa statistic, and they have better recognition scores and robustness than the compared algorithms; thus, the proposed algorithm has prominent promise in practice.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"223-232"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523496.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-08-26DOI: 10.26599/BDMA.2021.9020012
Sudhir Kumar Patnaik;C. Narendra Babu;Mukul Bhave
Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.
在当今这个要求极高的超个性化消费者体验的世界里,数据对电子商务的发展至关重要,这些体验是使用先进的网络抓取技术收集的。然而,核心数据提取引擎失败了,因为它们无法适应网站内容的动态变化。本研究研究了一种具有卷积和长短期记忆(LSTM)网络的智能自适应网络数据提取系统,该系统使用You only look once(Yolo)算法和Tesseract LSTM来提取产品细节,这些细节被检测为网页中的图像。这个最先进的系统不需要核心数据提取引擎,因此可以适应网站布局的动态变化。在真实世界的零售案例中进行的实验表明,图像检测(精度)和字符提取精度(精度)分别为97%和99%。此外,在输入数据集为45个对象或图像的情况下,获得了74%的平均精度。
{"title":"Intelligent and adaptive web data extraction system using convolutional and long short-term memory deep learning networks","authors":"Sudhir Kumar Patnaik;C. Narendra Babu;Mukul Bhave","doi":"10.26599/BDMA.2021.9020012","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020012","url":null,"abstract":"Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"279-297"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523501.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-08-26DOI: 10.26599/BDMA.2021.9020010
Xueting Liao;Danyang Zheng;Xiaojun Cao
The COVID-19 pandemic has hit the world hard. The reaction to the pandemic related issues has been pouring into social platforms, such as Twitter. Many public officials and governments use Twitter to make policy announcements. People keep close track of the related information and express their concerns about the policies on Twitter. It is beneficial yet challenging to derive important information or knowledge out of such Twitter data. In this paper, we propose a Tripartite Graph Clustering for Pandemic Data Analysis (TGC-PDA) framework that builds on the proposed models and analysis: (1) tripartite graph representation, (2) non-negative matrix factorization with regularization, and (3) sentiment analysis. We collect the tweets containing a set of keywords related to coronavirus pandemic as the ground truth data. Our framework can detect the communities of Twitter users and analyze the topics that are discussed in the communities. The extensive experiments show that our TGC-PDA framework can effectively and efficiently identify the topics and correlations within the Twitter data for monitoring and understanding public opinions, which would provide policy makers useful information and statistics for decision making.
{"title":"Coronavirus pandemic analysis through tripartite graph clustering in online social networks","authors":"Xueting Liao;Danyang Zheng;Xiaojun Cao","doi":"10.26599/BDMA.2021.9020010","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020010","url":null,"abstract":"The COVID-19 pandemic has hit the world hard. The reaction to the pandemic related issues has been pouring into social platforms, such as Twitter. Many public officials and governments use Twitter to make policy announcements. People keep close track of the related information and express their concerns about the policies on Twitter. It is beneficial yet challenging to derive important information or knowledge out of such Twitter data. In this paper, we propose a Tripartite Graph Clustering for Pandemic Data Analysis (TGC-PDA) framework that builds on the proposed models and analysis: (1) tripartite graph representation, (2) non-negative matrix factorization with regularization, and (3) sentiment analysis. We collect the tweets containing a set of keywords related to coronavirus pandemic as the ground truth data. Our framework can detect the communities of Twitter users and analyze the topics that are discussed in the communities. The extensive experiments show that our TGC-PDA framework can effectively and efficiently identify the topics and correlations within the Twitter data for monitoring and understanding public opinions, which would provide policy makers useful information and statistics for decision making.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"242-251"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523498.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.
{"title":"LotusSQL: SQL engine for high-performance big data systems","authors":"Xiaohan Li;Bowen Yu;Guanyu Feng;Haojie Wang;Wenguang Chen","doi":"10.26599/BDMA.2021.9020009","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020009","url":null,"abstract":"In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"252-265"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523499.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-08-26DOI: 10.26599/BDMA.2021.9020011
Chenyu Hou;Jiawei Wu;Bin Cao;Jing Fan
Time series forecasting has attracted wide attention in recent decades. However, some time series are imbalanced and show different patterns between special and normal periods, leading to the prediction accuracy degradation of special periods. In this paper, we aim to develop a unified model to alleviate the imbalance and thus improving the prediction accuracy for special periods. This task is challenging because of two reasons: (1) the temporal dependency of series, and (2) the tradeoff between mining similar patterns and distinguishing different distributions between different periods. To tackle these issues, we propose a self-attention-based time-varying prediction model with a two-stage training strategy. First, we use an encoder-decoder module with the multi-head self-attention mechanism to extract common patterns of time series. Then, we propose a time-varying optimization module to optimize the results of special periods and eliminate the imbalance. Moreover, we propose reverse distance attention in place of traditional dot attention to highlight the importance of similar historical values to forecast results. Finally, extensive experiments show that our model performs better than other baselines in terms of mean absolute error and mean absolute percentage error.
{"title":"A deep-learning prediction model for imbalanced time series data forecasting","authors":"Chenyu Hou;Jiawei Wu;Bin Cao;Jing Fan","doi":"10.26599/BDMA.2021.9020011","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020011","url":null,"abstract":"Time series forecasting has attracted wide attention in recent decades. However, some time series are imbalanced and show different patterns between special and normal periods, leading to the prediction accuracy degradation of special periods. In this paper, we aim to develop a unified model to alleviate the imbalance and thus improving the prediction accuracy for special periods. This task is challenging because of two reasons: (1) the temporal dependency of series, and (2) the tradeoff between mining similar patterns and distinguishing different distributions between different periods. To tackle these issues, we propose a self-attention-based time-varying prediction model with a two-stage training strategy. First, we use an encoder-decoder module with the multi-head self-attention mechanism to extract common patterns of time series. Then, we propose a time-varying optimization module to optimize the results of special periods and eliminate the imbalance. Moreover, we propose reverse distance attention in place of traditional dot attention to highlight the importance of similar historical values to forecast results. Finally, extensive experiments show that our model performs better than other baselines in terms of mean absolute error and mean absolute percentage error.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"266-278"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523500.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Total contents","authors":"","doi":"","DOIUrl":"https://doi.org/","url":null,"abstract":"","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"I-II"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523502.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}