Pub Date : 2023-08-22DOI: 10.1016/j.bdr.2023.100407
Felipe Tomazelli Lima, Vinicius M.A. Souza
Normalization is a mandatory preprocessing step in time series problems to guarantee similarity comparisons invariant to unexpected distortions in amplitude and offset. Such distortions are usual for most time series data. A typical example is gait recognition by motion collected on subjects with varying body height and width. To rescale the data for the same range of values, the vast majority of researchers consider z-normalization as the default method for any domain application, data, or task. This choice is made without a searching process as occurs to set the parameters of an algorithm or without any experimental evidence in the literature considering a variety of scenarios to support this decision. To address this gap, we evaluate the impact of different normalization methods on time series data. Our analysis is based on an extensive experimental comparison on classification problems involving 10 normalization methods, 3 state-of-the-art classifiers, and 38 benchmark datasets. We consider the classification task due to the simplicity of the experimental settings and well-defined metrics. However, our findings can be extrapolated for other time series mining tasks, such as forecasting or clustering. Based on our results, we suggest to evaluate the maximum absolute scale as an alternative to z-normalization. Besides being time efficient, this alternative shows promising results for similarity-based methods using Euclidean distance. For deep learning, mean normalization could be considered.
{"title":"A Large Comparison of Normalization Methods on Time Series","authors":"Felipe Tomazelli Lima, Vinicius M.A. Souza","doi":"10.1016/j.bdr.2023.100407","DOIUrl":"10.1016/j.bdr.2023.100407","url":null,"abstract":"<div><p>Normalization is a mandatory preprocessing step<span><span><span> in time series problems to guarantee similarity comparisons invariant to unexpected distortions in amplitude and offset. Such distortions are usual for most time series data<span>. A typical example is gait recognition by motion collected on subjects with varying body height and width. To rescale the data for the same range of values, the vast majority of researchers consider z-normalization as the default method for any domain application, data, or task. This choice is made without a searching process as occurs to set the parameters of an algorithm or without any experimental evidence in the literature considering a variety of scenarios to support this decision. To address this gap, we evaluate the impact of different normalization methods on time series data. Our analysis is based on an extensive experimental comparison on classification problems involving 10 normalization methods, 3 state-of-the-art classifiers, and 38 benchmark datasets. We consider the </span></span>classification task<span> due to the simplicity of the experimental settings and well-defined metrics. However, our findings can be extrapolated for other time series mining tasks, such as forecasting or clustering. Based on our results, we suggest to evaluate the maximum absolute scale as an alternative to z-normalization. Besides being time efficient, this alternative shows promising results for similarity-based methods using Euclidean distance. For </span></span>deep learning, mean normalization could be considered.</span></p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43624406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-01DOI: 10.1016/j.bdr.2023.100398
Amr M. Abdeltif, K. Hosny, M. M. Darwish, Ahmad Salah, KenLi Li
{"title":"Parallel Framework for Memory-Efficient Computation of Image Descriptors for Megapixel Images","authors":"Amr M. Abdeltif, K. Hosny, M. M. Darwish, Ahmad Salah, KenLi Li","doi":"10.1016/j.bdr.2023.100398","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100398","url":null,"abstract":"","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"54134995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-28DOI: 10.1016/j.bdr.2023.100379
Liqing Qiu, Jingcheng Zhou, Caixia Jing, Yuying Liu
Heterogeneous graph embedding maps a high-dimension graph that has different sorts of nodes and edges to a low-dimensional space, making it perform well in downstream tasks. The existing models mainly use two approaches to explore and embed heterogeneous graph information. One is to use meta-path to mining heterogeneous information; the other is to use special modules designed by researchers to explore heterogeneous information. These models show excellent performance in heterogeneous graph embedding tasks. However, none of the models considers using the number of meta-path instances between nodes to improve the performance of heterogeneous graph embedding. The paper proposes a Heterogeneous Graph Convolutional Network based on Correlation Matrix (CMHGCN) to fully use of the number of meta-path instances between nodes to discover interactive information between nodes in heterogeneous graphs. CMHGCN contains two core components: the node-level correlation component and the semantic-level correlation component. The node-level correlation component is able to use the number of meta-path instances between nodes to calculate the correlation between nodes guided by different meta-paths. The semantic-level correlation component can reasonably integrate such information from different meta-paths. On heterogeneous graphs with a large number of meta-path instances, CMHGCN outperforms baselines in node classification and clustering, according to experiments carried out on three benchmark heterogeneous datasets.
{"title":"Heterogeneous Graph Convolutional Network Based on Correlation Matrix","authors":"Liqing Qiu, Jingcheng Zhou, Caixia Jing, Yuying Liu","doi":"10.1016/j.bdr.2023.100379","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100379","url":null,"abstract":"<div><p><span>Heterogeneous graph embedding maps a high-dimension graph that has different sorts of nodes and edges to a low-dimensional space, making it perform well in downstream tasks. The existing models mainly use two approaches to explore and embed heterogeneous graph information. One is to use meta-path to mining heterogeneous information; the other is to use special modules designed by researchers to explore heterogeneous information. These models show excellent performance in heterogeneous graph embedding tasks. However, none of the models considers using the number of meta-path instances between nodes to improve the performance of heterogeneous graph embedding. The paper proposes a </span><em><strong>H</strong>eterogeneous <strong>G</strong>raph <strong>C</strong>onvolutional <strong>N</strong>etwork based on <strong>C</strong>orrelation <strong>M</strong>atrix</em><span> (CMHGCN) to fully use of the number of meta-path instances between nodes to discover interactive information between nodes in heterogeneous graphs. CMHGCN contains two core components: the node-level correlation component and the semantic-level correlation component. The node-level correlation component is able to use the number of meta-path instances between nodes to calculate the correlation between nodes guided by different meta-paths. The semantic-level correlation component can reasonably integrate such information from different meta-paths. On heterogeneous graphs with a large number of meta-path instances, CMHGCN outperforms baselines in node classification and clustering, according to experiments carried out on three benchmark heterogeneous datasets.</span></p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2023-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49713936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-28DOI: 10.1016/j.bdr.2023.100380
Jinghui Peng, Xinyu Hu, Wenbo Huang, Jian Yang
With the explosive growth of multi-modal information on the Internet, the multi-modal knowledge graph (MMKG) has become an important research topic in knowledge graphs to meet the needs of data management and application. Most research on MMKG has taken image-text data as the research object and used the multi-modal deep learning approach to process multi-modal data. In comparison, the structure of the MMKG is no uniform statement. This paper focuses on MMKG, introduces the related theories of multi-modal knowledge, and analyzes several common ideas about its construction. The survey also explains the structural evolution, proposes mirror node alignment to represent cross-modal knowledge for MMKG, lists some tasks' difficulties, and ultimately gives a sample MMKG for the news scene.
{"title":"What Is a Multi-Modal Knowledge Graph: A Survey","authors":"Jinghui Peng, Xinyu Hu, Wenbo Huang, Jian Yang","doi":"10.1016/j.bdr.2023.100380","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100380","url":null,"abstract":"<div><p>With the explosive growth of multi-modal information on the Internet, the multi-modal knowledge graph (MMKG) has become an important research topic in knowledge graphs to meet the needs of data management and application. Most research on MMKG has taken image-text data as the research object and used the multi-modal deep learning approach to process multi-modal data. In comparison, the structure of the MMKG is no uniform statement. This paper focuses on MMKG, introduces the related theories of multi-modal knowledge, and analyzes several common ideas about its construction. The survey also explains the structural evolution, proposes mirror node alignment to represent cross-modal knowledge for MMKG, lists some tasks' difficulties, and ultimately gives a sample MMKG for the news scene.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2023-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49713867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-28DOI: 10.1016/j.bdr.2023.100384
Junru Wang , Shixin Zhang , Anbang Dai
Introduction: Influenza has still posed a great threat to humans. The knowledge of the systematic disease burden of influenza in Japan was limited. The study was aimed to investigate Spatio-temporal characteristics of the influenza burden and its influence factors in the past three decades.
Methods: Data on annual death, years lived with disability (YLDs), years of life lost (YLLs) and disability adjusted life year (DALYs) of influenza from 1990 to 2019 in Japan were available from the Global Health Data Exchange (GHDx), and data on annual social household available from e-Stat in Japan. A joinpoint regression model was used to assess the trends of influenza from 1990 to 2019, a discrete Poisson model to analyze the spatial and temporal cluster of influenza, and a generalized linear model to assess the association of death and DALY of influenza with social household factors.
Results: From 1990 to 2019, the mortality rate increased from 9.95 per 100000 to 19.49 per 100000 in Japan, with AAPC of 2.2% (95% CI: 1.5, 3.0, P<0.05). The DALYs rate increased from 153.86 per 100000 to 209.22 per 100000, with AAPC of 1.0% (95% CI: 0.1, 1.9, P<0.05). The mortality rate ranged from 1.98 per 100000 (Chiba) to 16.9 per 100000 (Kochi) in 1990, and from 5.10 per 100000 (Chiba) to 35.74 per 100000 (Akita) in 2019. The population aged 60+ had the highest mortality rates from 53.79 per 100000 in 1990 to 55.74 per 100000 in 2019 (AAPC: 0.0%, 95% CI: -0.5, 0.6, P=0.944) and DALYs rates from 713.43 per 100000 to 565.22 per 100000 (AAPC: -0.9%, 95% CI: -1.5, -0.3, P<0.05). YLLs and DALYs rates among the population aged 1-4 were also high from 1990 to 2019, ranked after that among populations aged 60+. The mortality rate had two stages of spatio-temporal aggregation across Japan, northern Japan with the period of 2005-2019 (RR = 1.36, P < 0.001) and southern Japan with the same period in the northern area (RR = 1.36, P < 0.001). The generalized linear model (GLM) indicated that year was positively correlated with the mortality rate of influenza (β = 0.18, p<0.01); while the ratio of households ordered via the internet and population were negatively correlated with the mortality rate of influenza (β = -4.41, p<0.05 and β =-0.17, p<0.01, respectively).
Conclusions: The disease burden of influenza in Japan increased in the past three decades, especially among the population aged 60+ years, followed by the population aged 1-4 years. It had two stages of spatio-temporal aggregation across Japan. Lifestyle of households ordered via the internet contributed to the low mortality rate of influenza.
{"title":"Spatio-Temporal Characteristics of Influenza Burden and Its Influence Factors in Japan in the Past Three Decades: An Influenza Disease Burden Data-Based Modeling Study","authors":"Junru Wang , Shixin Zhang , Anbang Dai","doi":"10.1016/j.bdr.2023.100384","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100384","url":null,"abstract":"<div><p><strong>Introduction:</strong> Influenza has still posed a great threat to humans. The knowledge of the systematic disease burden of influenza in Japan was limited. The study was aimed to investigate Spatio-temporal characteristics of the influenza burden and its influence factors in the past three decades.</p><p><strong>Methods:</strong> Data on annual death, years lived with disability (YLDs), years of life lost (YLLs) and disability adjusted life year (DALYs) of influenza from 1990 to 2019 in Japan were available from the Global Health Data Exchange (GHDx), and data on annual social household available from e-Stat in Japan. A joinpoint regression model was used to assess the trends of influenza from 1990 to 2019, a discrete Poisson model to analyze the spatial and temporal cluster of influenza, and a generalized linear model to assess the association of death and DALY of influenza with social household factors.</p><p><strong>Results:</strong> From 1990 to 2019, the mortality rate increased from 9.95 per 100000 to 19.49 per 100000 in Japan, with AAPC of 2.2% (95% CI: 1.5, 3.0, P<0.05). The DALYs rate increased from 153.86 per 100000 to 209.22 per 100000, with AAPC of 1.0% (95% CI: 0.1, 1.9, P<0.05). The mortality rate ranged from 1.98 per 100000 (Chiba) to 16.9 per 100000 (Kochi) in 1990, and from 5.10 per 100000 (Chiba) to 35.74 per 100000 (Akita) in 2019. The population aged 60+ had the highest mortality rates from 53.79 per 100000 in 1990 to 55.74 per 100000 in 2019 (AAPC: 0.0%, 95% CI: -0.5, 0.6, P=0.944) and DALYs rates from 713.43 per 100000 to 565.22 per 100000 (AAPC: -0.9%, 95% CI: -1.5, -0.3, P<0.05). YLLs and DALYs rates among the population aged 1-4 were also high from 1990 to 2019, ranked after that among populations aged 60+. The mortality rate had two stages of spatio-temporal aggregation across Japan, northern Japan with the period of 2005-2019 (RR = 1.36, P < 0.001) and southern Japan with the same period in the northern area (RR = 1.36, P < 0.001). The generalized linear model (GLM) indicated that year was positively correlated with the mortality rate of influenza (<em>β</em> = 0.18, p<0.01); while the ratio of households ordered via the internet and population were negatively correlated with the mortality rate of influenza (<em>β</em> = -4.41, p<0.05 and <em>β</em> =-0.17, p<0.01, respectively).</p><p><strong>Conclusions:</strong><span> The disease burden of influenza in Japan increased in the past three decades, especially among the population aged 60+ years, followed by the population aged 1-4 years. It had two stages of spatio-temporal aggregation across Japan. Lifestyle of households ordered via the internet contributed to the low mortality rate of influenza.</span></p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2023-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49714138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-28DOI: 10.1016/j.bdr.2023.100378
Yuxin Wu, Guofeng Deng
Sentiment analysis has always been an important basic task in the NLP field. Recently, graph convolutional networks (GCNs) have been widely used in aspect-level sentiment analysis. Because GCNs have good aggregation effects, every node can contain neighboring node information. However, in previous studies, most models used only a single GCN to learn contextual information. The GCN relies on the construction method of the graph, and a single GCN will cause the model to focus on a certain relationship of nodes that depends on the construction method and ignore other information. In addition, when the GCN aggregates node information, it cannot determine whether the aggregated information is useful, so it will inevitably introduce noise. We propose a model that fuses two parallel GCNs to learn different relational features between sentences at the same time, and we add a gate mechanism to the GCN to filter the noise introduced by the GCN when aggregating information. Finally, we validate our model on public datasets, and the experiments show that compared to state-of-the-art models, our model performs better.
{"title":"A Parallel Fusion Graph Convolutional Network for Aspect-Level Sentiment Analysis","authors":"Yuxin Wu, Guofeng Deng","doi":"10.1016/j.bdr.2023.100378","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100378","url":null,"abstract":"<div><p>Sentiment analysis<span> has always been an important basic task in the NLP<span> field. Recently, graph convolutional networks (GCNs) have been widely used in aspect-level sentiment analysis. Because GCNs have good aggregation effects, every node can contain neighboring node information. However, in previous studies, most models used only a single GCN to learn contextual information. The GCN relies on the construction method of the graph, and a single GCN will cause the model to focus on a certain relationship of nodes that depends on the construction method and ignore other information. In addition, when the GCN aggregates node information, it cannot determine whether the aggregated information is useful, so it will inevitably introduce noise. We propose a model that fuses two parallel GCNs to learn different relational features between sentences at the same time, and we add a gate mechanism to the GCN to filter the noise introduced by the GCN when aggregating information. Finally, we validate our model on public datasets, and the experiments show that compared to state-of-the-art models, our model performs better.</span></span></p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2023-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49714247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-28DOI: 10.1016/j.bdr.2023.100381
Yiming Tan , Yongrui Chen , Guilin Qi , Weizhuo Li , Meng Wang
Knowledge Graph-based Multilingual Question Answering (KG-MLQA), as one of the essential subtasks in Knowledge Graph-based Question Answering (KGQA), emphasizes that questions on the KGQA task can be expressed in different languages to solve the lexical gap between questions and knowledge graph(s). However, the existing KG-MLQA works mainly focus on the semantic parsing of multilingual questions but ignore the questions that require integrating information from cross-lingual knowledge graphs (CLKG). This paper extends KG-MLQA to Cross-lingual KG-based multilingual Question Answering (CLKGQA) and constructs the first CLKGQA dataset over multilingual DBpedia named MLPQ, which contains 300K questions in English, Chinese, and French. We further propose a novel KG sampling algorithm for KG construction, making the MLPQ support the research of different types of methods. To evaluate the dataset, we put forward a general question answering workflow whose core idea is to transform CLKGQA into KG-MLQA. We first use the Entity Alignment (EA) model to merge CLKG into a single KG and get the answer to the question by the Multi-hop QA model combined with the Multilingual pre-training model. By instantiating the above QA workflow, we establish two baseline models for MLPQ, one of which uses Google translation to obtain alignment entities, and the other adopts the recent EA model. Experiments show that the baseline models are insufficient to obtain the ideal performances on CLKGQA. Moreover, the availability of our benchmark contributes to the community of question answering and entity alignment.
{"title":"MLPQ: A Dataset for Path Question Answering over Multilingual Knowledge Graphs","authors":"Yiming Tan , Yongrui Chen , Guilin Qi , Weizhuo Li , Meng Wang","doi":"10.1016/j.bdr.2023.100381","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100381","url":null,"abstract":"<div><p>Knowledge Graph-based Multilingual Question Answering (KG-MLQA), as one of the essential subtasks in Knowledge Graph-based Question Answering (KGQA), emphasizes that questions on the KGQA task can be expressed in different languages to solve the lexical gap between questions and knowledge graph(s). However, the existing KG-MLQA works mainly focus on the semantic parsing<span> of multilingual questions but ignore the questions that require integrating information from cross-lingual knowledge graphs (CLKG). This paper extends KG-MLQA to Cross-lingual KG-based multilingual Question Answering (CLKGQA) and constructs the first CLKGQA dataset over multilingual DBpedia named MLPQ, which contains 300K questions in English, Chinese, and French. We further propose a novel KG sampling algorithm<span> for KG construction, making the MLPQ support the research of different types of methods. To evaluate the dataset, we put forward a general question answering workflow whose core idea is to transform CLKGQA into KG-MLQA. We first use the Entity Alignment (EA) model to merge CLKG into a single KG and get the answer to the question by the Multi-hop QA model combined with the Multilingual pre-training model. By instantiating the above QA workflow, we establish two baseline models for MLPQ, one of which uses Google translation to obtain alignment entities, and the other adopts the recent EA model. Experiments show that the baseline models are insufficient to obtain the ideal performances on CLKGQA. Moreover, the availability of our benchmark contributes to the community of question answering and entity alignment.</span></span></p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2023-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49729716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-28DOI: 10.1016/j.bdr.2023.100377
Huimin Feng , Ruizhe Ma , Li Yan , Zongmin Ma
A large amount of time series is produced because of the frequent use of IoT devices and sensors. Time series compression is widely adopted to reduce storage overhead and transport costs. At present, most state-of-the-art approaches focus on univariate time series. Therefore, the task of compressing multivariate time series (MTS) is still an important but challenging problem. Traditional MTS compression methods treat each variable individually, ignoring the correlations across variables. This paper proposes a novel MTS prediction method, which can be applied to compress MTS to achieve a higher compression ratio. The method can extract the spatial and temporal correlation across multiple variables, achieving a more accurate prediction and improving the lossy compression performance of MTS based on the prediction-quantization-entropy framework. We use a convolutional neural network (CNN) to extract the temporal features of all variables within the window length. Then the features generated by CNN are transformed, and the image classification algorithm extracts the spatial features of the transformed data. Predictions are made according to spatiotemporal characteristics. To enhance the robustness of our model, we integrate the AR autoregressive linear model in parallel with the proposed network. Experimental results demonstrate that our work can improve the prediction accuracy of MTS and the MTS compression performance in most cases.
{"title":"Spatiotemporal Prediction Based on Feature Classification for Multivariate Floating-Point Time Series Lossy Compression","authors":"Huimin Feng , Ruizhe Ma , Li Yan , Zongmin Ma","doi":"10.1016/j.bdr.2023.100377","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100377","url":null,"abstract":"<div><p><span>A large amount of time series is produced because of the frequent use of IoT<span> devices and sensors. Time series compression is widely adopted to reduce storage overhead<span> and transport costs. At present, most state-of-the-art approaches focus on univariate time series. Therefore, the task of compressing multivariate time series (MTS) is still an important but challenging problem. Traditional MTS compression methods treat each variable individually, ignoring the correlations across variables. This paper proposes a novel MTS prediction method, which can be applied to compress MTS to achieve a higher compression ratio. The method can extract the spatial and temporal correlation across multiple variables, achieving a more accurate prediction and improving the lossy </span></span></span>compression performance<span> of MTS based on the prediction-quantization-entropy framework. We use a convolutional neural network<span> (CNN) to extract the temporal features of all variables within the window length. Then the features generated by CNN are transformed, and the image classification algorithm extracts the spatial features of the transformed data. Predictions are made according to spatiotemporal characteristics. To enhance the robustness of our model, we integrate the AR autoregressive linear model in parallel with the proposed network. Experimental results demonstrate that our work can improve the prediction accuracy of MTS and the MTS compression performance in most cases.</span></span></p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2023-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49713957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1016/j.bdr.2023.100395
Ling Ding, Peng Du, Hai-wei Hou, Jian Zhang, Di Jin, Shifei Ding
{"title":"Botnet DGA Domain Name Classification Using Transformer Network with Hybrid Embedding","authors":"Ling Ding, Peng Du, Hai-wei Hou, Jian Zhang, Di Jin, Shifei Ding","doi":"10.1016/j.bdr.2023.100395","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100395","url":null,"abstract":"","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"54134987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}