Zhenzhen Yang, Zelong Lin, Yongpeng Yang, Jiaqi Li
Link prediction, which has important applications in many fields, predicts the possibility of the link between two nodes in a graph. Link prediction based on Graph Neural Network (GNN) obtains node representation and graph structure through GNN, which has attracted a growing amount of attention recently. However, the existing GNN-based link prediction approaches possess some shortcomings. On the one hand, because a graph contains different types of nodes, it leads to a great challenge for aggregating information and learning node representation from its neighbor nodes. On the other hand, the attention mechanism has been an effect instrument for enhancing the link prediction performance. However, the traditional attention mechanism is always monotonic for query nodes, which limits its influence on link prediction. To address these two problems, a Dual-Path Graph Neural Network (DPGNN) for link prediction is proposed in this study. First, we propose a novel Local Random Features Augmentation for Graph Convolution Network as a baseline of one path. Meanwhile, Graph Attention Network version 2 based on dynamic attention mechanism is adopted as a baseline of the other path. And then, we capture more meaningful node representation and more accurate link features by concatenating the information of these two paths. In addition, we propose an adaptive auxiliary module for better balancing the weight of auxiliary tasks, which brings more benefit to link prediction. Finally, extensive experiments verify the effectiveness and superiority of our proposed DPGNN for link prediction.
链接预测是指预测图中两个节点之间链接的可能性,在许多领域都有重要应用。基于图神经网络(GNN)的链接预测通过 GNN 获得节点表示和图结构,最近引起了越来越多的关注。然而,现有的基于 GNN 的链接预测方法存在一些缺陷。一方面,由于图中包含不同类型的节点,这给从相邻节点汇总信息和学习节点表示带来了巨大挑战。另一方面,注意力机制一直是提高链接预测性能的有效工具。然而,传统的注意力机制对于查询节点总是单调的,这限制了它对链接预测的影响。针对这两个问题,本研究提出了一种用于链接预测的双路径图神经网络(DPGNN)。首先,我们提出了一种新颖的局部随机特征增强图卷积网络(Local Random Features Augmentation for Graph Convolution Network),作为单路径的基线。同时,我们采用基于动态注意力机制的图注意力网络版本 2 作为另一条路径的基准。然后,我们通过串联这两条路径的信息来捕捉更有意义的节点表示和更准确的链接特征。此外,我们还提出了自适应辅助模块,以更好地平衡辅助任务的权重,从而为链接预测带来更多益处。最后,大量实验验证了我们提出的 DPGNN 在链接预测方面的有效性和优越性。
{"title":"Dual-Path Graph Neural Network with Adaptive Auxiliary Module for Link Prediction.","authors":"Zhenzhen Yang, Zelong Lin, Yongpeng Yang, Jiaqi Li","doi":"10.1089/big.2023.0130","DOIUrl":"https://doi.org/10.1089/big.2023.0130","url":null,"abstract":"<p><p>Link prediction, which has important applications in many fields, predicts the possibility of the link between two nodes in a graph. Link prediction based on Graph Neural Network (GNN) obtains node representation and graph structure through GNN, which has attracted a growing amount of attention recently. However, the existing GNN-based link prediction approaches possess some shortcomings. On the one hand, because a graph contains different types of nodes, it leads to a great challenge for aggregating information and learning node representation from its neighbor nodes. On the other hand, the attention mechanism has been an effect instrument for enhancing the link prediction performance. However, the traditional attention mechanism is always monotonic for query nodes, which limits its influence on link prediction. To address these two problems, a Dual-Path Graph Neural Network (DPGNN) for link prediction is proposed in this study. First, we propose a novel Local Random Features Augmentation for Graph Convolution Network as a baseline of one path. Meanwhile, Graph Attention Network version 2 based on dynamic attention mechanism is adopted as a baseline of the other path. And then, we capture more meaningful node representation and more accurate link features by concatenating the information of these two paths. In addition, we propose an adaptive auxiliary module for better balancing the weight of auxiliary tasks, which brings more benefit to link prediction. Finally, extensive experiments verify the effectiveness and superiority of our proposed DPGNN for link prediction.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140289590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jumadil Saputra, Kasypi Mokhtar, Anuar Abu Bakar, Siti Marsila Mhd Ruslan
In the last 2 years, there has been a significant upswing in oil prices, leading to a decline in economic activity and demand. This trend holds substantial implications for the global economy, particularly within the emerging business landscape. Among the influential risk factors impacting the returns of shipping stocks, none looms larger than the volatility in oil prices. Yet, only a limited number of studies have explored the complex relationship between oil price shocks and the dynamics of the liner shipping industry, with specific focus on uncertainty linkages and potential diversification strategies. This study aims to investigate the co-movements and asymmetric associations between oil prices (specifically, West Texas Intermediate and Brent) and the stock returns of three prominent shipping companies from Germany, South Korea, and Taiwan. The results unequivocally highlight the indispensable role of oil prices in shaping both short-term and long-term shipping stock returns. In addition, the research underscores the statistical significance of exchange rates and interest rates in influencing these returns, with their effects varying across different time horizons. Notably, shipping stock prices exhibit heightened sensitivity to positive movements in oil prices, while exchange rates and interest rates exert contrasting impacts, one being positive and the other negative. These findings collectively illuminate the profound influence of market sentiment regarding crucial economic indicators within the global shipping sector.
{"title":"Investigating the Co-Movement and Asymmetric Relationships of Oil Prices on the Shipping Stock Returns: Evidence from Three Shipping-Flagged Companies from Germany, South Korea, and Taiwan.","authors":"Jumadil Saputra, Kasypi Mokhtar, Anuar Abu Bakar, Siti Marsila Mhd Ruslan","doi":"10.1089/big.2023.0026","DOIUrl":"https://doi.org/10.1089/big.2023.0026","url":null,"abstract":"<p><p>In the last 2 years, there has been a significant upswing in oil prices, leading to a decline in economic activity and demand. This trend holds substantial implications for the global economy, particularly within the emerging business landscape. Among the influential risk factors impacting the returns of shipping stocks, none looms larger than the volatility in oil prices. Yet, only a limited number of studies have explored the complex relationship between oil price shocks and the dynamics of the liner shipping industry, with specific focus on uncertainty linkages and potential diversification strategies. This study aims to investigate the co-movements and asymmetric associations between oil prices (specifically, West Texas Intermediate and Brent) and the stock returns of three prominent shipping companies from Germany, South Korea, and Taiwan. The results unequivocally highlight the indispensable role of oil prices in shaping both short-term and long-term shipping stock returns. In addition, the research underscores the statistical significance of exchange rates and interest rates in influencing these returns, with their effects varying across different time horizons. Notably, shipping stock prices exhibit heightened sensitivity to positive movements in oil prices, while exchange rates and interest rates exert contrasting impacts, one being positive and the other negative. These findings collectively illuminate the profound influence of market sentiment regarding crucial economic indicators within the global shipping sector.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139736755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01Epub Date: 2023-05-03DOI: 10.1089/big.2022.0082
Xinyue Zhang, Chen Ding, Guizhi Wang
With the acceleration of urbanization, air pollution, especially PM2.5, has seriously affected human health and reduced people's life quality. Accurate PM2.5 prediction is significant for environmental protection authorities to take actions and develop prevention countermeasures. In this article, an adapted Kalman filter (KF) approach is presented to remove the nonlinearity and stochastic uncertainty of time series, suffered by the autoregressive integrated moving average (ARIMA) model. To further improve the accuracy of PM2.5 forecasting, a hybrid model is proposed by introducing an autoregressive (AR) model, where the AR part is used to determine the state-space equation, whereas the KF part is used for state estimation on PM2.5 concentration series. A modified artificial neural network (ANN), called AR-ANN is introduced to compare with the AR-KF model. According to the results, the AR-KF model outperforms the AR-ANN model and the original ARIMA model on the predication accuracy; that is, the AR-ANN obtains 10.85 and 15.45 of mean absolute error and root mean square error, respectively, whereas the ARIMA gains 30.58 and 29.39 on the corresponding metrics. It, therefore, proves that the presented AR-KF model can be adopted for air pollutant concentration prediction.
{"title":"An Autoregressive-Based Kalman Filter Approach for Daily PM<sub>2.5</sub> Concentration Forecasting in Beijing, China.","authors":"Xinyue Zhang, Chen Ding, Guizhi Wang","doi":"10.1089/big.2022.0082","DOIUrl":"10.1089/big.2022.0082","url":null,"abstract":"<p><p>With the acceleration of urbanization, air pollution, especially PM<sub>2.5</sub>, has seriously affected human health and reduced people's life quality. Accurate PM<sub>2.5</sub> prediction is significant for environmental protection authorities to take actions and develop prevention countermeasures. In this article, an adapted Kalman filter (KF) approach is presented to remove the nonlinearity and stochastic uncertainty of time series, suffered by the autoregressive integrated moving average (ARIMA) model. To further improve the accuracy of PM<sub>2.5</sub> forecasting, a hybrid model is proposed by introducing an autoregressive (AR) model, where the AR part is used to determine the state-space equation, whereas the KF part is used for state estimation on PM<sub>2.5</sub> concentration series. A modified artificial neural network (ANN), called AR-ANN is introduced to compare with the AR-KF model. According to the results, the AR-KF model outperforms the AR-ANN model and the original ARIMA model on the predication accuracy; that is, the AR-ANN obtains 10.85 and 15.45 of mean absolute error and root mean square error, respectively, whereas the ARIMA gains 30.58 and 29.39 on the corresponding metrics. It, therefore, proves that the presented AR-KF model can be adopted for air pollutant concentration prediction.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"19-29"},"PeriodicalIF":4.6,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9757180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01Epub Date: 2023-11-17DOI: 10.1089/big.2022.0287
Huosong Xia, Xiaoyu Hou, Justin Zuopeng Zhang
Market uncertainty greatly interferes with the decisions and plans of market participants, thus increasing the risk of decision-making, leading to compromised interests of decision-makers. Cotton price index (hereinafter referred to as cotton price) volatility is highly noisy, nonlinear, and stochastic and is susceptible to supply and demand, climate, substitutes, and other policy factors, which are subject to large uncertainties. To reduce decision risk and provide decision support for policymakers, this article integrates 13 factors affecting cotton price index volatility based on existing research and further divides them into transaction data and interaction data. A long- and short-term memory (LSTM) model is constructed, and a comparison experiment is implemented to analyze the cotton price index volatility. To make the constructed model explainable, we use explainable artificial intelligence (XAI) techniques to perform statistical analysis of the input features. The experimental results show that the LSTM model can accurately analyze the cotton price index fluctuation trend but cannot accurately predict the actual price of cotton; the transaction data plus interaction data are more sensitive than the transaction data in analyzing the cotton price fluctuation trend and can have a positive effect on the cotton price fluctuation analysis. This study can accurately reflect the fluctuation trend of the cotton market, provide reference to the state, enterprises, and cotton farmers for decision-making, and reduce the risk caused by frequent fluctuation of cotton prices. The analysis of the model using XAI techniques builds the confidence of decision-makers in the model.
{"title":"Long- and Short-Term Memory Model of Cotton Price Index Volatility Risk Based on Explainable Artificial Intelligence.","authors":"Huosong Xia, Xiaoyu Hou, Justin Zuopeng Zhang","doi":"10.1089/big.2022.0287","DOIUrl":"10.1089/big.2022.0287","url":null,"abstract":"<p><p>Market uncertainty greatly interferes with the decisions and plans of market participants, thus increasing the risk of decision-making, leading to compromised interests of decision-makers. Cotton price index (hereinafter referred to as cotton price) volatility is highly noisy, nonlinear, and stochastic and is susceptible to supply and demand, climate, substitutes, and other policy factors, which are subject to large uncertainties. To reduce decision risk and provide decision support for policymakers, this article integrates 13 factors affecting cotton price index volatility based on existing research and further divides them into transaction data and interaction data. A long- and short-term memory (LSTM) model is constructed, and a comparison experiment is implemented to analyze the cotton price index volatility. To make the constructed model explainable, we use explainable artificial intelligence (XAI) techniques to perform statistical analysis of the input features. The experimental results show that the LSTM model can accurately analyze the cotton price index fluctuation trend but cannot accurately predict the actual price of cotton; the transaction data plus interaction data are more sensitive than the transaction data in analyzing the cotton price fluctuation trend and can have a positive effect on the cotton price fluctuation analysis. This study can accurately reflect the fluctuation trend of the cotton market, provide reference to the state, enterprises, and cotton farmers for decision-making, and reduce the risk caused by frequent fluctuation of cotton prices. The analysis of the model using XAI techniques builds the confidence of decision-makers in the model.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"49-62"},"PeriodicalIF":4.6,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136400257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01Epub Date: 2023-10-30DOI: 10.1089/big.2023.0035
M Ananthi, Annapoorani Gopal, K Ramalakshmi, P Mohan Kumar
An accurate resource usage prediction in the big data streaming applications still remains as one of the complex processes. In the existing works, various resource scaling techniques are developed for forecasting the resource usage in the big data streaming systems. However, the baseline streaming mechanisms limit with the issues of inefficient resource scaling, inaccurate forecasting, high latency, and running time. Therefore, the proposed work motivates to develop a new framework, named as Gaussian adapted Markov model (GAMM)-overhauled fluctuation analysis (OFA), for an efficient big data streaming in the cloud systems. The purpose of this work is to efficiently manage the time-bounded big data streaming applications with reduced error rate. In this study, the gating strategy is also used to extract the set of features for obtaining nonlinear distribution of data and fat convergence solution, used to perform the fluctuation analysis. Moreover, the layered architecture is developed for simplifying the process of resource forecasting in the streaming applications. During experimentation, the results of the proposed stream model GAMM-OFA are validated and compared by using different measures.
{"title":"Gaussian Adapted Markov Model with Overhauled Fluctuation Analysis-Based Big Data Streaming Model in Cloud.","authors":"M Ananthi, Annapoorani Gopal, K Ramalakshmi, P Mohan Kumar","doi":"10.1089/big.2023.0035","DOIUrl":"10.1089/big.2023.0035","url":null,"abstract":"<p><p>An accurate resource usage prediction in the big data streaming applications still remains as one of the complex processes. In the existing works, various resource scaling techniques are developed for forecasting the resource usage in the big data streaming systems. However, the baseline streaming mechanisms limit with the issues of inefficient resource scaling, inaccurate forecasting, high latency, and running time. Therefore, the proposed work motivates to develop a new framework, named as Gaussian adapted Markov model (GAMM)-overhauled fluctuation analysis (OFA), for an efficient big data streaming in the cloud systems. The purpose of this work is to efficiently manage the time-bounded big data streaming applications with reduced error rate. In this study, the gating strategy is also used to extract the set of features for obtaining nonlinear distribution of data and fat convergence solution, used to perform the fluctuation analysis. Moreover, the layered architecture is developed for simplifying the process of resource forecasting in the streaming applications. During experimentation, the results of the proposed stream model GAMM-OFA are validated and compared by using different measures.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"1-18"},"PeriodicalIF":4.6,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71415224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The mechanism of cooperative innovation (CI) for high-tech firms aims to improve their technological innovation performance. It is the effective integration of the internal and external innovation resources of these firms, along with the simultaneous reduction in the uncertainty of technological innovation and the maintenance of the comparative advantage of the firms in the competition. This study used 322 high-tech firms as our sample, which were located in 33 national innovation demonstration bases identified by the Chinese government. We implemented a multiple linear regression to test the impact of CI conducted by these high-tech firms at the level of their technological innovation performance. In addition, the study further examined the moderating effect of two boundary conditions-big data capabilities and policy support (PS)-on the main hypotheses. Our study found that high-tech firms carrying out CI can effectively improve their technological innovation performance, with big data capabilities and PS significantly enhancing the degree of this influence. The study reveals the intrinsic mechanism of the impact of CI on the technological innovation performance of high-tech firms, which, to a certain extent, expands the application context of CI and enriches the research perspective on the impact of CI on the innovation performance of firms. At the same time, the findings provide insight for how high-tech firms in the digital era can make reasonable use of data empowerment in the process of CI to achieve improved technological innovation performance.
高科技企业的合作创新(CI)机制旨在提高其技术创新绩效。它有效整合了企业内外部的创新资源,同时降低了技术创新的不确定性,保持了企业在竞争中的比较优势。本研究以中国政府认定的 33 个国家自主创新示范基地中的 322 家高科技企业为样本。我们采用多元线性回归的方法,检验了这些高科技企业开展的 CI 对其技术创新绩效水平的影响。此外,研究还进一步检验了两个边界条件--大数据能力和政策支持(PS)--对主要假设的调节作用。我们的研究发现,高科技企业开展 CI 能有效提高其技术创新绩效,而大数据能力和政策支持能显著提高这种影响程度。研究揭示了CI对高科技企业技术创新绩效影响的内在机理,在一定程度上拓展了CI的应用范围,丰富了CI对企业创新绩效影响的研究视角。同时,研究结果也为数字时代的高科技企业如何在CI过程中合理利用数据赋能实现技术创新绩效的提升提供了启示。
{"title":"Impact of Cooperative Innovation on the Technological Innovation Performance of High-Tech Firms: A Dual Moderating Effect Model of Big Data Capabilities and Policy Support.","authors":"Xianglong Li, Qingjin Wang, Renbo Shi, Xueling Wang, Kaiyun Zhang, Xiao Liu","doi":"10.1089/big.2022.0301","DOIUrl":"10.1089/big.2022.0301","url":null,"abstract":"<p><p>The mechanism of cooperative innovation (CI) for high-tech firms aims to improve their technological innovation performance. It is the effective integration of the internal and external innovation resources of these firms, along with the simultaneous reduction in the uncertainty of technological innovation and the maintenance of the comparative advantage of the firms in the competition. This study used 322 high-tech firms as our sample, which were located in 33 national innovation demonstration bases identified by the Chinese government. We implemented a multiple linear regression to test the impact of CI conducted by these high-tech firms at the level of their technological innovation performance. In addition, the study further examined the moderating effect of two boundary conditions-big data capabilities and policy support (PS)-on the main hypotheses. Our study found that high-tech firms carrying out CI can effectively improve their technological innovation performance, with big data capabilities and PS significantly enhancing the degree of this influence. The study reveals the intrinsic mechanism of the impact of CI on the technological innovation performance of high-tech firms, which, to a certain extent, expands the application context of CI and enriches the research perspective on the impact of CI on the innovation performance of firms. At the same time, the findings provide insight for how high-tech firms in the digital era can make reasonable use of data empowerment in the process of CI to achieve improved technological innovation performance.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"63-80"},"PeriodicalIF":4.6,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10243508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01Epub Date: 2023-07-07DOI: 10.1089/big.2022.0215
Mauro Papa, Ioannis Chatzigiannakis, Aris Anagnostopoulos
Public procurement is viewed as a major market force that can be used to promote innovation and drive small and medium-sized enterprises growth. In such cases, procurement system design relies on intermediates that provide vertical linkages between suppliers and providers of innovative services and products. In this work we propose an innovative methodology for decision support in the process of supplier discovery, which precedes the final supplier selection. We focus on data gathered from community-based sources such as Reddit and Wikidata and avoid any use of historical open procurement datasets to identify small and medium sized suppliers of innovative products and services that own very little market shares. We look into a real-world procurement case study from the financial sector focusing on the Financial and Market Data offering and develop an interactive web-based support tool to address certain requirements of the Italian central bank. We demonstrate how a suitable selection of natural language processing models, such as a part-of-speech tagger and a word-embedding model, in combination with a novel named-entity-disambiguation algorithm, can efficiently analyze huge quantity of textual data, increasing the probability of a full coverage of the market.
{"title":"Automated Natural Language Processing-Based Supplier Discovery for Financial Services.","authors":"Mauro Papa, Ioannis Chatzigiannakis, Aris Anagnostopoulos","doi":"10.1089/big.2022.0215","DOIUrl":"10.1089/big.2022.0215","url":null,"abstract":"<p><p>Public procurement is viewed as a major market force that can be used to promote innovation and drive small and medium-sized enterprises growth. In such cases, procurement system design relies on intermediates that provide vertical linkages between suppliers and providers of innovative services and products. In this work we propose an innovative methodology for decision support in the process of supplier discovery, which precedes the final supplier selection. We focus on data gathered from community-based sources such as Reddit and Wikidata and avoid any use of historical open procurement datasets to identify small and medium sized suppliers of innovative products and services that own very little market shares. We look into a real-world procurement case study from the financial sector focusing on the Financial and Market Data offering and develop an interactive web-based support tool to address certain requirements of the Italian central bank. We demonstrate how a suitable selection of natural language processing models, such as a part-of-speech tagger and a word-embedding model, in combination with a novel named-entity-disambiguation algorithm, can efficiently analyze huge quantity of textual data, increasing the probability of a full coverage of the market.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"30-48"},"PeriodicalIF":4.6,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9749953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Owing to increasing size of the real-world networks, their processing using classical techniques has become infeasible. The amount of storage and central processing unit time required for processing large networks is far beyond the capabilities of a high-end computing machine. Moreover, real-world network data are generally distributed in nature because they are collected and stored on distributed platforms. This has popularized the use of the MapReduce, a distributed data processing framework, for analyzing real-world network data. Existing MapReduce-based methods for connected components detection mainly struggle to minimize the number of MapReduce rounds and the amount of data generated and forwarded to the subsequent rounds. This article presents an efficient MapReduce-based approach for finding connected components, which does not forward the complete set of connected components to the subsequent rounds; instead, it writes them to the Hadoop Distributed File System as soon as they are found to reduce the amount of data forwarded to the subsequent rounds. It also presents an application of the proposed method in contact tracing. The proposed method is evaluated on several network data sets and compared with two state-of-the-art methods. The empirical results reveal that the proposed method performs significantly better and is scalable to find connected components in large-scale networks.
{"title":"A MapReduce-Based Approach for Fast Connected Components Detection from Large-Scale Networks.","authors":"Sajid Yousuf Bhat, Muhammad Abulaish","doi":"10.1089/big.2022.0264","DOIUrl":"https://doi.org/10.1089/big.2022.0264","url":null,"abstract":"<p><p>Owing to increasing size of the real-world networks, their processing using classical techniques has become infeasible. The amount of storage and central processing unit time required for processing large networks is far beyond the capabilities of a high-end computing machine. Moreover, real-world network data are generally distributed in nature because they are collected and stored on distributed platforms. This has popularized the use of the MapReduce, a distributed data processing framework, for analyzing real-world network data. Existing MapReduce-based methods for connected components detection mainly struggle to minimize the number of MapReduce rounds and the amount of data generated and forwarded to the subsequent rounds. This article presents an efficient MapReduce-based approach for finding connected components, which does not forward the complete set of connected components to the subsequent rounds; instead, it writes them to the Hadoop Distributed File System as soon as they are found to reduce the amount of data forwarded to the subsequent rounds. It also presents an application of the proposed method in contact tracing. The proposed method is evaluated on several network data sets and compared with two state-of-the-art methods. The empirical results reveal that the proposed method performs significantly better and is scalable to find connected components in large-scale networks.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2024-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139571864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The stock market is heavily influenced by global sentiment, which is full of uncertainty and is characterized by extreme values and linear and nonlinear variables. High-frequency data generally refer to data that are collected at a very fast rate based on days, hours, minutes, and even seconds. Stock prices fluctuate rapidly and even at extremes along with changes in the variables that affect stock fluctuations. Research on investment risk estimation in the stock market that can identify extreme values is nonlinear, reliable in multivariate cases, and uses high-frequency data that are very important. The extreme value theory (EVT) approach can detect extreme values. This method is reliable in univariate cases and very complicated in multivariate cases. The purpose of this research was to collect, characterize, and analyze the investment risk estimation literature to identify research gaps. The literature used was selected by applying the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) and sourced from Sciencedirect.com and Scopus databases. A total of 1107 articles were produced from the search at the identification stage, reduced to 236 in the eligibility stage, and 90 articles in the included studies set. The bibliometric networks were visualized using the VOSviewer software, and the main keyword used as the search criteria is "VaR." The visualization showed that EVT, the Generalized Autoregressive Conditional Heteroskedasticity (GARCH) models, and historical simulation are models often used to estimate the investment risk; the application of the machine learning (ML)-based investment risk estimation model is low. There has been no research using a combination of EVT and ML to estimate the investment risk. The results showed that the hybrid model produced better Value-at-Risk (VaR) accuracy under uncertainty and nonlinear conditions. Generally, models only use daily return data as model input. Based on research gaps, a hybrid model framework for estimating risk measures is proposed using a combination of EVT and ML, using multivariable and high-frequency data to identify extreme values in the distribution of data. The goal is to produce an accurate and flexible estimated risk value against extreme changes and shocks in the stock market. Mathematics Subject Classification: 60G25; 62M20; 6245; 62P05; 91G70.
股票市场深受全球情绪的影响,而全球情绪充满了不确定性,其特点是极端值以及线性和非线性变量。高频数据一般是指以天、小时、分钟甚至秒为单位快速收集的数据。股票价格随着影响股票波动的变量的变化而快速波动,甚至出现极端波动。能够识别极值的股市投资风险评估研究是非线性的,在多变量情况下是可靠的,并且使用的是非常重要的高频数据。极值理论(EVT)方法可以检测极值。这种方法在单变量情况下是可靠的,而在多变量情况下则非常复杂。本研究的目的是收集、描述和分析投资风险估计文献,找出研究空白。所使用的文献是根据《系统综述和元分析首选报告项目》(Preferred Reporting Items for Systematic Reviews and Meta-Analyses,PRISMA)进行筛选的,来源于 Sciencedirect.com 和 Scopus 数据库。在识别阶段共搜索到 1107 篇文章,在资格审查阶段减少到 236 篇,在纳入研究集中有 90 篇文章。使用 VOSviewer 软件对文献计量学网络进行了可视化,搜索标准的主要关键词是 "VaR"。可视化结果显示,EVT、广义自回归条件异方差(GARCH)模型和历史模拟是常用的投资风险估计模型;基于机器学习(ML)的投资风险估计模型应用较少。目前还没有将 EVT 和 ML 结合起来估计投资风险的研究。研究结果表明,在不确定和非线性条件下,混合模型能产生更好的风险价值(VaR)精度。一般来说,模型仅使用每日收益数据作为模型输入。基于研究差距,我们提出了一个结合 EVT 和 ML 的混合模型框架来估算风险度量,使用多变量和高频数据来识别数据分布中的极端值。其目标是针对股票市场的极端变化和冲击,得出准确而灵活的估计风险值。数学学科分类:60G25; 62M20; 6245; 62P05; 91G70.
{"title":"Modeling of Machine Learning-Based Extreme Value Theory in Stock Investment Risk Prediction: A Systematic Literature Review.","authors":"Melina Melina, Sukono, Herlina Napitupulu, Norizan Mohamed","doi":"10.1089/big.2023.0004","DOIUrl":"https://doi.org/10.1089/big.2023.0004","url":null,"abstract":"<p><p>The stock market is heavily influenced by global sentiment, which is full of uncertainty and is characterized by extreme values and linear and nonlinear variables. High-frequency data generally refer to data that are collected at a very fast rate based on days, hours, minutes, and even seconds. Stock prices fluctuate rapidly and even at extremes along with changes in the variables that affect stock fluctuations. Research on investment risk estimation in the stock market that can identify extreme values is nonlinear, reliable in multivariate cases, and uses high-frequency data that are very important. The extreme value theory (EVT) approach can detect extreme values. This method is reliable in univariate cases and very complicated in multivariate cases. The purpose of this research was to collect, characterize, and analyze the investment risk estimation literature to identify research gaps. The literature used was selected by applying the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) and sourced from Sciencedirect.com and Scopus databases. A total of 1107 articles were produced from the search at the identification stage, reduced to 236 in the eligibility stage, and 90 articles in the included studies set. The bibliometric networks were visualized using the VOSviewer software, and the main keyword used as the search criteria is \"VaR.\" The visualization showed that EVT, the Generalized Autoregressive Conditional Heteroskedasticity (GARCH) models, and historical simulation are models often used to estimate the investment risk; the application of the machine learning (ML)-based investment risk estimation model is low. There has been no research using a combination of EVT and ML to estimate the investment risk. The results showed that the hybrid model produced better Value-at-Risk (VaR) accuracy under uncertainty and nonlinear conditions. Generally, models only use daily return data as model input. Based on research gaps, a hybrid model framework for estimating risk measures is proposed using a combination of EVT and ML, using multivariable and high-frequency data to identify extreme values in the distribution of data. The goal is to produce an accurate and flexible estimated risk value against extreme changes and shocks in the stock market. Mathematics Subject Classification: 60G25; 62M20; 6245; 62P05; 91G70.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139486846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}