首页 > 最新文献

Big Data最新文献

英文 中文
Cloud-Based Advanced Shuffled Frog Leaping Algorithm for Tasks Scheduling. 基于云的任务调度高级洗牌蛙跳算法。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-04-01 Epub Date: 2023-03-03 DOI: 10.1089/big.2022.0095
Dipesh Kumar, Nirupama Mandal, Yugal Kumar

In recent years, the world has seen incremental growth in online activities owing to which the volume of data in cloud servers has also been increasing exponentially. With rapidly increasing data, load on cloud servers has increased in the cloud computing environment. With rapidly evolving technology, various cloud-based systems were developed to enhance the user experience. But, the increased online activities around the globe have also increased data load on the cloud-based systems. To maintain the efficiency and performance of the applications hosted in cloud servers, task scheduling has become very important. The task scheduling process helps in reducing the makespan time and average cost by scheduling the tasks to virtual machines (VMs). The task scheduling depends on assigning tasks to VMs to process the incoming tasks. The task scheduling should follow some algorithm for assigning tasks to VMs. Many researchers have proposed different scheduling algorithms for task scheduling in the cloud computing environment. In this article, an advanced form of the shuffled frog optimization algorithm, which works on the nature and behavior of frogs searching for food, has been proposed. The authors have introduced a new algorithm to shuffle the position of frogs in memeplex to obtain the best result. By using this optimization technique, the cost function of the central processing unit, makespan, and fitness function were calculated. The fitness function is the sum of the budget cost function and the makespan time. The proposed method helps in reducing the makespan time as well as the average cost by scheduling the tasks to VMs effectively. Finally, the performance of the proposed advanced shuffled frog optimization method is compared with existing task scheduling methods such as whale optimization-based scheduler (W-Scheduler), sliced particle swarm optimization (SPSO-SA), inverted ant colony optimization algorithm, and static learning particle swarm optimization (SLPSO-SA) in terms of average cost and metric makespan. Experimentally, it was concluded that the proposed advanced frog optimization algorithm can schedule tasks to the VMs more effectively as compared with other scheduling methods with a makespan of 6, average cost of 4, and fitness of 10.

近年来,全球在线活动不断增加,云服务器中的数据量也因此呈指数级增长。随着数据量的快速增长,云计算环境中云服务器的负载也随之增加。随着技术的快速发展,各种基于云的系统应运而生,以提升用户体验。但是,全球在线活动的增加也增加了云计算系统的数据负载。为了保持云服务器托管应用程序的效率和性能,任务调度变得非常重要。任务调度过程通过将任务调度到虚拟机(VM),有助于缩短运行时间和降低平均成本。任务调度取决于向虚拟机分配任务,以处理接收到的任务。任务调度应遵循某种算法将任务分配给虚拟机。许多研究人员为云计算环境中的任务调度提出了不同的调度算法。本文提出了一种高级形式的洗牌青蛙优化算法,该算法基于青蛙寻找食物的性质和行为。作者引入了一种新算法,对 memeplex 中青蛙的位置进行洗牌,以获得最佳结果。通过使用这种优化技术,计算出了中央处理单元的成本函数、makespan 和适应度函数。合适度函数是预算成本函数和间隔时间之和。通过有效地将任务调度到虚拟机上,所提出的方法有助于减少正常运行时间和平均成本。最后,将所提出的高级洗牌蛙优化方法的性能与现有的任务调度方法进行了比较,如基于鲸鱼优化的调度器(W-Scheduler)、切片粒子群优化(SPSO-SA)、倒置蚁群优化算法和静态学习粒子群优化(SLPSO-SA)在平均成本和度量间隔方面的性能。实验结果表明,与其他调度方法相比,所提出的高级蛙群优化算法能更有效地将任务调度到虚拟机上,其makespan为6,平均成本为4,适合度为10。
{"title":"Cloud-Based Advanced Shuffled Frog Leaping Algorithm for Tasks Scheduling.","authors":"Dipesh Kumar, Nirupama Mandal, Yugal Kumar","doi":"10.1089/big.2022.0095","DOIUrl":"10.1089/big.2022.0095","url":null,"abstract":"<p><p>In recent years, the world has seen incremental growth in online activities owing to which the volume of data in cloud servers has also been increasing exponentially. With rapidly increasing data, load on cloud servers has increased in the cloud computing environment. With rapidly evolving technology, various cloud-based systems were developed to enhance the user experience. But, the increased online activities around the globe have also increased data load on the cloud-based systems. To maintain the efficiency and performance of the applications hosted in cloud servers, task scheduling has become very important. The task scheduling process helps in reducing the makespan time and average cost by scheduling the tasks to virtual machines (VMs). The task scheduling depends on assigning tasks to VMs to process the incoming tasks. The task scheduling should follow some algorithm for assigning tasks to VMs. Many researchers have proposed different scheduling algorithms for task scheduling in the cloud computing environment. In this article, an advanced form of the shuffled frog optimization algorithm, which works on the nature and behavior of frogs searching for food, has been proposed. The authors have introduced a new algorithm to shuffle the position of frogs in memeplex to obtain the best result. By using this optimization technique, the cost function of the central processing unit, makespan, and fitness function were calculated. The fitness function is the sum of the budget cost function and the makespan time. The proposed method helps in reducing the makespan time as well as the average cost by scheduling the tasks to VMs effectively. Finally, the performance of the proposed advanced shuffled frog optimization method is compared with existing task scheduling methods such as whale optimization-based scheduler (W-Scheduler), sliced particle swarm optimization (SPSO-SA), inverted ant colony optimization algorithm, and static learning particle swarm optimization (SLPSO-SA) in terms of average cost and metric makespan. Experimentally, it was concluded that the proposed advanced frog optimization algorithm can schedule tasks to the VMs more effectively as compared with other scheduling methods with a makespan of 6, average cost of 4, and fitness of 10.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"110-126"},"PeriodicalIF":2.6,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10821344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Autoregressive-Based Kalman Filter Approach for Daily PM2.5 Concentration Forecasting in Beijing, China. 基于自回归卡尔曼滤波器的中国北京 PM2.5 每日浓度预测方法。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-02-01 Epub Date: 2023-05-03 DOI: 10.1089/big.2022.0082
Xinyue Zhang, Chen Ding, Guizhi Wang

With the acceleration of urbanization, air pollution, especially PM2.5, has seriously affected human health and reduced people's life quality. Accurate PM2.5 prediction is significant for environmental protection authorities to take actions and develop prevention countermeasures. In this article, an adapted Kalman filter (KF) approach is presented to remove the nonlinearity and stochastic uncertainty of time series, suffered by the autoregressive integrated moving average (ARIMA) model. To further improve the accuracy of PM2.5 forecasting, a hybrid model is proposed by introducing an autoregressive (AR) model, where the AR part is used to determine the state-space equation, whereas the KF part is used for state estimation on PM2.5 concentration series. A modified artificial neural network (ANN), called AR-ANN is introduced to compare with the AR-KF model. According to the results, the AR-KF model outperforms the AR-ANN model and the original ARIMA model on the predication accuracy; that is, the AR-ANN obtains 10.85 and 15.45 of mean absolute error and root mean square error, respectively, whereas the ARIMA gains 30.58 and 29.39 on the corresponding metrics. It, therefore, proves that the presented AR-KF model can be adopted for air pollutant concentration prediction.

随着城市化进程的加快,空气污染尤其是 PM2.5 严重影响了人类健康,降低了人们的生活质量。准确预测 PM2.5 对环保部门采取行动和制定预防对策意义重大。本文提出了一种改进的卡尔曼滤波器(KF)方法,以消除自回归积分移动平均(ARIMA)模型所带来的时间序列的非线性和随机不确定性。为了进一步提高 PM2.5 预测的准确性,提出了一种混合模型,即引入自回归(AR)模型,其中 AR 部分用于确定状态空间方程,而 KF 部分用于 PM2.5 浓度序列的状态估计。为了与 AR-KF 模型进行比较,引入了一个名为 AR-ANN 的改进型人工神经网络(ANN)。结果表明,AR-KF 模型的预测精度优于 AR-ANN 模型和原始 ARIMA 模型,即 AR-ANN 模型的平均绝对误差和均方根误差分别为 10.85 和 15.45,而 ARIMA 模型的相应指标分别为 30.58 和 29.39。因此,这证明所提出的 AR-KF 模型可用于空气污染物浓度预测。
{"title":"An Autoregressive-Based Kalman Filter Approach for Daily PM<sub>2.5</sub> Concentration Forecasting in Beijing, China.","authors":"Xinyue Zhang, Chen Ding, Guizhi Wang","doi":"10.1089/big.2022.0082","DOIUrl":"10.1089/big.2022.0082","url":null,"abstract":"<p><p>With the acceleration of urbanization, air pollution, especially PM<sub>2.5</sub>, has seriously affected human health and reduced people's life quality. Accurate PM<sub>2.5</sub> prediction is significant for environmental protection authorities to take actions and develop prevention countermeasures. In this article, an adapted Kalman filter (KF) approach is presented to remove the nonlinearity and stochastic uncertainty of time series, suffered by the autoregressive integrated moving average (ARIMA) model. To further improve the accuracy of PM<sub>2.5</sub> forecasting, a hybrid model is proposed by introducing an autoregressive (AR) model, where the AR part is used to determine the state-space equation, whereas the KF part is used for state estimation on PM<sub>2.5</sub> concentration series. A modified artificial neural network (ANN), called AR-ANN is introduced to compare with the AR-KF model. According to the results, the AR-KF model outperforms the AR-ANN model and the original ARIMA model on the predication accuracy; that is, the AR-ANN obtains 10.85 and 15.45 of mean absolute error and root mean square error, respectively, whereas the ARIMA gains 30.58 and 29.39 on the corresponding metrics. It, therefore, proves that the presented AR-KF model can be adopted for air pollutant concentration prediction.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"19-29"},"PeriodicalIF":2.6,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9757180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Long- and Short-Term Memory Model of Cotton Price Index Volatility Risk Based on Explainable Artificial Intelligence. 基于可解释人工智能的棉花价格指数波动风险的长短期记忆模型。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-02-01 Epub Date: 2023-11-17 DOI: 10.1089/big.2022.0287
Huosong Xia, Xiaoyu Hou, Justin Zuopeng Zhang

Market uncertainty greatly interferes with the decisions and plans of market participants, thus increasing the risk of decision-making, leading to compromised interests of decision-makers. Cotton price index (hereinafter referred to as cotton price) volatility is highly noisy, nonlinear, and stochastic and is susceptible to supply and demand, climate, substitutes, and other policy factors, which are subject to large uncertainties. To reduce decision risk and provide decision support for policymakers, this article integrates 13 factors affecting cotton price index volatility based on existing research and further divides them into transaction data and interaction data. A long- and short-term memory (LSTM) model is constructed, and a comparison experiment is implemented to analyze the cotton price index volatility. To make the constructed model explainable, we use explainable artificial intelligence (XAI) techniques to perform statistical analysis of the input features. The experimental results show that the LSTM model can accurately analyze the cotton price index fluctuation trend but cannot accurately predict the actual price of cotton; the transaction data plus interaction data are more sensitive than the transaction data in analyzing the cotton price fluctuation trend and can have a positive effect on the cotton price fluctuation analysis. This study can accurately reflect the fluctuation trend of the cotton market, provide reference to the state, enterprises, and cotton farmers for decision-making, and reduce the risk caused by frequent fluctuation of cotton prices. The analysis of the model using XAI techniques builds the confidence of decision-makers in the model.

市场的不确定性极大地干扰了市场参与者的决策和计划,从而增加了决策的风险,导致决策者的利益受损。棉花价格指数(以下简称棉价)波动具有高度的噪声、非线性和随机性,易受供需、气候、代用品等政策因素的影响,具有较大的不确定性。为了降低决策风险,为决策者提供决策支持,本文在已有研究的基础上,将影响棉花价格指数波动的13个因素进行整合,并进一步划分为交易数据和交互数据。构建了长短期记忆(LSTM)模型,并对棉花价格指数波动进行了对比实验分析。为了使构建的模型具有可解释性,我们使用可解释性人工智能(XAI)技术对输入特征进行统计分析。实验结果表明,LSTM模型能准确分析棉花价格指数波动趋势,但不能准确预测棉花实际价格;交易数据加交互数据在分析棉花价格波动趋势时比交易数据更敏感,可以对棉花价格波动分析产生积极的影响。本研究可以准确反映棉花市场的波动趋势,为国家、企业和棉农决策提供参考,降低棉花价格频繁波动带来的风险。使用XAI技术对模型进行分析,建立决策者对模型的信心。
{"title":"Long- and Short-Term Memory Model of Cotton Price Index Volatility Risk Based on Explainable Artificial Intelligence.","authors":"Huosong Xia, Xiaoyu Hou, Justin Zuopeng Zhang","doi":"10.1089/big.2022.0287","DOIUrl":"10.1089/big.2022.0287","url":null,"abstract":"<p><p>Market uncertainty greatly interferes with the decisions and plans of market participants, thus increasing the risk of decision-making, leading to compromised interests of decision-makers. Cotton price index (hereinafter referred to as cotton price) volatility is highly noisy, nonlinear, and stochastic and is susceptible to supply and demand, climate, substitutes, and other policy factors, which are subject to large uncertainties. To reduce decision risk and provide decision support for policymakers, this article integrates 13 factors affecting cotton price index volatility based on existing research and further divides them into transaction data and interaction data. A long- and short-term memory (LSTM) model is constructed, and a comparison experiment is implemented to analyze the cotton price index volatility. To make the constructed model explainable, we use explainable artificial intelligence (XAI) techniques to perform statistical analysis of the input features. The experimental results show that the LSTM model can accurately analyze the cotton price index fluctuation trend but cannot accurately predict the actual price of cotton; the transaction data plus interaction data are more sensitive than the transaction data in analyzing the cotton price fluctuation trend and can have a positive effect on the cotton price fluctuation analysis. This study can accurately reflect the fluctuation trend of the cotton market, provide reference to the state, enterprises, and cotton farmers for decision-making, and reduce the risk caused by frequent fluctuation of cotton prices. The analysis of the model using XAI techniques builds the confidence of decision-makers in the model.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"49-62"},"PeriodicalIF":2.6,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136400257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gaussian Adapted Markov Model with Overhauled Fluctuation Analysis-Based Big Data Streaming Model in Cloud. 基于高斯自适应马尔可夫模型和检修波动分析的云中大数据流模型。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-02-01 Epub Date: 2023-10-30 DOI: 10.1089/big.2023.0035
M Ananthi, Annapoorani Gopal, K Ramalakshmi, P Mohan Kumar

An accurate resource usage prediction in the big data streaming applications still remains as one of the complex processes. In the existing works, various resource scaling techniques are developed for forecasting the resource usage in the big data streaming systems. However, the baseline streaming mechanisms limit with the issues of inefficient resource scaling, inaccurate forecasting, high latency, and running time. Therefore, the proposed work motivates to develop a new framework, named as Gaussian adapted Markov model (GAMM)-overhauled fluctuation analysis (OFA), for an efficient big data streaming in the cloud systems. The purpose of this work is to efficiently manage the time-bounded big data streaming applications with reduced error rate. In this study, the gating strategy is also used to extract the set of features for obtaining nonlinear distribution of data and fat convergence solution, used to perform the fluctuation analysis. Moreover, the layered architecture is developed for simplifying the process of resource forecasting in the streaming applications. During experimentation, the results of the proposed stream model GAMM-OFA are validated and compared by using different measures.

在大数据流应用中,准确的资源使用预测仍然是一个复杂的过程。在现有的工作中,开发了各种资源缩放技术来预测大数据流系统中的资源使用情况。然而,基线流机制由于资源扩展效率低、预测不准确、高延迟和运行时间等问题而受到限制。因此,所提出的工作旨在开发一种新的框架,称为高斯自适应马尔可夫模型(GAMM)-大修波动分析(OFA),用于云系统中高效的大数据流。这项工作的目的是有效地管理有时间限制的大数据流应用程序,降低错误率。在本研究中,门控策略还用于提取一组特征,以获得数据的非线性分布和脂肪收敛解,用于进行波动分析。此外,为了简化流应用程序中的资源预测过程,开发了分层体系结构。在实验过程中,通过使用不同的措施对所提出的流模型GAMM-OFA的结果进行了验证和比较。
{"title":"Gaussian Adapted Markov Model with Overhauled Fluctuation Analysis-Based Big Data Streaming Model in Cloud.","authors":"M Ananthi, Annapoorani Gopal, K Ramalakshmi, P Mohan Kumar","doi":"10.1089/big.2023.0035","DOIUrl":"10.1089/big.2023.0035","url":null,"abstract":"<p><p>An accurate resource usage prediction in the big data streaming applications still remains as one of the complex processes. In the existing works, various resource scaling techniques are developed for forecasting the resource usage in the big data streaming systems. However, the baseline streaming mechanisms limit with the issues of inefficient resource scaling, inaccurate forecasting, high latency, and running time. Therefore, the proposed work motivates to develop a new framework, named as Gaussian adapted Markov model (GAMM)-overhauled fluctuation analysis (OFA), for an efficient big data streaming in the cloud systems. The purpose of this work is to efficiently manage the time-bounded big data streaming applications with reduced error rate. In this study, the gating strategy is also used to extract the set of features for obtaining nonlinear distribution of data and fat convergence solution, used to perform the fluctuation analysis. Moreover, the layered architecture is developed for simplifying the process of resource forecasting in the streaming applications. During experimentation, the results of the proposed stream model GAMM-OFA are validated and compared by using different measures.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"1-18"},"PeriodicalIF":2.6,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71415224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Acknowledgment of Reviewers 2023. 鸣谢 2023 年审稿人。
IF 4.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-02-01 Epub Date: 2023-12-19 DOI: 10.1089/big.2023.29063.ack
{"title":"Acknowledgment of Reviewers 2023.","authors":"","doi":"10.1089/big.2023.29063.ack","DOIUrl":"10.1089/big.2023.29063.ack","url":null,"abstract":"","PeriodicalId":51314,"journal":{"name":"Big Data","volume":"12 1","pages":"81-82"},"PeriodicalIF":4.6,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139730992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automated Natural Language Processing-Based Supplier Discovery for Financial Services. 基于自然语言处理的金融服务供应商自动发现。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-02-01 Epub Date: 2023-07-07 DOI: 10.1089/big.2022.0215
Mauro Papa, Ioannis Chatzigiannakis, Aris Anagnostopoulos

Public procurement is viewed as a major market force that can be used to promote innovation and drive small and medium-sized enterprises growth. In such cases, procurement system design relies on intermediates that provide vertical linkages between suppliers and providers of innovative services and products. In this work we propose an innovative methodology for decision support in the process of supplier discovery, which precedes the final supplier selection. We focus on data gathered from community-based sources such as Reddit and Wikidata and avoid any use of historical open procurement datasets to identify small and medium sized suppliers of innovative products and services that own very little market shares. We look into a real-world procurement case study from the financial sector focusing on the Financial and Market Data offering and develop an interactive web-based support tool to address certain requirements of the Italian central bank. We demonstrate how a suitable selection of natural language processing models, such as a part-of-speech tagger and a word-embedding model, in combination with a novel named-entity-disambiguation algorithm, can efficiently analyze huge quantity of textual data, increasing the probability of a full coverage of the market.

公共采购被视为一种重要的市场力量,可用于促进创新和推动中小型企业的发展。在这种情况下,采购系统的设计依赖于在供应商与创新服务和产品提供商之间建立纵向联系的中介机构。在这项工作中,我们提出了一种创新方法,用于在最终选择供应商之前的发现供应商过程中提供决策支持。我们专注于从 Reddit 和 Wikidata 等基于社区的来源收集数据,避免使用任何历史公开采购数据集来识别市场份额极小的创新产品和服务的中小型供应商。我们研究了金融部门的一个真实采购案例,重点是金融和市场数据产品,并开发了一个基于网络的互动式支持工具,以满足意大利中央银行的某些要求。我们展示了如何选择合适的自然语言处理模型,如语音部分标记和词嵌入模型,并结合新颖的命名实体消歧义算法,高效地分析大量文本数据,从而提高全面覆盖市场的可能性。
{"title":"Automated Natural Language Processing-Based Supplier Discovery for Financial Services.","authors":"Mauro Papa, Ioannis Chatzigiannakis, Aris Anagnostopoulos","doi":"10.1089/big.2022.0215","DOIUrl":"10.1089/big.2022.0215","url":null,"abstract":"<p><p>Public procurement is viewed as a major market force that can be used to promote innovation and drive small and medium-sized enterprises growth. In such cases, procurement system design relies on intermediates that provide vertical linkages between suppliers and providers of innovative services and products. In this work we propose an innovative methodology for decision support in the process of supplier discovery, which precedes the final supplier selection. We focus on data gathered from community-based sources such as Reddit and Wikidata and avoid any use of historical open procurement datasets to identify small and medium sized suppliers of innovative products and services that own very little market shares. We look into a real-world procurement case study from the financial sector focusing on the Financial and Market Data offering and develop an interactive web-based support tool to address certain requirements of the Italian central bank. We demonstrate how a suitable selection of natural language processing models, such as a part-of-speech tagger and a word-embedding model, in combination with a novel named-entity-disambiguation algorithm, can efficiently analyze huge quantity of textual data, increasing the probability of a full coverage of the market.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"30-48"},"PeriodicalIF":2.6,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9749953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Impact of Cooperative Innovation on the Technological Innovation Performance of High-Tech Firms: A Dual Moderating Effect Model of Big Data Capabilities and Policy Support. 合作创新对高科技企业技术创新绩效的影响:大数据能力与政策支持的双重调节效应模型。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-02-01 Epub Date: 2023-09-14 DOI: 10.1089/big.2022.0301
Xianglong Li, Qingjin Wang, Renbo Shi, Xueling Wang, Kaiyun Zhang, Xiao Liu

The mechanism of cooperative innovation (CI) for high-tech firms aims to improve their technological innovation performance. It is the effective integration of the internal and external innovation resources of these firms, along with the simultaneous reduction in the uncertainty of technological innovation and the maintenance of the comparative advantage of the firms in the competition. This study used 322 high-tech firms as our sample, which were located in 33 national innovation demonstration bases identified by the Chinese government. We implemented a multiple linear regression to test the impact of CI conducted by these high-tech firms at the level of their technological innovation performance. In addition, the study further examined the moderating effect of two boundary conditions-big data capabilities and policy support (PS)-on the main hypotheses. Our study found that high-tech firms carrying out CI can effectively improve their technological innovation performance, with big data capabilities and PS significantly enhancing the degree of this influence. The study reveals the intrinsic mechanism of the impact of CI on the technological innovation performance of high-tech firms, which, to a certain extent, expands the application context of CI and enriches the research perspective on the impact of CI on the innovation performance of firms. At the same time, the findings provide insight for how high-tech firms in the digital era can make reasonable use of data empowerment in the process of CI to achieve improved technological innovation performance.

高科技企业的合作创新(CI)机制旨在提高其技术创新绩效。它有效整合了企业内外部的创新资源,同时降低了技术创新的不确定性,保持了企业在竞争中的比较优势。本研究以中国政府认定的 33 个国家自主创新示范基地中的 322 家高科技企业为样本。我们采用多元线性回归的方法,检验了这些高科技企业开展的 CI 对其技术创新绩效水平的影响。此外,研究还进一步检验了两个边界条件--大数据能力和政策支持(PS)--对主要假设的调节作用。我们的研究发现,高科技企业开展 CI 能有效提高其技术创新绩效,而大数据能力和政策支持能显著提高这种影响程度。研究揭示了CI对高科技企业技术创新绩效影响的内在机理,在一定程度上拓展了CI的应用范围,丰富了CI对企业创新绩效影响的研究视角。同时,研究结果也为数字时代的高科技企业如何在CI过程中合理利用数据赋能实现技术创新绩效的提升提供了启示。
{"title":"Impact of Cooperative Innovation on the Technological Innovation Performance of High-Tech Firms: A Dual Moderating Effect Model of Big Data Capabilities and Policy Support.","authors":"Xianglong Li, Qingjin Wang, Renbo Shi, Xueling Wang, Kaiyun Zhang, Xiao Liu","doi":"10.1089/big.2022.0301","DOIUrl":"10.1089/big.2022.0301","url":null,"abstract":"<p><p>The mechanism of cooperative innovation (CI) for high-tech firms aims to improve their technological innovation performance. It is the effective integration of the internal and external innovation resources of these firms, along with the simultaneous reduction in the uncertainty of technological innovation and the maintenance of the comparative advantage of the firms in the competition. This study used 322 high-tech firms as our sample, which were located in 33 national innovation demonstration bases identified by the Chinese government. We implemented a multiple linear regression to test the impact of CI conducted by these high-tech firms at the level of their technological innovation performance. In addition, the study further examined the moderating effect of two boundary conditions-big data capabilities and policy support (PS)-on the main hypotheses. Our study found that high-tech firms carrying out CI can effectively improve their technological innovation performance, with big data capabilities and PS significantly enhancing the degree of this influence. The study reveals the intrinsic mechanism of the impact of CI on the technological innovation performance of high-tech firms, which, to a certain extent, expands the application context of CI and enriches the research perspective on the impact of CI on the innovation performance of firms. At the same time, the findings provide insight for how high-tech firms in the digital era can make reasonable use of data empowerment in the process of CI to achieve improved technological innovation performance.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"63-80"},"PeriodicalIF":2.6,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10243508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large-Scale Estimation and Analysis of Web Users' Mood from Web Search Query and Mobile Sensor Data. 从网络搜索查询和移动传感器数据中大规模估计和分析网络用户的情绪。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-01 Epub Date: 2023-06-02 DOI: 10.1089/big.2022.0211
Wataru Sasaki, Satoki Hamanaka, Satoko Miyahara, Kota Tsubouchi, Jin Nakazawa, Tadashi Okoshi

The ability to estimate the current mood states of web users has considerable potential for realizing user-centric opportune services in pervasive computing. However, it is difficult to determine the data type used for such estimation and collect the ground truth of such mood states. Therefore, we built a model to estimate the mood states from search-query data in an easy-to-collect and non-invasive manner. Then, we built a model to estimate mood states from mobile sensor data as another estimation model and supplemented its output to the ground-truth label of the model estimated from search queries. This novel two-step model building contributed to boosting the performance of estimating the mood states of web users. Our system was also deployed in the commercial stack, and large-scale data analysis with >11 million users was conducted. We proposed a nationwide mood score, which bundles the mood values of users across the country. It shows the daily and weekly rhythm of people's moods and explains the ups and downs of moods during the COVID-19 pandemic, which is inversely synchronized to the number of new COVID-19 cases. It detects big news that simultaneously affects the mood states of many users, even under fine-grained time resolution, such as the order of hours. In addition, we identified a certain class of advertisements that indicated a clear tendency in the mood of the users who clicked such advertisements.

估计网络用户当前情绪状态的能力对于在普适计算中实现以用户为中心的适时服务具有相当大的潜力。然而,很难确定用于这种估计的数据类型,也很难收集这种情绪状态的基本事实。因此,我们建立了一个模型,以易于收集和非侵入性的方式从搜索查询数据中估计情绪状态。然后,我们建立了一个从移动传感器数据中估计情绪状态的模型,作为另一个估计模型,并将其输出补充到从搜索查询中估计的模型的地面实况标签中。这种分两步建立模型的新方法有助于提高估计网络用户情绪状态的性能。我们的系统还部署在商业堆栈中,并对超过 1100 万用户进行了大规模数据分析。我们提出了一个全国性的情绪评分,它捆绑了全国用户的情绪值。它显示了人们每日和每周的情绪节奏,并解释了 COVID-19 大流行期间的情绪起伏,这与 COVID-19 新病例的数量成反比。它能检测到同时影响许多用户情绪状态的大新闻,即使是在时间分辨率很细的情况下,如数小时。此外,我们还发现了某类广告,点击此类广告的用户的情绪有明显的变化趋势。
{"title":"Large-Scale Estimation and Analysis of Web Users' Mood from Web Search Query and Mobile Sensor Data.","authors":"Wataru Sasaki, Satoki Hamanaka, Satoko Miyahara, Kota Tsubouchi, Jin Nakazawa, Tadashi Okoshi","doi":"10.1089/big.2022.0211","DOIUrl":"10.1089/big.2022.0211","url":null,"abstract":"<p><p>The ability to estimate the current mood states of web users has considerable potential for realizing user-centric opportune services in pervasive computing. However, it is difficult to determine the data type used for such estimation and collect the ground truth of such mood states. Therefore, we built a model to estimate the mood states from search-query data in an easy-to-collect and non-invasive manner. Then, we built a model to estimate mood states from mobile sensor data as another estimation model and supplemented its output to the ground-truth label of the model estimated from search queries. This novel two-step model building contributed to boosting the performance of estimating the mood states of web users. Our system was also deployed in the commercial stack, and large-scale data analysis with >11 million users was conducted. We proposed a nationwide mood score, which bundles the mood values of users across the country. It shows the daily and weekly rhythm of people's moods and explains the ups and downs of moods during the COVID-19 pandemic, which is inversely synchronized to the number of new COVID-19 cases. It detects big news that simultaneously affects the mood states of many users, even under fine-grained time resolution, such as the order of hours. In addition, we identified a certain class of advertisements that indicated a clear tendency in the mood of the users who clicked such advertisements.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"191-209"},"PeriodicalIF":2.6,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11304759/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9565593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Computational Efficient Approximations of the Concordance Probability in a Big Data Setting. 大数据环境下一致概率的高效计算近似。
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-01 Epub Date: 2023-06-07 DOI: 10.1089/big.2022.0107
Robin Van Oirbeek, Jolien Ponnet, Bart Baesens, Tim Verdonck

Performance measurement is an essential task once a statistical model is created. The area under the receiving operating characteristics curve (AUC) is the most popular measure for evaluating the quality of a binary classifier. In this case, the AUC is equal to the concordance probability, a frequently used measure to evaluate the discriminatory power of the model. Contrary to AUC, the concordance probability can also be extended to the situation with a continuous response variable. Due to the staggering size of data sets nowadays, determining this discriminatory measure requires a tremendous amount of costly computations and is hence immensely time consuming, certainly in case of a continuous response variable. Therefore, we propose two estimation methods that calculate the concordance probability in a fast and accurate way and that can be applied to both the discrete and continuous setting. Extensive simulation studies show the excellent performance and fast computing times of both estimators. Finally, experiments on two real-life data sets confirm the conclusions of the artificial simulations.

建立统计模型后,性能测量是一项重要任务。接收运行特征曲线下面积(AUC)是评估二元分类器质量的最常用指标。在这种情况下,AUC 等于一致性概率,是评估模型判别能力的常用指标。与 AUC 相反,一致性概率也可以扩展到连续响应变量的情况。由于当今数据集的规模惊人,确定这种判别能力需要进行大量昂贵的计算,因此非常耗时,当然是在连续响应变量的情况下。因此,我们提出了两种估算方法,可以快速、准确地计算一致性概率,并同时适用于离散和连续环境。大量的仿真研究表明,这两种估计方法都具有卓越的性能和快速的计算时间。最后,两个真实数据集的实验证实了人工模拟的结论。
{"title":"Computational Efficient Approximations of the Concordance Probability in a Big Data Setting.","authors":"Robin Van Oirbeek, Jolien Ponnet, Bart Baesens, Tim Verdonck","doi":"10.1089/big.2022.0107","DOIUrl":"10.1089/big.2022.0107","url":null,"abstract":"<p><p>Performance measurement is an essential task once a statistical model is created. The area under the receiving operating characteristics curve (AUC) is the most popular measure for evaluating the quality of a binary classifier. In this case, the AUC is equal to the concordance probability, a frequently used measure to evaluate the discriminatory power of the model. Contrary to AUC, the concordance probability can also be extended to the situation with a continuous response variable. Due to the staggering size of data sets nowadays, determining this discriminatory measure requires a tremendous amount of costly computations and is hence immensely time consuming, certainly in case of a continuous response variable. Therefore, we propose two estimation methods that calculate the concordance probability in a fast and accurate way and that can be applied to both the discrete and continuous setting. Extensive simulation studies show the excellent performance and fast computing times of both estimators. Finally, experiments on two real-life data sets confirm the conclusions of the artificial simulations.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"243-268"},"PeriodicalIF":2.6,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9592435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Small Files Problem Resolution via Hierarchical Clustering Algorithm. 通过分层聚类算法解决小文件问题
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-01 Epub Date: 2023-05-16 DOI: 10.1089/big.2022.0181
Oded Koren, Aviel Shamalov, Nir Perel

The Small Files Problem in Hadoop Distributed File System (HDFS) is an ongoing challenge that has not yet been solved. However, various approaches have been developed to tackle the obstacles this problem creates. Properly managing the size of blocks in a file system is essential as it saves memory and computing time and may reduce bottlenecks. In this article, a new approach using a Hierarchical Clustering Algorithm is suggested for dealing with small files. The proposed method identifies the files by their structure and via a special Dendrogram analysis, and then recommends which files can be merged. As a simulation, the proposed algorithm was applied via 100 CSV files with different structures, containing 2-4 columns with different data types (integer, decimal and text). Also, 20 files that were not CSV files were created to demonstrate that the algorithm only works on CSV files. All data were analyzed via a machine learning hierarchical clustering method, and a Dendrogram was created. According to the merge process that was performed, seven files from the Dendrogram analysis were chosen as appropriate files to be merged. This reduced the memory space in the HDFS. Furthermore, the results showed that using the suggested algorithm led to efficient file management.

Hadoop 分布式文件系统(HDFS)中的小文件问题是一个持续存在的挑战,至今尚未解决。不过,人们已经开发出各种方法来解决这一问题带来的障碍。在文件系统中适当管理块的大小至关重要,因为这样可以节省内存和计算时间,并可减少瓶颈。本文提出了一种使用分层聚类算法处理小文件的新方法。建议的方法通过文件结构和特殊的树枝图分析来识别文件,然后推荐哪些文件可以合并。作为模拟,建议的算法在 100 个不同结构的 CSV 文件中应用,这些文件包含 2-4 列不同的数据类型(整数、小数和文本)。此外,还创建了 20 个非 CSV 文件,以证明该算法仅适用于 CSV 文件。所有数据都通过机器学习分层聚类方法进行了分析,并创建了树枝图。根据所执行的合并程序,从树枝图分析中选择了七个文件作为适当的文件进行合并。这减少了 HDFS 的内存空间。此外,结果表明,使用建议的算法可实现高效的文件管理。
{"title":"Small Files Problem Resolution via Hierarchical Clustering Algorithm.","authors":"Oded Koren, Aviel Shamalov, Nir Perel","doi":"10.1089/big.2022.0181","DOIUrl":"10.1089/big.2022.0181","url":null,"abstract":"<p><p>The Small Files Problem in Hadoop Distributed File System (HDFS) is an ongoing challenge that has not yet been solved. However, various approaches have been developed to tackle the obstacles this problem creates. Properly managing the size of blocks in a file system is essential as it saves memory and computing time and may reduce bottlenecks. In this article, a new approach using a Hierarchical Clustering Algorithm is suggested for dealing with small files. The proposed method identifies the files by their structure and via a special Dendrogram analysis, and then recommends which files can be merged. As a simulation, the proposed algorithm was applied via 100 CSV files with different structures, containing 2-4 columns with different data types (integer, decimal and text). Also, 20 files that were not CSV files were created to demonstrate that the algorithm only works on CSV files. All data were analyzed via a machine learning hierarchical clustering method, and a Dendrogram was created. According to the merge process that was performed, seven files from the Dendrogram analysis were chosen as appropriate files to be merged. This reduced the memory space in the HDFS. Furthermore, the results showed that using the suggested algorithm led to efficient file management.</p>","PeriodicalId":51314,"journal":{"name":"Big Data","volume":" ","pages":"229-242"},"PeriodicalIF":2.6,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9830746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Big Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1