Big Data Research最新文献

英文中文

Data-Efficient Performance Modeling for Configurable Big Data Frameworks by Reducing Information Overlap Between Training Examples 基于减少训练样本间信息重叠的可配置大数据框架数据高效性能建模

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2022-11-28 DOI: 10.1016/j.bdr.2022.100358

Zhiqiang Liu, Xuanhua Shi, Hai Jin

To support the various analysis application of big data, big data processing frameworks are designed to be highly configurable. However, for common users, it is difficult to tailor the configurable frameworks to achieve optimal performance for every application. Recently, many automatic tuning methods are proposed to configure these frameworks. In detail, these methods firstly build a performance prediction model through sampling configurations randomly and measuring the corresponding performance. Then, they conduct heuristic search in the configuration space based on the performance prediction model. For most frameworks, it is too expensive to build the performance model since it needs to measure the performance of large amounts of configurations, which cause too much overhead on data collection. In this paper, we propose a novel data-efficient method to build the performance model with little impact on prediction accuracy. Compared to the traditional methods, the proposed method can reduce the overhead of data collection because it can train the performance model with much less training examples. Specifically, the proposed method can actively sample the important examples according to the dynamic requirement of the performance model during the iterative model updating. Hence, it can make full use of the collected informative data and train the performance model with much less training examples. To sample the important training examples, we employ several virtual performance model to estimate the importance of all candidate configurations efficiently. Experimental results show that our method needs less training examples than traditional methods with little impact on prediction accuracy.

为支持大数据的各种分析应用，大数据处理框架具有高度可配置性。然而，对于普通用户来说，为每个应用程序定制可配置框架以实现最佳性能是很困难的。最近，人们提出了许多自动调优方法来配置这些框架。具体来说，这些方法首先通过随机抽样配置并测量相应的性能来构建性能预测模型。然后，基于性能预测模型在配置空间中进行启发式搜索。对于大多数框架，构建性能模型的成本太高，因为它需要度量大量配置的性能，这会导致数据收集的开销过大。在本文中，我们提出了一种新的数据高效的方法来建立对预测精度影响很小的性能模型。与传统方法相比，该方法可以用更少的训练样本来训练性能模型，从而减少了数据收集的开销。具体而言，该方法可以在迭代模型更新过程中，根据性能模型的动态要求，主动对重要样例进行采样。因此，它可以充分利用收集到的信息数据，用更少的训练样例训练性能模型。为了对重要的训练样本进行采样，我们使用了几个虚拟性能模型来有效地估计所有候选配置的重要性。实验结果表明，该方法所需的训练样本比传统方法少，对预测精度影响较小。

{"title":"Data-Efficient Performance Modeling for Configurable Big Data Frameworks by Reducing Information Overlap Between Training Examples","authors":"Zhiqiang Liu, Xuanhua Shi, Hai Jin","doi":"10.1016/j.bdr.2022.100358","DOIUrl":"https://doi.org/10.1016/j.bdr.2022.100358","url":null,"abstract":"<div><p><span>To support the various analysis application of big data<span>, big data processing<span> frameworks are designed to be highly configurable. However, for common users, it is difficult to tailor the configurable frameworks to achieve optimal performance for every application. Recently, many automatic tuning methods are proposed to configure these frameworks. In detail, these methods firstly build a performance prediction model through sampling configurations randomly and measuring the corresponding performance. Then, they conduct heuristic search in the </span></span></span>configuration space based on the performance prediction model. For most frameworks, it is too expensive to build the performance model since it needs to measure the performance of large amounts of configurations, which cause too much overhead on data collection. In this paper, we propose a novel data-efficient method to build the performance model with little impact on prediction accuracy. Compared to the traditional methods, the proposed method can reduce the overhead of data collection because it can train the performance model with much less training examples. Specifically, the proposed method can actively sample the important examples according to the dynamic requirement of the performance model during the iterative model updating. Hence, it can make full use of the collected informative data and train the performance model with much less training examples. To sample the important training examples, we employ several virtual performance model to estimate the importance of all candidate configurations efficiently. Experimental results show that our method needs less training examples than traditional methods with little impact on prediction accuracy.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91599168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Detecting Seasonal Dependencies in Production Lines for Forecast Optimization 为预测优化检测生产线的季节相关性

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2022-11-28 DOI: 10.1016/j.bdr.2022.100335

Gerold Hoelzl , Sebastian Soller , Matthias Kranz

Huge amounts of data are produced inside an industrial production plant every minute. This data is getting more accessible by higher network and computing capabilities. This poses an opportunity to apply methods in real time to support the reliability of production machines. In theory every time series, that is currently monitored by for a breach of thresholds, can be extended with a forecast method. Classical approaches, such as ARIMA and Exponential Smoothing can be used for forecasting. To describe the signal and boost the forecast results we use a clustering method to group each unknown data stream in a seasonality class. This seasonality classes can be used for insight into intra and inter group behaviour between machines and add causality to factory wide correlations. We collected 10000 multiple day segments of multiple identical and different machines. We manually hand labelled the data segments for their seasonality pattern to compare and explain the clustering results. Classes, obtained through clustering, are used to adapt each single forecast model for every machine. For the forecast method we could show improved results by selecting the correct seasonality for each data stream.

工业生产工厂每分钟都会产生大量的数据。通过更高的网络和计算能力，这些数据变得更容易访问。这为实时应用方法来支持生产机器的可靠性提供了机会。理论上，每一个时间序列，目前监测的突破阈值，可以扩展与预测方法。经典的方法，如ARIMA和指数平滑可以用于预测。为了描述信号并增强预测结果，我们使用聚类方法将每个未知数据流分组在季节性类中。这种季节性类可以用于洞察机器之间的组内和组间行为，并将因果关系添加到工厂范围的相关性中。我们收集了多个相同和不同机器的10000多个日段。我们手动标记数据段的季节性模式，以比较和解释聚类结果。通过聚类得到的类用于适应每台机器的单个预测模型。对于预测方法，我们可以通过为每个数据流选择正确的季节性来显示改进的结果。

{"title":"Detecting Seasonal Dependencies in Production Lines for Forecast Optimization","authors":"Gerold Hoelzl , Sebastian Soller , Matthias Kranz","doi":"10.1016/j.bdr.2022.100335","DOIUrl":"10.1016/j.bdr.2022.100335","url":null,"abstract":"<div><p>Huge amounts of data are produced inside an industrial production plant every minute. This data is getting more accessible by higher network and computing capabilities. This poses an opportunity to apply methods in real time to support the reliability of production machines. In theory every time series, that is currently monitored by for a breach of thresholds, can be extended with a forecast method. Classical approaches, such as ARIMA and Exponential Smoothing can be used for forecasting. To describe the signal and boost the forecast results we use a clustering method to group each unknown data stream in a seasonality class. This seasonality classes can be used for insight into intra and inter group behaviour between machines and add causality to factory wide correlations. We collected 10000 multiple day segments of multiple identical and different machines. We manually hand labelled the data segments for their seasonality pattern to compare and explain the clustering results. Classes, obtained through clustering, are used to adapt each single forecast model for every machine. For the forecast method we could show improved results by selecting the correct seasonality for each data stream.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86009235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Neural Topic Modeling with Deep Mutual Information Estimation 基于深度互信息估计的神经主题建模

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2022-11-28 DOI: 10.1016/j.bdr.2022.100344

Kang Xu , Xiaoqiu Lu , Yuan-fang Li , Tongtong Wu , Guilin Qi , Ning Ye , Dong Wang , Zheng Zhou

The emerging neural topic models make topic modeling more easily adaptable and extendable in unsupervised text mining. However, the existing neural topic models are difficult to retain representative information of the documents within the learnt topic representation. Fortunately, Deep Mutual Information Estimation (DMIE), which maximizes the mutual information between input data and the hidden representations to learn a good representation of the input data. DMIE provides a new paradigm for neural topic modeling. In this paper, we propose a neural topic model which incorporates deep mutual information estimation, i.e., Neural Topic Modeling with Deep Mutual Information Estimation (NTM-DMIE). NTM-DMIE is a neural network method for topic learning which maximizes the mutual information between the input documents and their latent topic representation. To learn robust topic representation, we incorporate the discriminator to discriminate negative examples and positive examples via adversarial learning. Moreover, we use both global and local mutual information to preserve the rich information of the input documents in the topic representation. We evaluate NTM-DMIE on several metrics, including accuracy of text clustering, with topic representation, topic uniqueness and topic coherence. Compared to the existing methods, the experimental results show that NTM-DMIE can outperform in all the metrics on the four datasets.

神经主题模型的出现使主题建模在无监督文本挖掘中更容易适应和扩展。然而，现有的神经主题模型很难在学习到的主题表示中保留文档的代表性信息。幸运的是，深度互信息估计(DMIE)最大化了输入数据和隐藏表示之间的互信息，以学习输入数据的良好表示。DMIE为神经主题建模提供了一种新的范式。本文提出了一种融合深度互信息估计的神经主题模型，即NTM-DMIE (neural topic Modeling with deep mutual information estimation)。NTM-DMIE是一种用于主题学习的神经网络方法，它最大限度地提高了输入文档与其潜在主题表示之间的互信息。为了学习稳健的主题表示，我们结合了判别器，通过对抗性学习来区分消极例子和积极例子。此外，我们使用全局互信息和局部互信息来保留主题表示中输入文档的丰富信息。我们从几个指标来评估NTM-DMIE，包括文本聚类的准确性、主题表示、主题唯一性和主题一致性。实验结果表明，与现有方法相比，NTM-DMIE在四种数据集上的所有指标都优于现有方法。

{"title":"Neural Topic Modeling with Deep Mutual Information Estimation","authors":"Kang Xu , Xiaoqiu Lu , Yuan-fang Li , Tongtong Wu , Guilin Qi , Ning Ye , Dong Wang , Zheng Zhou","doi":"10.1016/j.bdr.2022.100344","DOIUrl":"https://doi.org/10.1016/j.bdr.2022.100344","url":null,"abstract":"<div><p><span><span>The emerging neural topic models make topic modeling more easily adaptable and extendable in unsupervised text mining. However, the existing neural topic models are difficult to retain representative information of the documents within the learnt topic representation. Fortunately, Deep Mutual Information Estimation (DMIE), which maximizes the mutual information between input data and the hidden representations to learn a good representation of the input data. DMIE provides a new paradigm for neural topic modeling. In this paper, we propose a neural topic model which incorporates deep mutual information estimation, i.e., Neural Topic Modeling with Deep Mutual Information Estimation (NTM-DMIE). NTM-DMIE is a neural network method for topic learning which maximizes the mutual information between the input documents and their latent topic representation. To learn robust topic representation, we incorporate the </span>discriminator to discriminate negative examples and positive examples via adversarial learning. Moreover, we use both global and local mutual information to preserve the rich information of the input documents in the topic representation. We evaluate NTM-DMIE on several metrics, including accuracy of </span>text clustering, with topic representation, topic uniqueness and topic coherence. Compared to the existing methods, the experimental results show that NTM-DMIE can outperform in all the metrics on the four datasets.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91599232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Special Issue on Real-Time Intelligent Systems 实时智能系统特刊

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2022-11-28 DOI: 10.1016/j.bdr.2022.100349

Richard Chbeir, Yannis Manolopoulos, Jolanta Mizera-Pietraszko, Spyros Sioutas

引用次数: 0

A Multi-Objective Clustering for Better Data Management in Connected Environment 面向互联环境下数据管理的多目标聚类

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2022-11-28 DOI: 10.1016/j.bdr.2022.100347

Sabri Allani , Richard Chbeir , Khouloud Salameh , Elio Mansour , Philippe Arnould

Over the past decade, the rapid increase in connected devices has enabled the emergence of new digital ecosystems to provide new opportunities for monitoring and managing systems to optimize overall performance. With these connected environments, data collection and management become increasingly challenging. A significant number of works in the literature have addressed data collection and management based on different contexts (e.g., mobile ad hoc, Peer-2-Peer, and IoT networks). Today, a wired network uses all of these protocols simultaneously, thus highlighting the need to build a standard data collection and management framework that considers all potential user preferences. For this purpose, multi-objective clustering has been utilized as a promising solution to ensure the stability of connected devices during the collection and management of data. In this paper, we introduce a new multi-objective clustering (MOC) technique based on various criteria for cluster construction and head selection in connected environments. More precisely, the proposed solution is based hypergraphs to represent the connected environment and clusters according to similarities between heterogeneous devices. Then, a cross-sectional hypergraph algorithm is applied to select the cluster heads. Experiments conducted show that our solution outperforms the pioneering literature methods in terms of performance and effectiveness.

在过去的十年中，连接设备的快速增长使得新的数字生态系统的出现为监测和管理系统提供了新的机会，以优化整体性能。在这些相互连接的环境中，数据收集和管理变得越来越具有挑战性。文献中的大量工作已经解决了基于不同环境(例如，移动自组织，对等网络和物联网网络)的数据收集和管理问题。如今，有线网络同时使用所有这些协议，因此需要建立一个考虑所有潜在用户偏好的标准数据收集和管理框架。为此，多目标聚类被用作一种很有前途的解决方案，以确保在数据采集和管理过程中连接设备的稳定性。本文介绍了一种新的多目标聚类(MOC)技术，该技术基于连接环境中聚类构建和头部选择的各种标准。更准确地说，提出的解决方案是基于超图来表示连接的环境和集群，根据异构设备之间的相似性。然后，采用横截面超图算法选择簇头。实验表明，我们的解决方案在性能和有效性方面优于开创性的文献方法。

{"title":"A Multi-Objective Clustering for Better Data Management in Connected Environment","authors":"Sabri Allani , Richard Chbeir , Khouloud Salameh , Elio Mansour , Philippe Arnould","doi":"10.1016/j.bdr.2022.100347","DOIUrl":"https://doi.org/10.1016/j.bdr.2022.100347","url":null,"abstract":"<div><p>Over the past decade, the rapid increase in connected devices has enabled the emergence of new digital ecosystems<span> to provide new opportunities for monitoring and managing systems to optimize overall performance. With these connected environments, data collection and management become increasingly challenging. A significant number of works in the literature have addressed data collection and management based on different contexts (e.g., mobile ad hoc, Peer-2-Peer, and IoT<span> networks). Today, a wired network uses all of these protocols simultaneously, thus highlighting the need to build a standard data collection and management framework that considers all potential user preferences. For this purpose, multi-objective clustering has been utilized as a promising solution to ensure the stability of connected devices during the collection and management of data. In this paper, we introduce a new multi-objective clustering (MOC) technique based on various criteria for cluster construction and head selection in connected environments. More precisely, the proposed solution is based hypergraphs to represent the connected environment and clusters according to similarities between heterogeneous devices. Then, a cross-sectional hypergraph algorithm is applied to select the cluster heads. Experiments conducted show that our solution outperforms the pioneering literature methods in terms of performance and effectiveness.</span></span></p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91599166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2022-11-01 DOI: 10.1016/j.bdr.2022.100358

Zhiqiang Liu, Xuanhua Shi, Haici Jin

引用次数: 1

Augmented Functional Analysis of Variance (A-fANOVA): Theory and Application to Google Trends for Detecting Differences in Abortion Drugs Queries 方差的增强功能分析(A-fANOVA):理论和应用于谷歌趋势检测流产药物查询的差异

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2022-11-01 DOI: 10.1016/j.bdr.2022.100354

Fabrizio Maturo, A. Porreca

引用次数: 4

Data Stream Classification Based on Extreme Learning Machine: A Review 基于极限学习机的数据流分类研究综述

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2022-11-01 DOI: 10.1016/j.bdr.2022.100356

Xiulin Zheng, Peipei Li, Xindong Wu

引用次数: 3

Linked Open Government Data to Predict and Explain House Prices: The Case of Scottish Statistics Portal 链接开放政府数据预测和解释房价:苏格兰统计门户的案例

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2022-10-01 DOI: 10.2139/ssrn.4123599

Areti Karamanou, E. Kalampokis, K. Tarabanis

引用次数: 10

A Multi-Objective Clustering for Better Data Management in Connected Environment 面向互联环境下数据管理的多目标聚类

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2022-09-01 DOI: 10.1016/j.bdr.2022.100347

Sabri Allani, R. Chbeir, Khouloud Salameh, Elio Mansour, Philippe Arnould

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Big Data Research

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀