首页 > 最新文献

Big Data Mining and Analytics最新文献

英文 中文
Sampling with prior knowledge for high-dimensional gravitational wave data analysis 高维引力波数据分析的先验知识采样
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-12-27 DOI: 10.26599/BDMA.2021.9020018
He Wang;Zhoujian Cao;Yue Zhou;Zong-Kuan Guo;Zhixiang Ren
Extracting knowledge from high-dimensional data has been notoriously difficult, primarily due to the so-called "curse of dimensionality" and the complex joint distributions of these dimensions. This is a particularly profound issue for high-dimensional gravitational wave data analysis where one requires to conduct Bayesian inference and estimate joint posterior distributions. In this study, we incorporate prior physical knowledge by sampling from desired interim distributions to develop the training dataset. Accordingly, the more relevant regions of the high-dimensional feature space are covered by additional data points, such that the model can learn the subtle but important details. We adapt the normalizing flow method to be more expressive and trainable, such that the information can be effectively extracted and represented by the transformation between the prior and target distributions. Once trained, our model only takes approximately 1 s on one V100 GPU to generate thousands of samples for probabilistic inference purposes. The evaluation of our approach confirms the efficacy and efficiency of gravitational wave data inferences and points to a promising direction for similar research. The source code, specifications, and detailed procedures are publicly accessible on GitHub.
从高维数据中提取知识一直是出了名的困难,主要是由于所谓的“维度诅咒”和这些维度的复杂联合分布。对于高维引力波数据分析来说,这是一个特别深刻的问题,需要进行贝叶斯推理并估计联合后验分布。在这项研究中,我们通过从期望的中期分布中采样来结合先前的物理知识,以开发训练数据集。因此,高维特征空间中更相关的区域被额外的数据点覆盖,使得模型可以学习细微但重要的细节。我们对归一化流方法进行了调整,使其更具表达性和可训练性,从而可以通过先验分布和目标分布之间的转换来有效地提取和表示信息。一旦经过训练,我们的模型在一个V100 GPU上只需要大约1秒就可以生成数千个样本用于概率推理。对我们方法的评估证实了引力波数据推断的有效性和效率,并为类似研究指明了一个有希望的方向。源代码、规范和详细过程可在GitHub上公开访问。
{"title":"Sampling with prior knowledge for high-dimensional gravitational wave data analysis","authors":"He Wang;Zhoujian Cao;Yue Zhou;Zong-Kuan Guo;Zhixiang Ren","doi":"10.26599/BDMA.2021.9020018","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020018","url":null,"abstract":"Extracting knowledge from high-dimensional data has been notoriously difficult, primarily due to the so-called \"curse of dimensionality\" and the complex joint distributions of these dimensions. This is a particularly profound issue for high-dimensional gravitational wave data analysis where one requires to conduct Bayesian inference and estimate joint posterior distributions. In this study, we incorporate prior physical knowledge by sampling from desired interim distributions to develop the training dataset. Accordingly, the more relevant regions of the high-dimensional feature space are covered by additional data points, such that the model can learn the subtle but important details. We adapt the normalizing flow method to be more expressive and trainable, such that the information can be effectively extracted and represented by the transformation between the prior and target distributions. Once trained, our model only takes approximately 1 s on one V100 GPU to generate thousands of samples for probabilistic inference purposes. The evaluation of our approach confirms the efficacy and efficiency of gravitational wave data inferences and points to a promising direction for similar research. The source code, specifications, and detailed procedures are publicly accessible on GitHub.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"5 1","pages":"53-63"},"PeriodicalIF":13.6,"publicationDate":"2021-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9663253/09663260.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68077806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Call for papers: Special issue on deep learning and evolutionary computation for satellite imagery 论文征集:卫星图像的深度学习和进化计算特刊
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-12-27 DOI: 10.26599/BDMA.2021.9020025
Satellite images are humungous sources of data that require efficient methods for knowledge discovery. The increased availability of earth data from satellite images has immense opportunities in various fields. However, the volume and heterogeneity of data poses serious computational challenges. The development of efficient techniques has the potential of discovering hidden information from these images. This knowledge can be used in various activities related to planning, monitoring, and managing the earth resources. Deep learning are being widely used for image analysis and processing. Deep learning based models can be effectively used for mining and knowledge discovery from satellite images.
卫星图像是巨大的数据来源,需要有效的知识发现方法。卫星图像中地球数据的可用性增加在各个领域都有巨大的机会。然而,数据的数量和异构性带来了严重的计算挑战。高效技术的发展有可能从这些图像中发现隐藏的信息。这些知识可以用于与规划、监测和管理地球资源有关的各种活动。深度学习正被广泛用于图像分析和处理。基于深度学习的模型可以有效地用于从卫星图像中挖掘和发现知识。
{"title":"Call for papers: Special issue on deep learning and evolutionary computation for satellite imagery","authors":"","doi":"10.26599/BDMA.2021.9020025","DOIUrl":"10.26599/BDMA.2021.9020025","url":null,"abstract":"Satellite images are humungous sources of data that require efficient methods for knowledge discovery. The increased availability of earth data from satellite images has immense opportunities in various fields. However, the volume and heterogeneity of data poses serious computational challenges. The development of efficient techniques has the potential of discovering hidden information from these images. This knowledge can be used in various activities related to planning, monitoring, and managing the earth resources. Deep learning are being widely used for image analysis and processing. Deep learning based models can be effectively used for mining and knowledge discovery from satellite images.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"5 1","pages":"79-79"},"PeriodicalIF":13.6,"publicationDate":"2021-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9663253/09663262.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44723605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Call for papers: Special issue on privacy-preserving data mining for artificial intelligence of things 论文征集:为事物的人工智能保护隐私的数据挖掘特刊
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-12-27 DOI: 10.26599/BDMA.2021.9020026
Artificial Intelligence of Things (AIoT) is experiencing unimaginable fast booming with the popularization of end devices and advanced machine learning and data processing techniques. An increasing volume of data is being collected every single second to enable Artificial Intelligence (AI) on the Internet of Things (IoT). The explosion of data brings significant benefits to various intelligent industries to provide predictive services and research institutes to advance human knowledge in data-intensive fields. To make the best use of the collected data, various data mining techniques have been deployed to extract data patterns. In classic scenarios, the data collected from IoT devices is directly sent to cloud servers for processing using diverse methods such as training machine learning models. However, the network between cloud servers and massive end devices may not be stable due to irregular bursts of traffic, weather, etc. Therefore, autonomous data mining that is self-organized by a group of local devices to maintain ongoing and robust AI services plays a growing important role for critical IoT infrastructures. Privacy issues become more concerning in this scenario. The data transmitted via autonomous networks are publicly accessible by all internal participants, which increases the risk of exposure. Besides, data mining techniques may reveal sensitive information from the collected data. Various attacks, such as inference attacks, are emerging and evolving to breach sensitive data due to its great financial benefits. Motivated by this, it is essential to devise novel privacy-preserving autonomous data mining solutions for AIoT. In this Special Issue, we aim to gather state-of-art advances in privacy-preserving data mining and autonomous data processing solutions for AIoT. Topics include, but are not limited to, the following: • Privacy-preserving federated learning for AIoT • Differentially private machine learning for AIoT • Personalized privacy-preserving data mining • Decentralized machine learning paradigms for autonomous data mining using blockchain • AI-enhanced edge data mining for AIoT • AI and blockchain empowered privacy-preserving big data analytics for AIoT • Anomaly detection and inference attack defense for AIoT • Privacy protection measurement metrics • Zero trust architectures for privacy protection management • Privacy protection data mining and analysis via blockchain-enabled digital twin.
随着终端设备以及先进的机器学习和数据处理技术的普及,物联网正经历着难以想象的快速发展。每秒都在收集越来越多的数据,以实现物联网上的人工智能。数据的爆炸性增长为提供预测服务的各种智能行业和在数据密集型领域推进人类知识的研究机构带来了巨大的好处。为了最大限度地利用收集到的数据,已经部署了各种数据挖掘技术来提取数据模式。在经典场景中,从物联网设备收集的数据被直接发送到云服务器,使用各种方法进行处理,如训练机器学习模型。然而,由于流量、天气等的不规则爆发,云服务器和大型终端设备之间的网络可能不稳定。因此,由一组本地设备自行组织的自主数据挖掘,以维持持续和强大的人工智能服务,在关键的物联网基础设施中发挥着越来越重要的作用。在这种情况下,隐私问题变得更加令人担忧。通过自主网络传输的数据可供所有内部参与者公开访问,这增加了暴露的风险。此外,数据挖掘技术可能会从收集的数据中揭示敏感信息。各种攻击,如推理攻击,由于其巨大的经济利益,正在出现并发展为破坏敏感数据。基于此,为AIoT设计新的隐私保护自主数据挖掘解决方案至关重要。在本期特刊中,我们旨在收集AIoT隐私保护数据挖掘和自主数据处理解决方案的最新进展。主题包括但不限于,以下内容:•AIoT的隐私保护联合学习•AIoT的差异私有机器学习•个性化隐私保护数据挖掘•使用区块链的自主数据挖掘的去中心化机器学习范式•AIoTAI增强的边缘数据挖掘•AI和区块链增强的AIoT隐私保护大数据分析•异常检测以及AIoT的推断攻击防御•隐私保护测量指标•隐私保护管理的零信任架构•通过区块链实现的数字孪生进行隐私保护数据挖掘和分析。
{"title":"Call for papers: Special issue on privacy-preserving data mining for artificial intelligence of things","authors":"","doi":"10.26599/BDMA.2021.9020026","DOIUrl":"10.26599/BDMA.2021.9020026","url":null,"abstract":"Artificial Intelligence of Things (AIoT) is experiencing unimaginable fast booming with the popularization of end devices and advanced machine learning and data processing techniques. An increasing volume of data is being collected every single second to enable Artificial Intelligence (AI) on the Internet of Things (IoT). The explosion of data brings significant benefits to various intelligent industries to provide predictive services and research institutes to advance human knowledge in data-intensive fields. To make the best use of the collected data, various data mining techniques have been deployed to extract data patterns. In classic scenarios, the data collected from IoT devices is directly sent to cloud servers for processing using diverse methods such as training machine learning models. However, the network between cloud servers and massive end devices may not be stable due to irregular bursts of traffic, weather, etc. Therefore, autonomous data mining that is self-organized by a group of local devices to maintain ongoing and robust AI services plays a growing important role for critical IoT infrastructures. Privacy issues become more concerning in this scenario. The data transmitted via autonomous networks are publicly accessible by all internal participants, which increases the risk of exposure. Besides, data mining techniques may reveal sensitive information from the collected data. Various attacks, such as inference attacks, are emerging and evolving to breach sensitive data due to its great financial benefits. Motivated by this, it is essential to devise novel privacy-preserving autonomous data mining solutions for AIoT. In this Special Issue, we aim to gather state-of-art advances in privacy-preserving data mining and autonomous data processing solutions for AIoT. Topics include, but are not limited to, the following: • Privacy-preserving federated learning for AIoT • Differentially private machine learning for AIoT • Personalized privacy-preserving data mining • Decentralized machine learning paradigms for autonomous data mining using blockchain • AI-enhanced edge data mining for AIoT • AI and blockchain empowered privacy-preserving big data analytics for AIoT • Anomaly detection and inference attack defense for AIoT • Privacy protection measurement metrics • Zero trust architectures for privacy protection management • Privacy protection data mining and analysis via blockchain-enabled digital twin.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"5 1","pages":"80-80"},"PeriodicalIF":13.6,"publicationDate":"2021-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9663253/09663263.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41991676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Attention-aware heterogeneous graph neural network 注意感知异构图神经网络
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-08-26 DOI: 10.26599/BDMA.2021.9020008
Jintao Zhang;Quan Xu
As a powerful tool for elucidating the embedding representation of graph-structured data, Graph Neural Networks (GNNs), which are a series of powerful tools built on homogeneous networks, have been widely used in various data mining tasks. It is a huge challenge to apply a GNN to an embedding Heterogeneous Information Network (HIN). The main reason for this challenge is that HINs contain many different types of nodes and different types of relationships between nodes. HIN contains rich semantic and structural information, which requires a specially designed graph neural network. However, the existing HIN-based graph neural network models rarely consider the interactive information hidden between the meta-paths of HIN in the poor embedding of nodes in the HIN. In this paper, we propose an Attention-aware Heterogeneous graph Neural Network (AHNN) model to effectively extract useful information from HIN and use it to learn the embedding representation of nodes. Specifically, we first use node-level attention to aggregate and update the embedding representation of nodes, and then concatenate the embedding representation of the nodes on different meta-paths. Finally, the semantic-level neural network is proposed to extract the feature interaction relationships on different meta-paths and learn the final embedding of nodes. Experimental results on three widely used datasets showed that the AHNN model could significantly outperform the state-of-the-art models.
作为解释图结构数据嵌入表示的强大工具,图神经网络(GNNs)是建立在同构网络上的一系列强大工具,已被广泛应用于各种数据挖掘任务中。将GNN应用于嵌入式异构信息网络是一个巨大的挑战。这一挑战的主要原因是HIN包含许多不同类型的节点以及节点之间不同类型的关系。HIN包含丰富的语义和结构信息,这需要专门设计的图神经网络。然而,现有的基于HIN的图神经网络模型很少考虑隐藏在HIN元路径之间的交互信息,因为节点在HIN中的嵌入很差。在本文中,我们提出了一种注意感知异构图神经网络(AHNN)模型,以有效地从HIN中提取有用的信息,并使用它来学习节点的嵌入表示。具体来说,我们首先使用节点级别的注意力来聚合和更新节点的嵌入表示,然后将不同元路径上的节点嵌入表示连接起来。最后,提出了语义级神经网络来提取不同元路径上的特征交互关系,并学习节点的最终嵌入。在三个广泛使用的数据集上的实验结果表明,AHNN模型可以显著优于最先进的模型。
{"title":"Attention-aware heterogeneous graph neural network","authors":"Jintao Zhang;Quan Xu","doi":"10.26599/BDMA.2021.9020008","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020008","url":null,"abstract":"As a powerful tool for elucidating the embedding representation of graph-structured data, Graph Neural Networks (GNNs), which are a series of powerful tools built on homogeneous networks, have been widely used in various data mining tasks. It is a huge challenge to apply a GNN to an embedding Heterogeneous Information Network (HIN). The main reason for this challenge is that HINs contain many different types of nodes and different types of relationships between nodes. HIN contains rich semantic and structural information, which requires a specially designed graph neural network. However, the existing HIN-based graph neural network models rarely consider the interactive information hidden between the meta-paths of HIN in the poor embedding of nodes in the HIN. In this paper, we propose an Attention-aware Heterogeneous graph Neural Network (AHNN) model to effectively extract useful information from HIN and use it to learn the embedding representation of nodes. Specifically, we first use node-level attention to aggregate and update the embedding representation of nodes, and then concatenate the embedding representation of the nodes on different meta-paths. Finally, the semantic-level neural network is proposed to extract the feature interaction relationships on different meta-paths and learn the final embedding of nodes. Experimental results on three widely used datasets showed that the AHNN model could significantly outperform the state-of-the-art models.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"233-241"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523497.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68020434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Multimodal adaptive identity-recognition algorithm fused with gait perception 融合步态感知的多模式自适应身份识别算法
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-08-26 DOI: 10.26599/BDMA.2021.9020006
Changjie Wang;Zhihua Li;Benjamin Sarpong
Identity-recognition technologies require assistive equipment, whereas they are poor in recognition accuracy and expensive. To overcome this deficiency, this paper proposes several gait feature identification algorithms. First, in combination with the collected gait information of individuals from triaxial accelerometers on smartphones, the collected information is preprocessed, and multimodal fusion is used with the existing standard datasets to yield a multimodal synthetic dataset; then, with the multimodal characteristics of the collected biological gait information, a Convolutional Neural Network based Gait Recognition (CNN-GR) model and the related scheme for the multimodal features are developed; at last, regarding the proposed CNN-GR model and scheme, a unimodal gait feature identity single-gait feature identification algorithm and a multimodal gait feature fusion identity multimodal gait information algorithm are proposed. Experimental results show that the proposed algorithms perform well in recognition accuracy, the confusion matrix, and the kappa statistic, and they have better recognition scores and robustness than the compared algorithms; thus, the proposed algorithm has prominent promise in practice.
身份识别技术需要辅助设备,但识别精度低且价格昂贵。为了克服这一不足,本文提出了几种步态特征识别算法。首先,结合智能手机上三轴加速度计采集的个体步态信息,对采集的信息进行预处理,并将多模态融合与现有标准数据集相结合,生成多模态合成数据集;然后,根据采集到的生物步态信息的多模式特征,建立了基于卷积神经网络的步态识别(CNN-GR)模型和多模式特征的相关方案;最后,针对所提出的CNN-GR模型和方案,提出了一种单模态步态特征识别算法和多模态步态特征融合识别多模态步态信息算法。实验结果表明,所提出的算法在识别精度、混淆矩阵和kappa统计量方面都表现良好,并且比比较算法具有更好的识别分数和鲁棒性;因此,该算法在实际应用中具有突出的前景。
{"title":"Multimodal adaptive identity-recognition algorithm fused with gait perception","authors":"Changjie Wang;Zhihua Li;Benjamin Sarpong","doi":"10.26599/BDMA.2021.9020006","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020006","url":null,"abstract":"Identity-recognition technologies require assistive equipment, whereas they are poor in recognition accuracy and expensive. To overcome this deficiency, this paper proposes several gait feature identification algorithms. First, in combination with the collected gait information of individuals from triaxial accelerometers on smartphones, the collected information is preprocessed, and multimodal fusion is used with the existing standard datasets to yield a multimodal synthetic dataset; then, with the multimodal characteristics of the collected biological gait information, a Convolutional Neural Network based Gait Recognition (CNN-GR) model and the related scheme for the multimodal features are developed; at last, regarding the proposed CNN-GR model and scheme, a unimodal gait feature identity single-gait feature identification algorithm and a multimodal gait feature fusion identity multimodal gait information algorithm are proposed. Experimental results show that the proposed algorithms perform well in recognition accuracy, the confusion matrix, and the kappa statistic, and they have better recognition scores and robustness than the compared algorithms; thus, the proposed algorithm has prominent promise in practice.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"223-232"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523496.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Intelligent and adaptive web data extraction system using convolutional and long short-term memory deep learning networks 使用卷积和长短期记忆深度学习网络的智能自适应web数据提取系统
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-08-26 DOI: 10.26599/BDMA.2021.9020012
Sudhir Kumar Patnaik;C. Narendra Babu;Mukul Bhave
Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.
在当今这个要求极高的超个性化消费者体验的世界里,数据对电子商务的发展至关重要,这些体验是使用先进的网络抓取技术收集的。然而,核心数据提取引擎失败了,因为它们无法适应网站内容的动态变化。本研究研究了一种具有卷积和长短期记忆(LSTM)网络的智能自适应网络数据提取系统,该系统使用You only look once(Yolo)算法和Tesseract LSTM来提取产品细节,这些细节被检测为网页中的图像。这个最先进的系统不需要核心数据提取引擎,因此可以适应网站布局的动态变化。在真实世界的零售案例中进行的实验表明,图像检测(精度)和字符提取精度(精度)分别为97%和99%。此外,在输入数据集为45个对象或图像的情况下,获得了74%的平均精度。
{"title":"Intelligent and adaptive web data extraction system using convolutional and long short-term memory deep learning networks","authors":"Sudhir Kumar Patnaik;C. Narendra Babu;Mukul Bhave","doi":"10.26599/BDMA.2021.9020012","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020012","url":null,"abstract":"Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"279-297"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523501.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Coronavirus pandemic analysis through tripartite graph clustering in online social networks 在线社交网络中基于三方图聚类的冠状病毒疫情分析
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-08-26 DOI: 10.26599/BDMA.2021.9020010
Xueting Liao;Danyang Zheng;Xiaojun Cao
The COVID-19 pandemic has hit the world hard. The reaction to the pandemic related issues has been pouring into social platforms, such as Twitter. Many public officials and governments use Twitter to make policy announcements. People keep close track of the related information and express their concerns about the policies on Twitter. It is beneficial yet challenging to derive important information or knowledge out of such Twitter data. In this paper, we propose a Tripartite Graph Clustering for Pandemic Data Analysis (TGC-PDA) framework that builds on the proposed models and analysis: (1) tripartite graph representation, (2) non-negative matrix factorization with regularization, and (3) sentiment analysis. We collect the tweets containing a set of keywords related to coronavirus pandemic as the ground truth data. Our framework can detect the communities of Twitter users and analyze the topics that are discussed in the communities. The extensive experiments show that our TGC-PDA framework can effectively and efficiently identify the topics and correlations within the Twitter data for monitoring and understanding public opinions, which would provide policy makers useful information and statistics for decision making.
新冠肺炎疫情对世界造成了沉重打击。对疫情相关问题的反应已经涌入推特等社交平台。许多公职人员和政府使用推特发布政策公告。人们密切关注相关信息,并在推特上表达他们对这些政策的担忧。从这样的推特数据中获得重要信息或知识是有益的,但也是具有挑战性的。在本文中,我们提出了一个用于流行病数据分析的三方图聚类(TGC-PDA)框架,该框架建立在所提出的模型和分析的基础上:(1)三方图表示,(2)带正则化的非负矩阵因子分解,以及(3)情绪分析。我们收集了包含一组与冠状病毒大流行相关的关键词的推文,作为基本事实数据。我们的框架可以检测推特用户的社区,并分析社区中讨论的主题。广泛的实验表明,我们的TGC-PDA框架可以有效地识别推特数据中的主题和相关性,以监测和了解公众意见,这将为决策者提供有用的信息和统计数据。
{"title":"Coronavirus pandemic analysis through tripartite graph clustering in online social networks","authors":"Xueting Liao;Danyang Zheng;Xiaojun Cao","doi":"10.26599/BDMA.2021.9020010","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020010","url":null,"abstract":"The COVID-19 pandemic has hit the world hard. The reaction to the pandemic related issues has been pouring into social platforms, such as Twitter. Many public officials and governments use Twitter to make policy announcements. People keep close track of the related information and express their concerns about the policies on Twitter. It is beneficial yet challenging to derive important information or knowledge out of such Twitter data. In this paper, we propose a Tripartite Graph Clustering for Pandemic Data Analysis (TGC-PDA) framework that builds on the proposed models and analysis: (1) tripartite graph representation, (2) non-negative matrix factorization with regularization, and (3) sentiment analysis. We collect the tweets containing a set of keywords related to coronavirus pandemic as the ground truth data. Our framework can detect the communities of Twitter users and analyze the topics that are discussed in the communities. The extensive experiments show that our TGC-PDA framework can effectively and efficiently identify the topics and correlations within the Twitter data for monitoring and understanding public opinions, which would provide policy makers useful information and statistics for decision making.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"242-251"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523498.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
LotusSQL: SQL engine for high-performance big data systems LotusSQL:用于高性能大数据系统的SQL引擎
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-08-26 DOI: 10.26599/BDMA.2021.9020009
Xiaohan Li;Bowen Yu;Guanyu Feng;Haojie Wang;Wenguang Chen
In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.
近年来,Apache Spark已成为大数据处理的事实标准。SparkSQL是一个使用结构化查询语言(SQL)在Spark上提供关系分析支持的模块。SparkSQL提供了方便的数据处理接口。尽管SparkSQL具有高效的优化器,但由于Java虚拟机和不必要的数据序列化和反序列化,Spark的效率仍然很低。采用C++等原生语言有助于避免此类瓶颈。得益于裸机运行时环境和模板使用,具有C++接口的系统通常可以获得卓越的性能。然而,本机语言的复杂性也增加了所需的编程和调试工作量。在这项工作中,我们介绍了LotusSQL,这是一个在本地后端Lotus上为数据集抽象提供SQL支持的引擎。我们采用了一个方便的SQL处理框架来处理前端作业。添加了先进的查询优化技术,以提高执行计划的质量。在计算引擎的存储设计和用户界面之上,LotusSQL高效地实现了一组结构化数据集操作,并将其与前端集成。评估结果表明,LotusSQL在某些查询中的加速率高达9倍,在标准查询基准测试中平均比Spark SQL高出2倍以上。
{"title":"LotusSQL: SQL engine for high-performance big data systems","authors":"Xiaohan Li;Bowen Yu;Guanyu Feng;Haojie Wang;Wenguang Chen","doi":"10.26599/BDMA.2021.9020009","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020009","url":null,"abstract":"In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"252-265"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523499.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A deep-learning prediction model for imbalanced time series data forecasting 一种用于不平衡时间序列数据预测的深度学习预测模型
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-08-26 DOI: 10.26599/BDMA.2021.9020011
Chenyu Hou;Jiawei Wu;Bin Cao;Jing Fan
Time series forecasting has attracted wide attention in recent decades. However, some time series are imbalanced and show different patterns between special and normal periods, leading to the prediction accuracy degradation of special periods. In this paper, we aim to develop a unified model to alleviate the imbalance and thus improving the prediction accuracy for special periods. This task is challenging because of two reasons: (1) the temporal dependency of series, and (2) the tradeoff between mining similar patterns and distinguishing different distributions between different periods. To tackle these issues, we propose a self-attention-based time-varying prediction model with a two-stage training strategy. First, we use an encoder-decoder module with the multi-head self-attention mechanism to extract common patterns of time series. Then, we propose a time-varying optimization module to optimize the results of special periods and eliminate the imbalance. Moreover, we propose reverse distance attention in place of traditional dot attention to highlight the importance of similar historical values to forecast results. Finally, extensive experiments show that our model performs better than other baselines in terms of mean absolute error and mean absolute percentage error.
近几十年来,时间序列预测引起了人们的广泛关注。然而,一些时间序列是不平衡的,在特殊时期和正常时期之间表现出不同的模式,导致特殊时期的预测精度下降。在本文中,我们旨在开发一个统一的模型来缓解这种不平衡,从而提高特殊时期的预测精度。由于两个原因,这项任务具有挑战性:(1)序列的时间依赖性,以及(2)挖掘相似模式和区分不同时期之间的不同分布之间的权衡。为了解决这些问题,我们提出了一种具有两阶段训练策略的基于自注意的时变预测模型。首先,我们使用具有多头自注意机制的编码器-解码器模块来提取时间序列的常见模式。然后,我们提出了一个时变优化模块来优化特殊时期的结果,消除不平衡。此外,我们提出了反向距离注意力来代替传统的点注意力,以突出相似历史值对预测结果的重要性。最后,大量实验表明,我们的模型在平均绝对误差和平均绝对百分比误差方面比其他基线表现更好。
{"title":"A deep-learning prediction model for imbalanced time series data forecasting","authors":"Chenyu Hou;Jiawei Wu;Bin Cao;Jing Fan","doi":"10.26599/BDMA.2021.9020011","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020011","url":null,"abstract":"Time series forecasting has attracted wide attention in recent decades. However, some time series are imbalanced and show different patterns between special and normal periods, leading to the prediction accuracy degradation of special periods. In this paper, we aim to develop a unified model to alleviate the imbalance and thus improving the prediction accuracy for special periods. This task is challenging because of two reasons: (1) the temporal dependency of series, and (2) the tradeoff between mining similar patterns and distinguishing different distributions between different periods. To tackle these issues, we propose a self-attention-based time-varying prediction model with a two-stage training strategy. First, we use an encoder-decoder module with the multi-head self-attention mechanism to extract common patterns of time series. Then, we propose a time-varying optimization module to optimize the results of special periods and eliminate the imbalance. Moreover, we propose reverse distance attention in place of traditional dot attention to highlight the importance of similar historical values to forecast results. Finally, extensive experiments show that our model performs better than other baselines in terms of mean absolute error and mean absolute percentage error.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"266-278"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523500.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Total contents 总内容
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-08-26
{"title":"Total contents","authors":"","doi":"","DOIUrl":"https://doi.org/","url":null,"abstract":"","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"I-II"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523502.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Big Data Mining and Analytics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1