首页 > 最新文献

Big Data Mining and Analytics最新文献

英文 中文
Intelligent and adaptive web data extraction system using convolutional and long short-term memory deep learning networks 使用卷积和长短期记忆深度学习网络的智能自适应web数据提取系统
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-08-26 DOI: 10.26599/BDMA.2021.9020012
Sudhir Kumar Patnaik;C. Narendra Babu;Mukul Bhave
Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.
在当今这个要求极高的超个性化消费者体验的世界里,数据对电子商务的发展至关重要,这些体验是使用先进的网络抓取技术收集的。然而,核心数据提取引擎失败了,因为它们无法适应网站内容的动态变化。本研究研究了一种具有卷积和长短期记忆(LSTM)网络的智能自适应网络数据提取系统,该系统使用You only look once(Yolo)算法和Tesseract LSTM来提取产品细节,这些细节被检测为网页中的图像。这个最先进的系统不需要核心数据提取引擎,因此可以适应网站布局的动态变化。在真实世界的零售案例中进行的实验表明,图像检测(精度)和字符提取精度(精度)分别为97%和99%。此外,在输入数据集为45个对象或图像的情况下,获得了74%的平均精度。
{"title":"Intelligent and adaptive web data extraction system using convolutional and long short-term memory deep learning networks","authors":"Sudhir Kumar Patnaik;C. Narendra Babu;Mukul Bhave","doi":"10.26599/BDMA.2021.9020012","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020012","url":null,"abstract":"Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"279-297"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523501.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Coronavirus pandemic analysis through tripartite graph clustering in online social networks 在线社交网络中基于三方图聚类的冠状病毒疫情分析
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-08-26 DOI: 10.26599/BDMA.2021.9020010
Xueting Liao;Danyang Zheng;Xiaojun Cao
The COVID-19 pandemic has hit the world hard. The reaction to the pandemic related issues has been pouring into social platforms, such as Twitter. Many public officials and governments use Twitter to make policy announcements. People keep close track of the related information and express their concerns about the policies on Twitter. It is beneficial yet challenging to derive important information or knowledge out of such Twitter data. In this paper, we propose a Tripartite Graph Clustering for Pandemic Data Analysis (TGC-PDA) framework that builds on the proposed models and analysis: (1) tripartite graph representation, (2) non-negative matrix factorization with regularization, and (3) sentiment analysis. We collect the tweets containing a set of keywords related to coronavirus pandemic as the ground truth data. Our framework can detect the communities of Twitter users and analyze the topics that are discussed in the communities. The extensive experiments show that our TGC-PDA framework can effectively and efficiently identify the topics and correlations within the Twitter data for monitoring and understanding public opinions, which would provide policy makers useful information and statistics for decision making.
新冠肺炎疫情对世界造成了沉重打击。对疫情相关问题的反应已经涌入推特等社交平台。许多公职人员和政府使用推特发布政策公告。人们密切关注相关信息,并在推特上表达他们对这些政策的担忧。从这样的推特数据中获得重要信息或知识是有益的,但也是具有挑战性的。在本文中,我们提出了一个用于流行病数据分析的三方图聚类(TGC-PDA)框架,该框架建立在所提出的模型和分析的基础上:(1)三方图表示,(2)带正则化的非负矩阵因子分解,以及(3)情绪分析。我们收集了包含一组与冠状病毒大流行相关的关键词的推文,作为基本事实数据。我们的框架可以检测推特用户的社区,并分析社区中讨论的主题。广泛的实验表明,我们的TGC-PDA框架可以有效地识别推特数据中的主题和相关性,以监测和了解公众意见,这将为决策者提供有用的信息和统计数据。
{"title":"Coronavirus pandemic analysis through tripartite graph clustering in online social networks","authors":"Xueting Liao;Danyang Zheng;Xiaojun Cao","doi":"10.26599/BDMA.2021.9020010","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020010","url":null,"abstract":"The COVID-19 pandemic has hit the world hard. The reaction to the pandemic related issues has been pouring into social platforms, such as Twitter. Many public officials and governments use Twitter to make policy announcements. People keep close track of the related information and express their concerns about the policies on Twitter. It is beneficial yet challenging to derive important information or knowledge out of such Twitter data. In this paper, we propose a Tripartite Graph Clustering for Pandemic Data Analysis (TGC-PDA) framework that builds on the proposed models and analysis: (1) tripartite graph representation, (2) non-negative matrix factorization with regularization, and (3) sentiment analysis. We collect the tweets containing a set of keywords related to coronavirus pandemic as the ground truth data. Our framework can detect the communities of Twitter users and analyze the topics that are discussed in the communities. The extensive experiments show that our TGC-PDA framework can effectively and efficiently identify the topics and correlations within the Twitter data for monitoring and understanding public opinions, which would provide policy makers useful information and statistics for decision making.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"242-251"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523498.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
LotusSQL: SQL engine for high-performance big data systems LotusSQL:用于高性能大数据系统的SQL引擎
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-08-26 DOI: 10.26599/BDMA.2021.9020009
Xiaohan Li;Bowen Yu;Guanyu Feng;Haojie Wang;Wenguang Chen
In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.
近年来,Apache Spark已成为大数据处理的事实标准。SparkSQL是一个使用结构化查询语言(SQL)在Spark上提供关系分析支持的模块。SparkSQL提供了方便的数据处理接口。尽管SparkSQL具有高效的优化器,但由于Java虚拟机和不必要的数据序列化和反序列化,Spark的效率仍然很低。采用C++等原生语言有助于避免此类瓶颈。得益于裸机运行时环境和模板使用,具有C++接口的系统通常可以获得卓越的性能。然而,本机语言的复杂性也增加了所需的编程和调试工作量。在这项工作中,我们介绍了LotusSQL,这是一个在本地后端Lotus上为数据集抽象提供SQL支持的引擎。我们采用了一个方便的SQL处理框架来处理前端作业。添加了先进的查询优化技术,以提高执行计划的质量。在计算引擎的存储设计和用户界面之上,LotusSQL高效地实现了一组结构化数据集操作,并将其与前端集成。评估结果表明,LotusSQL在某些查询中的加速率高达9倍,在标准查询基准测试中平均比Spark SQL高出2倍以上。
{"title":"LotusSQL: SQL engine for high-performance big data systems","authors":"Xiaohan Li;Bowen Yu;Guanyu Feng;Haojie Wang;Wenguang Chen","doi":"10.26599/BDMA.2021.9020009","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020009","url":null,"abstract":"In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"252-265"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523499.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A deep-learning prediction model for imbalanced time series data forecasting 一种用于不平衡时间序列数据预测的深度学习预测模型
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-08-26 DOI: 10.26599/BDMA.2021.9020011
Chenyu Hou;Jiawei Wu;Bin Cao;Jing Fan
Time series forecasting has attracted wide attention in recent decades. However, some time series are imbalanced and show different patterns between special and normal periods, leading to the prediction accuracy degradation of special periods. In this paper, we aim to develop a unified model to alleviate the imbalance and thus improving the prediction accuracy for special periods. This task is challenging because of two reasons: (1) the temporal dependency of series, and (2) the tradeoff between mining similar patterns and distinguishing different distributions between different periods. To tackle these issues, we propose a self-attention-based time-varying prediction model with a two-stage training strategy. First, we use an encoder-decoder module with the multi-head self-attention mechanism to extract common patterns of time series. Then, we propose a time-varying optimization module to optimize the results of special periods and eliminate the imbalance. Moreover, we propose reverse distance attention in place of traditional dot attention to highlight the importance of similar historical values to forecast results. Finally, extensive experiments show that our model performs better than other baselines in terms of mean absolute error and mean absolute percentage error.
近几十年来,时间序列预测引起了人们的广泛关注。然而,一些时间序列是不平衡的,在特殊时期和正常时期之间表现出不同的模式,导致特殊时期的预测精度下降。在本文中,我们旨在开发一个统一的模型来缓解这种不平衡,从而提高特殊时期的预测精度。由于两个原因,这项任务具有挑战性:(1)序列的时间依赖性,以及(2)挖掘相似模式和区分不同时期之间的不同分布之间的权衡。为了解决这些问题,我们提出了一种具有两阶段训练策略的基于自注意的时变预测模型。首先,我们使用具有多头自注意机制的编码器-解码器模块来提取时间序列的常见模式。然后,我们提出了一个时变优化模块来优化特殊时期的结果,消除不平衡。此外,我们提出了反向距离注意力来代替传统的点注意力,以突出相似历史值对预测结果的重要性。最后,大量实验表明,我们的模型在平均绝对误差和平均绝对百分比误差方面比其他基线表现更好。
{"title":"A deep-learning prediction model for imbalanced time series data forecasting","authors":"Chenyu Hou;Jiawei Wu;Bin Cao;Jing Fan","doi":"10.26599/BDMA.2021.9020011","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020011","url":null,"abstract":"Time series forecasting has attracted wide attention in recent decades. However, some time series are imbalanced and show different patterns between special and normal periods, leading to the prediction accuracy degradation of special periods. In this paper, we aim to develop a unified model to alleviate the imbalance and thus improving the prediction accuracy for special periods. This task is challenging because of two reasons: (1) the temporal dependency of series, and (2) the tradeoff between mining similar patterns and distinguishing different distributions between different periods. To tackle these issues, we propose a self-attention-based time-varying prediction model with a two-stage training strategy. First, we use an encoder-decoder module with the multi-head self-attention mechanism to extract common patterns of time series. Then, we propose a time-varying optimization module to optimize the results of special periods and eliminate the imbalance. Moreover, we propose reverse distance attention in place of traditional dot attention to highlight the importance of similar historical values to forecast results. Finally, extensive experiments show that our model performs better than other baselines in terms of mean absolute error and mean absolute percentage error.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"266-278"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523500.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Call for papers: Special issue on intelligent systems and Internet of Things 论文征集:智能系统与物联网特刊
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-03-12 DOI: 10.26599/BDMA.2021.9020007
Big Data Mining and Analytics is an international academic journal sponsored by Tsinghua University and published quarterly. It features on technologies to enable and accelerate big data discovery. All of papers published are on the IEEE Xplore Digital Library with the open access mode. The journal is indexed and abstracted in Ei Compendex, Scopus, DBLP Computer Science, Google Scholar, and CNKI.
《大数据挖掘与分析》是清华大学主办的国际学术期刊,每季度出版一期。它的特点是能够实现和加速大数据发现的技术。所有发表的论文都是在开放存取模式下的IEEE Xplore数字图书馆上发表的。该期刊在Ei Compendex、Scopus、DBLP Computer Science、Google Scholar和CNKI中进行了索引和摘要。
{"title":"Call for papers: Special issue on intelligent systems and Internet of Things","authors":"","doi":"10.26599/BDMA.2021.9020007","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020007","url":null,"abstract":"Big Data Mining and Analytics is an international academic journal sponsored by Tsinghua University and published quarterly. It features on technologies to enable and accelerate big data discovery. All of papers published are on the IEEE Xplore Digital Library with the open access mode. The journal is indexed and abstracted in Ei Compendex, Scopus, DBLP Computer Science, Google Scholar, and CNKI.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 3","pages":"222-222"},"PeriodicalIF":13.6,"publicationDate":"2021-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9430128/09430138.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67859023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multitask multiview neural network for end-to-end aspect-based sentiment analysis 一种用于端到端基于方面的情绪分析的多任务多视角神经网络
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-03-12 DOI: 10.26599/BDMA.2021.9020003
Yong Bie;Yan Yang
The aspect-based sentiment analysis (ABSA) consists of two subtasks-aspect term extraction and aspect sentiment prediction. Existing methods deal with both subtasks one by one in a pipeline manner, in which there lies some problems in performance and real application. This study investigates the end-to-end ABSA and proposes a novel multitask multiview network (MTMVN) architecture. Specifically, the architecture takes the unified ABSA as the main task with the two subtasks as auxiliary tasks. Meanwhile, the representation obtained from the branch network of the main task is regarded as the global view, whereas the representations of the two subtasks are considered two local views with different emphases. Through multitask learning, the main task can be facilitated by additional accurate aspect boundary information and sentiment polarity information. By enhancing the correlations between the views under the idea of multiview learning, the representation of the global view can be optimized to improve the overall performance of the model. The experimental results on three benchmark datasets show that the proposed method exceeds the existing pipeline methods and end-to-end methods, proving the superiority of our MTMVN architecture.
基于方面的情感分析(ABSA)包括两个子任务方面项提取和方面情感预测。现有的方法以流水线的方式逐个处理这两个子任务,在性能和实际应用中都存在一些问题。本文研究了端到端ABSA,提出了一种新的多任务多视角网络(MTMVN)结构。具体来说,该架构以统一的ABSA为主要任务,两个子任务为辅助任务。同时,从主任务的分支网络获得的表示被视为全局视图,而两个子任务的表示则被视为两个不同重点的局部视图。通过多任务学习,可以通过额外的精确方面边界信息和情感极性信息来促进主要任务。在多视图学习的思想下,通过增强视图之间的相关性,可以优化全局视图的表示,以提高模型的整体性能。在三个基准数据集上的实验结果表明,所提出的方法超过了现有的流水线方法和端到端方法,证明了我们的MTMVN架构的优越性。
{"title":"A multitask multiview neural network for end-to-end aspect-based sentiment analysis","authors":"Yong Bie;Yan Yang","doi":"10.26599/BDMA.2021.9020003","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020003","url":null,"abstract":"The aspect-based sentiment analysis (ABSA) consists of two subtasks-aspect term extraction and aspect sentiment prediction. Existing methods deal with both subtasks one by one in a pipeline manner, in which there lies some problems in performance and real application. This study investigates the end-to-end ABSA and proposes a novel multitask multiview network (MTMVN) architecture. Specifically, the architecture takes the unified ABSA as the main task with the two subtasks as auxiliary tasks. Meanwhile, the representation obtained from the branch network of the main task is regarded as the global view, whereas the representations of the two subtasks are considered two local views with different emphases. Through multitask learning, the main task can be facilitated by additional accurate aspect boundary information and sentiment polarity information. By enhancing the correlations between the views under the idea of multiview learning, the representation of the global view can be optimized to improve the overall performance of the model. The experimental results on three benchmark datasets show that the proposed method exceeds the existing pipeline methods and end-to-end methods, proving the superiority of our MTMVN architecture.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 3","pages":"195-207"},"PeriodicalIF":13.6,"publicationDate":"2021-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9430128/09430135.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67859134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
AIPerf: Automated machine learning as an AI-HPC benchmark AIPerf:作为AI-HPC基准的自动机器学习
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-03-12 DOI: 10.26599/BDMA.2021.9020004
Zhixiang Ren;Yongheng Liu;Tianhui Shi;Lei Xie;Yue Zhou;Jidong Zhai;Youhui Zhang;Yunquan Zhang;Wenguang Chen
The plethora of complex Artificial Intelligence (AI) algorithms and available High-Performance Computing (HPC) power stimulates the expeditious development of AI components with heterogeneous designs. Consequently, the need for cross-stack performance benchmarking of AI-HPC systems has rapidly emerged. In particular, the defacto HPC benchmark, LINPACK, cannot reflect the AI computing power and input/output performance without a representative workload. Current popular AI benchmarks, such as MLPerf, have a fixed problem size and therefore limited scalability. To address these issues, we propose an end-to-end benchmark suite utilizing automated machinelearning, which not only represents real AI scenarios, but also is auto-adaptively scalable to various scales ofmachines. We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimizationpotential on diverse systems with customizable configurations. We utilize Operations Per Second (OPS), which ismeasured in an analytical and systematic approach, as a major metric to quantify the AI performance. We performevaluations on various systems to ensure the benchmark's stability and scalability, from 4 nodes with 32 NVIDIA Tesla T4 (56.1 Tera-OPS measured) up to 512 nodes with 4096 Huawei Ascend 910 (194.53 Peta-OPS measured), and the results show near-linear weak scalability. With a flexible workload and single metric, AIPerf can easily scaleon and rank AI-HPC, providing a powerful benchmark suite for the coming supercomputing era.
过多的复杂人工智能(AI)算法和可用的高性能计算(HPC)能力刺激了具有异构设计的AI组件的快速开发。因此,对AI-HPC系统的跨堆栈性能基准测试的需求已经迅速出现。特别是,在没有代表性工作负载的情况下,实际的HPC基准LINPACK无法反映AI计算能力和输入/输出性能。当前流行的人工智能基准测试,如MLPerf,具有固定的问题大小,因此可扩展性有限。为了解决这些问题,我们提出了一个利用自动机器学习的端到端基准套件,它不仅代表了真实的人工智能场景,而且可以自动自适应地扩展到各种规模的机器。我们以高度并行和灵活的方式实现算法,以确保在具有可定制配置的不同系统上的效率和优化潜力。我们利用每秒操作数(OPS)作为量化人工智能性能的主要指标,该指标以分析和系统的方法进行测量。我们在各种系统上进行了评估,以确保基准的稳定性和可扩展性,从32个NVIDIA Tesla T4的4个节点(测量值为56.1 Tera OPS)到4096个华为Ascend 910的512个节点(测值为194.53 Peta OPS),结果显示可扩展性接近线性。凭借灵活的工作负载和单一指标,AIPerf可以轻松地对AI-HPC进行扩展和排名,为即将到来的超级计算时代提供强大的基准套件。
{"title":"AIPerf: Automated machine learning as an AI-HPC benchmark","authors":"Zhixiang Ren;Yongheng Liu;Tianhui Shi;Lei Xie;Yue Zhou;Jidong Zhai;Youhui Zhang;Yunquan Zhang;Wenguang Chen","doi":"10.26599/BDMA.2021.9020004","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020004","url":null,"abstract":"The plethora of complex Artificial Intelligence (AI) algorithms and available High-Performance Computing (HPC) power stimulates the expeditious development of AI components with heterogeneous designs. Consequently, the need for cross-stack performance benchmarking of AI-HPC systems has rapidly emerged. In particular, the defacto HPC benchmark, LINPACK, cannot reflect the AI computing power and input/output performance without a representative workload. Current popular AI benchmarks, such as MLPerf, have a fixed problem size and therefore limited scalability. To address these issues, we propose an end-to-end benchmark suite utilizing automated machinelearning, which not only represents real AI scenarios, but also is auto-adaptively scalable to various scales ofmachines. We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimizationpotential on diverse systems with customizable configurations. We utilize Operations Per Second (OPS), which ismeasured in an analytical and systematic approach, as a major metric to quantify the AI performance. We performevaluations on various systems to ensure the benchmark's stability and scalability, from 4 nodes with 32 NVIDIA Tesla T4 (56.1 Tera-OPS measured) up to 512 nodes with 4096 Huawei Ascend 910 (194.53 Peta-OPS measured), and the results show near-linear weak scalability. With a flexible workload and single metric, AIPerf can easily scaleon and rank AI-HPC, providing a powerful benchmark suite for the coming supercomputing era.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 3","pages":"208-220"},"PeriodicalIF":13.6,"publicationDate":"2021-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9430128/09430136.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68026694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
A survey on algorithms for intelligent computing and smart city applications 智能计算算法及智能城市应用综述
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-03-12 DOI: 10.26599/BDMA.2020.9020029
Zhao Tong;Feng Ye;Ming Yan;Hong Liu;Sunitha Basodi
With the rapid development of human society, the urbanization of the world's population is also progressing rapidly. Urbanization has brought many challenges and problems to the development of cities. For example, the urban population is under excessive pressure, various natural resources and energy are increasingly scarce, and environmental pollution is increasing, etc. However, the original urban model has to be changed to enable people to live in greener and more sustainable cities, thus providing them with a more convenient and comfortable living environment. The new urban framework, the smart city, provides excellent opportunities to meet these challenges, while solving urban problems at the same time. At this stage, many countries are actively responding to calls for smart city development plans. This paper investigates the current stage of the smart city. First, it introduces the background of smart city development and gives a brief definition of the concept of the smart city. Second, it describes the framework of a smart city in accordance with the given definition. Finally, various intelligent algorithms to make cities smarter, along with specific examples, are discussed and analyzed.
随着人类社会的快速发展,世界人口的城市化也在快速发展。城市化给城市发展带来了许多挑战和问题。例如,城市人口压力过大,各种自然资源和能源日益稀缺,环境污染加剧等。然而,必须改变原有的城市模式,让人们生活在更绿色、更可持续的城市,从而为他们提供更方便、更舒适的生活环境。新的城市框架,即智慧城市,提供了应对这些挑战的绝佳机会,同时解决了城市问题。现阶段,许多国家都在积极响应制定智慧城市发展计划的呼吁。本文研究了智慧城市的发展现状。首先,介绍了智慧城市发展的背景,简要界定了智慧城市的概念。其次,根据给出的定义描述了智慧城市的框架。最后,讨论和分析了使城市更智能的各种智能算法,并给出了具体的例子。
{"title":"A survey on algorithms for intelligent computing and smart city applications","authors":"Zhao Tong;Feng Ye;Ming Yan;Hong Liu;Sunitha Basodi","doi":"10.26599/BDMA.2020.9020029","DOIUrl":"https://doi.org/10.26599/BDMA.2020.9020029","url":null,"abstract":"With the rapid development of human society, the urbanization of the world's population is also progressing rapidly. Urbanization has brought many challenges and problems to the development of cities. For example, the urban population is under excessive pressure, various natural resources and energy are increasingly scarce, and environmental pollution is increasing, etc. However, the original urban model has to be changed to enable people to live in greener and more sustainable cities, thus providing them with a more convenient and comfortable living environment. The new urban framework, the smart city, provides excellent opportunities to meet these challenges, while solving urban problems at the same time. At this stage, many countries are actively responding to calls for smart city development plans. This paper investigates the current stage of the smart city. First, it introduces the background of smart city development and gives a brief definition of the concept of the smart city. Second, it describes the framework of a smart city in accordance with the given definition. Finally, various intelligent algorithms to make cities smarter, along with specific examples, are discussed and analyzed.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 3","pages":"155-172"},"PeriodicalIF":13.6,"publicationDate":"2021-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9430128/09430132.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68026696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Improvising personalized travel recommendation system with recency effects 利用近因效应改进个性化旅游推荐系统
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-03-12 DOI: 10.26599/BDMA.2020.9020026
Paromita Nitu;Joseph Coelho;Praveen Madiraju
A travel recommendation system based on social media activity provides a customized place of interest to accommodate user-specific needs and preferences. In general, the user's inclination towards travel destinations is subject to change over time. In this project, we have analyzed users' twitter data, as well as their friends and followers in a timely fashion to understand recent travel interest. A machine learning classifier identifies tweets relevant to travel. The travel tweets are then used to obtain personalized travel recommendations. Unlike most of the personalized recommendation systems, our proposed model takes into account a user's most recent interest by incorporating time-sensitive recency weight into the model. Our proposed model has outperformed the existing personalized place of interest recommendation model, and the overall accuracy is 75.23%.
基于社交媒体活动的旅行推荐系统提供了定制的兴趣地点,以适应用户特定的需求和偏好。一般来说,用户对旅行目的地的偏好会随着时间的推移而变化。在这个项目中,我们及时分析了用户的推特数据,以及他们的朋友和追随者,以了解最近的旅行兴趣。机器学习分类器识别与旅行相关的推文。旅行推文随后被用于获得个性化的旅行推荐。与大多数个性化推荐系统不同,我们提出的模型通过在模型中加入时间敏感的最近度权重来考虑用户的最新兴趣。我们提出的模型优于现有的个性化兴趣地点推荐模型,总体准确率为75.23%。
{"title":"Improvising personalized travel recommendation system with recency effects","authors":"Paromita Nitu;Joseph Coelho;Praveen Madiraju","doi":"10.26599/BDMA.2020.9020026","DOIUrl":"https://doi.org/10.26599/BDMA.2020.9020026","url":null,"abstract":"A travel recommendation system based on social media activity provides a customized place of interest to accommodate user-specific needs and preferences. In general, the user's inclination towards travel destinations is subject to change over time. In this project, we have analyzed users' twitter data, as well as their friends and followers in a timely fashion to understand recent travel interest. A machine learning classifier identifies tweets relevant to travel. The travel tweets are then used to obtain personalized travel recommendations. Unlike most of the personalized recommendation systems, our proposed model takes into account a user's most recent interest by incorporating time-sensitive recency weight into the model. Our proposed model has outperformed the existing personalized place of interest recommendation model, and the overall accuracy is 75.23%.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 3","pages":"139-154"},"PeriodicalIF":13.6,"publicationDate":"2021-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9430128/09430131.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68026697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
Call for papers: Special issue on unlocking genetic diseases by integrating machine learning techniques and medical data 论文征集:通过整合机器学习技术和医学数据解锁遗传疾病特刊
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2021-03-12 DOI: 10.26599/BDMA.2021.9020005
Big Data Mining and Analytics is an international academic journal sponsored by Tsinghua University and published quarterly. It features on technologies to enable and accelerate big data discovery. All of papers published are on the IEEE Xplore Digital Library with the open access mode. The journal is indexed and abstracted in Ei Compendex, Scopus, DBLP Computer Science, Google Scholar, and CNKI.
《大数据挖掘与分析》是清华大学主办的国际学术期刊,每季度出版一期。它的特点是能够实现和加速大数据发现的技术。所有发表的论文都是在开放存取模式下的IEEE Xplore数字图书馆上发表的。该期刊在Ei Compendex、Scopus、DBLP Computer Science、Google Scholar和CNKI中进行了索引和摘要。
{"title":"Call for papers: Special issue on unlocking genetic diseases by integrating machine learning techniques and medical data","authors":"","doi":"10.26599/BDMA.2021.9020005","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020005","url":null,"abstract":"Big Data Mining and Analytics is an international academic journal sponsored by Tsinghua University and published quarterly. It features on technologies to enable and accelerate big data discovery. All of papers published are on the IEEE Xplore Digital Library with the open access mode. The journal is indexed and abstracted in Ei Compendex, Scopus, DBLP Computer Science, Google Scholar, and CNKI.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 3","pages":"221-221"},"PeriodicalIF":13.6,"publicationDate":"2021-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9430128/09430137.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67859135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Big Data Mining and Analytics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1