2019 15th International Conference on eScience (eScience)最新文献

英文中文

Towards Exascale: Measuring the Energy Footprint of Astrophysics HPC Simulations 迈向百亿亿次:测量天体物理高性能计算模拟的能量足迹

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00052

G. Taffoni, M. Katevenis, Renato Panchieri, G. Perna, L. Tornatore, D. Goz, A. Ragagnin, S. Bertocco, I. Coretti, M. Marazakis, Fabien Chaix, Manolis Ploumidis

The increasing amount of data produced in Astronomy by observational studies and the size of theoretical problems to be tackled in the next future pushes the need of HPC (High Performance Computing) resources towards the "Exascale". The HPC sector is undergoing a profound phase of transition, in which one of the toughest challenges to cope with is the energy efficiency that is one of the main blocking factors to the achievement of "Exascale". Since ideal peak-performance is unlikely to be achieved in realistic scenarios, the aim of this work is to give some insights about the energy consumption of contemporary architectures with real scientific applications in a HPC context. We use two state-of-the-art applications from the astrophysical domain, that we optimized in order to fully exploit the underlying hardware: a direct N-body code and a semi-analytical code for Cosmic Structure formation simulations. For these two applications, we quantitatively evaluate the impact of computation on the energy consumption when running on three different systems: one that represents the present of current HPC systems (an Intel-based cluster), one that (possibly) represents the future of HPC systems (a prototype of an Exascale supercomputer) and a micro-cluster based on Arm MPSoC. We provide a comparison of the time-to-solution, energy-to-solution and energy delay product (EDP) metrics, for different software configurations. ARM-based HPC systems have lower energy consumption albeit running ≈10 times slower.

天文观测研究产生的数据量不断增加，以及未来需要解决的理论问题的规模，将HPC(高性能计算)资源的需求推向了“百亿亿次”。高性能计算领域正在经历一个深刻的转型阶段，其中最棘手的挑战之一是能源效率，这是实现“百亿亿次”的主要阻碍因素之一。由于理想的峰值性能在现实场景中不太可能实现，因此本工作的目的是在高性能计算环境中，通过真正的科学应用，对当代建筑的能源消耗提供一些见解。我们使用了来自天体物理领域的两个最先进的应用程序，我们优化了它们，以便充分利用底层硬件:一个直接的n体代码和一个用于宇宙结构形成模拟的半解析代码。对于这两个应用程序，我们定量评估了在三个不同系统上运行时计算对能耗的影响:一个代表当前HPC系统(基于英特尔的集群)，一个(可能)代表HPC系统的未来(Exascale超级计算机的原型)和一个基于Arm MPSoC的微集群。我们对不同软件配置的时间到解决方案、能量到解决方案和能量延迟积(EDP)指标进行了比较。基于arm的HPC系统能耗更低，尽管运行速度要慢约10倍。

{"title":"Towards Exascale: Measuring the Energy Footprint of Astrophysics HPC Simulations","authors":"G. Taffoni, M. Katevenis, Renato Panchieri, G. Perna, L. Tornatore, D. Goz, A. Ragagnin, S. Bertocco, I. Coretti, M. Marazakis, Fabien Chaix, Manolis Ploumidis","doi":"10.1109/eScience.2019.00052","DOIUrl":"https://doi.org/10.1109/eScience.2019.00052","url":null,"abstract":"The increasing amount of data produced in Astronomy by observational studies and the size of theoretical problems to be tackled in the next future pushes the need of HPC (High Performance Computing) resources towards the \"Exascale\". The HPC sector is undergoing a profound phase of transition, in which one of the toughest challenges to cope with is the energy efficiency that is one of the main blocking factors to the achievement of \"Exascale\". Since ideal peak-performance is unlikely to be achieved in realistic scenarios, the aim of this work is to give some insights about the energy consumption of contemporary architectures with real scientific applications in a HPC context. We use two state-of-the-art applications from the astrophysical domain, that we optimized in order to fully exploit the underlying hardware: a direct N-body code and a semi-analytical code for Cosmic Structure formation simulations. For these two applications, we quantitatively evaluate the impact of computation on the energy consumption when running on three different systems: one that represents the present of current HPC systems (an Intel-based cluster), one that (possibly) represents the future of HPC systems (a prototype of an Exascale supercomputer) and a micro-cluster based on Arm MPSoC. We provide a comparison of the time-to-solution, energy-to-solution and energy delay product (EDP) metrics, for different software configurations. ARM-based HPC systems have lower energy consumption albeit running ≈10 times slower.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130827004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

The Research Assistant and AI in eScience 科学中的研究助理和人工智能

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00059

Dennis Gannon

This paper was solicited as a "vision" talk for the 2019 eScience conference. It is based on my assessment of the role AI will have on eScience in the years ahead. While machine learning methods are already being well integrated into computing practice, this paper will look at another area: the role AI will play as an assistant to our daily research work.

这篇论文被征求为2019年eScience会议的“愿景”演讲。这是基于我对人工智能在未来几年对科学的作用的评估。虽然机器学习方法已经很好地融入了计算实践，但本文将着眼于另一个领域:人工智能将作为我们日常研究工作的助手。

引用次数: 0

DARE: A Reflective Platform Designed to Enable Agile Data-Driven Research on the Cloud DARE:一个旨在实现云上敏捷数据驱动研究的反思平台

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00079

I. Klampanos, F. Magnoni, E. Casarotti, C. Pagé, M. Lindner, A. Ikonomopoulos, V. Karkaletsis, A. Davvetas, A. Gemünd, M. Atkinson, A. Koukourikos, Rosa Filgueira, A. Krause, A. Spinuso, A. Charalambidis

The DARE platform has been designed to help research developers deliver user-facing applications and solutions over diverse underlying e-infrastructures, data and computational contexts. The platform is Cloud-ready, and relies on the exposure of APIs, which are suitable for raising the abstraction level and hiding complexity. At its core, the platform implements the cataloguing and execution of fine-grained and Python-based dispel4py workflows as services. Reflection is achieved via a logical knowledge base, comprising multiple internal catalogues, registries and semantics, while it supports persistent and pervasive data provenance. This paper presents design and implementation aspects of the DARE platform, as well as it provides directions for future development.

DARE平台旨在帮助研究开发人员在不同的底层电子基础设施、数据和计算环境中提供面向用户的应用程序和解决方案。该平台是云就绪的，并依赖于api的公开，这适用于提高抽象级别和隐藏复杂性。该平台的核心是将细粒度和基于python的dispel4py工作流的编目和执行实现为服务。反射是通过包含多个内部目录、注册表和语义的逻辑知识库实现的，同时它支持持久和普遍的数据来源。本文介绍了DARE平台的设计和实现，并为未来的发展提供了方向。

引用次数: 9

Enhanced Interactive Parallel Coordinates using Machine Learning and Uncertainty Propagation for Engineering Design 基于机器学习和不确定性传播的工程设计增强交互式并行坐标

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00045

W. Piotrowski, T. Kipouros, P. Clarkson

The design process of an engineering system requires thorough consideration of varied specifications, each with potentially large number of dimensions. The sheer volume of data, as well as its complexity, can overwhelm the designer and obscure vital information. Visualisation of big data can mitigate the issue of information overload but static display can suffer from overplotting. To tackle the issue of overplotting and cluttered data, we present an interactive and touch-screen capable visualisation toolkit that combines Parallel Coordinates and Scatter Plot approaches for managing multidimensional engineering design data. As engineering projects require a multitude of varied software to handle the various aspects of the design process, the combined datasets often do not have an underlying mathematical model. We address this issue by enhancing our visualisation software with Machine Learning methods which also facilitate further insights into the data. Furthermore, various software within the engineering design cycle produce information of different level of fidelity (accuracy and trustworthiness), as well as with different speed. The induced uncertainty is also considered and modelled in the synthetic dataset and is also presented in an interactive way. This paper describes a new visualisation software package and demonstrates its functionality on a complex aircraft systems design dataset.

工程系统的设计过程需要彻底考虑各种规格，每个规格都可能有大量的尺寸。庞大的数据量及其复杂性可能会使设计师不知所措，并模糊重要信息。大数据的可视化可以缓解信息过载的问题，但静态显示可能会受到过度绘图的影响。为了解决过度绘图和数据混乱的问题，我们提出了一个交互式的触摸屏可视化工具包，该工具包结合了并行坐标和散点图方法来管理多维工程设计数据。由于工程项目需要大量不同的软件来处理设计过程的各个方面，因此组合的数据集通常没有底层的数学模型。我们通过使用机器学习方法增强我们的可视化软件来解决这个问题，这也有助于进一步了解数据。此外，工程设计周期内的各种软件产生的信息保真度(准确性和可信度)不同，速度也不同。在综合数据集中也考虑和建模了诱导不确定性，并以交互方式呈现。本文介绍了一种新的可视化软件包，并演示了其在复杂飞机系统设计数据集上的功能。

{"title":"Enhanced Interactive Parallel Coordinates using Machine Learning and Uncertainty Propagation for Engineering Design","authors":"W. Piotrowski, T. Kipouros, P. Clarkson","doi":"10.1109/eScience.2019.00045","DOIUrl":"https://doi.org/10.1109/eScience.2019.00045","url":null,"abstract":"The design process of an engineering system requires thorough consideration of varied specifications, each with potentially large number of dimensions. The sheer volume of data, as well as its complexity, can overwhelm the designer and obscure vital information. Visualisation of big data can mitigate the issue of information overload but static display can suffer from overplotting. To tackle the issue of overplotting and cluttered data, we present an interactive and touch-screen capable visualisation toolkit that combines Parallel Coordinates and Scatter Plot approaches for managing multidimensional engineering design data. As engineering projects require a multitude of varied software to handle the various aspects of the design process, the combined datasets often do not have an underlying mathematical model. We address this issue by enhancing our visualisation software with Machine Learning methods which also facilitate further insights into the data. Furthermore, various software within the engineering design cycle produce information of different level of fidelity (accuracy and trustworthiness), as well as with different speed. The induced uncertainty is also considered and modelled in the synthetic dataset and is also presented in an interactive way. This paper describes a new visualisation software package and demonstrates its functionality on a complex aircraft systems design dataset.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"170 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128352225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Active Provenance for Data-Intensive Workflows: Engaging Users and Developers 数据密集型工作流的主动来源:吸引用户和开发人员

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00077

A. Spinuso, M. Atkinson, F. Magnoni

We present a practical approach for provenance capturing in Data-Intensive workflow systems. It provides contextualisation by recording injected domain metadata with the provenance stream. It offers control over lineage precision, combining automation with specified adaptations. We address provenance tasks such as extraction of domain metadata, injection of custom annotations, accuracy and integration of records from multiple independent workflows running in distributed contexts. To allow such flexibility, we introduce the concepts of programmable Provenance Types and Provenance Configuration. Provenance Types handle domain contextualisation and allow developers to model lineage patterns by re-defining API methods, composing easy-to-use extensions. Provenance Configuration, instead, enables users of a Data-Intensive workflow execution o prepare it for provenance capture, by configuring the attribution of Provenance Types to components and by specifying grouping into semantic clusters. This enables better searches over the lineage records. Provenance Types and Provenance Configuration are demonstrated in a system being used by computational seismologists. It is based on an extended provenance model, S-PROV.

我们提出了一种在数据密集型工作流系统中进行来源捕获的实用方法。它通过用来源流记录注入的域元数据来提供上下文化。它提供了对血统精度的控制，将自动化与指定的适应性相结合。我们解决了诸如提取域元数据、注入自定义注释、准确性和集成在分布式环境中运行的多个独立工作流记录等来源任务。为了允许这种灵活性，我们引入了可编程的来源类型和来源配置的概念。出处类型处理域上下文化，并允许开发人员通过重新定义API方法、组合易于使用的扩展来建模沿袭模式。相反，出处配置允许数据密集型工作流执行的用户通过配置出处类型对组件的属性并通过指定分组到语义集群来为出处捕获做好准备。这样可以更好地搜索谱系记录。在计算地震学家使用的系统中演示了物源类型和物源配置。它基于扩展的来源模型s - proof。

{"title":"Active Provenance for Data-Intensive Workflows: Engaging Users and Developers","authors":"A. Spinuso, M. Atkinson, F. Magnoni","doi":"10.1109/eScience.2019.00077","DOIUrl":"https://doi.org/10.1109/eScience.2019.00077","url":null,"abstract":"We present a practical approach for provenance capturing in Data-Intensive workflow systems. It provides contextualisation by recording injected domain metadata with the provenance stream. It offers control over lineage precision, combining automation with specified adaptations. We address provenance tasks such as extraction of domain metadata, injection of custom annotations, accuracy and integration of records from multiple independent workflows running in distributed contexts. To allow such flexibility, we introduce the concepts of programmable Provenance Types and Provenance Configuration. Provenance Types handle domain contextualisation and allow developers to model lineage patterns by re-defining API methods, composing easy-to-use extensions. Provenance Configuration, instead, enables users of a Data-Intensive workflow execution o prepare it for provenance capture, by configuring the attribution of Provenance Types to components and by specifying grouping into semantic clusters. This enables better searches over the lineage records. Provenance Types and Provenance Configuration are demonstrated in a system being used by computational seismologists. It is based on an extended provenance model, S-PROV.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121692178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

A Framework for Model Search Across Multiple Machine Learning Implementations 跨多个机器学习实现的模型搜索框架

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-08-27 DOI: 10.1109/eScience.2019.00044

Yoshiki Takahashi, M. Asahara, Kazuyuki Shudo

Several recently devised machine learning (ML) algorithms have shown improved accuracy for various predictive problems. Model searches, which explore to find an optimal ML algorithm and hyperparameter values for the target problem, play a critical role in such improvements. During a model search, data scientists typically use multiple ML implementations to construct several predictive models; however, it takes significant time and effort to employ multiple ML implementations due to the need to learn how to use them, prepare input data in several different formats, and compare their outputs. Our proposed framework addresses these issues by providing simple and unified coding method. It has been designed with the following two attractive features: i) new machine learning implementations can be added easily via common interfaces between the framework and ML implementations and ii) it can be scaled to handle large model configuration search spaces via profile-based scheduling. The results of our evaluation indicate that, with our framework, implementers need only write 55-144 lines of code to add a new ML implementation. They also show that ours was the fastest framework for the HIGGS dataset, and the second-fastest for the SECOM dataset.

最近设计的几种机器学习(ML)算法在各种预测问题上显示出更高的准确性。模型搜索在这种改进中起着至关重要的作用，它探索为目标问题找到最优的ML算法和超参数值。在模型搜索过程中，数据科学家通常使用多个ML实现来构建多个预测模型;然而，由于需要学习如何使用它们，以几种不同的格式准备输入数据，并比较它们的输出，因此需要花费大量的时间和精力来使用多个ML实现。我们提出的框架通过提供简单统一的编码方法来解决这些问题。它的设计具有以下两个吸引人的特性:i)新的机器学习实现可以通过框架和ML实现之间的公共接口轻松添加;ii)它可以通过基于配置文件的调度来扩展以处理大型模型配置搜索空间。我们的评估结果表明，使用我们的框架，实现者只需要编写55-144行代码来添加新的ML实现。他们还表明，我们的框架是HIGGS数据集最快的框架，是SECOM数据集第二快的框架。

{"title":"A Framework for Model Search Across Multiple Machine Learning Implementations","authors":"Yoshiki Takahashi, M. Asahara, Kazuyuki Shudo","doi":"10.1109/eScience.2019.00044","DOIUrl":"https://doi.org/10.1109/eScience.2019.00044","url":null,"abstract":"Several recently devised machine learning (ML) algorithms have shown improved accuracy for various predictive problems. Model searches, which explore to find an optimal ML algorithm and hyperparameter values for the target problem, play a critical role in such improvements. During a model search, data scientists typically use multiple ML implementations to construct several predictive models; however, it takes significant time and effort to employ multiple ML implementations due to the need to learn how to use them, prepare input data in several different formats, and compare their outputs. Our proposed framework addresses these issues by providing simple and unified coding method. It has been designed with the following two attractive features: i) new machine learning implementations can be added easily via common interfaces between the framework and ML implementations and ii) it can be scaled to handle large model configuration search spaces via profile-based scheduling. The results of our evaluation indicate that, with our framework, implementers need only write 55-144 lines of code to add a new ML implementation. They also show that ours was the fastest framework for the HIGGS dataset, and the second-fastest for the SECOM dataset.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131498751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Predicting Eating Events in Free Living Individuals 预测自由生活个体的饮食事件

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-08-14 DOI: 10.1109/eScience.2019.00090

Jiue-An Yang, Jiayi Wang, Supun Nakandala, Arun Kumar, Marta M. Jankowska

Eating is a health-related behavior that could be intervened upon in the moment using smartphone technology, however predicting eating events using sensor data is challenging. We evaluate multiple machine learning algorithms for predicting eating and food purchasing events in a pilot sample of free living individuals. Data was collected with accelerometer, GPS device, and body-worn cameras for a week from 81 individuals. Raw minute-level features from sensors and engineered features including temporal and environmental context were included in the models. The Gradient Boosting model performed best for predicting eating, and the RBF-SVM model best predicted food purchasing. Time and context engineered features were important contributors to predicting eating and food purchasing events. This study provides a promising start in integrating body-worn sensor data, time components, and environmental contextual data into food-related behavior prediction for use in smartphone interventions.

饮食是一种与健康相关的行为，可以通过智能手机技术进行干预，但使用传感器数据预测饮食事件具有挑战性。我们评估了在自由生活个体的试点样本中预测饮食和食品购买事件的多种机器学习算法。研究人员用加速度计、GPS设备和随身摄像机收集了81个人一周的数据。来自传感器的原始分钟级特征和工程特征(包括时间和环境背景)包含在模型中。梯度增强模型在预测饮食方面表现最好，RBF-SVM模型在预测食物购买方面表现最好。时间和环境工程特征是预测饮食和食品购买事件的重要贡献者。这项研究为将穿戴式传感器数据、时间成分和环境背景数据整合到与食物相关的行为预测中，以用于智能手机干预提供了一个有希望的开端。

引用次数: 5

Data Quality Issues in Current Nanopublications 当前纳米出版物中的数据质量问题

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-07-29 DOI: 10.1109/eScience.2019.00069

Imran Asif, Jessica Chen-Burger, A. Gray

Nanopublications are a granular way of publishing scientific claims together with their associated provenance and publication information. More than 10 million nanopublications have been published by a handful of researchers covering a wide range of topics within the life sciences. We were motivated to replicate an existing analysis of these nanopublications, but then went deeper into the structure of the existing nanopublications. In this paper, we analyse the usage of nanopublications by investigating the distribution of triples in each part and discuss the data quality issues that were subsequently revealed. We argue that there is a need for the community to develop a set of guidelines for the modelling of nanopublications.

纳米出版物是一种发布科学声明及其相关来源和出版信息的粒状方式。少数研究人员已经发表了超过1000万篇纳米出版物，涵盖了生命科学领域的广泛主题。我们的动机是复制这些纳米出版物的现有分析，但随后更深入地研究了现有纳米出版物的结构。在本文中，我们通过研究每个部分中三元组的分布来分析纳米出版物的使用情况，并讨论随后揭示的数据质量问题。我们认为科学界有必要为纳米出版物的建模制定一套指导方针。

引用次数: 1

Evaluation of Pilot Jobs for Apache Spark Applications on HPC Clusters HPC集群上Apache Spark应用的试点工作评估

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-05-29 DOI: 10.1109/eScience.2019.00023

Valérie Hayot-Sasson, T. Glatard

Big Data has become prominent throughout many scientific fields, and as a result, scientific communities have sought out Big Data frameworks to accelerate the processing of their increasingly data-intensive pipelines. However, while scientific communities typically rely on High-Performance Computing (HPC) clusters for the parallelization of their pipelines, many popular Big Data frameworks such as Hadoop and Apache Spark were primarily designed to be executed on dedicated commodity infrastructures. This paper evaluates the benefits of pilot jobs over traditional batch submission for Apache Spark on HPC clusters. Surprisingly, our results show that the speed-up provided by pilot jobs over batch scheduling is moderate to non-existent (0.98 on average) despite the presence of long queuing times. In addition, pilot jobs provide an extra layer of scheduling that complicates debugging and deployment. We conclude that traditional batch scheduling should remain the default strategy to deploy Apache Spark applications on HPC clusters.

大数据已经在许多科学领域变得突出，因此，科学界已经寻求大数据框架来加速其日益增长的数据密集型管道的处理。然而，虽然科学界通常依赖于高性能计算(HPC)集群来并行化他们的管道，但许多流行的大数据框架(如Hadoop和Apache Spark)主要设计用于在专用的商品基础设施上执行。本文评估了在HPC集群上，与传统的批量提交Apache Spark相比，试点作业的好处。令人惊讶的是，我们的结果表明，尽管存在较长的排队时间，但试点作业提供的批调度的加速是中等的，甚至不存在(平均0.98)。此外，试验作业提供了一个额外的调度层，使调试和部署变得复杂。我们得出结论，传统的批调度仍然是在HPC集群上部署Apache Spark应用程序的默认策略。

引用次数: 0

Multi-model Investigative Exploration of Social Media Data with BOUTIQUE: A Case Study in Public Health 社交媒体数据的多模式调查探索:以公共卫生为例

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-05-24 DOI: 10.1109/eScience.2019.00030

Junan Guo, S. Dasgupta, Amarnath Gupta

We present our experience with a data science problem in Public Health, where researchers use social media (Twitter) to determine whether the public shows awareness of HIV prevention measures offered by Public Health campaigns. To help the researcher, we develop a investigative exploration system called BOUTIQUE that allows a user to perform a multistep visualization and exploration of data through a dashboard interface. Unique features of BOUTIQUE includes its ability to handle heterogeneous types of data provided by a polystore, and its ability to use computation as part of the investigative exploration process. In this paper, we present the design of the BOUTIQUE middleware and walk through an investigation process for a real-life problem.

我们展示了我们在公共卫生数据科学问题上的经验，研究人员使用社交媒体(Twitter)来确定公众是否意识到公共卫生运动提供的艾滋病毒预防措施。为了帮助研究人员，我们开发了一个名为BOUTIQUE的调查探索系统，该系统允许用户通过仪表板界面执行多步骤的数据可视化和探索。BOUTIQUE的独特功能包括处理由polystore提供的异构类型数据的能力，以及将计算作为调查探索过程的一部分的能力。在本文中，我们介绍了BOUTIQUE中间件的设计，并介绍了一个针对实际问题的调查过程。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 15th International Conference on eScience (eScience)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀