Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00052
G. Taffoni, M. Katevenis, Renato Panchieri, G. Perna, L. Tornatore, D. Goz, A. Ragagnin, S. Bertocco, I. Coretti, M. Marazakis, Fabien Chaix, Manolis Ploumidis
The increasing amount of data produced in Astronomy by observational studies and the size of theoretical problems to be tackled in the next future pushes the need of HPC (High Performance Computing) resources towards the "Exascale". The HPC sector is undergoing a profound phase of transition, in which one of the toughest challenges to cope with is the energy efficiency that is one of the main blocking factors to the achievement of "Exascale". Since ideal peak-performance is unlikely to be achieved in realistic scenarios, the aim of this work is to give some insights about the energy consumption of contemporary architectures with real scientific applications in a HPC context. We use two state-of-the-art applications from the astrophysical domain, that we optimized in order to fully exploit the underlying hardware: a direct N-body code and a semi-analytical code for Cosmic Structure formation simulations. For these two applications, we quantitatively evaluate the impact of computation on the energy consumption when running on three different systems: one that represents the present of current HPC systems (an Intel-based cluster), one that (possibly) represents the future of HPC systems (a prototype of an Exascale supercomputer) and a micro-cluster based on Arm MPSoC. We provide a comparison of the time-to-solution, energy-to-solution and energy delay product (EDP) metrics, for different software configurations. ARM-based HPC systems have lower energy consumption albeit running ≈10 times slower.
{"title":"Towards Exascale: Measuring the Energy Footprint of Astrophysics HPC Simulations","authors":"G. Taffoni, M. Katevenis, Renato Panchieri, G. Perna, L. Tornatore, D. Goz, A. Ragagnin, S. Bertocco, I. Coretti, M. Marazakis, Fabien Chaix, Manolis Ploumidis","doi":"10.1109/eScience.2019.00052","DOIUrl":"https://doi.org/10.1109/eScience.2019.00052","url":null,"abstract":"The increasing amount of data produced in Astronomy by observational studies and the size of theoretical problems to be tackled in the next future pushes the need of HPC (High Performance Computing) resources towards the \"Exascale\". The HPC sector is undergoing a profound phase of transition, in which one of the toughest challenges to cope with is the energy efficiency that is one of the main blocking factors to the achievement of \"Exascale\". Since ideal peak-performance is unlikely to be achieved in realistic scenarios, the aim of this work is to give some insights about the energy consumption of contemporary architectures with real scientific applications in a HPC context. We use two state-of-the-art applications from the astrophysical domain, that we optimized in order to fully exploit the underlying hardware: a direct N-body code and a semi-analytical code for Cosmic Structure formation simulations. For these two applications, we quantitatively evaluate the impact of computation on the energy consumption when running on three different systems: one that represents the present of current HPC systems (an Intel-based cluster), one that (possibly) represents the future of HPC systems (a prototype of an Exascale supercomputer) and a micro-cluster based on Arm MPSoC. We provide a comparison of the time-to-solution, energy-to-solution and energy delay product (EDP) metrics, for different software configurations. ARM-based HPC systems have lower energy consumption albeit running ≈10 times slower.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130827004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00059
Dennis Gannon
This paper was solicited as a "vision" talk for the 2019 eScience conference. It is based on my assessment of the role AI will have on eScience in the years ahead. While machine learning methods are already being well integrated into computing practice, this paper will look at another area: the role AI will play as an assistant to our daily research work.
{"title":"The Research Assistant and AI in eScience","authors":"Dennis Gannon","doi":"10.1109/eScience.2019.00059","DOIUrl":"https://doi.org/10.1109/eScience.2019.00059","url":null,"abstract":"This paper was solicited as a \"vision\" talk for the 2019 eScience conference. It is based on my assessment of the role AI will have on eScience in the years ahead. While machine learning methods are already being well integrated into computing practice, this paper will look at another area: the role AI will play as an assistant to our daily research work.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123633932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00079
I. Klampanos, F. Magnoni, E. Casarotti, C. Pagé, M. Lindner, A. Ikonomopoulos, V. Karkaletsis, A. Davvetas, A. Gemünd, M. Atkinson, A. Koukourikos, Rosa Filgueira, A. Krause, A. Spinuso, A. Charalambidis
The DARE platform has been designed to help research developers deliver user-facing applications and solutions over diverse underlying e-infrastructures, data and computational contexts. The platform is Cloud-ready, and relies on the exposure of APIs, which are suitable for raising the abstraction level and hiding complexity. At its core, the platform implements the cataloguing and execution of fine-grained and Python-based dispel4py workflows as services. Reflection is achieved via a logical knowledge base, comprising multiple internal catalogues, registries and semantics, while it supports persistent and pervasive data provenance. This paper presents design and implementation aspects of the DARE platform, as well as it provides directions for future development.
{"title":"DARE: A Reflective Platform Designed to Enable Agile Data-Driven Research on the Cloud","authors":"I. Klampanos, F. Magnoni, E. Casarotti, C. Pagé, M. Lindner, A. Ikonomopoulos, V. Karkaletsis, A. Davvetas, A. Gemünd, M. Atkinson, A. Koukourikos, Rosa Filgueira, A. Krause, A. Spinuso, A. Charalambidis","doi":"10.1109/eScience.2019.00079","DOIUrl":"https://doi.org/10.1109/eScience.2019.00079","url":null,"abstract":"The DARE platform has been designed to help research developers deliver user-facing applications and solutions over diverse underlying e-infrastructures, data and computational contexts. The platform is Cloud-ready, and relies on the exposure of APIs, which are suitable for raising the abstraction level and hiding complexity. At its core, the platform implements the cataloguing and execution of fine-grained and Python-based dispel4py workflows as services. Reflection is achieved via a logical knowledge base, comprising multiple internal catalogues, registries and semantics, while it supports persistent and pervasive data provenance. This paper presents design and implementation aspects of the DARE platform, as well as it provides directions for future development.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126462720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00045
W. Piotrowski, T. Kipouros, P. Clarkson
The design process of an engineering system requires thorough consideration of varied specifications, each with potentially large number of dimensions. The sheer volume of data, as well as its complexity, can overwhelm the designer and obscure vital information. Visualisation of big data can mitigate the issue of information overload but static display can suffer from overplotting. To tackle the issue of overplotting and cluttered data, we present an interactive and touch-screen capable visualisation toolkit that combines Parallel Coordinates and Scatter Plot approaches for managing multidimensional engineering design data. As engineering projects require a multitude of varied software to handle the various aspects of the design process, the combined datasets often do not have an underlying mathematical model. We address this issue by enhancing our visualisation software with Machine Learning methods which also facilitate further insights into the data. Furthermore, various software within the engineering design cycle produce information of different level of fidelity (accuracy and trustworthiness), as well as with different speed. The induced uncertainty is also considered and modelled in the synthetic dataset and is also presented in an interactive way. This paper describes a new visualisation software package and demonstrates its functionality on a complex aircraft systems design dataset.
{"title":"Enhanced Interactive Parallel Coordinates using Machine Learning and Uncertainty Propagation for Engineering Design","authors":"W. Piotrowski, T. Kipouros, P. Clarkson","doi":"10.1109/eScience.2019.00045","DOIUrl":"https://doi.org/10.1109/eScience.2019.00045","url":null,"abstract":"The design process of an engineering system requires thorough consideration of varied specifications, each with potentially large number of dimensions. The sheer volume of data, as well as its complexity, can overwhelm the designer and obscure vital information. Visualisation of big data can mitigate the issue of information overload but static display can suffer from overplotting. To tackle the issue of overplotting and cluttered data, we present an interactive and touch-screen capable visualisation toolkit that combines Parallel Coordinates and Scatter Plot approaches for managing multidimensional engineering design data. As engineering projects require a multitude of varied software to handle the various aspects of the design process, the combined datasets often do not have an underlying mathematical model. We address this issue by enhancing our visualisation software with Machine Learning methods which also facilitate further insights into the data. Furthermore, various software within the engineering design cycle produce information of different level of fidelity (accuracy and trustworthiness), as well as with different speed. The induced uncertainty is also considered and modelled in the synthetic dataset and is also presented in an interactive way. This paper describes a new visualisation software package and demonstrates its functionality on a complex aircraft systems design dataset.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"170 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128352225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00077
A. Spinuso, M. Atkinson, F. Magnoni
We present a practical approach for provenance capturing in Data-Intensive workflow systems. It provides contextualisation by recording injected domain metadata with the provenance stream. It offers control over lineage precision, combining automation with specified adaptations. We address provenance tasks such as extraction of domain metadata, injection of custom annotations, accuracy and integration of records from multiple independent workflows running in distributed contexts. To allow such flexibility, we introduce the concepts of programmable Provenance Types and Provenance Configuration. Provenance Types handle domain contextualisation and allow developers to model lineage patterns by re-defining API methods, composing easy-to-use extensions. Provenance Configuration, instead, enables users of a Data-Intensive workflow execution o prepare it for provenance capture, by configuring the attribution of Provenance Types to components and by specifying grouping into semantic clusters. This enables better searches over the lineage records. Provenance Types and Provenance Configuration are demonstrated in a system being used by computational seismologists. It is based on an extended provenance model, S-PROV.
{"title":"Active Provenance for Data-Intensive Workflows: Engaging Users and Developers","authors":"A. Spinuso, M. Atkinson, F. Magnoni","doi":"10.1109/eScience.2019.00077","DOIUrl":"https://doi.org/10.1109/eScience.2019.00077","url":null,"abstract":"We present a practical approach for provenance capturing in Data-Intensive workflow systems. It provides contextualisation by recording injected domain metadata with the provenance stream. It offers control over lineage precision, combining automation with specified adaptations. We address provenance tasks such as extraction of domain metadata, injection of custom annotations, accuracy and integration of records from multiple independent workflows running in distributed contexts. To allow such flexibility, we introduce the concepts of programmable Provenance Types and Provenance Configuration. Provenance Types handle domain contextualisation and allow developers to model lineage patterns by re-defining API methods, composing easy-to-use extensions. Provenance Configuration, instead, enables users of a Data-Intensive workflow execution o prepare it for provenance capture, by configuring the attribution of Provenance Types to components and by specifying grouping into semantic clusters. This enables better searches over the lineage records. Provenance Types and Provenance Configuration are demonstrated in a system being used by computational seismologists. It is based on an extended provenance model, S-PROV.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121692178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-27DOI: 10.1109/eScience.2019.00044
Yoshiki Takahashi, M. Asahara, Kazuyuki Shudo
Several recently devised machine learning (ML) algorithms have shown improved accuracy for various predictive problems. Model searches, which explore to find an optimal ML algorithm and hyperparameter values for the target problem, play a critical role in such improvements. During a model search, data scientists typically use multiple ML implementations to construct several predictive models; however, it takes significant time and effort to employ multiple ML implementations due to the need to learn how to use them, prepare input data in several different formats, and compare their outputs. Our proposed framework addresses these issues by providing simple and unified coding method. It has been designed with the following two attractive features: i) new machine learning implementations can be added easily via common interfaces between the framework and ML implementations and ii) it can be scaled to handle large model configuration search spaces via profile-based scheduling. The results of our evaluation indicate that, with our framework, implementers need only write 55-144 lines of code to add a new ML implementation. They also show that ours was the fastest framework for the HIGGS dataset, and the second-fastest for the SECOM dataset.
{"title":"A Framework for Model Search Across Multiple Machine Learning Implementations","authors":"Yoshiki Takahashi, M. Asahara, Kazuyuki Shudo","doi":"10.1109/eScience.2019.00044","DOIUrl":"https://doi.org/10.1109/eScience.2019.00044","url":null,"abstract":"Several recently devised machine learning (ML) algorithms have shown improved accuracy for various predictive problems. Model searches, which explore to find an optimal ML algorithm and hyperparameter values for the target problem, play a critical role in such improvements. During a model search, data scientists typically use multiple ML implementations to construct several predictive models; however, it takes significant time and effort to employ multiple ML implementations due to the need to learn how to use them, prepare input data in several different formats, and compare their outputs. Our proposed framework addresses these issues by providing simple and unified coding method. It has been designed with the following two attractive features: i) new machine learning implementations can be added easily via common interfaces between the framework and ML implementations and ii) it can be scaled to handle large model configuration search spaces via profile-based scheduling. The results of our evaluation indicate that, with our framework, implementers need only write 55-144 lines of code to add a new ML implementation. They also show that ours was the fastest framework for the HIGGS dataset, and the second-fastest for the SECOM dataset.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131498751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-14DOI: 10.1109/eScience.2019.00090
Jiue-An Yang, Jiayi Wang, Supun Nakandala, Arun Kumar, Marta M. Jankowska
Eating is a health-related behavior that could be intervened upon in the moment using smartphone technology, however predicting eating events using sensor data is challenging. We evaluate multiple machine learning algorithms for predicting eating and food purchasing events in a pilot sample of free living individuals. Data was collected with accelerometer, GPS device, and body-worn cameras for a week from 81 individuals. Raw minute-level features from sensors and engineered features including temporal and environmental context were included in the models. The Gradient Boosting model performed best for predicting eating, and the RBF-SVM model best predicted food purchasing. Time and context engineered features were important contributors to predicting eating and food purchasing events. This study provides a promising start in integrating body-worn sensor data, time components, and environmental contextual data into food-related behavior prediction for use in smartphone interventions.
{"title":"Predicting Eating Events in Free Living Individuals","authors":"Jiue-An Yang, Jiayi Wang, Supun Nakandala, Arun Kumar, Marta M. Jankowska","doi":"10.1109/eScience.2019.00090","DOIUrl":"https://doi.org/10.1109/eScience.2019.00090","url":null,"abstract":"Eating is a health-related behavior that could be intervened upon in the moment using smartphone technology, however predicting eating events using sensor data is challenging. We evaluate multiple machine learning algorithms for predicting eating and food purchasing events in a pilot sample of free living individuals. Data was collected with accelerometer, GPS device, and body-worn cameras for a week from 81 individuals. Raw minute-level features from sensors and engineered features including temporal and environmental context were included in the models. The Gradient Boosting model performed best for predicting eating, and the RBF-SVM model best predicted food purchasing. Time and context engineered features were important contributors to predicting eating and food purchasing events. This study provides a promising start in integrating body-worn sensor data, time components, and environmental contextual data into food-related behavior prediction for use in smartphone interventions.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131099242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-29DOI: 10.1109/eScience.2019.00069
Imran Asif, Jessica Chen-Burger, A. Gray
Nanopublications are a granular way of publishing scientific claims together with their associated provenance and publication information. More than 10 million nanopublications have been published by a handful of researchers covering a wide range of topics within the life sciences. We were motivated to replicate an existing analysis of these nanopublications, but then went deeper into the structure of the existing nanopublications. In this paper, we analyse the usage of nanopublications by investigating the distribution of triples in each part and discuss the data quality issues that were subsequently revealed. We argue that there is a need for the community to develop a set of guidelines for the modelling of nanopublications.
{"title":"Data Quality Issues in Current Nanopublications","authors":"Imran Asif, Jessica Chen-Burger, A. Gray","doi":"10.1109/eScience.2019.00069","DOIUrl":"https://doi.org/10.1109/eScience.2019.00069","url":null,"abstract":"Nanopublications are a granular way of publishing scientific claims together with their associated provenance and publication information. More than 10 million nanopublications have been published by a handful of researchers covering a wide range of topics within the life sciences. We were motivated to replicate an existing analysis of these nanopublications, but then went deeper into the structure of the existing nanopublications. In this paper, we analyse the usage of nanopublications by investigating the distribution of triples in each part and discuss the data quality issues that were subsequently revealed. We argue that there is a need for the community to develop a set of guidelines for the modelling of nanopublications.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123220834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-29DOI: 10.1109/eScience.2019.00023
Valérie Hayot-Sasson, T. Glatard
Big Data has become prominent throughout many scientific fields, and as a result, scientific communities have sought out Big Data frameworks to accelerate the processing of their increasingly data-intensive pipelines. However, while scientific communities typically rely on High-Performance Computing (HPC) clusters for the parallelization of their pipelines, many popular Big Data frameworks such as Hadoop and Apache Spark were primarily designed to be executed on dedicated commodity infrastructures. This paper evaluates the benefits of pilot jobs over traditional batch submission for Apache Spark on HPC clusters. Surprisingly, our results show that the speed-up provided by pilot jobs over batch scheduling is moderate to non-existent (0.98 on average) despite the presence of long queuing times. In addition, pilot jobs provide an extra layer of scheduling that complicates debugging and deployment. We conclude that traditional batch scheduling should remain the default strategy to deploy Apache Spark applications on HPC clusters.
{"title":"Evaluation of Pilot Jobs for Apache Spark Applications on HPC Clusters","authors":"Valérie Hayot-Sasson, T. Glatard","doi":"10.1109/eScience.2019.00023","DOIUrl":"https://doi.org/10.1109/eScience.2019.00023","url":null,"abstract":"Big Data has become prominent throughout many scientific fields, and as a result, scientific communities have sought out Big Data frameworks to accelerate the processing of their increasingly data-intensive pipelines. However, while scientific communities typically rely on High-Performance Computing (HPC) clusters for the parallelization of their pipelines, many popular Big Data frameworks such as Hadoop and Apache Spark were primarily designed to be executed on dedicated commodity infrastructures. This paper evaluates the benefits of pilot jobs over traditional batch submission for Apache Spark on HPC clusters. Surprisingly, our results show that the speed-up provided by pilot jobs over batch scheduling is moderate to non-existent (0.98 on average) despite the presence of long queuing times. In addition, pilot jobs provide an extra layer of scheduling that complicates debugging and deployment. We conclude that traditional batch scheduling should remain the default strategy to deploy Apache Spark applications on HPC clusters.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134397757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-24DOI: 10.1109/eScience.2019.00030
Junan Guo, S. Dasgupta, Amarnath Gupta
We present our experience with a data science problem in Public Health, where researchers use social media (Twitter) to determine whether the public shows awareness of HIV prevention measures offered by Public Health campaigns. To help the researcher, we develop a investigative exploration system called BOUTIQUE that allows a user to perform a multistep visualization and exploration of data through a dashboard interface. Unique features of BOUTIQUE includes its ability to handle heterogeneous types of data provided by a polystore, and its ability to use computation as part of the investigative exploration process. In this paper, we present the design of the BOUTIQUE middleware and walk through an investigation process for a real-life problem.
{"title":"Multi-model Investigative Exploration of Social Media Data with BOUTIQUE: A Case Study in Public Health","authors":"Junan Guo, S. Dasgupta, Amarnath Gupta","doi":"10.1109/eScience.2019.00030","DOIUrl":"https://doi.org/10.1109/eScience.2019.00030","url":null,"abstract":"We present our experience with a data science problem in Public Health, where researchers use social media (Twitter) to determine whether the public shows awareness of HIV prevention measures offered by Public Health campaigns. To help the researcher, we develop a investigative exploration system called BOUTIQUE that allows a user to perform a multistep visualization and exploration of data through a dashboard interface. Unique features of BOUTIQUE includes its ability to handle heterogeneous types of data provided by a polystore, and its ability to use computation as part of the investigative exploration process. In this paper, we present the design of the BOUTIQUE middleware and walk through an investigation process for a real-life problem.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121558232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}