Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00040
A. Youngdahl, Dai Hai Ton That, T. Malik
The conduct of reproducible science improves when computations are portable and verifiable. A container runtime provides an isolated environment for running computations and thus is useful for porting applications on new machines. Current container engines, such as LXC and Docker, however, do not track provenance, which is essential for verifying computations. In this paper, we present SciInc, a container runtime that tracks the provenance of computations during container creation. We show how container engines can use audited provenance data for efficient container replay. SciInc observes inputs to computations, and, if they change, propagates the changes, re-using partially memoized computations and data that are identical across replay and original run. We chose light-weight data structures for storing the provenance trace to maintain the invariant of shareable and portable container runtime. To determine the effectiveness of change propagation and memoization, we compared popular container technology and incremental recomputation methods using published data analysis experiments.
{"title":"SciInc: A Container Runtime for Incremental Recomputation","authors":"A. Youngdahl, Dai Hai Ton That, T. Malik","doi":"10.1109/eScience.2019.00040","DOIUrl":"https://doi.org/10.1109/eScience.2019.00040","url":null,"abstract":"The conduct of reproducible science improves when computations are portable and verifiable. A container runtime provides an isolated environment for running computations and thus is useful for porting applications on new machines. Current container engines, such as LXC and Docker, however, do not track provenance, which is essential for verifying computations. In this paper, we present SciInc, a container runtime that tracks the provenance of computations during container creation. We show how container engines can use audited provenance data for efficient container replay. SciInc observes inputs to computations, and, if they change, propagates the changes, re-using partially memoized computations and data that are identical across replay and original run. We chose light-weight data structures for storing the provenance trace to maintain the invariant of shareable and portable container runtime. To determine the effectiveness of change propagation and memoization, we compared popular container technology and incremental recomputation methods using published data analysis experiments.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126609091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00058
Ewa Deelman, Ryan Mitchell, Loïc Pottier, M. Rynge, Erik Scott, K. Vahi, Marina Kogan, Jasmine Mann, Tom Gulbransen, Daniel Allen, David Barlow, A. Mandal, Santiago Bonarrigo, Chris Clark, Leslie Goldman, Tristan Goulden, Phil Harvey, David Hulsander, Steve Jacobs, Christine Laney, Ivan Lobo-Padilla, Jeremy Sampson, Valerio Pascucci, John Staarmann, Steve Stone, Susan Sons, J. Wyngaard, Charles Vardeman, Steve Petruzza, I. Baldin, L. Christopherson
The National Science Foundation's Large Facilities are major, multi-user research facilities that operate and manage sophisticated and diverse research instruments and platforms (e.g., large telescopes, interferometers, distributed sensor arrays) that serve a variety of scientific disciplines, from astronomy and physics to geology and biology and beyond. Large Facilities are increasingly dependent on advanced cyberinfrastructure (i.e., computing, data, and software systems; networking; and associated human capital) to enable the broad delivery and analysis of facility-generated data. These cyberinfrastructure tools enable scientists and the public to gain new insights into fundamental questions about the structure and history of the universe, the world we live in today, and how our environment may change in the coming decades. This paper describes a pilot project that aims to develop a model for a Cyberinfrastructure Center of Excellence (CI CoE) that facilitates community building and knowledge sharing and that disseminates and applies best practices and innovative solutions for facility CI.
{"title":"Cyberinfrastructure Center of Excellence Pilot: Connecting Large Facilities Cyberinfrastructure","authors":"Ewa Deelman, Ryan Mitchell, Loïc Pottier, M. Rynge, Erik Scott, K. Vahi, Marina Kogan, Jasmine Mann, Tom Gulbransen, Daniel Allen, David Barlow, A. Mandal, Santiago Bonarrigo, Chris Clark, Leslie Goldman, Tristan Goulden, Phil Harvey, David Hulsander, Steve Jacobs, Christine Laney, Ivan Lobo-Padilla, Jeremy Sampson, Valerio Pascucci, John Staarmann, Steve Stone, Susan Sons, J. Wyngaard, Charles Vardeman, Steve Petruzza, I. Baldin, L. Christopherson","doi":"10.1109/eScience.2019.00058","DOIUrl":"https://doi.org/10.1109/eScience.2019.00058","url":null,"abstract":"The National Science Foundation's Large Facilities are major, multi-user research facilities that operate and manage sophisticated and diverse research instruments and platforms (e.g., large telescopes, interferometers, distributed sensor arrays) that serve a variety of scientific disciplines, from astronomy and physics to geology and biology and beyond. Large Facilities are increasingly dependent on advanced cyberinfrastructure (i.e., computing, data, and software systems; networking; and associated human capital) to enable the broad delivery and analysis of facility-generated data. These cyberinfrastructure tools enable scientists and the public to gain new insights into fundamental questions about the structure and history of the universe, the world we live in today, and how our environment may change in the coming decades. This paper describes a pilot project that aims to develop a model for a Cyberinfrastructure Center of Excellence (CI CoE) that facilitates community building and knowledge sharing and that disseminates and applies best practices and innovative solutions for facility CI.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128539058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00011
Bernhard Gößwein, Tomasz Miksa, A. Rauber, W. Wagner
Earth observation researchers use specialised computing services for satellite image processing offered by various data backends. The source of data is often the same, for example Sentinel-2 satellites operated by Copernicus, but the way how data is pre-processed, corrected, updated, and later analysed may differ among the backends. Backends often lack mechanisms for data versioning, for example, data corrections are not tracked. Furthermore, an evolving software stack used for data processing remains a black box to researchers. Researchers have no means to identify why executions of the same code deliver different results. This hinders reproducibility of earth observation experiments. In this paper, we present how infrastructure of existing earth observation data backends can be modified to support reproducibility. The proposed extensions are based on recommendations of the Research Data Alliance regarding data identification and the VFramework for automated process provenance documentation. We implemented these extensions at the Earth Observation Data Centre, a partner in the openEO consortium. We evaluated the solution on a variety of usage scenarios, providing also performance and storage measures to evaluate the impact of the modifications. The results indicate reproducibility can be supported with minimal performance and storage overhead.
{"title":"Data Identification and Process Monitoring for Reproducible Earth Observation Research","authors":"Bernhard Gößwein, Tomasz Miksa, A. Rauber, W. Wagner","doi":"10.1109/eScience.2019.00011","DOIUrl":"https://doi.org/10.1109/eScience.2019.00011","url":null,"abstract":"Earth observation researchers use specialised computing services for satellite image processing offered by various data backends. The source of data is often the same, for example Sentinel-2 satellites operated by Copernicus, but the way how data is pre-processed, corrected, updated, and later analysed may differ among the backends. Backends often lack mechanisms for data versioning, for example, data corrections are not tracked. Furthermore, an evolving software stack used for data processing remains a black box to researchers. Researchers have no means to identify why executions of the same code deliver different results. This hinders reproducibility of earth observation experiments. In this paper, we present how infrastructure of existing earth observation data backends can be modified to support reproducibility. The proposed extensions are based on recommendations of the Research Data Alliance regarding data identification and the VFramework for automated process provenance documentation. We implemented these extensions at the Earth Observation Data Centre, a partner in the openEO consortium. We evaluated the solution on a variety of usage scenarios, providing also performance and storage measures to evaluate the impact of the modifications. The results indicate reproducibility can be supported with minimal performance and storage overhead.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126845969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00009
N. Caithness, Cécile Lachaux, D. Wallom
A method for quantifiably estimating the deforestation risk exposure of agricultural Forest Risk Commodities in commercial supply chains is presented. The model consists of a series of equations applied using end-to-end data representing quantitative descriptors of the supply chain and its effect on deforestation. A robust penalty is included for historical deforestation and a corresponding reward for reductions in the rate of deforestation. The INternational FOrest Risk Model (INFORM) is a method for data analysis that answers a particular question for any Forest Risk Commodity in a supply chain: what is its cumulative deforestation risk exposure? To illustrate the methodology a case study of a livestock producer in France who sources soya-based animal feed from Brazil and wishes to document the deforestation risk associated with the product is described and calculated. Building on this example a discussion of the future applicability of INFORM within emerging supply-chain transparency initiatives is made including describing clear shortcomings in the method and how it may also be used to motivate the production of better data by those that may be subject of its analysis.
{"title":"The International Forest Risk Model (INFORM): A Method for Assessing Supply Chain Deforestation Risk with Imperfect Data","authors":"N. Caithness, Cécile Lachaux, D. Wallom","doi":"10.1109/eScience.2019.00009","DOIUrl":"https://doi.org/10.1109/eScience.2019.00009","url":null,"abstract":"A method for quantifiably estimating the deforestation risk exposure of agricultural Forest Risk Commodities in commercial supply chains is presented. The model consists of a series of equations applied using end-to-end data representing quantitative descriptors of the supply chain and its effect on deforestation. A robust penalty is included for historical deforestation and a corresponding reward for reductions in the rate of deforestation. The INternational FOrest Risk Model (INFORM) is a method for data analysis that answers a particular question for any Forest Risk Commodity in a supply chain: what is its cumulative deforestation risk exposure? To illustrate the methodology a case study of a livestock producer in France who sources soya-based animal feed from Brazil and wishes to document the deforestation risk associated with the product is described and calculated. Building on this example a discussion of the future applicability of INFORM within emerging supply-chain transparency initiatives is made including describing clear shortcomings in the method and how it may also be used to motivate the production of better data by those that may be subject of its analysis.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121692432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00021
Roselyne B. Tchoua, Aswathy Ajith, Zhi Hong, Logan T. Ward, K. Chard, Debra J. Audus, Shrayesh Patel, Juan J. de Pablo, Ian T Foster
Despite significant progress in natural language processing, machine learning models require substantial expertannotated training data to perform well in tasks such as named entity recognition (NER) and entity relations extraction. Furthermore, NER is often more complicated when working with scientific text. For example, in polymer science, chemical structure may be encoded using nonstandard naming conventions, the same concept can be expressed using many different terms (synonymy), and authors may refer to polymers with ad-hoc labels. These challenges, which are not unique to polymer science, make it difficult to generate training data, as specialized skills are needed to label text correctly. We have previously designed polyNER, a semi-automated system for efficient identification of scientific entities in text. PolyNER applies word embedding models to generate entity-rich corpora for productive expert labeling, and then uses the resulting labeled data to bootstrap a context-based classifier. PolyNER facilitates a labeling process that is otherwise tedious and expensive. Here, we use active learning to efficiently obtain more annotations from experts and improve performance. Our approach requires just five hours of expert time to achieve discrimination capacity comparable to that of a state-of-the-art chemical NER toolkit.
{"title":"Active Learning Yields Better Training Data for Scientific Named Entity Recognition","authors":"Roselyne B. Tchoua, Aswathy Ajith, Zhi Hong, Logan T. Ward, K. Chard, Debra J. Audus, Shrayesh Patel, Juan J. de Pablo, Ian T Foster","doi":"10.1109/eScience.2019.00021","DOIUrl":"https://doi.org/10.1109/eScience.2019.00021","url":null,"abstract":"Despite significant progress in natural language processing, machine learning models require substantial expertannotated training data to perform well in tasks such as named entity recognition (NER) and entity relations extraction. Furthermore, NER is often more complicated when working with scientific text. For example, in polymer science, chemical structure may be encoded using nonstandard naming conventions, the same concept can be expressed using many different terms (synonymy), and authors may refer to polymers with ad-hoc labels. These challenges, which are not unique to polymer science, make it difficult to generate training data, as specialized skills are needed to label text correctly. We have previously designed polyNER, a semi-automated system for efficient identification of scientific entities in text. PolyNER applies word embedding models to generate entity-rich corpora for productive expert labeling, and then uses the resulting labeled data to bootstrap a context-based classifier. PolyNER facilitates a labeling process that is otherwise tedious and expensive. Here, we use active learning to efficiently obtain more annotations from experts and improve performance. Our approach requires just five hours of expert time to achieve discrimination capacity comparable to that of a state-of-the-art chemical NER toolkit.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124351486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00100
S. Labou, Reid Otsuji
As reproducible research tools and skills become increasingly in-demand across disciplines, so too does the need for innovative and collaborative training. While some academic departments incorporate software like R or Python in coursework and research, many students remain reliant on self-teaching in order to gain the necessary skills to work with their data. However, given the growing number of students interested in computational tools and resources for research automation, relying on student self-teaching and learning is not an efficient method for training the next generation of scholars. To address the educational need for computational thinking and learning across various academic departments on campus, the UC San Diego Library has been running Software Carpentry workshops (two day bootcamps to introduce foundational programming concepts and best practices) since 2015. The Library, as a discipline-agnostic entity with a history of serving as a trusted resource for information, has been well positioned to provide training for this new era of research methodology. The core of our success is the collaboration with the growing community of Software and Data Carpentry instructors at UC San Diego with expertise in various research disciplines. Building on this strong partnership and leveraging the Library’s resources and expertise in digital literacy, the campus can better support data-driven and technologically-focused education and research.
{"title":"Expanding Library Resources for Data and Compute-Intensive Education and Research","authors":"S. Labou, Reid Otsuji","doi":"10.1109/eScience.2019.00100","DOIUrl":"https://doi.org/10.1109/eScience.2019.00100","url":null,"abstract":"As reproducible research tools and skills become increasingly in-demand across disciplines, so too does the need for innovative and collaborative training. While some academic departments incorporate software like R or Python in coursework and research, many students remain reliant on self-teaching in order to gain the necessary skills to work with their data. However, given the growing number of students interested in computational tools and resources for research automation, relying on student self-teaching and learning is not an efficient method for training the next generation of scholars. To address the educational need for computational thinking and learning across various academic departments on campus, the UC San Diego Library has been running Software Carpentry workshops (two day bootcamps to introduce foundational programming concepts and best practices) since 2015. The Library, as a discipline-agnostic entity with a history of serving as a trusted resource for information, has been well positioned to provide training for this new era of research methodology. The core of our success is the collaboration with the growing community of Software and Data Carpentry instructors at UC San Diego with expertise in various research disciplines. Building on this strong partnership and leveraging the Library’s resources and expertise in digital literacy, the campus can better support data-driven and technologically-focused education and research.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126348052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00032
Ugur Çayoglu, Frank Tristram, Jörg Meyer, J. Schröter, T. Kerzenmacher, P. Braesicke, A. Streit
The increase in compute power and development of sophisticated simulation models with higher resolution output triggers a need for compression algorithms for scientific data. Several compression algorithms are currently under development. Most of these algorithms are using prediction-based compression algorithms, where each value is predicted and the residual between the prediction and true value is saved on disk. Currently there are two established forms of residual calculation: Exclusive-or and numerical difference. In this paper we will summarize both techniques and show their strengths and weaknesses. We will show that shifting the prediction and true value to a binary number with certain properties results in a better compression factor with minimal additional computational costs. This gain in compression factor allows for the usage of less sophisticated prediction algorithms to achieve a higher throughput during compression and decompression. In addition, we will introduce a new encoding scheme to achieve an 9% increase in compression factor on average compared to the current state-of-the-art.
{"title":"Data Encoding in Lossless Prediction-Based Compression Algorithms","authors":"Ugur Çayoglu, Frank Tristram, Jörg Meyer, J. Schröter, T. Kerzenmacher, P. Braesicke, A. Streit","doi":"10.1109/eScience.2019.00032","DOIUrl":"https://doi.org/10.1109/eScience.2019.00032","url":null,"abstract":"The increase in compute power and development of sophisticated simulation models with higher resolution output triggers a need for compression algorithms for scientific data. Several compression algorithms are currently under development. Most of these algorithms are using prediction-based compression algorithms, where each value is predicted and the residual between the prediction and true value is saved on disk. Currently there are two established forms of residual calculation: Exclusive-or and numerical difference. In this paper we will summarize both techniques and show their strengths and weaknesses. We will show that shifting the prediction and true value to a binary number with certain properties results in a better compression factor with minimal additional computational costs. This gain in compression factor allows for the usage of less sophisticated prediction algorithms to achieve a higher throughput during compression and decompression. In addition, we will introduce a new encoding scheme to achieve an 9% increase in compression factor on average compared to the current state-of-the-art.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132222961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00050
D. Chirkin, J. C. Díaz-Vélez, C. Kopper, A. Olivas, B. Riedel, M. Rongen, D. Schultz, J. Santen
IceCube Neutrino Observatory is a cubic kilometer neutrino detector located at the South Pole designed to detect high-energy astrophysical neutrinos. To thoroughly understand the detected neutrinos and their properties, the detector response to simulated signal and background has to be modeled using Monte Carlo techniques. An integral part of these studies are the optical properties of the ice the observatory is built into. The propagation of individual photons from particles produced by neutrino interactions in the ice can be greatly accelerated using graphics processing units (GPUs). In this paper, we will describe how we perform the photon propagation and create a global pool of GPU resources for both production and individual users.
{"title":"Photon Propagation using GPUs by the IceCube Neutrino Observatory","authors":"D. Chirkin, J. C. Díaz-Vélez, C. Kopper, A. Olivas, B. Riedel, M. Rongen, D. Schultz, J. Santen","doi":"10.1109/eScience.2019.00050","DOIUrl":"https://doi.org/10.1109/eScience.2019.00050","url":null,"abstract":"IceCube Neutrino Observatory is a cubic kilometer neutrino detector located at the South Pole designed to detect high-energy astrophysical neutrinos. To thoroughly understand the detected neutrinos and their properties, the detector response to simulated signal and background has to be modeled using Monte Carlo techniques. An integral part of these studies are the optical properties of the ice the observatory is built into. The propagation of individual photons from particles produced by neutrino interactions in the ice can be greatly accelerated using graphics processing units (GPUs). In this paper, we will describe how we perform the photon propagation and create a global pool of GPU resources for both production and individual users.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131788875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00031
Marine Louarn, F. Chatonnet, Xavier Garnier, T. Fest, A. Siegel, O. Dameron
In life sciences, current standardization and integration efforts are directed towards reference data and knowledge bases. However, original studies results are generally provided in non standardized and specific formats. In addition, the only formalization of analysis pipelines is often limited to textual descriptions in the method sections. Both factors impair the results reproducibility, their maintenance and their reuse for advancing other studies. Semantic Web technologies have proven their efficiency for facilitating the integration and reuse of reference data and knowledge bases. We thus hypothesize that Semantic Web technologies also facilitate reproducibility and reuse of life sciences studies involving pipelines that compute associations between entities according to intermediary relations and dependencies. In order to assess this hypothesis, we considered a case-study in systems biology (http://regulatorycircuits.org), which provides tissue-specific regulatory interaction networks to elucidate perturbations across complex diseases. Our approach consisted in surveying the complete set of provided supplementary files to reveal the underlying structure between the biological entities described in the data. We relied on this structure and used Semantic Web technologies (i) to integrate the Regulatory Circuits data, and (ii) to formalize the analysis pipeline as SPARQL queries. Our result was a 335,429,988 triples dataset on which two SPARQL queries were sufficient to extract each single tissuespecific regulatory network.
{"title":"Increasing Life Science Resources Re-Usability using Semantic Web Technologies","authors":"Marine Louarn, F. Chatonnet, Xavier Garnier, T. Fest, A. Siegel, O. Dameron","doi":"10.1109/eScience.2019.00031","DOIUrl":"https://doi.org/10.1109/eScience.2019.00031","url":null,"abstract":"In life sciences, current standardization and integration efforts are directed towards reference data and knowledge bases. However, original studies results are generally provided in non standardized and specific formats. In addition, the only formalization of analysis pipelines is often limited to textual descriptions in the method sections. Both factors impair the results reproducibility, their maintenance and their reuse for advancing other studies. Semantic Web technologies have proven their efficiency for facilitating the integration and reuse of reference data and knowledge bases. We thus hypothesize that Semantic Web technologies also facilitate reproducibility and reuse of life sciences studies involving pipelines that compute associations between entities according to intermediary relations and dependencies. In order to assess this hypothesis, we considered a case-study in systems biology (http://regulatorycircuits.org), which provides tissue-specific regulatory interaction networks to elucidate perturbations across complex diseases. Our approach consisted in surveying the complete set of provided supplementary files to reveal the underlying structure between the biological entities described in the data. We relied on this structure and used Semantic Web technologies (i) to integrate the Regulatory Circuits data, and (ii) to formalize the analysis pipeline as SPARQL queries. Our result was a 335,429,988 triples dataset on which two SPARQL queries were sufficient to extract each single tissuespecific regulatory network.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131421232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/eScience.2019.00018
J. '. Cid-Fuentes, S. Solà, Pol Álvarez, A. Castro-Ginard, Rosa M. Badia
In recent years, machine learning has proven to be an extremely useful tool for extracting knowledge from data. This can be leveraged in numerous research areas, such as genomics, earth sciences, and astrophysics, to gain valuable insight. At the same time, Python has become one of the most popular programming languages among researchers due to its high productivity and rich ecosystem. Unfortunately, existing machine learning libraries for Python do not scale to large data sets, are hard to use by non-experts, and are difficult to set up in high performance computing clusters. These limitations have prevented scientists to exploit the full potential of machine learning in their research. In this paper, we present and evaluate dislib, a distributed machine learning library on top of PyCOMPSs programming model that addresses the issues of other existing libraries. In our evaluation, we show that dislib can be up to 9 times faster, and can process data sets up to 16 times larger than other popular distributed machine learning libraries, such as MLlib. In addition to this, we also show how dislib can be used to reduce the computation time of a real scientific application from 18 hours to 17 minutes.
{"title":"dislib: Large Scale High Performance Machine Learning in Python","authors":"J. '. Cid-Fuentes, S. Solà, Pol Álvarez, A. Castro-Ginard, Rosa M. Badia","doi":"10.1109/eScience.2019.00018","DOIUrl":"https://doi.org/10.1109/eScience.2019.00018","url":null,"abstract":"In recent years, machine learning has proven to be an extremely useful tool for extracting knowledge from data. This can be leveraged in numerous research areas, such as genomics, earth sciences, and astrophysics, to gain valuable insight. At the same time, Python has become one of the most popular programming languages among researchers due to its high productivity and rich ecosystem. Unfortunately, existing machine learning libraries for Python do not scale to large data sets, are hard to use by non-experts, and are difficult to set up in high performance computing clusters. These limitations have prevented scientists to exploit the full potential of machine learning in their research. In this paper, we present and evaluate dislib, a distributed machine learning library on top of PyCOMPSs programming model that addresses the issues of other existing libraries. In our evaluation, we show that dislib can be up to 9 times faster, and can process data sets up to 16 times larger than other popular distributed machine learning libraries, such as MLlib. In addition to this, we also show how dislib can be used to reduce the computation time of a real scientific application from 18 hours to 17 minutes.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115795067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}