Introduction
The current effort towards the digital transformation across multiple scientific domains requires data that is Findable, Accessible, Interoperable and Reusable (FAIR). In addition to the FAIR data, what is required for the application of computational tools, such as Quantitative Structure Activity Relationships (QSARs), is a sufficient data volume and the ability to merge sources into homogeneous digital assets. In the nanosafety domain there is a lack of FAIR available metadata.
Methodology
To address this challenge, we utilized 34 datasets from the nanosafety domain by exploiting the NanoSafety Data Reusability Assessment (NSDRA) framework, which allowed the annotation and assessment of dataset's reusability. From the framework's application results, eight datasets targeting the same endpoint (i.e. numerical cellular viability) were selected, processed and merged to test several hypothesis including universal versus nanogroup-specific QSAR models (metal oxide and nanotubes), and regression versus classification Machine Learning (ML) algorithms.
Results
Universal regression and classification QSARs reached an 0.86 R2 and 0.92 accuracy, respectively, for the test set. Nanogroup-specific regression models reached 0.88 R2 for nanotubes test set followed by metal oxide (0.78). Nanogroup-specific classification models reached 0.99 accuracy for nanotubes test set, followed by metal oxide (0.91). Feature importance revealed different patterns depending on the dataset with common influential features including core size, exposure conditions and toxicological assay.
Even in the case where the available experimental knowledge was merged, the models still failed to correctly predict the outputs of an unseen dataset, revealing the cumbersome conundrum of scientific reproducibility in realistic applications of QSAR for nanosafety. To harness the full potential of computational tools and ensure their long-term applications, embracing FAIR data practices is imperative in driving the development of responsible QSAR models.
Conclusions
This study reveals that the digitalization of nanosafety knowledge in a reproducible manner has a long way towards its successful pragmatic implementation. The workflow carried out in the study shows a promising approach to increase the FAIRness across all the elements of computational studies, from dataset's annotation, selection, merging to FAIR modeling reporting. This has significant implications for future research as it provides an example of how to utilize and report different tools available in the nanosafety knowledge system, while increasing the transparency of the results. One of the main benefits of this workflow is that it promotes data sharing and reuse, which is essential for advancing scientific knowledge by making data and metadata FAIR compliant. In addition, the increased transparen