Sandi Baressi Segota, N. Anđelić, I. Lorencin, J. Musulin, D. Štifanić, Z. Car
{"title":"Preparation of Simplified Molecular Input Line Entry System Notation Datasets for use in Convolutional Neural Networks","authors":"Sandi Baressi Segota, N. Anđelić, I. Lorencin, J. Musulin, D. Štifanić, Z. Car","doi":"10.1109/BIBE52308.2021.9635320","DOIUrl":null,"url":null,"abstract":"Simplified Molecular Input Line Entry System (SMILES) is a type of chemical notation. The SMILES format allows the representation of chemical structures in a shape easily readable by computer programs. This allows many techniques, such as Artificial Neural Networks (ANNs) to be applied on the SMILES formatted data. One of the highest-performing ANN types is the Convolutional Neural Networks (CNNs), designed to work on images or matrix-shaped data. In this paper, the authors will present the preparation of the SMILES dataset for use by CNNs. The paper will start with a brief description of the SMILES format, followed by the explanation of the dataset transformation into an NPY matrix-based format, with an example of utilization via the application of popular CNN architectures on a transformed dataset. The proposed architecture achieves satisfactory results (AUC=0.92), with the transformation algorithm speed also proving satisfactory (0.08 seconds per data point)","PeriodicalId":343724,"journal":{"name":"2021 IEEE 21st International Conference on Bioinformatics and Bioengineering (BIBE)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 21st International Conference on Bioinformatics and Bioengineering (BIBE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE52308.2021.9635320","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Simplified Molecular Input Line Entry System (SMILES) is a type of chemical notation. The SMILES format allows the representation of chemical structures in a shape easily readable by computer programs. This allows many techniques, such as Artificial Neural Networks (ANNs) to be applied on the SMILES formatted data. One of the highest-performing ANN types is the Convolutional Neural Networks (CNNs), designed to work on images or matrix-shaped data. In this paper, the authors will present the preparation of the SMILES dataset for use by CNNs. The paper will start with a brief description of the SMILES format, followed by the explanation of the dataset transformation into an NPY matrix-based format, with an example of utilization via the application of popular CNN architectures on a transformed dataset. The proposed architecture achieves satisfactory results (AUC=0.92), with the transformation algorithm speed also proving satisfactory (0.08 seconds per data point)