D. Pragnath , G. Srijayanthi , Santosh Kumar , Sumer Chopra
{"title":"SeisAug: A data augmentation python toolkit","authors":"D. Pragnath , G. Srijayanthi , Santosh Kumar , Sumer Chopra","doi":"10.1016/j.acags.2025.100232","DOIUrl":null,"url":null,"abstract":"<div><div>A common limitation in applying any deep learning and machine learning techniques is the limited labelled dataset which can be addressed through Data augmentation (DA). SeisAug is a DA python toolkit to address this challenge in seismological studies. DA. DA helps to balance the imbalanced classes of a dataset by creating more examples of under-represented classes. It significantly mitigates overfitting by increasing the volume of training data and introducing variability, thereby improving the model's performance on unseen data. Given the rapid advancements in deep learning for seismology, ‘SeisAug’ assists in extensibility by generating a substantial amount of data (2–6 times more data) which can aid in developing an indigenous robust model. Further, this study demonstrates the role of DA in developing a robust model. For this we utilized a basic two class identification models between earthquake/signal and noise/(non-earthquake). The model is trained with original, 1 and 5 times augmented datasets and their performance metrics are evaluated. The model trained with 5X times augmented dataset significantly outperforms with accuracy of 0.991, AUC 0.999 and AUC-PR 0.999 compared to the model trained with original dataset with accuracy of 0.50, AUC 0.75 and AUC-PR 0.80. Furthermore, by making all codes available on GitHub, the toolkit facilitates the easy application of DA techniques, empowering end-users to enhance their seismological waveform datasets effectively and overcome the initial drawbacks posed by the scarcity of labelled data.</div></div>","PeriodicalId":33804,"journal":{"name":"Applied Computing and Geosciences","volume":"25 ","pages":"Article 100232"},"PeriodicalIF":2.6000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computing and Geosciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S259019742500014X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
A common limitation in applying any deep learning and machine learning techniques is the limited labelled dataset which can be addressed through Data augmentation (DA). SeisAug is a DA python toolkit to address this challenge in seismological studies. DA. DA helps to balance the imbalanced classes of a dataset by creating more examples of under-represented classes. It significantly mitigates overfitting by increasing the volume of training data and introducing variability, thereby improving the model's performance on unseen data. Given the rapid advancements in deep learning for seismology, ‘SeisAug’ assists in extensibility by generating a substantial amount of data (2–6 times more data) which can aid in developing an indigenous robust model. Further, this study demonstrates the role of DA in developing a robust model. For this we utilized a basic two class identification models between earthquake/signal and noise/(non-earthquake). The model is trained with original, 1 and 5 times augmented datasets and their performance metrics are evaluated. The model trained with 5X times augmented dataset significantly outperforms with accuracy of 0.991, AUC 0.999 and AUC-PR 0.999 compared to the model trained with original dataset with accuracy of 0.50, AUC 0.75 and AUC-PR 0.80. Furthermore, by making all codes available on GitHub, the toolkit facilitates the easy application of DA techniques, empowering end-users to enhance their seismological waveform datasets effectively and overcome the initial drawbacks posed by the scarcity of labelled data.