{"title":"对比多视图编码在音频分类中的应用","authors":"Milomir Babić, V. Risojevic","doi":"10.1109/INFOTEH53737.2022.9751326","DOIUrl":null,"url":null,"abstract":"Emergence of deep learning methods during the last decade has lead to a revolution in machine learning and a significant improvement of results in various fields. Initially, these methods were based on supervised learning but, as the development progressed, the limitations stemming from the dependence on labeled datasets became apparent. Data labeling is an expensive, laborious and error prone process which is hard to automate. All this hinders the training process, especially in the applications where a large amount of data is not available. This motivated the development of different unsupervised methods that aim to utilize the wide availability of unlabeled datasets. These methods involve substitution of manual labels with data relationships which can be automatically created. In this paper we examine one such unsupervised method, contrastive multiview coding, and its application in audio classification, by adapting an implementation from the field of digital image processing. We show that the use of this method results in models which can be used for feature extraction or fine-tuned for use in different downstream tasks to achieve results that surpass the ones obtained through pure supervised learning. We also investigate the effects of domain and size of the unlabeled dataset as well as the size of the downstream dataset on the results achieved in downstream tasks through the use of frozen and fine-tuned feature extractors.","PeriodicalId":6839,"journal":{"name":"2022 21st International Symposium INFOTEH-JAHORINA (INFOTEH)","volume":"260 1","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Application of Contrastive Multiview Coding in Audio Classification\",\"authors\":\"Milomir Babić, V. Risojevic\",\"doi\":\"10.1109/INFOTEH53737.2022.9751326\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emergence of deep learning methods during the last decade has lead to a revolution in machine learning and a significant improvement of results in various fields. Initially, these methods were based on supervised learning but, as the development progressed, the limitations stemming from the dependence on labeled datasets became apparent. Data labeling is an expensive, laborious and error prone process which is hard to automate. All this hinders the training process, especially in the applications where a large amount of data is not available. This motivated the development of different unsupervised methods that aim to utilize the wide availability of unlabeled datasets. These methods involve substitution of manual labels with data relationships which can be automatically created. In this paper we examine one such unsupervised method, contrastive multiview coding, and its application in audio classification, by adapting an implementation from the field of digital image processing. We show that the use of this method results in models which can be used for feature extraction or fine-tuned for use in different downstream tasks to achieve results that surpass the ones obtained through pure supervised learning. We also investigate the effects of domain and size of the unlabeled dataset as well as the size of the downstream dataset on the results achieved in downstream tasks through the use of frozen and fine-tuned feature extractors.\",\"PeriodicalId\":6839,\"journal\":{\"name\":\"2022 21st International Symposium INFOTEH-JAHORINA (INFOTEH)\",\"volume\":\"260 1\",\"pages\":\"1-5\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 21st International Symposium INFOTEH-JAHORINA (INFOTEH)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INFOTEH53737.2022.9751326\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 21st International Symposium INFOTEH-JAHORINA (INFOTEH)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INFOTEH53737.2022.9751326","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Application of Contrastive Multiview Coding in Audio Classification
Emergence of deep learning methods during the last decade has lead to a revolution in machine learning and a significant improvement of results in various fields. Initially, these methods were based on supervised learning but, as the development progressed, the limitations stemming from the dependence on labeled datasets became apparent. Data labeling is an expensive, laborious and error prone process which is hard to automate. All this hinders the training process, especially in the applications where a large amount of data is not available. This motivated the development of different unsupervised methods that aim to utilize the wide availability of unlabeled datasets. These methods involve substitution of manual labels with data relationships which can be automatically created. In this paper we examine one such unsupervised method, contrastive multiview coding, and its application in audio classification, by adapting an implementation from the field of digital image processing. We show that the use of this method results in models which can be used for feature extraction or fine-tuned for use in different downstream tasks to achieve results that surpass the ones obtained through pure supervised learning. We also investigate the effects of domain and size of the unlabeled dataset as well as the size of the downstream dataset on the results achieved in downstream tasks through the use of frozen and fine-tuned feature extractors.