Jin Jie Sean Yeo, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan
{"title":"Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction","authors":"Jin Jie Sean Yeo, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan","doi":"arxiv-2409.11964","DOIUrl":null,"url":null,"abstract":"In this technical report, we describe the SNTL-NTU team's submission for Task\n1 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection\nand classification of acoustic scenes and events (DCASE) 2024 challenge. Three\nsystems are introduced to tackle training splits of different sizes. For small\ntraining splits, we explored reducing the complexity of the provided baseline\nmodel by reducing the number of base channels. We introduce data augmentation\nin the form of mixup to increase the diversity of training samples. For the\nlarger training splits, we use FocusNet to provide confusing class information\nto an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models\nand baseline models trained on the original sampling rate of 44.1 kHz. We use\nKnowledge Distillation to distill the ensemble model to the baseline student\nmodel. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile\ndevelopment dataset yielded the highest average testing accuracy of (62.21,\n59.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over\nthe three systems.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11964","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this technical report, we describe the SNTL-NTU team's submission for Task
1 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection
and classification of acoustic scenes and events (DCASE) 2024 challenge. Three
systems are introduced to tackle training splits of different sizes. For small
training splits, we explored reducing the complexity of the provided baseline
model by reducing the number of base channels. We introduce data augmentation
in the form of mixup to increase the diversity of training samples. For the
larger training splits, we use FocusNet to provide confusing class information
to an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models
and baseline models trained on the original sampling rate of 44.1 kHz. We use
Knowledge Distillation to distill the ensemble model to the baseline student
model. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile
development dataset yielded the highest average testing accuracy of (62.21,
59.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over
the three systems.