Pub Date : 2022-07-11DOI: 10.1109/SPCOM55316.2022.9840761
Vishwesh Pillai, Pranav Mehar, M. Das, Deep Gupta, P. Radeva
The problem of food image recognition is an essential one in today’s context because health conditions such as diabetes, obesity, and heart disease require constant monitoring of a person’s diet. To automate this process, several models are available to recognize food images. Due to a considerable number of unique food dishes and various cuisines, a traditional flat classifier ceases to perform well. To address this issue, prediction schemes consisting of both flat and hierarchical classifiers, with the analysis of epistemic uncertainty are used to switch between the classifiers. However, the accuracy of the predictions made using epistemic uncertainty data remains considerably low. Therefore, this paper presents a prediction scheme using three different threshold criteria that helps to increase the accuracy of epistemic uncertainty predictions. The performance of the proposed method is demonstrated using several experiments performed on the MAFood-121 dataset. The experimental results validate the proposal performance and show that the proposed threshold criteria help to increase the overall accuracy of the predictions by correctly classifying the uncertainty distribution of the samples.
{"title":"Integrated Hierarchical and Flat Classifiers for Food Image Classification using Epistemic Uncertainty","authors":"Vishwesh Pillai, Pranav Mehar, M. Das, Deep Gupta, P. Radeva","doi":"10.1109/SPCOM55316.2022.9840761","DOIUrl":"https://doi.org/10.1109/SPCOM55316.2022.9840761","url":null,"abstract":"The problem of food image recognition is an essential one in today’s context because health conditions such as diabetes, obesity, and heart disease require constant monitoring of a person’s diet. To automate this process, several models are available to recognize food images. Due to a considerable number of unique food dishes and various cuisines, a traditional flat classifier ceases to perform well. To address this issue, prediction schemes consisting of both flat and hierarchical classifiers, with the analysis of epistemic uncertainty are used to switch between the classifiers. However, the accuracy of the predictions made using epistemic uncertainty data remains considerably low. Therefore, this paper presents a prediction scheme using three different threshold criteria that helps to increase the accuracy of epistemic uncertainty predictions. The performance of the proposed method is demonstrated using several experiments performed on the MAFood-121 dataset. The experimental results validate the proposal performance and show that the proposed threshold criteria help to increase the overall accuracy of the predictions by correctly classifying the uncertainty distribution of the samples.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129322953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-11DOI: 10.1109/SPCOM55316.2022.9840778
Prachee Priyadarshinee, Yixian Tan, Cindy Ming Ying Lin, Christopher Johann Clarke, T. BalamuraliB., Enyi Tan, V. Tan, S. Chai, C. Yeo, Jer-Ming Chen
We investigated the relationship between optical BCG and ECG signals measured simultaneously for the same heartbeat cycles. Despite the long history of BCG, earlier studies compared BCG and ECG features across large time cycles (inter-heartbeat), but not within the heartbeat cycle (intra-heartbeat). The non-invasively derived BCG signal was found to have a remarkable relationship with the arterial pressure signal, which has not been previously reported. We achieved synchronization of the two disparate modalities to within an estimated uncertainty of 50-70 ms, which allowed us to compare features within the heart cycle (which may be related to the arterial pressure) for one pathological and four healthy subjects lying supine, and found it consistent regardless of their breathing condition, gender and health status. Although not a one-to-one correlation, we show optical BCG proves to be a convenient and an unobtrusive and complementary modality to monitor cardiac activity alongside the well-established ECG.
{"title":"Investigating Synchronized Optical Ballistocardiography vs Electrocardiography for Pathological and Healthy Adults","authors":"Prachee Priyadarshinee, Yixian Tan, Cindy Ming Ying Lin, Christopher Johann Clarke, T. BalamuraliB., Enyi Tan, V. Tan, S. Chai, C. Yeo, Jer-Ming Chen","doi":"10.1109/SPCOM55316.2022.9840778","DOIUrl":"https://doi.org/10.1109/SPCOM55316.2022.9840778","url":null,"abstract":"We investigated the relationship between optical BCG and ECG signals measured simultaneously for the same heartbeat cycles. Despite the long history of BCG, earlier studies compared BCG and ECG features across large time cycles (inter-heartbeat), but not within the heartbeat cycle (intra-heartbeat). The non-invasively derived BCG signal was found to have a remarkable relationship with the arterial pressure signal, which has not been previously reported. We achieved synchronization of the two disparate modalities to within an estimated uncertainty of 50-70 ms, which allowed us to compare features within the heart cycle (which may be related to the arterial pressure) for one pathological and four healthy subjects lying supine, and found it consistent regardless of their breathing condition, gender and health status. Although not a one-to-one correlation, we show optical BCG proves to be a convenient and an unobtrusive and complementary modality to monitor cardiac activity alongside the well-established ECG.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126870212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-11DOI: 10.1109/SPCOM55316.2022.9840849
Murali Krishna Yadavalli, V. K. Pamula
In health care applications portable electroencephalogram (EEG) systems are frequently used to record and process the brain signals due to easy of use and low cost. Electrooculogram (EOG) is the major high amplitude low frequency artifact eye blink signal, which misleads the diagnosis activity of decease. Hence there is demand for artifact remove techniques in portable single EEG devices. In this work presented automatic extraction of EOG artifact by integrating Fluctuation based Dispersion Entropy (FDispEn) with Singular Spectral Analysis (SSA) and Adaptive noise canceller(ANC). The proposed model successfully identifies artifact signal component based on entropy values at different SNR and remove it with ANC for better performance. This method avoid the dependency on threshold to identify artifact subspace unlike previous existed DWT,SSA and Adaptive SSA methods combined with ANC. Proposed method is evaluated on synthetic data and real EEG data set and eliminate eyeblink artifact by preserving the low frequency EEG content. The performance of proposed method shows superiority in performance metrics over existing algorithms.
{"title":"An Efficient Framework to Automatic Extract EOG Artifacts from Single Channel EEG Recordings","authors":"Murali Krishna Yadavalli, V. K. Pamula","doi":"10.1109/SPCOM55316.2022.9840849","DOIUrl":"https://doi.org/10.1109/SPCOM55316.2022.9840849","url":null,"abstract":"In health care applications portable electroencephalogram (EEG) systems are frequently used to record and process the brain signals due to easy of use and low cost. Electrooculogram (EOG) is the major high amplitude low frequency artifact eye blink signal, which misleads the diagnosis activity of decease. Hence there is demand for artifact remove techniques in portable single EEG devices. In this work presented automatic extraction of EOG artifact by integrating Fluctuation based Dispersion Entropy (FDispEn) with Singular Spectral Analysis (SSA) and Adaptive noise canceller(ANC). The proposed model successfully identifies artifact signal component based on entropy values at different SNR and remove it with ANC for better performance. This method avoid the dependency on threshold to identify artifact subspace unlike previous existed DWT,SSA and Adaptive SSA methods combined with ANC. Proposed method is evaluated on synthetic data and real EEG data set and eliminate eyeblink artifact by preserving the low frequency EEG content. The performance of proposed method shows superiority in performance metrics over existing algorithms.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126111397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-11DOI: 10.1109/SPCOM55316.2022.9840769
S. Bhattacharjee, R. Sinha
Cleft lip and palate speech (CLP) is a congenital disorder which deforms the speech of an individual. As a result their speech is not amenable to the speech recognition systems. The existing work on CLP speech enhancement is by using CycleGAN-VC based non-parallel voice conversion method. However, CycleGAN-VC cannot capture the time-frequency structures which can be done by MaskCycleGAN-VC by application of a module named as time-frequency adaptive normalization. It also has the added advantage of mel-spectrogram conversion rather than mel-spectrum conversion. This voice conversion of a CLP speech to a normal speech increases the intelligibility and thereby allows automatic speech recognition systems to predict the uttered sentences which is necessary in day to day life as speech recognition devices are automatizing living on a large scale. But in order to develop an assistive technology it is very essential to study the sensitivity of automatic speech recognizers. This work focuses on the sensitivity analysis of a MaskCycleGAN based voice conversion system depending on the variation of acoustic and gender mismatch.
{"title":"Sensitivity Analysis of MaskCycleGAN based Voice Conversion for Enhancing Cleft Lip and Palate Speech Recognition","authors":"S. Bhattacharjee, R. Sinha","doi":"10.1109/SPCOM55316.2022.9840769","DOIUrl":"https://doi.org/10.1109/SPCOM55316.2022.9840769","url":null,"abstract":"Cleft lip and palate speech (CLP) is a congenital disorder which deforms the speech of an individual. As a result their speech is not amenable to the speech recognition systems. The existing work on CLP speech enhancement is by using CycleGAN-VC based non-parallel voice conversion method. However, CycleGAN-VC cannot capture the time-frequency structures which can be done by MaskCycleGAN-VC by application of a module named as time-frequency adaptive normalization. It also has the added advantage of mel-spectrogram conversion rather than mel-spectrum conversion. This voice conversion of a CLP speech to a normal speech increases the intelligibility and thereby allows automatic speech recognition systems to predict the uttered sentences which is necessary in day to day life as speech recognition devices are automatizing living on a large scale. But in order to develop an assistive technology it is very essential to study the sensitivity of automatic speech recognizers. This work focuses on the sensitivity analysis of a MaskCycleGAN based voice conversion system depending on the variation of acoustic and gender mismatch.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132709447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-11DOI: 10.1109/SPCOM55316.2022.9840797
Paramita Saha, J. Das, P. Venkateswaran
In this paper, a smart beam steering technique is investigated with a miniaturized patch antenna at 2.45 GHz. The antenna is miniaturized with an asymmetrically placed slot on the patch that reduces the frequency of operation significantly. It radiates with $23 %$ area $(0.15 lambda X 0.11 lambda, lambda$ is the freespace wavelength at 2.45 GHz) of a conventional patch at the operating frequency with the same dielectric properties. A $1 times 8$ antenna array is also designed with a power divider and phase shifter circuit to demonstrate a beam steering operation. With numerical investigation, a maximum of 220 steering angle is observed in the azimuthal plane.
{"title":"Smart Beam Steering with a Slot-Loaded Miniaturized Patch Antenna","authors":"Paramita Saha, J. Das, P. Venkateswaran","doi":"10.1109/SPCOM55316.2022.9840797","DOIUrl":"https://doi.org/10.1109/SPCOM55316.2022.9840797","url":null,"abstract":"In this paper, a smart beam steering technique is investigated with a miniaturized patch antenna at 2.45 GHz. The antenna is miniaturized with an asymmetrically placed slot on the patch that reduces the frequency of operation significantly. It radiates with $23 %$ area $(0.15 lambda X 0.11 lambda, lambda$ is the freespace wavelength at 2.45 GHz) of a conventional patch at the operating frequency with the same dielectric properties. A $1 times 8$ antenna array is also designed with a power divider and phase shifter circuit to demonstrate a beam steering operation. With numerical investigation, a maximum of 220 steering angle is observed in the azimuthal plane.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133299825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-11DOI: 10.1109/SPCOM55316.2022.9840788
Sibasis Sahoo, S. Dandapat
A cost-effective video signal based breath signal extraction method is described in this work. It does not require any sophisticated instrument; instead uses devices like mobile phones, headphones and computers that are readily available to an individual. For the same, a new database is created having read-speech utterances and video signals under the neutral and the post-exercise (or known as out-of-breath) conditions. The breath signals for most of the speakers exhibit a higher strength for both inhalation and exhalation phases of the breathing cycle under out-of-breath conditions. Additionally, the average duration of the breath cycle decreases for the same condition. The exhalation phase mainly influences the above time reduction. The ability of the breath features for distinguishing the neutral and the out-of-breath class is verified by the support vector machine and the logistic regression classifiers. The performance of both the classifiers in terms of unweighted average recall and Fl-score improved to $approx$ 70% after combining the above breath features with the MFCC baseline features.
{"title":"Extracting Video-Based Breath Signal For Detection of Out-of-breath Speech","authors":"Sibasis Sahoo, S. Dandapat","doi":"10.1109/SPCOM55316.2022.9840788","DOIUrl":"https://doi.org/10.1109/SPCOM55316.2022.9840788","url":null,"abstract":"A cost-effective video signal based breath signal extraction method is described in this work. It does not require any sophisticated instrument; instead uses devices like mobile phones, headphones and computers that are readily available to an individual. For the same, a new database is created having read-speech utterances and video signals under the neutral and the post-exercise (or known as out-of-breath) conditions. The breath signals for most of the speakers exhibit a higher strength for both inhalation and exhalation phases of the breathing cycle under out-of-breath conditions. Additionally, the average duration of the breath cycle decreases for the same condition. The exhalation phase mainly influences the above time reduction. The ability of the breath features for distinguishing the neutral and the out-of-breath class is verified by the support vector machine and the logistic regression classifiers. The performance of both the classifiers in terms of unweighted average recall and Fl-score improved to $approx$ 70% after combining the above breath features with the MFCC baseline features.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134397552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-11DOI: 10.1109/SPCOM55316.2022.9840781
Malviya Dutta Richa, Sk. Arif Ahmed, D. P. Dogra, P. Dan
Capsule Networks are becoming popular for developing AI-guided medical diagnostic tools. The objective of this paper is to carve out a strategy to solve dual problems of classification and segmentation of metastatic tissue regions in one single pipeline. To accomplish this, an attempt has been made in this paper to utilize capsule networks with variational Baye’s routing to classify normal and metastatic tissue regions from breast cancer whole slide images. Thereafter, a high-level segmentation of the metastatic tissue region has been carried out using the classified patches. The results obtained with a set of 75,000 patches show that patch-level segmentation is an efficient method to delineate metastatic regions. In the prospect of end-users, visualization of results plays a significant role in selecting the appropriate method for their applications. Capsule networks mimic the way the human brain works. For long, it has been the demand from clinicians that the algorithms used for the automatic classification of cancer pathology should be interpretable. Thus, in clinical practice, such a method will be more acceptable. The efficient region segmentation would aid clinicians in readily demarcating the area of interest and the area of most relevance.
{"title":"Patch Level Segmentation and Visualization of Capsule Network Inference for Breast Metastases Detection","authors":"Malviya Dutta Richa, Sk. Arif Ahmed, D. P. Dogra, P. Dan","doi":"10.1109/SPCOM55316.2022.9840781","DOIUrl":"https://doi.org/10.1109/SPCOM55316.2022.9840781","url":null,"abstract":"Capsule Networks are becoming popular for developing AI-guided medical diagnostic tools. The objective of this paper is to carve out a strategy to solve dual problems of classification and segmentation of metastatic tissue regions in one single pipeline. To accomplish this, an attempt has been made in this paper to utilize capsule networks with variational Baye’s routing to classify normal and metastatic tissue regions from breast cancer whole slide images. Thereafter, a high-level segmentation of the metastatic tissue region has been carried out using the classified patches. The results obtained with a set of 75,000 patches show that patch-level segmentation is an efficient method to delineate metastatic regions. In the prospect of end-users, visualization of results plays a significant role in selecting the appropriate method for their applications. Capsule networks mimic the way the human brain works. For long, it has been the demand from clinicians that the algorithms used for the automatic classification of cancer pathology should be interpretable. Thus, in clinical practice, such a method will be more acceptable. The efficient region segmentation would aid clinicians in readily demarcating the area of interest and the area of most relevance.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134398337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-11DOI: 10.1109/SPCOM55316.2022.9840756
Aditya Sing, Ankur Pandey, Sudhir Kumar
In this paper, we propose a novel method for location estimation of smart devices considering a generic shadowed $alpha-kappa-mu$ distribution based $alpha$-KMS fading environment, which is not considered for localization hitherto. Most of the existing path loss-based methods utilize a standard log-normal model only for localization; however, fading effects need to be considered to appropriately model the Received Signal Strength (RSS) values. Some of the localization methods utilize standard fading models such as Rayleigh, Nakagami-m, and Rician, to name a few; however, such assumptions lead to erroneous location estimates. The generic location estimator is applicable for all environments and provides accurate location estimates with correct estimates of $alpha-kappa-mu$. We propose a feedback-induced gradient ascent algorithm based on feedback distance that maximizes the derived log-likelihood estimate of the actual location. The proposed method also addresses the non-convex nature of the maximum likelihood estimator and is computationally efficient. The performance is evaluated on a simulated testbed, and the localization results outperform existing state-of-the-art methods.
{"title":"Smart Device Localization Under α-KMS Fading Environment using Feedback Distance based Gradient Ascent","authors":"Aditya Sing, Ankur Pandey, Sudhir Kumar","doi":"10.1109/SPCOM55316.2022.9840756","DOIUrl":"https://doi.org/10.1109/SPCOM55316.2022.9840756","url":null,"abstract":"In this paper, we propose a novel method for location estimation of smart devices considering a generic shadowed $alpha-kappa-mu$ distribution based $alpha$-KMS fading environment, which is not considered for localization hitherto. Most of the existing path loss-based methods utilize a standard log-normal model only for localization; however, fading effects need to be considered to appropriately model the Received Signal Strength (RSS) values. Some of the localization methods utilize standard fading models such as Rayleigh, Nakagami-m, and Rician, to name a few; however, such assumptions lead to erroneous location estimates. The generic location estimator is applicable for all environments and provides accurate location estimates with correct estimates of $alpha-kappa-mu$. We propose a feedback-induced gradient ascent algorithm based on feedback distance that maximizes the derived log-likelihood estimate of the actual location. The proposed method also addresses the non-convex nature of the maximum likelihood estimator and is computationally efficient. The performance is evaluated on a simulated testbed, and the localization results outperform existing state-of-the-art methods.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"349 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134075695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-11DOI: 10.1109/SPCOM55316.2022.9840813
Jagabandhu Mishra, Joshitha Gandra, Vaishnavi Patil, S. Prasanna
Sub-utterance level language identification (SLID) is an automatic process of recognizing the spoken language in a code switched (CS) utterance at the sub-utterance level. The nature of CS utterances suggest the primary language has a significant duration of occurrence over the secondary. In a CS utterance, a single speaker speaks both the languages. Hence the phoneme-level acoustic characteristic (sub-segmental and segmental evidence) of the secondary language is mostly biased towards the primary. This hypothesizes that the acoustic-based language identification system using CS training data may end with a biased performance towards the primary language. This study proves the hypothesis by observing the performance in terms of the confusion matrix of the earlier proposed approaches. At the same time, language discrimination also can be done at the suprasegmental-level, by capturing language-specific phonemic temporal evidence. Hence, to resolve the biasing issue, this study proposes a wav2vec2-based approach, which captures suprasegmental phonemic temporal patterns in the pre-training stage and merges it to capture language-specific suprasegmental evidence in the finetuning stage. The experimental results show the proposed approach is able to resolve the issue to some extent. As the fine-tuning stage uses a discriminative approach, the weighted loss and secondary language augmentation methods can be explored in the future for further performance improvement. Index Terms: Code switched (CS) bilingual speech, Sub-utterance level language identification (SLID), wav2vec2, Deepspeech2.
{"title":"Issues in Sub-Utterance Level Language Identification in a Code Switched Bilingual Scenario","authors":"Jagabandhu Mishra, Joshitha Gandra, Vaishnavi Patil, S. Prasanna","doi":"10.1109/SPCOM55316.2022.9840813","DOIUrl":"https://doi.org/10.1109/SPCOM55316.2022.9840813","url":null,"abstract":"Sub-utterance level language identification (SLID) is an automatic process of recognizing the spoken language in a code switched (CS) utterance at the sub-utterance level. The nature of CS utterances suggest the primary language has a significant duration of occurrence over the secondary. In a CS utterance, a single speaker speaks both the languages. Hence the phoneme-level acoustic characteristic (sub-segmental and segmental evidence) of the secondary language is mostly biased towards the primary. This hypothesizes that the acoustic-based language identification system using CS training data may end with a biased performance towards the primary language. This study proves the hypothesis by observing the performance in terms of the confusion matrix of the earlier proposed approaches. At the same time, language discrimination also can be done at the suprasegmental-level, by capturing language-specific phonemic temporal evidence. Hence, to resolve the biasing issue, this study proposes a wav2vec2-based approach, which captures suprasegmental phonemic temporal patterns in the pre-training stage and merges it to capture language-specific suprasegmental evidence in the finetuning stage. The experimental results show the proposed approach is able to resolve the issue to some extent. As the fine-tuning stage uses a discriminative approach, the weighted loss and secondary language augmentation methods can be explored in the future for further performance improvement. Index Terms: Code switched (CS) bilingual speech, Sub-utterance level language identification (SLID), wav2vec2, Deepspeech2.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117255470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-11DOI: 10.1109/SPCOM55316.2022.9840514
Aswathy Madhu, K. Suresh
The objective of Acoustic Scene Classification (ASC) is to assist the machines in identifying the unique acoustic characteristics that define an environment. In recent times, Convolutional Neural Networks (CNNs) have contributed significantly to the success of many state-of-the-art frameworks for ASC. The overall accuracy of the ASC framework depends on two factors: the signal representation and the learning model. In this work, we address these two factors as follows. First, we propose a time-frequency representation that employs empirical mode decomposition and Hilbert spectrum for meaningful characterization of the acoustic signal. Second, we introduce EHTNet, a framework for ASC which utilizes twin-pooled CNNs for classification and the proposed time-frequency representation to characterize the acoustic signal. Experiments on a benchmark dataset in ASC indicate that EHTNet outperforms state-of-the-art approaches for ASC in addition to a log mel spectrum-based baseline. Specifically, the proposed framework improves the classification accuracy by 91.04% and the f1-score by 93.61% as against the baseline.
{"title":"EHTNet: Twin-pooled CNN with Empirical Mode Decomposition and Hilbert Spectrum for Acoustic Scene Classification","authors":"Aswathy Madhu, K. Suresh","doi":"10.1109/SPCOM55316.2022.9840514","DOIUrl":"https://doi.org/10.1109/SPCOM55316.2022.9840514","url":null,"abstract":"The objective of Acoustic Scene Classification (ASC) is to assist the machines in identifying the unique acoustic characteristics that define an environment. In recent times, Convolutional Neural Networks (CNNs) have contributed significantly to the success of many state-of-the-art frameworks for ASC. The overall accuracy of the ASC framework depends on two factors: the signal representation and the learning model. In this work, we address these two factors as follows. First, we propose a time-frequency representation that employs empirical mode decomposition and Hilbert spectrum for meaningful characterization of the acoustic signal. Second, we introduce EHTNet, a framework for ASC which utilizes twin-pooled CNNs for classification and the proposed time-frequency representation to characterize the acoustic signal. Experiments on a benchmark dataset in ASC indicate that EHTNet outperforms state-of-the-art approaches for ASC in addition to a log mel spectrum-based baseline. Specifically, the proposed framework improves the classification accuracy by 91.04% and the f1-score by 93.61% as against the baseline.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"33 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120895373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}