Pub Date : 2014-12-01DOI: 10.1109/APSIPA.2014.7041589
Kentaro Domoto, T. Utsuro, N. Sawada, H. Nishizaki
This paper presents a novel keyword selection-based spoken document-indexing framework that selects the best match keyword from query candidates using spoken term detection (STD) for spoken document retrieval. Our method comprises creating a keyword set including keywords that are likely to be in a spoken document. Next, an STD is conducted for all the keywords as query terms for STD; then, the detection result, a set of each keyword and its detection intervals in the spoken document, is obtained. For the keywords that have competitive intervals, we rank them based on the matching cost of STD and select the best one with the longest duration among competitive detections. This is the final output of STD process and serves as an index word for the spoken document. The proposed framework was evaluated on lecture speeches as spoken documents in an STD task. The results show that our framework was quite effective for preventing false detection errors and in annotating keyword indices to spoken documents.
{"title":"Selection of best match keyword using spoken term detection for spoken document indexing","authors":"Kentaro Domoto, T. Utsuro, N. Sawada, H. Nishizaki","doi":"10.1109/APSIPA.2014.7041589","DOIUrl":"https://doi.org/10.1109/APSIPA.2014.7041589","url":null,"abstract":"This paper presents a novel keyword selection-based spoken document-indexing framework that selects the best match keyword from query candidates using spoken term detection (STD) for spoken document retrieval. Our method comprises creating a keyword set including keywords that are likely to be in a spoken document. Next, an STD is conducted for all the keywords as query terms for STD; then, the detection result, a set of each keyword and its detection intervals in the spoken document, is obtained. For the keywords that have competitive intervals, we rank them based on the matching cost of STD and select the best one with the longest duration among competitive detections. This is the final output of STD process and serves as an index word for the spoken document. The proposed framework was evaluated on lecture speeches as spoken documents in an STD task. The results show that our framework was quite effective for preventing false detection errors and in annotating keyword indices to spoken documents.","PeriodicalId":231382,"journal":{"name":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","volume":"215 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123869677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/APSIPA.2014.7041778
Li-Wei Kang, C. Yeh, Duan-Yu Chen, Chia-Tsung Lin
Decomposition of a signal (e.g., image or video) into multiple semantic components has been an effective research topic for various image/video processing applications, such as image/video denoising, enhancement, and inpainting. In this paper, we present a survey of signal decomposition frameworks based on the uses of sparsity and morphological diversity in signal mixtures and its applications in multimedia. First, we analyze existing MCA (morphological component analysis) based image decomposition frameworks with their applications and explore the potential limitations of these approaches for image denoising. Then, we discuss our recently proposed self-learning based image decomposition framework with its applications to several image/video denoising tasks, including single image rain streak removal, denoising, deblocking, joint super-resolution and deblocking for a highly compressed image/video. By advancing sparse representation and morphological diversity of image signals, the proposed framework first learns an over-complete dictionary from the high frequency part of an input image for reconstruction purposes. An unsupervised or supervised clustering technique is applied to the dictionary atoms for identifying the morphological component corresponding to the noise pattern of interest (e.g., rain streaks, blocking artifacts, or Gaussian noises). Different from prior learning-based approaches, our method does not need to collect training data in advance and no image priors are required. Our experimental results have confirmed the effectiveness and robustness of the proposed framework, which has been shown to outperform state-of-the-art approaches.
{"title":"Self-learning-based signal decomposition for multimedia applications: A review and comparative study","authors":"Li-Wei Kang, C. Yeh, Duan-Yu Chen, Chia-Tsung Lin","doi":"10.1109/APSIPA.2014.7041778","DOIUrl":"https://doi.org/10.1109/APSIPA.2014.7041778","url":null,"abstract":"Decomposition of a signal (e.g., image or video) into multiple semantic components has been an effective research topic for various image/video processing applications, such as image/video denoising, enhancement, and inpainting. In this paper, we present a survey of signal decomposition frameworks based on the uses of sparsity and morphological diversity in signal mixtures and its applications in multimedia. First, we analyze existing MCA (morphological component analysis) based image decomposition frameworks with their applications and explore the potential limitations of these approaches for image denoising. Then, we discuss our recently proposed self-learning based image decomposition framework with its applications to several image/video denoising tasks, including single image rain streak removal, denoising, deblocking, joint super-resolution and deblocking for a highly compressed image/video. By advancing sparse representation and morphological diversity of image signals, the proposed framework first learns an over-complete dictionary from the high frequency part of an input image for reconstruction purposes. An unsupervised or supervised clustering technique is applied to the dictionary atoms for identifying the morphological component corresponding to the noise pattern of interest (e.g., rain streaks, blocking artifacts, or Gaussian noises). Different from prior learning-based approaches, our method does not need to collect training data in advance and no image priors are required. Our experimental results have confirmed the effectiveness and robustness of the proposed framework, which has been shown to outperform state-of-the-art approaches.","PeriodicalId":231382,"journal":{"name":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116767970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/APSIPA.2014.7041732
Seokhwan Kim, Rafael E. Banchs
This paper describes a hybrid dialogue system for restaurant recommendation and reservation. The proposed system combines rule-based and data-driven components by using a flexible architecture aiming at diminishing error propagation along the different steps of the dialogue management and processing pipeline. The system implements three basic subsystems for restaurant recommendation, selection and booking, which leverage on the same system architecture and processing components. The specific system described here operates with a data collection of Singapore's F&B industry but it can be easily adapted to any other city or location by simply replacing the used data collection.
{"title":"R-cube: A dialogue agent for restaurant recommendation and reservation","authors":"Seokhwan Kim, Rafael E. Banchs","doi":"10.1109/APSIPA.2014.7041732","DOIUrl":"https://doi.org/10.1109/APSIPA.2014.7041732","url":null,"abstract":"This paper describes a hybrid dialogue system for restaurant recommendation and reservation. The proposed system combines rule-based and data-driven components by using a flexible architecture aiming at diminishing error propagation along the different steps of the dialogue management and processing pipeline. The system implements three basic subsystems for restaurant recommendation, selection and booking, which leverage on the same system architecture and processing components. The specific system described here operates with a data collection of Singapore's F&B industry but it can be easily adapted to any other city or location by simply replacing the used data collection.","PeriodicalId":231382,"journal":{"name":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122845742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/APSIPA.2014.7041683
Yunseok Song, Dong-Won Shin, Eunsang Ko, Yo-Sung Ho
In this paper, we present a hybrid multi-view camera system for real-time depth generation. We set up eight color cameras and three depth cameras. For simple test scenarios, we capture a single object at a blue-screen studio. The objective is depth map generation at eight color viewpoints. Due to hardware limitations, depth cameras produce low resolution images, i.e., 176×144. Thus, we warp the depth data to the color cameras views (1280×720) and then execute filtering. Joint bilateral filtering (JBF) is used to exploit range and spatial weights, considering color data as well. Simulation results exhibit depth generation of 13 frames per second (fps) when treating eight images as a single frame. When the proposed method is executed on a computer per depth camera basis, the speed can become three times faster. Thus, we have successfully achieved real-time depth generation using a hybrid multi-view camera system.
{"title":"Real-time depth map generation using hybrid multi-view cameras","authors":"Yunseok Song, Dong-Won Shin, Eunsang Ko, Yo-Sung Ho","doi":"10.1109/APSIPA.2014.7041683","DOIUrl":"https://doi.org/10.1109/APSIPA.2014.7041683","url":null,"abstract":"In this paper, we present a hybrid multi-view camera system for real-time depth generation. We set up eight color cameras and three depth cameras. For simple test scenarios, we capture a single object at a blue-screen studio. The objective is depth map generation at eight color viewpoints. Due to hardware limitations, depth cameras produce low resolution images, i.e., 176×144. Thus, we warp the depth data to the color cameras views (1280×720) and then execute filtering. Joint bilateral filtering (JBF) is used to exploit range and spatial weights, considering color data as well. Simulation results exhibit depth generation of 13 frames per second (fps) when treating eight images as a single frame. When the proposed method is executed on a computer per depth camera basis, the speed can become three times faster. Thus, we have successfully achieved real-time depth generation using a hybrid multi-view camera system.","PeriodicalId":231382,"journal":{"name":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131539563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/APSIPA.2014.7041717
Yun-Fan Chang, Payton Lin, Shao-Hua Cheng, Kai-Hsuan Chan, Y. Zeng, Chia-Wei Liao, Wen-Tsung Chang, Y. Wang, Yu Tsao
Anchorperson segment detection enables efficient video content indexing for information retrieval. Anchorperson detection based on audio analysis has gained popularity due to lower computational complexity and satisfactory performance. This paper presents a robust framework using a hybrid I-vector and deep neural network (DNN) system to perform anchorperson detection based on audio streams of video content. The proposed system first applies I-vector to extract speaker identity features from the audio data. With the extracted speaker identity features, a DNN classifier is then used to verify the claimed anchorperson identity. In addition, subspace feature normalization (SFN) is incorporated into the hybrid system for robust feature extraction to compensate the audio mismatch issues caused by recording devices. An anchorperson verification experiment was conducted to evaluate the equal error rate (EER) of the proposed hybrid system. Experimental results demonstrate that the proposed system outperforms the state-of-the-art hybrid I-vector and support vector machine (SVM) system. Moreover, the proposed system was further enhanced by integrating SFN to effectively compensate the audio mismatch issues in anchorperson detection tasks.
{"title":"Robust anchorperson detection based on audio streams using a hybrid I-vector and DNN system","authors":"Yun-Fan Chang, Payton Lin, Shao-Hua Cheng, Kai-Hsuan Chan, Y. Zeng, Chia-Wei Liao, Wen-Tsung Chang, Y. Wang, Yu Tsao","doi":"10.1109/APSIPA.2014.7041717","DOIUrl":"https://doi.org/10.1109/APSIPA.2014.7041717","url":null,"abstract":"Anchorperson segment detection enables efficient video content indexing for information retrieval. Anchorperson detection based on audio analysis has gained popularity due to lower computational complexity and satisfactory performance. This paper presents a robust framework using a hybrid I-vector and deep neural network (DNN) system to perform anchorperson detection based on audio streams of video content. The proposed system first applies I-vector to extract speaker identity features from the audio data. With the extracted speaker identity features, a DNN classifier is then used to verify the claimed anchorperson identity. In addition, subspace feature normalization (SFN) is incorporated into the hybrid system for robust feature extraction to compensate the audio mismatch issues caused by recording devices. An anchorperson verification experiment was conducted to evaluate the equal error rate (EER) of the proposed hybrid system. Experimental results demonstrate that the proposed system outperforms the state-of-the-art hybrid I-vector and support vector machine (SVM) system. Moreover, the proposed system was further enhanced by integrating SFN to effectively compensate the audio mismatch issues in anchorperson detection tasks.","PeriodicalId":231382,"journal":{"name":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","volume":"62 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131540250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/APSIPA.2014.7041795
A. Saenthon, Natchanon Sukkhadamrongrak
Currently, the optical character recognition (OCR) is applied in many fields such as reading the office letter and to read the serial on parts of industrial. The most manufacturing focus the processing time and accuracy for inspection process. The learning method of the optical character recognition is used a neural network to recognize the fonts and correlation the matching value. The neural network has many learning techniques which each technique impact to the processing time and accuracy. Therefore, this paper studies to comparisons a suitable procedure of training in neural network for recognizing both Thai and English characters. The experiment results show the comparing values of error and processing time of each training technique.
{"title":"Comparison the training methods of neural network for English and Thai character recognition","authors":"A. Saenthon, Natchanon Sukkhadamrongrak","doi":"10.1109/APSIPA.2014.7041795","DOIUrl":"https://doi.org/10.1109/APSIPA.2014.7041795","url":null,"abstract":"Currently, the optical character recognition (OCR) is applied in many fields such as reading the office letter and to read the serial on parts of industrial. The most manufacturing focus the processing time and accuracy for inspection process. The learning method of the optical character recognition is used a neural network to recognize the fonts and correlation the matching value. The neural network has many learning techniques which each technique impact to the processing time and accuracy. Therefore, this paper studies to comparisons a suitable procedure of training in neural network for recognizing both Thai and English characters. The experiment results show the comparing values of error and processing time of each training technique.","PeriodicalId":231382,"journal":{"name":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","volume":"304 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132744084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/APSIPA.2014.7041627
Jiahao Pang, Gene Cheung, Wei Hu, O. Au
Image denoising is the most basic inverse imaging problem. As an under-determined problem, appropriate definition of image priors to regularize the problem is crucial. Among recent proposed priors for image denoising are: i) graph Laplacian regularizer where a given pixel patch is assumed to be smooth in the graph-signal domain; and ii) self-similarity prior where image patches are assumed to recur throughout a natural image in non-local spatial regions. In our first contribution, we demonstrate that the graph Laplacian regularizer converges to a continuous time functional counterpart, and careful selection of its features can lead to a discriminant signal prior. In our second contribution, we redefine patch self-similarity in terms of patch gradients and argue that the new definition results in a more accurate estimate of the graph Laplacian matrix, and thus better image denoising performance. Experiments show that our designed algorithm based on graph Laplacian regularizer and gradient-based self-similarity can outperform non-local means (NLM) denoising by up to 1.4 dB in PSNR.
{"title":"Redefining self-similarity in natural images for denoising using graph signal gradient","authors":"Jiahao Pang, Gene Cheung, Wei Hu, O. Au","doi":"10.1109/APSIPA.2014.7041627","DOIUrl":"https://doi.org/10.1109/APSIPA.2014.7041627","url":null,"abstract":"Image denoising is the most basic inverse imaging problem. As an under-determined problem, appropriate definition of image priors to regularize the problem is crucial. Among recent proposed priors for image denoising are: i) graph Laplacian regularizer where a given pixel patch is assumed to be smooth in the graph-signal domain; and ii) self-similarity prior where image patches are assumed to recur throughout a natural image in non-local spatial regions. In our first contribution, we demonstrate that the graph Laplacian regularizer converges to a continuous time functional counterpart, and careful selection of its features can lead to a discriminant signal prior. In our second contribution, we redefine patch self-similarity in terms of patch gradients and argue that the new definition results in a more accurate estimate of the graph Laplacian matrix, and thus better image denoising performance. Experiments show that our designed algorithm based on graph Laplacian regularizer and gradient-based self-similarity can outperform non-local means (NLM) denoising by up to 1.4 dB in PSNR.","PeriodicalId":231382,"journal":{"name":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","volume":"35 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114117807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/APSIPA.2014.7041777
Lasguido Nio, S. Sakti, Graham Neubig, T. Toda, Satoshi Nakamura
An example-based dialog model often require a lot of data collections to achieve a good performance. However, when it comes on handling an out of vocabulary (OOV) database queries, this approach resulting in weakness and inadequate handling of interactions between words in the sentence. In this work, we try to overcome this problem by utilizing recursive neural network paraphrase identification to improve the robustness of example-based dialog response retrieval. We model our dialog-pair database and user input query with distributed word representations, and employ recursive autoencoders and dynamic pooling to determine whether two sentences with arbitrary length have the same meaning. The distributed representations have the potential to improve handling of OOV cases, and the recursive structure can reduce confusion in example matching.
{"title":"Recursive neural network paraphrase identification for example-based dialog retrieval","authors":"Lasguido Nio, S. Sakti, Graham Neubig, T. Toda, Satoshi Nakamura","doi":"10.1109/APSIPA.2014.7041777","DOIUrl":"https://doi.org/10.1109/APSIPA.2014.7041777","url":null,"abstract":"An example-based dialog model often require a lot of data collections to achieve a good performance. However, when it comes on handling an out of vocabulary (OOV) database queries, this approach resulting in weakness and inadequate handling of interactions between words in the sentence. In this work, we try to overcome this problem by utilizing recursive neural network paraphrase identification to improve the robustness of example-based dialog response retrieval. We model our dialog-pair database and user input query with distributed word representations, and employ recursive autoencoders and dynamic pooling to determine whether two sentences with arbitrary length have the same meaning. The distributed representations have the potential to improve handling of OOV cases, and the recursive structure can reduce confusion in example matching.","PeriodicalId":231382,"journal":{"name":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114629805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/APSIPA.2014.7041644
Etkin Baris Ozgul, Somchaya Liemhetcharat, K. H. Low
Multi-agent research has focused on finding the optimal team for a task. Many approaches assume that the performance of the agents are known a priori. We are interested in ad hoc teams, where the agents' algorithms and performance are initially unknown. We focus on the task of modeling the performance of single agents through observation in training environments, and using the learned models to partition a new environment for a multi-agent team. The goal is to minimize the number of agents used, while maintaining a performance threshold of the multi-agent team. We contribute a novel model to learn the agent's performance through observations, and a partitioning algorithm that minimizes the team size. We evaluate our algorithms in simulation, and show the efficacy of our learn model and partitioning algorithm.
{"title":"Multi-agent ad hoc team partitioning by observing and modeling single-agent performance","authors":"Etkin Baris Ozgul, Somchaya Liemhetcharat, K. H. Low","doi":"10.1109/APSIPA.2014.7041644","DOIUrl":"https://doi.org/10.1109/APSIPA.2014.7041644","url":null,"abstract":"Multi-agent research has focused on finding the optimal team for a task. Many approaches assume that the performance of the agents are known a priori. We are interested in ad hoc teams, where the agents' algorithms and performance are initially unknown. We focus on the task of modeling the performance of single agents through observation in training environments, and using the learned models to partition a new environment for a multi-agent team. The goal is to minimize the number of agents used, while maintaining a performance threshold of the multi-agent team. We contribute a novel model to learn the agent's performance through observations, and a partitioning algorithm that minimizes the team size. We evaluate our algorithms in simulation, and show the efficacy of our learn model and partitioning algorithm.","PeriodicalId":231382,"journal":{"name":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114247026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/APSIPA.2014.7041624
Jia-Ching Wang, Chang-Hong Lin, En-Ting Chen, P. Chang
This paper aims to propose a new set of acoustic features based on spectral-temporal receptive fields (STRFs). The STRF is an analysis method for studying physiological model of the mammalian auditory system in spectral-temporal domain. It has two different parts: one is the rate (in Hz) which represents the temporal response and the other is the scale (in cycle/octave) which represents the spectral response. With the obtained STRF, we propose an effective acoustic feature. First, the energy of each scale is calculated from the STRF. The logarithmic operation is then imposed on the scale energies. Finally, the discrete Cosine transform is applied to generate the proposed STRF feature. In our experiments, we combine the proposed STRF feature with conventional Mel frequency cepstral coefficients (MFCCs) to verify its effectiveness. In a noise-free environment, the proposed feature can increase the recognition rate by 17.48%. Moreover, the increase in the recognition rate ranges from 5% to 12% in noisy environments.
{"title":"Spectral-temporal receptive fields and MFCC balanced feature extraction for noisy speech recognition","authors":"Jia-Ching Wang, Chang-Hong Lin, En-Ting Chen, P. Chang","doi":"10.1109/APSIPA.2014.7041624","DOIUrl":"https://doi.org/10.1109/APSIPA.2014.7041624","url":null,"abstract":"This paper aims to propose a new set of acoustic features based on spectral-temporal receptive fields (STRFs). The STRF is an analysis method for studying physiological model of the mammalian auditory system in spectral-temporal domain. It has two different parts: one is the rate (in Hz) which represents the temporal response and the other is the scale (in cycle/octave) which represents the spectral response. With the obtained STRF, we propose an effective acoustic feature. First, the energy of each scale is calculated from the STRF. The logarithmic operation is then imposed on the scale energies. Finally, the discrete Cosine transform is applied to generate the proposed STRF feature. In our experiments, we combine the proposed STRF feature with conventional Mel frequency cepstral coefficients (MFCCs) to verify its effectiveness. In a noise-free environment, the proposed feature can increase the recognition rate by 17.48%. Moreover, the increase in the recognition rate ranges from 5% to 12% in noisy environments.","PeriodicalId":231382,"journal":{"name":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114445257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}