Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521669
C. Clavel, T. Ehrette, G. Richard
The present research deals with audio events detection in noisy environments for a multimedia surveillance application. In surveillance or homeland security most of the systems aiming to automatically detect abnormal situations are only based on visual clues while, in some situations, it may be easier to detect a given event using the audio information. This is in particular the case for the class of sounds considered in this paper, sounds produced by gun shots. The automatic shot detection system presented is based on a novelty detection approach which offers a solution to detect abnormality (abnormal audio events) in continuous audio recordings of public places. We specifically focus on the robustness of the detection against variable and adverse conditions and the reduction of the false rejection rate which is particularly important in surveillance applications. In particular, we take advantage of potential similarity between the acoustic signatures of the different types of weapons by building a hierarchical classification system
{"title":"Events Detection for an Audio-Based Surveillance System","authors":"C. Clavel, T. Ehrette, G. Richard","doi":"10.1109/ICME.2005.1521669","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521669","url":null,"abstract":"The present research deals with audio events detection in noisy environments for a multimedia surveillance application. In surveillance or homeland security most of the systems aiming to automatically detect abnormal situations are only based on visual clues while, in some situations, it may be easier to detect a given event using the audio information. This is in particular the case for the class of sounds considered in this paper, sounds produced by gun shots. The automatic shot detection system presented is based on a novelty detection approach which offers a solution to detect abnormality (abnormal audio events) in continuous audio recordings of public places. We specifically focus on the robustness of the detection against variable and adverse conditions and the reduction of the false rejection rate which is particularly important in surveillance applications. In particular, we take advantage of potential similarity between the acoustic signatures of the different types of weapons by building a hierarchical classification system","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114164888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521514
Takashi Nishizaki, R. Ogata, Yuichi Kameda, Yoshinari Ohta, Yuichi Nakamura
This paper introduces video quality analysis for automated video capture and editing. Previously, we proposed an automated video capture and editing system for conversation scenes. In the capture phase, our system not only produces concurrent video streams with multiple pan-tilt-zoom cameras but also recognizes "conversation states" i.e., who is speaking, when someone is nodding, etc. As it is necessary to know the conversation states for the automated editing phase, it is important to clarify how the recognition rate of the conversation attributes affects our editing system with regard to the quality of the resultant videos. In the present study, we analyzed the relationship between the recognition rate of conversation states and the quality of resultant videos through subjective evaluation experiments. The quality scores of the resultant videos were almost the same as the best case in which recognition was done manually, and the recognition rate of our capture system was therefore sufficient.
{"title":"Video quality analysis for an automated video capturing and editing system for conversation scenes","authors":"Takashi Nishizaki, R. Ogata, Yuichi Kameda, Yoshinari Ohta, Yuichi Nakamura","doi":"10.1109/ICME.2005.1521514","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521514","url":null,"abstract":"This paper introduces video quality analysis for automated video capture and editing. Previously, we proposed an automated video capture and editing system for conversation scenes. In the capture phase, our system not only produces concurrent video streams with multiple pan-tilt-zoom cameras but also recognizes \"conversation states\" i.e., who is speaking, when someone is nodding, etc. As it is necessary to know the conversation states for the automated editing phase, it is important to clarify how the recognition rate of the conversation attributes affects our editing system with regard to the quality of the resultant videos. In the present study, we analyzed the relationship between the recognition rate of conversation states and the quality of resultant videos through subjective evaluation experiments. The quality scores of the resultant videos were almost the same as the best case in which recognition was done manually, and the recognition rate of our capture system was therefore sufficient.","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121717382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521732
Y. V. Houten, S. U. Naci, Bauke Freiburg, R. Eggermont, Sander Schuurman, Danny Hollander, J. Reitsma, Maurice Markslag, Justin Kniest, Mattina Veenstra, A. Hanjalic
The MultimediaN concert-video browser demonstrates a video interaction environment for efficiently browsing video registrations of pop, rock and other music concerts. The exhibition displays the current state of the project for developing an advanced concert-video browser in 2007. Three demos are provided: 1) a high-level content analysis methodology for modeling the "experience" of the concert at its different stages, and for automatically detecting and identifying semantically coherent temporal segments in concert videos, 2) a general-purpose video editor that associates semantic descriptions with the video segments using both manual and automatic inputs, and a video browser that applies ideas from information foraging theory and demonstrates patch-based video browsing, 3) the Fabplayer, specifically designed for patch-based browsing of concert videos by a dedicated user-group, making use of the results of automatic concert-video segmentation
{"title":"The Multimedian Concert-Video Browser","authors":"Y. V. Houten, S. U. Naci, Bauke Freiburg, R. Eggermont, Sander Schuurman, Danny Hollander, J. Reitsma, Maurice Markslag, Justin Kniest, Mattina Veenstra, A. Hanjalic","doi":"10.1109/ICME.2005.1521732","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521732","url":null,"abstract":"The MultimediaN concert-video browser demonstrates a video interaction environment for efficiently browsing video registrations of pop, rock and other music concerts. The exhibition displays the current state of the project for developing an advanced concert-video browser in 2007. Three demos are provided: 1) a high-level content analysis methodology for modeling the \"experience\" of the concert at its different stages, and for automatically detecting and identifying semantically coherent temporal segments in concert videos, 2) a general-purpose video editor that associates semantic descriptions with the video segments using both manual and automatic inputs, and a video browser that applies ideas from information foraging theory and demonstrates patch-based video browsing, 3) the Fabplayer, specifically designed for patch-based browsing of concert videos by a dedicated user-group, making use of the results of automatic concert-video segmentation","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121848207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521640
Song Liu, Haoran Yi, L. Chia, D. Rajan
In this paper, we present a new classification scheme based on support vector machines (SVM) and a new texture feature, called texture correlogram, for high-level image classification. Originally, SVM classifier is designed for solving only binary classification problem. In order to deal with multiple classes, we present a new method to dynamically build up a hierarchical structure from the training dataset. The texture correlogram is designed to capture spatial distribution information. Experimental results demonstrate that the proposed classification scheme and texture feature are effective for high-level image classification task and the proposed classification scheme is more efficient than the other schemes while achieving almost the same classification accuracy. Another advantage of the proposed scheme is that the underlying hierarchical structure of the SVM classification tree manifests the interclass relationships among different classes.
{"title":"Adaptive hierarchical multi-class SVM classifier for texture-based image classification","authors":"Song Liu, Haoran Yi, L. Chia, D. Rajan","doi":"10.1109/ICME.2005.1521640","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521640","url":null,"abstract":"In this paper, we present a new classification scheme based on support vector machines (SVM) and a new texture feature, called texture correlogram, for high-level image classification. Originally, SVM classifier is designed for solving only binary classification problem. In order to deal with multiple classes, we present a new method to dynamically build up a hierarchical structure from the training dataset. The texture correlogram is designed to capture spatial distribution information. Experimental results demonstrate that the proposed classification scheme and texture feature are effective for high-level image classification task and the proposed classification scheme is more efficient than the other schemes while achieving almost the same classification accuracy. Another advantage of the proposed scheme is that the underlying hierarchical structure of the SVM classification tree manifests the interclass relationships among different classes.","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121599729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521504
K. Wan, Xin Yan, Changsheng Xu
We report on our development of a real-time system to deliver sports video highlights of a live game to mobile videophones over existing GPRS networks. To facilitate real-time analysis, a circular buffer receives live video data from which simple audio/visual features are computed to detect for highlight-worthiness according to a priori decision scheme. A separate module runs algorithms to insert content into the highlight for mobile advertising. The system is now under trial over new 3G networks.
{"title":"Automatic mobile sports highlights","authors":"K. Wan, Xin Yan, Changsheng Xu","doi":"10.1109/ICME.2005.1521504","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521504","url":null,"abstract":"We report on our development of a real-time system to deliver sports video highlights of a live game to mobile videophones over existing GPRS networks. To facilitate real-time analysis, a circular buffer receives live video data from which simple audio/visual features are computed to detect for highlight-worthiness according to a priori decision scheme. A separate module runs algorithms to insert content into the highlight for mobile advertising. The system is now under trial over new 3G networks.","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116781231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521548
S. Aroutchelvame, K. Raahemifar
In this research, an architecture that performs both forward and inverse lifting-based discrete wavelet transform is proposed. The proposed architecture reduces the hardware requirement by exploiting the redundancy in the arithmetic operation involved in DWT computation. The proposed architecture does not require any extra memory to store intermediate results. The proposed architecture consists of predict module, update module, address generation module, control unit and a set of registers to establish data communication between predict and update modules. The symmetrical extension of images at the boundary to reduce distorted images has been incorporated in our proposed architecture as mentioned in JPEG2000. This architecture has been described in VHDL at the RTL level and simulated successfully using ModelSim simulation environment
{"title":"An Efficient Architecture for Lifting-Based Forward and Inverse Discrete Wavelet Transform","authors":"S. Aroutchelvame, K. Raahemifar","doi":"10.1109/ICME.2005.1521548","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521548","url":null,"abstract":"In this research, an architecture that performs both forward and inverse lifting-based discrete wavelet transform is proposed. The proposed architecture reduces the hardware requirement by exploiting the redundancy in the arithmetic operation involved in DWT computation. The proposed architecture does not require any extra memory to store intermediate results. The proposed architecture consists of predict module, update module, address generation module, control unit and a set of registers to establish data communication between predict and update modules. The symmetrical extension of images at the boundary to reduce distorted images has been incorporated in our proposed architecture as mentioned in JPEG2000. This architecture has been described in VHDL at the RTL level and simulated successfully using ModelSim simulation environment","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125182223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521680
F. Colace, P. Foggia, G. Percannella
In this paper we face the problem of partitioning the news videos into stories, and of their classification according to a predefined set of categories. In particular, we propose to employ a multi-level probabilistic framework based on the hidden Markov models and the Bayesian networks paradigms for the segmentation and the classification phases, respectively. The whole analysis is carried out exploiting information extracted from the video and the audio tracks using techniques of superimposed text recognition, speaker identification, speech transcription, anchor detection. The system was tested on a database of Italian news videos and the results are very promising
{"title":"A Probabilistic Framework for TV-News Stories Detection and Classification","authors":"F. Colace, P. Foggia, G. Percannella","doi":"10.1109/ICME.2005.1521680","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521680","url":null,"abstract":"In this paper we face the problem of partitioning the news videos into stories, and of their classification according to a predefined set of categories. In particular, we propose to employ a multi-level probabilistic framework based on the hidden Markov models and the Bayesian networks paradigms for the segmentation and the classification phases, respectively. The whole analysis is carried out exploiting information extracted from the video and the audio tracks using techniques of superimposed text recognition, speaker identification, speech transcription, anchor detection. The system was tested on a database of Italian news videos and the results are very promising","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122433903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521410
Stuart K. Marks, R. González
This paper investigates the benefits of streaming autonomous audio objects over error-prone channels instead of encoded audio frames. Due to the nature of autonomous audio objects such a scheme is error resilient and has a fine-grain scalable bitrate, but also has the additional benefit of being able to disguise packet loss in the reconstructed signal. This paper proposes object-packing algorithms, which will be shown to be able to disguise the presence of long bursts of packet loss, removing the need for complex error-concealment schemes at the decoder
{"title":"Object-Based Audio Streaming Over Error-Prone Channels","authors":"Stuart K. Marks, R. González","doi":"10.1109/ICME.2005.1521410","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521410","url":null,"abstract":"This paper investigates the benefits of streaming autonomous audio objects over error-prone channels instead of encoded audio frames. Due to the nature of autonomous audio objects such a scheme is error resilient and has a fine-grain scalable bitrate, but also has the additional benefit of being able to disguise packet loss in the reconstructed signal. This paper proposes object-packing algorithms, which will be shown to be able to disguise the presence of long bursts of packet loss, removing the need for complex error-concealment schemes at the decoder","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121792368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521473
Rong Yan, M. Naphade
For large scale automatic semantic video characterization, it is necessary to learn and model a large number of semantic concepts. A major obstacle to this is the insufficiency of labeled training samples. Semi-supervised learning algorithms such as co-training may help by incorporating a large amount of unlabeled data, which allows the redundant information across views to improve the learning performance. Although co-training has been successfully applied in several domains, it has not been used to detect video concepts before. In this paper, we extend co-training to the domain of video concept detection and investigate different strategies of co-training as well as their effects to the detection accuracy. We demonstrate performance based on the guideline of the TRECVID '03 semantic concept extraction task
{"title":"Multi-Modal Video Concept Extraction Using Co-Training","authors":"Rong Yan, M. Naphade","doi":"10.1109/ICME.2005.1521473","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521473","url":null,"abstract":"For large scale automatic semantic video characterization, it is necessary to learn and model a large number of semantic concepts. A major obstacle to this is the insufficiency of labeled training samples. Semi-supervised learning algorithms such as co-training may help by incorporating a large amount of unlabeled data, which allows the redundant information across views to improve the learning performance. Although co-training has been successfully applied in several domains, it has not been used to detect video concepts before. In this paper, we extend co-training to the domain of video concept detection and investigate different strategies of co-training as well as their effects to the detection accuracy. We demonstrate performance based on the guideline of the TRECVID '03 semantic concept extraction task","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121875614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521431
Yiqun Hu, D. Rajan, L. Chia
Visual attention is obtained through determination of contrasts of low level features or attention cues like intensity, color etc. We propose a new texture attention cue that is shown to be more effective for images where the salient object regions and background have similar visual characteristics. Current visual attention models do not consider local contextual information to highlight attention regions. We also propose a feature combination strategy by suppressing saliency based on context information that is effective in determining the true attention region. We compare our approach with other visual attention models using a novel average discrimination ratio measure.
{"title":"Adaptive local context suppression of multiple cues for salient visual attention detection","authors":"Yiqun Hu, D. Rajan, L. Chia","doi":"10.1109/ICME.2005.1521431","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521431","url":null,"abstract":"Visual attention is obtained through determination of contrasts of low level features or attention cues like intensity, color etc. We propose a new texture attention cue that is shown to be more effective for images where the salient object regions and background have similar visual characteristics. Current visual attention models do not consider local contextual information to highlight attention regions. We also propose a feature combination strategy by suppressing saliency based on context information that is effective in determining the true attention region. We compare our approach with other visual attention models using a novel average discrimination ratio measure.","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127678959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}