Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521627
Xiaodan Song, Ching-Yung Lin, Ming-Ting Sun
Modeling visual concepts using supervised or unsupervised machine learning approaches are becoming increasing important for video semantic indexing, retrieval, and filtering applications. Naturally, videos include multimodality data such as audio, speech, visual and text, which are combined to infer therein the overall semantic concepts. However, in the literature, most researches were conducted within only one single domain. In this paper we propose an unsupervised technique that builds context-independent keyword lists for desired visual concept modeling using WordNet. Furthermore, we propose an extended speech-based visual concept (ESVC) model to reorder and extend the above keyword lists by supervised learning based on multimodality annotation. Experimental results show that the context-independent models can achieve comparable performance compared to conventional supervised learning algorithms, and the ESVC model achieves about 53% and 28.4% improvement in two testing subsets of the TRECVID 2003 corpus over a state-of-the-art speech-based video concept detection algorithm
{"title":"Speech-Based Visual Concept Learning Using Wordnet","authors":"Xiaodan Song, Ching-Yung Lin, Ming-Ting Sun","doi":"10.1109/ICME.2005.1521627","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521627","url":null,"abstract":"Modeling visual concepts using supervised or unsupervised machine learning approaches are becoming increasing important for video semantic indexing, retrieval, and filtering applications. Naturally, videos include multimodality data such as audio, speech, visual and text, which are combined to infer therein the overall semantic concepts. However, in the literature, most researches were conducted within only one single domain. In this paper we propose an unsupervised technique that builds context-independent keyword lists for desired visual concept modeling using WordNet. Furthermore, we propose an extended speech-based visual concept (ESVC) model to reorder and extend the above keyword lists by supervised learning based on multimodality annotation. Experimental results show that the context-independent models can achieve comparable performance compared to conventional supervised learning algorithms, and the ESVC model achieves about 53% and 28.4% improvement in two testing subsets of the TRECVID 2003 corpus over a state-of-the-art speech-based video concept detection algorithm","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132683726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521522
I. Lee, L. Guan
Video streaming demands high data rates and hard delay constraints, and it raises several challenges on today's packet-based and best-effort Internet. In this paper, we propose an efficient multiple-description coding (MDC) technique based on video frame sub-sampling and cubic-spline interpolation to provide spatial diversity, such that no additional buffering delay or storage is required. The frame dropping rate due to packet loss and drifting error under the multi-path streaming environment is analyzed in this paper.
{"title":"Reliable video communication with multi-path streaming using MDC","authors":"I. Lee, L. Guan","doi":"10.1109/ICME.2005.1521522","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521522","url":null,"abstract":"Video streaming demands high data rates and hard delay constraints, and it raises several challenges on today's packet-based and best-effort Internet. In this paper, we propose an efficient multiple-description coding (MDC) technique based on video frame sub-sampling and cubic-spline interpolation to provide spatial diversity, such that no additional buffering delay or storage is required. The frame dropping rate due to packet loss and drifting error under the multi-path streaming environment is analyzed in this paper.","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132753187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521608
F. D. Keukelaere, T. DeMartini, Jeroen Bekaert, R. Walle
Within the world of multimedia, the new MPEG-21 standard is currently under development. The purpose of this new standard is to create an open framework for multimedia delivery and consumption. MPEG-21 mastered the multitude of types of content and metadata by standardizing the declaration of digital items in an XML based format. In addition to standardizing, the declaration of digital items MPEG-21 also standardizes digital item processing, which enables the declaration of suggested uses of digital items. The rights expression language and the rights data dictionary parts of MPEG-21 enable the declaration of what rights (permitted interactions) Users are given to digital items. In this paper, we describe how rights checking can be realized in an environment in which interactions with digital items are declared through digital item processing. We demonstrate how rights checking can be done when "critical" digital item base operations are called and how rights context information can be gathered by tracking during the execution of digital item methods
{"title":"Supporting rights checking in an MPEG-21 Digital Item Processing environment","authors":"F. D. Keukelaere, T. DeMartini, Jeroen Bekaert, R. Walle","doi":"10.1109/ICME.2005.1521608","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521608","url":null,"abstract":"Within the world of multimedia, the new MPEG-21 standard is currently under development. The purpose of this new standard is to create an open framework for multimedia delivery and consumption. MPEG-21 mastered the multitude of types of content and metadata by standardizing the declaration of digital items in an XML based format. In addition to standardizing, the declaration of digital items MPEG-21 also standardizes digital item processing, which enables the declaration of suggested uses of digital items. The rights expression language and the rights data dictionary parts of MPEG-21 enable the declaration of what rights (permitted interactions) Users are given to digital items. In this paper, we describe how rights checking can be realized in an environment in which interactions with digital items are declared through digital item processing. We demonstrate how rights checking can be done when \"critical\" digital item base operations are called and how rights context information can be gathered by tracking during the execution of digital item methods","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132781702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521614
Larry Huston, R. Sukthankar, Yan Ke
This paper evaluates the effectiveness of keypoint methods for content-based protection of digital images. These methods identify a set of "distinctive" regions (termed keypoints) in an image and encode them using descriptors that are robust to expected image transformations. To determine whether particular images were derived from a protected image, the keypoints for both images are generated and their descriptors matched. We describe a comprehensive set of experiments to examine how keypoint methods cope with three real-world challenges: (1) loss of keypoints due to cropping; (2) matching failures caused by approximate nearest-neighbor indexing schemes; (3) degraded descriptors due to significant image distortions. While keypoint methods perform very well in general, this paper identifies cases where the accuracy of such methods degrades.
{"title":"Evaluating keypoint methods for content-based copyright protection of digital images","authors":"Larry Huston, R. Sukthankar, Yan Ke","doi":"10.1109/ICME.2005.1521614","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521614","url":null,"abstract":"This paper evaluates the effectiveness of keypoint methods for content-based protection of digital images. These methods identify a set of \"distinctive\" regions (termed keypoints) in an image and encode them using descriptors that are robust to expected image transformations. To determine whether particular images were derived from a protected image, the keypoints for both images are generated and their descriptors matched. We describe a comprehensive set of experiments to examine how keypoint methods cope with three real-world challenges: (1) loss of keypoints due to cropping; (2) matching failures caused by approximate nearest-neighbor indexing schemes; (3) degraded descriptors due to significant image distortions. While keypoint methods perform very well in general, this paper identifies cases where the accuracy of such methods degrades.","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132985470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521365
Cha Zhang, Y. Rui, Li-wei He, M. Wallick
We present a hybrid speaker tracking scheme based on a single pan/tilt/zoom (PTZ) camera in an automated lecture capturing system. Given that the camera's video resolution is higher than the required output resolution, we frame the output video as a sub-region of the camera's input video. This allows us to track the speaker both digitally and mechanically. Digital tracking has the advantage of being smooth, and mechanical tracking can cover a wide area. The hybrid tracking achieves the benefits of both worlds. In addition to hybrid tracking, we present an intelligent pan/zoom selection scheme to improve the aestheticity of the lecture scene.
{"title":"Hybrid speaker tracking in an automated lecture room","authors":"Cha Zhang, Y. Rui, Li-wei He, M. Wallick","doi":"10.1109/ICME.2005.1521365","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521365","url":null,"abstract":"We present a hybrid speaker tracking scheme based on a single pan/tilt/zoom (PTZ) camera in an automated lecture capturing system. Given that the camera's video resolution is higher than the required output resolution, we frame the output video as a sub-region of the camera's input video. This allows us to track the speaker both digitally and mechanically. Digital tracking has the advantage of being smooth, and mechanical tracking can cover a wide area. The hybrid tracking achieves the benefits of both worlds. In addition to hybrid tracking, we present an intelligent pan/zoom selection scheme to improve the aestheticity of the lecture scene.","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133259502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521486
V. Akella, M. Schaar, W. Kao
We propose a systematic technique for characterizing the workload of a video decoder at a given time and transforming the shape of the workload to optimize the utilization of a critical resource without compromising the distortion incurred in the process. We call our approach proactive resource management. We will illustrate our techniques by addressing the problem of minimizing the energy consumption during decoding a video sequence on a programmable processor that supports multiple voltages and frequencies. We evaluate two different heuristics for the underlying optimization problem that result in 50% to 92% improvements in energy savings compared to techniques that do not use dynamic adaptation
{"title":"Proactive Energy Optimization Algorithms for Wavelet-Based Video Codecs on Power-Aware Processors","authors":"V. Akella, M. Schaar, W. Kao","doi":"10.1109/ICME.2005.1521486","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521486","url":null,"abstract":"We propose a systematic technique for characterizing the workload of a video decoder at a given time and transforming the shape of the workload to optimize the utilization of a critical resource without compromising the distortion incurred in the process. We call our approach proactive resource management. We will illustrate our techniques by addressing the problem of minimizing the energy consumption during decoding a video sequence on a programmable processor that supports multiple voltages and frequencies. We evaluate two different heuristics for the underlying optimization problem that result in 50% to 92% improvements in energy savings compared to techniques that do not use dynamic adaptation","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131882778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521716
H. Miyamori, Qiang Ma, Katsumi Tanaka
A method is proposed for viewing broadcast content that converts TV programs into Web content and integrates the results with complementary information retrieved using the Internet. Converting the programs into Web pages enables the programs to be skimmed over to get an overview and for particular scenes to be easily explored. Integrating complementary information enables the programs to be viewed efficiently with value-added content. An intuitive, user-friendly browsing interface enables the user to easily changing the level of detail displayed for the integrated information by zooming. Preliminary testing of a prototype system for next-generation storage TV, "WA-TV", validated the approach taken by the proposed method
{"title":"WA-TV: Webifying and Augmenting Broadcast Content for Next-Generation Storage TV","authors":"H. Miyamori, Qiang Ma, Katsumi Tanaka","doi":"10.1109/ICME.2005.1521716","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521716","url":null,"abstract":"A method is proposed for viewing broadcast content that converts TV programs into Web content and integrates the results with complementary information retrieved using the Internet. Converting the programs into Web pages enables the programs to be skimmed over to get an overview and for particular scenes to be easily explored. Integrating complementary information enables the programs to be viewed efficiently with value-added content. An intuitive, user-friendly browsing interface enables the user to easily changing the level of detail displayed for the integrated information by zooming. Preliminary testing of a prototype system for next-generation storage TV, \"WA-TV\", validated the approach taken by the proposed method","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133093034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521598
C. Baras, N. Moreau
A particular application of audio data hiding systems and watermarking systems consists of using the audio signal as a transmission channel for binary information. The system should ensure a reliable and robust transmission for various channel perturbations but also propose a low computational cost for real-time applications. In this paper, we present a hybrid spread-spectrum data hiding system, which combines two reference systems taken from the State-Of-The-Art: the one based on a real-time receiver and the other one based on an informed embedding strategy with maximized robustness to additive perturbations. Experimental results permit to assess the efficiency of the system in terms of: (1) transmission reliability, which is significantly improved compared to reference systems, and (2) computational costs, which allows for the feasible real-time reception process of broadcast applications with off-line embedding.
{"title":"An audio spread-spectrum data hiding system with an informed embedding strategy adapted to a Wiener filtering based receiver","authors":"C. Baras, N. Moreau","doi":"10.1109/ICME.2005.1521598","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521598","url":null,"abstract":"A particular application of audio data hiding systems and watermarking systems consists of using the audio signal as a transmission channel for binary information. The system should ensure a reliable and robust transmission for various channel perturbations but also propose a low computational cost for real-time applications. In this paper, we present a hybrid spread-spectrum data hiding system, which combines two reference systems taken from the State-Of-The-Art: the one based on a real-time receiver and the other one based on an informed embedding strategy with maximized robustness to additive perturbations. Experimental results permit to assess the efficiency of the system in terms of: (1) transmission reliability, which is significantly improved compared to reference systems, and (2) computational costs, which allows for the feasible real-time reception process of broadcast applications with off-line embedding.","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134466056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521428
Sheng Gao, Yongwei Zhu
In this paper, an HMM-embedded unsupervised learning approach is proposed to detect the music events by grouping the similar segments of the music signal. This approach can cluster the segments based on their similarity of the spectral as well as the temporal structures. This is not easily done for clustering with the traditional similarity measures. Together with a Bayesian information criterion, the proposed approach can obtain a suitable event set to regularize the complexity of the model structure. The natural product of the approach is a set of music events modeled by the HMMs. Our experimental analyses show that the detected musical events have more perceptual meaning and are more consistent than the KL-distance based clustering. The learned events match better with our experience in spectrogram reading. Its capacity is further evaluated on a task of music identification. The identification error rate is reduced to 1.57%, and 56.3% relative error rate reduction is observed comparing with the system trained using the KL-distance clustering method
{"title":"A HMM-Embedded Unsupervised Learning to Musical Event Detection","authors":"Sheng Gao, Yongwei Zhu","doi":"10.1109/ICME.2005.1521428","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521428","url":null,"abstract":"In this paper, an HMM-embedded unsupervised learning approach is proposed to detect the music events by grouping the similar segments of the music signal. This approach can cluster the segments based on their similarity of the spectral as well as the temporal structures. This is not easily done for clustering with the traditional similarity measures. Together with a Bayesian information criterion, the proposed approach can obtain a suitable event set to regularize the complexity of the model structure. The natural product of the approach is a set of music events modeled by the HMMs. Our experimental analyses show that the detected musical events have more perceptual meaning and are more consistent than the KL-distance based clustering. The learned events match better with our experience in spectrogram reading. Its capacity is further evaluated on a task of music identification. The identification error rate is reduced to 1.57%, and 56.3% relative error rate reduction is observed comparing with the system trained using the KL-distance clustering method","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133257324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521637
B. Sat, B. Wah
In this paper, we propose a speech-adaptive layered-coding (LC) scheme for the loss concealments of real-time CELP-coded speech transmitted over IP networks. Based on the ITU G.729 CS-ACELP codec operating at 8 Kbps, we design a loss-robust speech-adaptive codec at the same bit rate. Our scheme employs LC with redundant packetization in order to conceal losses and adapt to dynamic loss conditions characterized by the loss rate and the degree of burst, while maintaining an acceptable end-to-end delay. By protecting only the most important excitation parameters of each frame according to its speech type, our approach enables more efficient use of the bit budget. Our scheme delivers good-quality speech with a level of protection similar to full replication under medium loss rates, provides speech quality similar to the standard G.729 under very low loss rates, and outperforms both for low-to-medium loss rates.
{"title":"Speech-adaptive layered G.729 coder for loss concealments of real-time voice over IP","authors":"B. Sat, B. Wah","doi":"10.1109/ICME.2005.1521637","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521637","url":null,"abstract":"In this paper, we propose a speech-adaptive layered-coding (LC) scheme for the loss concealments of real-time CELP-coded speech transmitted over IP networks. Based on the ITU G.729 CS-ACELP codec operating at 8 Kbps, we design a loss-robust speech-adaptive codec at the same bit rate. Our scheme employs LC with redundant packetization in order to conceal losses and adapt to dynamic loss conditions characterized by the loss rate and the degree of burst, while maintaining an acceptable end-to-end delay. By protecting only the most important excitation parameters of each frame according to its speech type, our approach enables more efficient use of the bit budget. Our scheme delivers good-quality speech with a level of protection similar to full replication under medium loss rates, provides speech quality similar to the standard G.729 under very low loss rates, and outperforms both for low-to-medium loss rates.","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133004046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}