Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521627
Xiaodan Song, Ching-Yung Lin, Ming-Ting Sun
Modeling visual concepts using supervised or unsupervised machine learning approaches are becoming increasing important for video semantic indexing, retrieval, and filtering applications. Naturally, videos include multimodality data such as audio, speech, visual and text, which are combined to infer therein the overall semantic concepts. However, in the literature, most researches were conducted within only one single domain. In this paper we propose an unsupervised technique that builds context-independent keyword lists for desired visual concept modeling using WordNet. Furthermore, we propose an extended speech-based visual concept (ESVC) model to reorder and extend the above keyword lists by supervised learning based on multimodality annotation. Experimental results show that the context-independent models can achieve comparable performance compared to conventional supervised learning algorithms, and the ESVC model achieves about 53% and 28.4% improvement in two testing subsets of the TRECVID 2003 corpus over a state-of-the-art speech-based video concept detection algorithm
{"title":"Speech-Based Visual Concept Learning Using Wordnet","authors":"Xiaodan Song, Ching-Yung Lin, Ming-Ting Sun","doi":"10.1109/ICME.2005.1521627","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521627","url":null,"abstract":"Modeling visual concepts using supervised or unsupervised machine learning approaches are becoming increasing important for video semantic indexing, retrieval, and filtering applications. Naturally, videos include multimodality data such as audio, speech, visual and text, which are combined to infer therein the overall semantic concepts. However, in the literature, most researches were conducted within only one single domain. In this paper we propose an unsupervised technique that builds context-independent keyword lists for desired visual concept modeling using WordNet. Furthermore, we propose an extended speech-based visual concept (ESVC) model to reorder and extend the above keyword lists by supervised learning based on multimodality annotation. Experimental results show that the context-independent models can achieve comparable performance compared to conventional supervised learning algorithms, and the ESVC model achieves about 53% and 28.4% improvement in two testing subsets of the TRECVID 2003 corpus over a state-of-the-art speech-based video concept detection algorithm","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132683726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521522
I. Lee, L. Guan
Video streaming demands high data rates and hard delay constraints, and it raises several challenges on today's packet-based and best-effort Internet. In this paper, we propose an efficient multiple-description coding (MDC) technique based on video frame sub-sampling and cubic-spline interpolation to provide spatial diversity, such that no additional buffering delay or storage is required. The frame dropping rate due to packet loss and drifting error under the multi-path streaming environment is analyzed in this paper.
{"title":"Reliable video communication with multi-path streaming using MDC","authors":"I. Lee, L. Guan","doi":"10.1109/ICME.2005.1521522","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521522","url":null,"abstract":"Video streaming demands high data rates and hard delay constraints, and it raises several challenges on today's packet-based and best-effort Internet. In this paper, we propose an efficient multiple-description coding (MDC) technique based on video frame sub-sampling and cubic-spline interpolation to provide spatial diversity, such that no additional buffering delay or storage is required. The frame dropping rate due to packet loss and drifting error under the multi-path streaming environment is analyzed in this paper.","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132753187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521608
F. D. Keukelaere, T. DeMartini, Jeroen Bekaert, R. Walle
Within the world of multimedia, the new MPEG-21 standard is currently under development. The purpose of this new standard is to create an open framework for multimedia delivery and consumption. MPEG-21 mastered the multitude of types of content and metadata by standardizing the declaration of digital items in an XML based format. In addition to standardizing, the declaration of digital items MPEG-21 also standardizes digital item processing, which enables the declaration of suggested uses of digital items. The rights expression language and the rights data dictionary parts of MPEG-21 enable the declaration of what rights (permitted interactions) Users are given to digital items. In this paper, we describe how rights checking can be realized in an environment in which interactions with digital items are declared through digital item processing. We demonstrate how rights checking can be done when "critical" digital item base operations are called and how rights context information can be gathered by tracking during the execution of digital item methods
{"title":"Supporting rights checking in an MPEG-21 Digital Item Processing environment","authors":"F. D. Keukelaere, T. DeMartini, Jeroen Bekaert, R. Walle","doi":"10.1109/ICME.2005.1521608","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521608","url":null,"abstract":"Within the world of multimedia, the new MPEG-21 standard is currently under development. The purpose of this new standard is to create an open framework for multimedia delivery and consumption. MPEG-21 mastered the multitude of types of content and metadata by standardizing the declaration of digital items in an XML based format. In addition to standardizing, the declaration of digital items MPEG-21 also standardizes digital item processing, which enables the declaration of suggested uses of digital items. The rights expression language and the rights data dictionary parts of MPEG-21 enable the declaration of what rights (permitted interactions) Users are given to digital items. In this paper, we describe how rights checking can be realized in an environment in which interactions with digital items are declared through digital item processing. We demonstrate how rights checking can be done when \"critical\" digital item base operations are called and how rights context information can be gathered by tracking during the execution of digital item methods","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132781702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521614
Larry Huston, R. Sukthankar, Yan Ke
This paper evaluates the effectiveness of keypoint methods for content-based protection of digital images. These methods identify a set of "distinctive" regions (termed keypoints) in an image and encode them using descriptors that are robust to expected image transformations. To determine whether particular images were derived from a protected image, the keypoints for both images are generated and their descriptors matched. We describe a comprehensive set of experiments to examine how keypoint methods cope with three real-world challenges: (1) loss of keypoints due to cropping; (2) matching failures caused by approximate nearest-neighbor indexing schemes; (3) degraded descriptors due to significant image distortions. While keypoint methods perform very well in general, this paper identifies cases where the accuracy of such methods degrades.
{"title":"Evaluating keypoint methods for content-based copyright protection of digital images","authors":"Larry Huston, R. Sukthankar, Yan Ke","doi":"10.1109/ICME.2005.1521614","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521614","url":null,"abstract":"This paper evaluates the effectiveness of keypoint methods for content-based protection of digital images. These methods identify a set of \"distinctive\" regions (termed keypoints) in an image and encode them using descriptors that are robust to expected image transformations. To determine whether particular images were derived from a protected image, the keypoints for both images are generated and their descriptors matched. We describe a comprehensive set of experiments to examine how keypoint methods cope with three real-world challenges: (1) loss of keypoints due to cropping; (2) matching failures caused by approximate nearest-neighbor indexing schemes; (3) degraded descriptors due to significant image distortions. While keypoint methods perform very well in general, this paper identifies cases where the accuracy of such methods degrades.","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132985470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521365
Cha Zhang, Y. Rui, Li-wei He, M. Wallick
We present a hybrid speaker tracking scheme based on a single pan/tilt/zoom (PTZ) camera in an automated lecture capturing system. Given that the camera's video resolution is higher than the required output resolution, we frame the output video as a sub-region of the camera's input video. This allows us to track the speaker both digitally and mechanically. Digital tracking has the advantage of being smooth, and mechanical tracking can cover a wide area. The hybrid tracking achieves the benefits of both worlds. In addition to hybrid tracking, we present an intelligent pan/zoom selection scheme to improve the aestheticity of the lecture scene.
{"title":"Hybrid speaker tracking in an automated lecture room","authors":"Cha Zhang, Y. Rui, Li-wei He, M. Wallick","doi":"10.1109/ICME.2005.1521365","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521365","url":null,"abstract":"We present a hybrid speaker tracking scheme based on a single pan/tilt/zoom (PTZ) camera in an automated lecture capturing system. Given that the camera's video resolution is higher than the required output resolution, we frame the output video as a sub-region of the camera's input video. This allows us to track the speaker both digitally and mechanically. Digital tracking has the advantage of being smooth, and mechanical tracking can cover a wide area. The hybrid tracking achieves the benefits of both worlds. In addition to hybrid tracking, we present an intelligent pan/zoom selection scheme to improve the aestheticity of the lecture scene.","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133259502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521486
V. Akella, M. Schaar, W. Kao
We propose a systematic technique for characterizing the workload of a video decoder at a given time and transforming the shape of the workload to optimize the utilization of a critical resource without compromising the distortion incurred in the process. We call our approach proactive resource management. We will illustrate our techniques by addressing the problem of minimizing the energy consumption during decoding a video sequence on a programmable processor that supports multiple voltages and frequencies. We evaluate two different heuristics for the underlying optimization problem that result in 50% to 92% improvements in energy savings compared to techniques that do not use dynamic adaptation
{"title":"Proactive Energy Optimization Algorithms for Wavelet-Based Video Codecs on Power-Aware Processors","authors":"V. Akella, M. Schaar, W. Kao","doi":"10.1109/ICME.2005.1521486","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521486","url":null,"abstract":"We propose a systematic technique for characterizing the workload of a video decoder at a given time and transforming the shape of the workload to optimize the utilization of a critical resource without compromising the distortion incurred in the process. We call our approach proactive resource management. We will illustrate our techniques by addressing the problem of minimizing the energy consumption during decoding a video sequence on a programmable processor that supports multiple voltages and frequencies. We evaluate two different heuristics for the underlying optimization problem that result in 50% to 92% improvements in energy savings compared to techniques that do not use dynamic adaptation","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131882778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521716
H. Miyamori, Qiang Ma, Katsumi Tanaka
A method is proposed for viewing broadcast content that converts TV programs into Web content and integrates the results with complementary information retrieved using the Internet. Converting the programs into Web pages enables the programs to be skimmed over to get an overview and for particular scenes to be easily explored. Integrating complementary information enables the programs to be viewed efficiently with value-added content. An intuitive, user-friendly browsing interface enables the user to easily changing the level of detail displayed for the integrated information by zooming. Preliminary testing of a prototype system for next-generation storage TV, "WA-TV", validated the approach taken by the proposed method
{"title":"WA-TV: Webifying and Augmenting Broadcast Content for Next-Generation Storage TV","authors":"H. Miyamori, Qiang Ma, Katsumi Tanaka","doi":"10.1109/ICME.2005.1521716","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521716","url":null,"abstract":"A method is proposed for viewing broadcast content that converts TV programs into Web content and integrates the results with complementary information retrieved using the Internet. Converting the programs into Web pages enables the programs to be skimmed over to get an overview and for particular scenes to be easily explored. Integrating complementary information enables the programs to be viewed efficiently with value-added content. An intuitive, user-friendly browsing interface enables the user to easily changing the level of detail displayed for the integrated information by zooming. Preliminary testing of a prototype system for next-generation storage TV, \"WA-TV\", validated the approach taken by the proposed method","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133093034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521357
A. Salway, Andrew Vassiliou, K. Ahmad
This paper aims to contribute to the analysis and description of semantic video content by investigating what actions are important in films. We apply a corpus analysis method to identify frequently occurring phrases in texts that describe films-screenplays and audio description. Frequent words and statistically significant collocations of these words are identified in screenplays of 75 films and in audio description of 45 films. Phrases such as 'looks at', 'turns to', 'smiles at' and various collocations of 'door' were found to be common. We argue that these phrases occur frequently because they describe actions that are important story-telling elements for filmed narrative. We discuss how this knowledge helps the development of systems to structure semantic video content.
{"title":"What happens in films?","authors":"A. Salway, Andrew Vassiliou, K. Ahmad","doi":"10.1109/ICME.2005.1521357","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521357","url":null,"abstract":"This paper aims to contribute to the analysis and description of semantic video content by investigating what actions are important in films. We apply a corpus analysis method to identify frequently occurring phrases in texts that describe films-screenplays and audio description. Frequent words and statistically significant collocations of these words are identified in screenplays of 75 films and in audio description of 45 films. Phrases such as 'looks at', 'turns to', 'smiles at' and various collocations of 'door' were found to be common. We argue that these phrases occur frequently because they describe actions that are important story-telling elements for filmed narrative. We discuss how this knowledge helps the development of systems to structure semantic video content.","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114085062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521478
Chao-Yong Hsu, Chun-Shien Lu
The existing halftone image watermarking methods were proposed to embed a watermark bit in a halftone dot, which corresponds to a pixel, to generate stego halftone image. This one-to-one mapping, however, is not consistent with the one-to-many strategy that is used by current high-resolution devices, such as computer printers and screens, where one pixel is first expanded into many dots and then a halftoning processing is employed to generate a halftone image. Furthermore, electronic paper or smart paper that produces high-resolution digital files cannot be protected by the traditional halftone watermarking methods. In view of these facts, we present a high-resolution halftone watermarking scheme to deal with the aforementioned problems. The characteristics of our scheme include: (i) a high-resolution halftoning process that employs a one-to-many mapping strategy is proposed; (ii) a many-to-one inverse halftoning process is proposed to generate gray-scale images of good quality; and (iii) halftone image watermarking can be directly conducted on gray-scale instead of halftone images to achieve better robustness
{"title":"Joint Image Halftoning and Watermarking in High-Resolution Digital Form","authors":"Chao-Yong Hsu, Chun-Shien Lu","doi":"10.1109/ICME.2005.1521478","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521478","url":null,"abstract":"The existing halftone image watermarking methods were proposed to embed a watermark bit in a halftone dot, which corresponds to a pixel, to generate stego halftone image. This one-to-one mapping, however, is not consistent with the one-to-many strategy that is used by current high-resolution devices, such as computer printers and screens, where one pixel is first expanded into many dots and then a halftoning processing is employed to generate a halftone image. Furthermore, electronic paper or smart paper that produces high-resolution digital files cannot be protected by the traditional halftone watermarking methods. In view of these facts, we present a high-resolution halftone watermarking scheme to deal with the aforementioned problems. The characteristics of our scheme include: (i) a high-resolution halftoning process that employs a one-to-many mapping strategy is proposed; (ii) a many-to-one inverse halftoning process is proposed to generate gray-scale images of good quality; and (iii) halftone image watermarking can be directly conducted on gray-scale instead of halftone images to achieve better robustness","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115900364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-06DOI: 10.1109/ICME.2005.1521399
Si Wu, Yu-Fei Ma, HongJiang Zhang
Home videos often have some abnormal camera motions, such as camera shaking and irregular camera motions, which cause the degradation of visual quality. To remove bad quality segments and automatic stabilize shaky ones are necessary steps for home video archiving. In this paper, we proposed a novel segmentation algorithm for home video based on video quality classification. According to three important properties of motion, speed, direction, and acceleration, the effects caused by camera motion are classified into four categories: blurred, shaky, inconsistent and stable using support vector machines (SVMs). Based on the classification, a multi-scale sliding window is employed to parse video sequence into different segments along time axis, and each of these segments is labeled as one of camera motion effects. The effectiveness of the proposed approach has been validated by extensive experiments.
{"title":"Video quality classification based home video segmentation","authors":"Si Wu, Yu-Fei Ma, HongJiang Zhang","doi":"10.1109/ICME.2005.1521399","DOIUrl":"https://doi.org/10.1109/ICME.2005.1521399","url":null,"abstract":"Home videos often have some abnormal camera motions, such as camera shaking and irregular camera motions, which cause the degradation of visual quality. To remove bad quality segments and automatic stabilize shaky ones are necessary steps for home video archiving. In this paper, we proposed a novel segmentation algorithm for home video based on video quality classification. According to three important properties of motion, speed, direction, and acceleration, the effects caused by camera motion are classified into four categories: blurred, shaky, inconsistent and stable using support vector machines (SVMs). Based on the classification, a multi-scale sliding window is employed to parse video sequence into different segments along time axis, and each of these segments is labeled as one of camera motion effects. The effectiveness of the proposed approach has been validated by extensive experiments.","PeriodicalId":244360,"journal":{"name":"2005 IEEE International Conference on Multimedia and Expo","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115139447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}