Pitch detection or fundamental frequency (f0) estimation is a classical research topic and has been extensively studied for many years. Pitch estimation by embedding speech signal into multiple state-space dimensions is a relatively recent technique. Also YIN pitch detection algorithm [1] has been cited recently as an improvement over other standard pitch estimation algorithms. In this paper an attempt is made to present a unifying view on some of these existing and seemingly disparate techniques. The unified view enables the development of robust formulations of some existing definitions and also helps to interpret the limitations of the classical/existing approaches in use. Application of the idea for a robust On-the-Fly pitch (OTFP) detection is demonstrated and comparison with robust YIN pitch detector has yielded encouraging results. The On-The-Fly imposes a constraint that pitch or aperiodicity estimates from past or future speech frames are not to be used at a post processing stage and OTFP outperforms the YIN estimator with this constraint.
{"title":"A robust on-the-fly pitch (OTFP) estimation algorithm","authors":"S. Sood, A. Krishnamurthy","doi":"10.1145/1027527.1027591","DOIUrl":"https://doi.org/10.1145/1027527.1027591","url":null,"abstract":"Pitch detection or fundamental frequency (f<inf>0</inf>) estimation is a classical research topic and has been extensively studied for many years. Pitch estimation by embedding speech signal into multiple state-space dimensions is a relatively recent technique. Also YIN pitch detection algorithm [1] has been cited recently as an improvement over other standard pitch estimation algorithms. In this paper an attempt is made to present a unifying view on some of these existing and seemingly disparate techniques. The unified view enables the development of robust formulations of some existing definitions and also helps to interpret the limitations of the classical/existing approaches in use. Application of the idea for a robust On-the-Fly pitch (OTFP) detection is demonstrated and comparison with robust YIN pitch detector has yielded encouraging results. The On-The-Fly imposes a constraint that pitch or aperiodicity estimates from past or future speech frames are not to be used at a post processing stage and OTFP outperforms the YIN estimator with this constraint.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122670152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose a unified framework, called Markov Model Mediator (MMM), to facilitate image database clustering and to improve the query performance. The structure of the MMM framework consists of two hierarchical levels: local MMMs and integrated MMMs, which model the affinity relations among the images within a single image database and within a set of image databases, respectively, via an effective data mining process. The effectiveness and efficiency of the MMM framework for database clustering and image retrieval are demonstrated over a set of image databases which contain various numbers of images with different dimensions and concept categories.
{"title":"Affinity relation discovery in image database clustering and content-based retrieval","authors":"M. Shyu, Shu‐Ching Chen, Min Chen, Chengcui Zhang","doi":"10.1145/1027527.1027614","DOIUrl":"https://doi.org/10.1145/1027527.1027614","url":null,"abstract":"In this paper, we propose a unified framework, called <i>Markov Model Mediator</i> (MMM), to facilitate image database clustering and to improve the query performance. The structure of the MMM framework consists of two hierarchical levels: local MMMs and integrated MMMs, which model the affinity relations among the images within a single image database and within a set of image databases, respectively, via an effective data mining process. The effectiveness and efficiency of the MMM framework for database clustering and image retrieval are demonstrated over a set of image databases which contain various numbers of images with different dimensions and concept categories.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123291917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a multiple watermarking algorithm based on code division multiple access (CDMA) technique. Before the watermark embedded, each user uses his private key as a seed to generate an address code which is subjected to pseudorandom noise distribution. Each watermark is modulated into a carrier signal with its corresponding address code. And then these carrier signals are added to host media (e.g. image, video and audio). During watermark detection, using the same address code, each user extracts his watermark from the detected media via calculating the correlation coefficient between address code and watermarked vector of the detected media. Each user can embed and extract his watermark independently, regardless of others. The experimental results show that this scheme can embed and extract watermarks independently for each user and is capable for multiple watermarking.
{"title":"A multiple watermarking algorithm based on CDMA technique","authors":"F. Zou, Zhengding Lu, H. Ling","doi":"10.1145/1027527.1027629","DOIUrl":"https://doi.org/10.1145/1027527.1027629","url":null,"abstract":"This paper proposes a multiple watermarking algorithm based on code division multiple access (CDMA) technique. Before the watermark embedded, each user uses his private key as a seed to generate an address code which is subjected to pseudorandom noise distribution. Each watermark is modulated into a carrier signal with its corresponding address code. And then these carrier signals are added to host media (e.g. image, video and audio). During watermark detection, using the same address code, each user extracts his watermark from the detected media via calculating the correlation coefficient between address code and watermarked vector of the detected media. Each user can embed and extract his watermark independently, regardless of others. The experimental results show that this scheme can embed and extract watermarks independently for each user and is capable for multiple watermarking.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131329139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One of the challenges of multimedia applications is to provide user-tailored access to information encoded in different media. Particularly, previous research has not yet fully explored how to automatically compose different video segments according to a communicative goal. We propose a rhetoric-based method to support the selection and automatic editing of user-requested content from video footage. The method is applied to the domain of video documentaries to create biased sequences about a user selected subject.
{"title":"Semantic-aware automatic video editing","authors":"S. Bocconi","doi":"10.1145/1027527.1027753","DOIUrl":"https://doi.org/10.1145/1027527.1027753","url":null,"abstract":"One of the challenges of multimedia applications is to provide user-tailored access to information encoded in different media. Particularly, previous research has not yet fully explored how to automatically compose different video segments according to a communicative goal. We propose a rhetoric-based method to support the selection and automatic editing of user-requested content from video footage. The method is applied to the domain of video documentaries to create biased sequences about a user selected subject.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125681697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Susanne CJ Boll, D. Bulterman, R. Jain, Tat-Seng Chua, R. Lienhart, L. Wilcox, Marc Davis, S. Venkatesh
Various issues related to the multimedia information retrieval and media access are discussed. The feasible solutions for automatic signal-based analysis of media content are analyzed. The extent of user involvement in the content creation process is emphasized. The applications driving the creation and usage of context and metadata are also elaborated.
{"title":"Between context-aware media capture and multimedia content analysis: where do we find the promised land?","authors":"Susanne CJ Boll, D. Bulterman, R. Jain, Tat-Seng Chua, R. Lienhart, L. Wilcox, Marc Davis, S. Venkatesh","doi":"10.1145/1027527.1027727","DOIUrl":"https://doi.org/10.1145/1027527.1027727","url":null,"abstract":"Various issues related to the multimedia information retrieval and media access are discussed. The feasible solutions for automatic signal-based analysis of media content are analyzed. The extent of user involvement in the content creation process is emphasized. The applications driving the creation and usage of context and metadata are also elaborated.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124694108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BiReality (a.k.a. Mutually-Immersive Telepresence) uses a teleoperated robotic surrogate to provide an immersive telepresence system for face-to-face interactions. Our goal is to recreate to the greatest extent practical, both for the user and the people at the remote location, the sensory experience relevant for face-to-face interactions of the user actually being in the remote location. Our system provides a 360-degree surround immersive audio and visual experience for both the user and remote participants, and streams eight 704x480 MPEG-2 coded videos totaling 20Mb/s. The system preserves gaze and eye contact, presents local and remote participants to each other at life size, and preserves the head height of the user at the remote location. Initial user experiences are presented.
{"title":"BiReality: mutually-immersive telepresence","authors":"N. Jouppi, Subu Iyer, Stan Thomas, A. Mitchell","doi":"10.1145/1027527.1027725","DOIUrl":"https://doi.org/10.1145/1027527.1027725","url":null,"abstract":"BiReality (a.k.a. Mutually-Immersive Telepresence) uses a teleoperated robotic surrogate to provide an immersive telepresence system for face-to-face interactions. Our goal is to recreate to the greatest extent practical, both for the user and the people at the remote location, the sensory experience relevant for face-to-face interactions of the user actually being in the remote location. Our system provides a 360-degree surround immersive audio and visual experience for both the user and remote participants, and streams eight 704x480 MPEG-2 coded videos totaling 20Mb/s. The system preserves gaze and eye contact, presents local and remote participants to each other at life size, and preserves the head height of the user at the remote location. Initial user experiences are presented.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116446331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinjun Wang, Changsheng Xu, Chng Eng Siong, K. Wan, Q. Tian
While most current approaches for sports video analysis are based on broadcast video, in this paper, we present a novel approach for highlight detection and automatic replay generation for soccer videos taken by the main camera. This research is important as current soccer highlight detection and replay generation from a live game is a labor-intensive process. A robust multi-level, multi-model event detection framework is proposed to detect the event and event boundaries from the video taken by the main camera. This framework explores the possible analysis cues, using a mid-level representation to bridge the gap between low-level features and high-level events. The event detection results and mid-level representation are used to generate replays which are automatically inserted into the video. Experimental results are promising and found to be comparable with those generated by broadcast professionals.
{"title":"Automatic replay generation for soccer video broadcasting","authors":"Jinjun Wang, Changsheng Xu, Chng Eng Siong, K. Wan, Q. Tian","doi":"10.1145/1027527.1027535","DOIUrl":"https://doi.org/10.1145/1027527.1027535","url":null,"abstract":"While most current approaches for sports video analysis are based on broadcast video, in this paper, we present a novel approach for highlight detection and automatic replay generation for soccer videos taken by the main camera. This research is important as current soccer highlight detection and replay generation from a live game is a labor-intensive process. A robust multi-level, multi-model event detection framework is proposed to detect the event and event boundaries from the video taken by the main camera. This framework explores the possible analysis cues, using a mid-level representation to bridge the gap between low-level features and high-level events. The event detection results and mid-level representation are used to generate replays which are automatically inserted into the video. Experimental results are promising and found to be comparable with those generated by broadcast professionals.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130211149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, it has become a new trend to reconstruct sports video for various purposes. This paper presents a 3D reconstruction and enrichment system that not only reconstructs broadcast soccer video but also enriches reconstructed video with music and illustrations of the video contents. The system can reconstruct not only the goalmouth scene but also the midfield scene, which cannot be reconstructed by the existing systems. To quickly find the feature points for calibrating the camera, we propose a fast algorithm to detect the lines in the goalmouth scene and use the algorithm proposed in our previous papers to detect the partial ellipses in the midfield scene. The reconstruction is conducted on several video sequences of two scenes. The reconstructed videos eliminate the ball deformation and unnecessary camera changes through smoothing the camera parameters. This system also serves as an experimental system for our project that reconstructs the on-going soccer game in real time.
{"title":"3D reconstruction and enrichment of broadcast soccer video","authors":"Xinguo Yu, Xin Yan, Tze Sen Hay, H. Leong","doi":"10.1145/1027527.1027586","DOIUrl":"https://doi.org/10.1145/1027527.1027586","url":null,"abstract":"Recently, it has become a new trend to reconstruct sports video for various purposes. This paper presents a 3D reconstruction and enrichment system that not only reconstructs broadcast soccer video but also enriches reconstructed video with music and illustrations of the video contents. The system can reconstruct not only the goalmouth scene but also the midfield scene, which cannot be reconstructed by the existing systems. To quickly find the feature points for calibrating the camera, we propose a fast algorithm to detect the lines in the goalmouth scene and use the algorithm proposed in our previous papers to detect the partial ellipses in the midfield scene. The reconstruction is conducted on several video sequences of two scenes. The reconstructed videos eliminate the ball deformation and unnecessary camera changes through smoothing the camera parameters. This system also serves as an experimental system for our project that reconstructs the on-going soccer game in real time.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125705995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N.A.G. (Network Auralization for Gnutella) is interactive software art designed to actively involve a lay public without musical training in a creative musical experience. Users enter search keywords, and the software looks for matching music files on the Gnutella peer-to-peer file-sharing network. As it downloads music, it plays an audio collage whose structure is based on the relative download rates of the files.
{"title":"N.A.G.: network auralization for Gnutella","authors":"Jason Freeman","doi":"10.1145/1027527.1027567","DOIUrl":"https://doi.org/10.1145/1027527.1027567","url":null,"abstract":"N.A.G. (Network Auralization for Gnutella) is interactive software art designed to actively involve a lay public without musical training in a creative musical experience. Users enter search keywords, and the software looks for matching music files on the Gnutella peer-to-peer file-sharing network. As it downloads music, it plays an audio collage whose structure is based on the relative download rates of the files.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131639353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
News videos constitute an important source of information for tracking and documenting important events. In these videos, news stories are often accompanied by short video shots that tend to be repeated during the course of the event. Automatic detection of such repetitions is essential for creating auto-documentaries, for alleviating the limitation of traditional textual topic detection methods. In this paper, we propose novel methods for detecting and tracking the evolution of news over time. The proposed method exploits both visual cues and textual information to summarize evolving news stories. Experiments are carried on the TREC-VID data set consisting of 120 hours of news videos from two different channels.
{"title":"Towards auto-documentary: tracking the evolution of news stories","authors":"P. D. Sahin, Jia-Yu Pan, D. Forsyth","doi":"10.1145/1027527.1027719","DOIUrl":"https://doi.org/10.1145/1027527.1027719","url":null,"abstract":"News videos constitute an important source of information for tracking and documenting important events. In these videos, news stories are often accompanied by short video shots that tend to be repeated during the course of the event. Automatic detection of such repetitions is essential for creating auto-documentaries, for alleviating the limitation of traditional textual topic detection methods. In this paper, we propose novel methods for detecting and tracking the evolution of news over time. The proposed method exploits both visual cues and textual information to summarize evolving news stories. Experiments are carried on the TREC-VID data set consisting of 120 hours of news videos from two different channels.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132852728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}