Pub Date : 2010-12-10DOI: 10.1109/MMSP.2010.5662059
Alexandros Iosifidis, N. Nikolaidis, I. Pitas
In this paper a novel view-invariant movement recognition method is presented. A multi-camera setup is used to capture the movement from different observation angles. Identification of the position of each camera with respect to the subject's body is achieved by a procedure based on morphological operations and the proportions of the human body. Binary body masks from frames of all cameras, consistently arranged through the previous procedure, are concatenated to produce the so-called multi-view binary mask. These masks are rescaled and vectorized to create feature vectors in the input space. Fuzzy vector quantization is performed to associate input feature vectors with movement representations and linear discriminant analysis is used to map movements in a low dimensionality discriminant feature space. Experimental results show that the method can achieve very satisfactory recognition rates.
{"title":"Movement recognition exploiting multi-view information","authors":"Alexandros Iosifidis, N. Nikolaidis, I. Pitas","doi":"10.1109/MMSP.2010.5662059","DOIUrl":"https://doi.org/10.1109/MMSP.2010.5662059","url":null,"abstract":"In this paper a novel view-invariant movement recognition method is presented. A multi-camera setup is used to capture the movement from different observation angles. Identification of the position of each camera with respect to the subject's body is achieved by a procedure based on morphological operations and the proportions of the human body. Binary body masks from frames of all cameras, consistently arranged through the previous procedure, are concatenated to produce the so-called multi-view binary mask. These masks are rescaled and vectorized to create feature vectors in the input space. Fuzzy vector quantization is performed to associate input feature vectors with movement representations and linear discriminant analysis is used to map movements in a low dimensionality discriminant feature space. Experimental results show that the method can achieve very satisfactory recognition rates.","PeriodicalId":105774,"journal":{"name":"2010 IEEE International Workshop on Multimedia Signal Processing","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126804575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-10DOI: 10.1109/MMSP.2010.5662075
Chikoto Miyamoto, Yuto Komai, T. Takiguchi, Y. Ariki, I. Li
We investigated the speech recognition of a person with articulation disorders resulting from athetoid cerebral palsy. The articulation of speech tends to become unstable due to strain on speech-related muscles, and that causes degradation of speech recognition. Therefore, we use multiple acoustic frames (MAF) as an acoustic feature to solve this problem. Further, in a real environment, current speech recognition systems do not have sufficient performance due to noise influence. In addition to acoustic features, visual features are used to increase noise robustness in a real environment. However, there are recognition problems resulting from the tendency of those suffering from cerebral palsy to move their head erratically. We investigate a pose-robust audio-visual speech recognition method using an Active Appearance Model (AAM) to solve this problem for people with articulation disorders resulting from athetoid cerebral palsy. AAMs are used for face tracking to extract pose-robust facial feature points. Its effectiveness is confirmed by word recognition experiments on noisy speech of a person with articulation disorders.
{"title":"Multimodal speech recognition of a person with articulation disorders using AAM and MAF","authors":"Chikoto Miyamoto, Yuto Komai, T. Takiguchi, Y. Ariki, I. Li","doi":"10.1109/MMSP.2010.5662075","DOIUrl":"https://doi.org/10.1109/MMSP.2010.5662075","url":null,"abstract":"We investigated the speech recognition of a person with articulation disorders resulting from athetoid cerebral palsy. The articulation of speech tends to become unstable due to strain on speech-related muscles, and that causes degradation of speech recognition. Therefore, we use multiple acoustic frames (MAF) as an acoustic feature to solve this problem. Further, in a real environment, current speech recognition systems do not have sufficient performance due to noise influence. In addition to acoustic features, visual features are used to increase noise robustness in a real environment. However, there are recognition problems resulting from the tendency of those suffering from cerebral palsy to move their head erratically. We investigate a pose-robust audio-visual speech recognition method using an Active Appearance Model (AAM) to solve this problem for people with articulation disorders resulting from athetoid cerebral palsy. AAMs are used for face tracking to extract pose-robust facial feature points. Its effectiveness is confirmed by word recognition experiments on noisy speech of a person with articulation disorders.","PeriodicalId":105774,"journal":{"name":"2010 IEEE International Workshop on Multimedia Signal Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128957255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-10DOI: 10.1109/MMSP.2010.5662000
J. Zepeda, C. Guillemot, Ewa Kijak
We introduce a new dictionary structure for sparse representations better adapted to pursuit algorithms used in practical scenarios. The new structure, which we call an Iteration-Tuned Dictionary (ITD), consists of a set of dictionaries each associated to a single iteration index of a pursuit algorithm. In this work we first adapt pursuit decompositions to the case of ITD structures and then introduce a training algorithm used to construct ITDs. The training algorithm consists of applying a K-means to the (i −1)-th residuals of the training set to thus produce the i-th dictionary of the ITD structure. In the results section we compare our algorithm against the state-of-the-art dictionary training scheme and show that our method produces sparse representations yielding better signal approximations for the same sparsity level.
{"title":"The Iteration-Tuned Dictionary for sparse representations","authors":"J. Zepeda, C. Guillemot, Ewa Kijak","doi":"10.1109/MMSP.2010.5662000","DOIUrl":"https://doi.org/10.1109/MMSP.2010.5662000","url":null,"abstract":"We introduce a new dictionary structure for sparse representations better adapted to pursuit algorithms used in practical scenarios. The new structure, which we call an Iteration-Tuned Dictionary (ITD), consists of a set of dictionaries each associated to a single iteration index of a pursuit algorithm. In this work we first adapt pursuit decompositions to the case of ITD structures and then introduce a training algorithm used to construct ITDs. The training algorithm consists of applying a K-means to the (i −1)-th residuals of the training set to thus produce the i-th dictionary of the ITD structure. In the results section we compare our algorithm against the state-of-the-art dictionary training scheme and show that our method produces sparse representations yielding better signal approximations for the same sparsity level.","PeriodicalId":105774,"journal":{"name":"2010 IEEE International Workshop on Multimedia Signal Processing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123082867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-10DOI: 10.1109/MMSP.2010.5661992
B. Mathon, P. Bas, François Cayre, B. Macq
This article is a theoretical study on binary Tardos' fingerprinting codes embedded using watermarking schemes. Our approach is derived from [1] and encompasses both security and robustness constraints. We assume here that the coalition has estimated the symbols of the fingerprinting code by the way of a security attack, the quality of the estimation relying on the security of the watermarking scheme. Taking into account the fact that the coalition can perform estimation errors, we update the Worst Case Attack, which minimises the mutual information between the sequence of one colluder and the pirated sequence forged by the coalition. After comparing the achievable rates of the previous and proposed Worst Case Attack according to the estimation error, we conclude this analysis by comparing the robustness of no-secure embedding schemes versus secure ones. We show that, for low probabilities of error during the decoding stage (e.g. highly robust watermarking schemes), security enables to increase the achievable rate of the fingerprinting scheme.
{"title":"Considering security and robustness constraints for watermark-based Tardos fingerprinting","authors":"B. Mathon, P. Bas, François Cayre, B. Macq","doi":"10.1109/MMSP.2010.5661992","DOIUrl":"https://doi.org/10.1109/MMSP.2010.5661992","url":null,"abstract":"This article is a theoretical study on binary Tardos' fingerprinting codes embedded using watermarking schemes. Our approach is derived from [1] and encompasses both security and robustness constraints. We assume here that the coalition has estimated the symbols of the fingerprinting code by the way of a security attack, the quality of the estimation relying on the security of the watermarking scheme. Taking into account the fact that the coalition can perform estimation errors, we update the Worst Case Attack, which minimises the mutual information between the sequence of one colluder and the pirated sequence forged by the coalition. After comparing the achievable rates of the previous and proposed Worst Case Attack according to the estimation error, we conclude this analysis by comparing the robustness of no-secure embedding schemes versus secure ones. We show that, for low probabilities of error during the decoding stage (e.g. highly robust watermarking schemes), security enables to increase the achievable rate of the fingerprinting scheme.","PeriodicalId":105774,"journal":{"name":"2010 IEEE International Workshop on Multimedia Signal Processing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131392869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-10DOI: 10.1109/MMSP.2010.5662060
Takanori Hashimoto, Yuko Uematsu, H. Saito
This paper presents a method of generating new view point movie for the baseball game. One of the most interesting view point on the baseball game is looking from behind the catcher. If only one camera is placed behind the catcher, however, the view is occluded by the umpire and catcher. In this paper, we propose a method for generating a see-through movie which is captured from behind the catcher by recovering the pitcher's appearance with multiple cameras, so that we can virtually remove the obstacles (catcher and umpire) from the movie. Our method consists of three processes; recovering the pitcher's appearance by Homography, detecting obstacles by Graph Cut, projecting the ball's trajectory. For demonstrating the effectiveness of our method, in the experiment, we generate a see-through movie by applying our method to the multiple camera movies which are taken in the real baseball stadium. In the see-through movie, the pitcher can be appeared through the catcher and umpire.
{"title":"Generation of see-through baseball movie from multi-camera views","authors":"Takanori Hashimoto, Yuko Uematsu, H. Saito","doi":"10.1109/MMSP.2010.5662060","DOIUrl":"https://doi.org/10.1109/MMSP.2010.5662060","url":null,"abstract":"This paper presents a method of generating new view point movie for the baseball game. One of the most interesting view point on the baseball game is looking from behind the catcher. If only one camera is placed behind the catcher, however, the view is occluded by the umpire and catcher. In this paper, we propose a method for generating a see-through movie which is captured from behind the catcher by recovering the pitcher's appearance with multiple cameras, so that we can virtually remove the obstacles (catcher and umpire) from the movie. Our method consists of three processes; recovering the pitcher's appearance by Homography, detecting obstacles by Graph Cut, projecting the ball's trajectory. For demonstrating the effectiveness of our method, in the experiment, we generate a see-through movie by applying our method to the multiple camera movies which are taken in the real baseball stadium. In the see-through movie, the pitcher can be appeared through the catcher and umpire.","PeriodicalId":105774,"journal":{"name":"2010 IEEE International Workshop on Multimedia Signal Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130063226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-10DOI: 10.1109/MMSP.2010.5662025
Gene Cheung, V. Velisavljevic
Novel coding tools have been proposed recently to encode texture and depth maps of multiview images, exploiting inter-view correlations, for depth-image-based rendering (DIBR). However, the important associated bit allocation problem for DIBR remains open: for chosen view coding and synthesis tools, how to allocate bits among texture and depth maps across encoded views, so that the fidelity of a set of V views reconstructed at the decoder is maximized, for a fixed bitrate budget? In this paper, we present an optimization strategy to select subset of texture and depth maps of the original V views for encoding at appropriate quantization levels, so that at the decoder, the combined quality of decoded views (using encoded texture maps) and synthesized views (using encoded texture and depth maps of neighboring views) is maximized. We show that using the monotonicity property, complexity of our strategy can be greatly reduced. Experiments show that using our strategy, one can achieve up to 0.83dB gain in PSNR improvement over a heuristic scheme of encoding only texture maps of all V views at constant quantization levels. Further, computation can be reduced by up to 66% over a full parameter search approach.
{"title":"Bit allocation and encoded view selection for optimal multiview image representation","authors":"Gene Cheung, V. Velisavljevic","doi":"10.1109/MMSP.2010.5662025","DOIUrl":"https://doi.org/10.1109/MMSP.2010.5662025","url":null,"abstract":"Novel coding tools have been proposed recently to encode texture and depth maps of multiview images, exploiting inter-view correlations, for depth-image-based rendering (DIBR). However, the important associated bit allocation problem for DIBR remains open: for chosen view coding and synthesis tools, how to allocate bits among texture and depth maps across encoded views, so that the fidelity of a set of V views reconstructed at the decoder is maximized, for a fixed bitrate budget? In this paper, we present an optimization strategy to select subset of texture and depth maps of the original V views for encoding at appropriate quantization levels, so that at the decoder, the combined quality of decoded views (using encoded texture maps) and synthesized views (using encoded texture and depth maps of neighboring views) is maximized. We show that using the monotonicity property, complexity of our strategy can be greatly reduced. Experiments show that using our strategy, one can achieve up to 0.83dB gain in PSNR improvement over a heuristic scheme of encoding only texture maps of all V views at constant quantization levels. Further, computation can be reduced by up to 66% over a full parameter search approach.","PeriodicalId":105774,"journal":{"name":"2010 IEEE International Workshop on Multimedia Signal Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134565132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-10DOI: 10.1109/MMSP.2010.5662024
S. Marcelino, S. Faria, P. Assunção, S. Moiron, M. Ghanbari
This paper proposes a method to efficiently find motion vector predictions for zonal search motion re-estimation in fast video transcoders. The motion information extracted from the incoming video stream is processed to generate accurate motion vector predictions for transcoding with reduced complexity. Our results demonstrate that motion vector predictions computed by the proposed method outperform those generated by the highly efficient EPZS (Enhanced Predictive Zonal Search) algorithm in H.264/AVC transcoders. The computational complexity is reduced up to 59.6% at negligible cost in R-D performance. The proposed method can be useful in multimedia systems and applications using any type of transcoder, such as transrating and/or spatial resolution downsizing.
针对快速视频转码器中区域搜索运动重估计问题,提出了一种高效的运动向量预测方法。从传入视频流中提取的运动信息被处理以生成准确的运动矢量预测,从而降低了转码的复杂性。结果表明,在H.264/AVC转码器中,采用该方法计算的运动向量预测结果优于高效的EPZS (Enhanced Predictive zone Search)算法。在研发性能方面,计算复杂性降低了59.6%,成本可以忽略不计。所提出的方法可用于使用任何类型的转码器的多媒体系统和应用,例如翻译和/或空间分辨率缩小。
{"title":"Efficient MV prediction for zonal search in video transcoding","authors":"S. Marcelino, S. Faria, P. Assunção, S. Moiron, M. Ghanbari","doi":"10.1109/MMSP.2010.5662024","DOIUrl":"https://doi.org/10.1109/MMSP.2010.5662024","url":null,"abstract":"This paper proposes a method to efficiently find motion vector predictions for zonal search motion re-estimation in fast video transcoders. The motion information extracted from the incoming video stream is processed to generate accurate motion vector predictions for transcoding with reduced complexity. Our results demonstrate that motion vector predictions computed by the proposed method outperform those generated by the highly efficient EPZS (Enhanced Predictive Zonal Search) algorithm in H.264/AVC transcoders. The computational complexity is reduced up to 59.6% at negligible cost in R-D performance. The proposed method can be useful in multimedia systems and applications using any type of transcoder, such as transrating and/or spatial resolution downsizing.","PeriodicalId":105774,"journal":{"name":"2010 IEEE International Workshop on Multimedia Signal Processing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131898265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-10DOI: 10.1109/MMSP.2010.5662048
David D. Conger, Mrityunjay Kumar, H. Radha
With the abundance and variety of display devices, novel image resizing techniques have become more desirable. Content-aware image resizing (retargeting) techniques have been proposed that show improvement over traditional techniques such as cropping and resampling. In particular, seam carving has gained attention as an effective solution, using simple filters to detect and preserve the high-energy areas of an image. Yet, it stands to be more robust to a variety of image types. To facilitate such improvement, we recast seam carving in a more general framework and in the context of filter banks. This enables improved filter design, and leads to a multiscale model that addresses the problem of scale of image features. We have found our generalized multiscale model to improve on the existing seam carving method for a variety of images.
{"title":"Generalized multiscale seam carving","authors":"David D. Conger, Mrityunjay Kumar, H. Radha","doi":"10.1109/MMSP.2010.5662048","DOIUrl":"https://doi.org/10.1109/MMSP.2010.5662048","url":null,"abstract":"With the abundance and variety of display devices, novel image resizing techniques have become more desirable. Content-aware image resizing (retargeting) techniques have been proposed that show improvement over traditional techniques such as cropping and resampling. In particular, seam carving has gained attention as an effective solution, using simple filters to detect and preserve the high-energy areas of an image. Yet, it stands to be more robust to a variety of image types. To facilitate such improvement, we recast seam carving in a more general framework and in the context of filter banks. This enables improved filter design, and leads to a multiscale model that addresses the problem of scale of image features. We have found our generalized multiscale model to improve on the existing seam carving method for a variety of images.","PeriodicalId":105774,"journal":{"name":"2010 IEEE International Workshop on Multimedia Signal Processing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129655326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-10DOI: 10.1109/MMSP.2010.5662021
Ramdas Satyan, F. Labeau, K. Rose
Transmission of compressed video over unreliable networks is vulnerable to errors and error propagation. Multi-hypothesis motion compensated prediction (MHMCP) which was originally developed to improve compression efficiency has been shown to have a good error resilience property. In this paper we improve the overall performance of MHMCP in packet loss scenarios by performing optimal mode switching within a rate distortion framework. The approach builds on the recursive optimal per-pixel estimate (ROPE), which is extended by re-deriving recursion formulas for the more complex MHMCP setting, so as to achieve an accurate estimation of the end-to-end distortion. Simulation results show significant performance gains over the standard MHMCP scheme and the importance of performing effective mode decisions. We also show results with comparison to conventional ROPE.
{"title":"Optimal mode switching for multi-hypothesis motion compensated prediction","authors":"Ramdas Satyan, F. Labeau, K. Rose","doi":"10.1109/MMSP.2010.5662021","DOIUrl":"https://doi.org/10.1109/MMSP.2010.5662021","url":null,"abstract":"Transmission of compressed video over unreliable networks is vulnerable to errors and error propagation. Multi-hypothesis motion compensated prediction (MHMCP) which was originally developed to improve compression efficiency has been shown to have a good error resilience property. In this paper we improve the overall performance of MHMCP in packet loss scenarios by performing optimal mode switching within a rate distortion framework. The approach builds on the recursive optimal per-pixel estimate (ROPE), which is extended by re-deriving recursion formulas for the more complex MHMCP setting, so as to achieve an accurate estimation of the end-to-end distortion. Simulation results show significant performance gains over the standard MHMCP scheme and the importance of performing effective mode decisions. We also show results with comparison to conventional ROPE.","PeriodicalId":105774,"journal":{"name":"2010 IEEE International Workshop on Multimedia Signal Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131021379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-10DOI: 10.1109/MMSP.2010.5662015
Alfred Dielmann
Edited video recordings, such as talk-shows and sitcoms, often include Audio-Visual clusters: frequent repetitions of closely related acoustic and visual content. For example during a political debate, every time that a given participant holds the conversational floor, her/his voice tends to co-occur with camera views (i.e. shots) showing her/his portrait. Differently from the previous Audio-Visual clustering works, this paper proposes an unsupervised approach that detects Audio-Visual clusters, avoiding to make assumptions on the recording content, such as the presence of specific participant voices or faces. Sequences of audio and shot clusters are automatically identified using unsupervised audio diarization and shot segmentation techniques. Audio-Visual clusters are then formed by ranking the co-occurrences between these two segmentations and selecting those which significantly go beyond chance. Numerical experiments performed on a collection of 70 political debates, comprising more than 43 hours of live edited recordings, showed that automatically extracted AudioVisual clusters well match the ground-truth annotation, achieving high purity performances.
{"title":"Unsupervised detection of multimodal clusters in edited recordings","authors":"Alfred Dielmann","doi":"10.1109/MMSP.2010.5662015","DOIUrl":"https://doi.org/10.1109/MMSP.2010.5662015","url":null,"abstract":"Edited video recordings, such as talk-shows and sitcoms, often include Audio-Visual clusters: frequent repetitions of closely related acoustic and visual content. For example during a political debate, every time that a given participant holds the conversational floor, her/his voice tends to co-occur with camera views (i.e. shots) showing her/his portrait. Differently from the previous Audio-Visual clustering works, this paper proposes an unsupervised approach that detects Audio-Visual clusters, avoiding to make assumptions on the recording content, such as the presence of specific participant voices or faces. Sequences of audio and shot clusters are automatically identified using unsupervised audio diarization and shot segmentation techniques. Audio-Visual clusters are then formed by ranking the co-occurrences between these two segmentations and selecting those which significantly go beyond chance. Numerical experiments performed on a collection of 70 political debates, comprising more than 43 hours of live edited recordings, showed that automatically extracted AudioVisual clusters well match the ground-truth annotation, achieving high purity performances.","PeriodicalId":105774,"journal":{"name":"2010 IEEE International Workshop on Multimedia Signal Processing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116248218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}