In this paper, a generic audio identification system is introduced to identify advertisements and songs in radio broadcast streams using automatically acquired segmental units. A new fingerprinting method based on ALISP data-driven segmentation is presented. A modified BLAST algorithm is also proposed for fast and approximate matching of ALISP sequences. To detect commercials and songs, ALISP transcriptions of references composed of large library of commercials and songs, are compared to the transcriptions of the test radio stream using Levenshtein distance. The system is described and evaluated on broadcast audio streams from 12 French radio stations. For advertisement identification, a mean precision rate of 100% with the corresponding recall value of 98% were achieved. For music identification, a mean precision rate of 100% with the corresponding recall value of 95% were achieved.
{"title":"A Generic Audio Identification System for Radio Broadcast Monitoring Based on Data-Driven Segmentation","authors":"H. Khemiri, D. Petrovska-Delacrétaz, G. Chollet","doi":"10.1109/ISM.2012.87","DOIUrl":"https://doi.org/10.1109/ISM.2012.87","url":null,"abstract":"In this paper, a generic audio identification system is introduced to identify advertisements and songs in radio broadcast streams using automatically acquired segmental units. A new fingerprinting method based on ALISP data-driven segmentation is presented. A modified BLAST algorithm is also proposed for fast and approximate matching of ALISP sequences. To detect commercials and songs, ALISP transcriptions of references composed of large library of commercials and songs, are compared to the transcriptions of the test radio stream using Levenshtein distance. The system is described and evaluated on broadcast audio streams from 12 French radio stations. For advertisement identification, a mean precision rate of 100% with the corresponding recall value of 98% were achieved. For music identification, a mean precision rate of 100% with the corresponding recall value of 95% were achieved.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123909783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most modern consumer cameras are capable of video capture, but their spatial resolution is generally lower than that of still images. The spatial resolution of videos can be enhanced with a hybrid camera system that combines information from high-resolution still images with low-resolution video frames in a process known as super-resolution. As this process is computationally intensive, we propose a camera system that uses the spatial and temporal information measures SI and TI standardized by ITU as camera parameters to determine during capture whether super-resolution processing would result in an increase in perceived quality. Experimental results show that the difference of these two measures can be used to determine the feasibility of super-resolution processing.
{"title":"Spatial and Temporal Information as Camera Parameters for Super-resolution Video","authors":"Jussi Tarvainen, M. Nuutinen, P. Oittinen","doi":"10.1109/ISM.2012.63","DOIUrl":"https://doi.org/10.1109/ISM.2012.63","url":null,"abstract":"Most modern consumer cameras are capable of video capture, but their spatial resolution is generally lower than that of still images. The spatial resolution of videos can be enhanced with a hybrid camera system that combines information from high-resolution still images with low-resolution video frames in a process known as super-resolution. As this process is computationally intensive, we propose a camera system that uses the spatial and temporal information measures SI and TI standardized by ITU as camera parameters to determine during capture whether super-resolution processing would result in an increase in perceived quality. Experimental results show that the difference of these two measures can be used to determine the feasibility of super-resolution processing.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128426177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, super-resolution techniques in the field of computer vision have been studied in earnest owing to the potential applicability of such technology in a variety of fields. In this paper, we propose a single-image, super-resolution approach using a Gaussian Mixture Model (GMM) and Partial Least Squares (PLS) regression. A GMM-based super-resolution technique is shown to be more efficient than previously known techniques, such as sparse-coding-based techniques. But the GMM-based conversion may result in over fitting. In this paper, an effective technique for preventing over fitting, which combines PLS regression with a GMM, is proposed. The conversion function is constructed using the input image and its self-reduction image. The high-resolution image is obtained by applying the conversion function to the enlarged input image without any outside database. We confirmed the effectiveness of this proposed method through our experiments.
{"title":"Super-resolution Using GMM and PLS Regression","authors":"Y. Ogawa, Takahiro Hori, T. Takiguchi, Y. Ariki","doi":"10.1109/ISM.2012.62","DOIUrl":"https://doi.org/10.1109/ISM.2012.62","url":null,"abstract":"In recent years, super-resolution techniques in the field of computer vision have been studied in earnest owing to the potential applicability of such technology in a variety of fields. In this paper, we propose a single-image, super-resolution approach using a Gaussian Mixture Model (GMM) and Partial Least Squares (PLS) regression. A GMM-based super-resolution technique is shown to be more efficient than previously known techniques, such as sparse-coding-based techniques. But the GMM-based conversion may result in over fitting. In this paper, an effective technique for preventing over fitting, which combines PLS regression with a GMM, is proposed. The conversion function is constructed using the input image and its self-reduction image. The high-resolution image is obtained by applying the conversion function to the enlarged input image without any outside database. We confirmed the effectiveness of this proposed method through our experiments.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"85 5-6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124330437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yutaka Katsuyama, A. Minagawa, Y. Hotta, Jun Sun, S. Omachi
We propose a caption recognition method for multicolor characters on complex background. Caption characters are used for an efficient search on a large amount of recorded TV programs. In the caption character recognition, the caption appearance section and the area is extracted, the character strokes are extracted from the area, and recognized. This paper focuses on caption character strokes extraction and recognition for multi-color characters on complex background which is a very difficult task for the conventional methods. The proposed method extracts decomposed binary images from input color caption image by color clustering. Then character candidates that are composed of combination of connect components are extracted by using recognition certainty. Finally, characters are selected by beyond-color Dynamic Programming method in which weight on recognition certainty and character alignment are used. In the character recognition evaluation of one-line multi-color character string on a complex background, a great improvement was achieved from a conventional technique that can recognize only one-color characters on complex background image.
{"title":"A Study on Caption Recognition for Multi-color Characters on Complex Background","authors":"Yutaka Katsuyama, A. Minagawa, Y. Hotta, Jun Sun, S. Omachi","doi":"10.1109/ISM.2012.83","DOIUrl":"https://doi.org/10.1109/ISM.2012.83","url":null,"abstract":"We propose a caption recognition method for multicolor characters on complex background. Caption characters are used for an efficient search on a large amount of recorded TV programs. In the caption character recognition, the caption appearance section and the area is extracted, the character strokes are extracted from the area, and recognized. This paper focuses on caption character strokes extraction and recognition for multi-color characters on complex background which is a very difficult task for the conventional methods. The proposed method extracts decomposed binary images from input color caption image by color clustering. Then character candidates that are composed of combination of connect components are extracted by using recognition certainty. Finally, characters are selected by beyond-color Dynamic Programming method in which weight on recognition certainty and character alignment are used. In the character recognition evaluation of one-line multi-color character string on a complex background, a great improvement was achieved from a conventional technique that can recognize only one-color characters on complex background image.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128198165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lai-Tee Cheok, Sol Yee Heo, Donato Mitrani, Anshuman Tewari
Face recognition is one of the most promising and successful applications of image analysis and understanding. Applications include biometrics identification, gaze estimation, emotion recognition, human computer interface, among others. A closed system trained to recognize only a predetermined number of faces will become obsolete very easily. In this paper, we describe a demo that we have developed using face detection and recognition algorithms for recognizing actors/actresses in movies. The demo runs on a Samsung tablet to recognize actors/actresses in the video. We also present our proposed method that allows user to interact with the system during training while watching video. New faces are tracked and trained into new face classifiers as video is continuously playing and the face database is updated dynamically.
{"title":"Automatic Actor Recognition for Video Services on Mobile Devices","authors":"Lai-Tee Cheok, Sol Yee Heo, Donato Mitrani, Anshuman Tewari","doi":"10.1109/ISM.2012.80","DOIUrl":"https://doi.org/10.1109/ISM.2012.80","url":null,"abstract":"Face recognition is one of the most promising and successful applications of image analysis and understanding. Applications include biometrics identification, gaze estimation, emotion recognition, human computer interface, among others. A closed system trained to recognize only a predetermined number of faces will become obsolete very easily. In this paper, we describe a demo that we have developed using face detection and recognition algorithms for recognizing actors/actresses in movies. The demo runs on a Samsung tablet to recognize actors/actresses in the video. We also present our proposed method that allows user to interact with the system during training while watching video. New faces are tracked and trained into new face classifiers as video is continuously playing and the face database is updated dynamically.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128517878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sam S. Tsai, Huizhong Chen, David M. Chen, Vasu Parameswaran, R. Grzeszczuk, B. Girod
We present a new class of visual text features that are based on text in camera phone images. A robust text detection algorithm locates individual text lines and feeds them to a recognition engine. From the recognized characters, we generate the visual text features in a way that resembles image features. We calculate their location, scale, orientation, and a descriptor that describes the character and word information. We apply visual text features to image matching. To disambiguate false matches, we developed a word-distance matching method. Our experiments with image that contain text show that the new visual text feature based image matching pipeline performs on par or better than a conventional image feature based pipeline while requiring less than 10 bits per feature. This is 4.5× smaller than state-of-the-art visual feature descriptors.
{"title":"Visual Text Features for Image Matching","authors":"Sam S. Tsai, Huizhong Chen, David M. Chen, Vasu Parameswaran, R. Grzeszczuk, B. Girod","doi":"10.1109/ISM.2012.84","DOIUrl":"https://doi.org/10.1109/ISM.2012.84","url":null,"abstract":"We present a new class of visual text features that are based on text in camera phone images. A robust text detection algorithm locates individual text lines and feeds them to a recognition engine. From the recognized characters, we generate the visual text features in a way that resembles image features. We calculate their location, scale, orientation, and a descriptor that describes the character and word information. We apply visual text features to image matching. To disambiguate false matches, we developed a word-distance matching method. Our experiments with image that contain text show that the new visual text feature based image matching pipeline performs on par or better than a conventional image feature based pipeline while requiring less than 10 bits per feature. This is 4.5× smaller than state-of-the-art visual feature descriptors.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128526670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, many local descriptor based approaches have been proposed for human activity recognition, which perform well on challenging datasets. However, most of these approaches are computationally intensive, extract irrelevant background features and fail to capture global temporal information. We propose to overcome these issues by introducing a compact and robust motion space that can be used to extract both spatial and temporal aspects of activities using local descriptors. We present Speed Adapted Motion History Image Space (SAMHIS) that employs a variant of Motion History Image for representing motion. This space alleviates both self-occlusion as well as the speed-related issues associated with different kinds of motion. We go on to show using a standard bag of visual words model that extracting appearance based local descriptors from this space is very effective for recognizing activity. Our approach yields promising results on the KTH and Weizmann dataset.
{"title":"SAMHIS: A Robust Motion Space for Human Activity Recognition","authors":"S. Raghuraman, B. Prabhakaran","doi":"10.1109/ISM.2012.75","DOIUrl":"https://doi.org/10.1109/ISM.2012.75","url":null,"abstract":"In recent years, many local descriptor based approaches have been proposed for human activity recognition, which perform well on challenging datasets. However, most of these approaches are computationally intensive, extract irrelevant background features and fail to capture global temporal information. We propose to overcome these issues by introducing a compact and robust motion space that can be used to extract both spatial and temporal aspects of activities using local descriptors. We present Speed Adapted Motion History Image Space (SAMHIS) that employs a variant of Motion History Image for representing motion. This space alleviates both self-occlusion as well as the speed-related issues associated with different kinds of motion. We go on to show using a standard bag of visual words model that extracting appearance based local descriptors from this space is very effective for recognizing activity. Our approach yields promising results on the KTH and Weizmann dataset.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128695270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Battery powered mobile devices suffer from significant power consumption of the WiFi network interface during video calls. By utilizing the dynamic Power Save Mode (PSM), our study proposes an adaptive RTP packet transmission scheme for multimedia traffic. By merging the outbound packet delivery timing with inbound packet reception and estimating each delay component along the packet processing and transmission path, each client manages to meet the stringent end-to-end latency for packets while creating longer sleep intervals. As a benefit it involves no cross-layer communication overhead as the interface state transitions are completely transparent to the application. The experimental results show that 28.53% energy savings on the WiFi interface can be achieved while maintaining satisfactory application performance.
{"title":"Energy Conservation in 802.11 WLAN for Mobile Video Calls","authors":"Haiyang Ma, Roger Zimmermann","doi":"10.1109/ISM.2012.57","DOIUrl":"https://doi.org/10.1109/ISM.2012.57","url":null,"abstract":"Battery powered mobile devices suffer from significant power consumption of the WiFi network interface during video calls. By utilizing the dynamic Power Save Mode (PSM), our study proposes an adaptive RTP packet transmission scheme for multimedia traffic. By merging the outbound packet delivery timing with inbound packet reception and estimating each delay component along the packet processing and transmission path, each client manages to meet the stringent end-to-end latency for packets while creating longer sleep intervals. As a benefit it involves no cross-layer communication overhead as the interface state transitions are completely transparent to the application. The experimental results show that 28.53% energy savings on the WiFi interface can be achieved while maintaining satisfactory application performance.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133451964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Color information is an important feature for many vision algorithms including color correction, image retrieval and tracking. In this paper, we study the limitations of color measurement accuracy and explore how this information can be used to improve the performance of color correction. In particular, we show that a strong correlation exists between the error in hue measurements on one hand and saturation and intensity on the other hand. We introduce the notion of color strength, which is a combination of saturation and intensity information to determine when hue information in a scene is reliable. We verify the predictive capability of this model on two different datasets with ground truth color information. Further, we show how color strength information can be used to significantly improve color correction accuracy for the 11K real-world SFU gray ball dataset.
{"title":"Exploiting Color Strength to Improve Color Correction","authors":"L. Brown, A. Datta, Sharath Pankanti","doi":"10.1109/ISM.2012.43","DOIUrl":"https://doi.org/10.1109/ISM.2012.43","url":null,"abstract":"Color information is an important feature for many vision algorithms including color correction, image retrieval and tracking. In this paper, we study the limitations of color measurement accuracy and explore how this information can be used to improve the performance of color correction. In particular, we show that a strong correlation exists between the error in hue measurements on one hand and saturation and intensity on the other hand. We introduce the notion of color strength, which is a combination of saturation and intensity information to determine when hue information in a scene is reliable. We verify the predictive capability of this model on two different datasets with ground truth color information. Further, we show how color strength information can be used to significantly improve color correction accuracy for the 11K real-world SFU gray ball dataset.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122236862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mind Mapping is a well-known technique used in note taking and is known to encourage learning and studying. Besides, Mind Mapping can be a very good way to present knowledge and concepts in a visual form. Unfortunately there is no reliable automated tool that can generate Mind Maps from Natural Language text. This paper fills in this gap by developing the first evaluated automated system that takes a text input and generates a Mind Map visualization out of it. The system also could visualize large text documents in multilevel Mind Maps in which a high level Mind Map node could be expanded into child Mind Maps. The proposed approach involves understanding of the input text converting it into intermediate Detailed Meaning Representation (DMR). The DMR is then visualized with two proposed approaches, Single level or Multiple levels which is convenient for larger text. The generated Mind Maps from both approaches were evaluated based on Human Subject experiments performed on Amazon Mechanical Turk with various parameter settings.
{"title":"English2MindMap: An Automated System for MindMap Generation from English Text","authors":"Mohamed Elhoseiny, A. Elgammal","doi":"10.1109/ISM.2012.103","DOIUrl":"https://doi.org/10.1109/ISM.2012.103","url":null,"abstract":"Mind Mapping is a well-known technique used in note taking and is known to encourage learning and studying. Besides, Mind Mapping can be a very good way to present knowledge and concepts in a visual form. Unfortunately there is no reliable automated tool that can generate Mind Maps from Natural Language text. This paper fills in this gap by developing the first evaluated automated system that takes a text input and generates a Mind Map visualization out of it. The system also could visualize large text documents in multilevel Mind Maps in which a high level Mind Map node could be expanded into child Mind Maps. The proposed approach involves understanding of the input text converting it into intermediate Detailed Meaning Representation (DMR). The DMR is then visualized with two proposed approaches, Single level or Multiple levels which is convenient for larger text. The generated Mind Maps from both approaches were evaluated based on Human Subject experiments performed on Amazon Mechanical Turk with various parameter settings.","PeriodicalId":282528,"journal":{"name":"2012 IEEE International Symposium on Multimedia","volume":"186 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131852704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}