Christos Tzelepis, Nikolaos Gkalelis, V. Mezaris, Y. Kompatsiaris
In this paper, a new method that exploits related videos for the problem of event detection is proposed, where related videos are videos that are closely but not fully associated with the event of interest. In particular, the Weighted Margin SVM formulation is modified so that related class observations can be effectively incorporated in the optimization problem. The resulting Relevance Degree SVM is especially useful in problems where only a limited number of training observations is provided, e.g., for the EK10Ex subtask of TRECVID MED, where only ten positive and ten related samples are provided for the training of a complex event detector. Experimental results on the TRECVID MED 2011 dataset verify the effectiveness of the proposed method.
本文提出了一种利用相关视频来解决事件检测问题的新方法,其中相关视频是指与感兴趣的事件密切但不完全相关的视频。特别是对加权余量支持向量机的公式进行了改进,使相关的类观测值能够有效地纳入到优化问题中。所得到的关联度支持向量机在只提供有限数量的训练观测值的问题中特别有用,例如,对于TRECVID MED的EK10Ex子任务,其中只提供十个正样本和十个相关样本来训练复杂事件检测器。在TRECVID MED 2011数据集上的实验结果验证了该方法的有效性。
{"title":"Improving event detection using related videos and relevance degree support vector machines","authors":"Christos Tzelepis, Nikolaos Gkalelis, V. Mezaris, Y. Kompatsiaris","doi":"10.1145/2502081.2502176","DOIUrl":"https://doi.org/10.1145/2502081.2502176","url":null,"abstract":"In this paper, a new method that exploits related videos for the problem of event detection is proposed, where related videos are videos that are closely but not fully associated with the event of interest. In particular, the Weighted Margin SVM formulation is modified so that related class observations can be effectively incorporated in the optimization problem. The resulting Relevance Degree SVM is especially useful in problems where only a limited number of training observations is provided, e.g., for the EK10Ex subtask of TRECVID MED, where only ten positive and ten related samples are provided for the training of a complex event detector. Experimental results on the TRECVID MED 2011 dataset verify the effectiveness of the proposed method.","PeriodicalId":20448,"journal":{"name":"Proceedings of the 21st ACM international conference on Multimedia","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74316584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Empowered by advances in information technology, such as social media network, digital library and mobile computing, there emerges an ever-increasing amounts of multimedia data. As the key technology to address the problem of information overload, multimedia recommendation system has been received a lot of attentions from both industry and academia. This course aims to 1) provide a series of detailed review of state-of-the-art in multimedia recommendation; 2) analyze key technical challenges in developing and evaluating next generation multimedia recommendation systems from different perspectives and 3) give some predictions about the road lies ahead of us.
{"title":"Towards next generation multimedia recommendation systems","authors":"Jialie Shen, Xiansheng Hua, Emre Sargin","doi":"10.1145/2502081.2502233","DOIUrl":"https://doi.org/10.1145/2502081.2502233","url":null,"abstract":"Empowered by advances in information technology, such as social media network, digital library and mobile computing, there emerges an ever-increasing amounts of multimedia data. As the key technology to address the problem of information overload, multimedia recommendation system has been received a lot of attentions from both industry and academia. This course aims to 1) provide a series of detailed review of state-of-the-art in multimedia recommendation; 2) analyze key technical challenges in developing and evaluating next generation multimedia recommendation systems from different perspectives and 3) give some predictions about the road lies ahead of us.","PeriodicalId":20448,"journal":{"name":"Proceedings of the 21st ACM international conference on Multimedia","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75414604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we address the issue of mapping between gesture and sound in interactive music systems. Our approach, we call mapping by demonstration, aims at learning the mapping from examples provided by users while interacting with the system. We propose a general framework for modeling gesture--sound sequences based on a probabilistic, multimodal and hierarchical model. Two orthogonal modeling aspects are detailed and we describe planned research directions to improve and evaluate the proposed models.
{"title":"Gesture--sound mapping by demonstration in interactive music systems","authors":"Jules Françoise","doi":"10.1145/2502081.2502214","DOIUrl":"https://doi.org/10.1145/2502081.2502214","url":null,"abstract":"In this paper we address the issue of mapping between gesture and sound in interactive music systems. Our approach, we call mapping by demonstration, aims at learning the mapping from examples provided by users while interacting with the system. We propose a general framework for modeling gesture--sound sequences based on a probabilistic, multimodal and hierarchical model. Two orthogonal modeling aspects are detailed and we describe planned research directions to improve and evaluate the proposed models.","PeriodicalId":20448,"journal":{"name":"Proceedings of the 21st ACM international conference on Multimedia","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75037136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multimedia data are now created at a macro, public scale as well as individual personal scale. While distributed multimedia streams (e.g. images, microblogs, and sensor readings) have recently been combined to understand multiple spatio-temporal phenomena like epidemic spreads, seasonal patterns, and political situations; personal data (via mobile sensors, quantified-self technologies) are now being used to identify user behavior, intent, affect, social connections, health, gaze, and interest level in real time. An effective combination of the two types of data can revolutionize multiple applications ranging from healthcare, to mobility, to product recommendation, to content delivery. Building systems at this intersection can lead to better orchestrated media systems that may also improve users' social, emotional and physical well-being. For example, users trapped in risky hurricane situations can receive personalized evacuation instructions based on their health, mobility parameters, and distance to nearest shelter. This workshop bring together researchers interested in exploring novel techniques that combine multiple streams at different scales (macro and micro) to understand and react to each user's needs.
{"title":"Summary abstract for the 1st ACM international workshop on personal data meets distributed multimedia","authors":"V. Singh, Tat-Seng Chua, R. Jain, A. Pentland","doi":"10.1145/2502081.2503836","DOIUrl":"https://doi.org/10.1145/2502081.2503836","url":null,"abstract":"Multimedia data are now created at a macro, public scale as well as individual personal scale. While distributed multimedia streams (e.g. images, microblogs, and sensor readings) have recently been combined to understand multiple spatio-temporal phenomena like epidemic spreads, seasonal patterns, and political situations; personal data (via mobile sensors, quantified-self technologies) are now being used to identify user behavior, intent, affect, social connections, health, gaze, and interest level in real time. An effective combination of the two types of data can revolutionize multiple applications ranging from healthcare, to mobility, to product recommendation, to content delivery. Building systems at this intersection can lead to better orchestrated media systems that may also improve users' social, emotional and physical well-being. For example, users trapped in risky hurricane situations can receive personalized evacuation instructions based on their health, mobility parameters, and distance to nearest shelter. This workshop bring together researchers interested in exploring novel techniques that combine multiple streams at different scales (macro and micro) to understand and react to each user's needs.","PeriodicalId":20448,"journal":{"name":"Proceedings of the 21st ACM international conference on Multimedia","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72674770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Zhao, Xue Li, C. Pang, Xiaofeng Zhu, Quan Z. Sheng
Online human gesture recognition has a wide range of applications in computer vision, especially in human-computer interaction applications. Recent introduction of cost-effective depth cameras brings on a new trend of research on body-movement gesture recognition. However, there are two major challenges: i) how to continuously recognize gestures from unsegmented streams, and ii) how to differentiate different styles of a same gesture from other types of gestures. In this paper, we solve these two problems with a new effective and efficient feature extraction method that uses a dynamic matching approach to construct a feature vector for each frame and improves sensitivity to the features of different gestures and decreases sensitivity to the features of gestures within the same class. Our comprehensive experiments on MSRC-12 Kinect Gesture and MSR-Action3D datasets have demonstrated a superior performance than the stat-of-the-art approaches.
{"title":"Online human gesture recognition from motion data streams","authors":"Xin Zhao, Xue Li, C. Pang, Xiaofeng Zhu, Quan Z. Sheng","doi":"10.1145/2502081.2502103","DOIUrl":"https://doi.org/10.1145/2502081.2502103","url":null,"abstract":"Online human gesture recognition has a wide range of applications in computer vision, especially in human-computer interaction applications. Recent introduction of cost-effective depth cameras brings on a new trend of research on body-movement gesture recognition. However, there are two major challenges: i) how to continuously recognize gestures from unsegmented streams, and ii) how to differentiate different styles of a same gesture from other types of gestures. In this paper, we solve these two problems with a new effective and efficient feature extraction method that uses a dynamic matching approach to construct a feature vector for each frame and improves sensitivity to the features of different gestures and decreases sensitivity to the features of gestures within the same class. Our comprehensive experiments on MSRC-12 Kinect Gesture and MSR-Action3D datasets have demonstrated a superior performance than the stat-of-the-art approaches.","PeriodicalId":20448,"journal":{"name":"Proceedings of the 21st ACM international conference on Multimedia","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77742372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aadhar Jain, A. Arefin, Raoul Rivas, Chien-Nan Chen, K. Nahrstedt
Being able to detect and recognize human activities is essential for 3D collaborative applications for efficient quality of service provisioning and device management. A broad range of research has been devoted to analyze media data to identify human activity, which requires the knowledge of data format, application-specific coding technique and computationally expensive image analysis. In this paper, we propose a human activity detection technique based on application generated metadata and related system metadata. Our approach does not depend on specific data format or coding technique. We evaluate our algorithm with different cyber-physical setups, and show that we can achieve very high accuracy (above 97%) by using a good learning model.
{"title":"3D teleimmersive activity classification based on application-system metadata","authors":"Aadhar Jain, A. Arefin, Raoul Rivas, Chien-Nan Chen, K. Nahrstedt","doi":"10.1145/2502081.2502194","DOIUrl":"https://doi.org/10.1145/2502081.2502194","url":null,"abstract":"Being able to detect and recognize human activities is essential for 3D collaborative applications for efficient quality of service provisioning and device management. A broad range of research has been devoted to analyze media data to identify human activity, which requires the knowledge of data format, application-specific coding technique and computationally expensive image analysis. In this paper, we propose a human activity detection technique based on application generated metadata and related system metadata. Our approach does not depend on specific data format or coding technique. We evaluate our algorithm with different cyber-physical setups, and show that we can achieve very high accuracy (above 97%) by using a good learning model.","PeriodicalId":20448,"journal":{"name":"Proceedings of the 21st ACM international conference on Multimedia","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81570412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Wagner, F. Lingenfelser, Tobias Baur, Ionut Damian, Felix Kistler, E. André
Automatic detection and interpretation of social signals carried by voice, gestures, mimics, etc. will play a key-role for next-generation interfaces as it paves the way towards a more intuitive and natural human-computer interaction. The paper at hand introduces Social Signal Interpretation (SSI), a framework for real-time recognition of social signals. SSI supports a large range of sensor devices, filter and feature algorithms, as well as, machine learning and pattern recognition tools. It encourages developers to add new components using SSI's C++ API, but also addresses front end users by offering an XML interface to build pipelines with a text editor. SSI is freely available under GPL at http://openssi.net.
语音、手势、模仿等传递的社会信号的自动检测和解释将在下一代界面中发挥关键作用,因为它为更直观、更自然的人机交互铺平了道路。本文介绍了社会信号解释(Social Signal Interpretation, SSI),一个实时识别社会信号的框架。SSI支持各种传感器设备、滤波器和特征算法,以及机器学习和模式识别工具。它鼓励开发人员使用SSI的c++ API添加新组件,但也通过提供XML接口来使用文本编辑器构建管道,从而解决了前端用户的问题。SSI在GPL下可在http://openssi.net免费获得。
{"title":"The social signal interpretation (SSI) framework: multimodal signal processing and recognition in real-time","authors":"J. Wagner, F. Lingenfelser, Tobias Baur, Ionut Damian, Felix Kistler, E. André","doi":"10.1145/2502081.2502223","DOIUrl":"https://doi.org/10.1145/2502081.2502223","url":null,"abstract":"Automatic detection and interpretation of social signals carried by voice, gestures, mimics, etc. will play a key-role for next-generation interfaces as it paves the way towards a more intuitive and natural human-computer interaction. The paper at hand introduces Social Signal Interpretation (SSI), a framework for real-time recognition of social signals. SSI supports a large range of sensor devices, filter and feature algorithms, as well as, machine learning and pattern recognition tools. It encourages developers to add new components using SSI's C++ API, but also addresses front end users by offering an XML interface to build pipelines with a text editor. SSI is freely available under GPL at http://openssi.net.","PeriodicalId":20448,"journal":{"name":"Proceedings of the 21st ACM international conference on Multimedia","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77019658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Common techniques represent images by quantizing local descriptors and summarizing their distribution in a histogram. In this paper we propose to employ a parametric description and compare its capabilities to histogram based approaches. We use the multivariate Gaussian distribution, applied over the SIFT descriptors, extracted with dense sampling on a spatial pyramid. Every distribution is converted to a high-dimensional descriptor, by concatenating the mean vector and the projection of the covariance matrix on the Euclidean space tangent to the Riemannian manifold. Experiments on Caltech-101 and ImageCLEF2011 are performed using the Stochastic Gradient Descent solver, which allows to deal with large scale datasets and high dimensional feature spaces.
{"title":"Modeling local descriptors with multivariate gaussians for object and scene recognition","authors":"G. Serra, C. Grana, M. Manfredi, R. Cucchiara","doi":"10.1145/2502081.2502185","DOIUrl":"https://doi.org/10.1145/2502081.2502185","url":null,"abstract":"Common techniques represent images by quantizing local descriptors and summarizing their distribution in a histogram. In this paper we propose to employ a parametric description and compare its capabilities to histogram based approaches. We use the multivariate Gaussian distribution, applied over the SIFT descriptors, extracted with dense sampling on a spatial pyramid. Every distribution is converted to a high-dimensional descriptor, by concatenating the mean vector and the projection of the covariance matrix on the Euclidean space tangent to the Riemannian manifold. Experiments on Caltech-101 and ImageCLEF2011 are performed using the Stochastic Gradient Descent solver, which allows to deal with large scale datasets and high dimensional feature spaces.","PeriodicalId":20448,"journal":{"name":"Proceedings of the 21st ACM international conference on Multimedia","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81150872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual data is exploding! 500 billion consumer photos are taken each year world-wide, 633 million photos taken per year in NYC alone. 120 new video-hours are uploaded on YouTube per minute. The explosion of digital multimedia data is creating a valuable open source for insights. However, the unconstrained nature of 'image/video in the wild' makes it very challenging for automated computer-based analysis. Furthermore, the most interesting content in the multimedia files is often complex in nature reflecting a diversity of human behaviors, scenes, activities and events. To address these challenges, this tutorial will provide a unified overview of the two emerging techniques: Semantic modeling and Massive scale visual recognition, with a goal of both introducing people from different backgrounds to this exciting field and reviewing state of the art research in the new computational era.
{"title":"Massive-scale multimedia semantic modeling","authors":"John R. Smith, Liangliang Cao","doi":"10.1145/2502081.2502235","DOIUrl":"https://doi.org/10.1145/2502081.2502235","url":null,"abstract":"Visual data is exploding! 500 billion consumer photos are taken each year world-wide, 633 million photos taken per year in NYC alone. 120 new video-hours are uploaded on YouTube per minute. The explosion of digital multimedia data is creating a valuable open source for insights. However, the unconstrained nature of 'image/video in the wild' makes it very challenging for automated computer-based analysis. Furthermore, the most interesting content in the multimedia files is often complex in nature reflecting a diversity of human behaviors, scenes, activities and events. To address these challenges, this tutorial will provide a unified overview of the two emerging techniques: Semantic modeling and Massive scale visual recognition, with a goal of both introducing people from different backgrounds to this exciting field and reviewing state of the art research in the new computational era.","PeriodicalId":20448,"journal":{"name":"Proceedings of the 21st ACM international conference on Multimedia","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72864189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With advances in pattern recognition and multimedia computing, it became possible to analyze human behavior via multimodal sensors, at different time-scales and at different levels of interaction and interpretation. This ability opens up enormous possibilities for multimedia and multimodal interaction, with a potential of endowing the computers with a capacity to attribute meaning to users' attitudes, preferences, personality, social relationships, etc., as well as to understand what people are doing, the activities they have been engaged in, their routines and lifestyles. This workshop gathers researchers dealing with the problem of modeling human behavior under its multiple facets with particular attention to interactions in arts, creativity, entertainment and edutainment.
{"title":"Fourth international workshop on human behavior understanding (HBU 2013)","authors":"A. A. Salah, H. Hung, O. Aran, H. Gunes","doi":"10.1145/2502081.2503830","DOIUrl":"https://doi.org/10.1145/2502081.2503830","url":null,"abstract":"With advances in pattern recognition and multimedia computing, it became possible to analyze human behavior via multimodal sensors, at different time-scales and at different levels of interaction and interpretation. This ability opens up enormous possibilities for multimedia and multimodal interaction, with a potential of endowing the computers with a capacity to attribute meaning to users' attitudes, preferences, personality, social relationships, etc., as well as to understand what people are doing, the activities they have been engaged in, their routines and lifestyles. This workshop gathers researchers dealing with the problem of modeling human behavior under its multiple facets with particular attention to interactions in arts, creativity, entertainment and edutainment.","PeriodicalId":20448,"journal":{"name":"Proceedings of the 21st ACM international conference on Multimedia","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80103154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}