Building a system that can recognize "what," "who," and "where" from arbitrary images has motivated researchers in computer vision, multimedia and machine learning areas for decades. Significant progresses have been made in recently years based on distributed computation and/or deep neural networks techniques. However, it is still very challenging to realize a general purpose real world image recognition engine that has reasonable recognition accuracy, semantic coverage, and recognition speed. In this talk, firstly we will review the current status of this area, analyze the difficulties, and discuss the potential solutions. Then two promising schemes to attack this challenge will be introduced, including (1) learning millions of concepts from search engine click logs, and (2) recognizing whatever you want without data labeling. The first work tries to build large-scale recognition models by mining search engine click logs. Challenges in training data selection and model selection will be discussed, and efficient and scalable approaches for model training and prediction will be introduced. The second work aims at building image recognition engines for any set of entities without using any human labeled training data, which helps generalize image recognition to a wide range of semantic concepts. Automatic training data generation steps will be presented, and techniques for improving recognition accuracy, which effectively leveraging massive amount of Internet data will be discussed. Different parallelization strategies for different computation tasks will be introduced, which guarantee the efficiency and scalability of the entire system. And last, we will discuss possible directions in pushing image recognition in the real world.
{"title":"Pushing Image Recognition in the Real World: Towards Recognizing Millions of Entities","authors":"Xiansheng Hua","doi":"10.1145/2661714.2661716","DOIUrl":"https://doi.org/10.1145/2661714.2661716","url":null,"abstract":"Building a system that can recognize \"what,\" \"who,\" and \"where\" from arbitrary images has motivated researchers in computer vision, multimedia and machine learning areas for decades. Significant progresses have been made in recently years based on distributed computation and/or deep neural networks techniques. However, it is still very challenging to realize a general purpose real world image recognition engine that has reasonable recognition accuracy, semantic coverage, and recognition speed.\u0000 In this talk, firstly we will review the current status of this area, analyze the difficulties, and discuss the potential solutions. Then two promising schemes to attack this challenge will be introduced, including (1) learning millions of concepts from search engine click logs, and (2) recognizing whatever you want without data labeling. The first work tries to build large-scale recognition models by mining search engine click logs. Challenges in training data selection and model selection will be discussed, and efficient and scalable approaches for model training and prediction will be introduced. The second work aims at building image recognition engines for any set of entities without using any human labeled training data, which helps generalize image recognition to a wide range of semantic concepts. Automatic training data generation steps will be presented, and techniques for improving recognition accuracy, which effectively leveraging massive amount of Internet data will be discussed. Different parallelization strategies for different computation tasks will be introduced, which guarantee the efficiency and scalability of the entire system. And last, we will discuss possible directions in pushing image recognition in the real world.","PeriodicalId":365687,"journal":{"name":"WISMM '14","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116328127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big data is becoming increasingly multimedia data. Storytelling is one of the oldest and the most popular activity for humans. Historically, since the early days of human existence, storytelling has been used as a means of simple communication as well as medium of entertainment, education of people, cultural preservation, and instilling moral values through examples. A story is presentation of experiences related to events. Events and their experiences are selected to communicate the intent of a story compellingly. The art of storytelling always had close relationship to technology of the time. A good story considers the message and the audience and then selects appropriate events and proper related experiential media and information to weave a compelling and engaging account of the events. There is a virtuous cycle between storytelling and the technology that is intertwined and synergistic. Historically, both have evolved together and are likely to continue evolving together in the near future. Most events of interest occur in physical world and must be captured using different sensors. Usually a single sensor is inadequate to capture diverse aspects of the event and hence the use of multiple sensors or media to capture an event and also to present event experiences for re-experiencing the events. Now we have diverse sensors to capture an event in all its details and use what will be compelling in storytelling. A good story is the result of many activities: collection of data, analysis of data, selection of events and experiences that are relevant to the message, and a compelling presentation using this material. All of these activities are active research areas in multimedia big data. We discuss different forms of storytelling as they evolved and the role of technology in different stages of storytelling. We believe that now we have powerful tools and technologies to make the art of storytelling really effective. In this presentation we will show challenges for multimedia researchers that could make storytelling very effective and very compelling.
{"title":"Storytelling with Big Multimedia Data: Keynote Talk","authors":"R. Jain","doi":"10.1145/2661714.2661715","DOIUrl":"https://doi.org/10.1145/2661714.2661715","url":null,"abstract":"Big data is becoming increasingly multimedia data. Storytelling is one of the oldest and the most popular activity for humans. Historically, since the early days of human existence, storytelling has been used as a means of simple communication as well as medium of entertainment, education of people, cultural preservation, and instilling moral values through examples. A story is presentation of experiences related to events. Events and their experiences are selected to communicate the intent of a story compellingly. The art of storytelling always had close relationship to technology of the time. A good story considers the message and the audience and then selects appropriate events and proper related experiential media and information to weave a compelling and engaging account of the events.\u0000 There is a virtuous cycle between storytelling and the technology that is intertwined and synergistic. Historically, both have evolved together and are likely to continue evolving together in the near future. Most events of interest occur in physical world and must be captured using different sensors. Usually a single sensor is inadequate to capture diverse aspects of the event and hence the use of multiple sensors or media to capture an event and also to present event experiences for re-experiencing the events. Now we have diverse sensors to capture an event in all its details and use what will be compelling in storytelling.\u0000 A good story is the result of many activities: collection of data, analysis of data, selection of events and experiences that are relevant to the message, and a compelling presentation using this material. All of these activities are active research areas in multimedia big data. We discuss different forms of storytelling as they evolved and the role of technology in different stages of storytelling. We believe that now we have powerful tools and technologies to make the art of storytelling really effective. In this presentation we will show challenges for multimedia researchers that could make storytelling very effective and very compelling.","PeriodicalId":365687,"journal":{"name":"WISMM '14","volume":"302 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122244196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fast and accurately categorizing the millions of aerial images on Google Maps is a useful technique in multimedia applications. Existing methods cannot handle this task effectively due to two reasons. 1) It is challenging to build a realtime image categorization system, as some geo-aware Apps update over 20 aerial images per second. 2) The aerial images' topologies are the key to distinguish their categories, but they cannot be encoded by the generic visual descriptors. To solve these two problems, we propose an efficient aerial image categorization system, aiming at mining discriminative topologies of aerial images under a multi-task learning framework. Particularly, we first construct a region adjacency graph (RAG) that describes the topology of each aerial image. Thereby, aerial image categorization can be formulated as RAG-to-RAG matching. Based on graph theory, RAG-to-RAG matching is conducted by comparing all their respective graphlets (i.e., small subgraphs). Because the number of graphlets is huge, a multi-task feature selection algorithm is derived to discover topologies jointly discriminative to multiple categories. The discovered topologies are used to extract the discriminative graphlets. Finally, these graphlets are integrated into an AdaBoost model for predicting aerial image categories. Experiments show that our approach is competitive several existing recognition models. Further, over 24 aerial images are categorized per second, reflecting that our system is ready for real-world applications.
{"title":"Large-Scale Aerial Image Categorization by Multi-Task Discriminative Topologies Discovery","authors":"Yingjie Xia, Luming Zhang, Suhua Tang","doi":"10.1145/2661714.2661718","DOIUrl":"https://doi.org/10.1145/2661714.2661718","url":null,"abstract":"Fast and accurately categorizing the millions of aerial images on Google Maps is a useful technique in multimedia applications. Existing methods cannot handle this task effectively due to two reasons. 1) It is challenging to build a realtime image categorization system, as some geo-aware Apps update over 20 aerial images per second. 2) The aerial images' topologies are the key to distinguish their categories, but they cannot be encoded by the generic visual descriptors. To solve these two problems, we propose an efficient aerial image categorization system, aiming at mining discriminative topologies of aerial images under a multi-task learning framework. Particularly, we first construct a region adjacency graph (RAG) that describes the topology of each aerial image. Thereby, aerial image categorization can be formulated as RAG-to-RAG matching. Based on graph theory, RAG-to-RAG matching is conducted by comparing all their respective graphlets (i.e., small subgraphs). Because the number of graphlets is huge, a multi-task feature selection algorithm is derived to discover topologies jointly discriminative to multiple categories. The discovered topologies are used to extract the discriminative graphlets. Finally, these graphlets are integrated into an AdaBoost model for predicting aerial image categories. Experiments show that our approach is competitive several existing recognition models. Further, over 24 aerial images are categorized per second, reflecting that our system is ready for real-world applications.","PeriodicalId":365687,"journal":{"name":"WISMM '14","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114527455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose an algorithm to predict the social popularity (i.e., the numbers of views, comments, and favorites) of content on social networking services using only text annotations. Instead of analyzing image/video content, we try to estimate social popularity by a combination of weight vectors obtained from a support vector regression (SVR) and tag frequency. Since our proposed algorithm uses text annotations instead of image/video features, its computational cost is small. As a result, we can estimate social popularity more efficiently than previously proposed methods. Furthermore, tags that significantly affect social popularity can be extracted using our algorithm. Our experiments involved using one million photos on the social networking website Flickr, and the results showed a high correlation between actual social popularity and the determination thereof using our algorithm. Moreover, the proposed algorithm can achieve high classification accuracy with regard to a classification between popular and unpopular content.
{"title":"Social Popularity Score: Predicting Numbers of Views, Comments, and Favorites of Social Photos Using Only Annotations","authors":"T. Yamasaki, Shumpei Sano, K. Aizawa","doi":"10.1145/2661714.2661722","DOIUrl":"https://doi.org/10.1145/2661714.2661722","url":null,"abstract":"In this paper, we propose an algorithm to predict the social popularity (i.e., the numbers of views, comments, and favorites) of content on social networking services using only text annotations. Instead of analyzing image/video content, we try to estimate social popularity by a combination of weight vectors obtained from a support vector regression (SVR) and tag frequency. Since our proposed algorithm uses text annotations instead of image/video features, its computational cost is small. As a result, we can estimate social popularity more efficiently than previously proposed methods. Furthermore, tags that significantly affect social popularity can be extracted using our algorithm. Our experiments involved using one million photos on the social networking website Flickr, and the results showed a high correlation between actual social popularity and the determination thereof using our algorithm. Moreover, the proposed algorithm can achieve high classification accuracy with regard to a classification between popular and unpopular content.","PeriodicalId":365687,"journal":{"name":"WISMM '14","volume":"207 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126868145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}