{"title":"What is happening in a still picture?","authors":"Piji Li, Jun Ma","doi":"10.1109/ACPR.2011.6166555","DOIUrl":null,"url":null,"abstract":"We consider the problem of generating concise sentences to describe still pictures automatically. We treat objects in images (nouns in sentences) as hidden information of actions (verbs). Therefore, the sentence generation problem can be transformed into action detection and scene classification problems. We employ Latent Multiple Kernel Learning (L-MKL) to learn the action detectors from “Exemplarlets”, and utilize MKL to learn the scene classifiers. The image features employed include distribution of edges, dense visual words and feature descriptors at different levels of spatial pyramid. For a new image we can detect the action using a sliding-window detector learnt via L-MKL, predict the scene the action happened in and build haction, scenei tuples. Finally, these tuples will be translated into concise sentences according to previously defined grammar template. We show both the classification and sentence generating results on our newly collected dataset of six actions as well as demonstrate improved performance over existing methods.","PeriodicalId":287232,"journal":{"name":"The First Asian Conference on Pattern Recognition","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The First Asian Conference on Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACPR.2011.6166555","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16
Abstract
We consider the problem of generating concise sentences to describe still pictures automatically. We treat objects in images (nouns in sentences) as hidden information of actions (verbs). Therefore, the sentence generation problem can be transformed into action detection and scene classification problems. We employ Latent Multiple Kernel Learning (L-MKL) to learn the action detectors from “Exemplarlets”, and utilize MKL to learn the scene classifiers. The image features employed include distribution of edges, dense visual words and feature descriptors at different levels of spatial pyramid. For a new image we can detect the action using a sliding-window detector learnt via L-MKL, predict the scene the action happened in and build haction, scenei tuples. Finally, these tuples will be translated into concise sentences according to previously defined grammar template. We show both the classification and sentence generating results on our newly collected dataset of six actions as well as demonstrate improved performance over existing methods.