Jian Jiao, Hong Lu, Zijian Wang, Wenqiang Zhang, Lizhe Qi
Sea-sky-line detection is an important research topic in the field of object detection and tracking on the sea. We propose an L0 gradient smoothing and bimodal histogram analysis based method to improve the robustness and accuracy of sea-sky-line detection. The proposed method mainly depends on the brightness difference between the sea region and the sky region in the image. First, we use L0 gradient smoothing to eliminate discrete noise in the image and achieve the modularity of brightness. Differing from previous methods, diagonal dividing is applied to obtain the brightness thresholds for the sky and sea regions. Then the thresholds are used for bimodal histogram analysis which helps to obtain the brightness near the sea-sky-line and narrow the detection region. After narrowing the detection region, the sea-sky-line in the image is extracted by a linear fitting method. To evaluate the performance of the proposed method, we manually construct an dataset which includes 40, 000 images taken in five scenes. Moreover, we also mark the corresponding ground-truth positions of sea-sky-line in each of the images. Extensive experiments on the dataset demonstrate that our method outperforms the state-of-the-art methods tremendously.
{"title":"L0 Gradient Smoothing and Bimodal Histogram Analysis: A Robust Method for Sea-sky-line Detection","authors":"Jian Jiao, Hong Lu, Zijian Wang, Wenqiang Zhang, Lizhe Qi","doi":"10.1145/3338533.3366554","DOIUrl":"https://doi.org/10.1145/3338533.3366554","url":null,"abstract":"Sea-sky-line detection is an important research topic in the field of object detection and tracking on the sea. We propose an L0 gradient smoothing and bimodal histogram analysis based method to improve the robustness and accuracy of sea-sky-line detection. The proposed method mainly depends on the brightness difference between the sea region and the sky region in the image. First, we use L0 gradient smoothing to eliminate discrete noise in the image and achieve the modularity of brightness. Differing from previous methods, diagonal dividing is applied to obtain the brightness thresholds for the sky and sea regions. Then the thresholds are used for bimodal histogram analysis which helps to obtain the brightness near the sea-sky-line and narrow the detection region. After narrowing the detection region, the sea-sky-line in the image is extracted by a linear fitting method. To evaluate the performance of the proposed method, we manually construct an dataset which includes 40, 000 images taken in five scenes. Moreover, we also mark the corresponding ground-truth positions of sea-sky-line in each of the images. Extensive experiments on the dataset demonstrate that our method outperforms the state-of-the-art methods tremendously.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115566065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Brave New Idea","authors":"Rongrong Ji","doi":"10.1145/3379194","DOIUrl":"https://doi.org/10.1145/3379194","url":null,"abstract":"","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114255350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiwei Zhang, Xueting Wang, Yoshiaki Sakai, T. Yamasaki
In this paper, we propose a new measure to estimate the similarity between brands via posts of brands' followers on social network services (SNS). Our method was developed with the intention of exploring the brands that customers are likely to jointly purchase. Nowadays, brands use social media for targeted advertising because influencing users' preferences can greatly affect the trends in sales. We assume that data on SNS allows us to make quantitative comparisons between brands. Our proposed algorithm analyzes the daily photos and hashtags posted by each brand's followers. By clustering them and converting them to histograms, we can calculate the similarity between brands. We evaluated our proposed algorithm with purchase logs, credit card information, and answers to the questionnaires. The experimental results show that the purchase data maintained by a mall or a credit card company can predict the co-purchase very well, but not the customer's willingness to buy products of new brands. On the other hand, our method can predict the users' interest on brands with a correlation value over 0.53, which is pretty high considering that such interest to brands are high subjective and individual dependent.
{"title":"Measuring Similarity between Brands using Followers' Post in Social Media","authors":"Yiwei Zhang, Xueting Wang, Yoshiaki Sakai, T. Yamasaki","doi":"10.1145/3338533.3366600","DOIUrl":"https://doi.org/10.1145/3338533.3366600","url":null,"abstract":"In this paper, we propose a new measure to estimate the similarity between brands via posts of brands' followers on social network services (SNS). Our method was developed with the intention of exploring the brands that customers are likely to jointly purchase. Nowadays, brands use social media for targeted advertising because influencing users' preferences can greatly affect the trends in sales. We assume that data on SNS allows us to make quantitative comparisons between brands. Our proposed algorithm analyzes the daily photos and hashtags posted by each brand's followers. By clustering them and converting them to histograms, we can calculate the similarity between brands. We evaluated our proposed algorithm with purchase logs, credit card information, and answers to the questionnaires. The experimental results show that the purchase data maintained by a mall or a credit card company can predict the co-purchase very well, but not the customer's willingness to buy products of new brands. On the other hand, our method can predict the users' interest on brands with a correlation value over 0.53, which is pretty high considering that such interest to brands are high subjective and individual dependent.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129483418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the growth of urban population, crowd analysis has become an important and necessary task in the field of computer vision. The goal of crowd counting, which is a subfield of crowd analysis, is to count the number of people in an image or a zone of a picture. Due to the problems like heavy occlusions, perspective and luminous intensity variations, it is still extremely challenging to achieve crowd counting. Recent state-of-the-art approaches are mainly designed with convolutional neural networks to generate density maps. In this work, Multi-Dilation Network (MDNet) is proposed to solve the problem of crowd counting in congested scenes. The MDNet is made up of two parts: a VGG-16 based front end for feature extraction and a back end containing multi-dilation blocks to generate density maps. Especially, a multi-dilation block has four branches which are used to collect features in different sizes. By using dilated convolutional operations, the multi-dilation block could obtain various features while the maximum kernel size is still 3 x 3. The experiments on two challenging crowd counting datasets, UCF_CC_50 and ShanghaiTech, have shown that the proposed MDNet achieves better performances than other state-of-the-art methods, with a lower mean absolute error and mean squared error. Comparing to the network with multi-scale blocks which adopt larger kernels to extract features, MDNet still gains competitive performances with fewer model parameters.
{"title":"Multi-Dilation Network for Crowd Counting","authors":"Shuheng Wang, Hanli Wang, Qinyu Li","doi":"10.1145/3338533.3366687","DOIUrl":"https://doi.org/10.1145/3338533.3366687","url":null,"abstract":"With the growth of urban population, crowd analysis has become an important and necessary task in the field of computer vision. The goal of crowd counting, which is a subfield of crowd analysis, is to count the number of people in an image or a zone of a picture. Due to the problems like heavy occlusions, perspective and luminous intensity variations, it is still extremely challenging to achieve crowd counting. Recent state-of-the-art approaches are mainly designed with convolutional neural networks to generate density maps. In this work, Multi-Dilation Network (MDNet) is proposed to solve the problem of crowd counting in congested scenes. The MDNet is made up of two parts: a VGG-16 based front end for feature extraction and a back end containing multi-dilation blocks to generate density maps. Especially, a multi-dilation block has four branches which are used to collect features in different sizes. By using dilated convolutional operations, the multi-dilation block could obtain various features while the maximum kernel size is still 3 x 3. The experiments on two challenging crowd counting datasets, UCF_CC_50 and ShanghaiTech, have shown that the proposed MDNet achieves better performances than other state-of-the-art methods, with a lower mean absolute error and mean squared error. Comparing to the network with multi-scale blocks which adopt larger kernels to extract features, MDNet still gains competitive performances with fewer model parameters.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128933108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Man, Yuanyuan Pu, Dan Xu, Wenhua Qian, Zhengpeng Zhao, Qiuxia Yang
Sentiment analysis has been an interesting and challenging task, researchers mostly pay attention to single-modal (image or text) emotion recognition, less attention is paid to joint analysis of multi-modal data. Most existing multi-modal sentiment analysis algorithms combined with attention mechanism focus only on local area of images, ignore the emotional information provided by the global features of the image. Motivated by the research status quo, in this paper, we proposed a novel multi-modal sentiment analysis model, which focuses on local attentive feature also on the global contextual feature from image, then a novel feature fusion mechanism is utilized to fuse features from different modal. In our proposed model, we use a convolutional neural network (CNN) to extract the region maps of images, and use the attention mechanism to acquire attention coefficient, then use a CNN with fewer hidden layers to extract the global feature, a long-short term memory model (LSTM) is utilized to extract textual feature. Finally, a tensor fusion network (TFN) is utilized to fuse all features from different modal. Extensive experiments are conducted on both weakly labeled and manually labeled datasets, and the results demonstrate the superiority of the proposed method.
{"title":"Multi-Feature Fusion for Multimodal Attentive Sentiment Analysis","authors":"A. Man, Yuanyuan Pu, Dan Xu, Wenhua Qian, Zhengpeng Zhao, Qiuxia Yang","doi":"10.1145/3338533.3366591","DOIUrl":"https://doi.org/10.1145/3338533.3366591","url":null,"abstract":"Sentiment analysis has been an interesting and challenging task, researchers mostly pay attention to single-modal (image or text) emotion recognition, less attention is paid to joint analysis of multi-modal data. Most existing multi-modal sentiment analysis algorithms combined with attention mechanism focus only on local area of images, ignore the emotional information provided by the global features of the image. Motivated by the research status quo, in this paper, we proposed a novel multi-modal sentiment analysis model, which focuses on local attentive feature also on the global contextual feature from image, then a novel feature fusion mechanism is utilized to fuse features from different modal. In our proposed model, we use a convolutional neural network (CNN) to extract the region maps of images, and use the attention mechanism to acquire attention coefficient, then use a CNN with fewer hidden layers to extract the global feature, a long-short term memory model (LSTM) is utilized to extract textual feature. Finally, a tensor fusion network (TFN) is utilized to fuse all features from different modal. Extensive experiments are conducted on both weakly labeled and manually labeled datasets, and the results demonstrate the superiority of the proposed method.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124404743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhuangzi Li, Feng Dai, N. Zhang, Lei Wang, Ziyu Xue
Recent convolutional neural network (CNNs) have shown promising performance on image retrieval due to the powerful feature extraction capability. However, the potential relations of feature maps are not effectively exploited in the before CNNs, resulting in inaccurate feature representations. To address this issue, we excavate feature channel-wise realtions by a matching strategy to adaptively highlight informative features. In this paper, we propose a novel representative feature matching network (RFMN) for image hashing retrieval. Specifically, we propose a novel representative feature matching block (RFMB) that can match feature maps with their representative one. So, the significance of each feature map can be exploited according to the matching similarity. In addition, we also present an innovative pooling layer based on the representative feature matching to build relations of pooled features with unpooled features, so as to highlight the pooled features retained more valuable information. Extensive experiments show that our approach can promote the average results of conventional residual network more than 2.6% on Cifar-10 and 1.4% on NUS-WIDE dataset, meanwhile achieve the state-of-the-art performance.
{"title":"Representative Feature Matching Network for Image Retrieval","authors":"Zhuangzi Li, Feng Dai, N. Zhang, Lei Wang, Ziyu Xue","doi":"10.1145/3338533.3366596","DOIUrl":"https://doi.org/10.1145/3338533.3366596","url":null,"abstract":"Recent convolutional neural network (CNNs) have shown promising performance on image retrieval due to the powerful feature extraction capability. However, the potential relations of feature maps are not effectively exploited in the before CNNs, resulting in inaccurate feature representations. To address this issue, we excavate feature channel-wise realtions by a matching strategy to adaptively highlight informative features. In this paper, we propose a novel representative feature matching network (RFMN) for image hashing retrieval. Specifically, we propose a novel representative feature matching block (RFMB) that can match feature maps with their representative one. So, the significance of each feature map can be exploited according to the matching similarity. In addition, we also present an innovative pooling layer based on the representative feature matching to build relations of pooled features with unpooled features, so as to highlight the pooled features retained more valuable information. Extensive experiments show that our approach can promote the average results of conventional residual network more than 2.6% on Cifar-10 and 1.4% on NUS-WIDE dataset, meanwhile achieve the state-of-the-art performance.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123700344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The task of multi-label image classification is to predict a set of proper labels for an input image. To this end, it is necessary to strengthen the association between the labels and the image regions, and utilize the relationship between the labels. In this paper, we propose a novel framework for multi-label image classification, which uses attention mechanism and Graph Convolutional Network (GCN) simultaneously. The attention mechanism can focus on specific target regions while ignoring other useless information around, thereby enhancing the association of the labels with the image regions. By constructing a directed graph over the labels, GCN can learn the relationship between the labels from a global perspective and map this label graph to a set of inter-dependent object classifiers. The framework first uses ResNet to extract features while using attention mechanism to generate attention maps for all labels and obtain weighted features. GCN uses weighted fusion features from the output of the resnet and attention mechanism to achieve classification. Experimental results show that both the attention mechanism and GCN can effectively improve the classification performance, and the proposed framework is competitive with the state-of-the-art methods.
{"title":"Multi-Label Image Classification with Attention Mechanism and Graph Convolutional Networks","authors":"Quanling Meng, Weigang Zhang","doi":"10.1145/3338533.3366589","DOIUrl":"https://doi.org/10.1145/3338533.3366589","url":null,"abstract":"The task of multi-label image classification is to predict a set of proper labels for an input image. To this end, it is necessary to strengthen the association between the labels and the image regions, and utilize the relationship between the labels. In this paper, we propose a novel framework for multi-label image classification, which uses attention mechanism and Graph Convolutional Network (GCN) simultaneously. The attention mechanism can focus on specific target regions while ignoring other useless information around, thereby enhancing the association of the labels with the image regions. By constructing a directed graph over the labels, GCN can learn the relationship between the labels from a global perspective and map this label graph to a set of inter-dependent object classifiers. The framework first uses ResNet to extract features while using attention mechanism to generate attention maps for all labels and obtain weighted features. GCN uses weighted fusion features from the output of the resnet and attention mechanism to achieve classification. Experimental results show that both the attention mechanism and GCN can effectively improve the classification performance, and the proposed framework is competitive with the state-of-the-art methods.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114197568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mengtian Li, Jie Guo, Xiufen Cui, Rui Pan, Yanwen Guo, Chenchen Wang, Piaopiao Yu, Fei Pan
In this paper, we propose a learning-based method to estimate high dynamic range (HDR) indoor illumination from only a single low dynamic range (LDR) photograph of limited field-of-view. Considering the extreme complexity of indoor illumination that is virtually impossible to reconstruct perfectly, we choose to encode the environmental illumination in Spherical Gaussian (SG) functions with fixed centering directions and bandwidth and only allow the weights vary. An end-to-end convolutional neural network (CNN) is designed and trained to build the complex relationship between a photograph and its illumination represented by SG functions. Moreover, we employ a masked L2 loss instead of naive L2 loss to avoid the loss of high frequency information, and propose a glossy loss to improve the rendering quality. Our experiments demonstrate that the proposed approach outperforms the state-of-the-arts both qualitatively and quantitatively.
{"title":"Deep Spherical Gaussian Illumination Estimation for Indoor Scene","authors":"Mengtian Li, Jie Guo, Xiufen Cui, Rui Pan, Yanwen Guo, Chenchen Wang, Piaopiao Yu, Fei Pan","doi":"10.1145/3338533.3366562","DOIUrl":"https://doi.org/10.1145/3338533.3366562","url":null,"abstract":"In this paper, we propose a learning-based method to estimate high dynamic range (HDR) indoor illumination from only a single low dynamic range (LDR) photograph of limited field-of-view. Considering the extreme complexity of indoor illumination that is virtually impossible to reconstruct perfectly, we choose to encode the environmental illumination in Spherical Gaussian (SG) functions with fixed centering directions and bandwidth and only allow the weights vary. An end-to-end convolutional neural network (CNN) is designed and trained to build the complex relationship between a photograph and its illumination represented by SG functions. Moreover, we employ a masked L2 loss instead of naive L2 loss to avoid the loss of high frequency information, and propose a glossy loss to improve the rendering quality. Our experiments demonstrate that the proposed approach outperforms the state-of-the-arts both qualitatively and quantitatively.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130413372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}