Overtourism has had a negative impact on various things at tourist sites. One of the most serious problems is environmental issues, such as littering, caused by too many visitors to tourist sites. It is important to change people's mindset to be more environmentally aware in order to improve such situation. In particular, if we can find people with comparatively high awareness about environmental issues for overtourism, we will be able to work effectively to promote eco-friendly behavior for people. However, grasping a person's awareness is inherently difficult. For this challenge, we introduce a new task, called Detecting Focus of Posts about Tourism, which is given users' posts of pictures and comment on SNSs about tourist sites, to classify them into types of their focuses based on such awareness. Once we classify such posts, we can see its result showing tendencies of users awareness and so we can discern awareness of the users for environmental issues at tourist sites. Specifically, we define four labels on focus of SNS posts about tourist sites. Based on these labels, we create an evaluation dataset. We present experimental results of the classification task with a CNN classifier for pictures or an LSTM classifier for comments, which will be baselines for the task.
{"title":"Classification of multimedia SNS posts about tourist sites based on their focus toward predicting eco-friendly users","authors":"Naoto Kashiwagi, Tokinori Suzuki, Jounghun Lee, Daisuke Ikeda","doi":"10.1145/3444685.3446272","DOIUrl":"https://doi.org/10.1145/3444685.3446272","url":null,"abstract":"Overtourism has had a negative impact on various things at tourist sites. One of the most serious problems is environmental issues, such as littering, caused by too many visitors to tourist sites. It is important to change people's mindset to be more environmentally aware in order to improve such situation. In particular, if we can find people with comparatively high awareness about environmental issues for overtourism, we will be able to work effectively to promote eco-friendly behavior for people. However, grasping a person's awareness is inherently difficult. For this challenge, we introduce a new task, called Detecting Focus of Posts about Tourism, which is given users' posts of pictures and comment on SNSs about tourist sites, to classify them into types of their focuses based on such awareness. Once we classify such posts, we can see its result showing tendencies of users awareness and so we can discern awareness of the users for environmental issues at tourist sites. Specifically, we define four labels on focus of SNS posts about tourist sites. Based on these labels, we create an evaluation dataset. We present experimental results of the classification task with a CNN classifier for pictures or an LSTM classifier for comments, which will be baselines for the task.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122473620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fang Zhou, Bei Yin, Zanxia Jin, Heran Wu, Dongyang Zhang
Text-based Visual Question Answering(VQA) usually needs to analyze and understand the text in a picture to give a correct answer for the given question. In this paper, a generic Text-based VQA with Knowledge Base (KB) is proposed, which performs text-based search on text information obtained by optical character recognition (OCR) in images, constructs task-oriented knowledge information and integrates it into the existing models. Due to the complexity of the image scene, the accuracy of OCR is not very high, and there are often cases where the words have individual character that is incorrect, resulting in inaccurate text information; here, some correct words can be found with help of KB, and the correct image text information can be added. Moreover, the knowledge information constructed with KB can better explain the image information, allowing the model to fully understand the image and find the appropriate text answer. The experimental results on the TextVQA dataset show that our method improves the accuracy, and the maximum increment is 39.2%.
{"title":"Text-based visual question answering with knowledge base","authors":"Fang Zhou, Bei Yin, Zanxia Jin, Heran Wu, Dongyang Zhang","doi":"10.1145/3444685.3446306","DOIUrl":"https://doi.org/10.1145/3444685.3446306","url":null,"abstract":"Text-based Visual Question Answering(VQA) usually needs to analyze and understand the text in a picture to give a correct answer for the given question. In this paper, a generic Text-based VQA with Knowledge Base (KB) is proposed, which performs text-based search on text information obtained by optical character recognition (OCR) in images, constructs task-oriented knowledge information and integrates it into the existing models. Due to the complexity of the image scene, the accuracy of OCR is not very high, and there are often cases where the words have individual character that is incorrect, resulting in inaccurate text information; here, some correct words can be found with help of KB, and the correct image text information can be added. Moreover, the knowledge information constructed with KB can better explain the image information, allowing the model to fully understand the image and find the appropriate text answer. The experimental results on the TextVQA dataset show that our method improves the accuracy, and the maximum increment is 39.2%.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127336729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Human action recognition is an active research area in computer vision. Aiming at the lack of spatial muti-scale information for human action recognition, we present a novel framework to recognize human actions from depth video sequences using multi-scale Laplacian pyramid depth motion images (LP-DMI). Each depth frame is projected onto three orthogonal Cartesian planes. Under three views, we generate depth motion images (DMI) and construct Laplacian pyramids as structured multi-scale feature maps which enhances multi-scale dynamic information of motions and reduces redundant static information in human bodies. We further extract the multi-granularity descriptor called LP-DMI-HOG to provide more discriminative features. Finally, we utilize extreme learning machine (ELM) for action classification. Through extensive experiments on the public MSRAction3D datasets, we prove that our method outperforms state-of-the-art benchmarks.
{"title":"A multi-scale human action recognition method based on Laplacian pyramid depth motion images","authors":"Chang Li, Qian Huang, Xing Li, Qianhan Wu","doi":"10.1145/3444685.3446284","DOIUrl":"https://doi.org/10.1145/3444685.3446284","url":null,"abstract":"Human action recognition is an active research area in computer vision. Aiming at the lack of spatial muti-scale information for human action recognition, we present a novel framework to recognize human actions from depth video sequences using multi-scale Laplacian pyramid depth motion images (LP-DMI). Each depth frame is projected onto three orthogonal Cartesian planes. Under three views, we generate depth motion images (DMI) and construct Laplacian pyramids as structured multi-scale feature maps which enhances multi-scale dynamic information of motions and reduces redundant static information in human bodies. We further extract the multi-granularity descriptor called LP-DMI-HOG to provide more discriminative features. Finally, we utilize extreme learning machine (ELM) for action classification. Through extensive experiments on the public MSRAction3D datasets, we prove that our method outperforms state-of-the-art benchmarks.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125891284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Liang, Haosheng Chen, Kaiwen Du, Yan Yan, Hanzi Wang
Video object detection is a challenging task due to the appearance deterioration problems in video frames. Thus, object features extracted from different frames of a video are usually deteriorated in varying degrees. Currently, some state-of-the-art methods enhance the deteriorated object features in a reference frame by aggregating the undeteriorated object features extracted from other frames, simply based on their learned appearance relation among object features. In this paper, we propose a novel intra-inter semantic aggregation method (ISA) to learn more effective intra and inter relations for semantically aggregating object features. Specifically, in the proposed ISA, we first introduce an intra semantic aggregation module (Intra-SAM) to enhance the deteriorated spatial features based on the learned intra relation among the features at different positions of an individual object. Then, we present an inter semantic aggregation module (Inter-SAM) to enhance the deteriorated object features in the temporal domain based on the learned inter relation among object features. As a result, by leveraging Intra-SAM and Inter-SAM, the proposed ISA can generate discriminative features from the novel perspective of intra-inter semantic aggregation for robust video object detection. We conduct extensive experiments on the ImageNet VID dataset to evaluate ISA. The proposed ISA obtains 84.5% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, and it achieves superior performance compared with several state-of-the-art video object detectors.
{"title":"Learning intra-inter semantic aggregation for video object detection","authors":"Jun Liang, Haosheng Chen, Kaiwen Du, Yan Yan, Hanzi Wang","doi":"10.1145/3444685.3446273","DOIUrl":"https://doi.org/10.1145/3444685.3446273","url":null,"abstract":"Video object detection is a challenging task due to the appearance deterioration problems in video frames. Thus, object features extracted from different frames of a video are usually deteriorated in varying degrees. Currently, some state-of-the-art methods enhance the deteriorated object features in a reference frame by aggregating the undeteriorated object features extracted from other frames, simply based on their learned appearance relation among object features. In this paper, we propose a novel intra-inter semantic aggregation method (ISA) to learn more effective intra and inter relations for semantically aggregating object features. Specifically, in the proposed ISA, we first introduce an intra semantic aggregation module (Intra-SAM) to enhance the deteriorated spatial features based on the learned intra relation among the features at different positions of an individual object. Then, we present an inter semantic aggregation module (Inter-SAM) to enhance the deteriorated object features in the temporal domain based on the learned inter relation among object features. As a result, by leveraging Intra-SAM and Inter-SAM, the proposed ISA can generate discriminative features from the novel perspective of intra-inter semantic aggregation for robust video object detection. We conduct extensive experiments on the ImageNet VID dataset to evaluate ISA. The proposed ISA obtains 84.5% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, and it achieves superior performance compared with several state-of-the-art video object detectors.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121570640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a system for large-scale image retrieval on everyday scenes with common objects. Our system leverages advances in deep learning and natural language processing (NLP) for improved understanding of images by capturing the relationships between the objects within an image. As a result, a user can retrieve highly relevant images and obtain suggestions for similar image queries to further explore the repository. Each image in the repository is processed (using deep learning) to obtain the most probable captions and objects in it. The captions are parsed into tree structures using NLP techniques, and stored and indexed in a database system. When a query image is posed, an optimized tree-pattern query is executed by the database system to obtain candidate matches, which are then ranked using tree-edit distance of the tree structures to output the top-k matches. Word embeddings and Bloom filters are used to obtain similar image queries. By clicking the suggested similar image queries, a user can intuitively explore the repository.
{"title":"A large-scale image retrieval system for everyday scenes","authors":"Arun Zachariah, Mohamed Gharibi, P. Rao","doi":"10.1145/3444685.3446253","DOIUrl":"https://doi.org/10.1145/3444685.3446253","url":null,"abstract":"We present a system for large-scale image retrieval on everyday scenes with common objects. Our system leverages advances in deep learning and natural language processing (NLP) for improved understanding of images by capturing the relationships between the objects within an image. As a result, a user can retrieve highly relevant images and obtain suggestions for similar image queries to further explore the repository. Each image in the repository is processed (using deep learning) to obtain the most probable captions and objects in it. The captions are parsed into tree structures using NLP techniques, and stored and indexed in a database system. When a query image is posed, an optimized tree-pattern query is executed by the database system to obtain candidate matches, which are then ranked using tree-edit distance of the tree structures to output the top-k matches. Word embeddings and Bloom filters are used to obtain similar image queries. By clicking the suggested similar image queries, a user can intuitively explore the repository.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114602369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, with the development of electronic medical record (EMR) systems, it has become possible to mine patient clinical data to improve medical care quality. After the treatment engine learns knowledge from the EMR data, it can automatically recommend the next stage of prescriptions and provide treatment guidelines for doctors and patients. However, this task is always challenged by the multi-modality of EMR data. To more effectively predict the next stage of treatment prescription by using multimodal information and the connection between the modalities, we propose a cross-modal shared-specific feature complementary generation and attention fusion algorithm. In the feature extraction stage, specific information and shared information are obtained through a shared-specific feature extraction network. To obtain the correlation between the modalities, we propose a sorting network. We use the attention fusion network in the multimodal feature fusion stage to give different multimodal features at different stages with different weights to obtain a more prepared patient representation. Considering the redundant information of specific modal information and shared modal information, we introduce a complementary feature learning strategy, including modality adaptation for shared features, project adversarial learning for specific features, and reconstruction enhancement. The experimental results on the real EMR data set MIMIC-III prove its superiority and each part's effectiveness.
{"title":"A treatment engine by multimodal EMR data","authors":"Zhaomeng Huang, Liyan Zhang, Xu Xu","doi":"10.1145/3444685.3446254","DOIUrl":"https://doi.org/10.1145/3444685.3446254","url":null,"abstract":"In recent years, with the development of electronic medical record (EMR) systems, it has become possible to mine patient clinical data to improve medical care quality. After the treatment engine learns knowledge from the EMR data, it can automatically recommend the next stage of prescriptions and provide treatment guidelines for doctors and patients. However, this task is always challenged by the multi-modality of EMR data. To more effectively predict the next stage of treatment prescription by using multimodal information and the connection between the modalities, we propose a cross-modal shared-specific feature complementary generation and attention fusion algorithm. In the feature extraction stage, specific information and shared information are obtained through a shared-specific feature extraction network. To obtain the correlation between the modalities, we propose a sorting network. We use the attention fusion network in the multimodal feature fusion stage to give different multimodal features at different stages with different weights to obtain a more prepared patient representation. Considering the redundant information of specific modal information and shared modal information, we introduce a complementary feature learning strategy, including modality adaptation for shared features, project adversarial learning for specific features, and reconstruction enhancement. The experimental results on the real EMR data set MIMIC-III prove its superiority and each part's effectiveness.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126534710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huan-Hua Chang, Wen-Cheng Chen, Wan-Lun Tsai, Min-Chun Hu, W. Chu
Learning basketball tactic via virtual reality environment requires real-time feedback to improve the realism and interactivity. For example, the virtual defender should move immediately according to the player's movement. In this paper, we proposed an autoregressive generative model for basketball defensive trajectory generation. To learn the continuous Gaussian distribution of player position, we adopt a differentiable sampling process to sample the candidate location with a standard deviation loss, which can preserve the diversity of the trajectories. Furthermore, we design several additional loss functions based on the domain knowledge of basketball to make the generated trajectories match the real situation in basketball games. The experimental results show that the proposed method can achieve better performance than previous works in terms of different evaluation metrics.
{"title":"An autoregressive generation model for producing instant basketball defensive trajectory","authors":"Huan-Hua Chang, Wen-Cheng Chen, Wan-Lun Tsai, Min-Chun Hu, W. Chu","doi":"10.1145/3444685.3446300","DOIUrl":"https://doi.org/10.1145/3444685.3446300","url":null,"abstract":"Learning basketball tactic via virtual reality environment requires real-time feedback to improve the realism and interactivity. For example, the virtual defender should move immediately according to the player's movement. In this paper, we proposed an autoregressive generative model for basketball defensive trajectory generation. To learn the continuous Gaussian distribution of player position, we adopt a differentiable sampling process to sample the candidate location with a standard deviation loss, which can preserve the diversity of the trajectories. Furthermore, we design several additional loss functions based on the domain knowledge of basketball to make the generated trajectories match the real situation in basketball games. The experimental results show that the proposed method can achieve better performance than previous works in terms of different evaluation metrics.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128957901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The existing researches reveal that a significant impact is introduced by viewing conditions for visual perception when viewing media on mobile screens. This brings two issues in the area of visual saliency that we need to address: how the saliency models perform in mobile conditions, and how to consider the mobile conditions when designing a saliency model. To investigate the performance of saliency models in mobile environment, eye fixations in four typical mobile conditions are collected as the mobile ground truth in this work. To consider the mobile conditions when designing a saliency model, we combine viewing factors and visual stimuli as two modalities, and a cross-modal based deep learning architecture is proposed for visual attention prediction. Experimental results demonstrate the model with the consideration of mobile viewing factors often outperforms the models without such consideration.
{"title":"Cross-modal learning for saliency prediction in mobile environment","authors":"Dakai Ren, X. Wen, Xiao-Yang Liu, Shuai Huang, Jiazhong Chen","doi":"10.1145/3444685.3446304","DOIUrl":"https://doi.org/10.1145/3444685.3446304","url":null,"abstract":"The existing researches reveal that a significant impact is introduced by viewing conditions for visual perception when viewing media on mobile screens. This brings two issues in the area of visual saliency that we need to address: how the saliency models perform in mobile conditions, and how to consider the mobile conditions when designing a saliency model. To investigate the performance of saliency models in mobile environment, eye fixations in four typical mobile conditions are collected as the mobile ground truth in this work. To consider the mobile conditions when designing a saliency model, we combine viewing factors and visual stimuli as two modalities, and a cross-modal based deep learning architecture is proposed for visual attention prediction. Experimental results demonstrate the model with the consideration of mobile viewing factors often outperforms the models without such consideration.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"31 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127980894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingjiao Pei, Zhongyuan Wang, Heling Chen, Baojin Huang, Weiping Tu
With the development of the Internet, multimedia data grows by an exponential level. The demand for video organization, summarization and retrieval has been increasing where scene detection plays an essential role. Existing shot clustering algorithms for scene detection usually treat temporal shot sequence as unconstrained data. The graph based scene detection methods can locate the scene boundaries by taking the temporal relation among shots into account, while most of them only rely on low-level features to determine whether the connected shot pairs are similar or not. The optimized algorithms considering temporal sequence of shots or combining multi-modal features will bring parameter trouble and computational burden. In this paper, we propose a novel temporal clustering method based on graph convolution network and the link transitivity of shot nodes, without involving complicated steps and prior parameter setting such as the number of clusters. In particular, the graph convolution network is used to predict the link possibility of node pairs that are close in temporal sequence. The shots are then clustered into scene segments by merging all possible links. Experimental results on BBC and OVSD datasets show that our approach is more robust and effective than the comparison methods in terms of F1-score.
{"title":"Video scene detection based on link prediction using graph convolution network","authors":"Yingjiao Pei, Zhongyuan Wang, Heling Chen, Baojin Huang, Weiping Tu","doi":"10.1145/3444685.3446293","DOIUrl":"https://doi.org/10.1145/3444685.3446293","url":null,"abstract":"With the development of the Internet, multimedia data grows by an exponential level. The demand for video organization, summarization and retrieval has been increasing where scene detection plays an essential role. Existing shot clustering algorithms for scene detection usually treat temporal shot sequence as unconstrained data. The graph based scene detection methods can locate the scene boundaries by taking the temporal relation among shots into account, while most of them only rely on low-level features to determine whether the connected shot pairs are similar or not. The optimized algorithms considering temporal sequence of shots or combining multi-modal features will bring parameter trouble and computational burden. In this paper, we propose a novel temporal clustering method based on graph convolution network and the link transitivity of shot nodes, without involving complicated steps and prior parameter setting such as the number of clusters. In particular, the graph convolution network is used to predict the link possibility of node pairs that are close in temporal sequence. The shots are then clustered into scene segments by merging all possible links. Experimental results on BBC and OVSD datasets show that our approach is more robust and effective than the comparison methods in terms of F1-score.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130964252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lexuan Sun, Xueliang Liu, Zhenzhen Hu, Richang Hong
Image dehazing is a fundamental task for the computer vision and multimedia and usually in the face of the challenge from two aspects, i) the uneven distribution of arbitrary haze and ii) the distortion of image pixels caused by the hazed image. In this paper, we propose an end-to-end trainable framework, named Weighted-Fusion Network with Poly-Scale Convolution (WFN-PSC), to address these dehazing issues. The proposed method is designed based on the Poly-Scale Convolution (PSConv). It can extract the image feature from different scales without upsampling and downsampled, which avoids the image distortion. Beyond this, we design the spatial and channel weighted-fusion modules to make the WFN-PSC model focus on the hard dehazing parts of image from two dimensions. Specifically, we design three Part Architectures followed by the channel weighted-fusion module. Each Part Architecture consists of three PSConv residual blocks and a spatial weighted-fusion module. The experiments on the benchmark demonstrate the dehazing effectiveness of the proposed method. Furthermore, considering that image dehazing is a low-level task in the computer vision, we evaluate the dehazed image on the object detection task and the results show that the proposed method can be a good pre-processing to assist the high-level computer vision task.
{"title":"WFN-PSC: weighted-fusion network with poly-scale convolution for image dehazing","authors":"Lexuan Sun, Xueliang Liu, Zhenzhen Hu, Richang Hong","doi":"10.1145/3444685.3446292","DOIUrl":"https://doi.org/10.1145/3444685.3446292","url":null,"abstract":"Image dehazing is a fundamental task for the computer vision and multimedia and usually in the face of the challenge from two aspects, i) the uneven distribution of arbitrary haze and ii) the distortion of image pixels caused by the hazed image. In this paper, we propose an end-to-end trainable framework, named Weighted-Fusion Network with Poly-Scale Convolution (WFN-PSC), to address these dehazing issues. The proposed method is designed based on the Poly-Scale Convolution (PSConv). It can extract the image feature from different scales without upsampling and downsampled, which avoids the image distortion. Beyond this, we design the spatial and channel weighted-fusion modules to make the WFN-PSC model focus on the hard dehazing parts of image from two dimensions. Specifically, we design three Part Architectures followed by the channel weighted-fusion module. Each Part Architecture consists of three PSConv residual blocks and a spatial weighted-fusion module. The experiments on the benchmark demonstrate the dehazing effectiveness of the proposed method. Furthermore, considering that image dehazing is a low-level task in the computer vision, we evaluate the dehazed image on the object detection task and the results show that the proposed method can be a good pre-processing to assist the high-level computer vision task.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127935984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}