Deep models are vulnerable to catastrophic forgetting when fine-tuned on new data. Popular distillation-based methods usually neglect the relations between data samples and may eventually forget essential structural knowledge. To solve these shortcomings, we propose a structural graph knowledge distillation based incremental learning framework to preserve both the positions of samples and their relations. Firstly, a memory knowledge graph (MKG) is generated to fully characterize the structural knowledge of historical tasks. Secondly, we develop a graph interpolation mechanism to enrich the domain of knowledge and alleviate the inter-class sample imbalance issue. Thirdly, we introduce structural graph knowledge distillation to transfer the knowledge of historical tasks. Comprehensive experiments on three datasets validate the proposed method.
{"title":"Structural Knowledge Organization and Transfer for Class-Incremental Learning","authors":"Yu Liu, Xiaopeng Hong, Xiaoyu Tao, Songlin Dong, Jingang Shi, Yihong Gong","doi":"10.1145/3469877.3490598","DOIUrl":"https://doi.org/10.1145/3469877.3490598","url":null,"abstract":"Deep models are vulnerable to catastrophic forgetting when fine-tuned on new data. Popular distillation-based methods usually neglect the relations between data samples and may eventually forget essential structural knowledge. To solve these shortcomings, we propose a structural graph knowledge distillation based incremental learning framework to preserve both the positions of samples and their relations. Firstly, a memory knowledge graph (MKG) is generated to fully characterize the structural knowledge of historical tasks. Secondly, we develop a graph interpolation mechanism to enrich the domain of knowledge and alleviate the inter-class sample imbalance issue. Thirdly, we introduce structural graph knowledge distillation to transfer the knowledge of historical tasks. Comprehensive experiments on three datasets validate the proposed method.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114107435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we introduce a new recipe dataset MIRecipe (Multimedia-Instructional Recipe). It has both text and image data for every cooking step, while the conventional recipe datasets only contain final dish images, and/or images only for some of the steps. It consists of 26,725 recipes, which include 239,973 steps in total. The recognition of ingredients in images associated with cooking steps poses a new challenge: Since ingredients are processed during cooking, the appearance of the same ingredient is very different in the beginning and finishing stages of the cooking. The general object recognition methods, which assume the constant appearance of objects, do not perform well for such objects. To solve the problem, we propose two stage-aware techniques: stage-wise model learning, which trains a separate model for each stage, and stage-aware curriculum learning, which starts with the training data from the beginning stage and proceeds to the later stages. Our experiment with our dataset shows that our method achieves higher accuracy than the model trained using all the data without considering the stages. Our dataset is available at our GitHub repository.
{"title":"MIRecipe: A Recipe Dataset for Stage-Aware Recognition of Changes in Appearance of Ingredients","authors":"Yixin Zhang, Yoko Yamakata, Keishi Tajima","doi":"10.1145/3469877.3490596","DOIUrl":"https://doi.org/10.1145/3469877.3490596","url":null,"abstract":"In this paper, we introduce a new recipe dataset MIRecipe (Multimedia-Instructional Recipe). It has both text and image data for every cooking step, while the conventional recipe datasets only contain final dish images, and/or images only for some of the steps. It consists of 26,725 recipes, which include 239,973 steps in total. The recognition of ingredients in images associated with cooking steps poses a new challenge: Since ingredients are processed during cooking, the appearance of the same ingredient is very different in the beginning and finishing stages of the cooking. The general object recognition methods, which assume the constant appearance of objects, do not perform well for such objects. To solve the problem, we propose two stage-aware techniques: stage-wise model learning, which trains a separate model for each stage, and stage-aware curriculum learning, which starts with the training data from the beginning stage and proceeds to the later stages. Our experiment with our dataset shows that our method achieves higher accuracy than the model trained using all the data without considering the stages. Our dataset is available at our GitHub repository.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121379082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This research presents the results of the implementation of deep learning neural networks in the identification of pure pigments of heritage artwork, namely paintings. Our paper applies an innovative three-branch deep learning model to maximise the correct identification of pure pigments. The model proposed combines the feature maps obtained from hyperspectral images through multiple convolutional neural networks, and numerical, hyperspectral metric data with respect to a set of reference reflectances. The results obtained exhibit an accurate representation of the pure predicted pigments which are confirmed through the use of analytical techniques. The model presented outperformed the compared counterparts and is deemed to be an important direction, not only in terms of utilisation of hyperspectral data and concrete pigment data in heritage analysis, but also in the application of deep learning in other fields.
{"title":"Convolutional Neural Network-Based Pure Paint Pigment Identification Using Hyperspectral Images","authors":"Ailin Chen, R. Jesus, M. Vilarigues","doi":"10.1145/3469877.3495641","DOIUrl":"https://doi.org/10.1145/3469877.3495641","url":null,"abstract":"This research presents the results of the implementation of deep learning neural networks in the identification of pure pigments of heritage artwork, namely paintings. Our paper applies an innovative three-branch deep learning model to maximise the correct identification of pure pigments. The model proposed combines the feature maps obtained from hyperspectral images through multiple convolutional neural networks, and numerical, hyperspectral metric data with respect to a set of reference reflectances. The results obtained exhibit an accurate representation of the pure predicted pigments which are confirmed through the use of analytical techniques. The model presented outperformed the compared counterparts and is deemed to be an important direction, not only in terms of utilisation of hyperspectral data and concrete pigment data in heritage analysis, but also in the application of deep learning in other fields.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128015404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Referring Expression Comprehension (REC) is the task of grounding object which is referred by the language expression. Previous one-stage REC methods usually use one single language feature vector to represent the whole query for grounding and no reasoning between different objects is performed despite the rich relation cues of objects contained in the language expression, which depresses their grounding accuracy. Additionally, these methods mostly use the feature pyramid networks for multi-scale visual object feature extraction but ground on different feature layers separately, neglecting the connections between objects with different scales. To address these problems, we propose a novel one-stage REC method, i.e. the Entity Relation Fusion Network (ERFN) to locate referred object by relation guided reasoning on different objects. In ERFN, instead of grounding objects at each layer separately, we propose a Language Guided Multi-Scale Fusion (LGMSF) model to utilize language to guide the fusion of representations of objects with different scales into one feature map.For modeling connections between different objects, we design a Relation Guided Feature Fusion (RGFF) model that extracts entities in the language expression to enhance the referred entity feature in the visual object feature map, and further extracts relations to guide object feature fusion based on the self-attention mechanism. Experimental results show that our method is competitive with the state-of-the-art one-stage and two-stage REC methods, and can also keep inferring in real time.
{"title":"Entity Relation Fusion for Real-Time One-Stage Referring Expression Comprehension","authors":"Hang Yu, Weixin Li, Jiankai Li, Ye Du","doi":"10.1145/3469877.3490592","DOIUrl":"https://doi.org/10.1145/3469877.3490592","url":null,"abstract":"Referring Expression Comprehension (REC) is the task of grounding object which is referred by the language expression. Previous one-stage REC methods usually use one single language feature vector to represent the whole query for grounding and no reasoning between different objects is performed despite the rich relation cues of objects contained in the language expression, which depresses their grounding accuracy. Additionally, these methods mostly use the feature pyramid networks for multi-scale visual object feature extraction but ground on different feature layers separately, neglecting the connections between objects with different scales. To address these problems, we propose a novel one-stage REC method, i.e. the Entity Relation Fusion Network (ERFN) to locate referred object by relation guided reasoning on different objects. In ERFN, instead of grounding objects at each layer separately, we propose a Language Guided Multi-Scale Fusion (LGMSF) model to utilize language to guide the fusion of representations of objects with different scales into one feature map.For modeling connections between different objects, we design a Relation Guided Feature Fusion (RGFF) model that extracts entities in the language expression to enhance the referred entity feature in the visual object feature map, and further extracts relations to guide object feature fusion based on the self-attention mechanism. Experimental results show that our method is competitive with the state-of-the-art one-stage and two-stage REC methods, and can also keep inferring in real time.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133439636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing-Fen Xu, Wei Zhang, Yalong Bai, Qibin Sun, Tao Mei
Digital image manipulations have been heavily abused to spread misinformation. Despite the great efforts dedicated in research community, prior works are mostly performance-driven, i.e., optimizing performances using standard/heavy networks designed for semantic classification. A thorough understanding for fake images detection models is still missing. This paper studies the essential ingredients for a good fake image detection model, by profiling the best-performing architectures. Specifically, we conduct a thorough analysis on a massive number of detection models, and observe how the performances are affected by different patterns of network structure. Our key findings include: 1) with the same computational budget, flat network structures (e.g., large kernel sizes, wide connections) perform better than commonly used deep networks; 2) operations in shallow layers deserve more computational capacities to trade-off performance and computational cost. These findings sketch a general profile for essential models of fake image detection, which show clear differences with those for semantic classification. Furthermore, based on our analysis, we propose a new Depth-Separable Search Space (DSS) for fake image detection. Compared to state-of-the-art methods, our model achieves competitive performance while saving more than 50% parameters.
{"title":"Flat and Shallow: Understanding Fake Image Detection Models by Architecture Profiling","authors":"Jing-Fen Xu, Wei Zhang, Yalong Bai, Qibin Sun, Tao Mei","doi":"10.1145/3469877.3490566","DOIUrl":"https://doi.org/10.1145/3469877.3490566","url":null,"abstract":"Digital image manipulations have been heavily abused to spread misinformation. Despite the great efforts dedicated in research community, prior works are mostly performance-driven, i.e., optimizing performances using standard/heavy networks designed for semantic classification. A thorough understanding for fake images detection models is still missing. This paper studies the essential ingredients for a good fake image detection model, by profiling the best-performing architectures. Specifically, we conduct a thorough analysis on a massive number of detection models, and observe how the performances are affected by different patterns of network structure. Our key findings include: 1) with the same computational budget, flat network structures (e.g., large kernel sizes, wide connections) perform better than commonly used deep networks; 2) operations in shallow layers deserve more computational capacities to trade-off performance and computational cost. These findings sketch a general profile for essential models of fake image detection, which show clear differences with those for semantic classification. Furthermore, based on our analysis, we propose a new Depth-Separable Search Space (DSS) for fake image detection. Compared to state-of-the-art methods, our model achieves competitive performance while saving more than 50% parameters.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134628278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study is aimed at finding a suitable method for generating time-series data such as video clips or avatar motions from text stating multiple events. This paper addresses the generation of variable-length time-series data considering the order and variable duration of events stated in the text. Although the use of the variant of Mean Squared Error (MSE) is a common means of training, only the gap between the element of ground-truth (GT) data and generated data at the same time are considered. Thus, variants of MSE are unsuitable for the task at hand because the loss may not be small for the generated and GT data with the same order of events if the time for each event does not overlap. To solve the problem, we propose a Dynamic Time Warping-Like method for Variable-Length data (DTWL-VL), which determines the corresponding elements of the GT and the generated data, allowing for the time difference between them, and makes them closer. We compared DTWL-VL, a variant of MSE, and an existing method for time-series data generation which considers the time difference between the corresponding part in the GT and generated data. Since the existing method is aimed at generating fixed-length data, we extend the method for generating variable-length time-series data. We conducted experiments using a dataset prepared for this study. Both DTWL-VL and the existing methods outperformed the MSE variant. Moreover, although the existing method outperformed DTWL-VL under certain settings, DTWL-VL required a smaller training period.
{"title":"Generation of Variable-Length Time Series from Text using Dynamic Time Warping-Based Method","authors":"Ayaka Ideno, Yusuke Mukuta, Tatsuya Harada","doi":"10.1145/3469877.3495644","DOIUrl":"https://doi.org/10.1145/3469877.3495644","url":null,"abstract":"This study is aimed at finding a suitable method for generating time-series data such as video clips or avatar motions from text stating multiple events. This paper addresses the generation of variable-length time-series data considering the order and variable duration of events stated in the text. Although the use of the variant of Mean Squared Error (MSE) is a common means of training, only the gap between the element of ground-truth (GT) data and generated data at the same time are considered. Thus, variants of MSE are unsuitable for the task at hand because the loss may not be small for the generated and GT data with the same order of events if the time for each event does not overlap. To solve the problem, we propose a Dynamic Time Warping-Like method for Variable-Length data (DTWL-VL), which determines the corresponding elements of the GT and the generated data, allowing for the time difference between them, and makes them closer. We compared DTWL-VL, a variant of MSE, and an existing method for time-series data generation which considers the time difference between the corresponding part in the GT and generated data. Since the existing method is aimed at generating fixed-length data, we extend the method for generating variable-length time-series data. We conducted experiments using a dataset prepared for this study. Both DTWL-VL and the existing methods outperformed the MSE variant. Moreover, although the existing method outperformed DTWL-VL under certain settings, DTWL-VL required a smaller training period.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124849591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federico Becattini, Xuemeng Song, C. Baecchi, S. Fang, C. Ferrari, Liqiang Nie, A. del Bimbo
In this paper, we are interested in understanding how customers perceive fashion recommendations, in particular when observing a proposed combination of garments to compose an outfit. Automatically understanding how a suggested item is perceived, without any kind of active engagement, is in fact an essential block to achieve interactive applications. We propose a pixel-landmark mutual enhanced framework for implicit preference estimation, named PLM-IPE, which is capable of inferring the user’s implicit preferences exploiting visual cues, without any active or conscious engagement. PLM-IPE consists of three key modules: pixel-based estimator, landmark-based estimator and mutual learning based optimization. The former two modules work on capturing the implicit reaction of the user from the pixel level and landmark level, respectively. The last module serves to transfer knowledge between the two parallel estimators. Towards evaluation, we collected a real-world dataset, named SentiGarment, which contains 3,345 facial reaction videos paired with suggested outfits and human labeled reaction scores. Extensive experiments show the superiority of our model over state-of-the-art approaches.
{"title":"PLM-IPE: A Pixel-Landmark Mutual Enhanced Framework for Implicit Preference Estimation","authors":"Federico Becattini, Xuemeng Song, C. Baecchi, S. Fang, C. Ferrari, Liqiang Nie, A. del Bimbo","doi":"10.1145/3469877.3490621","DOIUrl":"https://doi.org/10.1145/3469877.3490621","url":null,"abstract":"In this paper, we are interested in understanding how customers perceive fashion recommendations, in particular when observing a proposed combination of garments to compose an outfit. Automatically understanding how a suggested item is perceived, without any kind of active engagement, is in fact an essential block to achieve interactive applications. We propose a pixel-landmark mutual enhanced framework for implicit preference estimation, named PLM-IPE, which is capable of inferring the user’s implicit preferences exploiting visual cues, without any active or conscious engagement. PLM-IPE consists of three key modules: pixel-based estimator, landmark-based estimator and mutual learning based optimization. The former two modules work on capturing the implicit reaction of the user from the pixel level and landmark level, respectively. The last module serves to transfer knowledge between the two parallel estimators. Towards evaluation, we collected a real-world dataset, named SentiGarment, which contains 3,345 facial reaction videos paired with suggested outfits and human labeled reaction scores. Extensive experiments show the superiority of our model over state-of-the-art approaches.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123722148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose a multi-branch semantic learning network (MSLN) to generate image according to textual description by taking into account global and local textual semantics, which consists of two stages. The first stage generates a coarse-grained image based on the sentence features. In the second stage, a multi-branch fine-grained generation model is constructed to inject the sentence-level and word-level semantics into two coarse-grained images by global and local attention modules, which generate global and local fine-grained image textures, respectively. In particular, we devise a channel fusion module (CFM) to fuse the global and local fine-grained features in the multi-branch fine-grained stage and generate the output image. Extensive experiments conducted on the CUB-200 dataset and Oxford-102 dataset demonstrate the superior performance of the proposed method. (e.g., FID is reduced from 16.09 to 14.43 on CUB-200).
{"title":"Multi-branch Semantic Learning Network for Text-to-Image Synthesis","authors":"Jiading Ling, Xingcai Wu, Zhenguo Yang, Xudong Mao, Qing Li, Wenyin Liu","doi":"10.1145/3469877.3490567","DOIUrl":"https://doi.org/10.1145/3469877.3490567","url":null,"abstract":"In this paper, we propose a multi-branch semantic learning network (MSLN) to generate image according to textual description by taking into account global and local textual semantics, which consists of two stages. The first stage generates a coarse-grained image based on the sentence features. In the second stage, a multi-branch fine-grained generation model is constructed to inject the sentence-level and word-level semantics into two coarse-grained images by global and local attention modules, which generate global and local fine-grained image textures, respectively. In particular, we devise a channel fusion module (CFM) to fuse the global and local fine-grained features in the multi-branch fine-grained stage and generate the output image. Extensive experiments conducted on the CUB-200 dataset and Oxford-102 dataset demonstrate the superior performance of the proposed method. (e.g., FID is reduced from 16.09 to 14.43 on CUB-200).","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"333 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124302043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the emergence of various online platforms, associating different platforms is playing an increasingly important role in many applications. Cross-platform recommendation aims to improve recommendation accuracy through associating information from different platforms. Existing methods do not fully exploit high-order nonlinear connectivity information in cross-domain recommendation scenario and suffer from domain-incompatibility problem. In this paper, we propose an end-to-end convolution and fusion model for cross-platform recommendation (CFCR). The proposed CFCR model utilizes Graph Convolution Networks (GCN) to extract user and item features on graphs from different platforms, and fuses cross-platform information by Multimodal AutoEncoder (MAE) with common latent user features. Therefore, the high-order connectivity information is preserved to the most extent and domain-invariant user representations are automatically obtained. The domain-incompatible information is spontaneously discarded to avoid messing up the cross-platform association. Extensive experiments for the proposed CFCR model on real-world dataset demonstrate its advantages over existing cross-platform recommendation methods in terms of various evaluation metrics.
{"title":"CFCR: A Convolution and Fusion Model for Cross-platform Recommendation","authors":"Shengze Yu, Xin Wang, Wenwu Zhu","doi":"10.1145/3469877.3495639","DOIUrl":"https://doi.org/10.1145/3469877.3495639","url":null,"abstract":"With the emergence of various online platforms, associating different platforms is playing an increasingly important role in many applications. Cross-platform recommendation aims to improve recommendation accuracy through associating information from different platforms. Existing methods do not fully exploit high-order nonlinear connectivity information in cross-domain recommendation scenario and suffer from domain-incompatibility problem. In this paper, we propose an end-to-end convolution and fusion model for cross-platform recommendation (CFCR). The proposed CFCR model utilizes Graph Convolution Networks (GCN) to extract user and item features on graphs from different platforms, and fuses cross-platform information by Multimodal AutoEncoder (MAE) with common latent user features. Therefore, the high-order connectivity information is preserved to the most extent and domain-invariant user representations are automatically obtained. The domain-incompatible information is spontaneously discarded to avoid messing up the cross-platform association. Extensive experiments for the proposed CFCR model on real-world dataset demonstrate its advantages over existing cross-platform recommendation methods in terms of various evaluation metrics.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125183782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhichao Fu, Tianlong Ma, Liang Xue, Yingbin Zheng, Hao Ye, Liang He
We perform fast single image super-resolution with flexible magnification for natural images. A novel coarse-to-fine super-resolution framework is developed for the magnification that is factorized into a maximum integer component and the quotient. Specifically, our framework is embedded with a light-weight upscale network for super-resolution with the integer scale factor, followed by the fine-grained network to guide interpolation on feature maps as well as to generate the super-resolved image. Compared with the previous flexible magnification super-resolution approaches, the proposed framework achieves a tradeoff between computational complexity and performance. We conduct experiments using the coarse-to-fine framework on the standard benchmarks and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches.
{"title":"A Coarse-to-fine Approach for Fast Super-Resolution with Flexible Magnification","authors":"Zhichao Fu, Tianlong Ma, Liang Xue, Yingbin Zheng, Hao Ye, Liang He","doi":"10.1145/3469877.3490564","DOIUrl":"https://doi.org/10.1145/3469877.3490564","url":null,"abstract":"We perform fast single image super-resolution with flexible magnification for natural images. A novel coarse-to-fine super-resolution framework is developed for the magnification that is factorized into a maximum integer component and the quotient. Specifically, our framework is embedded with a light-weight upscale network for super-resolution with the integer scale factor, followed by the fine-grained network to guide interpolation on feature maps as well as to generate the super-resolved image. Compared with the previous flexible magnification super-resolution approaches, the proposed framework achieves a tradeoff between computational complexity and performance. We conduct experiments using the coarse-to-fine framework on the standard benchmarks and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126495930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}