Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859266
Kuan-Hsien Liu, Song-Jie Chen, Ching-Hsiang Chiu, Tsung-Jung Liu
Surface defect detection is a necessary process for quality control in the industry. Currently, popular neural network based defect detection systems usually need to use a large number of defect samples for training, and it takes a lot of manpower to make marks and clean the subsequent data. This is a time-consuming process, and it makes the whole system less effective. In this paper, a deep neural network based model for fabric surface defect detection is proposed and it only uses positive clean samples for training. Since the proposed model does not collect negative defective samples for learning, the landing time of whole system is greatly reduced. In the experiment, we use RTX3080 in the TensorRT model with 250 FPS, and the detection accuracy is 99%, which is suitable for production lines with real time requirements.
{"title":"Fabric Defect Detection VIA Unsupervised Neural Networks","authors":"Kuan-Hsien Liu, Song-Jie Chen, Ching-Hsiang Chiu, Tsung-Jung Liu","doi":"10.1109/ICMEW56448.2022.9859266","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859266","url":null,"abstract":"Surface defect detection is a necessary process for quality control in the industry. Currently, popular neural network based defect detection systems usually need to use a large number of defect samples for training, and it takes a lot of manpower to make marks and clean the subsequent data. This is a time-consuming process, and it makes the whole system less effective. In this paper, a deep neural network based model for fabric surface defect detection is proposed and it only uses positive clean samples for training. Since the proposed model does not collect negative defective samples for learning, the landing time of whole system is greatly reduced. In the experiment, we use RTX3080 in the TensorRT model with 250 FPS, and the detection accuracy is 99%, which is suitable for production lines with real time requirements.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128584412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859314
Manh-Hung Ha, O. Chen
To well perceive human actions, it may be favorable only to consider useful clues of human and scene context during the recognition process. Deep Neural Networks (DNNs) used to build up blocks associate with local neighborhood correlation computations at spatial and temporal domains individually. In this work, we develop a DNN which consists of a 3D convolutional neural network, Non-Local SpatioTemporal Correlation Attention (NSTCA) module, and classifier to retrieve meaningful semantic context for effective action identification. Particularly, the proposed NSTCA module extracts advantageous visual clues of both spatial and temporal features via transposed feature correlation computations rather than individual spatial and temporal attention computations. In the experiments, the dataset of traffic police was fulfilled for analysis and comparison. The experimental outcome exhibits that the proposed DNN obtains an average accuracy of 98.2% which is superior to those from the conventional DNNs. Therefore, the DNN proposed herein can be widely applied to discern various actions of subjects in video scenes.
{"title":"Non-Local Spatiotemporal Correlation Attention for Action Recognition","authors":"Manh-Hung Ha, O. Chen","doi":"10.1109/ICMEW56448.2022.9859314","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859314","url":null,"abstract":"To well perceive human actions, it may be favorable only to consider useful clues of human and scene context during the recognition process. Deep Neural Networks (DNNs) used to build up blocks associate with local neighborhood correlation computations at spatial and temporal domains individually. In this work, we develop a DNN which consists of a 3D convolutional neural network, Non-Local SpatioTemporal Correlation Attention (NSTCA) module, and classifier to retrieve meaningful semantic context for effective action identification. Particularly, the proposed NSTCA module extracts advantageous visual clues of both spatial and temporal features via transposed feature correlation computations rather than individual spatial and temporal attention computations. In the experiments, the dataset of traffic police was fulfilled for analysis and comparison. The experimental outcome exhibits that the proposed DNN obtains an average accuracy of 98.2% which is superior to those from the conventional DNNs. Therefore, the DNN proposed herein can be widely applied to discern various actions of subjects in video scenes.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121711189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859326
Tomoya Sawada, Mitsuki Nakamura
A driving assistance system can warn drivers to danger and plays an important role in avoiding serious accidents. However, there are few works considering bidirectional interaction between the system and users. In this paper, we propose a novel system named Intelligent Warning System (IWS) that can warn drivers with appropriate timing and warning level according to surrounding environment and drivers’ behavior. A contribution of IWS includes following two factors: 1) A light-weight object detection method for setting an appropriate warning level depends on potential risks of surrounding objects. 2) A time-series learning method of driver’s facial orientation for setting an appropriate warning timing depends on driver’s behavior with user-friendly interaction. Experimental results suggest that subjects want to use IWS for their daily driving and they realize the difference of its warning style that is adapted by their behaviors, especially for safety confirmation of approaching objects.
{"title":"Intelligentwarning System Monitoring Vehicle Surrounding and Driver’s Behavior","authors":"Tomoya Sawada, Mitsuki Nakamura","doi":"10.1109/ICMEW56448.2022.9859326","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859326","url":null,"abstract":"A driving assistance system can warn drivers to danger and plays an important role in avoiding serious accidents. However, there are few works considering bidirectional interaction between the system and users. In this paper, we propose a novel system named Intelligent Warning System (IWS) that can warn drivers with appropriate timing and warning level according to surrounding environment and drivers’ behavior. A contribution of IWS includes following two factors: 1) A light-weight object detection method for setting an appropriate warning level depends on potential risks of surrounding objects. 2) A time-series learning method of driver’s facial orientation for setting an appropriate warning timing depends on driver’s behavior with user-friendly interaction. Experimental results suggest that subjects want to use IWS for their daily driving and they realize the difference of its warning style that is adapted by their behaviors, especially for safety confirmation of approaching objects.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129845476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, high-quality video conferencing with fewer transmission bits becomes a very hot and challenging problem. We propose FAIVConf, a specially designed video compression framework for video conferencing, based on the effective neural human face generation techniques. FAIVConf brings together several designs to improve the system robustness in real video conference scenarios: face swapping to avoid artifacts in background animation; facial blurring to decrease transmission bit-rate and maintain quality of extracted facial landmarks; and dynamic source update for face view interpolation to accommodate a large range of head poses. Our method achieves significant bit-rate reduction in video conference and gives much better visual quality under the same bit-rate compared with H.264 and H.265 coding schemes.
{"title":"FAIVconf: Face Enhancement for AI-Based Video Conference with Low Bit-Rate","authors":"Z. Li, Sheng-fu Lin, Shan Liu, Songnan Li, Xue Lin, Wei Wang, Wei Jiang","doi":"10.1109/ICMEW56448.2022.9859370","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859370","url":null,"abstract":"Recently, high-quality video conferencing with fewer transmission bits becomes a very hot and challenging problem. We propose FAIVConf, a specially designed video compression framework for video conferencing, based on the effective neural human face generation techniques. FAIVConf brings together several designs to improve the system robustness in real video conference scenarios: face swapping to avoid artifacts in background animation; facial blurring to decrease transmission bit-rate and maintain quality of extracted facial landmarks; and dynamic source update for face view interpolation to accommodate a large range of head poses. Our method achieves significant bit-rate reduction in video conference and gives much better visual quality under the same bit-rate compared with H.264 and H.265 coding schemes.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127609594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-24DOI: 10.1109/ICMEW56448.2022.9859522
Hung-Min Hsu, Xinyu Yuan, Baohua Zhu, Zhongwei Cheng, Lin Chen
Package theft detection has been a challenging task mainly due to lack of training data and a wide variety of package theft cases in reality. In this paper, we propose a new Global and Local Fusion Package Theft Detection Embedding (GLF-PTDE) framework to generate package theft scores for each segment within a video to fulfill the real-world requirements on package theft detection. Moreover, we construct a novel Package Theft Detection dataset to facilitate the research on this task. Our method achieves 80% AUC performance on the newly proposed dataset, showing the effectiveness of the proposed GLF-PTDE framework and its robustness in different real scenes for package theft detection.
{"title":"Package Theft Detection from Smart Home Security Cameras","authors":"Hung-Min Hsu, Xinyu Yuan, Baohua Zhu, Zhongwei Cheng, Lin Chen","doi":"10.1109/ICMEW56448.2022.9859522","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859522","url":null,"abstract":"Package theft detection has been a challenging task mainly due to lack of training data and a wide variety of package theft cases in reality. In this paper, we propose a new Global and Local Fusion Package Theft Detection Embedding (GLF-PTDE) framework to generate package theft scores for each segment within a video to fulfill the real-world requirements on package theft detection. Moreover, we construct a novel Package Theft Detection dataset to facilitate the research on this task. Our method achieves 80% AUC performance on the newly proposed dataset, showing the effectiveness of the proposed GLF-PTDE framework and its robustness in different real scenes for package theft detection.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131947903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-16DOI: 10.1109/ICMEW56448.2022.9859426
R. F. Fela, Andréas Pastor, P. Callet, N. Zacharov, Toinon Vigier, Søren Forchhammer
To open up new possibilities to assess the multimodal perceptual quality of omnidirectional media formats, we proposed a novel open source 360 audiovisual (AV) quality dataset. The dataset consists of high-quality 360 video clips in equirectangular (ERP) format and higher-order ambisonic (4th order) along with the subjective scores. Three subjective quality experiments were conducted for audio, video, and AV with the procedures detailed in this paper. Using the data from subjective tests, we demonstrated that this dataset can be used to quantify perceived audio, video, and audiovisual quality. The diversity and discriminability of subjective scores were also analyzed. Finally, we investigated how our dataset correlates with various objective quality metrics of audio and video. Evidence from the results of this study implies that the proposed dataset can benefit future studies on multimodal quality evaluation of 360 content.
{"title":"Perceptual Evaluation on Audio-Visual Dataset of 360 Content","authors":"R. F. Fela, Andréas Pastor, P. Callet, N. Zacharov, Toinon Vigier, Søren Forchhammer","doi":"10.1109/ICMEW56448.2022.9859426","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859426","url":null,"abstract":"To open up new possibilities to assess the multimodal perceptual quality of omnidirectional media formats, we proposed a novel open source 360 audiovisual (AV) quality dataset. The dataset consists of high-quality 360 video clips in equirectangular (ERP) format and higher-order ambisonic (4th order) along with the subjective scores. Three subjective quality experiments were conducted for audio, video, and AV with the procedures detailed in this paper. Using the data from subjective tests, we demonstrated that this dataset can be used to quantify perceived audio, video, and audiovisual quality. The diversity and discriminability of subjective scores were also analyzed. Finally, we investigated how our dataset correlates with various objective quality metrics of audio and video. Evidence from the results of this study implies that the proposed dataset can benefit future studies on multimodal quality evaluation of 360 content.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128121105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-08DOI: 10.1109/ICMEW56448.2022.9859481
Yunhao Du, Zhihang Tong, Jun-Jun Wan, Binyu Zhang, Yanyun Zhao
Activity detection in surveillance videos is a challenging task caused by small objects, complex activity categories, its untrimmed nature, etc. Existing methods are generally limited in performance due to inaccurate proposals, poor classifiers or inadequate post-processing method. In this work, we propose a comprehensive and effective activity detection system in untrimmed surveillance videos for person-centered and vehicle-centered activities. It consists of four modules, i.e., object localizer, proposal filter, activity classifier and activity refiner. For person-centered activities, a novel part-attention mechanism is proposed to explore detailed features in different body parts. As for vehicle-centered activities, we propose a localization masking method to jointly encode motion and foreground attention features. We conduct experiments on the large-scale activity detection datasets VIRAT, and achieve the best results for both groups of activities. Furthermore, our team won the 1st place in the TRECVID 2021 ActEV challenge.
{"title":"PAMI-AD: An Activity Detector Exploiting Part-Attention and Motion Information in Surveillance Videos","authors":"Yunhao Du, Zhihang Tong, Jun-Jun Wan, Binyu Zhang, Yanyun Zhao","doi":"10.1109/ICMEW56448.2022.9859481","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859481","url":null,"abstract":"Activity detection in surveillance videos is a challenging task caused by small objects, complex activity categories, its untrimmed nature, etc. Existing methods are generally limited in performance due to inaccurate proposals, poor classifiers or inadequate post-processing method. In this work, we propose a comprehensive and effective activity detection system in untrimmed surveillance videos for person-centered and vehicle-centered activities. It consists of four modules, i.e., object localizer, proposal filter, activity classifier and activity refiner. For person-centered activities, a novel part-attention mechanism is proposed to explore detailed features in different body parts. As for vehicle-centered activities, we propose a localization masking method to jointly encode motion and foreground attention features. We conduct experiments on the large-scale activity detection datasets VIRAT, and achieve the best results for both groups of activities. Furthermore, our team won the 1st place in the TRECVID 2021 ActEV challenge.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132774801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-23DOI: 10.1109/ICMEW56448.2022.9859312
J. Mei, Jingxiang Yu, S. Romain, Craig S. Rose, Kelsey Magrane, Graeme LeeSon, Jenq-Neng Hwang
Much progress has been made in the supervised learning of 3D reconstruction of rigid objects from multi-view images or a video. However, it is more challenging to reconstruct severely deformed objects from a single-view RGB image in an unsupervised manner. Training-based methods, such as specific category-level training, have been shown to successfully reconstruct rigid objects and slightly deformed objects like birds from a single-view image. However, they cannot effectively handle severely deformed objects and neither can be applied to some downstream tasks in the real world due to the inconsistent semantic meaning of vertices, which are crucial in defining the adopted 3D templates of objects to be reconstructed. In this work, we introduce a template-based method to infer 3D shapes from a single-view image and apply the reconstructed mesh to a downstream task, i.e., absolute length measurement. Without using 3D ground truth, our method faithfully reconstructs 3D meshes and achieves state-of-the-art accuracy in a length measurement task on a severely deformed fish dataset.
{"title":"Unsupervised Severely Deformed Mesh Reconstruction (DMR) From A Single-View Image for Longline Fishing","authors":"J. Mei, Jingxiang Yu, S. Romain, Craig S. Rose, Kelsey Magrane, Graeme LeeSon, Jenq-Neng Hwang","doi":"10.1109/ICMEW56448.2022.9859312","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859312","url":null,"abstract":"Much progress has been made in the supervised learning of 3D reconstruction of rigid objects from multi-view images or a video. However, it is more challenging to reconstruct severely deformed objects from a single-view RGB image in an unsupervised manner. Training-based methods, such as specific category-level training, have been shown to successfully reconstruct rigid objects and slightly deformed objects like birds from a single-view image. However, they cannot effectively handle severely deformed objects and neither can be applied to some downstream tasks in the real world due to the inconsistent semantic meaning of vertices, which are crucial in defining the adopted 3D templates of objects to be reconstructed. In this work, we introduce a template-based method to infer 3D shapes from a single-view image and apply the reconstructed mesh to a downstream task, i.e., absolute length measurement. Without using 3D ground truth, our method faithfully reconstructs 3D meshes and achieves state-of-the-art accuracy in a length measurement task on a severely deformed fish dataset.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130505732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-08-20DOI: 10.1109/ICMEW56448.2022.9859382
Guoquan Xu, Hezhi Cao, Yifan Zhang, Jianwei Wan, Ke Xu, Yanxin Ma
Recently, deep neural networks have made remarkable achievements in 3D point cloud analysis. However, the current shape descriptors are inadequate for capturing the information thoroughly. To handle this problem, a feature representation learning method, named Dual-Neighborhood Deep Fusion Network (DNDFN), is proposed to serve as an improved point cloud encoder for the task of point cloud analysis. Specifically, the traditional local neighborhood ignores the long-distance dependency and DNDFN utilizes an adaptive key neighborhood replenishment mechanism to overcome the limitation. Furthermore, the transmission of information between points depends on the unique potential relationship between them, so a convolution for capturing the relationship is proposed. Extensive experiments on existing benchmarks especially non-idealized datasets verify the effectiveness of DNDFN and DNDFN achieves the state of the arts.
{"title":"Dual-Neighborhood Deep Fusion Network for Point Cloud Analysis","authors":"Guoquan Xu, Hezhi Cao, Yifan Zhang, Jianwei Wan, Ke Xu, Yanxin Ma","doi":"10.1109/ICMEW56448.2022.9859382","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859382","url":null,"abstract":"Recently, deep neural networks have made remarkable achievements in 3D point cloud analysis. However, the current shape descriptors are inadequate for capturing the information thoroughly. To handle this problem, a feature representation learning method, named Dual-Neighborhood Deep Fusion Network (DNDFN), is proposed to serve as an improved point cloud encoder for the task of point cloud analysis. Specifically, the traditional local neighborhood ignores the long-distance dependency and DNDFN utilizes an adaptive key neighborhood replenishment mechanism to overcome the limitation. Furthermore, the transmission of information between points depends on the unique potential relationship between them, so a convolution for capturing the relationship is proposed. Extensive experiments on existing benchmarks especially non-idealized datasets verify the effectiveness of DNDFN and DNDFN achieves the state of the arts.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125286570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The task of aesthetic quality assessment is complicated due to its subjectivity. In recent years, the target representation of image aesthetic quality has changed from a one-dimensional binary classification label or numerical score to a multi-dimensional score distribution. According to current methods, the ground truth score distributions are straightforwardly regressed. However, the subjectivity of aesthetics is not taken into account, that is to say, the psychological processes of human beings are not taken into consideration, which limits the performance of the task. In this paper, we propose a Deep Drift-Diffusion (DDD) model inspired by psychologists to predict aesthetic score distribution from images. The DDD model can describe the psychological process of aesthetic perception instead of traditional modelling of the results of assessment. We use deep convolution neural networks to regress the parameters of the drift-diffusion model. The experimental results in large scale aesthetic image datasets reveal that our novel DDD model is simple but efficient, which outperforms the state-of-the-art methods in aesthetic score distribution prediction. Besides, different psychological processes can also be predicted by our model. Our work applies drift-diffusion psychological model into score distribution prediction of visual aesthetics, and has the potential of inspiring more attentions to model the psychology process of aesthetic perception.
{"title":"A Deep Drift-Diffusion Model for Image Aesthetic Score Distribution Prediction","authors":"Xin Jin, Xiqiao Li, Heng Huang, Xiaodong Li, Xinghui Zhou","doi":"10.1109/ICMEW56448.2022.9859450","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859450","url":null,"abstract":"The task of aesthetic quality assessment is complicated due to its subjectivity. In recent years, the target representation of image aesthetic quality has changed from a one-dimensional binary classification label or numerical score to a multi-dimensional score distribution. According to current methods, the ground truth score distributions are straightforwardly regressed. However, the subjectivity of aesthetics is not taken into account, that is to say, the psychological processes of human beings are not taken into consideration, which limits the performance of the task. In this paper, we propose a Deep Drift-Diffusion (DDD) model inspired by psychologists to predict aesthetic score distribution from images. The DDD model can describe the psychological process of aesthetic perception instead of traditional modelling of the results of assessment. We use deep convolution neural networks to regress the parameters of the drift-diffusion model. The experimental results in large scale aesthetic image datasets reveal that our novel DDD model is simple but efficient, which outperforms the state-of-the-art methods in aesthetic score distribution prediction. Besides, different psychological processes can also be predicted by our model. Our work applies drift-diffusion psychological model into score distribution prediction of visual aesthetics, and has the potential of inspiring more attentions to model the psychology process of aesthetic perception.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123971935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}