Pub Date : 2020-09-21DOI: 10.1109/mmsp48831.2020.9287087
{"title":"MMSP 2020 TOC","authors":"","doi":"10.1109/mmsp48831.2020.9287087","DOIUrl":"https://doi.org/10.1109/mmsp48831.2020.9287087","url":null,"abstract":"","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131694735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287147
E. Belyaev, Linlin Bie, J. Korhonen
This paper studies the problem of decoding video sequences compressed by Motion JPEG (M-JPEG) at the best possible perceived video quality. We consider decoding of M-JPEG video as signal recovery from incomplete measurements known in compressive sensing. We take all quantized nonzero Discrete Cosine Transform (DCT) coefficients as measurements and the remaining zero coefficients as data that should be recovered. The output video is reconstructed via iterative thresholding algorithm, where Video Block Matching and 4-D filtering (VBM4D) is used as thresholding operator. To reduce non-linearities in the measurements caused by the quantization in JPEG, we propose to apply spatio-temporal pre-filtering before measurements calculation and recovery. Since temporal inconsistencies of the residual coding artifacts lead to strong flickering in recovered video, we also propose to apply motion-compensated deflickering filter as a post-filter. Experimental results show that the proposed approach provides 0.44–0.51 dB average improvement in Peak Signal to Noise Ratio (PSNR), as well as lower flickering level compared to the state-of-the-art method based on Coefficient Graph Laplacians (COGL). We have also conducted a subjective comparison study, indicating that the proposed approach outperforms state-of-the-art methods in terms of subjective video quality.
{"title":"Motion JPEG Decoding via Iterative Thresholding and Motion-Compensated Deflickering","authors":"E. Belyaev, Linlin Bie, J. Korhonen","doi":"10.1109/MMSP48831.2020.9287147","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287147","url":null,"abstract":"This paper studies the problem of decoding video sequences compressed by Motion JPEG (M-JPEG) at the best possible perceived video quality. We consider decoding of M-JPEG video as signal recovery from incomplete measurements known in compressive sensing. We take all quantized nonzero Discrete Cosine Transform (DCT) coefficients as measurements and the remaining zero coefficients as data that should be recovered. The output video is reconstructed via iterative thresholding algorithm, where Video Block Matching and 4-D filtering (VBM4D) is used as thresholding operator. To reduce non-linearities in the measurements caused by the quantization in JPEG, we propose to apply spatio-temporal pre-filtering before measurements calculation and recovery. Since temporal inconsistencies of the residual coding artifacts lead to strong flickering in recovered video, we also propose to apply motion-compensated deflickering filter as a post-filter. Experimental results show that the proposed approach provides 0.44–0.51 dB average improvement in Peak Signal to Noise Ratio (PSNR), as well as lower flickering level compared to the state-of-the-art method based on Coefficient Graph Laplacians (COGL). We have also conducted a subjective comparison study, indicating that the proposed approach outperforms state-of-the-art methods in terms of subjective video quality.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130513961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287136
Kristian Fischer, Fabian Brand, Christian Herglotz, A. Kaup
Common state-of-the-art video codecs are optimized to deliver a low bitrate by providing a certain quality for the final human observer, which is achieved by rate-distortion optimization (RDO). But, with the steady improvement of neural networks solving computer vision tasks, more and more multimedia data is not observed by humans anymore, but directly analyzed by neural networks. In this paper, we propose a standard-compliant feature-based RDO (FRDO) that is designed to increase the coding performance, when the decoded frame is analyzed by a neural network in a video coding for machine scenario. To that extent, we replace the pixel-based distortion metrics in conventional RDO of VTM-8.0 with distortion metrics calculated in the feature space created by the first layers of a neural network. Throughout several tests with the segmentation network Mask R-CNN and single images from the Cityscapes dataset, we compare the proposed FRDO and its hybrid version HFRDO with different distortion measures in the feature space against the conventional RDO. With HFRDO, up to 5.49% bitrate can be saved compared to the VTM-8.0 implementation in terms of Bjøntegaard Delta Rate and using the weighted average precision as quality metric. Additionally, allowing the encoder to vary the quantization parameter results in coding gains for the proposed HFRDO of up 9.95% compared to conventional VTM.
{"title":"Video Coding for Machines with Feature-Based Rate-Distortion Optimization","authors":"Kristian Fischer, Fabian Brand, Christian Herglotz, A. Kaup","doi":"10.1109/MMSP48831.2020.9287136","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287136","url":null,"abstract":"Common state-of-the-art video codecs are optimized to deliver a low bitrate by providing a certain quality for the final human observer, which is achieved by rate-distortion optimization (RDO). But, with the steady improvement of neural networks solving computer vision tasks, more and more multimedia data is not observed by humans anymore, but directly analyzed by neural networks. In this paper, we propose a standard-compliant feature-based RDO (FRDO) that is designed to increase the coding performance, when the decoded frame is analyzed by a neural network in a video coding for machine scenario. To that extent, we replace the pixel-based distortion metrics in conventional RDO of VTM-8.0 with distortion metrics calculated in the feature space created by the first layers of a neural network. Throughout several tests with the segmentation network Mask R-CNN and single images from the Cityscapes dataset, we compare the proposed FRDO and its hybrid version HFRDO with different distortion measures in the feature space against the conventional RDO. With HFRDO, up to 5.49% bitrate can be saved compared to the VTM-8.0 implementation in terms of Bjøntegaard Delta Rate and using the weighted average precision as quality metric. Additionally, allowing the encoder to vary the quantization parameter results in coding gains for the proposed HFRDO of up 9.95% compared to conventional VTM.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130530238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287141
Aki Kaipio, Mykola Ponomarenko, K. Egiazarian
For training of no-reference image visual quality metrics large specialized image databases are used. For images of the databases mean opinion scores (MOS) are experimentally obtained collecting judgments of many observers. MOS of a given image reflects an averaged human perception of visual quality of the image. Each database has its own unknown scale of MOS values depending on unique content of the database. For training of no-reference metrics based on convolutional networks usually only one selected database is used, because all MOS values on input of training loss function should be in the same scale. In this paper, a simple and effective method of merging of several large databases into one database with transforming of their MOS into one scale is proposed. Accuracy of the proposed method is analyzed. Merged MOS is used for practical training of no-reference metric. Better effectiveness of the training is shown in comparative analysis.
{"title":"Merging of MOS of Large Image Databases for No-reference Image Visual Quality Assessment","authors":"Aki Kaipio, Mykola Ponomarenko, K. Egiazarian","doi":"10.1109/MMSP48831.2020.9287141","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287141","url":null,"abstract":"For training of no-reference image visual quality metrics large specialized image databases are used. For images of the databases mean opinion scores (MOS) are experimentally obtained collecting judgments of many observers. MOS of a given image reflects an averaged human perception of visual quality of the image. Each database has its own unknown scale of MOS values depending on unique content of the database. For training of no-reference metrics based on convolutional networks usually only one selected database is used, because all MOS values on input of training loss function should be in the same scale. In this paper, a simple and effective method of merging of several large databases into one database with transforming of their MOS into one scale is proposed. Accuracy of the proposed method is analyzed. Merged MOS is used for practical training of no-reference metric. Better effectiveness of the training is shown in comparative analysis.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132441975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287073
Christina Saravanos, D. Ampeliotis, K. Berberidis
In recent years, several successful schemes have been proposed to solve the song identification problem. These techniques aim to construct a signal’s audio-fingerprint by either employing conventional signal processing techniques or by computing its sparse representation in the time-frequency domain. This paper proposes a new audio-fingerprinting scheme which is able to construct a unique and concise representation of an audio signal by applying a dictionary, which is learnt here via the well-known K-SVD algorithm applied on a song database. The promising results which emerged while conducting the experiments suggested that, not only the proposed approach preformed rather well in its attempt to identify the signal content of several audio clips –even in cases this content had been distorted by noise - but also surpassed the recognition rate of a Shazam-based paradigm.
{"title":"Audio-Fingerprinting via Dictionary Learning","authors":"Christina Saravanos, D. Ampeliotis, K. Berberidis","doi":"10.1109/MMSP48831.2020.9287073","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287073","url":null,"abstract":"In recent years, several successful schemes have been proposed to solve the song identification problem. These techniques aim to construct a signal’s audio-fingerprint by either employing conventional signal processing techniques or by computing its sparse representation in the time-frequency domain. This paper proposes a new audio-fingerprinting scheme which is able to construct a unique and concise representation of an audio signal by applying a dictionary, which is learnt here via the well-known K-SVD algorithm applied on a song database. The promising results which emerged while conducting the experiments suggested that, not only the proposed approach preformed rather well in its attempt to identify the signal content of several audio clips –even in cases this content had been distorted by noise - but also surpassed the recognition rate of a Shazam-based paradigm.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115435908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287142
Ruiting Yang, Jie Liu, Xiang Deng, Zhuochao Zheng
Voice Activity Detection (VAD) plays an important role in audio processing, but it is also a common challenge when a voice signal is corrupted with strong and transient noise. In this paper, an accurate and causal VAD module using a long short-term memory (LSTM) deep neural network is proposed. A set of features including Gammatone cepstral coefficients (GTCC) and selected spectral features are used. The low complex structure allows it can be easily implemented in speech processing algorithms and applications. With carefully pre-processing and labeling the collected training data in the classes of speech or non-speech and training on the LSTM net, experiments show the proposed VAD is able to distinguish speech from different types of noisy background effectively. Its robustness against changes including varying frame length, moving speech sources and speaking in different languages, are further investigated.
{"title":"A Low Complexity Long Short-Term Memory Based Voice Activity Detection","authors":"Ruiting Yang, Jie Liu, Xiang Deng, Zhuochao Zheng","doi":"10.1109/MMSP48831.2020.9287142","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287142","url":null,"abstract":"Voice Activity Detection (VAD) plays an important role in audio processing, but it is also a common challenge when a voice signal is corrupted with strong and transient noise. In this paper, an accurate and causal VAD module using a long short-term memory (LSTM) deep neural network is proposed. A set of features including Gammatone cepstral coefficients (GTCC) and selected spectral features are used. The low complex structure allows it can be easily implemented in speech processing algorithms and applications. With carefully pre-processing and labeling the collected training data in the classes of speech or non-speech and training on the LSTM net, experiments show the proposed VAD is able to distinguish speech from different types of noisy background effectively. Its robustness against changes including varying frame length, moving speech sources and speaking in different languages, are further investigated.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123968631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287110
Sheyda Ghanbaralizadeh Bahnemiri, Mykola Ponomarenko, K. Egiazarian
Natural images may contain regions with different levels of blur affecting image visual quality. No-reference image visual quality metrics should be able to effectively evaluate both blur and sharpness levels on a given image. In this paper, we propose a large image database BlurSet to verify this ability. BlurSet contains 5000 grayscale images of size 128×128 pixels with different levels of Gaussian blur and unsharp mask. For each image, a scalar value indicating the level of blur and the level of sharpness is provided. Several image quality assessment criteria are presented to evaluate how a given metric can estimate the level of blur/sharpness on BlurSet. An extensive comparative analysis of different no-reference metrics is carried out. Reachable levels of the quality criteria are evaluated using the proposed blur/sharpness convolutional neural network (BSCNN).
{"title":"On Verification of Blur and Sharpness Metrics for No-reference Image Visual Quality Assessment","authors":"Sheyda Ghanbaralizadeh Bahnemiri, Mykola Ponomarenko, K. Egiazarian","doi":"10.1109/MMSP48831.2020.9287110","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287110","url":null,"abstract":"Natural images may contain regions with different levels of blur affecting image visual quality. No-reference image visual quality metrics should be able to effectively evaluate both blur and sharpness levels on a given image. In this paper, we propose a large image database BlurSet to verify this ability. BlurSet contains 5000 grayscale images of size 128×128 pixels with different levels of Gaussian blur and unsharp mask. For each image, a scalar value indicating the level of blur and the level of sharpness is provided. Several image quality assessment criteria are presented to evaluate how a given metric can estimate the level of blur/sharpness on BlurSet. An extensive comparative analysis of different no-reference metrics is carried out. Reachable levels of the quality criteria are evaluated using the proposed blur/sharpness convolutional neural network (BSCNN).","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124937997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287155
S. Malladi, J. Mukhopadhyay, M. Larabi, S. Chaudhury
Human gaze dynamics mainly concern about the sequence of the occurrence of three eye movements: fixations, saccades, and microsaccades. In this paper, we correlate them as three different states to velocities of eye movements. We build a state trajectory estimator based on ancestor sampling (ST EAS) model, which captures the features of the human temporal gaze pattern to identify the kind of visual stimuli. We used a gaze dataset of 72 viewers watching 60 video clips which are equally split into four visual categories. Uniformly sampled velocity vectors from the training set, are used to find the best suitable parameters of the proposed statistical model. Then, the optimized model is used for both gaze data classification and video retrieval on the test set. We observed 93.265% of classification accuracy and a mean reciprocal rank of 0.888 for video retrieval on the test set. Hence, this model can be used for viewer independent video indexing for providing viewers an easier way to navigate through the contents.
{"title":"Eye Movement State Trajectory Estimator based on Ancestor Sampling","authors":"S. Malladi, J. Mukhopadhyay, M. Larabi, S. Chaudhury","doi":"10.1109/MMSP48831.2020.9287155","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287155","url":null,"abstract":"Human gaze dynamics mainly concern about the sequence of the occurrence of three eye movements: fixations, saccades, and microsaccades. In this paper, we correlate them as three different states to velocities of eye movements. We build a state trajectory estimator based on ancestor sampling (ST EAS) model, which captures the features of the human temporal gaze pattern to identify the kind of visual stimuli. We used a gaze dataset of 72 viewers watching 60 video clips which are equally split into four visual categories. Uniformly sampled velocity vectors from the training set, are used to find the best suitable parameters of the proposed statistical model. Then, the optimized model is used for both gaze data classification and video retrieval on the test set. We observed 93.265% of classification accuracy and a mean reciprocal rank of 0.888 for video retrieval on the test set. Hence, this model can be used for viewer independent video indexing for providing viewers an easier way to navigate through the contents.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131813970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287138
Ashek Ahmmed, M. Paul, Manzur Murshed, D. Taubman
Video coding algorithms attempt to minimize the significant commonality that exists within a video sequence. Each new video coding standard contains tools that can perform this task more efficiently compared to its predecessors. In this work, we form a coarse representation of the current frame by minimizing commonality within that frame while preserving important structural properties of the frame. The building blocks of this coarse representation are rectangular regions called cuboids, which are computationally simple and has a compact description. Then we propose to employ the coarse frame as an additional source for predictive coding of the current frame. Experimental results show an improvement in bit rate savings over a reference codec for HEVC, with minor increase in the codec computational complexity.
{"title":"A Coarse Representation of Frames Oriented Video Coding By Leveraging Cuboidal Partitioning of Image Data","authors":"Ashek Ahmmed, M. Paul, Manzur Murshed, D. Taubman","doi":"10.1109/MMSP48831.2020.9287138","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287138","url":null,"abstract":"Video coding algorithms attempt to minimize the significant commonality that exists within a video sequence. Each new video coding standard contains tools that can perform this task more efficiently compared to its predecessors. In this work, we form a coarse representation of the current frame by minimizing commonality within that frame while preserving important structural properties of the frame. The building blocks of this coarse representation are rectangular regions called cuboids, which are computationally simple and has a compact description. Then we propose to employ the coarse frame as an additional source for predictive coding of the current frame. Experimental results show an improvement in bit rate savings over a reference codec for HEVC, with minor increase in the codec computational complexity.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132078071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/mmsp48831.2020.9287118
{"title":"MMSP 2020 Breaker Page","authors":"","doi":"10.1109/mmsp48831.2020.9287118","DOIUrl":"https://doi.org/10.1109/mmsp48831.2020.9287118","url":null,"abstract":"","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127262599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}