Eurasip Journal on Image and Video Processing最新文献_第2页

Just Dance: detection of human body reenactment fake videos Just Dance：检测人体再现假视频

IF 2.4 4区计算机科学

Eurasip Journal on Image and Video Processing

Pub Date : 2024-08-14 DOI: 10.1186/s13640-024-00635-2

Omran Alamayreh, Carmelo Fascella, Sara Mandelli, Benedetta Tondi, Paolo Bestagini, Mauro Barni

In the last few years, research on the detection of AI-generated videos has focused exclusively on detecting facial manipulations known as deepfakes. Much less attention has been paid to the detection of artificial non-facial fake videos. In this paper, we address a new forensic task, namely, the detection of fake videos of human body reenactment. To this purpose, we consider videos generated by the “Everybody Dance Now” framework. To accomplish our task, we have constructed and released a novel dataset of fake videos of this kind, referred to as FakeDance dataset. Additionally, we propose two forgery detectors to study the detectability of FakeDance kind of videos. The first one exploits spatial–temporal clues of a given video by means of hand-crafted descriptors, whereas the second detector is an end-to-end detector based on Convolutional Neural Networks (CNNs) trained on purpose. Both detectors have their peculiarities and strengths, working well in different operative scenarios. We believe that our proposed dataset together with the two detectors will contribute to the research on the detection of non-facial fake videos generated by means of AI.

在过去几年中，有关人工智能生成视频检测的研究主要集中在检测被称为深度伪造的面部操纵上。而对人工非面部伪造视频的检测则关注较少。在本文中，我们将讨论一项新的取证任务，即检测人体再现的虚假视频。为此，我们考虑了由 "Everybody Dance Now "框架生成的视频。为了完成任务，我们构建并发布了一个新颖的此类伪造视频数据集，称为 FakeDance 数据集。此外，我们还提出了两种伪造检测器来研究 FakeDance 类视频的可检测性。第一个检测器通过手工创建的描述符来利用给定视频的时空线索，而第二个检测器则是基于专门训练的卷积神经网络（CNN）的端到端检测器。这两种检测器各有特点和优势，在不同的工作场景下都能发挥良好的作用。我们相信，我们提出的数据集和这两种检测器将有助于通过人工智能手段检测非人脸伪造视频的研究。

{"title":"Just Dance: detection of human body reenactment fake videos","authors":"Omran Alamayreh, Carmelo Fascella, Sara Mandelli, Benedetta Tondi, Paolo Bestagini, Mauro Barni","doi":"10.1186/s13640-024-00635-2","DOIUrl":"https://doi.org/10.1186/s13640-024-00635-2","url":null,"abstract":"In the last few years, research on the detection of AI-generated videos has focused exclusively on detecting facial manipulations known as deepfakes. Much less attention has been paid to the detection of artificial non-facial fake videos. In this paper, we address a new forensic task, namely, the detection of fake videos of human body reenactment. To this purpose, we consider videos generated by the “Everybody Dance Now” framework. To accomplish our task, we have constructed and released a novel dataset of fake videos of this kind, referred to as FakeDance dataset. Additionally, we propose two forgery detectors to study the detectability of FakeDance kind of videos. The first one exploits spatial–temporal clues of a given video by means of hand-crafted descriptors, whereas the second detector is an end-to-end detector based on Convolutional Neural Networks (CNNs) trained on purpose. Both detectors have their peculiarities and strengths, working well in different operative scenarios. We believe that our proposed dataset together with the two detectors will contribute to the research on the detection of non-facial fake videos generated by means of AI.","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"27 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PointPCA: point cloud objective quality assessment using PCA-based descriptors PointPCA：使用基于 PCA 的描述符进行点云客观质量评估

IF 2.4 4区计算机科学

Eurasip Journal on Image and Video Processing

Pub Date : 2024-08-09 DOI: 10.1186/s13640-024-00626-3

Evangelos Alexiou, Xuemei Zhou, Irene Viola, Pablo Cesar

Point clouds denote a prominent solution for the representation of 3D photo-realistic content in immersive applications. Similarly to other imaging modalities, quality predictions for point cloud contents are vital for a wide range of applications, enabling trade-off optimizations between data quality and data size in every processing step from acquisition to rendering. In this work, we focus on use cases that consider human end-users consuming point cloud contents and, hence, we concentrate on visual quality metrics. In particular, we propose a set of perceptually relevant descriptors based on principal component analysis (PCA) decomposition, which is applied to both geometry and texture data for full-reference point cloud quality assessment. Statistical features are derived from these descriptors to characterize local shape and appearance properties for both a reference and a distorted point cloud. The extracted statistical features are subsequently compared to provide corresponding predictions of visual quality for the distorted point cloud. As part of our method, a learning-based approach is proposed to fuse these individual predictors to a unified perceptual score. We validate the accuracy of the individual predictors, as well as the unified quality scores obtained after regression against subjectively annotated datasets, showing that our metric outperforms state-of-the-art solutions. Insights regarding design decisions are provided through exploratory studies, evaluating the performance of our metric under different parameter configurations, attribute domains, color spaces, and regression models. A software implementation of the proposed metric is made available at the following link: https://github.com/cwi-dis/pointpca.

点云是在沉浸式应用中表现三维照片逼真内容的一个重要解决方案。与其他成像模式类似，点云内容的质量预测对于广泛的应用至关重要，可以在从采集到渲染的每一个处理步骤中对数据质量和数据大小进行权衡优化。在这项工作中，我们将重点放在考虑人类终端用户消费点云内容的使用案例上，因此，我们专注于视觉质量指标。特别是，我们提出了一套基于主成分分析（PCA）分解的感知相关描述符，并将其应用于几何和纹理数据，以进行全参考点云质量评估。从这些描述符中提取出统计特征，用于描述参考点云和扭曲点云的局部形状和外观特性。随后对提取的统计特征进行比较，从而为扭曲点云提供相应的视觉质量预测。作为我们方法的一部分，我们提出了一种基于学习的方法，将这些单独的预测指标融合为统一的感知分数。我们验证了单个预测指标的准确性，以及根据主观注释数据集回归后获得的统一质量分数，结果表明我们的指标优于最先进的解决方案。我们通过探索性研究，评估了我们的指标在不同参数配置、属性域、色彩空间和回归模型下的性能，为设计决策提供了启示。建议指标的软件实现可通过以下链接获取：https://github.com/cwi-dis/pointpca。

{"title":"PointPCA: point cloud objective quality assessment using PCA-based descriptors","authors":"Evangelos Alexiou, Xuemei Zhou, Irene Viola, Pablo Cesar","doi":"10.1186/s13640-024-00626-3","DOIUrl":"https://doi.org/10.1186/s13640-024-00626-3","url":null,"abstract":"Point clouds denote a prominent solution for the representation of 3D photo-realistic content in immersive applications. Similarly to other imaging modalities, quality predictions for point cloud contents are vital for a wide range of applications, enabling trade-off optimizations between data quality and data size in every processing step from acquisition to rendering. In this work, we focus on use cases that consider human end-users consuming point cloud contents and, hence, we concentrate on visual quality metrics. In particular, we propose a set of perceptually relevant descriptors based on principal component analysis (PCA) decomposition, which is applied to both geometry and texture data for full-reference point cloud quality assessment. Statistical features are derived from these descriptors to characterize local shape and appearance properties for both a reference and a distorted point cloud. The extracted statistical features are subsequently compared to provide corresponding predictions of visual quality for the distorted point cloud. As part of our method, a learning-based approach is proposed to fuse these individual predictors to a unified perceptual score. We validate the accuracy of the individual predictors, as well as the unified quality scores obtained after regression against subjectively annotated datasets, showing that our metric outperforms state-of-the-art solutions. Insights regarding design decisions are provided through exploratory studies, evaluating the performance of our metric under different parameter configurations, attribute domains, color spaces, and regression models. A software implementation of the proposed metric is made available at the following link: https://github.com/cwi-dis/pointpca.","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"79 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Compressed point cloud classification with point-based edge sampling 利用基于点的边缘采样进行压缩点云分类

IF 2.4 4区计算机科学

Eurasip Journal on Image and Video Processing

Pub Date : 2024-08-07 DOI: 10.1186/s13640-024-00637-0

Zhe Luo, Wenjing Jia, Stuart Perry

3D point cloud data, as an immersive detailed data source, has been increasingly used in numerous applications. To deal with the computational and storage challenges of this data, it needs to be compressed before transmission, storage, and processing, especially in real-time systems. Instead of decoding the compressed data stream and subsequently conducting downstream tasks on the decompressed data, analyzing point clouds directly in their compressed domain has attracted great interest. In this paper, we dive into the realm of compressed point cloud classification (CPCC), aiming to achieve high point cloud classification accuracy in a bitrate-saving way by ensuring the bit stream contains a high degree of representative information of the point cloud. Edge information is one of the most important and representative attributes of the point cloud because it can display the outlines or main shapes. However, extracting edge points or information from point cloud models is challenging due to their irregularity and sparsity. To address this challenge, we adopt an advanced edge-sampling method that enhances existing state-of-the-art (SOTA) point cloud edge-sampling techniques based on attention mechanisms and consequently develop a novel CPCC method “CPCC-PES” that focuses on point cloud’s edge information. The result obtained on the benchmark ModelNet40 dataset shows that our model has superior rate-accuracy trade-off performance than SOTA works. Specifically, our method achieves over 90% Top-1 Accuracy with a mere 0.08 bits-per-point (bpp), marking a remarkable over 96% reduction in BD-bitrate compared with specialized codecs. This means that our method only consumes 20% of the bitrate of other SOTA works while maintaining comparable accuracy. Furthermore, we propose a new evaluation metric named BD-Top-1 Accuracy to evaluate the trade-off performance between bitrate and Top-1 Accuracy for future CPCC research.

三维点云数据作为一种身临其境的详细数据源，已越来越多地应用于众多领域。为了应对这些数据在计算和存储方面的挑战，需要在传输、存储和处理之前对其进行压缩，尤其是在实时系统中。与解码压缩数据流并随后在解压缩数据上执行下游任务相比，直接在压缩域中分析点云引起了人们的极大兴趣。本文深入探讨了压缩点云分类（CPCC）领域，旨在通过确保比特流包含点云的高度代表性信息，以节省比特率的方式实现高点云分类精度。边缘信息是点云最重要、最具代表性的属性之一，因为它可以显示轮廓或主要形状。然而，由于点云模型的不规则性和稀疏性，从点云模型中提取边缘点或信息具有挑战性。为了应对这一挑战，我们采用了一种先进的边缘采样方法，该方法基于注意力机制增强了现有的最先进（SOTA）点云边缘采样技术，并由此开发出一种新型的 CPCC 方法 "CPCC-PES"，该方法重点关注点云的边缘信息。在基准 ModelNet40 数据集上获得的结果表明，与 SOTA 方法相比，我们的模型具有更优越的速率-精度权衡性能。具体来说，我们的方法仅用 0.08 比特/点 (bpp) 就达到了 90% 以上的 Top-1 准确率，与专用编解码器相比，BD 比特率显著降低了 96% 以上。这意味着我们的方法只需消耗其他 SOTA 方法 20% 的比特率，同时还能保持相当的准确率。此外，我们还提出了一个名为 BD-Top-1 Accuracy 的新评估指标，以评估比特率和 Top-1 Accuracy 之间的权衡性能，为未来的 CPCC 研究提供参考。

{"title":"Compressed point cloud classification with point-based edge sampling","authors":"Zhe Luo, Wenjing Jia, Stuart Perry","doi":"10.1186/s13640-024-00637-0","DOIUrl":"https://doi.org/10.1186/s13640-024-00637-0","url":null,"abstract":"3D point cloud data, as an immersive detailed data source, has been increasingly used in numerous applications. To deal with the computational and storage challenges of this data, it needs to be compressed before transmission, storage, and processing, especially in real-time systems. Instead of decoding the compressed data stream and subsequently conducting downstream tasks on the decompressed data, analyzing point clouds directly in their compressed domain has attracted great interest. In this paper, we dive into the realm of compressed point cloud classification (CPCC), aiming to achieve high point cloud classification accuracy in a bitrate-saving way by ensuring the bit stream contains a high degree of representative information of the point cloud. Edge information is one of the most important and representative attributes of the point cloud because it can display the outlines or main shapes. However, extracting edge points or information from point cloud models is challenging due to their irregularity and sparsity. To address this challenge, we adopt an advanced edge-sampling method that enhances existing state-of-the-art (SOTA) point cloud edge-sampling techniques based on attention mechanisms and consequently develop a novel CPCC method “CPCC-PES” that focuses on point cloud’s edge information. The result obtained on the benchmark ModelNet40 dataset shows that our model has superior rate-accuracy trade-off performance than SOTA works. Specifically, our method achieves over 90% Top-1 Accuracy with a mere 0.08 bits-per-point (bpp), marking a remarkable over 96% reduction in BD-bitrate compared with specialized codecs. This means that our method only consumes 20% of the bitrate of other SOTA works while maintaining comparable accuracy. Furthermore, we propose a new evaluation metric named BD-Top-1 Accuracy to evaluate the trade-off performance between bitrate and Top-1 Accuracy for future CPCC research.","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"28 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluation of the use of box size priors for 6D plane segment tracking from point clouds with applications in cargo packing 评估使用箱体尺寸先验进行点云 6D 平面段跟踪在货物包装中的应用

IF 2.4 4区计算机科学

Eurasip Journal on Image and Video Processing

Pub Date : 2024-08-06 DOI: 10.1186/s13640-024-00636-1

Guillermo A. Camacho-Muñoz, Sandra Esperanza Nope Rodríguez, Humberto Loaiza-Correa, João Paulo Silva do Monte Lima, Rafael Alves Roberto

This paper addresses the problem of 6D pose tracking of plane segments from point clouds acquired from a mobile camera. This is motivated by manual packing operations, where an opportunity exists to enhance performance, aiding operators with instructions based on augmented reality. The approach uses as input point clouds, by its advantages for extracting geometric information relevant to estimating the 6D pose of rigid objects. The proposed algorithm begins with a RANSAC fitting stage on the raw point cloud. It then implements strategies to compute the 2D size and 6D pose of plane segments from geometric analysis of the fitted point cloud. Redundant detections are combined using a new quality factor that predicts point cloud mapping density and allows the selection of the most accurate detection. The algorithm is designed for dynamic scenes, employing a novel particle concept in the point cloud space to track detections’ validity over time. A variant of the algorithm uses box size priors (available in most packing operations) to filter out irrelevant detections. The impact of this prior knowledge is evaluated through an experimental design that compares the performance of a plane segment tracking system, considering variations in the tracking algorithm and camera speed (onboard the packing operator). The tracking algorithm varies at two levels: algorithm ((A_{wpk})), which integrates prior knowledge of box sizes, and algorithm ((A_{woutpk})), which assumes ignorance of box properties. Camera speed is evaluated at low and high speeds. Results indicate increments in the precision and F1-score associated with using the (A_{wpk}) algorithm and consistent performance across both velocities. These results confirm the enhancement of the performance of a tracking system in a real-life and complex scenario by including previous knowledge of the elements in the scene. The proposed algorithm is limited to tracking plane segments of boxes fully supported on surfaces parallel to the ground plane and not stacked. Future works are proposed to include strategies to resolve this limitation.

本文探讨了从移动摄像头获取的点云中对平面段进行 6D 姿态跟踪的问题。该问题是由人工包装操作引起的，在人工包装操作中存在着提高性能的机会，可以通过基于增强现实技术的指令为操作员提供帮助。该方法使用点云作为输入，其优势在于可提取与估计刚性物体 6D 姿态相关的几何信息。建议的算法从原始点云的 RANSAC 拟合阶段开始。然后，通过对拟合点云进行几何分析，实施计算平面段二维尺寸和六维姿态的策略。冗余检测使用新的质量因子进行组合，该因子可预测点云映射密度，并允许选择最准确的检测。该算法专为动态场景设计，在点云空间中采用了一种新颖的粒子概念，以跟踪检测结果随时间变化的有效性。该算法的一个变体使用盒尺寸先验（可用于大多数打包操作）来过滤无关的检测。通过实验设计来评估这种先验知识的影响，在考虑跟踪算法和相机速度（包装操作员机载相机）变化的情况下，比较平面段跟踪系统的性能。跟踪算法在两个层面上发生变化：算法（(A_{wpk})）和算法（(A_{woutpk})），前者整合了包装箱尺寸的先验知识，后者则假定包装箱的属性是未知的。在低速和高速时对相机速度进行评估。结果表明，使用 (A_{woutpk})算法的精确度和 F1 分数都有所提高，而且两种速度下的性能一致。这些结果证实，在现实生活中的复杂场景中，通过加入先前对场景中元素的了解，可以提高跟踪系统的性能。所提出的算法仅限于跟踪完全支撑在与地平面平行的表面上且未堆叠的方框平面段。建议未来的工作包括解决这一限制的策略。

{"title":"Evaluation of the use of box size priors for 6D plane segment tracking from point clouds with applications in cargo packing","authors":"Guillermo A. Camacho-Muñoz, Sandra Esperanza Nope Rodríguez, Humberto Loaiza-Correa, João Paulo Silva do Monte Lima, Rafael Alves Roberto","doi":"10.1186/s13640-024-00636-1","DOIUrl":"https://doi.org/10.1186/s13640-024-00636-1","url":null,"abstract":"This paper addresses the problem of 6D pose tracking of plane segments from point clouds acquired from a mobile camera. This is motivated by manual packing operations, where an opportunity exists to enhance performance, aiding operators with instructions based on augmented reality. The approach uses as input point clouds, by its advantages for extracting geometric information relevant to estimating the 6D pose of rigid objects. The proposed algorithm begins with a RANSAC fitting stage on the raw point cloud. It then implements strategies to compute the 2D size and 6D pose of plane segments from geometric analysis of the fitted point cloud. Redundant detections are combined using a new quality factor that predicts point cloud mapping density and allows the selection of the most accurate detection. The algorithm is designed for dynamic scenes, employing a novel particle concept in the point cloud space to track detections’ validity over time. A variant of the algorithm uses box size priors (available in most packing operations) to filter out irrelevant detections. The impact of this prior knowledge is evaluated through an experimental design that compares the performance of a plane segment tracking system, considering variations in the tracking algorithm and camera speed (onboard the packing operator). The tracking algorithm varies at two levels: algorithm ((A_{wpk})), which integrates prior knowledge of box sizes, and algorithm ((A_{woutpk})), which assumes ignorance of box properties. Camera speed is evaluated at low and high speeds. Results indicate increments in the precision and F1-score associated with using the (A_{wpk}) algorithm and consistent performance across both velocities. These results confirm the enhancement of the performance of a tracking system in a real-life and complex scenario by including previous knowledge of the elements in the scene. The proposed algorithm is limited to tracking plane segments of boxes fully supported on surfaces parallel to the ground plane and not stacked. Future works are proposed to include strategies to resolve this limitation.","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"19 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Remote expert viewing, laboratory tests or objective metrics: which one(s) to trust? 远程专家观察、实验室测试或客观指标：该相信哪一个？

IF 2.4 4区计算机科学

Eurasip Journal on Image and Video Processing

Pub Date : 2024-06-17 DOI: 10.1186/s13640-024-00630-7

Mathias Wien, Joel Jung

We present a study on the validity of quality assessment in the context of the development of visual media coding schemes. The work is motivated by the need for reliable means for decision-taking in standardization efforts of MPEG and JVET, i.e., the adoption or rejection of coding tools during the development process of the coding standard. The study includes results considering three means: objective quality metrics, remote expert viewing, which is a method designed in the context of MPEG standardization, and formal laboratory visual evaluation. The focus of this work is on the comparison of pairs of coded video sequences, e.g., a proposed change and an anchor scheme at a given rate point. An aggregation of performance measurements across multiple rate points, such as the Bjøntegaard Delta rate, is out of the scope of this paper. The paper details the test setup for the subjective assessment methods and the objective quality metrics under consideration. The results of the three approaches are reviewed, analyzed, and compared with respect to their suitability for the decision-taking task. The study indicates that, subject to the chosen test content and test protocols, the results of remote expert viewing using a forced-choice scale can be considered more discriminatory than the results of naïve viewers in the laboratory tests. The results further that, in general, the well-established quality metrics, such as PSNR, SSIM, or MS-SSIM, exhibit a high rate of correct decision-making when their results are compared with both types of viewing tests. Among the learning-based metrics, VMAF and AVQT appear to be most robust. For the development process of a coding standard, the selection of the most suitable means must be guided by the context, where a small number of carefully selected objective metrics, in combination with viewing tests for unclear cases, appears recommendable.

我们介绍了一项关于视觉媒体编码方案开发过程中质量评估有效性的研究。这项工作的动机是，在 MPEG 和 JVET 的标准化工作中需要可靠的决策手段，即在编码标准的开发过程中采用或拒绝编码工具。这项研究包括三种方法的结果：客观质量度量、远程专家视图（一种在 MPEG 标准化背景下设计的方法）和正式的实验室视觉评估。这项工作的重点是比较成对的编码视频序列，例如，在给定速率点上的拟议变化和锚定方案。对多个速率点（如比恩特加德三角洲速率）的性能测量汇总不在本文讨论范围之内。本文详细介绍了主观评估方法和客观质量指标的测试设置。本文对三种方法的结果进行了回顾、分析和比较，以确定它们是否适用于决策任务。研究表明，根据所选择的测试内容和测试方案，在实验室测试中，使用强制选择量表的远程专家观看结果比天真观众的结果更具辨别力。研究结果进一步表明，一般来说，成熟的质量指标，如 PSNR、SSIM 或 MS-SSIM，在与两种类型的观看测试结果进行比较时，都表现出较高的决策正确率。在基于学习的指标中，VMAF 和 AVQT 似乎最为稳健。在制定编码标准的过程中，必须根据具体情况选择最合适的方法，在这种情况下，建议采用少量精心挑选的客观度量标准，并结合对不明确案例的观察测试。

{"title":"Remote expert viewing, laboratory tests or objective metrics: which one(s) to trust?","authors":"Mathias Wien, Joel Jung","doi":"10.1186/s13640-024-00630-7","DOIUrl":"https://doi.org/10.1186/s13640-024-00630-7","url":null,"abstract":"We present a study on the validity of quality assessment in the context of the development of visual media coding schemes. The work is motivated by the need for reliable means for decision-taking in standardization efforts of MPEG and JVET, i.e., the adoption or rejection of coding tools during the development process of the coding standard. The study includes results considering three means: objective quality metrics, remote expert viewing, which is a method designed in the context of MPEG standardization, and formal laboratory visual evaluation. The focus of this work is on the comparison of pairs of coded video sequences, e.g., a proposed change and an anchor scheme at a given rate point. An aggregation of performance measurements across multiple rate points, such as the Bjøntegaard Delta rate, is out of the scope of this paper. The paper details the test setup for the subjective assessment methods and the objective quality metrics under consideration. The results of the three approaches are reviewed, analyzed, and compared with respect to their suitability for the decision-taking task. The study indicates that, subject to the chosen test content and test protocols, the results of remote expert viewing using a forced-choice scale can be considered more discriminatory than the results of naïve viewers in the laboratory tests. The results further that, in general, the well-established quality metrics, such as PSNR, SSIM, or MS-SSIM, exhibit a high rate of correct decision-making when their results are compared with both types of viewing tests. Among the learning-based metrics, VMAF and AVQT appear to be most robust. For the development process of a coding standard, the selection of the most suitable means must be guided by the context, where a small number of carefully selected objective metrics, in combination with viewing tests for unclear cases, appears recommendable.","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"135 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141550115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Impact of LiDAR point cloud compression on 3D object detection evaluated on the KITTI dataset 在 KITTI 数据集上评估激光雷达点云压缩对 3D 物体检测的影响

IF 2.4 4区计算机科学

Eurasip Journal on Image and Video Processing

Pub Date : 2024-06-17 DOI: 10.1186/s13640-024-00633-4

Nuno A. B. Martins, Luís A. da Silva Cruz, Fernando Lopes

The rapid growth on the amount of generated 3D data, particularly in the form of Light Detection And Ranging (LiDAR) point clouds (PCs), poses very significant challenges in terms of data storage, transmission, and processing. Point cloud (PC) representation of 3D visual information has shown to be a very flexible format with many applications ranging from multimedia immersive communication to machine vision tasks in the robotics and autonomous driving domains. In this paper, we investigate the performance of four reference 3D object detection techniques, when the input PCs are compressed with varying levels of degradation. Compression is performed using two MPEG standard coders based on 2D projections and octree decomposition, as well as two coding methods based on Deep Learning (DL). For the DL coding methods, we used a Joint Photographic Experts Group (JPEG) reference PC coder, that we adapted to accept LiDAR PCs in both Cartesian and cylindrical coordinate systems. The detection performance of the four reference 3D object detection methods was evaluated using both pre-trained models and models specifically trained using degraded PCs reconstructed from compressed representations. It is shown that LiDAR PCs can be compressed down to 6 bits per point with no significant degradation on the object detection precision. Furthermore, employing specifically trained detection models improves the detection capabilities even at compression rates as low as 2 bits per point. These results show that LiDAR PCs can be coded to enable efficient storage and transmission, without significant object detection performance loss.

生成的三维数据量迅速增长，尤其是以光探测和测距（LiDAR）点云（PC）的形式出现，这给数据存储、传输和处理带来了巨大挑战。三维视觉信息的点云（PC）表示已被证明是一种非常灵活的格式，其应用范围从多媒体沉浸式通信到机器人和自动驾驶领域的机器视觉任务。在本文中，我们研究了四种参考 3D 物体检测技术在对输入 PC 进行不同程度的降级压缩时的性能。压缩采用了两种基于二维投影和八叉树分解的 MPEG 标准编码器，以及两种基于深度学习（DL）的编码方法。对于 DL 编码方法，我们使用了联合图像专家组（JPEG）的参考 PC 编码器，并对其进行了调整，以接受直角坐标系和圆柱坐标系中的 LiDAR PC。我们使用预训练模型和使用压缩表示重建的降级 PC 专门训练的模型，对四种参考 3D 物体检测方法的检测性能进行了评估。结果表明，LiDAR PC 可以压缩到每点 6 比特，而物体检测精度不会明显降低。此外，即使压缩率低至每点 2 比特，采用专门训练的检测模型也能提高检测能力。这些结果表明，可以对激光雷达 PC 进行编码，以实现高效存储和传输，而不会明显降低物体检测性能。

{"title":"Impact of LiDAR point cloud compression on 3D object detection evaluated on the KITTI dataset","authors":"Nuno A. B. Martins, Luís A. da Silva Cruz, Fernando Lopes","doi":"10.1186/s13640-024-00633-4","DOIUrl":"https://doi.org/10.1186/s13640-024-00633-4","url":null,"abstract":"The rapid growth on the amount of generated 3D data, particularly in the form of Light Detection And Ranging (LiDAR) point clouds (PCs), poses very significant challenges in terms of data storage, transmission, and processing. Point cloud (PC) representation of 3D visual information has shown to be a very flexible format with many applications ranging from multimedia immersive communication to machine vision tasks in the robotics and autonomous driving domains. In this paper, we investigate the performance of four reference 3D object detection techniques, when the input PCs are compressed with varying levels of degradation. Compression is performed using two MPEG standard coders based on 2D projections and octree decomposition, as well as two coding methods based on Deep Learning (DL). For the DL coding methods, we used a Joint Photographic Experts Group (JPEG) reference PC coder, that we adapted to accept LiDAR PCs in both Cartesian and cylindrical coordinate systems. The detection performance of the four reference 3D object detection methods was evaluated using both pre-trained models and models specifically trained using degraded PCs reconstructed from compressed representations. It is shown that LiDAR PCs can be compressed down to 6 bits per point with no significant degradation on the object detection precision. Furthermore, employing specifically trained detection models improves the detection capabilities even at compression rates as low as 2 bits per point. These results show that LiDAR PCs can be coded to enable efficient storage and transmission, without significant object detection performance loss.","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"51 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141550114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive bridge model for compressed domain point cloud classification 用于压缩域点云分类的自适应桥模型

IF 2.4 4区计算机科学

Eurasip Journal on Image and Video Processing

Pub Date : 2024-06-08 DOI: 10.1186/s13640-024-00631-6

Abdelrahman Seleem, André F. R. Guarda, Nuno M. M. Rodrigues, Fernando Pereira

The recent adoption of deep learning-based models for the processing and coding of multimedia signals has brought noticeable gains in performance, which have established deep learning-based solutions as the uncontested state-of-the-art both for computer vision tasks, targeting machine consumption, as well as, more recently, coding applications, targeting human visualization. Traditionally, applications requiring both coding and computer vision processing require first decoding the bitstream and then applying the computer vision methods to the decompressed multimedia signals. However, the adoption of deep learning-based solutions enables the use of compressed domain computer vision processing, with gains in performance and computational complexity over the decompressed domain approach. For point clouds (PCs), these gains have been demonstrated in the single available compressed domain computer vision processing solution, named Compressed Domain PC Classifier, which processes JPEG Pleno PC coding (PCC) compressed streams using a PC classifier largely compatible with the state-of-the-art spatial domain PointGrid classifier. However, the available Compressed Domain PC Classifier presents strong limitations by imposing a single, specific input size which is associated to specific JPEG Pleno PCC configurations; this limits the compression performance as these configurations are not ideal for all PCs due to their different characteristics, notably density. To overcome these limitations, this paper proposes the first Adaptive Compressed Domain PC Classifier solution which includes a novel adaptive bridge model that allows to process the JPEG Pleno PCC encoded bit streams using different coding configurations, now maximizing the compression efficiency. Experimental results show that the novel Adaptive Compressed Domain PC Classifier allows JPEG PCC to achieve better compression performance by not imposing a single, specific coding configuration for all PCs, regardless of its different characteristics. Moreover, the added adaptability power can achieve slightly better PC classification performance than the previous Compressed Domain PC Classifier and largely better PC classification performance (and lower number of weights) than the PointGrid PC classifier working in the decompressed domain.

最近，基于深度学习的多媒体信号处理和编码模型的采用带来了明显的性能提升，这使得基于深度学习的解决方案成为无可争议的最先进解决方案，既适用于以机器消费为目标的计算机视觉任务，也适用于最近以人类可视化为目标的编码应用。传统上，需要同时进行编码和计算机视觉处理的应用首先需要对比特流进行解码，然后将计算机视觉方法应用于解压缩的多媒体信号。不过，采用基于深度学习的解决方案后，就可以使用压缩域计算机视觉处理，在性能和计算复杂度方面都比解压缩域方法有所提高。对于点云（PC）而言，这些优势已在名为 "压缩域 PC 分类器 "的单一压缩域计算机视觉处理解决方案中得到证实，该解决方案使用与最先进的空间域 PointGrid 分类器基本兼容的 PC 分类器处理 JPEG Pleno PC 编码（PCC）压缩流。然而，现有的压缩域 PC 分类器有很大的局限性，因为它强加了与特定 JPEG Pleno PCC 配置相关的单一、特定的输入大小；这限制了压缩性能，因为这些配置因其不同的特性（尤其是密度）而并非适用于所有 PC。为了克服这些限制，本文提出了第一个自适应压缩域 PC 分类器解决方案，其中包括一个新颖的自适应桥接模型，允许使用不同的编码配置处理 JPEG Pleno PCC 编码的比特流，从而最大限度地提高压缩效率。实验结果表明，新颖的自适应压缩域 PC 分类器不对所有 PC 强加单一、特定的编码配置，从而使 JPEG PCC 实现更好的压缩性能，而不管其不同的特性。此外，新增的适应能力可使 PC 分类性能略优于之前的压缩域 PC 分类器，并在很大程度上优于在解压缩域工作的 PointGrid PC 分类器（权重数量更少）。

{"title":"Adaptive bridge model for compressed domain point cloud classification","authors":"Abdelrahman Seleem, André F. R. Guarda, Nuno M. M. Rodrigues, Fernando Pereira","doi":"10.1186/s13640-024-00631-6","DOIUrl":"https://doi.org/10.1186/s13640-024-00631-6","url":null,"abstract":"The recent adoption of deep learning-based models for the processing and coding of multimedia signals has brought noticeable gains in performance, which have established deep learning-based solutions as the uncontested state-of-the-art both for computer vision tasks, targeting machine consumption, as well as, more recently, coding applications, targeting human visualization. Traditionally, applications requiring both coding and computer vision processing require first decoding the bitstream and then applying the computer vision methods to the decompressed multimedia signals. However, the adoption of deep learning-based solutions enables the use of compressed domain computer vision processing, with gains in performance and computational complexity over the decompressed domain approach. For point clouds (PCs), these gains have been demonstrated in the single available compressed domain computer vision processing solution, named Compressed Domain PC Classifier, which processes JPEG Pleno PC coding (PCC) compressed streams using a PC classifier largely compatible with the state-of-the-art spatial domain PointGrid classifier. However, the available Compressed Domain PC Classifier presents strong limitations by imposing a single, specific input size which is associated to specific JPEG Pleno PCC configurations; this limits the compression performance as these configurations are not ideal for all PCs due to their different characteristics, notably density. To overcome these limitations, this paper proposes the first Adaptive Compressed Domain PC Classifier solution which includes a novel adaptive bridge model that allows to process the JPEG Pleno PCC encoded bit streams using different coding configurations, now maximizing the compression efficiency. Experimental results show that the novel Adaptive Compressed Domain PC Classifier allows JPEG PCC to achieve better compression performance by not imposing a single, specific coding configuration for all PCs, regardless of its different characteristics. Moreover, the added adaptability power can achieve slightly better PC classification performance than the previous Compressed Domain PC Classifier and largely better PC classification performance (and lower number of weights) than the PointGrid PC classifier working in the decompressed domain.","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"15 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141549981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning-based light field imaging: an overview 基于学习的光场成像：概述

IF 2.4 4区计算机科学

Eurasip Journal on Image and Video Processing

Pub Date : 2024-05-30 DOI: 10.1186/s13640-024-00628-1

Saeed Mahmoudpour, Carla Pagliari, Peter Schelkens

Conventional photography can only provide a two-dimensional image of the scene, whereas emerging imaging modalities such as light field enable the representation of higher dimensional visual information by capturing light rays from different directions. Light fields provide immersive experiences, a sense of presence in the scene, and can enhance different vision tasks. Hence, research into light field processing methods has become increasingly popular. It does, however, come at the cost of higher data volume and computational complexity. With the growing deployment of machine-learning and deep architectures in image processing applications, a paradigm shift toward learning-based approaches has also been observed in the design of light field processing methods. Various learning-based approaches are developed to process the high volume of light field data efficiently for different vision tasks while improving performance. Taking into account the diversity of light field vision tasks and the deployed learning-based frameworks, it is necessary to survey the scattered learning-based works in the domain to gain insight into the current trends and challenges. This paper aims to review the existing learning-based solutions for light field imaging and to summarize the most promising frameworks. Moreover, evaluation methods and available light field datasets are highlighted. Lastly, the review concludes with a brief outlook for future research directions.

传统摄影只能提供场景的二维图像，而光场等新兴成像模式通过捕捉来自不同方向的光线，能够呈现更高维的视觉信息。光场能提供身临其境的体验，让人感觉身临其境，并能增强不同的视觉任务。因此，对光场处理方法的研究越来越受欢迎。然而，这需要以更大的数据量和计算复杂性为代价。随着机器学习和深度架构在图像处理应用中的部署日益广泛，在光场处理方法的设计中也出现了向基于学习的方法转变的范式。人们开发了各种基于学习的方法，以针对不同的视觉任务高效处理大量光场数据，同时提高性能。考虑到光场视觉任务和已部署的基于学习的框架的多样性，有必要对该领域分散的基于学习的作品进行调查，以深入了解当前的趋势和挑战。本文旨在回顾现有的基于学习的光场成像解决方案，并总结最有前途的框架。此外，还重点介绍了评估方法和可用的光场数据集。最后，本文对未来研究方向进行了简要展望。

{"title":"Learning-based light field imaging: an overview","authors":"Saeed Mahmoudpour, Carla Pagliari, Peter Schelkens","doi":"10.1186/s13640-024-00628-1","DOIUrl":"https://doi.org/10.1186/s13640-024-00628-1","url":null,"abstract":"Conventional photography can only provide a two-dimensional image of the scene, whereas emerging imaging modalities such as light field enable the representation of higher dimensional visual information by capturing light rays from different directions. Light fields provide immersive experiences, a sense of presence in the scene, and can enhance different vision tasks. Hence, research into light field processing methods has become increasingly popular. It does, however, come at the cost of higher data volume and computational complexity. With the growing deployment of machine-learning and deep architectures in image processing applications, a paradigm shift toward learning-based approaches has also been observed in the design of light field processing methods. Various learning-based approaches are developed to process the high volume of light field data efficiently for different vision tasks while improving performance. Taking into account the diversity of light field vision tasks and the deployed learning-based frameworks, it is necessary to survey the scattered learning-based works in the domain to gain insight into the current trends and challenges. This paper aims to review the existing learning-based solutions for light field imaging and to summarize the most promising frameworks. Moreover, evaluation methods and available light field datasets are highlighted. Lastly, the review concludes with a brief outlook for future research directions.","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"41 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141189137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Semi-automated computer vision-based tracking of multiple industrial entities: a framework and dataset creation approach 基于计算机视觉的多工业实体半自动跟踪：框架和数据集创建方法

IF 2.4 4区计算机科学

Eurasip Journal on Image and Video Processing

Pub Date : 2024-03-22 DOI: 10.1186/s13640-024-00623-6

Abstract

This contribution presents the TOMIE framework (Tracking Of Multiple Industrial Entities), a framework for the continuous tracking of industrial entities (e.g., pallets, crates, barrels) over a network of, in this example, six RGB cameras. This framework makes use of multiple sensors, data pipelines, and data annotation procedures, and is described in detail in this contribution. With the vision of a fully automated tracking system for industrial entities in mind, it enables researchers to efficiently capture high-quality data in an industrial setting. Using this framework, an image dataset, the TOMIE dataset, is created, which at the same time is used to gauge the framework’s validity. This dataset contains annotation files for 112,860 frames and 640,936 entity instances that are captured from a set of six cameras that perceive a large indoor space. This dataset out-scales comparable datasets by a factor of four and is made up of scenarios, drawn from industrial applications from the sector of warehousing. Three tracking algorithms, namely ByteTrack, Bot-Sort, and SiamMOT, are applied to this dataset, serving as a proof-of-concept and providing tracking results that are comparable to the state of the art.

摘要本文介绍了 TOMIE 框架（Tracking Of Multiple Industrial Entities，多个工业实体跟踪），这是一个通过由六个 RGB 摄像机组成的网络对工业实体（如托盘、板条箱、桶）进行连续跟踪的框架。该框架利用了多个传感器、数据管道和数据注释程序，本文将对此进行详细介绍。以工业实体的全自动跟踪系统为愿景，它使研究人员能够在工业环境中高效地捕获高质量数据。利用这一框架，我们创建了一个图像数据集 TOMIE 数据集，同时用来衡量该框架的有效性。该数据集包含 112,860 个帧和 640,936 个实体实例的注释文件，这些注释文件是由感知大型室内空间的六台摄像机采集的。该数据集的规模是同类数据集的四倍，由仓储行业的工业应用场景组成。三种跟踪算法（即 ByteTrack、Bot-Sort 和 SiamMOT）被应用于该数据集，作为概念验证，并提供了与最新技术水平相当的跟踪结果。

{"title":"Semi-automated computer vision-based tracking of multiple industrial entities: a framework and dataset creation approach","authors":"","doi":"10.1186/s13640-024-00623-6","DOIUrl":"https://doi.org/10.1186/s13640-024-00623-6","url":null,"abstract":"<h3>Abstract</h3> This contribution presents the TOMIE framework (Tracking Of Multiple Industrial Entities), a framework for the continuous tracking of industrial entities (e.g., pallets, crates, barrels) over a network of, in this example, six RGB cameras. This framework makes use of multiple sensors, data pipelines, and data annotation procedures, and is described in detail in this contribution. With the vision of a fully automated tracking system for industrial entities in mind, it enables researchers to efficiently capture high-quality data in an industrial setting. Using this framework, an image dataset, the TOMIE dataset, is created, which at the same time is used to gauge the framework’s validity. This dataset contains annotation files for 112,860 frames and 640,936 entity instances that are captured from a set of six cameras that perceive a large indoor space. This dataset out-scales comparable datasets by a factor of four and is made up of scenarios, drawn from industrial applications from the sector of warehousing. Three tracking algorithms, namely ByteTrack, Bot-Sort, and SiamMOT, are applied to this dataset, serving as a proof-of-concept and providing tracking results that are comparable to the state of the art.","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"3 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140198850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast CU size decision and intra-prediction mode decision method for H.266/VVC 针对 H.266/VVC 的快速 CU 大小决策和内部预测模式决策方法

IF 2.4 4区计算机科学

Eurasip Journal on Image and Video Processing

Pub Date : 2024-03-18 DOI: 10.1186/s13640-024-00622-7

Abstract

H.266/Versatile Video Coding (VVC) is the most recent video coding standard developed by the Joint Video Experts Team (JVET). The quad-tree with nested multi-type tree (QTMT) architecture that improves the compression performance of H.266/VVC is introduced. Moreover, H.266/VVC contains a greater number of intra-prediction modes than H.265/High Efficiency Video Coding (HEVC), totalling 67. However, these lead to extremely the coding computational complexity. To cope with the above issues, a fast intra-coding unit (CU) size decision method and a fast intra-prediction mode decision method are proposed in this paper. Specifically, the trained Support Vector Machine (SVM) classifier models are utilized for determining CU partition mode in a fast CU size decision scheme. Furthermore, the quantity of intra-prediction modes added to the RDO mode set decreases in a fast intra-prediction mode decision scheme based on the improved search step. Simulation results illustrate that the proposed overall algorithm can decrease 55.24% encoding runtime with negligible BDBR.

摘要 H.266/Versatile Video Coding（VVC）是联合视频专家组（JVET）制定的最新视频编码标准。本文介绍了四叉树嵌套多类型树（QTMT）结构，它提高了 H.266/VVC 的压缩性能。此外，H.266/VVC 包含比 H.265/High Efficiency Video Coding (HEVC) 更多的内部预测模式，共计 67 种。然而，这些都导致编码计算复杂度极高。为解决上述问题，本文提出了一种快速编码内单元（CU）大小决策方法和一种快速预测内模式决策方法。具体来说，在快速编码单元大小决策方案中，利用训练有素的支持向量机（SVM）分类器模型来确定编码单元分区模式。此外，在基于改进搜索步骤的快速内部预测模式决策方案中，添加到 RDO 模式集的内部预测模式数量会减少。仿真结果表明，所提出的整体算法可以减少 55.24% 的编码运行时间，而 BDBR 可忽略不计。

引用次数: 0