Guest Editorial: Special issue on media convergence and intelligent technology in the metaverse

IF 7.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE CAAI Transactions on Intelligence Technology Pub Date : 2023-06-08 DOI:10.1049/cit2.12250

Siwei Ma, Maoguo Gong, Guojun Qi, Yun Tie, Ivan Lee, Bo Li, Cong Jin

{"title":"Guest Editorial: Special issue on media convergence and intelligent technology in the metaverse","authors":"Siwei Ma, Maoguo Gong, Guojun Qi, Yun Tie, Ivan Lee, Bo Li, Cong Jin","doi":"10.1049/cit2.12250","DOIUrl":null,"url":null,"abstract":"The metaverse is a new type of Internet application and social form that integrates a variety of new technologies, including artificial intelligence, digital twins, block chain, cloud computing, virtual reality, robots, with brain-computer interfaces, and 5G. Media convergence technology is a systematic and comprehensive discipline that applies the theories and methods of modern science and technology to the development of media innovation, mainly including multimedia creation, production, communication, service, consumption, reproduction and so on. The emergence of new technologies, such as deep learning, distributed computing, and extended reality has promoted the development of media integration in the metaverse, and these technologies are the key factors that promote the current transformation of the Internet to the metaverse.This Special Issue aims to collect research on the application of media convergence and intelligent technology in the metaverse, focussing on the theory and technology of intelligent generation of multimedia content based on deep learning, the intelligent recommendation algorithm of media content with privacy protection as the core, the prediction model of multimedia communication based on big data analysis, and the immersive experience technology (VR/AR) in metaverse and multimedia communication, 5G/6G mobile Internet ultrahigh-definition video transmission and storage resource allocation algorithm, neural network-based media content encryption algorithm. Original research and review articles are welcome.The first article defines comprehensive information loss that considers both the suppression of records and the relationship between sensitive attributes [1]. A heuristic method is leveraged to discover the optimal anonymity scheme that has the lowest comprehensive information loss. The experimental results verify the practice of the proposed data publishing method with multiple sensitive attributes. The proposed method can guarantee information utility when compared with previous ones.The second article aims at the problem that the existing models have a poor segmentation effect on imbalanced data sets with small-scale samples, a bilateral U-Net network model with a spatial attention mechanism is designed [2]. The model uses the lightweight MobileNetV2 as the backbone network for feature hierarchical extraction and proposes an Attentive Pyramid Spatial Attention (APSA) module compared to the Attenuated Spatial Pyramid module, which can increase the receptive field and enhance the information, and finally adds the context fusion prediction branch that fuses high-semantic and low-semantic prediction results, and the model effectively improves the segmentation accuracy of small data sets. The experimental results on the CamVid data set show that compared with some existing semantic segmentation networks, the algorithm has a better segmentation effect and segmentation accuracy, and its mIOU reaches 75.85%. Moreover, to verify the generality of the model and the effectiveness of the APSA module, experiments were conducted on the VOC 2012 data set, and the APSA module improved mIOU by about 12.2%.The third article proposed the dendritic neural model (DNM) mimics the non-linearity of synapses in the human brain to simulate the information processing mechanisms and procedures of neurons [3]. This enhances the understanding of biological nervous systems and the applicability of the model in various fields. However, the existing DNM suffers from high complexity and limited generalisation capability. To address these issues, we propose a DNM pruning method with dendrite layer significance constraints. Our method not only evaluates the significance of dendrite layers but also allocates the significance of a few dendrite layers in the trained model to a few dendrite layers, allowing the removal of low-significance dendrite layers. The simulation experiments on six UCI datasets demonstrate that our method surpasses existing pruning methods in terms of network size and generalisation performance.The fourth article proposes a semantic and emotion-based dual latent variable generation model (Dual-LVG) for dialog systems, which is able to generate appropriate emotional responses without an emotional dictionary [4]. Different from previous work, the conditional variational auto-encoder (CVAE) adopts the standard transformer structure. Then, Dual-LVG regularises the CVAE latent space by introducing a dual latent space of semantics and emotion. The content diversity and emotional accuracy of the generated responses are improved by learning emotion and semantic features respectively. Moreover, the average attention mechanism is adopted to better extract semantic features at the sequence level, and the semi-supervised attention mechanism is used in the decoding step to strengthen the fusion of emotional features of the model. Experimental results show that Dual-LVG can successfully achieve the effect of generating different content by controlling emotional factors.The fifth article proposes RDDCNN contains three blocks: a deformable block (DB), an enhanced block (EB) and a residual block (RB) [5]. The DB can extract more representative noise features via a deformable learnable kernel and stacked convolutional architecture, according to relations of surrounding pixels. The EB can facilitate contextual interaction through a dilated convolution and a novel combination of convolutional layers, batch normalisation (BN) and ReLU, which can enhance the learning ability of the proposed RDDCNN. To address long-term dependency problem, the RB is used to enhance the memory ability of shallow layer on deep layers and construct a clean image. Besides, we implement a blind denoising model. Experimental results demonstrate that our denoising model outperforms popular denoising methods in terms of qualitative and quantitative analysis.The six article presents a framework emphasising RE activities of structuring the DL development with a transformation problem frame and analysing important data assumptions based on the framed physical phenomena [6]. Our framework then links the RE activities through MRs to quantitatively assess the DL solutions. Our case study on MSDGC's CSO predictions demonstrates the applicability and viability of our framework. In particular, we show the appropriateness of the MRs derived from RE activities as well as the ways that the MRs shall be operated. Our framework also helps offer insights into the strengths and weaknesses of three RNN implementations: LSTM, GRU, and IndRNN.The seventh article presents the performance of the end-to-end music separation algorithm is enhanced by improving the network structure [7]. Our main contributions include the following: (1) A more reasonable densely connected U-Net is designed to capture the long-term characteristics of music, such as main melody, tone and so on. (2) On this basis, the multi-head attention and dual-path transformer are introduced in the separation module. Channel attention units are applied recursively on the feature map of each layer of the network, enabling the network to perform long-sequence separation. Experimental results show that after the introduction of the channel attention, the performance of the proposed algorithm has a stable improvement compared with the baseline system. On the MUSDB18 dataset, the average score of the separated audio exceeds that of the current best-performing music separation algorithm based on the time-frequency domain (T-F domain).The eighth article proposes a deep learning method with generic HRTF amplitudes and anthropometric parameters as input features for individual HRTF generation [8]. By designing fully convolutional neural networks, the key anthropometric parameters and the generic HRTF amplitudes were used to predict each individual HRTF amplitude spectrum in the full-space directions, and the interaural time delay (ITD) was predicted by the transformer module. In the amplitude prediction model, the attention mechanism was adopted to better capture the relationship of HRTF amplitude spectra at two distinctive directions with large angle differences in space. Finally, with the minimum phase model, the predicted amplitude spectrum and ITDs were used to obtain a set of individual head-related impulse responses. Besides the separate training of the HRTF amplitude and ITD generation models, their joint training was also considered and evaluated. The root-mean-square error and the log-spectral distortion were selected as objective measurement metrics to evaluate the performance. Subjective experiments further showed that the auditory source localisation performance of the proposed method was better than other methods in most cases.The ninth article utilises the skinned multi-personlinear (SMPL) model and propose a method using the Skeleton-aware Implicit Function (SIF) [9]. To alleviate the broken or disembodied body parts, the proposed skeleton-aware structure prior makes the skeleton awareness into an implicit function, which consists of a bone-guided sampling strategy and a skeleton-relative encoding strategy. To deal with the missing details and depth ambiguity problems, the authors' body-guided pixel-aligned feature exploits the SMPL to enhance 2D normal and depth semantic features, and the proposed feature aggregation uses the extra geometry-aware prior to enabling a more plausible merging with less noisy geometry. Additionally, SIF is also adapted to the RGB-D input, and experimental results show that SIF outperforms the state-of-the-arts methods on challenging datasets from Twindom and Thuman3.0.The 10th article presents an approach based on Media Convergence and Graph convolution Encoder Clustering (MCGEC) for TCM clinical data [10]. It feeds modal information and graph structure from media information into a multi-modal graph convolution encoder to obtain the media feature representation learnt from multiple modalities. MCGEC captures latent information from various modalities by fusion and optimises the feature representations and network architecture with learnt clustering labels. The experiment is conducted on real-world multi-modal TCM clinical data, including information like images and text. MCGEC has improved clustering results compared to the generic single-modal clustering methods and the current more advanced multi-modal clustering methods. MCGEC applied to TCM clinical datasets can achieve better results. Integrating multimedia features into clustering algorithms offers significant benefits compared to single-modal clustering approaches that simply concatenate features from different modalities. It provides practical technical support for multi-modal clustering in the TCM field incorporating multimedia features.Overall, the articles accepted cover a wide spectrum of problem providing readers with a perspective on the underlying problem in both breadth and depth. We would like to thank all the authors and reviewers again for their contributions.","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"8 2","pages":"285-287"},"PeriodicalIF":7.3000,"publicationDate":"2023-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.12250","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CAAI Transactions on Intelligence Technology","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/cit2.12250","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The metaverse is a new type of Internet application and social form that integrates a variety of new technologies, including artificial intelligence, digital twins, block chain, cloud computing, virtual reality, robots, with brain-computer interfaces, and 5G. Media convergence technology is a systematic and comprehensive discipline that applies the theories and methods of modern science and technology to the development of media innovation, mainly including multimedia creation, production, communication, service, consumption, reproduction and so on. The emergence of new technologies, such as deep learning, distributed computing, and extended reality has promoted the development of media integration in the metaverse, and these technologies are the key factors that promote the current transformation of the Internet to the metaverse.

This Special Issue aims to collect research on the application of media convergence and intelligent technology in the metaverse, focussing on the theory and technology of intelligent generation of multimedia content based on deep learning, the intelligent recommendation algorithm of media content with privacy protection as the core, the prediction model of multimedia communication based on big data analysis, and the immersive experience technology (VR/AR) in metaverse and multimedia communication, 5G/6G mobile Internet ultrahigh-definition video transmission and storage resource allocation algorithm, neural network-based media content encryption algorithm. Original research and review articles are welcome.

The first article defines comprehensive information loss that considers both the suppression of records and the relationship between sensitive attributes [1]. A heuristic method is leveraged to discover the optimal anonymity scheme that has the lowest comprehensive information loss. The experimental results verify the practice of the proposed data publishing method with multiple sensitive attributes. The proposed method can guarantee information utility when compared with previous ones.

The second article aims at the problem that the existing models have a poor segmentation effect on imbalanced data sets with small-scale samples, a bilateral U-Net network model with a spatial attention mechanism is designed [2]. The model uses the lightweight MobileNetV2 as the backbone network for feature hierarchical extraction and proposes an Attentive Pyramid Spatial Attention (APSA) module compared to the Attenuated Spatial Pyramid module, which can increase the receptive field and enhance the information, and finally adds the context fusion prediction branch that fuses high-semantic and low-semantic prediction results, and the model effectively improves the segmentation accuracy of small data sets. The experimental results on the CamVid data set show that compared with some existing semantic segmentation networks, the algorithm has a better segmentation effect and segmentation accuracy, and its mIOU reaches 75.85%. Moreover, to verify the generality of the model and the effectiveness of the APSA module, experiments were conducted on the VOC 2012 data set, and the APSA module improved mIOU by about 12.2%.

The third article proposed the dendritic neural model (DNM) mimics the non-linearity of synapses in the human brain to simulate the information processing mechanisms and procedures of neurons [3]. This enhances the understanding of biological nervous systems and the applicability of the model in various fields. However, the existing DNM suffers from high complexity and limited generalisation capability. To address these issues, we propose a DNM pruning method with dendrite layer significance constraints. Our method not only evaluates the significance of dendrite layers but also allocates the significance of a few dendrite layers in the trained model to a few dendrite layers, allowing the removal of low-significance dendrite layers. The simulation experiments on six UCI datasets demonstrate that our method surpasses existing pruning methods in terms of network size and generalisation performance.

The fourth article proposes a semantic and emotion-based dual latent variable generation model (Dual-LVG) for dialog systems, which is able to generate appropriate emotional responses without an emotional dictionary [4]. Different from previous work, the conditional variational auto-encoder (CVAE) adopts the standard transformer structure. Then, Dual-LVG regularises the CVAE latent space by introducing a dual latent space of semantics and emotion. The content diversity and emotional accuracy of the generated responses are improved by learning emotion and semantic features respectively. Moreover, the average attention mechanism is adopted to better extract semantic features at the sequence level, and the semi-supervised attention mechanism is used in the decoding step to strengthen the fusion of emotional features of the model. Experimental results show that Dual-LVG can successfully achieve the effect of generating different content by controlling emotional factors.

The fifth article proposes RDDCNN contains three blocks: a deformable block (DB), an enhanced block (EB) and a residual block (RB) [5]. The DB can extract more representative noise features via a deformable learnable kernel and stacked convolutional architecture, according to relations of surrounding pixels. The EB can facilitate contextual interaction through a dilated convolution and a novel combination of convolutional layers, batch normalisation (BN) and ReLU, which can enhance the learning ability of the proposed RDDCNN. To address long-term dependency problem, the RB is used to enhance the memory ability of shallow layer on deep layers and construct a clean image. Besides, we implement a blind denoising model. Experimental results demonstrate that our denoising model outperforms popular denoising methods in terms of qualitative and quantitative analysis.

The six article presents a framework emphasising RE activities of structuring the DL development with a transformation problem frame and analysing important data assumptions based on the framed physical phenomena [6]. Our framework then links the RE activities through MRs to quantitatively assess the DL solutions. Our case study on MSDGC's CSO predictions demonstrates the applicability and viability of our framework. In particular, we show the appropriateness of the MRs derived from RE activities as well as the ways that the MRs shall be operated. Our framework also helps offer insights into the strengths and weaknesses of three RNN implementations: LSTM, GRU, and IndRNN.

The seventh article presents the performance of the end-to-end music separation algorithm is enhanced by improving the network structure [7]. Our main contributions include the following: (1) A more reasonable densely connected U-Net is designed to capture the long-term characteristics of music, such as main melody, tone and so on. (2) On this basis, the multi-head attention and dual-path transformer are introduced in the separation module. Channel attention units are applied recursively on the feature map of each layer of the network, enabling the network to perform long-sequence separation. Experimental results show that after the introduction of the channel attention, the performance of the proposed algorithm has a stable improvement compared with the baseline system. On the MUSDB18 dataset, the average score of the separated audio exceeds that of the current best-performing music separation algorithm based on the time-frequency domain (T-F domain).

The eighth article proposes a deep learning method with generic HRTF amplitudes and anthropometric parameters as input features for individual HRTF generation [8]. By designing fully convolutional neural networks, the key anthropometric parameters and the generic HRTF amplitudes were used to predict each individual HRTF amplitude spectrum in the full-space directions, and the interaural time delay (ITD) was predicted by the transformer module. In the amplitude prediction model, the attention mechanism was adopted to better capture the relationship of HRTF amplitude spectra at two distinctive directions with large angle differences in space. Finally, with the minimum phase model, the predicted amplitude spectrum and ITDs were used to obtain a set of individual head-related impulse responses. Besides the separate training of the HRTF amplitude and ITD generation models, their joint training was also considered and evaluated. The root-mean-square error and the log-spectral distortion were selected as objective measurement metrics to evaluate the performance. Subjective experiments further showed that the auditory source localisation performance of the proposed method was better than other methods in most cases.

The ninth article utilises the skinned multi-personlinear (SMPL) model and propose a method using the Skeleton-aware Implicit Function (SIF) [9]. To alleviate the broken or disembodied body parts, the proposed skeleton-aware structure prior makes the skeleton awareness into an implicit function, which consists of a bone-guided sampling strategy and a skeleton-relative encoding strategy. To deal with the missing details and depth ambiguity problems, the authors' body-guided pixel-aligned feature exploits the SMPL to enhance 2D normal and depth semantic features, and the proposed feature aggregation uses the extra geometry-aware prior to enabling a more plausible merging with less noisy geometry. Additionally, SIF is also adapted to the RGB-D input, and experimental results show that SIF outperforms the state-of-the-arts methods on challenging datasets from Twindom and Thuman3.0.

The 10th article presents an approach based on Media Convergence and Graph convolution Encoder Clustering (MCGEC) for TCM clinical data [10]. It feeds modal information and graph structure from media information into a multi-modal graph convolution encoder to obtain the media feature representation learnt from multiple modalities. MCGEC captures latent information from various modalities by fusion and optimises the feature representations and network architecture with learnt clustering labels. The experiment is conducted on real-world multi-modal TCM clinical data, including information like images and text. MCGEC has improved clustering results compared to the generic single-modal clustering methods and the current more advanced multi-modal clustering methods. MCGEC applied to TCM clinical datasets can achieve better results. Integrating multimedia features into clustering algorithms offers significant benefits compared to single-modal clustering approaches that simply concatenate features from different modalities. It provides practical technical support for multi-modal clustering in the TCM field incorporating multimedia features.

Overall, the articles accepted cover a wide spectrum of problem providing readers with a perspective on the underlying problem in both breadth and depth. We would like to thank all the authors and reviewers again for their contributions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

客座编辑：元宇宙中的媒体融合和智能技术特刊

元宇宙是一种新型的互联网应用和社交形式，融合了各种新技术，包括人工智能、数字孪生、区块链、云计算、虚拟现实、机器人、脑机接口和5G。媒体融合技术是将现代科学技术的理论和方法应用于媒体创新发展的系统性、综合性学科，主要包括多媒体创作、生产、传播、服务、消费、再生产等，扩展现实促进了元宇宙中媒体融合的发展，这些技术是推动当前互联网向元宇宙转型的关键因素。本特刊旨在收集媒体融合和智能技术在元宇宙中的应用研究，重点关注基于深度学习的多媒体内容智能生成理论和技术、以隐私保护为核心的媒体内容智能推荐算法、，基于大数据分析的多媒体通信预测模型，元宇宙和多媒体通信中的沉浸式体验技术（VR/AR），5G/6G移动互联网超高清视频传输和存储资源分配算法，基于神经网络的媒体内容加密算法。欢迎原创研究和评论文章。第一篇文章定义了综合信息损失，同时考虑了记录的抑制和敏感属性之间的关系[1]。利用启发式方法来发现具有最低综合信息损失的最优匿名方案。实验结果验证了所提出的具有多个敏感属性的数据发布方法的实用性。与以前的方法相比，该方法能够保证信息的有效性。第二篇文章针对现有模型对小样本不平衡数据集分割效果差的问题，设计了一个具有空间注意机制的双边U-Net网络模型[2]。该模型使用轻量级的MobileNetV2作为骨干网络进行特征层次提取，并与衰减空间金字塔模块相比，提出了一个衰减金字塔空间注意力（APSA）模块，该模块可以增加感受野并增强信息，最后加入了融合高语义和低语义预测结果的上下文融合预测分支，该模型有效地提高了小数据集的分割精度。在CamVid数据集上的实验结果表明，与现有的一些语义分割网络相比，该算法具有更好的分割效果和分割精度，其mIOU达到75.85%。此外，为了验证模型的通用性和APSA模块的有效性，在VOC 2012数据集上进行了实验，APSA模块将mIOU提高了约12.2%。第三篇文章提出了树状神经模型（DNM）模拟人脑突触的非线性，以模拟神经元的信息处理机制和过程[3]。这增强了对生物神经系统的理解以及该模型在各个领域的适用性。然而，现有的DNM具有高复杂性和有限的泛化能力。为了解决这些问题，我们提出了一种具有枝晶层显著性约束的DNM修剪方法。我们的方法不仅评估了枝晶层的重要性，而且还将训练模型中少数枝晶的重要性分配给少数枝晶层，从而可以去除低重要性的枝晶层。在六个UCI数据集上的仿真实验表明，我们的方法在网络大小和泛化性能方面优于现有的修剪方法。第四篇文章提出了一种用于对话系统的基于语义和情感的双潜在变量生成模型（dual LVG），该模型能够在没有情感词典的情况下生成适当的情感反应[4]。与以往的工作不同，条件变分自动编码器（CVAE）采用了标准的变换器结构。然后，对偶LVG通过引入语义和情感的对偶潜在空间来正则化CVAE潜在空间。通过分别学习情感特征和语义特征，提高了生成响应的内容多样性和情感准确性。此外，在序列层面采用平均注意力机制更好地提取语义特征，在解码步骤采用半监督注意力机制加强模型情感特征的融合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

CAAI Transactions on Intelligence Technology COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

11.00

自引率

3.90%

发文量

134

审稿时长

35 weeks

期刊介绍： CAAI Transactions on Intelligence Technology is a leading venue for original research on the theoretical and experimental aspects of artificial intelligence technology. We are a fully open access journal co-published by the Institution of Engineering and Technology (IET) and the Chinese Association for Artificial Intelligence (CAAI) providing research which is openly accessible to read and share worldwide.