Siwei Ma, Maoguo Gong, Guojun Qi, Yun Tie, Ivan Lee, Bo Li, Cong Jin
{"title":"客座编辑:元宇宙中的媒体融合和智能技术特刊","authors":"Siwei Ma, Maoguo Gong, Guojun Qi, Yun Tie, Ivan Lee, Bo Li, Cong Jin","doi":"10.1049/cit2.12250","DOIUrl":null,"url":null,"abstract":"<p>The metaverse is a new type of Internet application and social form that integrates a variety of new technologies, including artificial intelligence, digital twins, block chain, cloud computing, virtual reality, robots, with brain-computer interfaces, and 5G. Media convergence technology is a systematic and comprehensive discipline that applies the theories and methods of modern science and technology to the development of media innovation, mainly including multimedia creation, production, communication, service, consumption, reproduction and so on. The emergence of new technologies, such as deep learning, distributed computing, and extended reality has promoted the development of media integration in the metaverse, and these technologies are the key factors that promote the current transformation of the Internet to the metaverse.</p><p>This Special Issue aims to collect research on the application of media convergence and intelligent technology in the metaverse, focussing on the theory and technology of intelligent generation of multimedia content based on deep learning, the intelligent recommendation algorithm of media content with privacy protection as the core, the prediction model of multimedia communication based on big data analysis, and the immersive experience technology (VR/AR) in metaverse and multimedia communication, 5G/6G mobile Internet ultrahigh-definition video transmission and storage resource allocation algorithm, neural network-based media content encryption algorithm. Original research and review articles are welcome.</p><p>The first article defines comprehensive information loss that considers both the suppression of records and the relationship between sensitive attributes [<span>1</span>]. A heuristic method is leveraged to discover the optimal anonymity scheme that has the lowest comprehensive information loss. The experimental results verify the practice of the proposed data publishing method with multiple sensitive attributes. The proposed method can guarantee information utility when compared with previous ones.</p><p>The second article aims at the problem that the existing models have a poor segmentation effect on imbalanced data sets with small-scale samples, a bilateral U-Net network model with a spatial attention mechanism is designed [<span>2</span>]. The model uses the lightweight MobileNetV2 as the backbone network for feature hierarchical extraction and proposes an Attentive Pyramid Spatial Attention (APSA) module compared to the Attenuated Spatial Pyramid module, which can increase the receptive field and enhance the information, and finally adds the context fusion prediction branch that fuses high-semantic and low-semantic prediction results, and the model effectively improves the segmentation accuracy of small data sets. The experimental results on the CamVid data set show that compared with some existing semantic segmentation networks, the algorithm has a better segmentation effect and segmentation accuracy, and its mIOU reaches 75.85%. Moreover, to verify the generality of the model and the effectiveness of the APSA module, experiments were conducted on the VOC 2012 data set, and the APSA module improved mIOU by about 12.2%.</p><p>The third article proposed the dendritic neural model (DNM) mimics the non-linearity of synapses in the human brain to simulate the information processing mechanisms and procedures of neurons [<span>3</span>]. This enhances the understanding of biological nervous systems and the applicability of the model in various fields. However, the existing DNM suffers from high complexity and limited generalisation capability. To address these issues, we propose a DNM pruning method with dendrite layer significance constraints. Our method not only evaluates the significance of dendrite layers but also allocates the significance of a few dendrite layers in the trained model to a few dendrite layers, allowing the removal of low-significance dendrite layers. The simulation experiments on six UCI datasets demonstrate that our method surpasses existing pruning methods in terms of network size and generalisation performance.</p><p>The fourth article proposes a semantic and emotion-based dual latent variable generation model (Dual-LVG) for dialog systems, which is able to generate appropriate emotional responses without an emotional dictionary [<span>4</span>]. Different from previous work, the conditional variational auto-encoder (CVAE) adopts the standard transformer structure. Then, Dual-LVG regularises the CVAE latent space by introducing a dual latent space of semantics and emotion. The content diversity and emotional accuracy of the generated responses are improved by learning emotion and semantic features respectively. Moreover, the average attention mechanism is adopted to better extract semantic features at the sequence level, and the semi-supervised attention mechanism is used in the decoding step to strengthen the fusion of emotional features of the model. Experimental results show that Dual-LVG can successfully achieve the effect of generating different content by controlling emotional factors.</p><p>The fifth article proposes RDDCNN contains three blocks: a deformable block (DB), an enhanced block (EB) and a residual block (RB) [<span>5</span>]. The DB can extract more representative noise features via a deformable learnable kernel and stacked convolutional architecture, according to relations of surrounding pixels. The EB can facilitate contextual interaction through a dilated convolution and a novel combination of convolutional layers, batch normalisation (BN) and ReLU, which can enhance the learning ability of the proposed RDDCNN. To address long-term dependency problem, the RB is used to enhance the memory ability of shallow layer on deep layers and construct a clean image. Besides, we implement a blind denoising model. Experimental results demonstrate that our denoising model outperforms popular denoising methods in terms of qualitative and quantitative analysis.</p><p>The six article presents a framework emphasising RE activities of structuring the DL development with a transformation problem frame and analysing important data assumptions based on the framed physical phenomena [<span>6</span>]. Our framework then links the RE activities through MRs to quantitatively assess the DL solutions. Our case study on MSDGC's CSO predictions demonstrates the applicability and viability of our framework. In particular, we show the appropriateness of the MRs derived from RE activities as well as the ways that the MRs shall be operated. Our framework also helps offer insights into the strengths and weaknesses of three RNN implementations: LSTM, GRU, and IndRNN.</p><p>The seventh article presents the performance of the end-to-end music separation algorithm is enhanced by improving the network structure [<span>7</span>]. Our main contributions include the following: (1) A more reasonable densely connected U-Net is designed to capture the long-term characteristics of music, such as main melody, tone and so on. (2) On this basis, the multi-head attention and dual-path transformer are introduced in the separation module. Channel attention units are applied recursively on the feature map of each layer of the network, enabling the network to perform long-sequence separation. Experimental results show that after the introduction of the channel attention, the performance of the proposed algorithm has a stable improvement compared with the baseline system. On the MUSDB18 dataset, the average score of the separated audio exceeds that of the current best-performing music separation algorithm based on the time-frequency domain (T-F domain).</p><p>The eighth article proposes a deep learning method with generic HRTF amplitudes and anthropometric parameters as input features for individual HRTF generation [<span>8</span>]. By designing fully convolutional neural networks, the key anthropometric parameters and the generic HRTF amplitudes were used to predict each individual HRTF amplitude spectrum in the full-space directions, and the interaural time delay (ITD) was predicted by the transformer module. In the amplitude prediction model, the attention mechanism was adopted to better capture the relationship of HRTF amplitude spectra at two distinctive directions with large angle differences in space. Finally, with the minimum phase model, the predicted amplitude spectrum and ITDs were used to obtain a set of individual head-related impulse responses. Besides the separate training of the HRTF amplitude and ITD generation models, their joint training was also considered and evaluated. The root-mean-square error and the log-spectral distortion were selected as objective measurement metrics to evaluate the performance. Subjective experiments further showed that the auditory source localisation performance of the proposed method was better than other methods in most cases.</p><p>The ninth article utilises the skinned multi-personlinear (SMPL) model and propose a method using the Skeleton-aware Implicit Function (SIF) [<span>9</span>]. To alleviate the broken or disembodied body parts, the proposed skeleton-aware structure prior makes the skeleton awareness into an implicit function, which consists of a bone-guided sampling strategy and a skeleton-relative encoding strategy. To deal with the missing details and depth ambiguity problems, the authors' body-guided pixel-aligned feature exploits the SMPL to enhance 2D normal and depth semantic features, and the proposed feature aggregation uses the extra geometry-aware prior to enabling a more plausible merging with less noisy geometry. Additionally, SIF is also adapted to the RGB-D input, and experimental results show that SIF outperforms the state-of-the-arts methods on challenging datasets from Twindom and Thuman3.0.</p><p>The 10th article presents an approach based on Media Convergence and Graph convolution Encoder Clustering (MCGEC) for TCM clinical data [<span>10</span>]. It feeds modal information and graph structure from media information into a multi-modal graph convolution encoder to obtain the media feature representation learnt from multiple modalities. MCGEC captures latent information from various modalities by fusion and optimises the feature representations and network architecture with learnt clustering labels. The experiment is conducted on real-world multi-modal TCM clinical data, including information like images and text. MCGEC has improved clustering results compared to the generic single-modal clustering methods and the current more advanced multi-modal clustering methods. MCGEC applied to TCM clinical datasets can achieve better results. Integrating multimedia features into clustering algorithms offers significant benefits compared to single-modal clustering approaches that simply concatenate features from different modalities. It provides practical technical support for multi-modal clustering in the TCM field incorporating multimedia features.</p><p>Overall, the articles accepted cover a wide spectrum of problem providing readers with a perspective on the underlying problem in both breadth and depth. We would like to thank all the authors and reviewers again for their contributions.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"8 2","pages":"285-287"},"PeriodicalIF":8.4000,"publicationDate":"2023-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.12250","citationCount":"0","resultStr":"{\"title\":\"Guest Editorial: Special issue on media convergence and intelligent technology in the metaverse\",\"authors\":\"Siwei Ma, Maoguo Gong, Guojun Qi, Yun Tie, Ivan Lee, Bo Li, Cong Jin\",\"doi\":\"10.1049/cit2.12250\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The metaverse is a new type of Internet application and social form that integrates a variety of new technologies, including artificial intelligence, digital twins, block chain, cloud computing, virtual reality, robots, with brain-computer interfaces, and 5G. Media convergence technology is a systematic and comprehensive discipline that applies the theories and methods of modern science and technology to the development of media innovation, mainly including multimedia creation, production, communication, service, consumption, reproduction and so on. The emergence of new technologies, such as deep learning, distributed computing, and extended reality has promoted the development of media integration in the metaverse, and these technologies are the key factors that promote the current transformation of the Internet to the metaverse.</p><p>This Special Issue aims to collect research on the application of media convergence and intelligent technology in the metaverse, focussing on the theory and technology of intelligent generation of multimedia content based on deep learning, the intelligent recommendation algorithm of media content with privacy protection as the core, the prediction model of multimedia communication based on big data analysis, and the immersive experience technology (VR/AR) in metaverse and multimedia communication, 5G/6G mobile Internet ultrahigh-definition video transmission and storage resource allocation algorithm, neural network-based media content encryption algorithm. Original research and review articles are welcome.</p><p>The first article defines comprehensive information loss that considers both the suppression of records and the relationship between sensitive attributes [<span>1</span>]. A heuristic method is leveraged to discover the optimal anonymity scheme that has the lowest comprehensive information loss. The experimental results verify the practice of the proposed data publishing method with multiple sensitive attributes. The proposed method can guarantee information utility when compared with previous ones.</p><p>The second article aims at the problem that the existing models have a poor segmentation effect on imbalanced data sets with small-scale samples, a bilateral U-Net network model with a spatial attention mechanism is designed [<span>2</span>]. The model uses the lightweight MobileNetV2 as the backbone network for feature hierarchical extraction and proposes an Attentive Pyramid Spatial Attention (APSA) module compared to the Attenuated Spatial Pyramid module, which can increase the receptive field and enhance the information, and finally adds the context fusion prediction branch that fuses high-semantic and low-semantic prediction results, and the model effectively improves the segmentation accuracy of small data sets. The experimental results on the CamVid data set show that compared with some existing semantic segmentation networks, the algorithm has a better segmentation effect and segmentation accuracy, and its mIOU reaches 75.85%. Moreover, to verify the generality of the model and the effectiveness of the APSA module, experiments were conducted on the VOC 2012 data set, and the APSA module improved mIOU by about 12.2%.</p><p>The third article proposed the dendritic neural model (DNM) mimics the non-linearity of synapses in the human brain to simulate the information processing mechanisms and procedures of neurons [<span>3</span>]. This enhances the understanding of biological nervous systems and the applicability of the model in various fields. However, the existing DNM suffers from high complexity and limited generalisation capability. To address these issues, we propose a DNM pruning method with dendrite layer significance constraints. Our method not only evaluates the significance of dendrite layers but also allocates the significance of a few dendrite layers in the trained model to a few dendrite layers, allowing the removal of low-significance dendrite layers. The simulation experiments on six UCI datasets demonstrate that our method surpasses existing pruning methods in terms of network size and generalisation performance.</p><p>The fourth article proposes a semantic and emotion-based dual latent variable generation model (Dual-LVG) for dialog systems, which is able to generate appropriate emotional responses without an emotional dictionary [<span>4</span>]. Different from previous work, the conditional variational auto-encoder (CVAE) adopts the standard transformer structure. Then, Dual-LVG regularises the CVAE latent space by introducing a dual latent space of semantics and emotion. The content diversity and emotional accuracy of the generated responses are improved by learning emotion and semantic features respectively. Moreover, the average attention mechanism is adopted to better extract semantic features at the sequence level, and the semi-supervised attention mechanism is used in the decoding step to strengthen the fusion of emotional features of the model. Experimental results show that Dual-LVG can successfully achieve the effect of generating different content by controlling emotional factors.</p><p>The fifth article proposes RDDCNN contains three blocks: a deformable block (DB), an enhanced block (EB) and a residual block (RB) [<span>5</span>]. The DB can extract more representative noise features via a deformable learnable kernel and stacked convolutional architecture, according to relations of surrounding pixels. The EB can facilitate contextual interaction through a dilated convolution and a novel combination of convolutional layers, batch normalisation (BN) and ReLU, which can enhance the learning ability of the proposed RDDCNN. To address long-term dependency problem, the RB is used to enhance the memory ability of shallow layer on deep layers and construct a clean image. Besides, we implement a blind denoising model. Experimental results demonstrate that our denoising model outperforms popular denoising methods in terms of qualitative and quantitative analysis.</p><p>The six article presents a framework emphasising RE activities of structuring the DL development with a transformation problem frame and analysing important data assumptions based on the framed physical phenomena [<span>6</span>]. Our framework then links the RE activities through MRs to quantitatively assess the DL solutions. Our case study on MSDGC's CSO predictions demonstrates the applicability and viability of our framework. In particular, we show the appropriateness of the MRs derived from RE activities as well as the ways that the MRs shall be operated. Our framework also helps offer insights into the strengths and weaknesses of three RNN implementations: LSTM, GRU, and IndRNN.</p><p>The seventh article presents the performance of the end-to-end music separation algorithm is enhanced by improving the network structure [<span>7</span>]. Our main contributions include the following: (1) A more reasonable densely connected U-Net is designed to capture the long-term characteristics of music, such as main melody, tone and so on. (2) On this basis, the multi-head attention and dual-path transformer are introduced in the separation module. Channel attention units are applied recursively on the feature map of each layer of the network, enabling the network to perform long-sequence separation. Experimental results show that after the introduction of the channel attention, the performance of the proposed algorithm has a stable improvement compared with the baseline system. On the MUSDB18 dataset, the average score of the separated audio exceeds that of the current best-performing music separation algorithm based on the time-frequency domain (T-F domain).</p><p>The eighth article proposes a deep learning method with generic HRTF amplitudes and anthropometric parameters as input features for individual HRTF generation [<span>8</span>]. By designing fully convolutional neural networks, the key anthropometric parameters and the generic HRTF amplitudes were used to predict each individual HRTF amplitude spectrum in the full-space directions, and the interaural time delay (ITD) was predicted by the transformer module. In the amplitude prediction model, the attention mechanism was adopted to better capture the relationship of HRTF amplitude spectra at two distinctive directions with large angle differences in space. Finally, with the minimum phase model, the predicted amplitude spectrum and ITDs were used to obtain a set of individual head-related impulse responses. Besides the separate training of the HRTF amplitude and ITD generation models, their joint training was also considered and evaluated. The root-mean-square error and the log-spectral distortion were selected as objective measurement metrics to evaluate the performance. Subjective experiments further showed that the auditory source localisation performance of the proposed method was better than other methods in most cases.</p><p>The ninth article utilises the skinned multi-personlinear (SMPL) model and propose a method using the Skeleton-aware Implicit Function (SIF) [<span>9</span>]. To alleviate the broken or disembodied body parts, the proposed skeleton-aware structure prior makes the skeleton awareness into an implicit function, which consists of a bone-guided sampling strategy and a skeleton-relative encoding strategy. To deal with the missing details and depth ambiguity problems, the authors' body-guided pixel-aligned feature exploits the SMPL to enhance 2D normal and depth semantic features, and the proposed feature aggregation uses the extra geometry-aware prior to enabling a more plausible merging with less noisy geometry. Additionally, SIF is also adapted to the RGB-D input, and experimental results show that SIF outperforms the state-of-the-arts methods on challenging datasets from Twindom and Thuman3.0.</p><p>The 10th article presents an approach based on Media Convergence and Graph convolution Encoder Clustering (MCGEC) for TCM clinical data [<span>10</span>]. It feeds modal information and graph structure from media information into a multi-modal graph convolution encoder to obtain the media feature representation learnt from multiple modalities. MCGEC captures latent information from various modalities by fusion and optimises the feature representations and network architecture with learnt clustering labels. The experiment is conducted on real-world multi-modal TCM clinical data, including information like images and text. MCGEC has improved clustering results compared to the generic single-modal clustering methods and the current more advanced multi-modal clustering methods. MCGEC applied to TCM clinical datasets can achieve better results. Integrating multimedia features into clustering algorithms offers significant benefits compared to single-modal clustering approaches that simply concatenate features from different modalities. It provides practical technical support for multi-modal clustering in the TCM field incorporating multimedia features.</p><p>Overall, the articles accepted cover a wide spectrum of problem providing readers with a perspective on the underlying problem in both breadth and depth. We would like to thank all the authors and reviewers again for their contributions.</p>\",\"PeriodicalId\":46211,\"journal\":{\"name\":\"CAAI Transactions on Intelligence Technology\",\"volume\":\"8 2\",\"pages\":\"285-287\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2023-06-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.12250\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"CAAI Transactions on Intelligence Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1049/cit2.12250\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"CAAI Transactions on Intelligence Technology","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/cit2.12250","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Guest Editorial: Special issue on media convergence and intelligent technology in the metaverse
The metaverse is a new type of Internet application and social form that integrates a variety of new technologies, including artificial intelligence, digital twins, block chain, cloud computing, virtual reality, robots, with brain-computer interfaces, and 5G. Media convergence technology is a systematic and comprehensive discipline that applies the theories and methods of modern science and technology to the development of media innovation, mainly including multimedia creation, production, communication, service, consumption, reproduction and so on. The emergence of new technologies, such as deep learning, distributed computing, and extended reality has promoted the development of media integration in the metaverse, and these technologies are the key factors that promote the current transformation of the Internet to the metaverse.
This Special Issue aims to collect research on the application of media convergence and intelligent technology in the metaverse, focussing on the theory and technology of intelligent generation of multimedia content based on deep learning, the intelligent recommendation algorithm of media content with privacy protection as the core, the prediction model of multimedia communication based on big data analysis, and the immersive experience technology (VR/AR) in metaverse and multimedia communication, 5G/6G mobile Internet ultrahigh-definition video transmission and storage resource allocation algorithm, neural network-based media content encryption algorithm. Original research and review articles are welcome.
The first article defines comprehensive information loss that considers both the suppression of records and the relationship between sensitive attributes [1]. A heuristic method is leveraged to discover the optimal anonymity scheme that has the lowest comprehensive information loss. The experimental results verify the practice of the proposed data publishing method with multiple sensitive attributes. The proposed method can guarantee information utility when compared with previous ones.
The second article aims at the problem that the existing models have a poor segmentation effect on imbalanced data sets with small-scale samples, a bilateral U-Net network model with a spatial attention mechanism is designed [2]. The model uses the lightweight MobileNetV2 as the backbone network for feature hierarchical extraction and proposes an Attentive Pyramid Spatial Attention (APSA) module compared to the Attenuated Spatial Pyramid module, which can increase the receptive field and enhance the information, and finally adds the context fusion prediction branch that fuses high-semantic and low-semantic prediction results, and the model effectively improves the segmentation accuracy of small data sets. The experimental results on the CamVid data set show that compared with some existing semantic segmentation networks, the algorithm has a better segmentation effect and segmentation accuracy, and its mIOU reaches 75.85%. Moreover, to verify the generality of the model and the effectiveness of the APSA module, experiments were conducted on the VOC 2012 data set, and the APSA module improved mIOU by about 12.2%.
The third article proposed the dendritic neural model (DNM) mimics the non-linearity of synapses in the human brain to simulate the information processing mechanisms and procedures of neurons [3]. This enhances the understanding of biological nervous systems and the applicability of the model in various fields. However, the existing DNM suffers from high complexity and limited generalisation capability. To address these issues, we propose a DNM pruning method with dendrite layer significance constraints. Our method not only evaluates the significance of dendrite layers but also allocates the significance of a few dendrite layers in the trained model to a few dendrite layers, allowing the removal of low-significance dendrite layers. The simulation experiments on six UCI datasets demonstrate that our method surpasses existing pruning methods in terms of network size and generalisation performance.
The fourth article proposes a semantic and emotion-based dual latent variable generation model (Dual-LVG) for dialog systems, which is able to generate appropriate emotional responses without an emotional dictionary [4]. Different from previous work, the conditional variational auto-encoder (CVAE) adopts the standard transformer structure. Then, Dual-LVG regularises the CVAE latent space by introducing a dual latent space of semantics and emotion. The content diversity and emotional accuracy of the generated responses are improved by learning emotion and semantic features respectively. Moreover, the average attention mechanism is adopted to better extract semantic features at the sequence level, and the semi-supervised attention mechanism is used in the decoding step to strengthen the fusion of emotional features of the model. Experimental results show that Dual-LVG can successfully achieve the effect of generating different content by controlling emotional factors.
The fifth article proposes RDDCNN contains three blocks: a deformable block (DB), an enhanced block (EB) and a residual block (RB) [5]. The DB can extract more representative noise features via a deformable learnable kernel and stacked convolutional architecture, according to relations of surrounding pixels. The EB can facilitate contextual interaction through a dilated convolution and a novel combination of convolutional layers, batch normalisation (BN) and ReLU, which can enhance the learning ability of the proposed RDDCNN. To address long-term dependency problem, the RB is used to enhance the memory ability of shallow layer on deep layers and construct a clean image. Besides, we implement a blind denoising model. Experimental results demonstrate that our denoising model outperforms popular denoising methods in terms of qualitative and quantitative analysis.
The six article presents a framework emphasising RE activities of structuring the DL development with a transformation problem frame and analysing important data assumptions based on the framed physical phenomena [6]. Our framework then links the RE activities through MRs to quantitatively assess the DL solutions. Our case study on MSDGC's CSO predictions demonstrates the applicability and viability of our framework. In particular, we show the appropriateness of the MRs derived from RE activities as well as the ways that the MRs shall be operated. Our framework also helps offer insights into the strengths and weaknesses of three RNN implementations: LSTM, GRU, and IndRNN.
The seventh article presents the performance of the end-to-end music separation algorithm is enhanced by improving the network structure [7]. Our main contributions include the following: (1) A more reasonable densely connected U-Net is designed to capture the long-term characteristics of music, such as main melody, tone and so on. (2) On this basis, the multi-head attention and dual-path transformer are introduced in the separation module. Channel attention units are applied recursively on the feature map of each layer of the network, enabling the network to perform long-sequence separation. Experimental results show that after the introduction of the channel attention, the performance of the proposed algorithm has a stable improvement compared with the baseline system. On the MUSDB18 dataset, the average score of the separated audio exceeds that of the current best-performing music separation algorithm based on the time-frequency domain (T-F domain).
The eighth article proposes a deep learning method with generic HRTF amplitudes and anthropometric parameters as input features for individual HRTF generation [8]. By designing fully convolutional neural networks, the key anthropometric parameters and the generic HRTF amplitudes were used to predict each individual HRTF amplitude spectrum in the full-space directions, and the interaural time delay (ITD) was predicted by the transformer module. In the amplitude prediction model, the attention mechanism was adopted to better capture the relationship of HRTF amplitude spectra at two distinctive directions with large angle differences in space. Finally, with the minimum phase model, the predicted amplitude spectrum and ITDs were used to obtain a set of individual head-related impulse responses. Besides the separate training of the HRTF amplitude and ITD generation models, their joint training was also considered and evaluated. The root-mean-square error and the log-spectral distortion were selected as objective measurement metrics to evaluate the performance. Subjective experiments further showed that the auditory source localisation performance of the proposed method was better than other methods in most cases.
The ninth article utilises the skinned multi-personlinear (SMPL) model and propose a method using the Skeleton-aware Implicit Function (SIF) [9]. To alleviate the broken or disembodied body parts, the proposed skeleton-aware structure prior makes the skeleton awareness into an implicit function, which consists of a bone-guided sampling strategy and a skeleton-relative encoding strategy. To deal with the missing details and depth ambiguity problems, the authors' body-guided pixel-aligned feature exploits the SMPL to enhance 2D normal and depth semantic features, and the proposed feature aggregation uses the extra geometry-aware prior to enabling a more plausible merging with less noisy geometry. Additionally, SIF is also adapted to the RGB-D input, and experimental results show that SIF outperforms the state-of-the-arts methods on challenging datasets from Twindom and Thuman3.0.
The 10th article presents an approach based on Media Convergence and Graph convolution Encoder Clustering (MCGEC) for TCM clinical data [10]. It feeds modal information and graph structure from media information into a multi-modal graph convolution encoder to obtain the media feature representation learnt from multiple modalities. MCGEC captures latent information from various modalities by fusion and optimises the feature representations and network architecture with learnt clustering labels. The experiment is conducted on real-world multi-modal TCM clinical data, including information like images and text. MCGEC has improved clustering results compared to the generic single-modal clustering methods and the current more advanced multi-modal clustering methods. MCGEC applied to TCM clinical datasets can achieve better results. Integrating multimedia features into clustering algorithms offers significant benefits compared to single-modal clustering approaches that simply concatenate features from different modalities. It provides practical technical support for multi-modal clustering in the TCM field incorporating multimedia features.
Overall, the articles accepted cover a wide spectrum of problem providing readers with a perspective on the underlying problem in both breadth and depth. We would like to thank all the authors and reviewers again for their contributions.
期刊介绍:
CAAI Transactions on Intelligence Technology is a leading venue for original research on the theoretical and experimental aspects of artificial intelligence technology. We are a fully open access journal co-published by the Institution of Engineering and Technology (IET) and the Chinese Association for Artificial Intelligence (CAAI) providing research which is openly accessible to read and share worldwide.