Neural image compression (NIC) has shown remarkable rate-distortion (R-D) efficiency. However, the considerable computational and spatial complexity of most NIC methods presents deployment challenges on resource-constrained devices. We introduce a lightweight neural image compression framework designed to efficiently process both local and global information. In this framework, the convolutional branch extracts local information, whereas the frequency domain branch extracts global information. To capture global information without the high computational costs of dense pixel operations, such as attention mechanisms, Fourier transform is employed. This approach allows for the manipulation of global information in the frequency domain. Additionally, we employ feature shift operations as a strategy to acquire large receptive fields without any computational cost, thus circumventing the need for large kernel convolution. Our framework achieves a superior balance between rate-distortion performance and complexity. On varying resolution sets, our method not only achieves rate-distortion (R-D) performance on par with versatile video coding (VVC) intra and other state-of-the-art (SOTA) NIC methods but also exhibits the lowest computational requirements, with approximately 200 KMACs/pixel. The code will be available at https://github.com/baoyu2020/SFNIC.
{"title":"SFNIC: Hybrid Spatial-Frequency Information for Lightweight Neural Image Compression","authors":"Youneng Bao, Wen Tan, Mu Li, Jiacong Chen, Qingyu Mao, Yongsheng Liang","doi":"10.1049/cit2.70034","DOIUrl":"https://doi.org/10.1049/cit2.70034","url":null,"abstract":"<p>Neural image compression (NIC) has shown remarkable rate-distortion (R-D) efficiency. However, the considerable computational and spatial complexity of most NIC methods presents deployment challenges on resource-constrained devices. We introduce a lightweight neural image compression framework designed to efficiently process both local and global information. In this framework, the convolutional branch extracts local information, whereas the frequency domain branch extracts global information. To capture global information without the high computational costs of dense pixel operations, such as attention mechanisms, Fourier transform is employed. This approach allows for the manipulation of global information in the frequency domain. Additionally, we employ feature shift operations as a strategy to acquire large receptive fields without any computational cost, thus circumventing the need for large kernel convolution. Our framework achieves a superior balance between rate-distortion performance and complexity. On varying resolution sets, our method not only achieves rate-distortion (R-D) performance on par with versatile video coding (VVC) intra and other state-of-the-art (SOTA) NIC methods but also exhibits the lowest computational requirements, with approximately 200 KMACs/pixel. The code will be available at https://github.com/baoyu2020/SFNIC.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 6","pages":"1717-1730"},"PeriodicalIF":7.3,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70034","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145848212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
<p>The integration of artificial intelligence (AI) into medical robotics has emerged as a cornerstone of modern healthcare, driving transformative advancements in precision, adaptability and patient outcomes. Although computational tools have long supported diagnostic processes, their role is evolving beyond passive assistance to become active collaborators in therapeutic decision-making. In this paradigm, knowledge-driven deep learning systems are redefining possibilities—enabling robots to interpret complex data, adapt to dynamic clinical environments and execute tasks with human-like contextual awareness.</p><p>The purpose of this special issue is to showcase the latest developments in the application of AI technology in medical robots. The main content includes but is not limited to passive data adaptation, force feedback tracking, image processing and diagnosis, surgical navigation, exoskeleton systems etc. These studies cover various application scenarios of medical robots, with the ultimate goal of maximising AI autonomy.</p><p>We have received 31 paper submissions from around the world, and after a rigorous peer review process, we have finally selected 9 papers for publication. The selected collection of papers covers various fascinating research topics, all of which have achieved key breakthroughs in their respective fields. We believe that these accepted papers have guiding significance for their research fields and can help researchers enhance their understanding of current trends. Sincere thanks to the authors who chose our platform and all the staff who provided assistance for the publication of these papers.</p><p>In the article ‘Model adaptation via credible local context representation’, Tang et al. pointed out that conventional model transfer techniques require labelled source data, which makes them inapplicable in privacy-sensitive medical domains. To address these critical problems of source-free domain adaptation (SFDA), they proposed a credible local context representation (CLCR) method that significantly enhances model generalisation through geometric structure mining in feature space. This method innovatively constructs a two-stage learning framework: introducing a data-enhanced mutual information regularisation term in the pretraining stage of the source model to enhance the model's learning of sample discriminative features; design a deep space fixed step walking strategy during the target domain adaptation phase, dynamically capture the local credible contextual features of each target sample and use them as pseudo-labels for semantic fusion. Experiments on the three benchmark datasets of Office-31, Office Home and VisDA show that CLCR achieves an average accuracy of 89.2% in 12 cross-domain tasks, which is 3.1% higher than the existing optimal SFDA method and even surpasses some domain adaptation methods that require the participation of source data. This work provides a new approach to address the privacy performance c
人工智能(AI)与医疗机器人的集成已成为现代医疗保健的基石,推动了精确度、适应性和患者治疗结果的变革性进步。虽然计算工具长期以来一直支持诊断过程,但它们的作用正在从被动辅助发展成为治疗决策的积极合作者。在这种模式下,知识驱动的深度学习系统正在重新定义可能性——使机器人能够解释复杂的数据,适应动态的临床环境,并以类似人类的上下文感知执行任务。本期特刊旨在展示人工智能技术在医疗机器人领域应用的最新进展。主要内容包括但不限于被动数据自适应、力反馈跟踪、图像处理与诊断、手术导航、外骨骼系统等。这些研究涵盖了医疗机器人的各种应用场景,最终目标是最大限度地提高人工智能的自主性。我们收到了来自世界各地的31篇投稿论文,经过严格的同行评议,我们最终选择了9篇论文发表。精选的论文涵盖了各种引人入胜的研究课题,这些课题都在各自的领域取得了重大突破。我们认为这些被认可的论文对其研究领域具有指导意义,可以帮助研究人员增强对当前趋势的理解。衷心感谢选择我们平台的作者和为论文发表提供帮助的所有工作人员。在文章“通过可信的局部上下文表示进行模型适应”中,Tang等人指出,传统的模型转移技术需要标记源数据,这使得它们不适用于隐私敏感的医疗领域。为了解决这些无源域自适应(SFDA)的关键问题,他们提出了一种可信的局部上下文表示(CLCR)方法,该方法通过特征空间中的几何结构挖掘显著增强了模型的泛化。该方法创新性地构建了两阶段学习框架:在源模型的预训练阶段引入数据增强的互信息正则化项,增强模型对样本判别特征的学习能力;在目标域适应阶段设计深空定步行走策略,动态捕获每个目标样本的局部可信上下文特征,并将其作为伪标签进行语义融合。在Office-31、Office Home和VisDA三个基准数据集上的实验表明,CLCR在12个跨域任务中平均准确率达到89.2%,比现有的最优SFDA方法提高了3.1%,甚至超过了一些需要源数据参与的域自适应方法。该研究为解决医疗保健跨机构模型转移中的隐私绩效冲突提供了一种新方法,其上下文发现机制对无监督表示学习具有普遍意义。在文章“不确定表面扫描的人机协作方法”中,Zhao等人介绍了一种用于不确定表面扫描的人机协作框架,该框架将远程操作与自适应力控制协同起来。该系统使操作人员能够远程引导扫描轨迹,而导纳控制器通过实时刚度调整保持恒定的接触力,在刚度未知的表面上实现±1 N的跟踪精度。当角度偏差超过5°时触发自动工具重新定向,通过摩擦补偿力感知确保垂直对准。使用模拟超声探头的实验验证表明,与纯远程操作相比,工作量减少了63%,成功地处理了海绵和弹簧支撑的幻影。混合控制体系结构将人类引导与机器人依从性分离,允许同时进行xy轴运动控制和z轴力调节,而无需事先进行环境建模。这种方法将人类的直觉与机器人的精度联系起来,对于需要安全组织相互作用的医疗扫描应用特别有价值。在“AESR3D:用于小梁CT超分辨率的3D过完整自编码器”研究中,Zhang等人提出了一种3D过完整自编码器框架AESR3D,通过增强低分辨率小梁CT扫描来解决骨质疏松症诊断的局限性。目前对骨矿物质密度(BMD)的依赖忽略了对生物力学强度至关重要的微观结构恶化。AESR3D结合了CNN-transformer混合架构和双任务正则化,同时优化超分辨率重建和低分辨率恢复,以防止在恢复结构细节时过拟合。 该模型实现了最先进的性能(SSIM: 0.996),并在小梁度量(ICC = 0.917)中显示出与高分辨率地面真值的强相关性。通过整合无监督的k均值分割,它可以在没有标记数据的情况下精确地可视化骨骼微结构。AESR3D超越了现有的医学/自然图像SR方法,连接了微CT研究和临床CT应用,为增强骨质疏松症评估提供了一种无创工具,提高了骨质量评估的诊断准确性。在论文“分割与检测:双参数前列腺MRI上PIRADS病变定位的深度学习模型的开发和评估”中,Min等人通过严格比较分割(nnUNet)和对象检测(nnDetection)深度学习方法,解决了双参数MRI (bp-MRI)中自动前列腺癌检测的关键挑战。前列腺癌是男性死亡的主要原因,需要精确的早期诊断,但MRI解释仍然依赖于放射科医生和时间密集。作者引入了新的损伤级灵敏度和精度度量,克服了传统体素评估的局限性,并提出了集成方法来协同两种模型的优势。结果表明,nnDetection在病变级别上的灵敏度更高(对于PIRADS≥3个病变,3个假阳性时为80.78% vs 60.40%),而nnUNet在体素级别上的准确性更高(DSC为0.46 vs 0.35)。集成技术进一步提高了性能,达到了82.24%的损伤级灵敏度,强调了它们平衡检测鲁棒性和空间精度的潜力。通过外部数据集的验证,该框架强调了分割和检测范式相结合的临床可行性,特别是对于需要高灵敏度的mri引导活检。这项工作通过弥合方法差距和提供与临床优先事项一致的指标来推进计算机辅助诊断,通过人工智能驱动的病变定位为改善前列腺癌管理提供了可扩展的途径。在论文“使用深度学习进行机器人辅助视网膜下注射的针头检测和定位”中,Zhou等人解决了机器人辅助视网膜下注射中精确针头检测和定位的关键挑战,这是一项需要微米级精度的高风险眼科手术。利用显微镜集成光学相干断层扫描(MI-OCT),作者提出了一个结合ROI裁剪和深度学习的强大框架,以克服由组织变形和镜面噪声引起的手动针头跟踪的局限性。对5种卷积神经网络架构进行了评估,其中表现最好的模型(network II)在离体猪眼上实现了100%的检测成功率,并以0.55的交集-过联合(Intersection-over-Union)定位了针段。通过分析边界盒边缘,该方法的深度估计精度低于10 μm,这对于导航脆弱的视网膜层至关重要。相邻OCT扫描的整合增强了空间上下文感知,优于基于几何特征的方法。这项工作通过实现实时、抗变形的针头跟踪,促进了术中成像引导机器人技术的发展,潜在地降低了基因治疗和视网膜下出血治疗的手术风险。经过验证的框架填补了眼科机器人技术的关键空白,为视网膜手术中更安全、更精确的机器人干预提供了一条途径。在论文《基于PointMLP_RegNet的骨盆表面自动特征点提取方法》中,Kou等人指出,在机器人辅助骨折复位中,从复杂骨盆结构中精确提取解剖标志对于提高3D/3D配准精度至关重要。为了解决手动和传统自动化方法的挑战,本研究引入了PointMLP_RegNet,这是一个基于PointMLP的深度学习框架,通过用回归模块替换其分类层来预测10个骨盆地标的空间坐标。通过下采样、平移、旋转和噪声注入增强的40个ct重构点云的临床数据集进行训练,该模型通过留一交叉验证显示了稳健的性能。结果显示,所有地标的精度都在5毫米以下,其中80%的误差低于4毫米,在精度上超过了PointNet++和PointNet(平均误差降低了20%-30%),同时保持了优越的计算效率(0.688 M参数)。通过自动特征提取,该方法最大限度地减少了人为的可变性,简化了术中登记,提高了手术计划的可靠性。 这一创新弥补了骨盆骨折机器人的技术差距,为临床应用提供了可扩展的解决方案,并强调了骨科导航系统中定制深度学习架构的变革潜力。Gao等
{"title":"Guest Editorial: Special Issue on Al Technologies and Applications in Medical Robots","authors":"Xiaozhi Qi, Zhongliang Jiang, Ying Hu, Jianwei Zhang","doi":"10.1049/cit2.70019","DOIUrl":"10.1049/cit2.70019","url":null,"abstract":"<p>The integration of artificial intelligence (AI) into medical robotics has emerged as a cornerstone of modern healthcare, driving transformative advancements in precision, adaptability and patient outcomes. Although computational tools have long supported diagnostic processes, their role is evolving beyond passive assistance to become active collaborators in therapeutic decision-making. In this paradigm, knowledge-driven deep learning systems are redefining possibilities—enabling robots to interpret complex data, adapt to dynamic clinical environments and execute tasks with human-like contextual awareness.</p><p>The purpose of this special issue is to showcase the latest developments in the application of AI technology in medical robots. The main content includes but is not limited to passive data adaptation, force feedback tracking, image processing and diagnosis, surgical navigation, exoskeleton systems etc. These studies cover various application scenarios of medical robots, with the ultimate goal of maximising AI autonomy.</p><p>We have received 31 paper submissions from around the world, and after a rigorous peer review process, we have finally selected 9 papers for publication. The selected collection of papers covers various fascinating research topics, all of which have achieved key breakthroughs in their respective fields. We believe that these accepted papers have guiding significance for their research fields and can help researchers enhance their understanding of current trends. Sincere thanks to the authors who chose our platform and all the staff who provided assistance for the publication of these papers.</p><p>In the article ‘Model adaptation via credible local context representation’, Tang et al. pointed out that conventional model transfer techniques require labelled source data, which makes them inapplicable in privacy-sensitive medical domains. To address these critical problems of source-free domain adaptation (SFDA), they proposed a credible local context representation (CLCR) method that significantly enhances model generalisation through geometric structure mining in feature space. This method innovatively constructs a two-stage learning framework: introducing a data-enhanced mutual information regularisation term in the pretraining stage of the source model to enhance the model's learning of sample discriminative features; design a deep space fixed step walking strategy during the target domain adaptation phase, dynamically capture the local credible contextual features of each target sample and use them as pseudo-labels for semantic fusion. Experiments on the three benchmark datasets of Office-31, Office Home and VisDA show that CLCR achieves an average accuracy of 89.2% in 12 cross-domain tasks, which is 3.1% higher than the existing optimal SFDA method and even surpasses some domain adaptation methods that require the participation of source data. This work provides a new approach to address the privacy performance c","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 3","pages":"635-637"},"PeriodicalIF":7.3,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70019","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144503207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Helong Yu, Jiayao Zhao, Chun Guang Bi, Lei Shi, Huiling Chen
The 100-kernel weight of corn seed is a crucial metric for assessing corn quality, and the current measurement means mostly involve manual counting of kernels followed by weighing on a balance, which is labour-intensive and time-consuming. Aiming to address the problem of low efficiency in measuring the 100-kernel weight of corn seeds, this study proposes a measurement method based on deep learning and machine vision. In this study, high-contrast camera technology was utilised to capture image data of corn seeds. And improvements were made to the feature extraction network of the YOLOv5 model by incorporating the MobileNetV3 network structure. The novel model employs deep separable convolution to decrease parameters and computational load. It incorporates a linear bottleneck and inverted residual structure to enhance efficiency. It introduces an SE attention mechanism for direct learning of channel number features and updates the activation function. Algorithms and experiments were subsequently designed to calculate the 100-grain weight in conjunction with the output of the model. The outcomes revealed that the enhanced model in this study achieved an accuracy of 90.1%, a recall rate of 91.3%, and a mAP (mean average precision) value of 92.2%. While meeting production requirements, this model significantly reduces the number of parameters compared to alternative models—50% of the original model. In an applied study focused on measuring the 100-kernel weight of corn seeds, the counting accuracy yielded a remarkable 97.18%, while the accuracy for weight measurement results reached 94.2%. This study achieves both efficient and precise measurement of the 100-kernel weight of maize seeds, presenting a novel perspective in the exploration of maize seed weight.
玉米种子的百粒重是评估玉米质量的一个关键指标,目前的测量手段主要是人工数粒,然后在天平上称重,这是一种劳动密集型和耗时的方法。针对玉米种子百粒重测量效率低的问题,本研究提出了一种基于深度学习和机器视觉的测量方法。本研究采用高对比度相机技术对玉米种子进行图像数据采集。并结合MobileNetV3网络结构对YOLOv5模型的特征提取网络进行了改进。该模型采用深度可分离卷积来减少参数和计算量。它采用线性瓶颈和倒立残余结构来提高效率。引入了一种SE关注机制,用于直接学习频道号特征,并更新了激活函数。随后设计算法和实验,结合模型的输出计算100粒重。结果表明,该模型的准确率为90.1%,查全率为91.3%,mAP (mean average precision)值为92.2%。在满足生产要求的同时,与其他型号相比,该型号大大减少了参数数量-原始型号的50%。在一项以玉米种子百粒重测量为重点的应用研究中,计数准确率达到了97.18%,而称重结果的准确率达到了94.2%。本研究实现了玉米种子百粒重的高效、精确测量,为玉米种子重的研究提供了新的视角。
{"title":"A Lightweight YOLOv5 Target Detection Model and Its Application to the Measurement of 100-Kernel Weight of Corn Seeds","authors":"Helong Yu, Jiayao Zhao, Chun Guang Bi, Lei Shi, Huiling Chen","doi":"10.1049/cit2.70031","DOIUrl":"https://doi.org/10.1049/cit2.70031","url":null,"abstract":"<p>The 100-kernel weight of corn seed is a crucial metric for assessing corn quality, and the current measurement means mostly involve manual counting of kernels followed by weighing on a balance, which is labour-intensive and time-consuming. Aiming to address the problem of low efficiency in measuring the 100-kernel weight of corn seeds, this study proposes a measurement method based on deep learning and machine vision. In this study, high-contrast camera technology was utilised to capture image data of corn seeds. And improvements were made to the feature extraction network of the YOLOv5 model by incorporating the MobileNetV3 network structure. The novel model employs deep separable convolution to decrease parameters and computational load. It incorporates a linear bottleneck and inverted residual structure to enhance efficiency. It introduces an SE attention mechanism for direct learning of channel number features and updates the activation function. Algorithms and experiments were subsequently designed to calculate the 100-grain weight in conjunction with the output of the model. The outcomes revealed that the enhanced model in this study achieved an accuracy of 90.1%, a recall rate of 91.3%, and a mAP (mean average precision) value of 92.2%. While meeting production requirements, this model significantly reduces the number of parameters compared to alternative models—50% of the original model. In an applied study focused on measuring the 100-kernel weight of corn seeds, the counting accuracy yielded a remarkable 97.18%, while the accuracy for weight measurement results reached 94.2%. This study achieves both efficient and precise measurement of the 100-kernel weight of maize seeds, presenting a novel perspective in the exploration of maize seed weight.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 5","pages":"1521-1534"},"PeriodicalIF":7.3,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70031","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145366938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tibetan medical named entity recognition (Tibetan MNER) involves extracting specific types of medical entities from unstructured Tibetan medical texts. Tibetan MNER provide important data support for the work related to Tibetan medicine. However, existing Tibetan MNER methods often struggle to comprehensively capture multi-level semantic information, failing to sufficiently extract multi-granularity features and effectively filter out irrelevant information, which ultimately impacts the accuracy of entity recognition. This paper proposes an improved embedding representation method called syllable–word–sentence embedding. By leveraging features at different granularities and using un-scaled dot-product attention to focus on key features for feature fusion, the syllable–word–sentence embedding is integrated into the transformer, enhancing the specificity and diversity of feature representations. The model leverages multi-level and multi-granularity semantic information, thereby improving the performance of Tibetan MNER. We evaluate our proposed model on datasets from various domains. The results indicate that the model effectively identified three types of entities in the Tibetan news dataset we constructed, achieving an F1 score of 93.59%, which represents an improvement of 1.24% compared to the vanilla FLAT. Additionally, results from the Tibetan medical dataset we developed show that it is effective in identifying five kinds of medical entities, with an F1 score of 71.39%, which is a 1.34% improvement over the vanilla FLAT.
{"title":"Tibetan Medical Named Entity Recognition Based on Syllable-Word-Sentence Embedding Transformer","authors":"Jin Zhang, Ziyue Zhang, Lobsang Yeshi, Dorje Tashi, Xiangshi Wang, Yuqing Cai, Yongbin Yu, Xiangxiang Wang, Nyima Tashi, Gadeng Luosang","doi":"10.1049/cit2.70029","DOIUrl":"10.1049/cit2.70029","url":null,"abstract":"<p>Tibetan medical named entity recognition (Tibetan MNER) involves extracting specific types of medical entities from unstructured Tibetan medical texts. Tibetan MNER provide important data support for the work related to Tibetan medicine. However, existing Tibetan MNER methods often struggle to comprehensively capture multi-level semantic information, failing to sufficiently extract multi-granularity features and effectively filter out irrelevant information, which ultimately impacts the accuracy of entity recognition. This paper proposes an improved embedding representation method called syllable–word–sentence embedding. By leveraging features at different granularities and using un-scaled dot-product attention to focus on key features for feature fusion, the syllable–word–sentence embedding is integrated into the transformer, enhancing the specificity and diversity of feature representations. The model leverages multi-level and multi-granularity semantic information, thereby improving the performance of Tibetan MNER. We evaluate our proposed model on datasets from various domains. The results indicate that the model effectively identified three types of entities in the Tibetan news dataset we constructed, achieving an F1 score of 93.59%, which represents an improvement of 1.24% compared to the vanilla FLAT. Additionally, results from the Tibetan medical dataset we developed show that it is effective in identifying five kinds of medical entities, with an F1 score of 71.39%, which is a 1.34% improvement over the vanilla FLAT.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 4","pages":"1148-1158"},"PeriodicalIF":7.3,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70029","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144910125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sicheng Wang, Milutin N. Nikolić, Tin Lun Lam, Qing Gao, Runwei Ding, Tianwei Zhang
Visual perception is critical in robotic operations, particularly in collaborative and autonomous robot systems. Through efficient visual systems, robots can acquire and process environmental information in real-time, recognise objects, assess spatial relationships, and make adaptive decisions. This review aims to provide a comprehensive overview of the latest advancements in the field of vision as applied to robotic perception, focusing primarily on visual applications in the areas of object perception, self-perception, human–robot collaboration, and multi-robot collaboration. By summarising the current state of development and analysing the challenges and opportunities that remain in these areas, this paper offers a thorough examination of the integration of visual perception with operational robotics. It further inspires future research and drives the application and development of visual perception across various robotic domains, enabling operational robots to better adapt to complex environments and reliably accomplish tasks.
{"title":"Robot Manipulation Based on Embodied Visual Perception: A Survey","authors":"Sicheng Wang, Milutin N. Nikolić, Tin Lun Lam, Qing Gao, Runwei Ding, Tianwei Zhang","doi":"10.1049/cit2.70022","DOIUrl":"10.1049/cit2.70022","url":null,"abstract":"<p>Visual perception is critical in robotic operations, particularly in collaborative and autonomous robot systems. Through efficient visual systems, robots can acquire and process environmental information in real-time, recognise objects, assess spatial relationships, and make adaptive decisions. This review aims to provide a comprehensive overview of the latest advancements in the field of vision as applied to robotic perception, focusing primarily on visual applications in the areas of object perception, self-perception, human–robot collaboration, and multi-robot collaboration. By summarising the current state of development and analysing the challenges and opportunities that remain in these areas, this paper offers a thorough examination of the integration of visual perception with operational robotics. It further inspires future research and drives the application and development of visual perception across various robotic domains, enabling operational robots to better adapt to complex environments and reliably accomplish tasks.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 4","pages":"945-958"},"PeriodicalIF":7.3,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70022","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144910063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hee Jun Lee, Yang Sok Kim, Won Seok Lee, In Hyeok Choi, Choong Kwon Lee
Selecting appropriate tourist attractions to visit in real time is an important problem for travellers. Since recommenders proactively suggest items based on user preference, they are a promising solution for this problem. Travellers visit tourist attractions sequentially by considering multiple attributes at the same time. Therefore, it is desirable to consider this when developing recommenders for tourist attractions. Using GRU4REC, we proposed RNN-based sequence-aware recommenders (RNN-SARs) that use multiple sequence datasets for training the recommended model, named multi-RNN-SARs. We proposed two types of multi-RNN-SARs—concatenate-RNN-SARs and parallel-RNN-SARs. In order to evaluate multi-RNN-SARs, we compared hit rate (HR) and mean reciprocal rank (MRR) of the item-based collaborative filtering recommender (item-CFR), RNN-SAR with the single-sequence dataset (basic-RNN-SAR), multi-RNN-SARs and the state-of-the-art SARs using a real-world travel dataset. Our research shows that multi-RNN-SARs have significantly higher performances compared to item-CFR. Not all multi-RNN-SARs outperform basic-RNN-SAR but the best multi-RNN-SAR achieves comparable performance to that of the state-of-the-art algorithms. These results highlight the importance of using multiple sequence datasets in RNN-SARs and the importance of choosing appropriate sequence datasets and learning methods for implementing multi-RNN-SARs in practice.
{"title":"RNN-Based Sequence-Aware Recommenders for Tourist Attractions","authors":"Hee Jun Lee, Yang Sok Kim, Won Seok Lee, In Hyeok Choi, Choong Kwon Lee","doi":"10.1049/cit2.70027","DOIUrl":"10.1049/cit2.70027","url":null,"abstract":"<p>Selecting appropriate tourist attractions to visit in real time is an important problem for travellers. Since recommenders proactively suggest items based on user preference, they are a promising solution for this problem. Travellers visit tourist attractions sequentially by considering multiple attributes at the same time. Therefore, it is desirable to consider this when developing recommenders for tourist attractions. Using GRU4REC, we proposed RNN-based sequence-aware recommenders (RNN-SARs) that use multiple sequence datasets for training the recommended model, named multi-RNN-SARs. We proposed two types of multi-RNN-SARs—concatenate-RNN-SARs and parallel-RNN-SARs. In order to evaluate multi-RNN-SARs, we compared hit rate (HR) and mean reciprocal rank (MRR) of the item-based collaborative filtering recommender (item-CFR), RNN-SAR with the single-sequence dataset (basic-RNN-SAR), multi-RNN-SARs and the state-of-the-art SARs using a real-world travel dataset. Our research shows that multi-RNN-SARs have significantly higher performances compared to item-CFR. Not all multi-RNN-SARs outperform basic-RNN-SAR but the best multi-RNN-SAR achieves comparable performance to that of the state-of-the-art algorithms. These results highlight the importance of using multiple sequence datasets in RNN-SARs and the importance of choosing appropriate sequence datasets and learning methods for implementing multi-RNN-SARs in practice.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 4","pages":"1077-1088"},"PeriodicalIF":7.3,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70027","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144910064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziwei Fan, Zhiwen Yu, Kaixiang Yang, Wuxing Chen, Xiaoqing Liu, Guojie Li, Xianling Yang, C. L. Philip Chen
Ensemble learning, a pivotal branch of machine learning, amalgamates multiple base models to enhance the overarching performance of predictive models, capitalising on the diversity and collective wisdom of the ensemble to surpass individual models and mitigate overfitting. In this review, a four-layer research framework is established for the research of ensemble learning, which can offer a comprehensive and structured review of ensemble learning from bottom to top. Firstly, this survey commences by introducing fundamental ensemble learning techniques, including bagging, boosting, and stacking, while also exploring the ensemble's diversity. Then, deep ensemble learning and semi-supervised ensemble learning are studied in detail. Furthermore, the utilisation of ensemble learning techniques to navigate challenging datasets, such as imbalanced and high-dimensional data, is discussed. The application of ensemble learning techniques across various research domains, including healthcare, transportation, finance, manufacturing, and the Internet, is also examined. The survey concludes by discussing challenges intrinsic to ensemble learning.
{"title":"Diverse Models, United Goal: A Comprehensive Survey of Ensemble Learning","authors":"Ziwei Fan, Zhiwen Yu, Kaixiang Yang, Wuxing Chen, Xiaoqing Liu, Guojie Li, Xianling Yang, C. L. Philip Chen","doi":"10.1049/cit2.70030","DOIUrl":"10.1049/cit2.70030","url":null,"abstract":"<p>Ensemble learning, a pivotal branch of machine learning, amalgamates multiple base models to enhance the overarching performance of predictive models, capitalising on the diversity and collective wisdom of the ensemble to surpass individual models and mitigate overfitting. In this review, a four-layer research framework is established for the research of ensemble learning, which can offer a comprehensive and structured review of ensemble learning from bottom to top. Firstly, this survey commences by introducing fundamental ensemble learning techniques, including bagging, boosting, and stacking, while also exploring the ensemble's diversity. Then, deep ensemble learning and semi-supervised ensemble learning are studied in detail. Furthermore, the utilisation of ensemble learning techniques to navigate challenging datasets, such as imbalanced and high-dimensional data, is discussed. The application of ensemble learning techniques across various research domains, including healthcare, transportation, finance, manufacturing, and the Internet, is also examined. The survey concludes by discussing challenges intrinsic to ensemble learning.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 4","pages":"959-982"},"PeriodicalIF":7.3,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70030","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Najmeh Sadat Jaddi, Mohammad Saniee Abadeh, Niousha Bagheri Khoulenjani, Salwani Abdullah, MohammadMahdi Ariannejad, Mohd Zakree Ahmad Nazri, Fatemeh Alvankarian
Prediction of the age of each individual is possible using the changing pattern of DNA methylation with age. In this paper an age prediction approach to work out multivariate regression problems using DNA methylation data is developed. In this research study a convolutional neural network (CNN)-based model optimised by the genetic algorithm (GA) is addressed. This paper contributes to enhancing age prediction as a regression problem using a union of two CNNs and exchanging knowledge between them. This specifically re-starts the training process from a possibly higher-quality point in different iterations and, consequently, causes potentially yeilds better results at each iteration. The method proposed, which is called cooperative deep neural network (Co-DeepNet), is tested on two types of age prediction problems. Sixteen datasets containing 1899 healthy blood samples and nine datasets containing 2395 diseased blood samples are employed to examine the method's efficiency. As a result, the mean absolute deviation (MAD) is 1.49 and 3.61 years for training and testing data, respectively, when the healthy data is tested. The diseased blood data show MAD results of 3.81 and 5.43 years for training and testing data, respectively. The results of the Co-DeepNet are compared with six other methods proposed in previous studies and a single CNN using four prediction accuracy measurements (R2, MAD, MSE and RMSE). The effectiveness of the Co-DeepNet and superiority of its results is proved through the statistical analysis.
{"title":"Co-DeepNet: A Cooperative Convolutional Neural Network for DNA Methylation-Based Age Prediction","authors":"Najmeh Sadat Jaddi, Mohammad Saniee Abadeh, Niousha Bagheri Khoulenjani, Salwani Abdullah, MohammadMahdi Ariannejad, Mohd Zakree Ahmad Nazri, Fatemeh Alvankarian","doi":"10.1049/cit2.70026","DOIUrl":"10.1049/cit2.70026","url":null,"abstract":"<p>Prediction of the age of each individual is possible using the changing pattern of DNA methylation with age. In this paper an age prediction approach to work out multivariate regression problems using DNA methylation data is developed. In this research study a convolutional neural network (CNN)-based model optimised by the genetic algorithm (GA) is addressed. This paper contributes to enhancing age prediction as a regression problem using a union of two CNNs and exchanging knowledge between them. This specifically re-starts the training process from a possibly higher-quality point in different iterations and, consequently, causes potentially yeilds better results at each iteration. The method proposed, which is called cooperative deep neural network (Co-DeepNet), is tested on two types of age prediction problems. Sixteen datasets containing 1899 healthy blood samples and nine datasets containing 2395 diseased blood samples are employed to examine the method's efficiency. As a result, the mean absolute deviation (MAD) is 1.49 and 3.61 years for training and testing data, respectively, when the healthy data is tested. The diseased blood data show MAD results of 3.81 and 5.43 years for training and testing data, respectively. The results of the Co-DeepNet are compared with six other methods proposed in previous studies and a single CNN using four prediction accuracy measurements (<i>R</i><sup>2</sup>, MAD, MSE and RMSE). The effectiveness of the Co-DeepNet and superiority of its results is proved through the statistical analysis.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 4","pages":"1118-1134"},"PeriodicalIF":7.3,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70026","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nonlinear transforms have significantly advanced learned image compression (LIC), particularly using residual blocks. This transform enhances the nonlinear expression ability and obtain compact feature representation by enlarging the receptive field, which indicates how the convolution process extracts features in a high dimensional feature space. However, its functionality is restricted to the spatial dimension and network depth, limiting further improvements in network performance due to insufficient information interaction and representation. Crucially, the potential of high dimensional feature space in the channel dimension and the exploration of network width/resolution remain largely untapped. In this paper, we consider nonlinear transforms from the perspective of feature space, defining high-dimensional feature spaces in different dimensions and investigating the specific effects. Firstly, we introduce the dimension increasing and decreasing transforms in both channel and spatial dimensions to obtain high dimensional feature space and achieve better feature extraction. Secondly, we design a channel-spatial fusion residual transform (CSR), which incorporates multi-dimensional transforms for a more effective representation. Furthermore, we simplify the proposed fusion transform to obtain a slim architecture (CSR-sm), balancing network complexity and compression performance. Finally, we build the overall network with stacked CSR transforms to achieve better compression and reconstruction. Experimental results demonstrate that the proposed method can achieve superior rate-distortion performance compared to the existing LIC methods and traditional codecs. Specifically, our proposed method achieves 9.38% BD-rate reduction over VVC on Kodak dataset.
{"title":"Exploring High Dimensional Feature Space With Channel-Spatial Nonlinear Transforms for Learned Image Compression","authors":"Wen Tan, Fanyang Meng, Chao Li, Youneng Bao, Yongsheng Liang","doi":"10.1049/cit2.70025","DOIUrl":"10.1049/cit2.70025","url":null,"abstract":"<p>Nonlinear transforms have significantly advanced learned image compression (LIC), particularly using residual blocks. This transform enhances the nonlinear expression ability and obtain compact feature representation by enlarging the receptive field, which indicates how the convolution process extracts features in a high dimensional feature space. However, its functionality is restricted to the spatial dimension and network depth, limiting further improvements in network performance due to insufficient information interaction and representation. Crucially, the potential of high dimensional feature space in the channel dimension and the exploration of network width/resolution remain largely untapped. In this paper, we consider nonlinear transforms from the perspective of feature space, defining high-dimensional feature spaces in different dimensions and investigating the specific effects. Firstly, we introduce the dimension increasing and decreasing transforms in both channel and spatial dimensions to obtain high dimensional feature space and achieve better feature extraction. Secondly, we design a channel-spatial fusion residual transform (CSR), which incorporates multi-dimensional transforms for a more effective representation. Furthermore, we simplify the proposed fusion transform to obtain a slim architecture (CSR-sm), balancing network complexity and compression performance. Finally, we build the overall network with stacked CSR transforms to achieve better compression and reconstruction. Experimental results demonstrate that the proposed method can achieve superior rate-distortion performance compared to the existing LIC methods and traditional codecs. Specifically, our proposed method achieves 9.38% BD-rate reduction over VVC on Kodak dataset.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 4","pages":"1235-1253"},"PeriodicalIF":7.3,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70025","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144910121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, numerous estimation issues have been solved due to the developments in data-driven artificial neural networks (ANN) and graph neural networks (GNN). The primary limitation of previous methodologies has been the dependence on data that can be structured in a grid format. However, physiological recordings often exhibit irregular and unordered patterns, posing a significant challenge in conceptualising them as matrices. As a result, GNNs which comprise interactive nodes connected by edges whose weights are defined by anatomical junctions or temporal relationships have received a lot of consideration by leveraging implicit data that exists in a biological system. Additionally, our study incorporates a structural GNN to effectively differentiate between different degrees of infection in both the left and right hemispheres of the brain. Subsequently, demographic data are included, and a multi-task learning architecture is devised, integrating classification and regression tasks. The trials used an authentic dataset, including 800 brain x-ray pictures, consisting of 560 instances classified as moderate cases and 240 instances classified as severe cases. Based on empirical evidence, our methodology demonstrates superior performance in classification, surpassing other comparison methods with a notable achievement of 92.27% in terms of area under the curve as well as a correlation coefficient of 0.62.
{"title":"Intelligent Medical Diagnosis Model Based on Graph Neural Networks for Medical Images","authors":"Ashutosh Sharma, Amit Sharma, Kai Guo","doi":"10.1049/cit2.70020","DOIUrl":"10.1049/cit2.70020","url":null,"abstract":"<p>Recently, numerous estimation issues have been solved due to the developments in data-driven artificial neural networks (ANN) and graph neural networks (GNN). The primary limitation of previous methodologies has been the dependence on data that can be structured in a grid format. However, physiological recordings often exhibit irregular and unordered patterns, posing a significant challenge in conceptualising them as matrices. As a result, GNNs which comprise interactive nodes connected by edges whose weights are defined by anatomical junctions or temporal relationships have received a lot of consideration by leveraging implicit data that exists in a biological system. Additionally, our study incorporates a structural GNN to effectively differentiate between different degrees of infection in both the left and right hemispheres of the brain. Subsequently, demographic data are included, and a multi-task learning architecture is devised, integrating classification and regression tasks. The trials used an authentic dataset, including 800 brain x-ray pictures, consisting of 560 instances classified as moderate cases and 240 instances classified as severe cases. Based on empirical evidence, our methodology demonstrates superior performance in classification, surpassing other comparison methods with a notable achievement of 92.27% in terms of area under the curve as well as a correlation coefficient of 0.62.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 4","pages":"1201-1216"},"PeriodicalIF":7.3,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70020","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}