Prompting Robotic Modalities (PRM): A structured architecture for centralizing language models in complex systems

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Future Generation Computer Systems-The International Journal of Escience Pub Date : 2025-01-31 DOI:10.1016/j.future.2025.107723

Bilel Benjdira, Anis Koubaa, Anas M. Ali

{"title":"Prompting Robotic Modalities (PRM): A structured architecture for centralizing language models in complex systems","authors":"Bilel Benjdira, Anis Koubaa, Anas M. Ali","doi":"10.1016/j.future.2025.107723","DOIUrl":null,"url":null,"abstract":"<div><div>Despite significant advancements in robotics and AI, existing systems often struggle to integrate diverse modalities (e.g., image, sound, actuator data) into a unified framework, resulting in fragmented architectures that limit adaptability, scalability, and explainability. To address these gaps, this paper introduces Prompting Robotic Modalities (PRM), a novel architecture that centralizes language models for controlling and managing complex systems through natural language. In PRM, each system modality (e.g., image, sound, actuator) is handled independently by a Modality Language Model (MLM), while a central Task Modality, powered by a Large Language Model (LLM), orchestrates complex tasks using information from the MLMs. Each MLM is trained on datasets that pair modality-specific data with rich textual descriptions, enabling intuitive, language-based interaction. We validate PRM with two main contributions: (1) ROSGPT_Vision, a new open-source ROS 2 package (available at <span><span>https://github.com/bilel-bj/ROSGPT_Vision</span><svg><path></path></svg></span>) for visual modality tasks, achieving up to 66% classification accuracy in driver-focus monitoring—surpassing other tested models in its category; and (2) CarMate, a driver-distraction detection application that significantly reduces development time and cost by allowing rapid adaptation to new monitoring tasks via simple prompt adjustments. In addition, we develop a Navigation Language Model (NLM) that converts free-form human language orders into detailed ROS commands, underscoring PRM’s modality-agnostic adaptability. Experimental results demonstrate that PRM simplifies system development, outperforms baseline vision-language approaches in specialized tasks (e.g., driver monitoring), reduces complexity through prompt engineering rather than extensive coding, and enhances explainability via natural-language-based diagnostics. Hence, PRM lays a promising foundation for next-generation complex and robotic systems by integrating advanced language model capabilities at their core, making them more adaptable to new environments, cost-effective, and user-friendly.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"166 ","pages":"Article 107723"},"PeriodicalIF":6.2000,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25000184","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Despite significant advancements in robotics and AI, existing systems often struggle to integrate diverse modalities (e.g., image, sound, actuator data) into a unified framework, resulting in fragmented architectures that limit adaptability, scalability, and explainability. To address these gaps, this paper introduces Prompting Robotic Modalities (PRM), a novel architecture that centralizes language models for controlling and managing complex systems through natural language. In PRM, each system modality (e.g., image, sound, actuator) is handled independently by a Modality Language Model (MLM), while a central Task Modality, powered by a Large Language Model (LLM), orchestrates complex tasks using information from the MLMs. Each MLM is trained on datasets that pair modality-specific data with rich textual descriptions, enabling intuitive, language-based interaction. We validate PRM with two main contributions: (1) ROSGPT_Vision, a new open-source ROS 2 package (available at https://github.com/bilel-bj/ROSGPT_Vision) for visual modality tasks, achieving up to 66% classification accuracy in driver-focus monitoring—surpassing other tested models in its category; and (2) CarMate, a driver-distraction detection application that significantly reduces development time and cost by allowing rapid adaptation to new monitoring tasks via simple prompt adjustments. In addition, we develop a Navigation Language Model (NLM) that converts free-form human language orders into detailed ROS commands, underscoring PRM’s modality-agnostic adaptability. Experimental results demonstrate that PRM simplifies system development, outperforms baseline vision-language approaches in specialized tasks (e.g., driver monitoring), reduces complexity through prompt engineering rather than extensive coding, and enhances explainability via natural-language-based diagnostics. Hence, PRM lays a promising foundation for next-generation complex and robotic systems by integrating advanced language model capabilities at their core, making them more adaptable to new environments, cost-effective, and user-friendly.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.

期刊最新文献

Editorial Board A self-organized MoE framework for distributed federated learning Keyed watermarks: A fine-grained watermark generation for Apache Flink Fast and Privacy-Preserving Spatial Keyword Authorization Query with access control Performance and efficiency: A multi-generational benchmark of modern processors on bandwidth-bound HPC applications