{"title":"Prompting Robotic Modalities (PRM): A structured architecture for centralizing language models in complex systems","authors":"Bilel Benjdira, Anis Koubaa, Anas M. Ali","doi":"10.1016/j.future.2025.107723","DOIUrl":null,"url":null,"abstract":"<div><div>Despite significant advancements in robotics and AI, existing systems often struggle to integrate diverse modalities (e.g., image, sound, actuator data) into a unified framework, resulting in fragmented architectures that limit adaptability, scalability, and explainability. To address these gaps, this paper introduces Prompting Robotic Modalities (PRM), a novel architecture that centralizes language models for controlling and managing complex systems through natural language. In PRM, each system modality (e.g., image, sound, actuator) is handled independently by a Modality Language Model (MLM), while a central Task Modality, powered by a Large Language Model (LLM), orchestrates complex tasks using information from the MLMs. Each MLM is trained on datasets that pair modality-specific data with rich textual descriptions, enabling intuitive, language-based interaction. We validate PRM with two main contributions: (1) ROSGPT_Vision, a new open-source ROS 2 package (available at <span><span>https://github.com/bilel-bj/ROSGPT_Vision</span><svg><path></path></svg></span>) for visual modality tasks, achieving up to 66% classification accuracy in driver-focus monitoring—surpassing other tested models in its category; and (2) CarMate, a driver-distraction detection application that significantly reduces development time and cost by allowing rapid adaptation to new monitoring tasks via simple prompt adjustments. In addition, we develop a Navigation Language Model (NLM) that converts free-form human language orders into detailed ROS commands, underscoring PRM’s modality-agnostic adaptability. Experimental results demonstrate that PRM simplifies system development, outperforms baseline vision-language approaches in specialized tasks (e.g., driver monitoring), reduces complexity through prompt engineering rather than extensive coding, and enhances explainability via natural-language-based diagnostics. Hence, PRM lays a promising foundation for next-generation complex and robotic systems by integrating advanced language model capabilities at their core, making them more adaptable to new environments, cost-effective, and user-friendly.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"166 ","pages":"Article 107723"},"PeriodicalIF":6.2000,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25000184","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Despite significant advancements in robotics and AI, existing systems often struggle to integrate diverse modalities (e.g., image, sound, actuator data) into a unified framework, resulting in fragmented architectures that limit adaptability, scalability, and explainability. To address these gaps, this paper introduces Prompting Robotic Modalities (PRM), a novel architecture that centralizes language models for controlling and managing complex systems through natural language. In PRM, each system modality (e.g., image, sound, actuator) is handled independently by a Modality Language Model (MLM), while a central Task Modality, powered by a Large Language Model (LLM), orchestrates complex tasks using information from the MLMs. Each MLM is trained on datasets that pair modality-specific data with rich textual descriptions, enabling intuitive, language-based interaction. We validate PRM with two main contributions: (1) ROSGPT_Vision, a new open-source ROS 2 package (available at https://github.com/bilel-bj/ROSGPT_Vision) for visual modality tasks, achieving up to 66% classification accuracy in driver-focus monitoring—surpassing other tested models in its category; and (2) CarMate, a driver-distraction detection application that significantly reduces development time and cost by allowing rapid adaptation to new monitoring tasks via simple prompt adjustments. In addition, we develop a Navigation Language Model (NLM) that converts free-form human language orders into detailed ROS commands, underscoring PRM’s modality-agnostic adaptability. Experimental results demonstrate that PRM simplifies system development, outperforms baseline vision-language approaches in specialized tasks (e.g., driver monitoring), reduces complexity through prompt engineering rather than extensive coding, and enhances explainability via natural-language-based diagnostics. Hence, PRM lays a promising foundation for next-generation complex and robotic systems by integrating advanced language model capabilities at their core, making them more adaptable to new environments, cost-effective, and user-friendly.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.