This paper envisions a revolutionary AIOS-Agent ecosystem, where Large Language Model (LLM) serves as the (Artificial) Intelligent Operating System (IOS, or AIOS)--an operating system ``with soul''. Upon this foundation, a diverse range of LLM-based AI Agent Applications (Agents, or AAPs) are developed, enriching the AIOS-Agent ecosystem and signaling a paradigm shift from the traditional OS-APP ecosystem. We envision that LLM's impact will not be limited to the AI application level, instead, it will in turn revolutionize the design and implementation of computer system, architecture, software, and programming language, featured by several main concepts: LLM as OS (system-level), Agents as Applications (application-level), Natural Language as Programming Interface (user-level), and Tools as Devices/Libraries (hardware/middleware-level).
{"title":"LLM as OS (llmao), Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem","authors":"Yingqiang Ge, Yujie Ren, Wenyue Hua, Shuyuan Xu, Juntao Tan, Yongfeng Zhang","doi":"arxiv-2312.03815","DOIUrl":"https://doi.org/arxiv-2312.03815","url":null,"abstract":"This paper envisions a revolutionary AIOS-Agent ecosystem, where Large\u0000Language Model (LLM) serves as the (Artificial) Intelligent Operating System\u0000(IOS, or AIOS)--an operating system ``with soul''. Upon this foundation, a\u0000diverse range of LLM-based AI Agent Applications (Agents, or AAPs) are\u0000developed, enriching the AIOS-Agent ecosystem and signaling a paradigm shift\u0000from the traditional OS-APP ecosystem. We envision that LLM's impact will not\u0000be limited to the AI application level, instead, it will in turn revolutionize\u0000the design and implementation of computer system, architecture, software, and\u0000programming language, featured by several main concepts: LLM as OS\u0000(system-level), Agents as Applications (application-level), Natural Language as\u0000Programming Interface (user-level), and Tools as Devices/Libraries\u0000(hardware/middleware-level).","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138554773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern airborne operating systems implement the concept of robust time and resource partitioning imposed by the standards for aerospace and airborne-embedded software systems, such as ARINC 653. While these standards do provide a considerable amount of design choices in regards to resource partitioning on the architectural and API levels, such as isolated memory spaces between the application partitions, predefined resource configuration, and unidirectional ports with limited queue and message sizes for inter-partition communication, they do not specify how an operating system should implement them in software. Furthermore, they often tend to set the minimal level of the required guarantees, for example, in terms of memory permissions, and disregard the hardware state of the art, which presently can provide considerably stronger guarantees at no extra cost. In the paper we present an architecture of robust resource partitioning for ARINC 653 real-time operating systems based on completely static MMU configuration. The architecture was implemented on different types of airborne hardware, including platforms with TLB-based and page table-based MMU. Key benefits of the proposed approach include minimised run-time overhead and simpler verification of the memory subsystem.
{"title":"Robust Resource Partitioning Approach for ARINC 653 RTOS","authors":"Vitaly Cheptsov, Alexey Khoroshilov","doi":"arxiv-2312.01436","DOIUrl":"https://doi.org/arxiv-2312.01436","url":null,"abstract":"Modern airborne operating systems implement the concept of robust time and\u0000resource partitioning imposed by the standards for aerospace and\u0000airborne-embedded software systems, such as ARINC 653. While these standards do\u0000provide a considerable amount of design choices in regards to resource\u0000partitioning on the architectural and API levels, such as isolated memory\u0000spaces between the application partitions, predefined resource configuration,\u0000and unidirectional ports with limited queue and message sizes for\u0000inter-partition communication, they do not specify how an operating system\u0000should implement them in software. Furthermore, they often tend to set the\u0000minimal level of the required guarantees, for example, in terms of memory\u0000permissions, and disregard the hardware state of the art, which presently can\u0000provide considerably stronger guarantees at no extra cost. In the paper we\u0000present an architecture of robust resource partitioning for ARINC 653 real-time\u0000operating systems based on completely static MMU configuration. The\u0000architecture was implemented on different types of airborne hardware, including\u0000platforms with TLB-based and page table-based MMU. Key benefits of the proposed\u0000approach include minimised run-time overhead and simpler verification of the\u0000memory subsystem.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"88 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138523002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amanda RaybuckThe University of Texas at Austin, Wei ZhangMicrosoft, Kayvan MansoorshahiThe University of Texas at Austin, Aditya K. KamathUniversity of Washington, Mattan ErezThe University of Texas at Austin, Simon PeterUniversity of Washington
We present MaxMem, a tiered main memory management system that aims to maximize Big Data application colocation and performance. MaxMem uses an application-agnostic and lightweight memory occupancy control mechanism based on fast memory miss ratios to provide application QoS under increasing colocation. By relying on memory access sampling and binning to quickly identify per-process memory heat gradients, MaxMem maximizes performance for many applications sharing tiered main memory simultaneously. MaxMem is designed as a user-space memory manager to be easily modifiable and extensible, without complex kernel code development. On a system with tiered main memory consisting of DRAM and Intel Optane persistent memory modules, our evaluation confirms that MaxMem provides 11% and 38% better throughput and up to 80% and an order of magnitude lower 99th percentile latency than HeMem and Linux AutoNUMA, respectively, with a Big Data key-value store in dynamic colocation scenarios.
{"title":"MaxMem: Colocation and Performance for Big Data Applications on Tiered Main Memory Servers","authors":"Amanda RaybuckThe University of Texas at Austin, Wei ZhangMicrosoft, Kayvan MansoorshahiThe University of Texas at Austin, Aditya K. KamathUniversity of Washington, Mattan ErezThe University of Texas at Austin, Simon PeterUniversity of Washington","doi":"arxiv-2312.00647","DOIUrl":"https://doi.org/arxiv-2312.00647","url":null,"abstract":"We present MaxMem, a tiered main memory management system that aims to\u0000maximize Big Data application colocation and performance. MaxMem uses an\u0000application-agnostic and lightweight memory occupancy control mechanism based\u0000on fast memory miss ratios to provide application QoS under increasing\u0000colocation. By relying on memory access sampling and binning to quickly\u0000identify per-process memory heat gradients, MaxMem maximizes performance for\u0000many applications sharing tiered main memory simultaneously. MaxMem is designed\u0000as a user-space memory manager to be easily modifiable and extensible, without\u0000complex kernel code development. On a system with tiered main memory consisting\u0000of DRAM and Intel Optane persistent memory modules, our evaluation confirms\u0000that MaxMem provides 11% and 38% better throughput and up to 80% and an order\u0000of magnitude lower 99th percentile latency than HeMem and Linux AutoNUMA,\u0000respectively, with a Big Data key-value store in dynamic colocation scenarios.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138522989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weijia Song, Thiago Garrett, Yuting Yang, Mingzhao Liu, Edward Tremel, Lorenzo Rosa, Andrea Merlina, Roman Vitenberg, Ken Birman
Interactive intelligent computing applications are increasingly prevalent, creating a need for AI/ML platforms optimized to reduce per-event latency while maintaining high throughput and efficient resource management. Yet many intelligent applications run on AI/ML platforms that optimize for high throughput even at the cost of high tail-latency. Cascade is a new AI/ML hosting platform intended to untangle this puzzle. Innovations include a legacy-friendly storage layer that moves data with minimal copying and a "fast path" that collocates data and computation to maximize responsiveness. Our evaluation shows that Cascade reduces latency by orders of magnitude with no loss of throughput.
{"title":"Cascade: A Platform for Delay-Sensitive Edge Intelligence","authors":"Weijia Song, Thiago Garrett, Yuting Yang, Mingzhao Liu, Edward Tremel, Lorenzo Rosa, Andrea Merlina, Roman Vitenberg, Ken Birman","doi":"arxiv-2311.17329","DOIUrl":"https://doi.org/arxiv-2311.17329","url":null,"abstract":"Interactive intelligent computing applications are increasingly prevalent,\u0000creating a need for AI/ML platforms optimized to reduce per-event latency while\u0000maintaining high throughput and efficient resource management. Yet many\u0000intelligent applications run on AI/ML platforms that optimize for high\u0000throughput even at the cost of high tail-latency. Cascade is a new AI/ML\u0000hosting platform intended to untangle this puzzle. Innovations include a\u0000legacy-friendly storage layer that moves data with minimal copying and a \"fast\u0000path\" that collocates data and computation to maximize responsiveness. Our\u0000evaluation shows that Cascade reduces latency by orders of magnitude with no\u0000loss of throughput.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138522992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Autonomous applications are typically developed over Robot Operating System 2.0 (ROS2) even in time-critical systems like automotive. Recent years have seen increased interest in developing model-based timing analysis and schedule optimization approaches for ROS2-based applications. To complement these approaches, we propose a tracing and measurement framework to emph{obtain timing models} of ROS2-based applications. It offers a tracer based on emph{extended Berkeley Packet Filter} that probes different functions in ROS2 middleware and reads their arguments or return values to reason about the data flow in applications. It combines event traces from ROS2 and the operating system to generate a emph{directed acyclic graph} showing ROS2 callbacks, precedence relations between them, and their timing attributes. While being compatible with existing analyses, we also show how to model (i)~message synchronization, e.g., in sensor fusion, and (ii)~service requests from multiple clients, e.g., in motion planning. Considering that, in real-world scenarios, the application code might be emph{confidential} and formal models are unavailable, our framework still enables the application of existing analysis and optimization techniques.
{"title":"Trace-enabled Timing Model Synthesis for ROS2-based Autonomous Applications","authors":"Hazem Abaza, Debayan Roy, Shiqing Fan, Selma Saidi, Antonios Motakis","doi":"arxiv-2311.13333","DOIUrl":"https://doi.org/arxiv-2311.13333","url":null,"abstract":"Autonomous applications are typically developed over Robot Operating System\u00002.0 (ROS2) even in time-critical systems like automotive. Recent years have\u0000seen increased interest in developing model-based timing analysis and schedule\u0000optimization approaches for ROS2-based applications. To complement these\u0000approaches, we propose a tracing and measurement framework to emph{obtain\u0000timing models} of ROS2-based applications. It offers a tracer based on\u0000emph{extended Berkeley Packet Filter} that probes different functions in ROS2\u0000middleware and reads their arguments or return values to reason about the data\u0000flow in applications. It combines event traces from ROS2 and the operating\u0000system to generate a emph{directed acyclic graph} showing ROS2 callbacks,\u0000precedence relations between them, and their timing attributes. While being\u0000compatible with existing analyses, we also show how to model (i)~message\u0000synchronization, e.g., in sensor fusion, and (ii)~service requests from\u0000multiple clients, e.g., in motion planning. Considering that, in real-world\u0000scenarios, the application code might be emph{confidential} and formal models\u0000are unavailable, our framework still enables the application of existing\u0000analysis and optimization techniques.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138523003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rise of the Internet has brought about significant changes in our lives, and the rapid expansion of the Internet of Things (IoT) is poised to have an even more substantial impact by connecting a wide range of devices across various application domains. IoT devices, especially low-end ones, are constrained by limited memory and processing capabilities, necessitating efficient memory management within IoT operating systems. This paper delves into the importance of memory management in IoT systems, with a primary focus on the design and configuration of such systems, as well as the scalability and performance of scene management. Effective memory management is critical for optimizing resource usage, responsiveness, and adaptability as the IoT ecosystem continues to grow. The study offers insights into memory allocation, scene execution, memory reduction, and system scalability within the context of an IoT system, ultimately highlighting the vital role that memory management plays in facilitating a seamless and efficient IoT experience.
{"title":"Memory Management Strategies for an Internet of Things System","authors":"Ana-Maria Comeagă, Iuliana Marin","doi":"arxiv-2311.10458","DOIUrl":"https://doi.org/arxiv-2311.10458","url":null,"abstract":"The rise of the Internet has brought about significant changes in our lives,\u0000and the rapid expansion of the Internet of Things (IoT) is poised to have an\u0000even more substantial impact by connecting a wide range of devices across\u0000various application domains. IoT devices, especially low-end ones, are\u0000constrained by limited memory and processing capabilities, necessitating\u0000efficient memory management within IoT operating systems. This paper delves\u0000into the importance of memory management in IoT systems, with a primary focus\u0000on the design and configuration of such systems, as well as the scalability and\u0000performance of scene management. Effective memory management is critical for\u0000optimizing resource usage, responsiveness, and adaptability as the IoT\u0000ecosystem continues to grow. The study offers insights into memory allocation,\u0000scene execution, memory reduction, and system scalability within the context of\u0000an IoT system, ultimately highlighting the vital role that memory management\u0000plays in facilitating a seamless and efficient IoT experience.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"2 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138522995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alan Nair, Sandeep Kumar, Aravinda Prasad, Andy Rudoff, Sreenivas Subramoney
Data-hungry applications that require terabytes of memory have become widespread in recent years. To meet the memory needs of these applications, data centers are embracing tiered memory architectures with near and far memory tiers. Precise, efficient, and timely identification of hot and cold data and their placement in appropriate tiers is critical for performance in such systems. Unfortunately, the existing state-of-the-art telemetry techniques for hot and cold data detection are ineffective at the terabyte scale. We propose Telescope, a novel technique that profiles different levels of the application's page table tree for fast and efficient identification of hot and cold data. Telescope is based on the observation that, for a memory- and TLB-intensive workload, higher levels of a page table tree are also frequently accessed during a hardware page table walk. Hence, the hotness of the higher levels of the page table tree essentially captures the hotness of its subtrees or address space sub-regions at a coarser granularity. We exploit this insight to quickly converge on even a few megabytes of hot data and efficiently identify several gigabytes of cold data in terabyte-scale applications. Importantly, such a technique can seamlessly scale to petabyte-scale applications. Telescope's telemetry achieves 90%+ precision and recall at just 0.009% single CPU utilization for microbenchmarks with a 5 TB memory footprint. Memory tiering based on Telescope results in 5.6% to 34% throughput improvement for real-world benchmarks with a 1-2 TB memory footprint compared to other state-of-the-art telemetry techniques.
{"title":"Telescope: Telemetry at Terabyte Scale","authors":"Alan Nair, Sandeep Kumar, Aravinda Prasad, Andy Rudoff, Sreenivas Subramoney","doi":"arxiv-2311.10275","DOIUrl":"https://doi.org/arxiv-2311.10275","url":null,"abstract":"Data-hungry applications that require terabytes of memory have become\u0000widespread in recent years. To meet the memory needs of these applications,\u0000data centers are embracing tiered memory architectures with near and far memory\u0000tiers. Precise, efficient, and timely identification of hot and cold data and\u0000their placement in appropriate tiers is critical for performance in such\u0000systems. Unfortunately, the existing state-of-the-art telemetry techniques for\u0000hot and cold data detection are ineffective at the terabyte scale. We propose Telescope, a novel technique that profiles different levels of the\u0000application's page table tree for fast and efficient identification of hot and\u0000cold data. Telescope is based on the observation that, for a memory- and\u0000TLB-intensive workload, higher levels of a page table tree are also frequently\u0000accessed during a hardware page table walk. Hence, the hotness of the higher\u0000levels of the page table tree essentially captures the hotness of its subtrees\u0000or address space sub-regions at a coarser granularity. We exploit this insight\u0000to quickly converge on even a few megabytes of hot data and efficiently\u0000identify several gigabytes of cold data in terabyte-scale applications.\u0000Importantly, such a technique can seamlessly scale to petabyte-scale\u0000applications. Telescope's telemetry achieves 90%+ precision and recall at just 0.009%\u0000single CPU utilization for microbenchmarks with a 5 TB memory footprint. Memory\u0000tiering based on Telescope results in 5.6% to 34% throughput improvement for\u0000real-world benchmarks with a 1-2 TB memory footprint compared to other\u0000state-of-the-art telemetry techniques.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138522994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Microservices are commonly used in modern cloud-native applications to achieve agility. However, the complexity of service dependencies in large-scale microservices systems can lead to anomaly propagation, making fault troubleshooting a challenge. To address this issue, distributed tracing systems have been proposed to trace complete request execution paths, enabling developers to troubleshoot anomalous services. However, existing distributed tracing systems have limitations such as invasive instrumentation, trace loss, or inaccurate trace correlation. To overcome these limitations, we propose a new tracing system based on eBPF (extended Berkeley Packet Filter), named Nahida, that can track complete requests in the kernel without intrusion, regardless of programming language or implementation. Our evaluation results show that Nahida can track over 92% of requests with stable accuracy, even under the high concurrency of user requests, while the state-of-the-art non-invasive approaches can not track any of the requests. Importantly, Nahida can track requests served by a multi-threaded application that none of the existing invasive tracing systems can handle by instrumenting tracing codes into libraries. Moreover, the overhead introduced by Nahida is negligible, increasing service latency by only 1.55%-2.1%. Overall, Nahida provides an effective and non-invasive solution for distributed tracing.
{"title":"Nahida: In-Band Distributed Tracing with eBPF","authors":"Wanqi Yang, Pengfei Chen, Kai Liu, Huxing Zhang","doi":"arxiv-2311.09032","DOIUrl":"https://doi.org/arxiv-2311.09032","url":null,"abstract":"Microservices are commonly used in modern cloud-native applications to\u0000achieve agility. However, the complexity of service dependencies in large-scale\u0000microservices systems can lead to anomaly propagation, making fault\u0000troubleshooting a challenge. To address this issue, distributed tracing systems\u0000have been proposed to trace complete request execution paths, enabling\u0000developers to troubleshoot anomalous services. However, existing distributed\u0000tracing systems have limitations such as invasive instrumentation, trace loss,\u0000or inaccurate trace correlation. To overcome these limitations, we propose a\u0000new tracing system based on eBPF (extended Berkeley Packet Filter), named\u0000Nahida, that can track complete requests in the kernel without intrusion,\u0000regardless of programming language or implementation. Our evaluation results\u0000show that Nahida can track over 92% of requests with stable accuracy, even\u0000under the high concurrency of user requests, while the state-of-the-art\u0000non-invasive approaches can not track any of the requests. Importantly, Nahida\u0000can track requests served by a multi-threaded application that none of the\u0000existing invasive tracing systems can handle by instrumenting tracing codes\u0000into libraries. Moreover, the overhead introduced by Nahida is negligible,\u0000increasing service latency by only 1.55%-2.1%. Overall, Nahida provides an\u0000effective and non-invasive solution for distributed tracing.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138522991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tadeu Freitas, Mário Neto, Inês Dutra, João Soares, Manuel Correia, Rolando Martins
Intrusion Tolerant Systems (ITSs) are a necessary component for cyber-services/infrastructures. Additionally, as cyberattacks follow a multi-domain attack surface, a similar defensive approach should be applied, namely, the use of an evolving multi-disciplinary solution that combines ITS, cybersecurity and Artificial Intelligence (AI). With the increased popularity of AI solutions, due to Big Data use-case scenarios and decision support and automation scenarios, new opportunities to apply Machine Learning (ML) algorithms have emerged, namely ITS empowerment. Using ML algorithms, an ITS can augment its intrusion tolerance capability, by learning from previous attacks and from known vulnerabilities. As such, this work's contribution is twofold: (1) an ITS architecture (Skynet) based on the state-of-the-art and incorporates new components to increase its intrusion tolerance capability and its adaptability to new adversaries; (2) an improved Risk Manager design that leverages AI to improve ITSs by automatically assessing OS risks to intrusions, and advise with safer configurations. One of the reasons that intrusions are successful is due to bad configurations or slow adaptability to new threats. This can be caused by the dependency that systems have for human intervention. One of the characteristics in Skynet and HAL 9000 design is the removal of human intervention. Being fully automatized lowers the chance of successful intrusions caused by human error. Our experiments using Skynet, shows that HAL is able to choose 15% safer configurations than the state-of-the-art risk manager.
{"title":"HAL 9000: Skynet's Risk Manager","authors":"Tadeu Freitas, Mário Neto, Inês Dutra, João Soares, Manuel Correia, Rolando Martins","doi":"arxiv-2311.09449","DOIUrl":"https://doi.org/arxiv-2311.09449","url":null,"abstract":"Intrusion Tolerant Systems (ITSs) are a necessary component for\u0000cyber-services/infrastructures. Additionally, as cyberattacks follow a\u0000multi-domain attack surface, a similar defensive approach should be applied,\u0000namely, the use of an evolving multi-disciplinary solution that combines ITS,\u0000cybersecurity and Artificial Intelligence (AI). With the increased popularity\u0000of AI solutions, due to Big Data use-case scenarios and decision support and\u0000automation scenarios, new opportunities to apply Machine Learning (ML)\u0000algorithms have emerged, namely ITS empowerment. Using ML algorithms, an ITS\u0000can augment its intrusion tolerance capability, by learning from previous\u0000attacks and from known vulnerabilities. As such, this work's contribution is\u0000twofold: (1) an ITS architecture (Skynet) based on the state-of-the-art and\u0000incorporates new components to increase its intrusion tolerance capability and\u0000its adaptability to new adversaries; (2) an improved Risk Manager design that\u0000leverages AI to improve ITSs by automatically assessing OS risks to intrusions,\u0000and advise with safer configurations. One of the reasons that intrusions are\u0000successful is due to bad configurations or slow adaptability to new threats.\u0000This can be caused by the dependency that systems have for human intervention.\u0000One of the characteristics in Skynet and HAL 9000 design is the removal of\u0000human intervention. Being fully automatized lowers the chance of successful\u0000intrusions caused by human error. Our experiments using Skynet, shows that HAL\u0000is able to choose 15% safer configurations than the state-of-the-art risk\u0000manager.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138522987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In kernel-centric operations, the uprobe component of eBPF frequently encounters performance bottlenecks, largely attributed to the overheads borne by context switches. Transitioning eBPF operations to user space bypasses these hindrances, thereby optimizing performance. This also enhances configurability and obviates the necessity for root access or privileges for kernel eBPF, subsequently minimizing the kernel attack surface. This paper introduces bpftime, a novel user-space eBPF runtime, which leverages binary rewriting to implement uprobe and syscall hook capabilities. Through bpftime, userspace uprobes achieve a 10x speed enhancement compared to their kernel counterparts without requiring dual context switches. Additionally, this runtime facilitates the programmatic hooking of syscalls within a process, both safely and efficiently. Bpftime can be seamlessly attached to any running process, limiting the need for either a restart or manual recompilation. Our implementation also extends to interprocess eBPF Maps within shared memory, catering to summary aggregation or control plane communication requirements. Compatibility with existing eBPF toolchains such as clang and libbpf is maintained, not only simplifying the development of user-space eBPF without necessitating any modifications but also supporting CO-RE through BTF. Through bpftime, we not only enhance uprobe performance but also extend the versatility and user-friendliness of eBPF runtime in user space, paving the way for more efficient and secure kernel operations.
{"title":"bpftime: userspace eBPF Runtime for Uprobe, Syscall and Kernel-User Interactions","authors":"Yusheng Zheng, Tong Yu, Yiwei Yang, Yanpeng Hu, XiaoZheng Lai, Andrew Quinn","doi":"arxiv-2311.07923","DOIUrl":"https://doi.org/arxiv-2311.07923","url":null,"abstract":"In kernel-centric operations, the uprobe component of eBPF frequently\u0000encounters performance bottlenecks, largely attributed to the overheads borne\u0000by context switches. Transitioning eBPF operations to user space bypasses these\u0000hindrances, thereby optimizing performance. This also enhances configurability\u0000and obviates the necessity for root access or privileges for kernel eBPF,\u0000subsequently minimizing the kernel attack surface. This paper introduces\u0000bpftime, a novel user-space eBPF runtime, which leverages binary rewriting to\u0000implement uprobe and syscall hook capabilities. Through bpftime, userspace\u0000uprobes achieve a 10x speed enhancement compared to their kernel counterparts\u0000without requiring dual context switches. Additionally, this runtime facilitates\u0000the programmatic hooking of syscalls within a process, both safely and\u0000efficiently. Bpftime can be seamlessly attached to any running process,\u0000limiting the need for either a restart or manual recompilation. Our\u0000implementation also extends to interprocess eBPF Maps within shared memory,\u0000catering to summary aggregation or control plane communication requirements.\u0000Compatibility with existing eBPF toolchains such as clang and libbpf is\u0000maintained, not only simplifying the development of user-space eBPF without\u0000necessitating any modifications but also supporting CO-RE through BTF. Through\u0000bpftime, we not only enhance uprobe performance but also extend the versatility\u0000and user-friendliness of eBPF runtime in user space, paving the way for more\u0000efficient and secure kernel operations.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"495 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138523004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}