面向未来智慧城市的可扩展计算系统

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS IET Smart Cities Pub Date : 2022-03-18 DOI:10.1049/smc2.12026

Ike Nassi

{"title":"面向未来智慧城市的可扩展计算系统","authors":"Ike Nassi","doi":"10.1049/smc2.12026","DOIUrl":null,"url":null,"abstract":"I will discuss each in turn, but first, a bias. Scalable cities are first and foremost about people, not about computers or computing. Of course, these days, computing infrastructure is important, but we should never lose sight of the prime directive. The more time and effort we spend on computing infrastructure, the less we can spend on enriching people's lives.With that out of the way, let me address the issues raised above. First, in terms of ease of use, we could mean the ease with which clients interact with the computer system. That is not what I am referring to. Rather, I am referring to the fact that the scalable city is an enterprise, and like all enterprises, it is most likely running standard third party software packages, and has been doing so for a long time. There is a lot of software inertia present in this model. The last thing I would encourage is to require a lot of software modifications to existing software particularly on a fixed inflexible schedule. Of course, as technologies and new ideas emerge, it is important to be able to integrate these with an existing computing base, but large-scale rewriting of existing software must not be mandated or encouraged. New aspects of the smart cities' technology base must be introduced gradually with a clear cost/benefit analysis. The computer systems chosen to run the smart cities must be capable of running both old and new software without modification. It is also important that the infrastructure used in implementing a smart city not be locked into a single vendor. Using standard servers, standard networks, and standard software is, again, highly desirable.Second, let me address the need for scalable computing. Needs change as smart cities evolve. It would be very desirable to preserve investments in computing infrastructure by allowing that infrastructure to support more computing over time without having to invest in the latest shiny new hardware offering. Further, investments that allow an existing hardware technology base to grow and evolve, without having to rewrite software are highly desirable. It would be even better if the system itself can automatically expand and contract due to the demand placed on it, month to month, week to week, day to day, or even at finer levels of granularity. This is well within the state of the art.You might think I am talking about ‘the cloud’. While I do not rule it out, using the cloud has a high potential for locking in customers, as discussed earlier. This is not only true for the hardware that is used, but also the reliance on a set of software packages that only run in single branded cloud vendor's environment can be disadvantageous, since ultimately, the cost of switching away from one vendor to another can be very high or even practically impossible. The marginal costs of using a single cloud vendor can be very high over time due to the vendor's increasing infrastructure costs that often are directly passed along to satisfy shareholder expectations.The third point I wanted to make has to do with reliability. If the smart city is going to rely on its smart city infrastructure, it must be highly reliable and highly available. You might think this comes for free. After all, aren’t servers getting more and more reliable? In short, it is becoming increasingly apparent that the answer to this is ‘no’. In fact, it is the opposite. As we include more memory in these servers, and increase the density of semiconductors, reliability is decreasing. Part of this has to do with process geometries, part of this has to do with higher utilisation, and part of it has to do with the ability of semiconductors to monitor their own behaviour but not take corrective action when disruptive events are anticipated. When heavily loaded hardware servers fail, the ‘blast radius’ can become very problematic. Restarting a server can be very expensive in terms of downtime, particularly as the amount of memory in a server increases. Further, it may also take time to get the performance of the server back to an acceptable level (e.g., re-warming caches). The ability to dynamically scale also has an impact on reliability and security.From a security standpoint, we know that it is important to apply security upgrades on a regular basis, but if it means taking down multiple running systems to upgrade components, this will often not happen according to the desirable fixed schedule. Fortunately, there are solutions to this problem as well. We now can detect a very high percentage of anticipated potential hardware errors, like correctable error correcting codes (ECC) errors predicting non-correctable ECC errors, increasing error rates in network interface cards (NICs), rising temperatures indicating fan failures, and deal with them without having to take mission-critical systems offline.The dynamic scaling ability allows us to not only take a hardware system offline for repair, but also allows us to add additional capacity when a system becomes overloaded (and then revert when it becomes underloaded). These abilities are all possible and desirable. Further, from an economic standpoint to preserve the investment in computing infrastructure it is very advantageous to not require that a system be overprovisioned just to meet some hypothetical peak demand. It is very advantageous to only use as much computing infrastructure as needed, and only when needed. This is not only true for hardware and energy investments, but also for investments in software licences, which are often correlated with hardware capabilities.I personally believe that distributed virtual machines offer the potential to satisfy all the needs I have mentioned. What is a distributed virtual machine? It is a virtual machine that runs on a dedicated cluster of cooperating physical servers interconnected by a standard network, like Ethernet.To an operating system, it looks exactly like a single physical server, but it is not. Each physical server runs a piece of software called a hyperkernel. When powering up, each hyperkernel instance takes an inventory of all the processors, all the memory, all the networks, and all the storage on each physical server. Then, the hyperkernel instances exchange this inventory information, and use it to create a single virtual machine. One processor boots a standard operating system, which sees all the combined resources of all the physical servers. The operating system does not even know it is running on a cluster. (It is like a dream: how do you know whether you are dreaming or not?) No modifications to the operating system need to be made, and no modifications to any applications need to be made. Further, the virtual resources like guest virtual processors and guest virtual memory can migrate under automatic control by machine learning algorithms and system performance introspection. So, the first goal of simplicity can thus be achieved.Scalability is achieved by the cooperating hyperkernels implementing the ability to add and subtract physical servers dynamically as needed. This can be explicit, under operator control, or under programmatic control by some oversight software that tracks performance usage information. Thus, the second goal is achieved.Reliability is achieved in a very innovative way. The various hyperkernels monitor things like dynamic random access memory error rates, temperature fluctuations, NIC error rates and the like. When an impending problem is detected, there is sufficient time to take corrective action. For example, when a problem is detected on physical server n, the hyperkernels on all the other physical servers are told not to send any active guest physical pages or guest processors to n. An additional physical server may be added to the cluster to maintain previous performance levels. In other words, n is quarantined. Physical server n is directed to evict all active guest physical pages and guest virtual processors to other physical servers. When this is complete, physical server n can be removed for repair. A similar process can be used for upgrades of hardware or firmware. All this is done without having to modify or restart the operating system, which is unaware that any of this is taking place. Thus, the third goal, reliability, is achieved.All this can be achieved with competitive performance using technology available today.Authors declare no conflict of interest.","PeriodicalId":34740,"journal":{"name":"IET Smart Cities","volume":"4 2","pages":"79-80"},"PeriodicalIF":2.1000,"publicationDate":"2022-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/smc2.12026","citationCount":"0","resultStr":"{\"title\":\"Scalable computing systems for future smart cities\",\"authors\":\"Ike Nassi\",\"doi\":\"10.1049/smc2.12026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"I will discuss each in turn, but first, a bias. Scalable cities are first and foremost about people, not about computers or computing. Of course, these days, computing infrastructure is important, but we should never lose sight of the prime directive. The more time and effort we spend on computing infrastructure, the less we can spend on enriching people's lives.With that out of the way, let me address the issues raised above. First, in terms of ease of use, we could mean the ease with which clients interact with the computer system. That is not what I am referring to. Rather, I am referring to the fact that the scalable city is an enterprise, and like all enterprises, it is most likely running standard third party software packages, and has been doing so for a long time. There is a lot of software inertia present in this model. The last thing I would encourage is to require a lot of software modifications to existing software particularly on a fixed inflexible schedule. Of course, as technologies and new ideas emerge, it is important to be able to integrate these with an existing computing base, but large-scale rewriting of existing software must not be mandated or encouraged. New aspects of the smart cities' technology base must be introduced gradually with a clear cost/benefit analysis. The computer systems chosen to run the smart cities must be capable of running both old and new software without modification. It is also important that the infrastructure used in implementing a smart city not be locked into a single vendor. Using standard servers, standard networks, and standard software is, again, highly desirable.Second, let me address the need for scalable computing. Needs change as smart cities evolve. It would be very desirable to preserve investments in computing infrastructure by allowing that infrastructure to support more computing over time without having to invest in the latest shiny new hardware offering. Further, investments that allow an existing hardware technology base to grow and evolve, without having to rewrite software are highly desirable. It would be even better if the system itself can automatically expand and contract due to the demand placed on it, month to month, week to week, day to day, or even at finer levels of granularity. This is well within the state of the art.You might think I am talking about ‘the cloud’. While I do not rule it out, using the cloud has a high potential for locking in customers, as discussed earlier. This is not only true for the hardware that is used, but also the reliance on a set of software packages that only run in single branded cloud vendor's environment can be disadvantageous, since ultimately, the cost of switching away from one vendor to another can be very high or even practically impossible. The marginal costs of using a single cloud vendor can be very high over time due to the vendor's increasing infrastructure costs that often are directly passed along to satisfy shareholder expectations.The third point I wanted to make has to do with reliability. If the smart city is going to rely on its smart city infrastructure, it must be highly reliable and highly available. You might think this comes for free. After all, aren’t servers getting more and more reliable? In short, it is becoming increasingly apparent that the answer to this is ‘no’. In fact, it is the opposite. As we include more memory in these servers, and increase the density of semiconductors, reliability is decreasing. Part of this has to do with process geometries, part of this has to do with higher utilisation, and part of it has to do with the ability of semiconductors to monitor their own behaviour but not take corrective action when disruptive events are anticipated. When heavily loaded hardware servers fail, the ‘blast radius’ can become very problematic. Restarting a server can be very expensive in terms of downtime, particularly as the amount of memory in a server increases. Further, it may also take time to get the performance of the server back to an acceptable level (e.g., re-warming caches). The ability to dynamically scale also has an impact on reliability and security.From a security standpoint, we know that it is important to apply security upgrades on a regular basis, but if it means taking down multiple running systems to upgrade components, this will often not happen according to the desirable fixed schedule. Fortunately, there are solutions to this problem as well. We now can detect a very high percentage of anticipated potential hardware errors, like correctable error correcting codes (ECC) errors predicting non-correctable ECC errors, increasing error rates in network interface cards (NICs), rising temperatures indicating fan failures, and deal with them without having to take mission-critical systems offline.The dynamic scaling ability allows us to not only take a hardware system offline for repair, but also allows us to add additional capacity when a system becomes overloaded (and then revert when it becomes underloaded). These abilities are all possible and desirable. Further, from an economic standpoint to preserve the investment in computing infrastructure it is very advantageous to not require that a system be overprovisioned just to meet some hypothetical peak demand. It is very advantageous to only use as much computing infrastructure as needed, and only when needed. This is not only true for hardware and energy investments, but also for investments in software licences, which are often correlated with hardware capabilities.I personally believe that distributed virtual machines offer the potential to satisfy all the needs I have mentioned. What is a distributed virtual machine? It is a virtual machine that runs on a dedicated cluster of cooperating physical servers interconnected by a standard network, like Ethernet.To an operating system, it looks exactly like a single physical server, but it is not. Each physical server runs a piece of software called a hyperkernel. When powering up, each hyperkernel instance takes an inventory of all the processors, all the memory, all the networks, and all the storage on each physical server. Then, the hyperkernel instances exchange this inventory information, and use it to create a single virtual machine. One processor boots a standard operating system, which sees all the combined resources of all the physical servers. The operating system does not even know it is running on a cluster. (It is like a dream: how do you know whether you are dreaming or not?) No modifications to the operating system need to be made, and no modifications to any applications need to be made. Further, the virtual resources like guest virtual processors and guest virtual memory can migrate under automatic control by machine learning algorithms and system performance introspection. So, the first goal of simplicity can thus be achieved.Scalability is achieved by the cooperating hyperkernels implementing the ability to add and subtract physical servers dynamically as needed. This can be explicit, under operator control, or under programmatic control by some oversight software that tracks performance usage information. Thus, the second goal is achieved.Reliability is achieved in a very innovative way. The various hyperkernels monitor things like dynamic random access memory error rates, temperature fluctuations, NIC error rates and the like. When an impending problem is detected, there is sufficient time to take corrective action. For example, when a problem is detected on physical server n, the hyperkernels on all the other physical servers are told not to send any active guest physical pages or guest processors to n. An additional physical server may be added to the cluster to maintain previous performance levels. In other words, n is quarantined. Physical server n is directed to evict all active guest physical pages and guest virtual processors to other physical servers. When this is complete, physical server n can be removed for repair. A similar process can be used for upgrades of hardware or firmware. All this is done without having to modify or restart the operating system, which is unaware that any of this is taking place. Thus, the third goal, reliability, is achieved.All this can be achieved with competitive performance using technology available today.Authors declare no conflict of interest.\",\"PeriodicalId\":34740,\"journal\":{\"name\":\"IET Smart Cities\",\"volume\":\"4 2\",\"pages\":\"79-80\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2022-03-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/smc2.12026\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IET Smart Cities\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1049/smc2.12026\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Smart Cities","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/smc2.12026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

动态扩展能力不仅允许我们离线修复硬件系统，还允许我们在系统过载时添加额外的容量(然后在系统负载不足时恢复)。这些能力都是可能的和可取的。此外，从经济角度来看，为了保持对计算基础设施的投资，不要求系统仅仅为了满足某些假设的峰值需求而过度配置是非常有利的。只在需要时使用尽可能多的计算基础设施是非常有利的。这不仅适用于硬件和能源投资，也适用于软件许可投资，后者往往与硬件能力相关。我个人认为分布式虚拟机提供了满足我所提到的所有需求的潜力。什么是分布式虚拟机?它是一种虚拟机，运行在由标准网络(如以太网)连接的协作物理服务器组成的专用集群上。对于操作系统来说，它看起来就像一个单独的物理服务器，但它不是。每个物理服务器运行一个称为超级内核的软件。在启动时，每个超级内核实例对每个物理服务器上的所有处理器、所有内存、所有网络和所有存储进行盘点。然后，超级内核实例交换这些清单信息，并使用它来创建单个虚拟机。一个处理器引导一个标准操作系统，该系统可以看到所有物理服务器的所有组合资源。操作系统甚至不知道它正在集群上运行。(这就像一个梦:你怎么知道你是不是在做梦?)不需要修改操作系统，也不需要修改任何应用程序。此外，客户虚拟处理器和客户虚拟内存等虚拟资源可以在机器学习算法和系统性能自省的自动控制下迁移。这样，第一个简单的目标就可以实现了。可伸缩性是通过协作的超级内核实现根据需要动态增加和减少物理服务器的能力来实现的。这可以是显式的，在操作员控制下，或者在一些跟踪性能使用信息的监督软件的编程控制下。这样，第二个目标就实现了。可靠性是以一种非常创新的方式实现的。各种超级内核监控诸如动态随机存取存储器错误率、温度波动、网卡错误率等。当检测到即将发生的问题时，有足够的时间采取纠正措施。例如，当在物理服务器n上检测到问题时，所有其他物理服务器上的超内核都被告知不要向n发送任何活动的来宾物理页面或来宾处理器。可能会向集群添加额外的物理服务器以保持以前的性能水平。换句话说，n被隔离。指示物理服务器n将所有活动的来宾物理页面和来宾虚拟处理器驱逐到其他物理服务器。完成后，可以移除物理服务器n进行修复。类似的过程可以用于硬件或固件的升级。所有这些都无需修改或重新启动操作系统，操作系统不会意识到正在发生的任何事情。因此，实现了第三个目标，即可靠性。所有这些都可以通过使用当今可用的技术实现具有竞争力的性能。作者声明无利益冲突。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Scalable computing systems for future smart cities

I will discuss each in turn, but first, a bias. Scalable cities are first and foremost about people, not about computers or computing. Of course, these days, computing infrastructure is important, but we should never lose sight of the prime directive. The more time and effort we spend on computing infrastructure, the less we can spend on enriching people's lives.

With that out of the way, let me address the issues raised above. First, in terms of ease of use, we could mean the ease with which clients interact with the computer system. That is not what I am referring to. Rather, I am referring to the fact that the scalable city is an enterprise, and like all enterprises, it is most likely running standard third party software packages, and has been doing so for a long time. There is a lot of software inertia present in this model. The last thing I would encourage is to require a lot of software modifications to existing software particularly on a fixed inflexible schedule. Of course, as technologies and new ideas emerge, it is important to be able to integrate these with an existing computing base, but large-scale rewriting of existing software must not be mandated or encouraged. New aspects of the smart cities' technology base must be introduced gradually with a clear cost/benefit analysis. The computer systems chosen to run the smart cities must be capable of running both old and new software without modification. It is also important that the infrastructure used in implementing a smart city not be locked into a single vendor. Using standard servers, standard networks, and standard software is, again, highly desirable.

Second, let me address the need for scalable computing. Needs change as smart cities evolve. It would be very desirable to preserve investments in computing infrastructure by allowing that infrastructure to support more computing over time without having to invest in the latest shiny new hardware offering. Further, investments that allow an existing hardware technology base to grow and evolve, without having to rewrite software are highly desirable. It would be even better if the system itself can automatically expand and contract due to the demand placed on it, month to month, week to week, day to day, or even at finer levels of granularity. This is well within the state of the art.

You might think I am talking about ‘the cloud’. While I do not rule it out, using the cloud has a high potential for locking in customers, as discussed earlier. This is not only true for the hardware that is used, but also the reliance on a set of software packages that only run in single branded cloud vendor's environment can be disadvantageous, since ultimately, the cost of switching away from one vendor to another can be very high or even practically impossible. The marginal costs of using a single cloud vendor can be very high over time due to the vendor's increasing infrastructure costs that often are directly passed along to satisfy shareholder expectations.

The third point I wanted to make has to do with reliability. If the smart city is going to rely on its smart city infrastructure, it must be highly reliable and highly available. You might think this comes for free. After all, aren’t servers getting more and more reliable? In short, it is becoming increasingly apparent that the answer to this is ‘no’. In fact, it is the opposite. As we include more memory in these servers, and increase the density of semiconductors, reliability is decreasing. Part of this has to do with process geometries, part of this has to do with higher utilisation, and part of it has to do with the ability of semiconductors to monitor their own behaviour but not take corrective action when disruptive events are anticipated. When heavily loaded hardware servers fail, the ‘blast radius’ can become very problematic. Restarting a server can be very expensive in terms of downtime, particularly as the amount of memory in a server increases. Further, it may also take time to get the performance of the server back to an acceptable level (e.g., re-warming caches). The ability to dynamically scale also has an impact on reliability and security.

From a security standpoint, we know that it is important to apply security upgrades on a regular basis, but if it means taking down multiple running systems to upgrade components, this will often not happen according to the desirable fixed schedule. Fortunately, there are solutions to this problem as well. We now can detect a very high percentage of anticipated potential hardware errors, like correctable error correcting codes (ECC) errors predicting non-correctable ECC errors, increasing error rates in network interface cards (NICs), rising temperatures indicating fan failures, and deal with them without having to take mission-critical systems offline.

The dynamic scaling ability allows us to not only take a hardware system offline for repair, but also allows us to add additional capacity when a system becomes overloaded (and then revert when it becomes underloaded). These abilities are all possible and desirable. Further, from an economic standpoint to preserve the investment in computing infrastructure it is very advantageous to not require that a system be overprovisioned just to meet some hypothetical peak demand. It is very advantageous to only use as much computing infrastructure as needed, and only when needed. This is not only true for hardware and energy investments, but also for investments in software licences, which are often correlated with hardware capabilities.

I personally believe that distributed virtual machines offer the potential to satisfy all the needs I have mentioned. What is a distributed virtual machine? It is a virtual machine that runs on a dedicated cluster of cooperating physical servers interconnected by a standard network, like Ethernet.

To an operating system, it looks exactly like a single physical server, but it is not. Each physical server runs a piece of software called a hyperkernel. When powering up, each hyperkernel instance takes an inventory of all the processors, all the memory, all the networks, and all the storage on each physical server. Then, the hyperkernel instances exchange this inventory information, and use it to create a single virtual machine. One processor boots a standard operating system, which sees all the combined resources of all the physical servers. The operating system does not even know it is running on a cluster. (It is like a dream: how do you know whether you are dreaming or not?) No modifications to the operating system need to be made, and no modifications to any applications need to be made. Further, the virtual resources like guest virtual processors and guest virtual memory can migrate under automatic control by machine learning algorithms and system performance introspection. So, the first goal of simplicity can thus be achieved.

Scalability is achieved by the cooperating hyperkernels implementing the ability to add and subtract physical servers dynamically as needed. This can be explicit, under operator control, or under programmatic control by some oversight software that tracks performance usage information. Thus, the second goal is achieved.

Reliability is achieved in a very innovative way. The various hyperkernels monitor things like dynamic random access memory error rates, temperature fluctuations, NIC error rates and the like. When an impending problem is detected, there is sufficient time to take corrective action. For example, when a problem is detected on physical server n, the hyperkernels on all the other physical servers are told not to send any active guest physical pages or guest processors to n. An additional physical server may be added to the cluster to maintain previous performance levels. In other words, n is quarantined. Physical server n is directed to evict all active guest physical pages and guest virtual processors to other physical servers. When this is complete, physical server n can be removed for repair. A similar process can be used for upgrades of hardware or firmware. All this is done without having to modify or restart the operating system, which is unaware that any of this is taking place. Thus, the third goal, reliability, is achieved.

All this can be achieved with competitive performance using technology available today.

Authors declare no conflict of interest.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊