Design of Monitoring Tools for Data Centre Downtime Reduction

{"title":"Design of Monitoring Tools for Data Centre Downtime Reduction","authors":"","doi":"10.30534/ijeter/2023/0211112023","DOIUrl":null,"url":null,"abstract":"This paper presents a new monitoring tool and event management method for data centre compute, network and storage infrastructure based on node event processing. The uptime of highly classified data centres are not only to be maintained at the highest level of reliability and availability of the operation, but also fast, specific event identification and rectification, which altogether improves availability of the resources is important. The new method, using a tree node for each element of the data centre resources provides information about the compute, network and storage file system configuration in a specific node. Its major advantage is that in our case where a large number of heterogeneous computers are present, it helps us in monitoring all the elements of the computer resources and gives information for alerting the associated work centres before any of the identified events that might occur. By monitoring and informing apriori to the concerned work centres the state of the systems, it lowers errors in data centre physical infrastructure operating costs, improving at the same time the level of operations efficiency. This method resulted that the use of tree nodes significantly reduces the number of unexpected events, the time needed for the main event identification, and the maintenance response time to events. By using event entities processing, multilayer nodes have a significant impact on the efficient operation of data centre physical infrastructure. In this paper, the design and development of two customised dashboards to monitor the compute, storage and network elements of the heterogeneous data centre for uptime maintenance and optimal performance is presented. The dashboards are designed, keeping in view the nature of tasks carried out and the resource requirements of various work centres in the data centre. One dashboard displays dynamically created icons for each of the compute resources in the data centre. On clicking any of the icon, complete details of the corresponding server is fetched showing the status, usage, configuration and available resources. Furthermore, a unique colouring scheme is followed wherein the icon is displayed green if the server is healthy and orange if the server is facing a resource crunch (disk, memory, etc.) and red if the server is not reachable. The dashboard GUI refreshes every 5 min (is configurable), displaying the latest status details of the servers in the data centre. The second Dashboard is developed with the capability to monitor the storage, cloud and network infrastructure components. The dashboard collects data from different elements of the storage i.e. Meta Data Servers, Storage, Core and Edge switches etc. and processes the collected data to a customized format for display. It delivers details like availability of Storage Meta Data Servers, switches and file systems, disk space capacity monitoring, file system backup status, Monitoring of the hierarchical Storage including Tape Library and the availability of Production ESXi hosts cluster. The GUI is updated with new requirements to further fine-tune and reduce manual intervention for monitoring operations.","PeriodicalId":13964,"journal":{"name":"International Journal of Emerging Trends in Engineering Research","volume":" 24","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Emerging Trends in Engineering Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30534/ijeter/2023/0211112023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Engineering","Score":null,"Total":0}
引用次数: 0

Abstract

This paper presents a new monitoring tool and event management method for data centre compute, network and storage infrastructure based on node event processing. The uptime of highly classified data centres are not only to be maintained at the highest level of reliability and availability of the operation, but also fast, specific event identification and rectification, which altogether improves availability of the resources is important. The new method, using a tree node for each element of the data centre resources provides information about the compute, network and storage file system configuration in a specific node. Its major advantage is that in our case where a large number of heterogeneous computers are present, it helps us in monitoring all the elements of the computer resources and gives information for alerting the associated work centres before any of the identified events that might occur. By monitoring and informing apriori to the concerned work centres the state of the systems, it lowers errors in data centre physical infrastructure operating costs, improving at the same time the level of operations efficiency. This method resulted that the use of tree nodes significantly reduces the number of unexpected events, the time needed for the main event identification, and the maintenance response time to events. By using event entities processing, multilayer nodes have a significant impact on the efficient operation of data centre physical infrastructure. In this paper, the design and development of two customised dashboards to monitor the compute, storage and network elements of the heterogeneous data centre for uptime maintenance and optimal performance is presented. The dashboards are designed, keeping in view the nature of tasks carried out and the resource requirements of various work centres in the data centre. One dashboard displays dynamically created icons for each of the compute resources in the data centre. On clicking any of the icon, complete details of the corresponding server is fetched showing the status, usage, configuration and available resources. Furthermore, a unique colouring scheme is followed wherein the icon is displayed green if the server is healthy and orange if the server is facing a resource crunch (disk, memory, etc.) and red if the server is not reachable. The dashboard GUI refreshes every 5 min (is configurable), displaying the latest status details of the servers in the data centre. The second Dashboard is developed with the capability to monitor the storage, cloud and network infrastructure components. The dashboard collects data from different elements of the storage i.e. Meta Data Servers, Storage, Core and Edge switches etc. and processes the collected data to a customized format for display. It delivers details like availability of Storage Meta Data Servers, switches and file systems, disk space capacity monitoring, file system backup status, Monitoring of the hierarchical Storage including Tape Library and the availability of Production ESXi hosts cluster. The GUI is updated with new requirements to further fine-tune and reduce manual intervention for monitoring operations.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
减少数据中心停机时间的监控工具设计
本文提出了一种基于节点事件处理的数据中心计算、网络和存储基础设施监控工具和事件管理方法。高度机密数据中心的正常运行时间不仅要保持在最高水平的可靠性和操作可用性,而且要快速,具体的事件识别和纠正,这总体上提高资源的可用性是很重要的。新方法为数据中心资源的每个元素使用树节点,提供有关特定节点中计算、网络和存储文件系统配置的信息。它的主要优点是,在我们的情况下,存在大量异构计算机,它帮助我们监视计算机资源的所有元素,并提供信息,以便在任何可能发生的已识别事件之前通知相关的工作中心。通过对相关工作中心的系统状态进行监测和先验通知,降低了数据中心物理基础设施运行成本中的错误,同时提高了运行效率水平。这种方法的结果是,使用树节点显著减少了意外事件的数量、识别主事件所需的时间以及对事件的维护响应时间。通过使用事件实体处理,多层节点对数据中心物理基础设施的高效运行有着重要的影响。本文介绍了两个定制仪表板的设计和开发,用于监控异构数据中心的计算、存储和网络元素,以实现正常运行时间维护和最佳性能。仪表板的设计考虑到所执行任务的性质和数据中心内各工作中心的资源需求。一个仪表板显示为数据中心中的每个计算资源动态创建的图标。在单击任何图标时,将获取相应服务器的完整详细信息,显示状态、使用情况、配置和可用资源。此外,还遵循一种独特的配色方案,其中如果服务器运行正常,图标显示为绿色;如果服务器面临资源紧张(磁盘、内存等),图标显示为橙色;如果服务器不可访问,图标显示为红色。仪表板GUI每5分钟刷新一次(可配置),显示数据中心中服务器的最新状态详细信息。第二个仪表板具有监视存储、云和网络基础设施组件的功能。仪表板从存储的不同元素收集数据,即元数据服务器、存储、核心和边缘交换机等,并将收集到的数据处理为自定义格式以供显示。它提供了诸如存储元数据服务器、交换机和文件系统的可用性、磁盘空间容量监控、文件系统备份状态、分层存储监控(包括磁带库)和生产ESXi主机集群的可用性等详细信息。GUI根据新的要求进行了更新,以进一步微调和减少对监控操作的人工干预。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
70
期刊最新文献
An Effective Data Fusion Methodology for Multi-modal Emotion Recognition: A Survey The Transformative Role of Microsoft Azure AI in Healthcare Low Costs Electrical Calibration System of SLM with the Uncertainty Measurements Compared with Primary System Platform Brūel & Kjær type 3630 Analytical Model of a New Acoustic Conductor Lined with Linear Increasing Perforated Area Enhanced Sleep Quality Through Light Modulation IoT-Based Approach ESP32 with Philips Hue Integration
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1