Design of Monitoring Tools for Data Centre Downtime Reduction

Q2 Engineering International Journal of Emerging Trends in Engineering Research Pub Date : 2023-11-10 DOI:10.30534/ijeter/2023/0211112023

{"title":"Design of Monitoring Tools for Data Centre Downtime Reduction","authors":"","doi":"10.30534/ijeter/2023/0211112023","DOIUrl":null,"url":null,"abstract":"This paper presents a new monitoring tool and event management method for data centre compute, network and storage infrastructure based on node event processing. The uptime of highly classified data centres are not only to be maintained at the highest level of reliability and availability of the operation, but also fast, specific event identification and rectification, which altogether improves availability of the resources is important. The new method, using a tree node for each element of the data centre resources provides information about the compute, network and storage file system configuration in a specific node. Its major advantage is that in our case where a large number of heterogeneous computers are present, it helps us in monitoring all the elements of the computer resources and gives information for alerting the associated work centres before any of the identified events that might occur. By monitoring and informing apriori to the concerned work centres the state of the systems, it lowers errors in data centre physical infrastructure operating costs, improving at the same time the level of operations efficiency. This method resulted that the use of tree nodes significantly reduces the number of unexpected events, the time needed for the main event identification, and the maintenance response time to events. By using event entities processing, multilayer nodes have a significant impact on the efficient operation of data centre physical infrastructure. In this paper, the design and development of two customised dashboards to monitor the compute, storage and network elements of the heterogeneous data centre for uptime maintenance and optimal performance is presented. The dashboards are designed, keeping in view the nature of tasks carried out and the resource requirements of various work centres in the data centre. One dashboard displays dynamically created icons for each of the compute resources in the data centre. On clicking any of the icon, complete details of the corresponding server is fetched showing the status, usage, configuration and available resources. Furthermore, a unique colouring scheme is followed wherein the icon is displayed green if the server is healthy and orange if the server is facing a resource crunch (disk, memory, etc.) and red if the server is not reachable. The dashboard GUI refreshes every 5 min (is configurable), displaying the latest status details of the servers in the data centre. The second Dashboard is developed with the capability to monitor the storage, cloud and network infrastructure components. The dashboard collects data from different elements of the storage i.e. Meta Data Servers, Storage, Core and Edge switches etc. and processes the collected data to a customized format for display. It delivers details like availability of Storage Meta Data Servers, switches and file systems, disk space capacity monitoring, file system backup status, Monitoring of the hierarchical Storage including Tape Library and the availability of Production ESXi hosts cluster. The GUI is updated with new requirements to further fine-tune and reduce manual intervention for monitoring operations.","PeriodicalId":13964,"journal":{"name":"International Journal of Emerging Trends in Engineering Research","volume":" 24","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Emerging Trends in Engineering Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30534/ijeter/2023/0211112023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Engineering","Score":null,"Total":0}

引用次数: 0

Abstract

This paper presents a new monitoring tool and event management method for data centre compute, network and storage infrastructure based on node event processing. The uptime of highly classified data centres are not only to be maintained at the highest level of reliability and availability of the operation, but also fast, specific event identification and rectification, which altogether improves availability of the resources is important. The new method, using a tree node for each element of the data centre resources provides information about the compute, network and storage file system configuration in a specific node. Its major advantage is that in our case where a large number of heterogeneous computers are present, it helps us in monitoring all the elements of the computer resources and gives information for alerting the associated work centres before any of the identified events that might occur. By monitoring and informing apriori to the concerned work centres the state of the systems, it lowers errors in data centre physical infrastructure operating costs, improving at the same time the level of operations efficiency. This method resulted that the use of tree nodes significantly reduces the number of unexpected events, the time needed for the main event identification, and the maintenance response time to events. By using event entities processing, multilayer nodes have a significant impact on the efficient operation of data centre physical infrastructure. In this paper, the design and development of two customised dashboards to monitor the compute, storage and network elements of the heterogeneous data centre for uptime maintenance and optimal performance is presented. The dashboards are designed, keeping in view the nature of tasks carried out and the resource requirements of various work centres in the data centre. One dashboard displays dynamically created icons for each of the compute resources in the data centre. On clicking any of the icon, complete details of the corresponding server is fetched showing the status, usage, configuration and available resources. Furthermore, a unique colouring scheme is followed wherein the icon is displayed green if the server is healthy and orange if the server is facing a resource crunch (disk, memory, etc.) and red if the server is not reachable. The dashboard GUI refreshes every 5 min (is configurable), displaying the latest status details of the servers in the data centre. The second Dashboard is developed with the capability to monitor the storage, cloud and network infrastructure components. The dashboard collects data from different elements of the storage i.e. Meta Data Servers, Storage, Core and Edge switches etc. and processes the collected data to a customized format for display. It delivers details like availability of Storage Meta Data Servers, switches and file systems, disk space capacity monitoring, file system backup status, Monitoring of the hierarchical Storage including Tape Library and the availability of Production ESXi hosts cluster. The GUI is updated with new requirements to further fine-tune and reduce manual intervention for monitoring operations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

减少数据中心停机时间的监控工具设计

本文提出了一种基于节点事件处理的数据中心计算、网络和存储基础设施监控工具和事件管理方法。高度机密数据中心的正常运行时间不仅要保持在最高水平的可靠性和操作可用性，而且要快速，具体的事件识别和纠正，这总体上提高资源的可用性是很重要的。新方法为数据中心资源的每个元素使用树节点，提供有关特定节点中计算、网络和存储文件系统配置的信息。它的主要优点是，在我们的情况下，存在大量异构计算机，它帮助我们监视计算机资源的所有元素，并提供信息，以便在任何可能发生的已识别事件之前通知相关的工作中心。通过对相关工作中心的系统状态进行监测和先验通知，降低了数据中心物理基础设施运行成本中的错误，同时提高了运行效率水平。这种方法的结果是，使用树节点显著减少了意外事件的数量、识别主事件所需的时间以及对事件的维护响应时间。通过使用事件实体处理，多层节点对数据中心物理基础设施的高效运行有着重要的影响。本文介绍了两个定制仪表板的设计和开发，用于监控异构数据中心的计算、存储和网络元素，以实现正常运行时间维护和最佳性能。仪表板的设计考虑到所执行任务的性质和数据中心内各工作中心的资源需求。一个仪表板显示为数据中心中的每个计算资源动态创建的图标。在单击任何图标时，将获取相应服务器的完整详细信息，显示状态、使用情况、配置和可用资源。此外，还遵循一种独特的配色方案，其中如果服务器运行正常，图标显示为绿色;如果服务器面临资源紧张(磁盘、内存等)，图标显示为橙色;如果服务器不可访问，图标显示为红色。仪表板GUI每5分钟刷新一次(可配置)，显示数据中心中服务器的最新状态详细信息。第二个仪表板具有监视存储、云和网络基础设施组件的功能。仪表板从存储的不同元素收集数据，即元数据服务器、存储、核心和边缘交换机等，并将收集到的数据处理为自定义格式以供显示。它提供了诸如存储元数据服务器、交换机和文件系统的可用性、磁盘空间容量监控、文件系统备份状态、分层存储监控(包括磁带库)和生产ESXi主机集群的可用性等详细信息。GUI根据新的要求进行了更新，以进一步微调和减少对监控操作的人工干预。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Journal of Emerging Trends in Engineering Research Engineering-Engineering (all)

自引率

0.00%

发文量