可用性评分和阈值设置

S. Heisig;J. R. M. Hosking
{"title":"可用性评分和阈值设置","authors":"S. Heisig;J. R. M. Hosking","doi":"10.1147/SJ.2008.5386513","DOIUrl":null,"url":null,"abstract":"As the capacity of hardware systems has grown and workload consolidation has taken place, the volume of performance metrics and diagnostic data streams has outscaled the capability of people to handle these systems using traditional methods. As work of different types (such as database, batch, and Web processing), each in its own monitoring silo, runs concurrently on a single image (operating system instance), both the complexity and the business consequences of a single image failure have increased. This paper presents two techniques for generating actionable information out of the overwhelming amount of performance and diagnostic data available to human analysts. Failure scoring is used to identify high-risk failure events that may be obscured in the myriad system events. This replaces human expertise in scanning tens of thousands of records per day and results in a short, prioritized list for action by systems staff. Adaptive thresholding is used to drive predictive and descriptive machine-learning-based modeling to isolate and identify misbehaving processes and transactions. The attraction of this technique is that it does not require human intervention and can be reapplied continually, resulting in models that are not brittle. Both techniques reduce the quantity and increase the relevance of data available for programmatic and human processes.","PeriodicalId":55035,"journal":{"name":"IBM systems journal","volume":"47 4","pages":"665-666"},"PeriodicalIF":0.0000,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/SJ.2008.5386513","citationCount":"1","resultStr":"{\"title\":\"Scoring and thresholding for availability\",\"authors\":\"S. Heisig;J. R. M. Hosking\",\"doi\":\"10.1147/SJ.2008.5386513\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the capacity of hardware systems has grown and workload consolidation has taken place, the volume of performance metrics and diagnostic data streams has outscaled the capability of people to handle these systems using traditional methods. As work of different types (such as database, batch, and Web processing), each in its own monitoring silo, runs concurrently on a single image (operating system instance), both the complexity and the business consequences of a single image failure have increased. This paper presents two techniques for generating actionable information out of the overwhelming amount of performance and diagnostic data available to human analysts. Failure scoring is used to identify high-risk failure events that may be obscured in the myriad system events. This replaces human expertise in scanning tens of thousands of records per day and results in a short, prioritized list for action by systems staff. Adaptive thresholding is used to drive predictive and descriptive machine-learning-based modeling to isolate and identify misbehaving processes and transactions. The attraction of this technique is that it does not require human intervention and can be reapplied continually, resulting in models that are not brittle. Both techniques reduce the quantity and increase the relevance of data available for programmatic and human processes.\",\"PeriodicalId\":55035,\"journal\":{\"name\":\"IBM systems journal\",\"volume\":\"47 4\",\"pages\":\"665-666\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1147/SJ.2008.5386513\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IBM systems journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/5386513/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IBM systems journal","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/5386513/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

随着硬件系统容量的增长和工作负载整合的进行,性能指标和诊断数据流的数量已经超过了人们使用传统方法处理这些系统的能力。随着不同类型的工作(如数据库、批处理和Web处理)在单个映像(操作系统实例)上同时运行,单个映像故障的复杂性和业务后果都有所增加。本文提出了两种技术,用于从人类分析师可获得的大量性能和诊断数据中生成可操作的信息。故障评分用于识别可能在无数系统事件中被掩盖的高风险故障事件。这取代了每天扫描数万条记录的人力专业知识,并产生了一个简短的、优先顺序的列表,供系统工作人员采取行动。自适应阈值用于驱动基于预测和描述性机器学习的建模,以隔离和识别行为不端的过程和事务。这种技术的吸引力在于,它不需要人工干预,可以不断地重新应用,从而产生不脆弱的模型。这两种技术都减少了可用于方案和人力过程的数据的数量并提高了其相关性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Scoring and thresholding for availability
As the capacity of hardware systems has grown and workload consolidation has taken place, the volume of performance metrics and diagnostic data streams has outscaled the capability of people to handle these systems using traditional methods. As work of different types (such as database, batch, and Web processing), each in its own monitoring silo, runs concurrently on a single image (operating system instance), both the complexity and the business consequences of a single image failure have increased. This paper presents two techniques for generating actionable information out of the overwhelming amount of performance and diagnostic data available to human analysts. Failure scoring is used to identify high-risk failure events that may be obscured in the myriad system events. This replaces human expertise in scanning tens of thousands of records per day and results in a short, prioritized list for action by systems staff. Adaptive thresholding is used to drive predictive and descriptive machine-learning-based modeling to isolate and identify misbehaving processes and transactions. The attraction of this technique is that it does not require human intervention and can be reapplied continually, resulting in models that are not brittle. Both techniques reduce the quantity and increase the relevance of data available for programmatic and human processes.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
“The IBM Way”: How It Worked, 1964–1993 IBM’s Initial Response, 1985–1993 IBM on the Global Stage IBM in World War II, 1939–1945 Origins, 1880s–1914
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1