### Managing Incidents and Problems 管理事件和问题 > Again, if the campaign is protracted, the resources of the State will not be equal to the strain.—Sun Tzu > 再说一次,如果战役旷日持久,国家的资源将无法应对压力。——孙子 The management of issues and problems is critical to creating a highly scalable platform or system. This chapter describes the bare minimum processes that all companies must have to help correctly resolve production incidents and minimize the rate atwhich they reoccur. Recurring incidents are the enemy of scalability. Each time weallow an incident with the same root cause to recur in our production environments,we steal time away from our teams that would be better used developing systems andfeatures that maximize shareholder value. This theft of engineering time runs counterto our scalability goals as we are increasing the cost of producing our service or product when the goal of scalability is to produce more with less. 问题和问题的管理对于创建高度可扩展的平台或系统至关重要。本章介绍了所有公司必须具备的最低限度流程,以帮助正确解决生产事故并最大程度地减少事故的再次发生率。重复发生的事件是可扩展性的敌人。每次我们允许具有相同根本原因的事件在我们的生产环境中重复发生时,我们就从我们的团队中窃取了时间,这些时间可以更好地用于开发系统和功能,从而最大化股东价值。这种工程时间的盗窃与我们的可扩展性目标背道而驰,因为当可扩展性的目标是用更少的资源生产更多的产品时,我们正在增加生产服务或产品的成本。 Our past performance is the best indicator we have of our future performance,and our past performance is best described by the incidents we’ve experienced andthe underlying problems causing those incidents. To the extent that we currently haveproblems scaling our systems to meet end-user demand, or concerns about our abilityto scale these systems in the future, our recent incidents and problems are very likelygreat indications of our current and future limitations. By defining appropriate processes to capture and resolve incidents and processes, we can significantly improveour ability to scale. Failing to recognize and resolve our past failures means a failureto learn from our past mistakes in architecture, engineering, and operations. Failingto recognize past mistakes and learn from them with the intent of ensuring that wedo not repeat them is disastrous in any field or discipline. For that reason, we’ve dedicated a chapter to incident and problem management. 我们过去的表现是我们未来表现的最佳指标,而我们过去的表现最好由我们经历过的事件以及导致这些事件的根本问题来描述。就我们目前在扩展系统以满足最终用户需求方面遇到的问题,或者对我们未来扩展这些系统的能力的担忧而言,我们最近发生的事件和问题很可能很好地表明了我们当前和未来的局限性。通过定义适当的流程来捕获和解决事件和流程,我们可以显着提高扩展能力。无法认识并解决我们过去的失败意味着无法从过去在架构、工程和运营方面的错误中吸取教训。未能认识到过去的错误并从中吸取教训,以确保我们不再重蹈覆辙,在任何领域或学科都是灾难性的。因此,我们专门用了一章来讨论事件和问题管理。 Throughout this chapter, we will rely upon the United Kingdom’s Office of Government Commerce (OGC) Information Technology Infrastructure Library (ITIL) fordefinitions of certain words and processes. The ITIL and the Control Objectives for Information and related Technology (COBIT) created by the Information SystemsAudit and Control Association are the two most commonly used frameworks fordeveloping and maturing processes related to managing the software, systems, andorganizations within information technology. This chapter is not meant to be a comprehensive review or endorsement of either the ITIL or COBIT. Rather, we try tosummarize some of the most important aspects of the parts of these systems as theyrelate to managing incidents and their associated problems and identify the portionsthat you absolutely must have regardless of the size or complexity of your organization or company. 在本章中,我们将依靠英国政府商务办公室 (OGC) 信息技术基础设施库 (ITIL) 来定义某些词语和流程。 ITIL 和信息系统审计与控制协会创建的信息及相关技术控制目标 (COBIT) 是两个最常用的框架,用于开发和成熟与管理信息技术内的软件、系统和组织相关的流程。本章无意对 ITIL 或 COBIT 进行全面审查或认可。相反,我们尝试总结这些系统各部分的一些最重要的方面,因为它们与管理事件及其相关问题有关,并确定您绝对必须拥有的部分,无论您的组织或公司的规模或复杂性如何。 Whether you are a large company expecting to complete a full implementation ofeither the ITIL or COBIT or a small company looking for a fast and lean process tohelp identify and eliminate recurring scalability related issues, the following are absolutely necessary: 无论您是希望完成 ITIL 或 COBIT 全面实施的大公司,还是寻求快速、精益流程来帮助识别和消除反复出现的可扩展性相关问题的小公司,以下内容都是绝对必要的 * Recognize the difference between incidents and problems and track themaccordingly. * Follow an incident management life cycle (such as DRIER identified shortly) toproperly catalog, close, report on, and track incidents. * Develop a problem management tracking system and life cycle to ensure you areappropriately closing and reacting to scalability related problems. * Implement a daily incident and problem review to support your incident andproblem management processes. * Implement a quarterly incident review to learn from past mistakes and helpidentify issues repeatedly impacting your ability to scale. * Implement a robust postmortem process to get to the heart of all problems. * 认识事件和问题之间的区别并相应地跟踪它们。 * 遵循事件管理生命周期(例如不久后确定的 DRIER)以正确编目、关闭、报告和跟踪事件。 * 开发问题管理跟踪系统和生命周期,以确保您正确地解决可扩展性相关问题并做出反应。 * 实施每日事件和问题审查,以支持您的事件和问题管理流程。 * 实施季度事件审查,从过去的错误中吸取教训,并帮助识别反复影响您的扩展能力的问题。 * 实施健全的事后分析流程,以找到所有问题的核心。 #### What Is an Incident? 什么是事件? The ITIL definition of an incident is “Any event which is not part of the standardoperation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of that service.” That definition has a bit of “government speak”in it. Let’s give it a more easily understood meaning of “Any event that reduces thequality of our service.”1 An incident here then could be a downtime related event, anevent that causes slowness in response time to end users, or an event that causesincorrect or unexpected results to be returned to end users. ITIL 对事件的定义是“不属于服务标准操作的一部分且导致或可能导致服务质量中断或降低的任何事件”。这个定义有点“政府言论”的味道。让我们更容易理解“任何降低我们服务质量的事件”的含义。1 这里的事件可能是与停机相关的事件、导致最终用户响应时间缓慢的事件,或者导致错误或意外的事件结果返回给最终用户。 Issue management, as defined by the ITIL, is “to restore normal operations asquickly as possible with the least possible impact on either the business or the user, at a cost-effective price.” Thus, management of an issue really becomes the managementof the impact of the issue. We love this definition and love the approach as it separatescause from impact. We want to resolve an issue as quickly as possible, but that doesnot necessarily mean understanding its root cause. Therefore, rapidly resolving an incident is critical to the perception of scale, as once a scalability related incident occurs,it starts to cause the perception (and of course the reality) of a lack of scalability. ITIL 定义的问题管理是“以具有成本效益的价格,尽快恢复正常运营,对业务或用户的影响尽可能小。”因此,问题的管理实际上变成了问题影响的管理。我们喜欢这个定义,也喜欢这种方法,因为它将原因与影响分开。我们希望尽快解决问题,但这并不一定意味着了解其根本原因。因此,快速解决事件对于规模感知至关重要,因为一旦发生与可扩展性相关的事件,它就会开始导致缺乏可扩展性的感知(当然还有现实)。 Now that we understand that an incident is an unwanted event in our system thatimpacts our availability or service levels and that incident management has to dowith the timely and cost-effective resolution of incidents to force the system into perceived normal behavior, let’s discuss problems and problem management. 既然我们了解事件是我们系统中不受欢迎的事件,会影响我们的可用性或服务水平,并且事件管理与及时且经济高效地解决事件有关,以迫使系统进入感知的正常行为,那么让我们讨论问题和问题管理。 #### What Is a Problem? 什么是问题? The ITIL defines a problem as “the unknown cause of one or more incidents, oftenidentified as a result of multiple similar incidents.” The ITIL further defines a“known error” as an identified root cause of a problem. Finally, “The objective ofProblem Management is to minimize the impact of problems on the organization.” ITIL 将问题定义为“一个或多个事件的未知原因,通常是多个类似事件的结果”。 ITIL 进一步将“已知错误”定义为已识别的问题根本原因。最后,“问题管理的目标是尽量减少问题对组织的影响。 Again, we can see the purposeful separation of events (incidents) and their causes(problems). This simple separation of definition in incident and problem helps us inour everyday lives by forcing us to think about their resolution differently. If forevery incident we attempt to find root cause before restoring service, we will verylikely have lower availability than if we separate the restoration of service from theidentification of cause. Furthermore, the skills necessary to restore service and manage a system back to proper operation may very well be different from those necessary to identify root cause of any given incident. If that is the case, serializing the twoprocesses not only wastes engineering time but further destroys shareholder value. 再次,我们可以看到事件(事件)及其原因(问题)的有目的地分离。这种事件和问题定义的简单分离有助于我们在日常生活中迫使我们以不同的方式思考它们的解决方案。如果每次事件我们都试图在恢复服务之前找到根本原因,那么我们的可用性很可能会低于将服务恢复与原因识别分开的情况。此外,恢复服务和管理系统使其恢复正常运行所需的技能很可能与确定任何给定事件的根本原因所需的技能非常不同。如果是这样的话,将这两个流程序列化不仅会浪费工程时间,还会进一步损害股东价值。 Take, for example, the case that a Web site makes use of a monolithic databasestructure and is unavailable in the event that the database fails. This Web site has adatabase failure where the database simply crashes and all processes running thedatabase die and produce varying core files during its peak traffic period from 11AM to 1 PM. One very conservative approach to this problem may be to say that younever restart your database until you know why it failed. This could take hours andmaybe even days while you go through log and core files and bring in your databasevendor to help you analyze everything. The intent is obvious—you don’t want tocause any data corruption in restarting the database. 举例来说,网站使用整体数据库结构,并且在数据库发生故障时将不可用。该网站出现数据库故障,数据库直接崩溃,运行该数据库的所有进程都停止运行,并在上午 11 点到下午 1 点的流量高峰期产生不同的核心文件。解决这个问题的一种非常保守的方法可能是,在知道数据库失败的原因之前,永远不要重新启动数据库。当您检查日志和核心文件并引入数据库供应商来帮助您分析所有内容时,这可能需要数小时甚至数天的时间。意图很明显——您不想在重新启动数据库时导致任何数据损坏。 But most databases these days can recover from nearly any crash without significantdata hazards. A quick examination could tell you that no processes are running, thatyou have several core and log files, and that a restart of the database may actually help you understand what type of problem you are experiencing. Maybe you start upthe database and run a few quick “health checks” like the insertion and updating ofsome dummy data to verify that things are likely to work well, then put the databaseback into service. Obviously, this approach, assuming the database will restart, islikely to result in less downtime associated with scalability related events than serializing the management of the problem (identifying root cause) and the management ofthe incident (restoration of service). 但如今大多数数据库几乎可以从任何崩溃中恢复,而不会造成重大数据危害。快速检查可以告诉您没有任何进程正在运行,您有多个核心文件和日志文件,并且重新启动数据库实际上可以帮助您了解所遇到的问题类型。也许您启动数据库并运行一些快速的“健康检查”,例如插入和更新一些虚拟数据,以验证一切是否正常运行,然后将数据库重新投入使用。显然,与串行化问题管理(识别根本原因)和事件管理(恢复服务)相比,假设数据库将重新启动,这种方法可能会减少与可伸缩性相关事件相关的停机时间。 We’ve just highlighted a very real conflict between these two processes that we’lladdress later in this chapter. Specifically, this problem is that incident management(the restoration of service) and problem management (the identification and resolution of root cause) are often in conflict with each other. The rapid restoration of service often conflicts with the forensic data gathering necessary for problemmanagement. Maybe the restart of servers or services causes the destruction of critical data. We’ll discuss how to handle this later. For now, recognize that there is a benefit in thinking about the differences in actions for the restoration of service and theresolution of problems. 我们刚刚强调了这两个过程之间的一个非常现实的冲突,我们将在本章后面讨论。具体来说,这个问题是事件管理(服务的恢复)和问题管理(根本原因的识别和解决)经常相互冲突。服务的快速恢复常常与问题管理所需的取证数据收集发生冲突。也许服务器或服务的重启会导致关键数据的破坏。我们稍后会讨论如何处理这个问题。目前,认识到考虑恢复服务和解决问题的行动差异是有好处的。 #### The Components of Incident Management 事件管理的组成部分 The ITIL defines the activities essential to the incident management process as ITIL 将事件管理流程所必需的活动定义为 * Incident detection and recording * Classification and initial support * Investigation and diagnosis * Resolution and recovery * Incident closure * Incident ownership, monitoring, tracking, and communication * 事件检测和记录 * 分类和初步支持 * 调查与诊断 * 分辨率和恢复率 * 事件结束 * 事件所有权、监控、跟踪和沟通 Implicit to this list is an ordering such that nothing can happen before incidentdetection, classification comes before investigation and diagnosis, resolution andrecovery must happen only after initial investigation, and so on. We completely agreewith this list of necessary actions, but if you are not an organization strictly governedby the OGC and you do not require any OGC related certification, there are somesimple changes you can make to this order that will speed issue recovery. First, wewish to create our own simplified definitions of the preceding activities. 该列表隐含了这样的顺序:在事件检测之前什么都不会发生,在调查和诊断之前进行分类,只有在初步调查之后才必须进行解决和恢复,等等。我们完全同意这份必要行动清单,但如果您不是受 OGC 严格监管的组织,并且您不需要任何 OGC 相关认证,您可以对此命令进行一些简单的更改,以加快问题恢复速度。首先,我们希望对上述活动创建我们自己的简化定义。 Incident detection and recording is the activity of identifying that there is an incident affecting users or the operation of the system and then recording it. Both ofthese are very important, and many companies have quite a bit they can do to makeboth actions better and faster. Incident detection is all about the monitoring of yoursystems. Do you have customer experience monitors in place to identify problems before the first customer complaint? Do they measure the same things customers do?It is very important in our experience to perform actual customer transactions withinyour system and measure them over time both for the expected results (are theyreturning the right data?) and for the expected response times (are they operating asquickly as you would expect?). 事件检测和记录是识别存在影响用户或系统运行的事件并记录的活动。这两点都非常重要,许多公司可以做很多事情来使这两项行动变得更好更快。事件检测就是对系统的监控。您是否有适当的客户体验监控器来在第一次客户投诉之前发现问题?他们衡量客户所做的事情是否相同?根据我们的经验,在您的系统中执行实际的客户交易并随着时间的推移对其进行衡量,以获得预期结果(他们是否返回正确的数据?)和预期响应时间(他们是否返回正确的数据?)非常重要。运行速度如您所期望的那样快吗?)。 ##### A Framework for Maturing Monitoring 成熟监控的框架 Far too often, we see clients attempting to implement monitoring solutions intended to tell themthe root cause of any potential problem they might be facing. This sounds great, but this monitoring panacea rarely works and the failures are largely attributed to two issues: 我们经常看到客户尝试实施监控解决方案,旨在告诉他们可能面临的任何潜在问题的根本原因。这听起来不错,但这种监控灵丹妙药很少起作用,失败很大程度上归因于两个问题 * The systems they are attempting to monitor aren’t designed to be monitored. * The company does not approach monitoring in a planned, methodical evolutionary (oriterative) fashion. * 他们试图监控的系统并不是为了被监控而设计的。 * 公司没有以有计划、有条理的渐进(迭代)方式进行监控。 You should not expect a monitoring system (or incident identification system) to correctlyidentify the faults within your platform if you did not design your platform to be monitored. Thebest designed systems build the monitoring and notification of incidents into their code andsystems. As an example, world class real-time monitoring solutions have the capability to logthe times and errors for each internal call to a service. Here, the service may be a call to a datastore or another Web service that exposes account information, and so on. The resulting times,rates, and types of errors might be plotted in real time in a statistical process control chart(SPC) with out-of-bound conditions highlighted as an alert on some sort of monitoring panel. 如果您没有将平台设计为可监控的,那么您不应期望监控系统(或事件识别系统)能够正确识别平台内的故障。设计最佳的系统会将事件监控和通知构建到其代码和系统中。例如,世界一流的实时监控解决方案能够记录每次内部调用服务的时间和错误。这里,服务可以是对数据存储或公开帐户信息的另一个Web服务的调用,等等。所产生的错误时间、错误率和类型可能会实时绘制在统计过程控制图 (SPC) 中,并在某种监控面板上突出显示越界条件作为警报。 Designing a system to be monitored is necessary but not sufficient to identify and resolveincidents quickly. You also need a system that identifies issues from the perspective of yourcustomer and helps to identify the underlying system causing that problem. 设计一个要监控的系统是必要的,但不足以快速识别和解决事件。您还需要一个能够从客户的角度识别问题并帮助识别导致该问题的底层系统的系统。 Far too many companies bypass the step of monitoring their systems from a customer perspective. Build or incorporate a real time system that interacts with your platform in the samefashion as your customers and performs the most critical transactions. Throw an alert when thesystem is outside of internally generated service levels for response time and availability. 太多的公司绕过了从客户角度监控系统的步骤。构建或合并一个实时系统,以与客户相同的方式与您的平台进行交互并执行最关键的交易。当系统超出内部生成的响应时间和可用性服务级别时发出警报。 Next, implement something to help identify which system is causing the incident. In theideal world, you will have developed a fault isolative architecture to create failure domains thatwill isolate failures and help you determine where the fault is occurring (we discuss failuredomains and fault isolative architectures in Chapter 21, Creating Fault Isolative ArchitecturalStructures). Failing that, you need monitoring that can help indicate the rough areas of concern.These are typically aggregated system statistics such as load, CPU, or memory utilization. 接下来,实施一些措施来帮助确定是哪个系统导致了事件。在理想的世界中,您将开发一个故障隔离架构来创建故障域,该故障域将隔离故障并帮助您确定故障发生的位置(我们在第 21 章“创建故障隔离架构”中讨论故障域和故障隔离架构)。如果做不到这一点,您需要进行监控来帮助指出需要关注的大致区域。这些通常是聚合的系统统计信息,例如负载、CPU 或内存利用率。 Note that our first step here is not only issue identification but also the recordingof the issues. Many companies that correctly identify issues don’t immediately record them before taking other actions or don’t have systems implemented that will recordthe problems. The best answer is to have an automated system that will immediatelyrecord the issue and its timestamp, leaving operators free to handle the rest of theprocess. 请注意,我们的第一步不仅是问题识别,还包括问题记录。许多正确识别问题的公司在采取其他行动之前不会立即记录问题,或者没有实施记录问题的系统。最好的答案是拥有一个自动化系统,该系统将立即记录问题及其时间戳,让操作员可以自由地处理其余过程。 The ITIL identifies classification and initial support as the next step, but webelieve that in many companies this can really just be the step of “getting the rightpeople involved.” Classification is an activity that can happen in hindsight in ourestimation—after the issue is resolved. ITIL 将分类和初始支持确定为下一步,但我们相信,在许多公司中,这实际上只是“让合适的人参与”的步骤。根据我们的估计,分类是一项事后可能发生的活动——在问题解决之后。 Investigation and diagnosis is followed by resolution and recovery. Put simply,these are the steps of identifying what has failed and then taking the appropriatesteps to put that service back into proper working order. As an example, they may bethe steps that determine that application server 5 is not responding (investigation anddiagnosis), at which point we immediately attempt a reboot (a resolution step) andthe system recovers (recovery). 检查和诊断之后是解决和恢复。简而言之,这些步骤是确定失败的原因,然后采取适当的步骤将该服务恢复到正常的工作状态。例如,它们可能是确定应用程序服务器 5 没有响应的步骤(调查和诊断),此时我们立即尝试重新启动(解决步骤)并且系统恢复(恢复)。 Investigation and diagnosis is followed by resolution and recovery. Put simply,these are the steps of identifying what has failed and then taking the appropriatesteps to put that service back into proper working order. As an example, they may bethe steps that determine that application server 5 is not responding (investigation anddiagnosis), at which point we immediately attempt a reboot (a resolution step) andthe system recovers (recovery). 检查和诊断之后是解决和恢复。简而言之,这些步骤是确定失败的原因,然后采取适当的步骤将该服务恢复到正常的工作状态。例如,它们可能是确定应用程序服务器 5 没有响应的步骤(调查和诊断),此时我们立即尝试重新启动(解决步骤)并且系统恢复(恢复)。 We often recommend an easily remembered acronym when implementing incidentmanagement (see Figure 8.1). Our acronym, although not supported by the ITIL,supports ITIL implementations and for smaller companies can be adopted with orwithout an ITIL implementation. The acronym is DRIER and it stands for 在实施事件管理时,我们经常推荐一个容易记住的首字母缩略词(见图 8.1)。我们的缩写虽然不受 ITIL 支持,但支持 ITIL 实施,并且对于较小的公司来说,可以在有或没有 ITIL 实施的情况下采用。缩写词是 DRIER,它代表 * Detect an incident through monitoring or customer contact * Report the incident, or log it into the system responsible for tracking all incidents, failures, etc. * Investigate the incident to determine what should be done * 通过监控或客户联系检测事件 * 报告事件,或将其记录到负责跟踪所有事件、故障等的系统中。 * 调查事件以确定应该做什么 ![](https://blog.baidu-google.com/usr/uploads/2024/06/104555334.png) * Escalate the incident if not solved in a timely fashion * Resolve the incident by restoring end-user functionality and log all informationfor follow up * 如果不及时解决,将事件升级 * 通过恢复最终用户功能来解决事件并记录所有信息以进行后续处理 In developing DRIER, we’ve attempted to make it easier for our clients to understand how issue management can be effectively implemented. Note that althoughwe’ve removed the classification of issues from our acronym, we still expect thatthese activities are being performed in order to develop data from the system andhelp inform other processes. We recommend that the classification of issues happenswithin the Daily Incident Management meeting identified later in this chapter. 在开发 DRIER 时,我们试图让客户更轻松地了解如何有效实施问题管理。请注意,尽管我们已从首字母缩略词中删除了问题分类,但我们仍然希望执行这些活动是为了从系统中开发数据并帮助通知其他流程。我们建议在本章后面确定的日常事件管理会议中对问题进行分类。 #### The Components of Problem Management 问题管理的组成部分 The ITIL defined components of problem management are a little more difficult to navigate than those for incident management. The ITIL definitions define a number of processes that control other processes and in anything but a large organization this can bea bit cumbersome. We attempt to highlight steps that will help in the resolution ofproblems within this section without the deep treatment of all of the supporting processes. Remember that problems are the causes of incidents and as such, within thecontext of scalability, they are likely to be the reasons you are not scaling to meet endcustomer demand, are not scaling cost effectively, or will not scale easily in the future. ITIL 定义的问题管理组件比事件管理组件更难导航。 ITIL 定义定义了许多控制其他流程的流程,除了大型组织之外,这可能有点麻烦。我们试图强调有助于解决本节中的问题的步骤,而无需深入处理所有支持流程。请记住,问题是事件的原因,因此,在可扩展性的背景下,它们很可能是您无法扩展以满足最终客户需求、无法有效地扩展成本或将来无法轻松扩展的原因。 Problems in our model start concurrent with an issue and last until the root causeof an incident is identified. As such, most problems last longer than most incidents,though a problem can be the cause of many incidents. 我们模型中的问题与问题同时开始,一直持续到确定事件的根本原因为止。因此,大多数问题的持续时间比大多数事件的持续时间要长,尽管一个问题可能是许多事件的原因。 Just as with incidents, we need a type of workflow that supports our problem resolutions. We need a system or place to keep all of the open problems and ensure thatthey can be associated with the incidents they cause. We also need to be able to trackthese problems to closure identified in the ideal world by the fix being applied towhatever system is experiencing the incident. Our reasoning for this definition of“closure” is that a problem exists until it no longer causes incidents. This meaning isthe meaning holding the most value for our shareholders as they have a very highexpectation of us and our teams in the maximization of their value. 就像处理事件一样,我们需要一种支持问题解决的工作流程。我们需要一个系统或地方来保存所有未解决的问题,并确保它们可以与它们引起的事件相关联。我们还需要能够跟踪这些问题,直到在理想世界中通过修复应用于遇到事件的系统来确定问题的解决。我们对“关闭”定义的推理是,问题一直存在,直到它不再导致事件为止。这个意义就是为我们的股东持有最大价值的意义,因为他们对我们和我们的团队价值最大化有很高的期望。 In our mind, problems are either small enough to be handled by a single person orlarge enough that they require a team to resolve them. Both are similar in that theworkflow and closure criteria remain the same, but they differ in the amount ofinvolvement both from individual contributors and management. Small problems canbe handed to a single person, and when ready for closure, they can go through whatever QA and testing criteria is appropriate and then validated as closed by the appropriate management or owner of the system experiencing the incidents and problems. 在我们看来,问题要么小到足以由一个人处理,要么大到需要一个团队来解决。两者的相似之处在于工作流程和结束标准保持相同,但个人贡献者和管理层的参与程度有所不同。小问题可以交给一个人,当准备好关闭时,他们可以通过任何适当的质量保证和测试标准,然后由遇到事件和问题的系统的适当管理人员或所有者验证为已关闭。 Larger problems are more complex and need specialized processes to help ensurerapid resolution. A large problem may be the subject of a postmortem (described later in this chapter), which in turn will drive another of investigative or resolutionaction items to individuals. The outcome of these action items should be reviewed ona periodic basis, either by a dedicated team of project managers responsible for problem resolution, by a manager with responsibility for tracking problem resolution, orwithin the confines of a meeting dedicated to handling incident tracking and problemresolution such as our recommended daily incident meeting. 较大的问题更加复杂,需要专门的流程来帮助确保快速解决。一个大问题可能是事后分析的主题(本章稍后描述),这反过来又会推动对个人进行另一个调查或解决行动项目。这些行动项目的结果应定期由负责问题解决的项目经理组成的专门团队、负责跟踪问题解决的经理或在专门处理事件跟踪和问题解决的会议范围内进行审核,例如作为我们推荐的每日事件会议。 #### Resolving Conflicts Between Incident and Problem Management 解决事件和问题管理之间的冲突 We previously mentioned an obvious and very real tension between incident management and problem management. Very often, it is the case that the actions necessary torestore a system to service will potentially destroy evidence necessary to determineroot cause (problem resolution). Our experience is that incident resolution (the restoration of service) should always trump root cause identification unless an incidenthas a high frequency of recurrence without root cause and problem resolution. 我们之前提到过事件管理和问题管理之间存在明显且非常现实的紧张关系。很多时候,将系统恢复到服务所需的操作可能会破坏确定根本原因(解决问题)所需的证据。我们的经验是,事件解决(恢复服务)应始终胜过根本原因识别,除非事件频繁发生而没有根本原因和问题解决。 That said, we also believe it is important to have thought your approach throughbefore you are in the position of needing to make calls on when to restore service andwhen to continue root cause analysis. We have some suggestions: 也就是说,我们还认为,在您需要打电话询问何时恢复服务以及何时继续根本原因分析之前,仔细考虑您的方法非常重要。我们有一些建议 * Determine what needs to be collected by the system before system restoration. * Determine how long you are willing to collect diagnostic information beforerestoration. * Determine how many times you will allow a system to fail before you requirethat root cause analysis is more important than system restoration. * Determine who should make the decision as to when systems should be restoredif there is a conflict (who is the R and who is the A). * 在系统恢复之前确定系统需要收集哪些内容。 * 确定您愿意在恢复前收集诊断信息多长时间。 * 在要求根本原因分析比系统恢复更重要之前,确定允许系统发生故障的次数。 * 如果存在冲突,确定由谁来决定何时恢复系统(谁是 R,谁是 A)。 If an incident occurs and you don’t get a good root cause from it during the problem management process, it is wise to determine the preceding for that incident inaddition to ensuring that you clearly identify all of the people who should beinvolved the next time the incident happens to get better diagnostics about the incident more quickly. 如果发生事件并且您在问题管理过程中没有从中找到良好的根本原因,那么明智的做法是确定该事件的前因后果,并确保您清楚地识别下次事件发生时应涉及的所有人员。事件发生时可以更快地更好地诊断事件。 #### Incident and Problem Life Cycles 事件和问题的生命周期 There is an implied life cycle and relationship between incidents and problems. Anincident is open or ongoing until the system is restored. This restoration of the systemmay cause the incident to be closed in some life cycles, or it may move the incident to“resolved” in other life cycles. Problems are related to incidents and are likely opened at the time that an incident happens, potentially “resolved” after root cause is determined, and “closed” after the problem is corrected and verified within the production environment. Depending upon your approach, incidents might be closed afterservice is restored, or several incidents associated with a single problem might not befinally closed until their associated problems are fixed. 事件和问题之间存在隐含的生命周期和关系。在系统恢复之前,事件是开放的或持续的。系统的这种恢复可能会导致事件在某些生命周期内被关闭,也可能会在其他生命周期内将事件移至“已解决”。问题与事件相关,并且可能在事件发生时打开,在确定根本原因后可能“解决”,并在生产环境中纠正和验证问题后“关闭”。根据您的方法,事件可能会在服务恢复后关闭,或者与单个问题相关的多个事件可能不会最终关闭,直到其相关问题得到修复。 Regardless of what words you associate to the life cycles, we often recommend thefollowing simple phases be tracked in order to collect good data about incidents,problems, and what they cost you in production: 无论您将什么词与生命周期联系起来,我们通常建议跟踪以下简单阶段,以便收集有关事件、问题及其在生产中造成的成本的良好数据 ![](https://blog.baidu-google.com/usr/uploads/2024/06/3949723508.png) Our approach here is to ensure that incidents remain open until the problems thatcause them have root causes identified and fixed in the production environment.Note these life cycles don’t address the other data we like to see associated with incidents and problems, such as the classifications we recommend adding in the DailyIncident Meeting that follows. 我们的方法是确保事件保持开放状态,直到在生产环境中确定并修复导致事件的问题的根本原因。请注意,这些生命周期不会解决我们希望看到的与事件和问题相关的其他数据,例如我们建议在随后的每日事件会议中添加分类。 We recommend against reopening incidents as it makes it more difficult to queryyour incident tracking system to identify how often an incident reoccurs. That said,having a way to “reopen” a problem is useful as long as you can determine how oftenyou reopen the problem. Having a problem reoccur after it was thought to be closedis an indication that you are not truly finding root cause and is an important datapoint to any organization. Consistent failure to correctly identify root cause results incontinued incidents and is disastrous to your scalability initiatives as it steals timeaway from your organization, causes repeated failures for your customers and is dilutive to shareholder wealth and all other initiatives having to do with high availabilityand an appropriate quality of service for your end users. 我们建议不要重新打开事件,因为这会使查询事件跟踪系统以确定事件重复发生的频率变得更加困难。也就是说,只要您可以确定重新打开问题的频率,找到“重新打开”问题的方法就很有用。问题在被认为已解决后再次出现表明您没有真正找到根本原因,这对任何组织来说都是一个重要的数据点。持续未能正确识别根本原因会导致事件持续发生,并对您的可扩展性计划造成灾难性的影响,因为它会占用您组织的时间,导致客户反复出现故障,并稀释股东财富以及与高可用性和适当质量有关的所有其他计划为您的最终用户提供服务。 #### Implementing the Daily Incident Meeting 实施每日事件会议 We previously discussed the Daily Incident Meeting or Daily Incident ManagementMeeting. This is a meeting and process we encourage all of our clients to use andadopt as quickly as possible. This meeting occurs daily in most high transaction and rapid growth companies and serves to tie together the incident management processand the problem management process. 我们之前讨论过每日事件会议或每日事件管理会议。我们鼓励所有客户尽快使用和采用这个会议和流程。在大多数高交易量和快速增长的公司中,该会议每天都会举行,旨在将事件管理流程和问题管理流程联系在一起。 All incidents from the previous day are reviewed during this meeting to assignownership of problem management to an individual, or if necessary a group. The frequency with which a problem occurs as well as its resulting impact serves to prioritizethe problems to be root caused and fixed. We recommend that incidents be givenclassifications meaningful to the company within this meeting. Classifications mayinclude severity, systems affected, customers affected, and so on. Ultimately, the classification system employed should be meaningful in future reviews of incidents todetermine impact and areas of the system causing the company the greatest pain.This last point is especially important to identify scalability related issues throughoutthe system. 会议期间将审查前一天发生的所有事件,以便将问题管理的所有权分配给个人,或在必要时分配给小组。问题发生的频率及其产生的影响有助于确定问题的优先顺序,以找出根本原因并予以解决。我们建议在本次会议上对事件进行对公司有意义的分类。分类可能包括严重性、受影响的系统、受影响的客户等等。最终,所采用的分类系统应该在未来的事件审查中有意义,以确定系统的影响和给公司带来最大痛苦的区域。最后一点对于识别整个系统中与可扩展性相关的问题尤其重要。 Additionally, the open problems are reviewed. Open problems are problems associated with incidents that may be in the open or identified state but not completelyclosed (problem not root caused and fixed in the production environment). The problems are reviewed to ensure that they are prioritized appropriately, that progress isbeing made in identifying their cause, and that no help is required of the ownersassigned the problems. It may not be possible to review all problems in a single day; ifthat is the case, a rotating review of problems should start with the highest priorityproblems (those with the greatest impact) being reviewed most frequently. Problemsshould also be classified in this meeting in a manner consistent with business needand indicative of type of problem (e.g., internal versus vendor-related), subsystem(e.g., storage, server, database, login application, buying application, and so on) andtype of impact (e.g., scalability, availability, response time, and so on). This last classification is especially important to be able to pull out meaningful data to help informour scale efforts in processes and meanings described later in this portion of thebook. Problems should inherit the impact determined by their incidents, including theaggregate downtime, response time issues, and so on. 此外,还审查了未解决的问题。开放性问题是与可能处于开放或已识别状态但未完全关闭的事件相关的问题(问题不是根源引起的,而是在生产环境中修复的)。对问题进行审查,以确保它们得到适当的优先级,在查明其原因方面取得进展,并且不需要分配问题的所有者提供帮助。一天之内不可能复习完所有的问题;如果是这样的话,对问题的轮流审查应该从最频繁审查的最高优先级问题(影响最大的问题)开始。在此会议上,问题还应按照与业务需求一致的方式进行分类,并表明问题类型(例如,内部问题与供应商相关问题)、子系统(例如,存储、服务器、数据库、登录应用程序、购买应用程序等)和类型影响(例如,可扩展性、可用性、响应时间等)。最后一个分类尤其重要,因为它能够提取有意义的数据,帮助我们在本书这一部分稍后描述的过程和含义中进行规模化工作。问题应该继承其事件所确定的影响,包括总停机时间、响应时间问题等。 Let’s pause to review the amount of workflow we’ve discussed thus far in this section. We’ve identified the need to associate incidents with systems and other classifications, the need to associate problems with incidents and still more classifications,and the need to review data over time. Furthermore, owners need to be assigned atleast to problems and potentially to incidents and status needs to be maintained foreverything. Most readers have probably figured out that a system to aid in this collection of information would be really useful. Most open source and third-party “problem ticketing” solutions have a majority of this functionality enabled with some smallconfiguration right out of the box. We don’t think you should wait to implement anincident management process, a problem management process, and a daily meetinguntil you have a tracking system. However, it will certainly help if you work to put atracking system in place shortly after the implementation of these processes. 让我们暂停一下,回顾一下本节迄今为止讨论的工作流程量。我们已经确定需要将事件与系统和其他分类相关联,需要将问题与事件和更多分类相关联,以及需要随着时间的推移审查数据。此外,至少需要向所有者分配问题和可能的事件,并且需要始终维护状态。大多数读者可能已经发现,帮助收集信息的系统非常有用。大多数开源和第三方“问题单”解决方案都通过一些开箱即用的小型配置启用了大部分功能。我们认为您不应该等到拥有跟踪系统后才实施事件管理流程、问题管理流程和每日会议。然而,如果您在实施这些流程后不久就建立一个跟踪系统,这肯定会有所帮助。 ##### Implementing the Quarterly Incident Review 实施季度事件审查 No set of incident and problem management processes would be complete without aprocess of reviewing their effectiveness and ensuring that they are successful in eliminating recurring incidents and problems. 如果没有审查其有效性并确保其成功消除重复发生的事件和问题的过程,任何事件和问题管理流程都是不完整的。 We mentioned earlier in “Incident and Problem Life Cycles” that you may findyourself incorrectly identifying root cause for some problems. This is almost guaranteed to happen to you at some point and you need to have a way for determiningwhen it is happening. Is the same person incorrectly identifying root cause? This mayrequire some coaching of the individual, a change in the person’s responsibilities, orthe removal of the person from the organization. Is the same subsystem consistentlybeing misdiagnosed? If so, perhaps you have insufficient training or documentationon how the system really behaves. Are you consistently having problems with a singlepartner or vendor? If so, you may need to implement a vendor scorecard process orgive the vendor other performance related feedback. 我们之前在“事件和问题生命周期”中提到,您可能会发现自己错误地识别了某些问题的根本原因。这几乎肯定会在某个时刻发生在您身上,并且您需要有一种方法来确定它何时发生。同一个人是否错误地识别了根本原因?这可能需要对个人进行一些指导、改变该人的职责,或者将该人从组织中除名。同一子系统是否一直被误诊?如果是这样,也许您没有足够的培训或文档来了解系统的实际行为。您是否一直与单一合作伙伴或供应商存在问题?如果是这样,您可能需要实施供应商记分卡流程或向供应商提供其他与绩效相关的反馈。 Additionally, to ensure that your scalability efforts are applied to the right systems,you need to review past system performance and evaluate the frequency and impactof past events on a per system or subsystem basis. This evaluation helps to inform theprioritization for future architectural work and becomes an input to processes suchas the Headroom Process or 10x process that we describe in Chapter 11, Determining Headroom for Applications. 此外,为了确保您的可扩展性工作应用于正确的系统,您需要审查过去的系统性能并评估每个系统或子系统过去事件的频率和影响。此评估有助于确定未来架构工作的优先级,并成为诸如我们在第 11 章“确定应用程序的余量”中描述的余量流程或 10x 流程等流程的输入。 The output of the quarterly incident review also gives you the data that you needto define the business case for scalability investments in Chapter 6, Making the Business Case. Being able to show the business where you are going to put your effortand why, prioritized by impact, is a powerful way of securing the resources necessaryto run your systems and maximize shareholder wealth. Furthermore, using that datato paint the story of how your efforts are resulting in fewer scalability associated outages and response time issues makes the case that past investments are paying dividends and helps give you the credibility you need to continue doing a good job. 季度事件审核的输出还为您提供了在第 6 章“制定业务案例”中定义可扩展性投资的业务案例所需的数据。能够向企业展示您将在哪里投入精力以及原因,并按影响优先排序,是确保运行系统和最大化股东财富所需资源的有效方法。此外,使用这些数据来描述您的努力如何导致与中断和响应时间问题相关的可扩展性减少,这使得过去的投资正在获得回报,并帮助您获得继续做好工作所需的可信度。 #### The Postmortem Process 事后剖析过程 Earlier in this chapter, we identified that some large problems require a specialapproach to help resolve them. Most often, these large problems will require a crossfunctional brainstorming meeting, often referred to as a postmortem or after actionreview meeting. Although the postmortem meeting is valuable in helping to identifyroot cause for a problem, if run properly, it can also help identify issues related toprocess and training. It should not be used as a forum for finger pointing. 在本章前面,我们发现一些大问题需要特殊的方法来帮助解决。大多数情况下,这些大问题需要召开跨职能的头脑风暴会议,通常称为事后分析或行动后审查会议。尽管事后分析会议对于帮助确定问题的根本原因很有价值,但如果运行得当,它还可以帮助确定与流程和培训相关的问题。它不应该被用作相互指责的论坛。 The first step in developing a postmortem process is to determine for what size ofan incident or group of incidents a postmortem should be required. Postmortems arevery useful but costly events as you are taking several people away from their assignedtasks and putting them on the special duty of helping to determine what failed andwhat can work better within a system, process, or organization. When thinking backto our section on metrics and measurements within Chapter 5, Management 101, theuse of people in a postmortem would reduce our engineering efficiency metric as theywould be spending hours away from creating product and scaling systems. We wantto use the postmortem on items that have hugely and negatively impacted us, but noton every single incident we face (unless those incidents are all large). 制定事后分析流程的第一步是确定需要对一个或一组事件的规模进行事后分析。事后分析是非常有用但代价高昂的事件,因为您将几个人从分配的任务中抽离出来,让他们担负起特殊的职责,帮助确定系统、流程或组织中哪些地方失败了,哪些地方可以更好地工作。回想一下第 5 章“管理 101”中关于指标和测量的部分,事后分析中的人员使用会降低我们的工程效率指标,因为他们将花费数小时来创建产品和扩展系统。我们希望对那些对我们产生巨大负面影响的项目进行事后分析,但不是对我们面临的每一个事件进行事后分析(除非这些事件都很大)。 The input to the postmortem process is a timeline that includes data and timestamps leading up to the end-user incident, the time of the actual customer incident,all times and actions taken during the incident, and everything that happened upuntil the time of the postmortem. Ideally, all actions and their associated timestampshave been logged in a system during the restoration of the service, and all otheractions have been logged either in the same system or other places to cover what hasbeen done to collect diagnostics and fix the root cause. Logs should be parsed to graball meaningful data leading up to the incidents with timestamps associated to the collected data. 事后分析过程的输入是一个时间线,其中包括导致最终用户事件发生的数据和时间戳、实际客户事件的时间、事件期间采取的所有时间和操作以及事后分析时发生的所有事情。理想情况下,在服务恢复期间,所有操作及其关联的时间戳都已记录在系统中,并且所有其他操作都已记录在同一系统或其他位置,以涵盖为收集诊断信息和修复根本原因所做的操作。应解析日志以获取导致事件发生的有意义的数据以及与收集的数据关联的时间戳。 The attendees of the postmortem should consist of a cross-functional team fromsoftware engineering, systems administration, database administration, networkengineering, operations, and all other technical organizations that could have valuableinput like capacity planning. A manager trained in facilitating meetings and who alsohas some technical background should be assigned to run the meeting. Figure 8.2introduces the process that the team should cover during the postmortem meeting. 事后分析的参与者应由来自软件工程、系统管理、数据库管理、网络工程、运营以及所有其他可以提供容量规划等宝贵意见的技术组织的跨职能团队组成。应指派一位接受过主持会议培训且具有一定技术背景的经理来主持会议。图 8.2 介绍了团队在事后分析会议期间应涵盖的流程。 ![](https://blog.baidu-google.com/usr/uploads/2024/06/3128743415.png) The first step in the postmortem process is to cover the initial timeline and ensurethat it is complete. We call this the Timeline Phase of the postmortem. Attendees ofthe postmortem might identify that critical dates, times, and actions are missing. Forinstance, the team might identify that an alert from an application was thrown andnot acted upon two hours before the first item identified in the initial incident timeline. Note that during this phase of the process only times, actions, and events shouldbe recorded. No problems or issues should be identified or debated. 事后分析过程的第一步是涵盖初始时间表并确保其完整。我们将此称为事后分析的时间线阶段。事后分析的参与者可能会发现缺少关键日期、时间和行动。例如,团队可能会发现在初始事件时间线中识别的第一个项目之前两小时,应用程序发出了警报,但没有采取行动。请注意,在此过程阶段仅应记录时间、操作和事件。任何问题都不应被识别或争论。 The next step in the postmortem meeting is to cover the timeline and identifyissues, mistakes, problems, or areas where additional data would be useful. We callthis phase the Issue Phase. Each of these areas is logged as an issue, but no discussionover what should happen to fix the issue happens until the entire timeline is discussedand all of the issues are identified. The facilitator of the meeting needs to ensure thatshe is creating an environment in which all ideas and concerns over what might beissues are encouraged without concern for retribution or retaliation. She should alsoensure that no reference to ownership is made. For instance, it is inappropriate in apostmortem to say, “John ran the wrong command there.” Instead, the referenceshould be “at 10:12 AM, command A was incorrectly issued.” Ownership can beidentified later by management if there is an issue with someone violating companypolicy, repeatedly making the same mistake, or simply needing some coaching. Thepostmortem is not meant to be a public flogging of an individual and if it is used assuch a forum, the efficacy of the process will be destroyed. 事后分析会议的下一步是涵盖时间表并确定问题、错误、问题或其他数据有用的领域。我们将此阶段称为问题阶段。这些领域中的每一个都被记录为一个问题,但在讨论整个时间表并确定所有问题之前,不会讨论应该采取什么措施来解决该问题。会议的主持人需要确保她正在创造一个环境,在这个环境中,所有对可能出现的问题的想法和担忧都得到鼓励,而不用担心遭到报复。她还应确保不提及所有权。例如,在事后分析中说“约翰在那里运行了错误的命令”是不合适的。相反,引用应该是“上午 10:12,错误地发出了命令 A”。如果有人违反公司政策、反复犯同样的错误或只是需要一些指导,管理层可以稍后确定所有权。尸检并不意味着对个人的公开鞭打,如果将其用作这样的论坛,那么该过程的有效性将被破坏。 After a first run through the timeline is made and an issues list generated, a secondpass through the timeline should be made with an eye toward whether actions weretaken in a timely manner. For instance, let’s consider the case where a system starts toexhibit high CPU utilization at 10 AM and no action is taken. At noon, customersstart to complain of slow response times. A second pass through the timeline mightresult in someone indicating that the early indication of CPU might be correlatedwith slow response times later and an issue generated indicating as such. 在第一次浏览时间线并生成问题列表后,应进行第二次浏览时间线,着眼于是否及时采取行动。例如,我们考虑这样一种情况:系统在上午 10 点开始表现出较高的 CPU 利用率,但未采取任何操作。中午,客户开始抱怨响应速度慢。第二次浏览时间线可能会导致有人指出 CPU 的早期指示可能与稍后的缓慢响应时间相关,并生成一个指示此类问题的问题。 After a complete list of issues is generated from at least one and preferably twopasses through the timeline, we are ready to begin the creation of the task list. Thisfinal phase is called the Action Phase of the postmortem. The task list is generatedfrom the issues list with at least one task identified for each issue. Where a specificaction to fix some issue can’t be agreed upon by the team, an analysis task can be created to identify additional tasks to solve the issue. 在通过时间线的至少一次(最好是两次)生成完整的问题列表后,我们就准备开始创建任务列表。这个最后阶段称为事后分析的行动阶段。任务列表是根据问题列表生成的,其中为每个问题标识至少一个任务。如果团队无法就解决某些问题的具体行动达成一致,则可以创建分析任务来确定解决问题的其他任务。 After the task list is created, owners should be assigned for each task. Where necessary, use the RASCI methodology outlined earlier to clearly identify who is responsiblefor completing the task and who is the Approver of the task, and so on. Attempt to usethe SMART criteria for the tasks, making them specific, measurable, aggressive/attainable,realistic, and timely. Though initially intended for goals, the SMART acronym canalso help ensure that we are putting time limits on our tasks. Ideally, these items arelogged into a problem management system or database for future follow-up. 创建任务列表后,应为每个任务分配所有者。必要时,使用前面概述的 RASCI 方法来明确确定谁负责完成任务以及谁是任务的批准者等。尝试对任务使用 SMART 标准,使它们具体、可衡量、积极/可实现、现实且及时。尽管最初是为了目标,SMART 缩写也可以帮助确保我们对任务设定时间限制。理想情况下,这些项目被记录到问题管理系统或数据库中以供将来跟进。 #### Putting It All Together 把它们放在一起 Putting the components of issue management, process management, the daily incident management meeting, and the quarterly incident review together along with awell-defined postmortem process and a system to track and report on all systems andproblems will give us a good foundation for identifying, reporting, prioritizing, andtaking action against past scalability issues. 将问题管理、流程管理、日常事件管理会议和季度事件审查的组成部分与明确定义的事后分析流程以及跟踪和报告所有系统和问题的系统结合在一起,将为我们识别、报告提供良好的基础。 、确定优先级并针对过去的可扩展性问题采取行动。 Any given incident will follow our DRIER process of detecting the issue, reportingupon the issue, investigating the issue, escalating the issue, and resolving the issue.The issue is immediately entered into a system we’ve developed to track incidents andproblems. Investigation leads to a set of immediate actions and if help is needed, weescalate according to our escalation processes. Resolving the issue changes the issuestatus to “resolved” but does not close the incident until root cause is identified andfixed within our production environment. 任何给定事件都将遵循我们的 DRIER 流程:检测问题、报告问题、调查问题、升级问题和解决问题。问题会立即输入到我们开发的用于跟踪事件和问题的系统中。调查后会立即采取一系列行动,如果需要帮助,请根据我们的升级流程升级。解决问题会将问题状态更改为“已解决”,但在我们的生产环境中确定并修复根本原因之前,不会关闭事件。 The problem is assigned to an individual or organization during our daily statusreview unless it is of the size that it needs immediate assignment. During that dailyreview, we also review incidents and their status from the previous day and the highpriority problems that remain open in the system. We also validate the closure ofproblems and assign categories for both incidents and problems in our daily meeting. 在我们的日常状态审查期间,问题会分配给个人或组织,除非问题的规模需要立即分配。在每日审核期间,我们还会审核前一天发生的事件及其状态,以及系统中仍然存在的高优先级问题。我们还在日常会议中验证问题的解决并为事件和问题分配类别。 Problems, when assigned, get worked by the team or individual assigned to themin priority order. After root cause is determined, the problem moves to the “identified” status, and when fixed and validated in production, it moves to “closed.” Largeproblems go through a well-defined postmortem process with the focus being theidentification of all possible issues within the process and technology stacks. Problems are tracked within the same system and reviewed in our daily meeting. 分配问题后,由分配给他们的优先级顺序的团队或个人来解决。确定根本原因后,问题将转至“已识别”状态,在生产中修复并验证后,问题将转至“已关闭”。大问题会经过明确定义的事后分析流程,重点是识别流程和技术堆栈中所有可能的问题。问题在同一系统内进行跟踪并在我们的每日会议中进行审查。 Quarterly, we review incidents and problems to determine whether our processesare correctly closing problems and to determine the most common occurrences ofincidents. This data is collected and used to prioritize architectural, organizational,and process changes to aid in our mission of increasing scalability. Additional data iscollected to determine where we are doing well in reducing scale related problemsand to help create the business case for scale initiatives. 我们每季度都会审查事件和问题,以确定我们的流程是否正确解决问题并确定最常见的事件发生情况。收集这些数据并用于确定架构、组织和流程变更的优先级,以帮助我们实现提高可扩展性的使命。收集更多数据是为了确定我们在减少规模相关问题方面做得很好,并帮助为规模计划创建业务案例。 Let’s look at how Johnny and his team employ these processes by following anincident through its life cycle. Late one Tuesday evening, the network operations center is notified of an event by the customer support organization that is characterizedby several customers complaining that they cannot get access to some professionaltraining documents within the HRM product offering. The issue has now beendetected within the DRIER process, so the operations team logs (or reports) it intothe incident management system. The company uses an open source ticket system totrack incidents through their life cycle. The network operations team can’t immediately identify the root cause of the problem, even though they suspect a recentchange, so pursuant to company policy for an incident of this size (medium) after Investigating for 10 minutes, they escalate to the next level of support. The team logsall of their investigations and opens up a chat room for communication as well as aphone line/conference bridge for coordination. 让我们看看 Johnny 和他的团队如何通过跟踪事件的整个生命周期来使用这些流程。在一个星期二晚上,网络运营中心收到客户支持组织的一个事件通知,其特点是一些客户抱怨他们无法访问 HRM 产品中的某些专业培训文档。该问题现已在 DRIER 流程中检测到,因此运营团队将其记录(或报告)到事件管理系统中。该公司使用开源票证系统来跟踪事件的整个生命周期。网络运营团队无法立即确定问题的根本原因,即使他们怀疑最近发生了变化,因此根据公司针对这种规模(中型)事件的政策,在调查 10 分钟后,他们将升级到下一个级别支持。该团队记录了所有调查,并开放了一个用于沟通的聊天室以及一个用于协调的电话线/会议桥。 Level two support, consisting of software engineers, systems administrators, anddatabase administrators, work the problem for the next 20 minutes and identify thata network attached storage device containing the training documents identified bythe complaining customers has had several training documents renamed. The teamidentifies the appropriate names for the documents and changes the names. Workingwith customer support, the team determines that the problem is resolved by renamingthe documents. They close the incident knowing that the problem will remain openuntil after the morning incident review. 二级支持由软件工程师、系统管理员和数据库管理员组成,在接下来的 20 分钟内解决问题,并确定包含投诉客户识别的培训文档的网络连接存储设备已重命名了多个培训文档。团队确定文档的适当名称并更改名称。团队与客户支持人员合作,确定问题已通过重命名文档得到解决。他们关闭了事件,因为他们知道问题将一直存在,直到早上的事件审查结束。 The next morning, during the daily incident review, as Johnny Fixer is reviewingthe previous day’s problems and all open issues with his team, he determines that thesize of the previous night’s document incident is large enough to demand a postmortem. Johnny requires postmortems for any incident impacting more than 10% of hiscustomer base for 15 minutes or more. He assigns ownership for running the postmortem to his infrastructure and operations manager, Tom Harde. 第二天早上,在每日事件审查期间,当 Johnny Fixer 与他的团队一起审查前一天的问题和所有未解决的问题时,他确定前一天晚上的文档事件的规模足够大,需要进行事后分析。 Johnny 要求对影响超过 10% 客户群并持续 15 分钟或更长时间的任何事件进行事后分析。他将运行事后分析的所有权分配给他的基础设施和运营经理 Tom Harde。 Tom and his team generate an initial timeline for the postmortem from items in theincident management system that were logged by the operations team and the teamattempting to resolve the problem. Additionally, they identify that there were application errors being thrown two hours prior to the first customer contact and that customer support did not contact the operations center for two hours after the firstcustomer contact was received. Additionally, they find several changes logged againstthe network attached storage device in question. They schedule the postmortem withmembers of the level two support team, the teams logging the changes, and customersupport representatives. Tom 和他的团队根据运营团队和尝试解决问题的团队记录的事件管理系统中的项目生成事后分析的初始时间表。此外,他们还发现,在第一次联系客户之前两个小时出现了应用程序错误,并且在收到第一次客户联系后两个小时内,客户支持人员没有联系运营中心。此外,他们还发现针对相关网络连接存储设备记录的一些更改。他们安排与二级支持团队的成员、记录变更的团队以及客户支持代表一起进行事后分析。 Stepping through the postmortem process, the team covers the timeline. Severalmembers attempt to jump to adding issues but Tom focuses the team initially oncompleting the timeline. Several data points are added to the timeline before movingalong to the next part of the postmortem. During the second phase of the postmortem, Tom and the team identify issues. Again, team members attempt to jump toactions but Tom focuses them on just identifying issues. The delays between the firstalerts from the software and the first customer contact and the delay from first customer contact to first report are included in the issues. The team also identifies a process issue with one of the changes that caused the files to be improperly changed. Inthe next phase of the postmortem, they identify actions and owners. 通过事后分析过程,团队涵盖了时间表。一些成员试图直接添加问题,但汤姆让团队最初集中精力完成时间表。在进行事后分析的下一部分之前,会将几个数据点添加到时间线中。在事后分析的第二阶段,Tom 和团队发现了问题。团队成员再次尝试立即采取行动,但汤姆让他们专注于识别问题。软件发出的首次警报与第一次客户联系之间的延迟以及从第一次客户联系到第一次报告的延迟都包含在问题中。该团队还发现其中一项更改存在流程问题,导致文件被不当更改。在事后分析的下一阶段,他们确定操作和所有者。 One month later, during Johnny Fixer’s quarterly incident review, Johnny noteswith his team that the issues with files apparently missing on the network attachedstorage devices happen at least twice a quarter and sometimes even more than that.Although several root causes have been identified, the problem continues to happen.Johnny assigns Tom to look into the issue and starts to track it again in the morningincident reviews with the hope of finding the true root cause. 一个月后,在 Johnny Fixer 的季度事件审查期间,Johnny 向他的团队指出,网络附加存储设备上文件明显丢失的问题每季度至少发生两次,有时甚至不止一次。尽管已经确定了几个根本原因,但问题仍然存在。约翰尼指派汤姆调查这个问题,并开始在早上的事件回顾中再次跟踪它,希望找到真正的根本原因。 #### Conclusion 结论 We have focused on one of the most important processes within any technology organization: the process of resolving, tracking, and reporting on incidents and problems.We learned that incident resolution and problem management should be thought ofas two separate and sometimes competing processes. We also discussed the need forsome sort of system to help us manage the relationships and the data associated withthese processes. 我们专注于任何技术组织内最重要的流程之一:解决、跟踪和报告事件和问题的流程。我们了解到,事件解决和问题管理应被视为两个独立且有时相互竞争的流程。我们还讨论了需要某种系统来帮助我们管理与这些流程相关的关系和数据。 We gave examples of how a few simple meetings can help meld the incident andproblem management processes. The daily incident management meeting helps manageincident and problem resolution and status, whereas the quarterly incident reviewhelps to create a continual process improvement cycle. Finally, we discussed supportiveprocesses such as the postmortem process to help drive major problem resolution. 我们举例说明了一些简单的会议如何帮助融合事件和问题管理流程。每日事件管理会议有助于管理事件和问题的解决和状态,而季度事件审查有助于创建持续的流程改进周期。最后,我们讨论了支持性流程,例如事后分析流程,以帮助推动重大问题的解决。 ##### Key Points 关键点 * Incidents are issues in our production environment and incident management isthe process focused on timely and cost-effective restoration of service in the production environment. * Problems are the cause of incidents and problem management is the processfocused on determining root cause of and correcting problems. * Incidents can be managed using the acronym DRIER, standing for detect,report, investigate, escalate, and resolve. * There is a natural tension between incident and problem management. Rapidrestoration of service may cause some forensic information to be lost that wouldotherwise be useful in problem management. Thinking through how much timeshould be allowed to collect data and what data should be collected will helpease this tension for any given incident. * Incidents and problems should have defined life cycles. An example is for anincident to be open, resolved, and closed, whereas a problem is open, identified,and closed. * A daily incident meeting should be organized to review incidents and problemstatus, assign owners, and assign meaningful business categorizations. * A quarterly incident review should look back at past incidents and problems inorder to validate proper first-time closure of problems and thematically analyzeboth problems and incidents to help prioritize scalability related architecture,process, and organization work. * The postmortem process is a brainstorming process used for large incidents andproblems to help drive closure and identify supporting tasks. * 事件是我们生产环境中的问题,事件管理是专注于及时且经济高效地恢复生产环境中的服务的过程。 * 问题是事件发生的原因,问题管理是专注于确定问题根本原因并纠正问题的过程。 * 可以使用缩写 DRIER 来管理事件,它代表检测、报告、调查、升级和解决。 * 事件和问题管理之间存在天然的紧张关系。服务的快速恢复可能会导致一些在问题管理中有用的取证信息丢失。考虑应该允许多少时间来收集数据以及应该收集哪些数据将有助于缓解任何特定事件的这种紧张局势。 * 事件和问题应该有明确的生命周期。一个例子是事件是开放的、解决的和关闭的,而问题是开放的、识别的和关闭的。 * 应组织每日事件会议来审查事件和问题状态、分配所有者并分配有意义的业务分类。 * 季度事件审查应回顾过去的事件和问题,以验证问题的首次正确解决,并对问题和事件进行主题分析,以帮助确定与可扩展性相关的架构、流程和组织工作的优先顺序。 * 事后分析过程是一个集思广益的过程,用于大型事件和问题,以帮助推动解决和确定支持任务。
没有评论