这篇文章上次修改于 273 天前,可能其部分内容已经发生变化,如有疑问可询问作者。 ### Chapter 9 Managing Crisis and Escalations 第 9 章 管理危机和升级 > There is no instance of a country having benefitted from prolonged warfare.—Sun Tzu > 没有一个国家从长期战争中受益的例子。——孙子 A crisis is an incident on steroids. If not handled properly and if approached in thesame fashion you would approach smaller incidents, a crisis will drive your customers away and tear your organization and company apart. Crisis situations, if handledproperly, including ensuring that you learn from them and that they never happenagain, can redefine a company and help set it on the right track. Assuming that thecrisis was not a result of a gross lack of judgment and assuming that the companylives through it, it can serve to galvanize the company and become a source ofstrength. In this chapter, we discuss how to handle major crises and more specificallycrises related to scalability. We will show you how to go up, over, around, or if necessary through the brick wall of scale. 危机是类固醇事件。如果处理不当,并且以同样的方式处理较小的事件,危机就会赶走您的客户并撕裂您的组织和公司。危机情况如果处理得当,包括确保从中吸取教训并确保它们不再发生,可以重新定义一家公司并帮助其走上正确的轨道。假设这场危机不是由于严重缺乏判断力造成的,并且假设公司能够渡过危机,那么它可以激励公司并成为力量的源泉。在本章中,我们讨论如何处理重大危机,更具体地说是与可扩展性相关的危机。我们将向您展示如何向上、越过、绕过,或者在必要时穿过规模砖墙。 #### What Is a Crisis? 什么是危机? We prefer the medical definitions of crisis from Merriam-Webster’s dictionary: “theturning point for better or worse in an acute disease or fever” and “a paroxysmalattack of pain, distress or disordered function.” Wow! In our experience, these twodefinitions define a crisis of scale better than any we know. 我们更喜欢韦氏词典中危机的医学定义:“急性疾病或发烧好转或恶化的转折点”和“疼痛、痛苦或功能紊乱的阵发性发作”。哇!根据我们的经验,这两个定义比我们所知的任何定义都更好地定义了规模危机。 The first definition offered by Merriam-Webster is our favorite as it is so true ofour personal experiences. A crisis can be both cathartic and galvanizing. It can beyour Nietzsche event, allowing you to rise from the ashes like the mythical Phoenix.It can in one fell swoop fix many of the things we described in Chapter 6, Making theBusiness Case, and force the company to focus on scale. Ideally, you will have gottenthe company interested in fixing the problems before the crisis occurs. More importantly, we hope you’ve led ethically and not landed at this crisis to prove a point to peers or management, as that would be the epitome of the wrong thing to do! But ifyou’ve arrived here and take the right actions, you can become significantly better 韦氏词典提供的第一个定义是我们最喜欢的,因为它非常适合我们的个人经历。危机既可以起到宣泄作用,又可以起到激励作用。它可以是你的尼采事件,让你像神话中的凤凰一样浴火重生。它可以一举解决我们在第 6 章“制定商业案例”中描述的许多问题,并迫使公司专注于规模。理想情况下,你应该让公司有兴趣在危机发生之前解决问题。更重要的是,我们希望您的领导符合道德规范,而不是在这场危机中向同事或管理层证明自己的观点,因为这将是错误行为的缩影!但如果你已经到达这里并采取正确的行动,你可以变得更好 The actual definition of a crisis that is relevant to your business is based on business impact and impact to the competitive landscape. It might be the case that a 30-to 60-minute failure between 1 AM and 1:30 AM is not really a crisis situation foryour company, whereas a three-minute failure at noon is a major crisis. Your businessmay be such that you make 30% of your annual revenue during the three weeks surrounding Christmas. As such, downtime during this three-week period may be anorder of magnitude more costly than downtime during the remainder of the year. Inthis case, a crisis situation for you may be any downtime between the first and thirdweeks of December, whereas at any other point during the year you are willing to tolerate 30-minute outages. Your business may rely upon a data warehouse supportinghundreds of analysts between the hours of 8 AM and 7 PM and with nearly no usageafter 7 PM in the evening and during weekends. A crisis for you in this case may beany outage during working hours that would idle the expensive time of your analysts. 与您的业务相关的危机的实际定义基于业务影响和对竞争格局的影响。可能的情况是,凌晨 1 点到凌晨 1:30 之间的 30 到 60 分钟的故障对于您的公司来说并不是真正的危机情况,而中午 3 分钟的故障则是重大危机。您的企业可能会在圣诞节前后的三周内赚取年收入的 30%。因此,这三周期间的停机成本可能比今年剩余时间的停机成本高出一个数量级。在这种情况下,对您来说,危机情况可能是 12 月第一周到第三周之间的任何停机时间,而在一年中的任何其他时间点,您愿意容忍 30 分钟的停机。您的企业可能依赖于上午 8 点到晚上 7 点之间支持数百名分析师的数据仓库,而晚上 7 点之后和周末几乎没有使用。在这种情况下,您面临的危机可能是工作时间内的停电,这会占用分析师的宝贵时间。 That’s not to say that all crises are equal, and obviously not everything should betreated as a crisis. Certainly, a brownout of activity on your Web site for three minutes Monday through Friday during “prime time” (peak utilization) is more of a crisis than a single 30-minute event during relatively low user activity levels. Our pointhere is that after you determine what the crisis threshold is for your company, everything that exceeds that should be treated the same way. Losing a leg is absolutelyworse than losing a finger, but both require immediate medical attention. The same istrue with crises; after the predefined crisis threshold is passed, they should all beapproached the same way. 这并不是说所有危机都是平等的,显然并不是所有的危机都应该被视为危机。当然,周一到周五的“黄金时间”(高峰使用率)期间网站上的活动暂停三分钟比用户活动水平相对较低时发生的一次 30 分钟的活动更危险。我们的观点是,在确定公司的危机阈值后,所有超过该阈值的事情都应该以同样的方式对待。失去一条腿绝对比失去一根手指更糟糕,但两者都需要立即就医。危机也是如此。在超过预定的危机阈值后,应该以同样的方式处理它们。 You may recall from Chapter 8, Managing Incidents and Problems, that recurringproblems (those problems that occur more than once) rob you of time and thereforedestroy your ability to scale your services and scale your organization. Crises alsoruin scale as they steal even more resources; and allowing the root cause of a crisis tosurface more than once will not only steal vast resources and keep you from scalingyour organization and services, it has the possibility of destroying your business. 您可能还记得第 8 章“管理事件和问题”,反复出现的问题(多次出现的问题)会占用您的时间,从而破坏您扩展服务和扩展组织的能力。危机也会破坏规模,因为它们会窃取更多的资源。让危机的根本原因多次浮现不仅会窃取大量资源并阻止您扩展组织和服务,还可能会毁掉您的业务。 #### Why Differentiate a Crisis from Any Other Incident? 为什么要将危机与其他事件区分开来? You can’t treat a crisis as any normal incident because it won’t treat you the way anincident would treat you. This is the time to pull out all of the stops during and afterthe end of the crisis. This is the time to fix the problem faster than you’ve ever fixed aproblem before, and then continue working the real root causes to remove every bitof cholesterol that has clogged your scalability arteries and caused this technologyheart attack. When the operation is done, it’s time to change your life—including the technical, process, and organizational equivalent of exercise, diet, and discipline toensure that you never have a crisis again. 你不能将危机视为任何正常事件,因为它不会像事件对待你那样对待你。现在是危机结束期间和结束后全力以赴的时候了。现在是比以前更快地解决问题的时候了,然后继续研究真正的根本原因,以消除堵塞您的可扩展性动脉并导致这种技术心脏病发作的每一点胆固醇。手术完成后,就该改变您的生活了,包括技术、流程和组织方面的锻炼、饮食和纪律,以确保您不再遇到危机。 Although we are generally believers that there is a point at which adding resourcesto a project has diminishing returns, in a crisis, you are looking for the shortest possible time to resolution rather than the efficiency or return on those resources. While ina crisis, it is not the time to think about future product delivery as such thoughts andtheir resulting actions will only increase the duration of the crisis. As a matter of fact,you need to lead by example and be at the scene of the crisis for as long as it ishumanly possible and eliminate all other distractions from your schedule. Everyminute that the crisis continues is another minute that destroys shareholder value. 尽管我们普遍认为,在某个点上向项目添加资源会带来收益递减,但在危机中,您寻求的是尽可能短的解决时间,而不是这些资源的效率或回报。在危机中,现在不是考虑未来产品交付的时候,因为这种想法及其由此产生的行动只会增加危机的持续时间。事实上,您需要以身作则,尽可能长时间地呆在危机现场,并消除日程安排中的所有其他干扰。危机持续的每一分钟都是股东价值受损的一分钟。 Your job is to stop the crisis from causing a negative trend within your business. Ifyou can’t fix it quickly by getting enough people on the problem to ensure that youhave appropriate coverage, two things are going to happen. The first is that the crisiswill perpetuate: Events will happen again and again and you will lose customers, revenue, and maybe your business. The second is that in allowing the crisis to continueto take precious time out of your organization over a prolonged period, you willeventually lose traction on other projects anyway. The very thing you were trying toavoid by not putting “all hands on deck” happens anyway and you allowed the problem to go on longer than necessary. 您的工作是阻止危机在您的企业中造成负面趋势。如果你不能通过让足够的人来解决问题来确保你有适当的覆盖范围来快速解决问题,那么将会发生两种情况。首先,危机将会持续下去:事件会一次又一次地发生,你将失去客户、收入,也许还有你的业务。第二个是,如果让危机在很长一段时间内继续占用您组织的宝贵时间,您最终将失去对其他项目的吸引力。你试图通过不“全力以赴”来避免的事情无论如何都会发生,而且你让问题持续的时间超过了必要的时间。 #### How Crises Can Change a Company 危机如何改变公司 Perhaps you now agree that not all incidents are created equal and that some incidents actually have the possibility through duration or frequency to potentially kill acompany. You may be wondering how any of that can be good. How is it possiblethat something that bad might actually turn out to benefit the company? 也许您现在同意,并非所有事件都是平等的,有些事件实际上有可能通过持续时间或频率来杀死一家公司。您可能想知道这有什么好处。这么糟糕的事情怎么可能实际上对公司有利呢? The answer is that it only benefits the company if the crisis, or series of crises,serves to change the direction, culture, organization, processes, and technology of thecompany. It’s not like you are going to wake up three days after the crisis and everything will magically be better. As a matter of fact, the resolution of the crisis is goingto pale in comparison to the blood, sweat, and tears you will shed trying to changeeverything. But the crisis or series of crises can serve as the catalyst for change. It willserve to focus shareholders, directors, executives, and managers on the horrors offailing to meet the scalability needs of the company. 答案是,只有当危机或一系列危机能够改变公司的方向、文化、组织、流程和技术时,它才会对公司有利。危机发生三天后你不会醒来,一切都会神奇地变得更好。事实上,与你试图改变一切而流下的血、汗和泪水相比,危机的解决将显得黯然失色。但危机或一系列危机可以成为变革的催化剂。它将有助于让股东、董事、高管和经理关注未能满足公司可扩展性需求的可怕后果。 Again, we can’t urge you enough to manage and lead in such a way that such a crisis can be avoided. The pain of such an event is incredible and it can cost shareholders millions (or more) in market capitalization. Steering a company toward a crisis asa method of changing culture is like putting a gun to your head to solve a headache.It’s just not the right thing to do. 再次强调,我们极力敦促您以可以避免此类危机的方式进行管理和领导。此类事件的痛苦是令人难以置信的,它可能会让股东损失数百万(或更多)的市值。将公司引向危机作为改变文化的方法就像用枪指着你的头来解决头痛一样。这不是正确的做法。 ##### The eBay Scalability Crisis eBay 的可扩展性危机 As proof that a crisis can change a company, consider eBay in 1999. In its early days, eBaywas the darling of the Internet and up to the summer of 1999, few if any companies had experienced its exponential growth in users, revenue, and profits. Through the summer of 1999, eBayexperienced many outages including a 20-plus hour outage in June of 1999. These outageswere at least partially responsible for the reduction in stock price from a high in the mid $20sthe week of April 26, 1999, to a low of $10.42 the week of August 2, 1999. 作为危机可以改变公司的证据,请考虑 1999 年的 eBay。 在其早期,eBay 是互联网的宠儿,直到 1999 年夏天,很少有公司经历过用户、收入和利润的指数级增长。 。整个 1999 年夏天,eBay 经历了多次停电,其中包括 1999 年 6 月的一次长达 20 多个小时的停电。这些停电至少在一定程度上导致了股价从 1999 年 4 月 26 日这一周 20 美元左右的高点跌至1999 年 8 月 2 日当周最低价为 10.42 美元。 The cause of the outages isn’t really as important as what happened within the companyafter the outages. Additional executives were brought in to ensure that the engineering organization, the engineering processes, and the technology they produced could scale to thedemand placed on them by the eBay community. Initially, additional capital was deployed topurchase systems and equipment (though eBay was successful in actually lowering both itstechnology expense and capital on an absolute basis well into 2001). Processes were put inplace to help the company design systems that were more scalable, and the engineering teamwas augmented with engineers experienced in high availability and scalable designs and architectures. Most importantly, the company created a culture of scalability. The lessons from thesummer of pain are still discussed at eBay, and scalability has become part of eBay’s DNA. 停电的原因并不像停电后公司内部发生的事情那么重要。引入了更多管理人员,以确保工程组织、工程流程和他们生产的技术能够扩展以满足 eBay 社区对他们的需求。最初,额外的资金被用于购买系统和设备(尽管 eBay 在 2001 年实际上成功地在绝对基础上降低了技术费用和资本)。我们制定了流程来帮助公司设计更具可扩展性的系统,并且工程团队也增加了在高可用性和可扩展设计和架构方面经验丰富的工程师。最重要的是,该公司创造了一种可扩展的文化。 eBay 仍在讨论这个夏天的痛苦教训,而可扩展性已成为 eBay DNA 的一部分。 eBay continued to experience crises from time to time, but these crises were smaller interms of their impact and shorter in terms of their duration as compared to the summer of 1999.The culture of scalability netted architectural changes, people changes, and process changes.One such change was eBay’s focus on managing each and every crisis in the fashiondescribed in this chapter. eBay 时不时地经历危机,但与 1999 年夏天相比,这些危机的影响较小,持续时间也较短。可扩展性文化带来了架构变更、人员变更和流程变更。这种变化是 eBay 关注管理本章所描述的每一次时尚危机的重点。 #### Order Out of Chaos 摆脱混乱的秩序 Bringing in and managing several different organizations within a crisis situation isdifficult at best. Most organizations have their own unique subculture and oftentimes, even within a technology organization, those subcultures don’t even trulyspeak the same language. It is entirely possible that an application developer will useterms with which a systems engineer is not familiar, and vice versa. 在危机情况下引入并管理多个不同的组织是非常困难的。大多数组织都有自己独特的亚文化,而且通常,即使在技术组织内,这些亚文化甚至并不真正讲同一种语言。应用程序开发人员完全有可能使用系统工程师不熟悉的术语,反之亦然。 Moreover, if not managed, the attendance of many people and multiple organizationswithin a crisis situation will create chaos. This chaos will feed on itself creating avicious cycle that can actually prolong the crisis or worse yet aggravate the damagedone in the crisis through someone taking an ill-advised action. Indeed, if you cannoteffectively manage the force you throw at a crisis, you are better off using fewer people. 此外,如果不加以管理,危机局势中多人、多个组织的参与将会造成混乱。这种混乱会自行形成恶性循环,实际上可能会延长危机或更糟的情况,但如果有人采取不明智的行动,则会加剧危机中受损的情况。事实上,如果你不能有效地管理你在危机中投入的力量,那么你最好少用人。 Your company may have a crisis management process that consists of both phoneand chat (instant messaging or IRC) communications. If you listen on the phone or follow the chat session, you are very likely to see an unguided set of discussions andstatements as different people and organizations go about troubleshooting or tryingdifferent activities in the hopes of finding something that will work. You may havequestions asked that go unanswered or requests to try something that go withoutauthorization. You might as well be witnessing a grade school recess, with differentgroups of children running around doing different things with absolutely no coordination of effort. But a crisis situation isn’t a recess; it’s a war, and in war such a lackof coordination results in an increase in the rate of friendly casualties through“friendly fire.” In a technology crisis, these friendly casualties are manifested throughprolonged outages, lost data, and increased customer impact. 您的公司可能有一个危机管理流程,其中包括电话和聊天(即时消息或 IRC)通信。如果您在电话中收听或关注聊天会话,您很可能会看到一组无指导的讨论和声明,因为不同的人和组织会进行故障排除或尝试不同的活动,希望找到可行的方法。您提出的问题可能没有得到解答,或者请求尝试未经授权的操作。你可能会看到小学课间休息时,不同组的孩子跑来跑去做着不同的事情,完全没有协调一致的努力。但危机局势并不是休会;而是危机。这是一场战争,而在战争中,这种缺乏协调的情况会导致“友军火力”造成的友军伤亡率增加。在技术危机中,这些友好的伤亡表现为长时间的中断、数据丢失和客户影响的增加。 What you really want to see in such a situation is some level of control applied tothe chaos. Rather than a grade school recess, you hope to see a high school footballgame. Don’t get us wrong, you aren’t going to see an NFL style performance, but youdo hope that you witness a group of professionals being led with confidence to identify a path to restoration and a path to identification of root cause. 在这种情况下,你真正希望看到的是对混乱进行一定程度的控制。您希望看到的不是小学课间休息,而是一场高中橄榄球比赛。不要误会我们的意思,您不会看到 NFL 风格的表演,但您确实希望亲眼目睹一群专业人士被充满信心地引导,以确定恢复之路和识别根本原因的道路。 Different groups should have specific objectives and guidelines unique to theirexpertise. There should be an expectation that they are reporting their progressclearly and succinctly in regular time intervals. Hypotheses should be generated,quickly debated, and either prioritized for analysis or eliminated as good initial candidates. These hypotheses should then be quickly restated as the tasks necessary todetermine validity and handed out to the appropriate groups to work them withtimes for results clearly communicated. 不同的群体应该有适合其专业知识的具体目标和指导方针。应该期望他们定期清楚、简洁地报告他们的进展。应该生成假设,快速辩论,并优先进行分析或作为良好的初始候选者予以消除。然后,应将这些假设快速重申为确定有效性所需的任务,并将其分发给适当的小组,以便及时进行工作,以便清楚地传达结果。 Someone on the call or in the crisis resolution meeting should be in charge, andthat someone should be able to paint an accurate picture of the impact, what hasbeen tried, the best hypotheses being considered and the tasks associated with thosehypotheses, and the timeline for completion of the current set of actions, as well asthe development of the next set of actions. Other members should be managers of thetechnical teams assembled to help solve the crisis and one of the experienced(described in organizations as senior, principal, or lead) technical people from eachmanager’s teams. We will now describe these roles and positions in greater detail.Other engineers should be gathered in organizational or cross-functional groups todeeply investigate domain areas or services within the platform undergoing a crisis. 电话会议或危机解决会议中应该有人负责,并且有人应该能够准确描述影响、已经尝试过的内容、正在考虑的最佳假设以及与这些假设相关的任务以及完成的时间表当前的一系列行动,以及下一组行动的制定。其他成员应该是为帮助解决危机而组建的技术团队的管理者,以及每个管理者团队中经验丰富的(在组织中被描述为高级、主要或领导)技术人员之一。我们现在将更详细地描述这些角色和职位。其他工程师应该聚集在组织或跨职能小组中,深入调查正在经历危机的平台内的领域区域或服务。 ##### The Role of the “Problem Manager” “问题经理”的角色 The preceding paragraphs have been leading up to a position definition. We canthink of lots of names for such a position: outage commander, problem manager,incident manager, crisis commando, crisis manager, issue manager, and from the military, battle captain. Whatever you call the person, you had better have someonecapable of taking charge on the phone. Unfortunately, not everyone can fill this kindof a role. We aren’t arguing that you need to hire someone just to manage your major production incidents to resolution, though if you have enough of them you mightconsider that; rather, ensure you have at least one person on your staff who has theskills to manage such a chaotic environment. 前面的段落导致了职位定义。我们可以为这样的职位想出很多名字:停电指挥官、问题经理、事件经理、危机突击队、危机经理、问题经理,以及来自军队的战斗队长。无论你如何称呼这个人,你最好找一个有能力接听电话的人。不幸的是,并不是每个人都能胜任这种角色。我们并不是说您需要雇用某人来管理您的重大生产事件并解决问题,但如果您有足够的人,您可能会考虑这样做;相反,请确保您的员工中至少有一个人有能力管理这种混乱的环境。 The characteristics of someone capable of successfully managing chaotic environments are rather unique. As with leadership, some people are born with them andsome people nurture them over time. The person absolutely needs to be technicallyliterate but not necessarily the most technical person in the room. He should be ableto use his technical base to form questions and evaluate answers relevant to the crisisat hand. He does not need to be the chief problem solver, but he needs to effectivelymanage the process of the chief problem solvers gathered within the crisis. The person also needs to be incredibly calm “inside” but be persuasive “outside.” This mightmean that he has the type of presence to which people naturally are attracted or itmay mean that he isn’t afraid to yell to get people’s attention within the room or onthe conference call. 能够成功管理混乱环境的人的特征是相当独特的。与领导力一样,有些人与生俱来,有些人则随着时间的推移而培养。这个人绝对需要具备技术素养,但不一定是房间里技术最高的人。他应该能够利用自己的技术基础提出与当前危机相关的问题并评估答案。他不需要成为首席问题解决者,但他需要有效地管理危机中聚集的首席问题解决者的流程。这个人还需要“内心”非常冷静,但“外表”要有说服力。这可能意味着他具有人们自然会被吸引的那种存在感,也可能意味着他不害怕在房间内或电话会议上大喊大叫来引起人们的注意。 The crisis manager needs to be able to speak and think in business terms. Sheneeds to be conversant enough with the business model to make decisions in theabsence of higher guidance on when to force incident resolution over attempting tocollect data that might be destroyed and would be useful in problem resolution(remember the differences in definitions from Chapter 8). The crisis manager alsoneeds to be able to create succinct business relevant summaries from the technicalchaos that is going on around her in order to keep the remainder of the businessinformed. 危机管理者需要能够用商业术语来说话和思考。她需要足够熟悉业务模型,以便在缺乏关于何时强制解决事件的更高指导的情况下做出决策,而不是尝试收集可能被破坏但对解决问题有用的数据(请记住与第 8 章的定义的差异)。危机经理还需要能够从周围发生的技术混乱中创建简洁的业务相关摘要,以便让其他业务人员了解情况。 In the absence of administrative help to document everything said or done duringthe crisis, the crisis manager is responsible for ensuring that the actions and discussions are represented in a written state for future analysis. This means that the crisismanager will need to keep a history of the crisis as well as help ensure that others arekeeping histories to be merged. A shared chat room with timestamps enabled is anexcellent choice for this. 在缺乏行政帮助来记录危机期间所说或所做的一切的情况下,危机管理者有责任确保行动和讨论以书面形式呈现,以供将来分析。这意味着危机管理者需要保留危机的历史记录,并帮助确保其他人保留要合并的历史记录。启用时间戳的共享聊天室是一个很好的选择。 In terms of Star Trek characters and financial gurus, the person is 1/3 Scotty, 1/3Captain Kirk, and 1/3 Warren Buffet. He is 1/3 engineer, 1/3 manager, and 1/3 business manager. He has a combat arms military background, an M.B.A., and a Ph.D. insome engineering discipline. Hopefully, by now, we’ve indicated how difficult it is tofind someone with the experience, charisma, and business acumen to perform such afunction. To make the task even harder, when you find the person, she probably isn’tgoing to want the job as it is a bottomless pool of stress. You will either need toincent the person with the right merit based performance package or you will need toclearly articulate how it is that they have a future beyond managing crises in yourorganization. However you approach it, if you are lucky enough to be successful infinding such an individual, you should do everything possible to keep him or her forthe “long term.” 就《星际迷航》人物和金融大师而言,这个人是1/3斯科蒂、1/3柯克船长和1/3沃伦巴菲特。他是1/3工程师,1/3经理,1/3业务经理。他拥有作战武器军事背景、工商管理硕士学位和博士学位。在某些工程学科中。希望到目前为止,我们已经表明找到具有经验、魅力和商业头脑的人来履行这一职能是多么困难。让任务变得更加困难的是,当你找到那个人时,她可能不会想要这份工作,因为这是一个无底的压力池。您要么需要用基于绩效的正确绩效方案来激励员工,要么您需要清楚地阐明他们除了管理组织中的危机之外还有什么未来。无论你采取什么方法,如果你足够幸运,成功地找到了这样的人,你应该尽一切可能“长期”地留住他或她。 Although we flippantly suggested the M.B.A., Ph.D., and military combat armsbackground, we were only half kidding. Such people actually do exist! As we mentioned earlier, the military has a role that they put such people in to manage their battles or what most of us would view as crises. The military combat arms branchesattract many leaders and managers who thrive on chaos and are trained and have thepersonalities to handle such environments. Although not all former military officershave the right personalities, the percentage within this class of individual who havethe right personalities are significantly higher than the rest of the general population.Moreover, they have life experiences consistent with your needs and specialized training on how to handle such situations. Finally, as a group, they tend to be highly educated, with many of them having at least one and sometimes multiple graduatedegrees. Ideally, you would want one who has been out of the military for awhile andrunning engineering teams to give him the proper experience. 尽管我们轻率地提出了MBA、博士学位和军事作战背景,但我们只是半开玩笑。这样的人确实存在!正如我们之前提到的,军队的职责是让这些人来管理他们的战斗或我们大多数人认为的危机。军事作战兵种吸引了许多领导人和管理人员,他们在混乱中茁壮成长,接受过训练,具有应对这种环境的个性。虽然并非所有退伍军人都具有正确的性格,但在这一类人中,具有正确性格的人的比例明显高于其他普通人群。而且,他们拥有符合您需求的生活经历,并接受过如何处理此类问题的专门培训。情况。最后,作为一个群体,他们往往受过高等教育,其中许多人至少拥有一个,有时甚至多个研究生学位。理想情况下,您会希望一个已经退伍一段时间并管理工程团队的人能够为他提供适当的经验。 ##### The Role of Team Managers 团队经理的角色 Within a crisis situation, a team manager is responsible for passing along action itemsto her teams and reporting progress, ideas, hypotheses, and summaries back to thecrisis manager. Depending upon the type of organization, the team manager may alsobe the “senior” or “lead” engineer on the call for her discipline or domain. 在危机情况下,团队经理负责将行动项目传递给她的团队,并向危机经理报告进度、想法、假设和摘要。根据组织的类型,团队经理也可能是其学科或领域的“高级”或“首席”工程师。 A team manager functioning solely in a management capacity is expected to manage his team through the crisis resolution process. A majority of his team is going tobe somewhere other than the crisis resolution (or “war”) room or on a call otherthan the crisis resolution call if a phone is being used. This means that the team manager must communicate and monitor the progress of his team as well as interactingwith the crisis manager. Although this may sound odd, the hierarchical structure withmultiple communication channels is exactly what gives this process so much scale.This structured hierarchy affects scale in the following way: If every manager cancommunicate and control 10 or more subordinate managers or individual contributors, the capability in terms of manpower grows by one or more orders of magnitude.The alternative is to have everyone communicating in a single room or in a singlechannel, which obviously doesn’t scale well as communication becomes difficult andcoordination of people becomes near impossible. People and teams would quicklydrown each other out in their debates, discussions, and chatter. Very little would getdone in such a crowded environment. 仅仅发挥管理职能的团队经理应该通过危机解决流程来管理他的团队。他的团队的大多数成员将前往危机解决(或“战争”)室以外的其他地方,或者如果正在使用电话,则将进行危机解决呼叫以外的呼叫。这意味着团队经理必须沟通和监控团队的进展,并与危机经理互动。虽然这听起来可能很奇怪,但具有多个沟通渠道的层次结构正是使该流程具有如此大规模的原因。这种结构化层次结构通过以下方式影响规模:如果每个经理都可以沟通和控制 10 名或更多下属经理或个人贡献者,则人力会增加一个或多个数量级。另一种选择是让每个人在一个房间或一个渠道中进行沟通,这显然不能很好地扩展,因为沟通变得困难,人们的协调几乎不可能。人们和团队很快就会淹没在辩论、讨论和闲聊中。在如此拥挤的环境中,几乎没有什么事情可以完成。 Furthermore, this approach to having managers listen and communicate on twochannels has been very effective for many years in the military. Company commanders listen to and interact with their battalion commanders on one channel and issueorders and respond to multiple platoon leaders on another channel (the companycommander is at the upper-left of Figure 9.1). The platoon leaders then do the samewith their platoons; each platoon leader speaks to multiple squads on a frequency dedicated to the platoon in question (see the center of Figure 9.1 speaking to squadsshown in upper-right). So although it may seem a bit awkward to have someone listening to two different calls or being in a room and while issuing directions over thephone or in a chat room, the concept has worked well in the military since the adventof the radio and we have employed it successfully in several companies. It is notuncommon for military pilots to listen to four different radios at one time while flying the aircraft: two tactical channels and two air traffic control channels. 此外,这种让管理者通过两个渠道倾听和沟通的方法多年来在军队中一直非常有效。连指挥官在一个频道上听取营指挥官的意见并与其互动,并在另一个频道上发布命令并对多个排长做出响应(连长位于图 9.1 的左上角)。然后排长对他们的排做同样的事情;每个排长以该排专用的频率与多个小队通话(参见图 9.1 与右上角显示的小队通话的中心)。因此,尽管让某人听两个不同的电话或在一个房间里通过电话或聊天室发出指示可能看起来有点尴尬,但自从无线电出现以来,这个概念在军队中一直很有效,我们已经采用了它在多家公司取得了成功。军事飞行员在驾驶飞机时同时收听四个不同的无线电广播并不罕见:两个战术频道和两个空中交通管制频道。  #### The Role of Engineering Leads 工程主管的作用 The role of a senior engineering professional on the phone can be filled by a deeplytechnical manager. Each engineering discipline or engineering team necessary toresolve the crisis should have someone capable of both managing that team andanswering technical questions within the higher level crisis management team. Thisperson is the lead individual investigator for her domain experience on the crisismanagement call and is responsible for helping the higher-level team vet information,clear and prioritize hypotheses, and so on. This person can also be on both the callsof the organization she represents and the crisis management call or conference, buther primary responsibility is to interact with the other senior engineers and the crisismanager to help formulate appropriate actions to end the crisis. 电话中的高级工程专业人员的角色可以由技术精湛的经理来担任。解决危机所需的每个工程学科或工程团队都应该有一个既能管理该团队又能回答更高级别危机管理团队内的技术问题的人。此人是危机管理电话会议领域经验的首席个人调查员,负责帮助更高级别的团队审查信息、明确假设并确定假设的优先级等。此人还可以参加她所代表的组织的电话会议以及危机管理电话会议或会议,但她的主要职责是与其他高级工程师和危机经理互动,以帮助制定适当的行动来结束危机。 ##### The Role of Individual Contributors 个人贡献者的角色 Individual contributors within the teams assigned to the crisis management call orconference communicate on separate chat and phone conferences or reside in separate conference rooms. They are responsible for generating and running down leadswithin their teams and work with the lead or senior engineer and their manager onthe crisis management team. Here, an individual contributor isn’t just responsible fordoing work assigned by the crisis management team. The individual contributor andhis teams are additionally responsible for brainstorming potential problems causingthe incident, communicating them, generating hypotheses, and quickly proving ordisproving those hypotheses. The teams should be able to communicate with theother domains’ teams either through the crisis management team or directly. All status, however, should be communicated to the team manager who is responsible forcommunicating it to the crisis management team. 分配到危机管理电话会议或会议的团队中的个人贡献者通过单独的聊天和电话会议进行交流,或者居住在单独的会议室中。他们负责在团队内产生和追踪潜在客户,并与危机管理团队的首席或高级工程师及其经理合作。在这里,个人贡献者不仅仅负责完成危机管理团队分配的工作。个人贡献者和他的团队还负责集思广益,对导致事件的潜在问题进行沟通,提出假设,并快速证明或反驳这些假设。这些团队应该能够通过危机管理团队或直接与其他领域的团队进行沟通。但是,所有状态都应传达给负责将其传达给危机管理团队的团队经理。 ##### Communications and Control 通讯与控制 Shared communication channels are a must for effective and rapid crisis resolution.Ideally, the teams are moved to be located near each other at the beginning of a crisis.That means that the lead crisis management team is in the same room and that eachof the individual teams supporting the crisis resolution effort are located with eachother to facilitate rapid brainstorming, hypothesis resolution, distribution of work,and status reporting. Too often, however, crises happen when people are away fromwork; because of this, both synchronous voice communication conferences (such asconference bridges on a phone) and asynchronous chat rooms should be employed. 共享沟通渠道是有效、快速解决危机的必要条件。理想情况下,在危机开始时将团队移至彼此附近。这意味着领导危机管理团队位于同一个房间,并且每个团队都在同一房间内。支持危机解决工作的团队相互协作,以促进快速集思广益、假设解决、工作分配和状态报告。然而,危机常常发生在人们没有工作的时候。因此,应该同时使用同步语音通信会议(例如电话上的会议桥)和异步聊天室。 The voice channel should be used to issue commands, stop harmful activity, andgain the attention of the appropriate team. It is absolutely essential that someonefrom each of the teams be on the crisis resolution voice channel and be capable ofcontrolling her team. In many cases, two representatives, the manager and the senior(or lead) engineer, should be present from each team on such a call. This is the command and control channel in the absence of everyone being in the same room. Allshots are called from here, and it serves as the temporary change control authorityand system for the company. The authority to do anything other than perform nondestructive “read” activities like investigating logs is first “OK’d” within this voicechannel or conference room to ensure that two activities do not compete with eachother and either cause system damage or result in an inability to determine whataction “fixed” the system. 语音通道应用于发出命令、停止有害活动并引起相应团队的注意。每个团队中的某个人必须在危机解决语音频道上并能够控制她的团队,这是绝对必要的。在许多情况下,每个团队都应该有两名代表,即经理和高级(或首席)工程师出席此类电话会议。这是所有人不在同一个房间时的命令和控制通道。 Allshots都是从这里调用的,它作为公司的临时变更控制权限和系统。除了执行非破坏性“读取”活动(例如调查日志)之外,执行任何其他操作的权限首先要在此语音通道或会议室中“确定”,以确保两项活动不会相互竞争,从而导致系统损坏或导致无法进行操作。确定什么操作“修复”了系统。 The chat or IRC channel is used to document all conversations and easily passaround commands to be executed so that time isn’t wasted in communication. Commands that are passed around can be cut and pasted for accuracy. Additionally, the timestamps within the IRC or chat can be used in follow-up postmortems. The crisismanager is responsible for ensuring that he is not only putting his notes in the chatroom and writing his decisions in the chat room for clarification, but for ensuringthat status updates, summaries, hypotheses, and associated actions are put into thechat room. 聊天或 IRC 频道用于记录所有对话并轻松传递要执行的命令,这样就不会在沟通中浪费时间。可以剪切和粘贴传递的命令以确保准确性。此外,IRC 或聊天中的时间戳可用于后续事后分析。危机经理负责确保他不仅将笔记放在聊天室中并在聊天室中写下他的决定以供澄清,而且还要确保将状态更新、摘要、假设和相关操作放入聊天室。 It is absolutely essential in our minds that both the synchronous voice and asynchronous chat channels are open and available for any crisis. The asynchronousnature of chat allows activities to go on without interruption and allows individualsto monitor overall group activities between the tasks within their own assignedduties. Through this asynchronous method, scale is achieved while the voice allowsfor immediate command and control of different groups for immediate activities.Should everyone be in one room, there is no need for a phone call or conference callother than to facilitate experts who might not be on site and updates for the businessmanagers. But even with everyone in one room, a chat room should be opened andshared by all parties. In the case where a command is misunderstood, it can be buddychecked by all other crisis participants and even “cut and pasted” into the sharedchat room for validation. The chat room allows actual system or application resultsto be shared in real time with the remainder of the group and an immediate log withtimestamps is generated when such results are cut and pasted into the chat. 在我们看来,同步语音和异步聊天渠道都是开放的并且可以应对任何危机,这是绝对重要的。聊天的异步特性允许活动不间断地进行,并允许个人监控自己分配的职责范围内的任务之间的整体小组活动。通过这种异步方法,可以实现规模化,同时语音可以立即指挥和控制不同的小组进行即时活动。如果每个人都在一个房间里,则无需电话或电话会议,只需为可能不在场的专家提供便利为业务经理提供网站和更新。但即使每个人都在一个房间,聊天室也应该开放并由所有各方共享。如果命令被误解,所有其他危机参与者都可以对其进行好友检查,甚至可以将其“剪切并粘贴”到共享聊天室中进行验证。聊天室允许与组中的其他成员实时共享实际的系统或应用程序结果,并且当将此类结果剪切并粘贴到聊天中时,会立即生成带有时间戳的日志。 #### The War Room 作战室 Phone conferences are a poor but sometimes necessary substitute for the “war room”or crisis conference room we had previously mentioned. So much more can be communicated when people are in a room together, as body language and facial expressions can actually be meaningful in a discussion. How many times have you heardsomeone say something, but when you read or look at the person’s face you realize heis not convinced of the validity of his statement? That isn’t to say that the person islying, but rather that he is passing along something that he does not wholly believe.For instance, someone might say, “The team believes that the problem could be withthe login code,” but she has a scowl on her face that shows that something is wrong.A phone conversation would not pick that up, but you have the presence of mind inperson to say, “What’s wrong, Sue?” Sue might answer that she doesn’t believe it’spossible given that the login code hasn’t changed in months, which may lower thepriority for investigation. Sue might also respond by saying, “We just changed thatdamn thing yesterday,” which would increase the prioritization for investigation. 电话会议是我们之前提到的“作战室”或危机会议室的一个糟糕但有时必要的替代品。当人们共处一室时,可以交流更多内容,因为肢体语言和面部表情在讨论中实际上很有意义。有多少次你听到某人说了某事,但当你读到或看着那个人的脸时,你意识到他不相信他的话的有效性?这并不是说这个人在撒谎,而是说他正在传递一些他并不完全相信的东西。例如,有人可能会说,“团队认为问题可能出在登录代码上”,但她有她脸上的愁容表明出了什么问题。电话交谈不会发现这一点,但你可以当面冷静地说:“苏,怎么了?”苏可能会回答说,鉴于登录代码几个月没有更改,她不相信这是可能的,这可能会降低调查的优先级。苏也可能会回应说,“我们昨天刚刚改变了那个该死的东西”,这将增加调查的优先级。 In the ideal case, the war room is equipped with phones, a shared desk, terminalscapable of accessing systems that might be involved in the crisis, plenty of workspace, projectors capable of displaying key operating metrics or any person’s terminal, and lots of whiteboard space. Although the inclusion of a white board might ini-tially appear to be at odds with the need to log everything in a chat room, it actuallysupports chat activities by allowing graphics, symbols, and ideas best expressed inpictures to be drawn quickly and shared. Then, such things can be reduced to wordsand placed in chat, or a picture of the whiteboard can be taken and sent to the chatmembers. Many new whiteboards even have systems capable of reducing their contents to pictures immediately. Should you have an operations center, the war roomshould be close to that to allow easy access from one area to the next. 在理想的情况下,作战室配备电话、共享办公桌、能够访问可能涉及危机的系统的终端、充足的工作空间、能够显示关键操作指标或任何人的终端的投影仪以及大量的白板空间。尽管白板的加入最初可能看起来与在聊天室中记录所有内容的需要相矛盾,但它实际上通过允许快速绘制和共享最能以图片形式表达的图形、符号和想法来支持聊天活动。然后,这些东西可以简化为文字并放入聊天中,或者可以拍摄白板的照片并发送给聊天成员。许多新的白板甚至具有能够立即将其内容缩小为图片的系统。如果您有一个运营中心,作战室应该靠近该中心,以便从一个区域轻松进入另一个区域。 You may think that creating such a war room would be a very expensive proposition. “We can’t possibly afford to dedicate space to a crisis,” you might say. Ouranswer is that the war room need not be expensive or dedicated to crisis situations. Itsimply needs to be given a priority to any crisis and as such any conference roomequipped with at least one and preferably two lines or more will do. Individual managers can use cell phones to communicate with their teams if need be, but in this case,you should consider the inclusion of low-cost cell phone chargers within the room.There are lots of low-cost whiteboard options available including special paint that“acts” like a whiteboard and is easily cleanable, and windows make a fine whiteboard in a pinch. 您可能认为创建这样一个作战室将是一个非常昂贵的提议。你可能会说:“我们不可能为危机腾出空间。”我们的答案是,作战室不需要昂贵或专门用于应对危机情况。任何危机都需要优先考虑,因此任何配备至少一条线、最好两条线或更多线的会议室都可以。如果需要,个人经理可以使用手机与团队沟通,但在这种情况下,您应该考虑在房间内提供低成本手机充电器。有许多低成本白板选项可供选择,包括特殊油漆“作用”就像一块白板,并且易于清洁,在紧要关头,窗户可以成为一块精美的白板。 Moreover, the war room is useful for the “ride along” situation we described inChapter 6. If you want to make a good case for why you should invest in creating ascalable organization, scalable processes, and a scalable technology platform, invitesome business executives into a well-run war room to witness the work necessary tofix scale problems that result in a crisis. One word of caution here: If you can’t run acrisis well and make order out of its chaos, do not invite people into the conference.Instead, focus your time on finding a leader and manager who can run such a crisisand then invite other executives into it. 此外,作战室对于我们在第 6 章中描述的“顺风车”情况很有用。如果您想充分说明为什么应该投资创建可扩展的组织、可扩展的流程和可扩展的技术平台,请邀请一些业务主管一个运作良好的作战室,见证解决导致危机的规模问题所需的工作。这里需要注意的是:如果你不能很好地处理危机并从混乱中恢复秩序,就不要邀请人们参加会议。相反,把你的时间集中在寻找一位能够处理这样的危机的领导者和经理上,然后邀请其他人参加会议。高管们投入其中。 ##### Tips for a Successful War Room 成功作战室的秘诀 A good war room has the following: 一个好的作战室具有以下特点 * Plenty of white board space * Computers and monitors with access to the production systems and real-time data * A projector for sharing information * Phones for communication to teams outside the war room * Access to IRC or chat * Workspace for the number of people who will occupy the room * 充足的白板空间 * 可访问生产系统和实时数据的计算机和监视器 * 用于共享信息的投影仪 * 用于与作战室外的团队进行通信的电话 * 访问 IRC 或聊天 * 工作空间指的是占用该房间的人数 War rooms tend to get loud, and the crisis manager must maintain control within the room toensure that communication is concise and effective. Brainstorming can and should be used,but limit communication during discussion to one individual at a time. 作战室往往很吵,危机经理必须保持房间内的控制,以确保沟通简洁有效。可以而且应该使用头脑风暴法,但在讨论期间限制一次只与一个人进行交流。 #### Escalations 升级 Escalations during crisis events are critical for several reasons. The first and mostobvious is that the company’s job in maximizing shareholder value is to ensure that itisn’t destroyed in these events. As such, the CTO, CEO, and other execs need to hearquickly of issues that are likely to take significant time or have significant negativecustomer impact. In a public company, it’s all that much more important that thesenior execs know what is going on as shareholders demand that they know aboutsuch things, and it is possible that public facing statements will need to be made.Moreover, executives have a better chance at helping to marshal all of the resourcesnecessary to bring a crisis to resolution, including customer communications, vendor,and partner relationships, and so on. 出于多种原因,危机事件期间的升级至关重要。第一个也是最明显的一点是,公司在实现股东价值最大化方面的工作是确保公司不会在这些事件中遭到破坏。因此,首席技术官、首席执行官和其他高管需要快速了解可能需要花费大量时间或对客户产生重大负面影响的问题。在一家上市公司中,更重要的是,高级管理人员知道正在发生的事情,因为股东要求他们了解这些事情,并且可能需要发表面向公众的声明。此外,管理人员有更好的机会帮助整合解决危机所需的所有资源,包括客户沟通、供应商和合作伙伴关系等。 The natural tendency for engineering teams is to feel that they can solve the problem without outside help or help from their management teams. That may be true,but solving the problem isn’t enough—it needs to be resolved the quickest and mostcost-effective way possible. Often, that will require more than the engineering teamcan muster on their own, especially if third-party providers are at all to blame forsome of the incident. Moreover, communication throughout the company is important as your systems are either supporting critical portions of the company or in thecase of Web companies they are the company. Someone needs to communicate toshareholders, partners, customers, and maybe even the press. That job is best handledby people who aren’t involved in fighting the fire. 工程团队的自然倾向是认为他们可以在没有外部帮助或管理团队帮助的情况下解决问题。这可能是真的,但解决问题还不够——还需要以最快、最具成本效益的方式解决。通常,这需要工程团队自己无法承担的能力,特别是如果第三方提供商对部分事件负有责任的话。此外,整个公司的沟通也很重要,因为您的系统要么支持公司的关键部分,要么对于网络公司而言,它们就是公司。有人需要与股东、合作伙伴、客户,甚至媒体进行沟通。这项工作最好由不参与救火的人来完成。 Think through your escalation policies and get buy-in from senior executivesbefore you have a major crisis. It is the crisis manager’s job to adhere to those escalation policies and get the right people involved at the time defined in the policiesregardless of how quickly the problem is likely to be solved after the escalation. 在发生重大危机之前,仔细考虑您的升级政策并获得高级管理人员的支持。危机管理者的工作是遵守这些升级政策,并在政策规定的时间让合适的人员参与进来,无论问题升级后解决问题的速度有多快。 #### Status Communications 状态通讯 Status communications should happen at predefined intervals throughout the crisisand should be posted or communicated in a somewhat secure fashion such that theorganizations needing information on resolution time can get the information theyneed to take the appropriate actions. Status is different than escalation. Escalation ismade to bring in additional help as time drags on during a crisis, and status communications are made to keep people informed. Using the RASCI framework, you escalate to Rs, As, Ss, and Cs, and you post status communication to Is. 状态通信应在整个危机期间按预定的时间间隔进行,并且应以某种安全的方式发布或通信,以便需要解决时间信息的组织可以获得采取适当行动所需的信息。状态与升级不同。随着危机期间时间的推移,升级是为了提供额外的帮助,状态通信是为了让人们了解情况。使用 RASCI 框架,您可以升级到 Rs、As、Ss 和 Cs,并将状态通信发布到 Is。 A status should include start time, a general update of actions since the start time,and the expected resolution time if known. This resolution time is important for several reasons. Maybe you support a manufacturing center and the manufacturing manager needs to know if she should send home her hourly employees. Potentially,you provide sales or customer support software in a SaaS fashion, and those companiesneed to be able to figure out what to do with their sales and customer support staff. 状态应包括开始时间、自开始时间以来操作的一般更新以及预期解决时间(如果已知)。由于多种原因,该解决时间很重要。也许您支持一个制造中心,而制造经理需要知道是否应该将小时工送回家。您可能以 SaaS 方式提供销售或客户支持软件,而这些公司需要能够弄清楚如何处理其销售和客户支持人员。  Your crisis process should clearly define who is responsible for communicating towhom, but it is the crisis manager’s job to ensure that the timeline for communications is followed and that the appropriate communicators are properly informed. Asample status email is shown in Figure 9.2. 您的危机流程应明确定义谁负责与谁进行沟通,但危机经理的工作是确保遵循沟通时间表并确保适当的沟通者得到适当的通知。状态电子邮件示例如图 9.2 所示。 #### Crises Postmortems 危机事后分析 Just as a crisis is an incident on steroids, so is a crisis postmortem a juiced-up postmortem. Treat this postmortem with extra special care. Bring in people outside oftechnology because you never know where you are going to get advice critical tomaking the whole process better. Remember, the systems that you helped create andmanage have just caused a huge problem for a lot of people. This isn’t the time to getdefensive; this is the time to be reborn. This is the meeting that will fulfill or destroythe process of turning around your team, setting up the right culture, and fixing yourprocesses. 正如危机是类固醇事件一样,危机事后剖析也是一种兴奋的事后剖析。请格外小心地对待这次尸检。引入技术之外的人员,因为你永远不知道从哪里可以获得对改善整个过程至关重要的建议。请记住,您帮助创建和管理的系统刚刚给很多人带来了巨大的问题。现在不是采取防御措施的时候;这是重生的时刻。这次会议将实现或破坏扭转团队、建立正确的文化和修复流程的过程。 Absolutely everything should be evaluated. The very first crisis postmortem isreferred to as the “master postmortem” and its primary task is to identify subordinate postmortems. It is not to resolve or identify all of the issues leading to the incident; it is meant to identify the areas for which subordinate postmortems should beresponsible. You might have postmortems focused on technology, process, and organization failures. You might have several postmortems on technology covering different aspects—one on your communication process, one on your crisis managementprocess, and one on why certain organizations didn’t contribute appropriately earlyon in the postmortem. 绝对应该评估一切。第一个危机事后分析被称为“主事后分析”,其主要任务是确定从属事后分析。并非要解决或识别导致事件的所有问题;它的目的是确定下级事后分析应负责的领域。您可能会针对技术、流程和组织故障进行事后分析。您可能对涵盖不同方面的技术进行了多次事后分析——一项是关于您的沟通流程,一项是关于您的危机管理流程,一项是关于为什么某些组织在事后分析早期没有做出适当贡献。 Follow the same timeline process as the postmortem described in Chapter 8, butfocus on creating other postmortems and tracking them to completion. The sametimeline should be used, but rather than identifying tasks and owners, you shouldidentify subordinate postmortems and leaders associated with them. You should stillassign dates as you normally would, but rather than tracking these in the morningincident meeting, you should set up a weekly recurring meeting to track progress. It iscritically important that executives lead from the front and be at these weekly meetings. Again, we need to change our culture or, should we have the right culture,ensure that it is properly supported through this process. 遵循与第 8 章中描述的事后分析相同的时间线流程,但重点是创建其他事后分析并跟踪它们直至完成。应使用相同的时间表,但您不应确定任务和所有者,而应确定下级事后分析和与他们相关的领导者。您仍然应该像平常一样分配日期,但您应该设置每周定期会议来跟踪进度,而不是在早上的事件会议中跟踪这些日期。高管们亲自领导并参加每周的会议是至关重要的。同样,我们需要改变我们的文化,或者,如果我们拥有正确的文化,请确保它在这个过程中得到适当的支持。 #### Crises Follow-up and Communication 危机跟进与沟通 Just as you had a communication plan during your crisis, so must you have a communication plan until all postmortems are complete and all problems identified andsolved. Keep all members of the RASCI chart updated and allow them to update theirorganizations and constituents. This is a time to be completely transparent. Explain,in business terms, everything that went wrong and provide aggressive but achievabledates in your action plan to resolve all problems. Follow up with communication inyour staff meeting, your boss’ staff meeting, and/or the company board meeting.Communicate with everyone else via email or whatever communication channel isappropriate for your company. For very large events where morale might beimpacted, consider using a company all hands meeting followed by weekly updatesvia email or on a blog. 正如您在危机期间制定了沟通计划一样,您也必须制定沟通计划,直到所有事后分析完成以及所有问题都得到识别和解决。使 RASCI 图表的所有成员保持最新状态,并允许他们更新其组织和选民。这是一个完全透明的时代。用商业术语解释所有出错的地方,并在你的行动计划中提供积极但可实现的日期来解决所有问题。在员工会议、老板的员工会议和/或公司董事会会议上跟进沟通。通过电子邮件或任何适合您公司的沟通渠道与其他人沟通。对于可能影响士气的大型活动,请考虑召开公司全体会议,然后通过电子邮件或博客每周进行更新。 ##### A Note on Customer Apologies 关于客户道歉的说明 When you communicate to your customers, buck the recent trend of apologizing without actually apologizing and try sincerity. Actually mean that you are sorry that you disrupted their businesses, their work, and their lives! Too many companies use the passive voice, point thefingers in other directions, or otherwise misdirect customers as to true root cause. If you find yourself writing something like “Can’tScale, Inc. experienced a brief 6-hour downtime last weekand we apologize for any inconvenience that this may have caused you,” stop right there andtry again. Try the first person “I” instead of “we,” drop the “may” and “brief,” try acknowledgingthat you messed up what your customers were planning on doing with your application, and trygetting this posted immediately not “last week.” 当你与客户沟通时,要扭转最近的道歉而不真正道歉的趋势,尝试表现出诚意。实际上的意思是你很抱歉打扰了他们的生意、他们的工作和他们的生活!太多的公司使用被动语态,把矛头指向其他方向,或者以其他方式误导客户了解真正的根本原因。如果您发现自己写的是类似“Can’tScale, Inc. 上周经历了短暂的 6 小时停机,对于由此给您带来的任何不便,我们深表歉意”,请立即停止并重试。尝试用第一人称“我”而不是“我们”,放弃“可能”和“简短”,尝试承认你搞砸了客户计划对你的申请做什么,并尝试立即发布此信息,而不是“上周”。 It is very likely that you have significantly negatively impacted your customers. Moreover,this negative customer impact is not likely to have been the fault of the customer. Acknowledgeyour mistakes and be clear as to what you are going to do to ensure that it does not happenagain. Your customers will appreciate it, and assuming that you can make good on your promises, you are more likely to have a happy and satisfied customer. 您很可能对您的客户产生了重大负面影响。此外,这种负面的客户影响不太可能是客户的错。承认你的错误,并明确你将做什么,以确保它不会再次发生。您的客户会很感激,并且假设您能够兑现承诺,那么您更有可能拥有一个快乐和满意的客户。 #### Conclusion 结论 We’ve discussed how not every incident is created equally and how some incidentsrequire significantly more time to truly identify and solve all of the underlying problems. We call these incidents crisis and you should have a plan to handle them frominception to end. We define the end of this crisis management process as the point atwhich all problems identified through postmortems have been resolved. 我们已经讨论了并非每个事件都是平等产生的,以及某些事件如何需要更多时间才能真正识别和解决所有根本问题。我们将这些事件称为危机,您应该制定一个从始至终处理这些事件的计划。我们将危机管理流程的结束定义为通过事后分析发现的所有问题均已得到解决的时刻。 We discussed the roles of the technology team in responding to, resolving, andhandling the problem management aspects of a crisis. These roles include the problem manager/crisis manager, engineering managers, senior engineers/lead engineers,and individual contributor engineers from each of the technology organizations. 我们讨论了技术团队在响应、解决和处理危机的问题管理方面的作用。这些角色包括问题经理/危机经理、工程经理、高级工程师/首席工程师以及来自每个技术组织的个人贡献工程师。 We explained the four types of communication necessary in crisis resolution andclosure, including internal communications, escalations, and status reports duringand after the crisis. We also discussed some handy tools for crisis resolution such asconference bridges, chat rooms, and the war room concept. 我们解释了解决和结束危机所需的四种沟通方式,包括内部沟通、升级以及危机期间和危机后的状态报告。我们还讨论了一些解决危机的便捷工具,例如会议桥、聊天室和作战室概念。 ##### Key Points 关键点 * Crises are incidents on steroids and can either make your company stronger orkill your business. Crisis, if not managed aggressively, will destroy your abilityto scale your customers, your organization, and your technology platform andservices. * To resolve crises as quickly and cost effectively as possible, you must contain thechaos with some measure of order. * The leaders most effective in crises are calm on the inside but are capable offorcing and maintaining order through those crises. They must have businessacumen and technical experience and be calm leaders under pressure. * 危机是类固醇事件,可以使您的公司变得更强大,也可以毁掉您的业务。如果不积极管理,危机将破坏您扩展客户、组织以及技术平台和服务的能力。 * 为了尽可能快速且经济高效地解决危机,您必须通过一定程度的秩序来遏制混乱。 * 在危机中最有效的领导者内心平静,但有能力在危机中强制和维持秩序。他们必须具有商业头脑和技术经验,并且是压力下冷静的领导者。 * The crisis resolution team consists of the crisis manager, engineering managers,and senior engineers. In addition, teams of engineers reporting to the engineering managers are employed. * The role of the crisis manager is to maintain order and follow the crisis resolution, escalation, and communication processes. * The role of the engineering manager is to manage her team and provide status tothe crisis resolution team. * The role of the senior engineer from each engineering team is to help the crisisresolution team create and vet hypotheses regarding cause and help determinerapid resolution approaches. * 危机解决团队由危机经理、工程经理、高级工程师组成。此外,还雇用了向工程经理报告的工程师团队。 * 危机管理者的职责是维持秩序并遵循危机解决、升级和沟通流程。 * 工程经理的角色是管理她的团队并向危机解决团队提供状态。 * 每个工程团队的高级工程师的作用是帮助危机解决团队创建和审查有关原因的假设,并帮助确定快速解决方法。 * The role of the individual contributor engineer is to participate in his team andidentify rapid resolution approaches, create and evaluate hypotheses on cause,and provide status to his manager on the crisis resolution team. * Communication between crisis resolution team members should happen face toface in a crisis resolution or war room; or when face-to-face communicationisn’t available, the team should use a conference bridge on a phone. A chat roomshould also be employed. * War rooms, ideally adjacent to operations centers, should be developed to helpresolve crisis situations. * Escalations and status communications should be defined during a crisis. After acrisis, the crisis process should define status updates at periodic intervals untilall root causes are identified and fixed. * Crisis postmortems should be strict and employed to identify and manage aseries of follow-ups on postmortems that thematically attack all issues identifiedin the master postmortem. * 个人贡献者工程师的角色是参与他的团队并确定快速解决方法,创建和评估原因假设,并向危机解决团队中的经理提供状态。 * 危机解决团队成员之间的沟通应在危机解决或作战室进行面对面的沟通;或者当无法进行面对面沟通时,团队应使用电话上的会议桥。还应该使用聊天室。 * 作战室最好靠近作战中心,以帮助解决危机情况。 * 应在危机期间定义升级和状态通信。危机发生后,危机流程应定期定义状态更新,直到识别并修复所有根本原因。 * 危机事后分析应严格并用于识别和管理一系列事后分析的后续行动,这些后续行动以主题方式解决主要事后分析中确定的所有问题。