《可扩展的艺术》Chapter 16 Determining Risk 第 16 章确定风险

这篇文章上次修改于 268 天前，可能其部分内容已经发生变化，如有疑问可询问作者。

Chapter 16 Determining Risk 第 16 章确定风险

Hence in the wise leader’s plans, considerations of advantage and disadvantage will be blended together.—Sun Tzu

因此，在明智的领导者的计划中，利弊的考虑是混合在一起的。——孙子

In the previous 15 chapters, we have often mentioned risk management or suggestedthat you analyze the amount of risk, but we have not given you a detailed explana-tion of what we mean by these phrases and terminology. This chapter is going to beall about how to determine the amount of risk in a feature, release, bug fix, configu-ration change, or other technology related action. Managing risk is one of the mostfundamentally important aspects of increasing and maintaining availability and scal-ability. To manage risk, you first must know how to calculate risk and determine howmuch risk exists in some action or lack of action.

在前面的15章中，我们经常提到风险管理或建议您分析风险的大小，但我们没有给您详细解释这些短语和术语的含义。本章将讨论如何确定功能、版本、错误修复、配置更改或其他技术相关操作的风险量。管理风险是提高和维护可用性和可扩展性的最根本的重要方面之一。要管理风险，您首先必须知道如何计算风险并确定某些行动或不采取行动存在多少风险。

In this chapter, we will first discuss why risk plays such a large part in scalability.This discussion will build upon all the other times so far in the book that we havementioned risk management and its importance. After we have clearly articulated theimportance of risk management, we will discuss how to measure the amount of riskand finally how to manage the overall risk in a system. Here, the use of system meansnot only the application, but the entire product development life cycle, technologyorganization, and all the processes that make these up. There are many differentways of calculating the amount of risk, and we will cover some of the best ones thatwe have seen, including the pros and cons of each method.

在本章中，我们将首先讨论为什么风险在可扩展性中扮演如此重要的角色。这一讨论将建立在本书到目前为止我们提到的风险管理及其重要性的所有其他时间的基础上。在明确阐述了风险管理的重要性之后，我们将讨论如何衡量风险的大小，最后如何管理系统中的整体风险。这里，系统的使用不仅意味着应用，还意味着整个产品开发生命周期、技术组织以及构成它们的所有流程。计算风险量的方法有很多，我们将介绍一些我们见过的最好的方法，包括每种方法的优缺点。

At the end of this chapter, you will have a much better grasp of risk and under-stand how to determine the amount of risk involved in something, as well as how tomanage the overall level of risk that the business is willing to take. These are funda-mental skills that need to exist in the organization at almost every level to ensure thatthe scalability of the system is not impaired by improper decisions and behaviors.

在本章结束时，您将对风险有更好的把握，并了解如何确定某件事所涉及的风险量，以及如何管理企业愿意承担的总体风险水平。这些是组织中几乎每个级别都需要具备的基本技能，以确保系统的可扩展性不会因不当决策和行为而受到损害。

Importance of Risk Management to Scale 风险管理对规模化的重要性

Why is the ability to manage risk so important to scalability? The answer to thisquestion lies in the fact that business is inherently a risky endeavor. For example, therisk that customers will not want products that you offer, that the tradeoffs madebetween speed and quality don’t exceed the threshold for customers, that skippedsteps for cost savings don’t result in catastrophic failure, the risk that the businessmodel will ever work, and so on and so on. To be in business, at least for any amountof time, you must be able to identify and balance the risks with the rewards. It isrisky to demo for a potential huge customer your newest untested product, but if itworks that customer might sign up; is that a risk worth taking? The capability to bal-ance risk and reward are essential to survive as a business, especially a startup. Thisbalance of risk and reward is exactly what entrepreneurs do every day and what tech-nologists in companies must do. Pushing the new release has inherent risks, but itshould also have expected rewards. Knowing how to determine the amount of riskthat exists allows you to solve this risk–reward equation and make the right decisionsabout when to take the risk in turn for rewards.

为什么管理风险的能力对于可扩展性如此重要？这个问题的答案在于，商业本质上是一项有风险的事业。例如，客户不想要您提供的产品的风险、速度和质量之间的权衡没有超过客户的阈值、跳过成本节省步骤不会导致灾难性失败的风险、商业模式永远无法实现的风险。工作等等。为了开展业务，至少在任何时间段内，您都必须能够识别并平衡风险与回报。向潜在的大客户演示您最新的未经测试的产品是有风险的，但如果它有效，客户可能会注册；这样的风险值得冒吗？平衡风险和回报的能力对于企业（尤其是初创企业）的生存至关重要。这种风险与回报的平衡正是企业家每天所做的事情，也是公司技术人员必须做的事情。推送新版本有固有的风险，但也应该有预期的回报。了解如何确定存在的风险量可以让您解决这个风险回报方程，并就何时承担风险以获得回报做出正确的决定。

If risk is an inherent part of any business, especially a hyper-growth SaaS Web 2.0company, are the successful companies necessarily great at managing risk? Theanswer is probably not, but they probably have either someone who innately man-ages risk or they have been extremely lucky so far and will likely run out of luck atsome point. There are certain people who can naturally feel and manage risk; we’lltalk more about this in the section of this chapter about ways to measure risk. Thesepeople may have developed this skill from years of working around technology andhaving an acute sense of when things are likely to go wrong. They also might justhave an inborn ability to sense risk. It’s great if you have someone like this, but evenin that case you want the rest of the organization to be able to identify risk and nothave to rely on a single individual as the human risk meter. Remember that single-tons, especially if that singleton is a person, do not scale well. If you are one of thelucky organizations who have been successful without any focus or understanding ofrisk, you should be even more worried. You could argue that risk demonstrates aMarkov property, meaning that the future states are determined by the present stateand are independent of past states. We would argue that risk is cumulative to somedegree, perhaps with an exponential decay but still additive. A risky event today canresult in failures in the future, either because of direct correlation such as today’schange breaks something else in the future, or via indirect methods such as anincreased risk tolerance by the organization leading to riskier behaviors in the future.Either way, actions can have near- and long-term consequences.

如果风险是任何企业的固有组成部分，尤其是高速增长的 SaaS Web 2.0 公司，那么成功的公司一定擅长管理风险吗？答案可能是否定的，但他们可能要么有天生能够管理风险的人，要么他们到目前为止非常幸运，但可能会在某个时候失去运气。有些人天生就能感知和管理风险，但有些人却天生能够感知和管理风险。我们将在本章有关衡量风险的方法部分详细讨论这一点。这些人可能通过多年的技术工作和对何时可能出现问题的敏锐感觉而培养了这种技能。他们也可能天生就有感知风险的能力。如果您有这样的人，那就太好了，但即使在这种情况下，您也希望组织的其他成员能够识别风险，而不必依赖单个人作为人类风险计量器。请记住，单例，尤其是当该单例是一个人时，无法很好地扩展。如果您是幸运的组织之一，在没有任何关注或对风险的了解的情况下取得了成功，那么您应该更加担心。您可能会说风险表现出马尔可夫属性，这意味着未来状态由当前状态决定并且独立于过去状态。我们认为，风险在某种程度上是累积的，可能呈指数衰减，但仍然是累加的。今天的风险事件可能会导致未来的失败，要么是因为直接相关性（例如今天的变化会破坏未来的其他事物），要么是通过间接方法（例如组织增加的风险承受能力）导致未来出现更高风险的行为。无论哪种方式，行动可能会产生近期和长期的后果。

Because risk management is important to scalability, we need to understand thecomponents and steps of the risk management process. We’ll cover this in more detailin this chapter but a high-level overview of the risk management process entails firstand foremost as accurately as possible determining the risk of a particular action.There are many ways to go about trying to accurately determine risk, some moreinvolved than others and some often more accurate than others. The important thingis to select the right process for your organization, which means balancing the rigorand required accuracy to what makes sense for your organization. After the amountof risk has been determined or estimated, you must actively manage the amount ofrisk both acutely and overall within the system. Acute risk is the amount of risk asso-ciated with a particular action, such as changing a configuration on a server. Overallrisk is the amount that is cumulative within the system because of all the actions thathave taken place over the previous days, weeks, or possibly even months.

由于风险管理对于可扩展性非常重要，因此我们需要了解风险管理流程的组成部分和步骤。我们将在本章中更详细地介绍这一点，但风险管理流程的高级概述首先需要尽可能准确地确定特定操作的风险。有很多方法可以尝试准确地确定风险，其中一些是比其他人更投入，而且有些人往往比其他人更准确。重要的是为您的组织选择正确的流程，这意味着平衡严格性和所需的准确性与对您的组织有意义的内容。在确定或估计风险量后，您必须在系统内积极地、敏锐地、全面地管理风险量。急性风险是与特定操作（例如更改服务器配置）相关的风险量。总体风险是由于过去几天、几周甚至几个月发生的所有操作而在系统内累积的金额。

Measuring Risk 衡量风险

The first step in being able to manage risk is the ability to as accurately as necessarydetermine what amount of risk is involved in a particular action. The reason we usethe term necessary and not possible is that you may be able to more accurately deter-mine risk, but it might not be necessary given the current state of your product oryour organization. For example, a product in beta, where customers are expectingsome glitches, may dictate that a sophisticated risk assessment is not necessary andthat a cursory analysis is sufficient at this point. There are many different ways toanalyze, assess, or estimate risk. The more of these that are in your tool belt, themore likely you will use the most appropriate one for the appropriate time and activ-ity. We are going to cover three methods of determining risk. With each of these, wewill discuss the advantages and disadvantages as well as the accuracy.

能够管理风险的第一步是能够根据需要准确地确定特定行动涉及的风险量。我们使用“必要”和“不可能”这个术语的原因是，您可能能够更准确地确定风险，但考虑到您的产品或组织的当前状态，可能没有必要。例如，测试版产品中，客户预计会出现一些故障，这可能表明不需要进行复杂的风险评估，此时进行粗略的分析就足够了。有许多不同的方法来分析、评估或估计风险。您的工具带中的这些工具越多，您就越有可能在适当的时间和活动中使用最合适的工具。我们将介绍三种确定风险的方法。我们将讨论其中每一个的优点、缺点以及准确性。

The first assessment method is the gut feel method. This is when someone eitherbecause of his position, VP of operations, or because of his innate ability to feel risk,is given the job of making go/no-go decisions on actions. As we mentioned earlier,some people inherently have this ability and it is great to have someone like this inthe organization. However, we would caution you on two very important concerns.First, does this person really have the ability to understand risk at a subconsciouslevel or do you just wish he did? In other words, have you tracked this person’s accu-racy? If you haven’t, you should before you consider this as anything more than amethod of guessing. Secondly, if indeed this person has some degree of accuracy withregard to determining risk, you do not want your organization to be dependent onone person. You need multiple people in your organization to understand how toassess risk. Ideally, everyone in the organization is familiar with the significance ofrisk and the methodologies that exist for assessing and managing it.

第一种评估方法是直觉法。这是指某人由于其职位（运营副总裁）或由于其天生的感知风险的能力，被赋予对行动做出进行/不进行决策的工作。正如我们前面提到的，有些人天生就有这种能力，组织中有这样的人真是太好了。但是，我们要提醒您注意两个非常重要的问题。首先，这个人是否真的有能力在潜意识层面理解风险，还是您只是希望他有能力？换句话说，你追踪过这个人的准确性吗？如果您还没有，那么您应该先考虑一下这不仅仅是一种猜测方法。其次，如果这个人确实在确定风险方面具有一定程度的准确性，那么您不希望您的组织依赖于一个人。您的组织中需要多名人员来了解如何评估风险。理想情况下，组织中的每个人都熟悉风险的重要性以及评估和管理风险的方法。

As an example of the gut feel method, let us say that the VP of operations, TomHarde, for our fictitious company AllScale is revered for his ability to make on-the-spotdecisions about problems and go/no-go decisions. As far as anyone can remember, hisdecisions have never been questioned and have always been correct, at least that iswhat the team recalls. The team has just finished preparing a release to go to produc-tion for the HRM application and has asked Tom for permission to push the codethis evening. This Wednesday evening between 10 PM and midnight has in the pastbeen designated as a maintenance window; not for down time, but because of the lowtraffic, it is a suitable time to perform the higher risk actions. Tonight, there is a data-base split taking place during this window that has already been planned andapproved. Tom decides, without explanation to the engineering team, that they can-not push code tonight and should expect to be allowed to push it the next night. Theteam accepts this decision because even though they are engineers and skeptical bynature, no one has ever questioned a go/no-go decision from Tom. Later that night,the database split goes disastrously wrong and the team is forced to work late intothe morning rolling back the changes. The engineering team hears about this in themorning and is very glad for the no-go decision last night.

作为直觉方法的一个例子，让我们假设我们虚构的公司 AllScale 的运营副总裁 TomHarde 因其针对问题做出现场决策以及继续/不继续决策的能力而受到尊敬。就任何人的记忆而言，他的决定从未受到质疑，而且一直都是正确的，至少团队是这么回忆的。该团队刚刚完成了 HRM 应用程序的发布准备工作，并已请求 Tom 允许今晚推送代码。过去，本周三晚上 10 点至午夜之间被指定为维护窗口；不是因为停机时间，而是因为流量较低，因此是执行较高风险操作的合适时间。今晚，在此窗口期间将发生数据库拆分，这已经是计划和批准的。汤姆决定，在没有向工程团队解释的情况下，他们今晚不能推送代码，应该可以在第二天晚上推送它。团队接受这一决定，因为尽管他们是工程师并且天生持怀疑态度，但没有人质疑汤姆的进行/不进行决定。那天晚上晚些时候，数据库拆分出现了灾难性的错误，团队被迫工作到深夜以回滚更改。工程团队早上得知此事，并对昨晚的不进行决定感到非常高兴。

The advantages of the gut feel method of risk assessment is that it is very fast. Atrue expert who fundamentally understands the amount of risk inherent in certaintasks can make decisions in a matter of a few seconds. The disadvantages of the gutfeel method are, as we discussed, the person might not have this ability but may befooled into thinking he does because of a few key saves. The other disadvantage isthat this method is rarely replicable. People tend to develop this ability over years ofworking in the industry and honing their expertise, not something that can be taught inan hour-long class. Another disadvantage of this method is that it leaves a lot of decisionmaking up to the whim of one person as opposed to a team or group that can ques-tion each others’ data and conclusions. The accuracy of this method is highly variabledepending on the person, the action, and a host of other variables. This week a per-son might be very good at assessing the risk and next week strike out completely.

直觉风险评估方法的优点是速度非常快。真正的专家从根本上了解某些任务固有的风险量，可以在几秒钟内做出决定。正如我们所讨论的，直觉方法的缺点是，这个人可能没有这种能力，但可能会因为一些关键的扑救而误以为他有这种能力。另一个缺点是这种方法很难复制。人们往往会通过多年的行业工作和磨练自己的专业知识来培养这种能力，而不是在一个小时的课程中可以教授的东西。这种方法的另一个缺点是，它让很多决策取决于一个人的突发奇想，而不是一个可以质疑彼此数据和结论的团队或团体。这种方法的准确性根据人、动作和许多其他变量的不同而变化很大。本周，一个人可能非常善于评估风险，但下周就会完全放弃。

The second method that we are going to cover is the traffic light method. In thismethod, you determine the risk of an action by breaking down the action into thesmallest components and assigning a risk level to them of green, yellow, or red. Thesmallest component could be a feature in a release or a configuration change in a listof maintenance steps, the granularity depends on several factors including the timeavailable and the amount of practice the team has in performing these assessments.After each component has been assigned a color of risk, there are two ways of arriv-ing at the overall risk of the action. The first method is to assign a risk value to eachcolor, count the number of each color, and multiply the count by the risk value. Then,sum these multiplied values and divide by the total count of items or actions. What-ever risk value this is closest to gets assigned the overall color. Figure 16.1 depicts therisk rating of three features that provides a cumulative risk of the overall release.

我们要介绍的第二种方法是红绿灯方法。在此方法中，您可以通过将操作分解为最小的组件并为其分配绿色、黄色或红色风险级别来确定操作的风险。最小的组件可能是版本中的功能或维护步骤列表中的配置更改，粒度取决于几个因素，包括可用时间和团队在执行这些评估时的实践量。在为每个组件分配了颜色后风险，有两种方法可以得出该行动的总体风险。第一种方法是为每种颜色分配一个风险值，计算每种颜色的数量，然后将计数乘以风险值。然后，将这些相乘值相加并除以项目或操作的总数。无论哪种风险值最接近，都会分配整体颜色。图 16.1 描述了三个功能的风险评级，提供了总体发布的累积风险。

The assessment of risk for the individual items in the action, release, or mainte-nance is done by someone very familiar with the low-level component and theydecide on green, yellow, or red by analyzing various factor such as the difficulty ofthe task, the amount of effort required for the task (the more effort generally thehigher the risk), the interaction of this component with others (the more connected orcentralized this item is the higher the risk), and so on. Table 16.1 shows some of themost common attributes and their associated risk factors that can be used by engi-neers or other experts to gauge the risk of a particular feature or granular item in theoverall list.

行动、发布或维护中各个项目的风险评估是由非常熟悉低级组件的人员完成的，他们通过分析各种因素（例如任务的难度、任务所需的工作量（通常越多，风险越高），该组件与其他组件的交互（该项目联系越紧密或集中，风险越高），等等。表 16.1 显示了一些最常见的属性及其相关的风险因素，工程师或其他专家可以使用这些属性来衡量整个列表中特定功能或细粒度项目的风险。

Traffic Light Release Example 交通灯释放示例

Mike Softe, VP of engineering at AllScale, has adopted the traffic light method of risk assess-ment. He has decided to assign numeric equivalents to the colors, assigning 1 to green, 3 toyellow, and 9 to red. Mike knows that he could use any arbitrary scale but he prefer this onebecause it causes higher risk items to dramatically stand out, which is very conservative. Mikeis producing a risk assessment for an upcoming release for the HRM application. He has fouritems that are green, two that are yellow, and one that is red; the math for calculating the overallrisk number/color is depicted in Figure 16.2.

AllScale 工程副总裁 Mike Softe 采用了红绿灯法进行风险评估。他决定为颜色指定对应的数字，将 1 指定为绿色，3 指定为黄色，9 指定为红色。迈克知道他可以使用任意尺度，但他更喜欢这个尺度，因为它会导致风险较高的项目显着突出，这是非常保守的。 Mike 正在为即将发布的 HRM 应用程序进行风险评估。他有四件绿色的物品，两件黄色的物品，一件红色的物品；计算总体风险数字/颜色的数学方法如图 16.2 所示。

Therefore, our total risk for the HRM release is 2.7, which is closest to 3 or yellow. We’ll dis-cuss what Mike should do with this number or color in the next section when we talk about howto manage risk. For now, we are satisfied that Mike has performed some level of risk assess-ment on the action.

因此，我们的 HRM 发布的总风险为 2.7，最接近 3 或黄色。当我们讨论如何管理风险时，我们将在下一节中讨论迈克应该如何处理这个数字或颜色。目前，我们对迈克已对该行动进行了一定程度的风险评估感到满意。

One large advantage of the traffic light method is that it begins to become method-ical, which implies that it is repeatable, able to be documented, and able to betrained. Many people can conduct the risk assessment so you are no longer depen-dent on a single individual. Again, because many people can perform the assessment,there can be discussion about the decisions that people arrive at and as a group theycan decide whether someone’s argument has merit. The disadvantages of this methodis that it does take more time than the gut feel method and it is an extra step in theprocess. Another disadvantage is that it relies on each expert to choose whichattributes she will use to assess the risk of individual components. Because of thispossible variance among the experts, the accuracy of this risk assessment is mediocre.If the experts are very knowledgeable and have a clear understanding of what consti-tutes risky attributes for their particular area, this method can be fairly accurate. Ifthey do not have a clear understanding about what attributes are important to lookat when performing the assessment, the risk level may be off quite a bit. We will seein the next risk assessment methodology how this potential variance is fixed allowingthe assessments to be more accurate.

红绿灯方法的一大优点是它开始变得有条理，这意味着它是可重复的、能够记录并能够接受训练。许多人可以进行风险评估，因此您不再依赖于某个人。同样，由于许多人都可以进行评估，因此可以对人们做出的决定进行讨论，并且作为一个群体，他们可以决定某人的论点是否有价值。这种方法的缺点是它确实比直觉方法花费更多的时间，并且是该过程中的额外步骤。另一个缺点是，它依赖于每个专家来选择她将使用哪些属性来评估各个组件的风险。由于专家之间可能存在差异，因此该风险评估的准确性一般。如果专家知识渊博并且清楚地了解其特定领域的风险属性的构成，则该方法可以相当准确。如果他们没有清楚地了解在执行评估时需要关注哪些属性，则风险级别可能会相差很大。我们将在下一个风险评估方法中看到如何修复这种潜在的差异，从而使评估更加准确。

The third method of assessing the amount of risk in a particular action is knownas the Failure Mode and Effects Analysis (FMEA). This methodology was originallydeveloped for use by the military in the late 1940s.

评估特定行动中风险量的第三种方法称为故障模式和影响分析 (FMEA)。该方法最初是在 20 世纪 40 年代末供军方使用而开发的。

Since then, it has been used in amultitude of industries including automotive, manufacturing, aerospace, and soft-ware development. The method of performing the assessment is similar to the trafficlight method in that components are broken up into the smallest parts that can beassessed for risk; for a release, this could be features, tasks, or modules. Each of thesecomponents is then identified with one or more possible failure modes. Each failuremode has an effect that describes the impact if this particular failure occurred.

此后，它已被应用于汽车、制造、航空航天和软件开发等多个行业。执行评估的方法类似于红绿灯方法，将组件分解为可以评估风险的最小部分；对于一个版本，这可以是功能、任务或模块。然后用一种或多种可能的故障模式来识别这些组件中的每一个。每个故障模式都有一个效果，描述发生此特定故障时的影响。

For example, a signup feature may fail by not storing the new user’s informationproperly in the database or by assigning the wrong set of privileges to the new user ora variety of other failure scenarios. The effect would be the user not being registeredor having the ability to see data she was not authorized to see. These failure scenariosare scored on three factors: likelihood of failure, severity of that failure, and the abil-ity to detect if that failure occurs. Again, we choose to use a scoring scale of 1, 3, and9 because it allows us to be very conservative and differentiate items with high riskfactors well above those with medium or low risks. The likelihood of failure is essen-tially the probability of this particular failure scenario coming true. The severity ofthe failure is the total impact to the customer and the business if this occurs. This canbe in monetary terms or in reputation (good will) or any other business related mea-surement. The ability to detect the failure is rating whether you will be likely tonotice the failure if it occurs. As you can imagine, a very likely failure that has disas-trous consequences that is practically undetectable is the worst possible of all three.

例如，注册功能可能会因未在数据库中正确存储新用户的信息或向新用户分配错误的权限集或各种其他失败情况而失败。结果将是用户未注册或无法查看她无权查看的数据。这些故障场景根据三个因素进行评分：故障的可能性、故障的严重性以及检测故障是否发生的能力。同样，我们选择使用 1、3 和 9 的评分标准，因为它使我们能够非常保守，并将高风险因素的项目远远高于中风险因素或低风险因素的项目区分开来。失败的可能性本质上是这种特定失败场景发生的概率。故障的严重程度是指发生这种情况时对客户和业务的总体影响。这可以是金钱、声誉（商誉）或任何其他与业务相关的衡量标准。检测故障的能力是指如果发生故障，您是否有可能注意到该故障。正如您可以想象的那样，极有可能导致灾难性后果且几乎无法察觉的故障是这三种故障中最糟糕的一种。

After the individual failure modes and effects have been scored, the scores are mul-tiplied to provide a Total Risk Score that is equal to the Likelihood Score u SeverityScore u Ability to Detect Score. This score shows the overall risk that a particularcomponent has within the overall action. The next step in the FMEA process is todetermine mitigation steps that you can perform or put in place that will lower therisk of a particular factor. For instance, if a component of a feature had a very highability to detect score, meaning that it would be hard to notice if the event occurred,the team might decide ahead of time to write some queries to check the databaseevery hour post-release for signs of this failure, such as missing data or wrong data.This mitigation step has a lowering effect on this risk factor of the component andshould then indicate what the risk was lowered to.

对各个故障模式和影响进行评分后，将分数相乘以提供等于可能性分数 u 严重性分数 u 检测能力分数的总风险分数。该分数显示了特定组件在总体操作中所具有的总体风险。 FMEA 流程的下一步是确定您可以执行或实施的缓解步骤，以降低特定因素的风险。例如，如果某个功能的某个组件具有非常高的检测分数的能力，这意味着很难注意到事件是否发生，那么团队可能会提前决定编写一些查询来在发布后每小时检查数据库此故障的迹象，例如丢失数据或错误数据。此缓解步骤对组件的此风险因素有降低作用，然后应指示风险降低到什么程度。

In Table 16.2, there are two features that the AllScale team is planning on releas-ing as part of its HRM application. One is a new signup flow for its customers andthe other is changing to a new credit card processor. Each of the features has several1. Procedure for performing a failure mode effect and criticality analysis. November 9, 1949.United States Military Procedure, MIL-P-1629.

在表 16.2 中，AllScale 团队计划将发布两个功能作为其 HRM 应用程序的一部分。一个是为客户提供新的注册流程，另一个是更换为新的信用卡处理器。每个功能都有几个1。执行故障模式影响和关键性分析的程序。 1949 年 11 月 9 日。美国军事程序，MIL-P-1629。

failure modes identified. Walking through one as an example, let’s look at the CreditCard Payment feature and focus on the Credit Card billed incorrectly failure modewith the effect of either a payment too large or too small being charged to the card.The engineering expert, Sam Codur, has ranked this as very unlikely to occur, proba-bly because Mike Softe, VP of engineering at AllScale, has ensured that this featurereceived extensive code review and quality assurance testing due to the fact that itwas dealing with credit cards. The engineer, Sam, gave the failure mode a 1 for likeli-hood. Sam also scored this failure mode as having disastrous severity, giving it a 9.This seems reasonable because a wrongly billed credit card would result in customersbeing very upset, charge backs, which cost money, and probably refunds, which costmore money. Should this failure occur, Sam feels that it will be somewhat hard todetect but not impossible so he gave it a score of 3. The Total Risk score for this fail-ure mode is 27, arrived at by multiplying 1u3u9. Sam also identified the fact that ifthis new payment processor were rolled out in beta for a limited customer set, theseverity would be much lower because only a few select customers would beimpacted and if anything went wrong the overall monetary and publicity amountswould be limited. If this remediation action is taken, the risk would be lowered to a 3for severity and the Revised Risk Score would be only a 9, much better than before.

已识别的故障模式。以一个例子为例，让我们看一下信用卡支付功能，并重点关注信用卡计费错误故障模式，该模式会导致从卡中收取的付款金额太大或太小。工程专家 Sam Codur 已对信用卡进行了排名这种情况不太可能发生，可能是因为 AllScale 工程副总裁 Mike Softe 已确保该功能由于涉及信用卡而接受了广泛的代码审查和质量保证测试。工程师 Sam 将故障模式的可能性定为 1。 Sam 还将这种故障模式评为灾难性的严重程度，给了 9 分。这似乎是合理的，因为错误的信用卡账单会导致客户非常不安，退款，这会花费金钱，并且可能会退款，这会花费更多的钱。如果发生这种故障，Sam 认为检测起来有些困难，但并非不可能，因此他给它打了 3 分。这种故障模式的总风险分数为 27，通过乘以 1u3u9 得出。萨姆还指出，如果这个新的支付处理器在测试版中针对有限的客户群推出，那么严重性会低得多，因为只有少数选定的客户会受到影响，如果出现任何问题，总体的金钱和宣传金额将受到限制。如果采取此补救措施，风险严重程度将降至 3，修订后的风险评分将仅为 9，比以前好得多。

The advantage of the FMEA as a risk assessment process is that it is very methodi-cal, which allows it to be documented, trained, evaluated, and modified. Anotheradvantage is the accuracy. Especially over time as your team becomes better at identi-fying failure scenarios and accurately assessing the risk, this will become the mostaccurate way for you to determine risk. The disadvantage of the FMEA method isthat it takes time and thought. The more time and effort put into this yields betterand more accurate results. This method is very similar to test-driven development.Failure modes can often be determined up front from the specification, and the moreidentified the better understanding you will have of the feature and how it should bedesigned to minimize the risk of these failures.

FMEA 作为风险评估流程的优点在于它非常有条理，可以对其进行记录、培训、评估和修改。另一个优点是准确性。特别是随着时间的推移，随着您的团队越来越善于识别故障场景并准确评估风险，这将成为您确定风险的最准确方法。 FMEA方法的缺点是需要时间和思考。投入越多的时间和精力就会产生更好、更准确的结果。这种方法与测试驱动开发非常相似。故障模式通常可以从规范中预先确定，并且识别得越多，您就越能更好地理解该功能以及应该如何设计它以最大限度地降低这些故障的风险。

As we will discuss in the next section, these scores, especially ones from a FMEAcan be used to manage the amount of risk in a system across any time interval or inany one release/action. The next step in the risk assessment is to have someone or ateam of people review the assessment for accuracy and to question any decision. Thisis the great part about using a methodical approach such as the FMEA: Everyone canbe trained on and thus can police each other to ensure the highest quality assessmentis performed. The last step in the assessment process is to revisit the assessment afterthe action has taken place to see how accurate you and the experts were in determin-ing the right failure modes and in assessing their factors. If a problem arose that wasnot identified as possible, have that expert review the situation in detail and provide areason this was not identified ahead of time and a warning to other experts to watchout for this type of failure.

正如我们将在下一节中讨论的那样，这些分数，尤其是来自 FMEA 的分数，可用于管理任何时间间隔或任何一次发布/操作中系统中的风险量。风险评估的下一步是让某人或一组人审查评估的准确性并对任何决定提出质疑。这是使用 FMEA 等有条不紊的方法的重要部分：每个人都可以接受培训，从而可以互相监督，以确保执行最高质量的评估。评估过程的最后一步是在采取行动后重新审视评估，以了解您和专家在确定正确的故障模式和评估其因素方面的准确性。如果出现未尽可能识别的问题，请让该专家详细审查情况，并提供未提前识别的区域，并警告其他专家注意此类故障。

Risk Assessment Steps 风险评估步骤

If you are planning on using any methodical approach to risk assessment, these are the stepsfor a proper risk assessment. These steps are appropriate for the traffic light method or theFMEA method that were discussed:

如果您计划使用任何有条理的方法进行风险评估，这些是正确风险评估的步骤。这些步骤适用于所讨论的交通灯方法或 FMEA 方法

1.Determine the proper level of granularity to assess the risk.

1.确定评估风险的适当粒度级别。

2.Choose a method that you can reproduce.

2.选择可以重现的方法。

3.Train the individuals who will be performing the risk assessment.

3.对将进行风险评估的人员进行培训。

4.Have someone review each assessment or a team can review the entire assessment.

4.让专人审查每个评估，或者团队可以审查整个评估。

5.Choose an appropriate scoring scale (1, 3, 9) that takes into account how conservative you need to be.

5.选择适当的评分标准（1、3、9），考虑您需要的保守程度。

6.Review the risk assessments after the action, release, or maintenance has occurred to determine how good the risk assessment was at identifying the types of failures as well as how likely, severe, and detectable they were.

6.在行动、发布或维护发生后审查风险评估，以确定风险评估在识别故障类型以及故障的可能性、严重性和可检测性方面的效果如何。

Whether you are using the traffic light method, the FMEA, or another risk assessment meth-odology, be sure to follow these steps to ensure a successful risk assessment that can be usedin the overall management of risk.

无论您使用的是红绿灯法、FMEA 还是其他风险评估方法，请务必遵循以下步骤，以确保成功的风险评估可用于整体风险管理。

Managing Risk 管理风险

As we discussed earlier in this chapter we fundamentally believe that risk is cumula-tive. As you take more risky actions or pile on risky changes, there will come a pointwhere the risk is realized and there will be problems in the system. In our practice atAKF Partners, we teach our clients to manage both acute and overall risk in a system.The acute risk is how much risk exists from a single change or combination ofchanges in a release. The overall level of risk comes from the accumulation of riskover hours, days, or weeks of performing risky actions on the system. Either type ofrisk, acute or overall, can result in a crisis scenario in the system. We will discuss howto manage both these types of risk to ensure you are making good decisions aboutwhat should and what should not be allowed to change within your system at anygiven point in time.

正如我们在本章前面讨论的那样，我们从根本上相信风险是累积的。当你采取更多有风险的行动或进行更多有风险的更改时，风险就会出现，系统中就会出现问题。在 AKF Partners 的实践中，我们教导客户管理系统中的急性风险和总体风险。急性风险是指版本中的单个更改或更改组合存在多少风险。风险的总体水平来自于在系统上执行风险操作的数小时、数天或数周内风险的累积。无论是严重风险还是整体风险，都可能导致系统出现危机。我们将讨论如何管理这两种类型的风险，以确保您在任何给定时间点就系统中应该更改什么和不允许更改什么做出正确的决策。

Acute risk is managed by monitoring the risk assessments performed on proposedchanges to the system such as releases. You may want to establish ahead of time somelimits to the amount of risk that any one concurrent action can have or that you arewilling to allow at a particular time of day or customer volume. For instance, youmay decide that any single action that contains a risk above 50 points, as calculatedthrough the FMEA methodology, must be remediated below this amount or split intotwo separate actions. Or, you may want only actions below 25 points taking place onthe system before midnight, everything higher must occur after midnight. Even thoughthis is a discussion about the acute risk of a single action, this too is cumulative inthat the more risky items contained in a risk, the higher the likelihood of a problemand the more difficult the detection or determination of the cause because so manythings changed.

通过监控对系统拟议变更（例如发布）进行的风险评估来管理急性风险。您可能希望提前对任何一项并发操作可能具有的风险量或您愿意在一天中的特定时间或客户量允许的风险量建立一些限制。例如，您可能会决定，通过 FMEA 方法计算得出的风险超过 50 点的任何单个操作都必须进行补救以使其低于该金额，或者分成两个单独的操作。或者，您可能只想在午夜之前在系统上执行低于 25 点的操作，所有高于 25 点的操作都必须在午夜之后发生。尽管这是关于单一行动的严重风险的讨论，但这也是累积的，因为风险中包含的风险项目越多，出现问题的可能性就越高，并且由于许多事情发生了变化，检测或确定原因也就越困难。

As a thought experiment, imagine a release with one feature that has two failuremodes identified compared to a release with 50 features, each with two or more fail-ure modes. Firstly it is way more likely for a problem to occur because of the numberof opportunities. As an analog consider flipping 50 pennies at the same time. Whileeach coin is an independent probability of landing on heads, you are more likely tohave at least one head in the total results. Secondly, with 50 features, the likelihoodof changes affecting each other or touching the same component, class, or method inan unexpected way is higher. Therefore, both from a cumulative opportunity as wellas from a cumulative probability of negative interactions, there is an increased likeli-hood of a problem occurring. If a problem arises after these releases, it is also a loteasier to determine the cause of the problem when the release contains one featurethan when it contains 50, assuming that all the features are somewhat proportionalin complexity and size.

作为一个思想实验，想象一个具有一个功能的版本有两种故障模式，与一个具有 50 个功能的版本相比，每个功能都有两种或多种故障模式。首先，由于机会的数量，问题发生的可能性更大。作为一个模拟，考虑同时翻转 50 便士。虽然每枚硬币正面朝上的概率是独立的，但总结果中至少有一个正面朝上的可能性更大。其次，对于 50 个功能，更改相互影响或以意外方式触及相同组件、类或方法的可能性更高。因此，无论是累积的机会还是累积的负面互动概率，问题发生的可能性都会增加。如果在这些版本之后出现问题，那么当版本包含一个功能时比包含 50 个功能时更容易确定问题的原因，假设所有功能在复杂性和大小方面都成一定比例。

For managing acute risk, we recommend that you determine a chart such as theone in Table 16.3 that outlines all the rules and associated risk levels that are accept-able. This way, it is clear cut. You should also decide on an exceptions policy such asanything outside of these rules must be approved by the VP of engineering and theVP of operations or the CTO alone.

为了管理急性风险，我们建议您确定一个图表，例如表 16.3 中的图表，其中概述了所有规则和可接受的相关风险级别。这样一来，就一目了然了。您还应该决定例外政策，例如这些规则之外的任何内容都必须由工程副总裁和运营副总裁或首席技术官单独批准。

For managing the overall risk amount, there are two factors that can cause issues.The first is the cumulative amount of changes that have taken place in the system andthe corresponding increase in the amount of risk associated with each of thesechanges. Just as we discussed in the earlier section on acute risks, combinations ofTable 16.3 Acute Risk Management RulesRules Risk LevelNew feature release < 150 ptsBug fix release < 50 pts6 AM – 10 PM < 25 pts10 PM – 6 AM < 200 ptsMaintenance patches < 25 ptsConfiguration changes < 50 ptsactions can have unwanted interactions as well. The more releases or database splitsor configuration changes that are made, the more likely one will cause a problem orthe interaction of them will cause a problem. If the development team has been work-ing in a development environment with a single database and two days before therelease the database is split into a master and read host, it’s pretty likely that the nextrelease is going to have a problem unless there has been a ton of coordination andremediation work done.

对于管理总体风险量，有两个因素可能会导致问题。第一个是系统中发生的累积变化量以及与每个变化相关的风险量的相应增加。正如我们在前面关于急性风险的部分中讨论的那样，表 16.3 急性风险管理规则规则风险级别的组合新功能版本 < 150 分错误修复版本 < 50 分上午 6 点 – 晚上 10 点 < 25 点下午 10 点 – 上午 6 点 < 200 点维护补丁 < 25 点配置更改< 50 ptsactions 也可能会产生不需要的交互。发布的版本或数据库拆分器或配置更改越多，就越有可能导致问题或它们之间的交互导致问题。如果开发团队一直在具有单个数据库的开发环境中工作，并且在发布前两天将数据库分为主主机和读取主机，那么下一个版本很可能会出现问题，除非出现了问题完成了大量的协调和补救工作。

The second factor that should be considered in the overall risk analysis is thehuman factor. As people perform riskier and riskier activities, their level of risk toler-ance goes up. This human conditioning can work for us very well when we need tobecome adapted to a new environment, but when it comes to controlling risk in a sys-tem, this can lead us astray. If a sabre-toothed tiger has moved into the neighborhoodand you still have to leave your cave each day to hunt, the ability to adapt to the newrisk in your life is critical to your survival. Otherwise, you might stay in your cave allday and starve. Pushing more and more changes to your production environmentbecause you haven’t been burnt yet and you feel somewhat invincible is a good wayto cause serious issues.

总体风险分析中应考虑的第二个因素是人的因素。随着人们进行越来越危险的活动，他们的风险承受能力也会不断提高。当我们需要适应新环境时，这种人类调节对我们来说非常有效，但当涉及到控制系统风险时，这可能会让我们误入歧途。如果一只剑齿虎搬进了附近，而你仍然必须每天离开洞穴去捕猎，那么适应生活中新风险的能力对你的生存至关重要。否则，你可能会整天呆在山洞里挨饿。因为你还没有被烧死并且你感觉有点无敌，所以对你的生产环境进行越来越多的改变是导致严重问题的好方法。

We recommend that to manage the overall amount of risk in a system, you adopt aset of rules such as in Table 16.4, which lays out the amount of risk, as determined byat FMEA, for specific time periods. If you are using a different methodology thanFMEA, you need to adjust the risk level column with some scale that makes sense,such as instead of < 150 pts you could use < 5 green or 3 yellow actions. Like theacute risk management process, you will need to account for objections and over-rides. You should plan ahead and have an escalation process established. An ideawould be that a director can grant an extra 50 points to any risk level, a VP can grant100 points, and the CTO can grant 250 points, but not cumulative. Any way youdecide to set this up, it matters most that it makes sense for your organization andthat it is documented and adhered to strictly.

我们建议，为了管理系统中的总体风险量，您采用表 16.4 中的一组规则，其中列出了 FMEA 确定的特定时间段的风险量。如果您使用与 FMEA 不同的方法，则需要以某种有意义的规模调整风险级别列，例如您可以使用 < 5 个绿色或 3 个黄色操作，而不是 < 150 点。与紧急风险管理流程一样，您需要考虑反对和推翻。您应该提前计划并建立升级流程。一个想法是，董事可以为任何风险级别额外授予 50 分，副总裁可以授予 100 分，首席技术官可以授予 250 分，但不能累积。无论您决定以何种方式进行设置，最重要的是它对您的组织有意义，并且记录在案并严格遵守。

Conclusion 结论

In this chapter, we have focused on risk. Our discussions started with the purpose ofrisk management and how that related to scalability. We concluded that risk is preva-lent in all businesses, especially startups. To be successful, you have to take risks inthe business world. In the Web 2.0 and SaaS world, scalability is part of this risk/reward structure. You must take risks in terms of your system’s scalability or else youwill overbuild your system and not deliver products that will make the business suc-cessful. By actively managing your risk, you will increase the availability and scalabil-ity of your system.

在本章中，我们重点关注风险。我们的讨论从风险管理的目的及其与可扩展性的关系开始。我们的结论是，风险在所有企业中都普遍存在，尤其是初创企业。为了获得成功，你必须在商业世界中承担风险。在 Web 2.0 和 SaaS 世界中，可扩展性是这种风险/回报结构的一部分。您必须在系统的可扩展性方面承担风险，否则您将过度构建系统，而无法交付使业务成功的产品。通过积极管理风险，您将提高系统的可用性和可扩展性。

Our next discussion in this chapter was focused on how to assess risk. Althoughthere are many different approaches used for this, we offered three different ones.The first was the gut feeling, which we abdicated that some are naturally gifted at butmany others are credited for but actually lack the ability and are simply mislabeled.

本章的下一个讨论重点是如何评估风险。尽管有许多不同的方法用于此目的，但我们提供了三种不同的方法。第一个是直觉，我们放弃了这种感觉，即有些人天生就有天赋，但许多其他人被认为是有能力的，但实际上缺乏能力，而且只是被贴错了标签。

The second method was the traffic light, which assessed components as low risk(green), medium risk (yellow), or high risk (red). The combination of all componentsin an action, release, change, or maintenance was the overall risk level. We providedsome examples of how this overall number could be calculated.

第二种方法是红绿灯，它将组件评估为低风险（绿色）、中风险（黄色）或高风险（红色）。操作、发布、变更或维护中所有组件的组合就是总体风险级别。我们提供了一些示例来说明如何计算这个总数。

The third and our recommended approach is the Failure Mode and Effect Analysismethodology. In this method, experts are asked to assess the risk of components byidentifying the failure modes that are possible with each component or feature andthe impending effect that this failure would cause. An example given was a creditcard payment feature that could fail by charging a wrong amount to the credit card,the effect being a charge that was too large or too small to the customer. These failuremodes and effects were scored by their likelihood of occurrence, the severity if theywere to occur, and the ability to detect if it did occur. These were multiplied for atotal risk score. The experts would then recommend remediation steps that wouldreduce the risk of one or more of the factors and thus reduce the overall risk score. After the risk assessment was completed, the management of risk needed to begin.

第三种也是我们推荐的方法是故障模式和影响分析方法。在这种方法中，要求专家通过识别每个组件或功能可能出现的故障模式以及该故障可能造成的潜在影响来评估组件的风险。给出的一个例子是信用卡支付功能，该功能可能会因向信用卡收取错误金额而失败，其结果是对客户来说收费太大或太小。这些故障模式和影响是根据其发生的可能性、发生的严重程度以及检测是否发生的能力进行评分的。这些乘以总风险评分。然后，专家会建议采取补救措施，以降低一种或多种因素的风险，从而降低总体风险评分。风险评估完成后，需要开始风险管理。

We broke this up into the management of acute risk and the management of overallrisk. The acute risk dealt with single actions, releases, maintenances, and so on,whereas the overall risk dealt with all changes over periods of time such as hours,days, or weeks. For both acute and overall, we recommended the adoption of rulesthat specified predetermined amounts of risk that would be tolerated for each actionor time period. Additionally, in preparation for objections, we recommended an esca-lation path be established ahead of time so that the first crisis does not create its ownpath without thought and proper input from all parties.

我们将其分解为急性风险管理和总体风险管理。急性风险涉及单个操作、发布、维护等，而总体风险涉及一段时间内（例如几小时、几天或几周）的所有变化。无论是急性的还是总体的，我们建议采用规则，规定每个行动或时间段可以容忍的预先确定的风险量。此外，为了准备反对意见，我们建议提前建立一条升级路径，以便在没有各方深思熟虑和适当投入的情况下，第一次危机不会自行发展。

As with most processes, the most important aspect of both the risk assessment andthe risk management is the fit within your organization at this particular time. Asyour organization grows and matures, there may be a need to modify or augmentthese processes. For risk management to be effective, it must be used, and in order forit to be used, it needs to be a good fit for your team.

与大多数流程一样，风险评估和风险管理最重要的方面是在特定时间适合您的组织。随着您的组织的成长和成熟，可能需要修改或增强这些流程。为了使风险管理有效，必须使用它，并且为了使用它，它需要非常适合您的团队。

Key Points 关键点

Business is inherently risky; the changes that we make to improve scalability ofour systems can be risky as well.
商业本质上是有风险的；我们为提高系统的可扩展性而做出的改变也可能存在风险。

Managing the amount of risk in a system is key to availability and ensuring thesystem can scale.
管理系统中的风险量是可用性和确保系统可扩展的关键。

Risk is cumulative with some degree of degradation over time.
随着时间的推移，风险会累积并出现一定程度的退化。

For best results, use a method of risk assessment that is repeatable and measureable.
为了获得最佳结果，请使用可重复且可测量的风险评估方法。
Risk assessments like other processes can be improved over time.

与其他流程一样，风险评估可以随着时间的推移而得到改进。

There are advantages and disadvantages to various risk assessment approaches.
各种风险评估方法各有优点和缺点。

There is a great deal of difference in the accuracy of various risk assessmentapproaches.
各种风险评估方法的准确性存在很大差异。

Risk management can be viewed as both acute and overall.
风险管理可以被视为既尖锐又全面。

Acute risk management deals with single instances of change such as a release ora maintenance procedure.
急性风险管理处理单个变更实例，例如发布或维护过程。

Overall risk management is about watching and administering the total level ofrisk in the system at any point in time.
总体风险管理是指在任何时间点监视和管理系统中的总体风险水平。

For the risk management process to be effective, it must be used and followed.
为了使风险管理流程有效，必须使用并遵循它。

The best way to ensure a process is adhered to is to make sure it is a good fit forthe organization.
确保遵守流程的最佳方法是确保它适合组织。

搜虎博客 - 人生天地间，忽如远行客。