### Chapter 10 Controlling Change in Production Environments 第10章控制生产环境的变化 > If you know neither the enemy nor yourself, you will succumb in every battle.—Sun Tzu > 不知敌不知己,则百战必败。——孙子 In engineering and chemistry circles, the word stability is a resistance to deteriorationor constancy in makeup and composition. Something is “highly instable” if its composition changes regardless of the actual rate of activity within the system, and it is“stable” if its composition remains constant and it does not disintegrate or deteriorate. In the hosted services world, and with enterprise systems, one way to create astabile service is simply to not allow activity on it and to limit the number of changesmade to the system. Change, in the previous sentence, is an indication of activitiesthat an engineering team might take on a system, such as modifying configurationfiles or updating a revision of code on the system. Unfortunately for many of us, theelimination of changes within a system, while potentially accomplishing stability, willlimit the ability of our business to grow. Therefore, we must allow and enablechanges with the intent of limiting impact and managing risk, thereby creating a stable platform or service. 在工程和化学界,“稳定性”一词是指对组成和成分的恶化或稳定性的抵抗力。如果某种东西的成分发生变化而不管系统内的实际活动速率如何,则该东西是“高度不稳定的”;如果其成分保持不变并且不会分解或变质,则该东西是“稳定的”。在托管服务领域以及企业系统中,创建不稳定服务的一种方法就是不允许在其上进行活动并限制对系统进行的更改数量。在上一句中,更改表示工程团队可能对系统进行的活动,例如修改配置文件或更新系统上的代码修订版。不幸的是,对于我们许多人来说,消除系统内的变化虽然可能实现稳定性,但会限制我们业务增长的能力。因此,我们必须允许并启用变革,以限制影响和管理风险,从而创建稳定的平台或服务。 If unmanaged, a high rate of change will cause you significant problems and willresult in the more modern definition of instability within software: something thatdoes not work or is not reliable consistently. The service will deteriorate or disintegrate (that is, become unavailable) with unmanaged and undocumented change. Ahigh rate of change, if not managed, will cause the events of Chapters 8, ManagingIncidents and Problems, and 9, Managing Crisis and Escalations, to happen as aresult of your actions. And, as we discussed in Chapters 8 and 9, incidents and crisesrun counter to your scalability objectives. It follows that you must manage change toensure that you have a scalable service and happy customers. 如果不加以管理,高变化率会给您带来严重的问题,并导致软件中不稳定的更现代的定义:无法正常工作或始终不可靠的东西。由于不受管理和无记录的变更,服务将恶化或瓦解(即变得不可用)。如果不加以管理,高变化率将导致第 8 章“管理事件和问题”和第 9 章“管理危机和升级”中的事件因您的行为而发生。而且,正如我们在第 8 章和第 9 章中讨论的那样,事件和危机与您的可扩展性目标背道而驰。因此,您必须管理变更,以确保您拥有可扩展的服务和满意的客户。 In our experience, one of the greatest consumers of scalability is change, especiallywhen a change includes the implementation of new functionality. An implementation that supports two times the current user demand on Tuesday may be in the positionof barely handling all the user requests after a release that includes a series of newfeatures is made on Wednesday. Some of the impact may be a result of poorly tunedqueries or bugs, and some may just be a result of unexpected user demand after therelease of the new functionality. Whatever the reason, you’ve now put yourself in avery desperate situation for which there may be no easy and immediate solution. 根据我们的经验,可扩展性的最大消耗者之一是变更,尤其是当变更包括新功能的实现时。周二支持两倍于当前用户需求的实现可能在周三发布包含一系列新功能的版本后几乎无法处理所有用户请求。有些影响可能是由于查询调整不当或错误造成的,有些可能只是新功能发布后意外的用户需求造成的。不管出于什么原因,你现在已经把自己置于一种非常绝望的境地,可能没有简单且立即的解决方案。 Similarly, infrastructure changes can have significant and negative impact to yourability to handle user demand, and this presents yet another scalability concern. Perhaps you implement a new tier of firewalls and as a result all customer transactionstake an additional 10 milliseconds to complete. Maybe that doesn’t sound like a lotto you, but if your departure rate of the requests now taking an additional 10 milliseconds to complete is significantly less than the arrival rate of those requests, youare going to have an increasingly slow system that may eventually fail altogether. Ifthe terms departure rate and arrival rate are confusing to you, think of departure rateas the rate (requests over time) that your system completes end-user requests andarrival rate is the rate (requests over time) at which new requests arrive. A reductionin departure rate resulting from an increase in processing time might then mean thatyou have fewer requests completing within a given timeframe than you have arriving.Such a situation will cause a backlog of requests and should such a backlog continueto grow over time, your systems might appear to end users to stop responding to newrequests. 同样,基础设施的变化可能会对您处理用户需求的能力产生重大的负面影响,这又带来了另一个可扩展性问题。也许您实施了新的防火墙层,因此所有客户交易都需要额外 10 毫秒才能完成。也许这对您来说听起来不是很多,但是如果您现在需要额外 10 毫秒才能完成的请求的离开率明显低于这些请求的到达率,那么您的系统将变得越来越慢,最终可能会失败共。如果出发率和到达率这两个术语让您感到困惑,请将出发率视为系统完成最终用户请求的速率(随时间变化的请求),而到达率则是新请求到达的速率(随时间变化的请求)。由于处理时间的增加而导致的出发率降低可能意味着在给定时间范围内完成的请求少于到达的请求。这种情况将导致请求积压,并且如果这种积压随着时间的推移继续增长,您的系统可能会最终用户似乎停止响应新请求。 If your scalability goals include both increasing your availability and increasingthe percentage of time that you adhere to internally or externally published servicelevels for critical functions, having processes that help you manage the effect of yourchanges are critical to your success. The absence of any process to help manage therisk associated with change is a surefire way to cause both you and your customers agreat deal of heartache. Thinking back to our “shareholder” test, can you really seeyourself walking up to one of your largest shareholders and saying, “We will neverlog our changes or attempt to manage them as it is a complete waste of time”? Thechances are you would make such a statement and if you wouldn’t make such a statement, then you agree that the need to monitor and manage change is important toyour success. 如果您的可扩展性目标包括提高可用性和增加遵守内部或外部发布的关键功能服务级别的时间百分比,那么拥有可帮助您管理变更影响的流程对于您的成功至关重要。缺乏任何流程来帮助管理与变革相关的风险肯定会让您和您的客户感到非常心痛。回想一下我们的“股东”测试,您真的能想象自己走到最大的股东之一面前并说:“我们永远不会记录我们的变更或尝试管理它们,因为这完全是浪费时间”?您很可能会发表这样的声明,如果您不发表这样的声明,那么您同意监控和管理变革的需要对您的成功很重要。 #### What Is a Change? 什么是改变? Sometimes, we define a change as any action that has the possibility of breakingsomething. There are two problems with this definition in our experience. The first isthat it is too “subjective” and allows too many actions to be excluded such as givingpeople the luxury of saying that “this action wouldn’t possibly cause a problem.” The second issue is that it is sometimes too inclusive as it is pretty simple to make thecase that all customer transactions could cause a problem if they encounter a bug.This latter choice is often cited as a reason not to log changes. The argument is thatthere are too many activities that induce “change” and therefore it simply isn’t worthtrying to capture them all. 有时,我们将更改定义为任何可能破坏某些内容的操作。根据我们的经验,这个定义存在两个问题。首先是它太“主观”,允许排除太多的行为,比如让人们奢侈地说“这个行为不可能造成问题”。第二个问题是,它有时过于包容,因为很容易证明所有客户交易如果遇到错误都可能导致问题。后一种选择通常被认为是不记录更改的原因。争论的焦点是,有太多的活动会引发“变化”,因此根本不值得尝试捕获所有这些活动。 We are going to assume that you understand that all businesses have some amountof risk. By virtue of being in business, you have already accepted that you are willingto take the risk of allowing customers to interact with your systems for the purposeof generating revenue. In the case of back office IT systems, we are going to assumethat you are willing to take the risk of stakeholder interactions in order to reduce costwithin your company or increase employee productivity. 我们假设您了解所有企业都存在一定程度的风险。通过开展业务,您已经接受您愿意冒允许客户与您的系统交互以产生收入的风险。就后台 IT 系统而言,我们假设您愿意承担利益相关者互动的风险,以降低公司内部成本或提高员工生产力。 Although you wish to manage the risk of customer or stakeholder interactionscausing incidents, we assume that you manage that risk through appropriate testing,inspections, and audits. Further, we are going to assume that you want to manage therisk of interacting with your system, platform, or product in a fashion for which it isnot designed. In our experience, such interactions are more likely to cause incidentsthan the “planned” interactions that your system is designed to handle. The intent ofmanaging such interactions then is to reduce the number and duration of incidentsassociated with the interactions. We will call this last set of interactions “changes.” Achange then is any action you take to modify the system or data outside normal customer or stakeholder interactions provided by that system. 尽管您希望管理客户或利益相关者互动导致事件的风险,但我们假设您通过适当的测试、检查和审计来管理该风险。此外,我们将假设您希望管理与您的系统、平台或产品交互的风险,而其设计方式并非如此。根据我们的经验,此类交互比您的系统旨在处理的“计划”交互更有可能导致事件。管理此类交互的目的是减少与交互相关的事件的数量和持续时间。我们将这最后一组交互称为“变化”。更改是您在该系统提供的正常客户或利益相关者交互之外修改系统或数据所采取的任何操作。 Changes include modifications in configuration, such as modifying values usedduring startup or run time of your operating systems, databases, proprietary applications, firewalls, network devices, and so on. Changes also include any modificationsto code, additions of hardware, removal of hardware, connection of network cablesto network devices, and powering on and off systems. As a general rule, any time anyone of your employees needs to touch, twiddle, prod, or poke any piece of hardware,software, or firmware, it is a change. 更改包括配置修改,例如修改操作系统、数据库、专有应用程序、防火墙、网络设备等启动或运行时使用的值。更改还包括对代码的任何修改、添加硬件、删除硬件、将网络电缆连接到网络设备以及打开和关闭系统。一般来说,任何时候您的员工需要触摸、摆弄、戳或戳任何硬件、软件或固件,这就是一个变化。 ##### What If I Have a Small Company? 如果我有一家小公司怎么办? Every company needs to have some level of process around managing and documentingchange. Even a company of a single individual likely has a process of identifying what haschanged, even if only as a result of that one individual having a great memory and being able toinstinctively understand the relationship of the systems she has created in order to manage herrisk of changes. 每个公司都需要有一定程度的管理和记录变更的流程。即使是由一个人组成的公司也可能有一个识别发生了什么变化的过程,即使只是因为该人具有良好的记忆力并且能够本能地理解她所创建的系统的关系,以便管理她的变革风险。 The real question here is how much process you need and how much needs to be documented.The answer to that is the same answer as with any process: You should implement exactlyenough to maximize the benefit of the process. This in turn means that the process shouldreturn more to you in benefit than you spend in time to document and adhere to the process. 这里真正的问题是您需要多少流程以及需要记录多少流程。这个问题的答案与任何流程的答案相同:您应该实施足够多的流程,以最大限度地提高流程的效益。这反过来意味着该流程给您带来的收益应该比您花在记录和遵守该流程上的时间更多。 A small company with few employees and few services or systems interactions might getaway with only change identification. A large company with a completely segmented servicesoriented architecture and moderate level of change might also only need change identification,or maybe it implements a very lightweight change management process. A large company witha complex system with several dependencies and interactions in a hosted SaaS environmentlikely needs complex change identification and change management. 员工很少、服务或系统交互也很少的小公司可能只需要更改识别即可逃脱。具有完全分段的面向服务的架构和适度的变更级别的大公司可能也只需要变更识别,或者可能实现非常轻量级的变更管理流程。拥有复杂系统(在托管 SaaS 环境中具有多个依赖项和交互)的大公司可能需要复杂的变更识别和变更管理。 #### Change Identification 变更标识 The very first thing you should do to limit the impact of changes is to ensure thateach and every change that goes into your production environment gets logged with 为了限制更改的影响,您应该做的第一件事是确保进入生产环境的每个更改都被记录下来 * Exact time and date of the change * System undergoing change * Actual change * Expected results of the change * Contact information of person making the change * 更改的确切时间和日期 * 系统正在改变 * 实际变化 * 变更的预期结果 * 变更人联系方式 An example of the minimum necessary information for a change log is included inTable 10.1. 表 10.1 中包含了变更日志的最少必要信息的示例。 ![](https://blog.baidu-google.com/usr/uploads/2024/06/600201167.png) To understand why you should include all of the information from these five bullets, let’s examine an event at AllScale. The HRM system login functionality starts tofail and all attempted logins result in a “website not found” error. The AllScale definition of a crisis is that any rate of failure above a 10% failure rate for any criticalcomponent (login is considered to be critical) is a crisis. The crisis manager is paged,and she starts to assemble the crisis management team with the composition that wediscussed in Chapter 9. When everyone is assembled in a room or on a telephonic conference bridge, what do you think should be the first question out of the crisismanager’s mouth? 要了解为什么应该包含这五个项目符号中的所有信息,让我们检查一下 AllScale 上的一个事件。 HRM 系统登录功能开始失败,所有尝试登录都会导致“找不到网站”错误。 AllScale 对危机的定义是,任何关键组件(登录被认为是关键)的任何故障率超过 10% 都是危机。危机经理被传呼,她开始按照我们在第 9 章中讨论的组成来组建危机管理团队。当每个人都聚集在一个房间或电话会议桥上时,您认为应该提出的第一个问题是什么?危机管理者的嘴? We often get answers to this question ranging from “What is going on right now?”to “How many customers are impacted?” and “What are the customers experiencing?” All of these are good questions and absolutely should be asked, but they arenot the question most likely to reduce the time and amount of impact of your currentincident. The question you should ask first is “What most recently changed?” In ourexperience, more than any other reason, changes are the cause of most incidents inproduction environments. It is possible that you have an unusual environment wheresome piece of faulty equipment fails daily, but after that type of incident is fixed, youare most likely to experience that your interaction with your system causes more customer impact issues than any other situation. 我们经常得到这个问题的答案,从“现在发生了什么?”到“有多少客户受到影响?”和“客户正在经历什么?”所有这些都是好问题,绝对应该问,但它们并不是最有可能减少当前事件的时间和影响程度的问题。您首先应该问的问题是“最近发生了什么变化?”根据我们的经验,与任何其他原因相比,变化是生产环境中大多数事件的原因。您可能会遇到不寻常的环境,其中某些有故障的设备每天都会发生故障,但是在修复此类事件后,您很可能会遇到与系统的交互比任何其他情况都会导致更多的客户影响问题。 Asking “What most recently changed?” gets people thinking about what they didthat might have caused the problem at hand. It gets your team focused on attemptingto quickly undo anything that is correlated in time to the beginning of the incident. Inour experience, it is the best opening question for any discussion around any ongoingincident from a small customer impact to a crisis. It is a question focused on restoration or service rather than problem resolution. 问“最近发生了什么变化?”让人们思考他们所做的事情可能导致了当前的问题。它让您的团队专注于尝试快速撤消与事件开始时间相关的任何事情。根据我们的经验,这是围绕任何正在进行的事件(从小型客户影响到危机)进行讨论的最佳开场问题。这是一个侧重于恢复或服务而不是解决问题的问题。 One of the most humorous answers we encounter time and again after asking“What most recently changed?” goes like this: “We just changed the configuration ofthe (insert system or software name here) but that can’t possibly be the cause of thisproblem!” Collectively, we’ve heard this phrase hundreds if not thousands of times inour career and we can almost guarantee you that if you ever hear that phrase you willknow exactly what the problem is. Stop right there! Cease all work! Focus on theaction identified in the (insert system or software name here) portion of the answerand “undo” the change! In our experience, the person might as well have said “Icaused this—sorry!” We’re not sure why there is such a high correlation between“that can’t possibly be the cause of this problem” and the actual cause of the problem, but it probably has something to do with our subconscious knowing that it isthe cause of the problem while our conscious mind hopes that it isn’t the case. Okay,back to more serious matters. 在询问“最近发生了什么变化?”之后,我们一次又一次遇到最幽默的答案之一。像这样:“我们刚刚更改了(在此处插入系统或软件名称)的配置,但这不可能是导致此问题的原因!”总的来说,我们在职业生涯中已经听过这句话数百次甚至数千次,我们几乎可以向您保证,如果您听到过这句话,您就会确切地知道问题是什么。停在那儿!停止一切工作!专注于答案(在此处插入系统或软件名称)部分中标识的操作并“撤消”更改!根据我们的经验,对方可能会说“这是我造成的——抱歉!”我们不确定为什么“这不可能是这个问题的原因”和问题的实际原因之间有如此高的相关性,但这可能与我们的潜意识知道它是问题的原因有关。但我们的意识却希望事实并非如此。好吧,回到更严肃的事情上来。 It is not likely that when you ask “What most recently changed?” that you willhave everyone who performed all changes on the phone or in the room with youunless you are a very small company. And even if you are a small company of saythree engineers, it is entirely possible that you’d be asking the question of yourself inthe middle of the night while your partners are sound asleep. As such, you really needa place to easily collect the information identified earlier. The system that stores thisinformation does not need to be an expensive, third-party change management andlogging tool. It can easily be a shared email folder, with all changes identified in thesubject line and sent to the folder at the time of the actual change by the person making the change. Larger companies probably need more functionality including a way to query the system by the subsystem being affected, type of change, and so on. Butall companies need a place to log changes in order to quickly recover from those thathave an adverse customer or stakeholder impact. 当你问“最近发生了什么变化?”时,这不太可能发生。除非您是一家非常小的公司,否则您将让每个通过电话或在房间里与您一起执行所有更改的人。即使您是一家由三名工程师组成的小公司,您也完全有可能在半夜当您的合作伙伴熟睡时问自己这个问题。因此,您确实需要一个地方来轻松收集之前确定的信息。存储此信息的系统不需要是昂贵的第三方变更管理和日志记录工具。它可以很容易地成为一个共享电子邮件文件夹,所有更改都在主题行中标识,并由进行更改的人在实际更改时发送到该文件夹。较大的公司可能需要更多的功能,包括通过受影响的子系统、更改类型等来查询系统的方法。但所有公司都需要一个地方来记录更改,以便快速从对客户或利益相关者产生不利影响的更改中恢复过来。 #### Change Management 更换管理层 Change identification is a component of a much larger and more complex processcalled change management. The intent of change identification is to limit the impactof any change by being able to determine its correlation in time to the start of anevent and thereby its probability of causing that event; this limitation of impactincreases your ability to scale as less time is spent working on value destroying incidents. The intent of change management is to limit the probability of changes causingproduction incidents by controlling them through their release into the productionenvironment and logging them as they are introduced to production. Great companies implement change management not to reduce the rate of change, but rather toallow the rate of change to increase while decreasing the number of change relatedincidents and their impact on shareholder wealth creation. Increasing the velocityand quantity of change while decreasing the impact and probability of change relatedincidents is how change management increases the scalability of your organization,service, or platform. 变更识别是一个更大、更复杂的流程(称为变更管理)的组成部分。变更识别的目的是通过能够及时确定变更与事件开始的相关性以及导致该事件的概率来限制任何变更的影响;这种影响的限制提高了您的扩展能力,因为花在处理价值破坏事件上的时间更少。变更管理的目的是通过将变更发布到生产环境中并在引入生产时进行记录来控制变更导致生产事件的可能性。伟大的公司实施变革管理不是为了降低变革率,而是允许变革率增加,同时减少变革相关事件的数量及其对股东财富创造的影响。提高变革的速度和数量,同时降低变革相关事件的影响和概率,是变革管理提高组织、服务或平台的可扩展性的方法。 #### Change Management and Air Traffic Control 变更管理和空中交通管制 Sometimes, it is easiest to view change management as the same type of function as the Federal Aviation Administration (FAA) provides for aircraft at busy airports. Air Traffic Control (ATC)exists to reduce and ideally eliminate the frequency and impact of aircraft accidents duringtakeoff and landing at airports just as change management exists to reduce the frequency andimpact of changes within your platform, product, or system. 有时,最容易将变更管理视为与联邦航空管理局 (FAA) 为繁忙机场的飞机提供的相同类型的功能。空中交通管制 (ATC) 的存在是为了减少并理想地消除机场起飞和降落期间飞机事故的频率和影响,就像变更管理的存在是为了减少平台、产品或系统内变更的频率和影响一样。 ATC works to order aircraft landings and takeoffs based on the availability of the aircraft, itspersonal needs (does the aircraft have a declared emergency, is it low on fuel, and so on), andits order in the queue for takeoffs and landings. Queue order may be changed for a number ofreasons including the aforementioned declaration of emergencies ATC 根据飞机的可用性、其个人需求(飞机是否宣布紧急情况、燃油是否不足等)以及其在起飞和着陆队列中的顺序来安排飞机着陆和起飞。排队顺序可能会因多种原因而改变,包括上述紧急情况声明 Just as ATC orders aircraft for safety, so does the change management process orderchanges for safety. Change management considers the expected delivery date of a change, itsbusiness benefit to help indicate ordering, the risk associated with the change, and its relationship with other changes to attempt to deliver the fewest accidents possible. 正如 ATC 为了安全而订购飞机一样,变更管理流程也为了安全而订购变更。变更管理会考虑变更的预期交付日期、有助于指示排序的业务收益、与变更相关的风险以及其与其他变更的关系,以尝试尽可能减少事故发生。 Change identification is a point-in-time action, where someone indicates a changehas been made and moves on to other activities. Change management is a life cycleprocess whereby changes are 变更识别是一种时间点行动,其中有人指示已进行变更并继续进行其他活动。变更管理是一个生命周期过程,其中变更是 * Proposed * Approved * Scheduled * Implemented and logged * Validated as successful * Reviewed and reported on over time * 建议的 * 得到正式认可的 * 预定 * 已实施并记录 * 验证成功 * 随着时间的推移进行审查和报告 The change management process may start as early as when a project is goingthrough its business validation (or return on investment analysis) or it may start aslate as when a project is ready to be moved into the production environment. Changemanagement also includes a process of continual process improvement whereby metrics regarding incidents and resulting impact are collected in order to improve thechange management process. 变更管理流程可以早在项目进行业务验证(或投资回报分析)时开始,也可以晚在项目准备好移入生产环境时开始。变更管理还包括持续流程改进的过程,通过该过程收集有关事件和由此产生的影响的指标,以改进变更管理流程。 ##### Change Management and ITIL 变革管理和 ITIL The Information Technology Infrastructure Library (ITIL) defines the goal of change management as follows: 信息技术基础设施库 (ITIL) 将变革管理的目标定义如下 The goal of the Change Management Process is to ensure that standardized methods and procedures are used for efficient and prompt handling of all changes, in order to minimize the impact ofchange-related incidents upon service quality, and consequently improve the day-to-day operations ofthe organization. 变更管理流程的目标是确保使用标准化的方法和程序高效、及时地处理所有变更,以尽量减少变更相关事件对服务质量的影响,从而改善日常运营组织的。 Change management is responsible for managing change process involving 变更管理负责管理变更过程,涉及 * Hardware * Communications equipment and software * System software * All documentation and procedures associated with the running, support, and maintenance of live systems * 硬件 * 通讯设备和软件 * 系统软件 * 与实时系统的运行、支持和维护相关的所有文档和程序 The ITIL is a great source of information should you decide to implement a robust changemanagement process as defined by a recognized industry standard. For our purposes, we aregoing to describe a lightweight change management process that should be considered for anymedium-sized enterprise. 如果您决定实施由公认的行业标准定义的强大的变更管理流程,那么 ITIL 是一个重要的信息来源。出于我们的目的,我们将描述任何中型企业都应该考虑的轻量级变更管理流程。 ##### Change Proposal 变更提案 As described, the proposal of a change can occur anywhere in your cycle. The IT ServiceManagement (ITSM) and ITIL frameworks hint at identification occurring as early in thecycle as the business analysis for a change. Within these frameworks, the change proposal is called a request for change. Opponents to ITSM actually cite the inclusion ofbusiness/benefit analysis within the change process as one of the reasons that theITSM and ITIL are not good frameworks. These opponents state that the businessbenefit analysis and feature/product selection steps have nothing to do with managingchange. Although we agree that these are two separate processes, we also believe thata business benefit analysis should be performed somewhere. If business benefit analysis isn’t conducted as part of another process, including it within the change management process is a good first step. That said, this is a book on scalability and notproduct and feature selection, so we will leave it that a benefit analysis should occur. 如上所述,变更提案可以发生在周期中的任何位置。 IT 服务管理 (ITSM) 和 ITIL 框架暗示识别发生在变更的业务分析周期的早期。在这些框架内,变更提案称为变更请求。 ITSM 的反对者实际上指出,ITSM 和 ITIL 不是好的框架的原因之一是在变革流程中包含了业务/效益分析。这些反对者指出,业务效益分析和功能/产品选择步骤与管理变革无关。尽管我们同意这是两个独立的流程,但我们也认为应该在某个地方进行业务效益分析。如果业务效益分析不是作为另一个流程的一部分进行的,那么将其纳入变更管理流程是一个好的第一步。也就是说,这是一本关于可扩展性的书,而不是关于产品和功能选择的书,因此我们将保留应该进行效益分析的书。 The most important thing to remember regarding a change proposal is that it kicksoff all other activities. Ideally, it will occur early enough to allow some evaluation asto the impact of the change and its relationship with other changes. For the change toactually be “managed,” we need to know certain things about the proposed change: 关于变更提案要记住的最重要的事情是它启动了所有其他活动。理想情况下,它会尽早发生,以便对变更的影响及其与其他变更的关系进行一些评估。为了真正“管理”变革,我们需要了解有关拟议变革的某些信息 * The system, subsystem, and component being changed * Expected result of the change * Some information regarding how the change is to be performed * Known risks associated with the change * Relationship of the change to other systems, recent or planned changes * 正在更改的系统、子系统和组件 * 变更的预期结果 * 有关如何执行更改的一些信息 * 与变更相关的已知风险 * 变更与其他系统的关系、最近或计划的变更 You may decide to track significantly more information than this, but we considerthis the minimum information necessary to properly plan change schedules. 您可能决定跟踪比这更多的信息,但我们认为这是正确计划变更时间表所需的最少信息。 The system undergoing change is important as we hope to limit the number ofchanges to a given system during a single time interval. Consider that a system is theequivalent of a runway at an airport. We don’t want two changes colliding in time onthe same system because if there is a problem during the change, we won’t immediately know which change caused it. As such, we need to know the item beingchanged down to the granularity of what is actually being modified. For instance, ifthis is a software change and there is a single large executable or script that contains100% of the code for that subsystem, we need only identify that we are changing outthat executable or script. On the other hand, if we are modifying one of several hundred configuration files, we should identify which exact file is being modified. If weare changing a file, configuration, or software on an entire pool of servers with similar functionality, the pool is the most granular thing being changed and should beidentified here; the steps of the change including rolling to each of the systems in thepool would be identified in information regarding how the change will be performed. 正在进行更改的系统很重要,因为我们希望限制在单个时间间隔内给定系统的更改数量。认为系统相当于机场的跑道。我们不希望两个更改在同一系统上及时发生冲突,因为如果更改期间出现问题,我们不会立即知道是哪个更改导致的。因此,我们需要了解正在更改的项目,直至实际修改内容的粒度。例如,如果这是一个软件更改,并且有一个大型可执行文件或脚本包含该子系统 100% 的代码,那么我们只需要确定我们正在更改该可执行文件或脚本。另一方面,如果我们要修改数百个配置文件中的一个,我们应该确定正在修改的是哪个文件。如果我们正在更改具有类似功能的整个服务器池上的文件、配置或软件,则该池是正在更改的最细粒度的内容,应在此处进行标识;变更的步骤,包括滚动到池中的每个系统,将在有关如何执行变更的信息中进行标识。 Architecture here plays a huge role in helping us increase change velocity. If wehave a technology platform comprised of a number of noncommunicating services,we increase the number of airports or runways for which we are managing traffic; asa result, we can have many more “landings” or changes. If the services communicateasynchronously, we would have a few more concerns, but we are also likely morewilling to take risks. On the other hand, if the services all communicate synchronously with each other, there isn’t much more fault tolerance than with a monolithicsystem (see Chapter 21, Creating Fault Isolative Architectural Structures) and we areback to managing a single runway at a single airport. 架构在帮助我们提高变革速度方面发挥着巨大作用。如果我们拥有一个由许多非通信服务组成的技术平台,我们就会增加管理交通的机场或跑道的数量;因此,我们可以有更多的“着陆”或改变。如果服务异步通信,我们会有更多担忧,但我们也可能更愿意承担风险。另一方面,如果所有服务都彼此同步通信,那么容错能力并不比整体系统高多少(请参阅第 21 章,创建故障隔离架构),并且我们又回到了管理单个机场的单个跑道的情况。 The expected result of the change is important as we want to be able to verify laterthat the change was successful. For instance, if a change is being made to a Webserver and that change is to allow more threads of execution in the Web server, weshould state that as the expected result. If we are making a modification to our proprietary code to correct an error where the capital letter Q shows up as its hex value51, we should indicate such. 更改的预期结果很重要,因为我们希望稍后能够验证更改是否成功。例如,如果对 Web 服务器进行更改,并且该更改是为了允许 Web 服务器中执行更多线程,我们应该将其声明为预期结果。如果我们要修改我们的专有代码以纠正大写字母 Q 显示为其十六进制值 51 的错误,我们应该指出这一点。 Information regarding how the change is to be performed will vary with yourorganization and system. You may need to indicate precise steps if the change willtake some time or requires a lot of work. For instance, if a server needs to be stoppedand rebooted, that might impact what other changes can be going on at the sametime. The larger and more complex the steps for the change in production, the moreyou should consider requiring those steps to be clearly outlined. 有关如何执行更改的信息将因您的组织和系统而异。如果更改需要一些时间或需要大量工作,您可能需要指出精确的步骤。例如,如果需要停止并重新启动服务器,这可能会影响同时进行的其他更改。生产变更的步骤越大、越复杂,您就越应该考虑要求清楚地概述这些步骤。 Identifying the known risks of the change is an often overlooked step. Very often,requesters of a change will quickly type in a commonly used risk to speed through thechange request process. A little time spent in this area could pay huge dividends inavoiding a crisis. If there is the risk that should a certain database table not be“clean” or truncated prior to the change that data corruption may occur, that shouldbe pointed out during the change. The more risks that are identified, the more likelyit is that the change will receive the proper management oversight and risk mitigationand the higher the probability of success for the change. We will cover risk identification and management in a future chapter in much more detail. 识别变革的已知风险是一个经常被忽视的步骤。通常,变更请求者会快速输入常用的风险,以加快变更请求流程。在这个领域花费一点时间可以带来巨大的好处,避免危机。如果某个数据库表在更改之前不“干净”或被截断,则存在可能发生数据损坏的风险,应该在更改期间指出。识别的风险越多,变革就越有可能受到适当的管理监督和风险缓解,变革成功的可能性就越高。我们将在以后的章节中更详细地介绍风险识别和管理。 Complacency often sets in quickly with these processes and teams are quick to feelthat identifying risks is simply a “check the box” exercise. A great way to incent theappropriate behaviors and to get your team to analyze risks is to reward those thatidentify and avoid risks and to counsel those who have incidents occur outside of therisk identification. This isn’t a new technique, but rather the application of tried andtrue management techniques. Reminding the team that a little time spent managingrisks can save a lot of time in managing incidents and even showing the team datafrom your environment as to how that is true is a great tactic. 在这些流程中,人们往往会很快产生自满情绪,团队很快就会觉得识别风险只是一项“勾选框”练习。激励适当行为并让团队分析风险的一个好方法是奖励那些识别和避免风险的人,并为那些在风险识别之外发生事件的人提供建议。这不是一项新技术,而是久经考验的管理技术的应用。提醒团队花一点时间管理风险可以节省管理事件的大量时间,甚至向团队展示来自您环境的数据以了解其真实情况,这是一个很好的策略。 Finally, identifying the relationship to other systems and changes is a critical step.For instance, take the case that a requested change requires a modification to the login service of AllScale’s site and that this change is dependent upon another change to theaccount services module in order for the login service to function properly. The requesterof the change should identify this dependency in her request. Ideally, the requesterwill identify that if the account services module is not changed, the login service willnot work or will corrupt data or whatever the case might be given the dependency.Depending upon the process that you ultimately develop, you may or may notdecide to include a required or suggested date for your change to take place. Wehighly recommend developing a process that allows individuals to suggest a date;however, the approving and scheduling authorities should be responsible for decidingon the final date based on all other changes, business priorities, and risks. 最后,确定与其他系统和更改的关系是关键的一步。例如,假设请求的更改需要修改 AllScale 站点的登录服务,并且此更改取决于帐户服务模块的另一次更改,以便按顺序进行更改。以便登录服务正常运行。更改的请求者应在其请求中标识此依赖性。理想情况下,请求者将确定如果帐户服务模块未更改,则登录服务将无法工作或将损坏数据或任何可能给予依赖项的情况。根据您最终开发的流程,您可能会或可能不会决定包括进行更改所需或建议的日期。我们强烈建议开发一个允许个人建议日期的流程;但是,审批和调度机构应负责根据所有其他变更、业务优先级和风险来决定最终日期。 ##### Change Approval 变更批准 Change approval is a simple portion of the change management process. Yourapproval process may simply be a validation that all of the required information necessary to “request” the change is indeed present, that the change proposal has allrequired fields filled out appropriately. To the extent that you’ve implemented someform of the RASCI model, you may also decide to require that the appropriate A, orowner of the system in question, has signed off on the change and is aware of it. Theprimary reason for the inclusion of this step in the change control process is to validate that everything that should happen prior to the change occurring has in facthappened. This is also the place at which changes may be questioned with respect totheir priority relative to other changes. 变更批准是变更管理流程的一个简单部分。您的批准过程可能只是验证“请求”更改所需的所有必需信息确实存在,并且更改提案已正确填写所有必填字段。如果您已经实施了某种形式的 RASCI 模型,您可能还决定要求适当的 A(或相关系统的所有者)已签署变更并了解这一点。在变更控制过程中包含此步骤的主要原因是验证变更发生之前应该发生的所有事情实际上都已经发生。这也是可能会质疑变更相对于其他变更的优先级的地方。 An approval here is not a validation that the change will have the expected results;it simply means that everything has been discussed and that the change has met withthe appropriate approvals in all other processes prior to rolling out to your system,product, or platform. Bug fixes, for instance, may have an abbreviated approval process compared to a complete reimplementation of your entire product, platform, orsystem. The former is addressing a current issue and might not require the approvalof any organization other than QA, whereas the latter might require the final sign-offof the CEO. 这里的批准并不是验证变更将具有预期结果;它只是意味着所有内容都已经过讨论,并且变更在推出到您的系统、产品或平台之前已经在所有其他流程中获得了适当的批准。例如,与完全重新实现整个产品、平台或系统相比,错误修复的审批流程可能会缩短。前者正在解决当前问题,可能不需要 QA 之外的任何组织的批准,而后者可能需要首席执行官的最终签署。 ##### Change Scheduling 更改日程安排 The process of scheduling changes is where most of the additional benefit of changemanagement occurs over the benefit you get when you implement change identification. This is the point where the real work of the “air traffic controllers” comes in.Here, a group tasked with the responsibility of ensuring that changes do not collideor conflict applies a set of rules identified by its management team to maximizechange benefit while minimizing change risk. 与实施变更识别时获得的好处相比,变更管理的大部分额外好处都发生在安排变更的过程中。这就是“空中交通管制员”真正工作的切入点。在这里,一个负责确保变更不会发生碰撞或冲突的小组应用其管理团队确定的一组规则,以最大化变更效益,同时最小化变更风险。 The business rules very likely will include limiting changes during peak utilization ofyour platform or system. If you have the heaviest utilization between 10 AM and 2 PM and 7 PM and 9 PM, it probably doesn’t make sense to be making your largest andmost disrupting changes during this timeframe. You might limit or eliminate altogetherchanges during this timeframe if your risk tolerance is low. The same might hold truefor specific times of the year. Sometimes though, as in very high volume change environments, we simply don’t have the luxury of disallowing changes during certainportions of the day and we need to find ways to manage our change risks elsewhere. 业务规则很可能包括在平台或系统的高峰使用期间限制更改。如果上午 10 点至下午 2 点以及晚上 7 点至 9 点之间的利用率最高,那么在此时间范围内进行最大且最具破坏性的更改可能没有意义。如果您的风险承受能力较低,您可能会在这段时间内限制或消除所有变化。对于一年中的特定时间来说,情况可能也是如此。但有时,就像在大量变更的环境中一样,我们根本没有能力在一天中的某些时间段不允许变更,我们需要找到在其他地方管理变更风险的方法。 ##### The Business Change Calendar 业务变更日历 Many businesses, from large to small, put the next three to six months and maybe even thenext year’s worth of proposed changes into a shared calendar for internal viewing. This concepthelps communicate changes to various organizations and often helps reduce the risks of changesas teams start requesting dates that are not full of changes already. Consider the Change Calendar concept as part of your change management system. In very small companies, a changecalendar may be the only thing you need to implement (along with change identification). 许多企业,无论规模大小,都会将未来三到六个月甚至明年的拟议变更放入共享日历中以供内部查看。这一概念有助于向各个组织传达变更,并且当团队开始请求尚未充满变更的日期时,通常有助于降低变更风险。将变更日历概念视为变更管理系统的一部分。在非常小的公司中,变更日历可能是您唯一需要实施的事情(以及变更识别)。 This set of business rules might also include an analysis of risk of a type discussedin Chapter 16, Determining Risk. We are not arguing for an intensive analysis of riskor even indicating that your process absolutely needs to have risk analysis. Rather, weare stating that if you can develop a high level and easy risk analysis for the change,your change management process will be more robust and likely yield better results.Each change might include a risk profile of say high, medium, and low during thechange proposal portion of the process. The company then may decide that it wantsno more than three high risk changes happening in a week, six medium risk changes,and 20 low risk changes. Obviously, as the amount of change requests increase overtime, the company’s willingness to accept more risk on any given day within anygiven category will need to go up or changes will back up in the queue and the timeto market to implement any change will increase. One way to help both limit riskassociated with change and increase change velocity is to implement fault isolativearchitectures as we describe in Chapter 21. 这组业务规则还可能包括第 16 章“确定风险”中讨论的类型的风险分析。我们并不是主张对风险进行深入分析,甚至表明您的流程绝对需要进行风险分析。相反,我们指出,如果您能够为变革开发高水平且简单的风险分析,您的变革管理流程将更加稳健,并可能产生更好的结果。每个变革可能包括高、中和低的风险状况。流程的变更提案部分。然后,公司可能会决定一周内发生的高风险变更不超过 3 次,中等风险变更不超过 6 次,低风险变更不超过 20 次。显然,随着变更请求数量的增加,公司在任何给定类别中的任何一天接受更多风险的意愿都需要上升,或者变更将在队列中备份,并且实施任何变更的上市时间将会增加。帮助限制与变更相关的风险并提高变更速度的一种方法是实现故障隔离架构,如我们在第 21 章中所述。 Another consideration during the change scheduling portion of the process mightbe the beneficial business impact of the change. This analysis ideally is done in someother process, rather than being done first for the benefit of change. Someone, somewhere decided that the initiative requiring the change was of benefit to the company,and if you can represent that analysis in a lightweight way within the change process,you will likely benefit from it. If the risk analysis measures the product of the probability of failure multiplied by the effect of failure, benefit would then analyze theprobability of success with the impact of success. The company would be incented to move as many high value activities to the front of the queue as possible while beingwary not to starve lower value changes. 流程的变更计划部分期间的另一个考虑因素可能是变更的有益业务影响。理想情况下,这种分析是在其他流程中完成的,而不是为了变革而首先完成。某个地方的某人认为需要变革的举措对公司有利,如果您可以在变革过程中以轻量级的方式表示该分析,您可能会从中受益。如果风险分析衡量的是失败概率乘以失败影响的乘积,则效益将分析成功的概率和成功的影响。公司将被激励将尽可能多的高价值活动移到队列的前面,同时小心不要让低价值的变化匮乏。 An even better process would be to implement both processes with each recognizing the other in the form of a cost-benefit analysis. Risk and reward might offset eachother to create some value the company comes up with and with guidelines to implement changes in any given day with a risk-reward tradeoff between two values. We’llcover the concepts of risk and benefit analysis in Chapter 16. 更好的流程是实施这两个流程,并以成本效益分析的形式相互认识。风险和回报可能会相互抵消,以创造公司提出的一些价值,并制定在任何一天实施变革的指导方针,并在两个价值之间进行风险回报权衡。我们将在第 16 章中介绍风险和收益分析的概念。 ##### Key Aspects of Change Scheduling 变更计划的关键方面 Change scheduling is intended to minimize conflicts and reduce change related incidents. Keyaspects of most scheduling processes are 变更计划旨在最大限度地减少冲突并减少与变更相关的事件。大多数调度过程的关键方面是 * Change blackout times/dates during peak utilization or revenue generation * Analysis of risk versus reward to determine priority of changes * Analysis of relationships of changes for dependencies and conflicts * Determination and management of maximum risk per time period or number of changesper time period to minimize probability of incidents * 在高峰使用或创收期间更改停电时间/日期 * 分析风险与回报以确定变更的优先级 * 分析变化关系的依赖和冲突 * 确定和管理每个时间段的最大风险或每个时间段的变更数量,以最大限度地减少事件发生的可能性 Change scheduling need not be burdensome, it can be contained within another meeting andin small companies can be quick and easy to implement without additional headcount. 变更日程安排不必是繁琐的,它可以包含在另一个会议中,并且在小型公司中可以快速轻松地实施,而无需增加人员。 ##### Change Implementation and Logging 变更实施和记录 Change implementation and logging is basically the function of implementing thechange in a production environment in accordance with the steps identified withinthe change proposal and consistent with the limitations, restrictions, or requests identified within the change scheduling phase. This phase consists of two steps: startingand logging the start time of the change and completing and logging the completiontime of the change. This is slightly more robust than the change identification processidentified earlier in the chapter, but also will yield greater results in a high changeenvironment. If the change proposal does not include the name of the individual performing the change, the change implementation and logging steps should name theindividuals associated with the change. 变更实施和记录基本上是根据变更提案中确定的步骤并与变更计划阶段中确定的限制、约束或请求一致在生产环境中实施变更的功能。此阶段由两个步骤组成:开始并记录变更的开始时间以及完成并记录变更的完成时间。这比本章前面提到的变更识别过程稍微稳健一些,但也会在高变更环境中产生更好的结果。如果变更提案不包括执行变更的个人的姓名,则变更实施和记录步骤应命名与变更相关的个人。 ##### Change Validation 变更验证 No process should be complete without verification that you accomplished what youexpected to accomplish. While this should seem intuitively obvious to the casualobserver, how often have you asked yourself “Why the heck didn’t Sue check that before she said she was done?” That question follows us outside of the technologyworld and into everything in our life: The electrical contractor completes the work onyour new home, but you find several circuits that don’t work; your significant othersays that his portion of the grocery shopping is done but you find five items missing;the systems administrator claims that he is done with rebooting and repairing a faultysystem but your application doesn’t work. 如果没有验证您是否完成了预期的任务,任何过程都不应完成。虽然对于不经意的观察者来说,这应该是直观上显而易见的,但你是否经常问自己“为什么苏在说她完成之前没有检查一下?”这个问题一直伴随着我们走出技术世界,渗透到我们生活中的方方面面:电气承包商完成了你新家的工作,但你发现有几个电路无法工作;你的另一半说他的部分杂货购物已经完成,但你发现缺少五件物品;系统管理员声称他已经完成了重新启动和修复有故障的系统,但你的应用程序无法运行。 Our point here is that you shouldn’t perform a change unless you know what youexpect to get from that change. And it stands to reason that should you not get thatexpected result, you should consider undoing the change and rolling back or at leastpausing and discussing the alternatives. Maybe you made it halfway to where youwant to be if it was a tuning change to help with scalability and that’s good enoughfor now. 我们在这里的观点是,除非您知道您期望从该更改中获得什么,否则您不应该执行更改。按理说,如果您没有得到预期的结果,您应该考虑撤消更改并回滚,或者至少暂停并讨论替代方案。如果这是一个调整更改以帮助提高可扩展性,那么您可能已经达到了您想要的目标的一半,并且现在已经足够了。 Validation becomes especially important in high scalability environments. If youare a hyper-growth company, we highly recommend adding a scalability validation toevery significant change. Did you change the load, CPU utilization, or memory utilization for worse on any critical systems as a result or your change? If so, does thatput you in a dangerous position during peak utilization/demand periods? The resultof validation should either be an entry as to when validation was complete by theperson making the change, a rollback to the change if it did not meet the validationcriteria, or an escalation to resolve the question of whether to roll back the change. 在高可扩展性环境中,验证变得尤为重要。如果您是一家高速增长的公司,我们强烈建议为每个重大变更添加可扩展性验证。您是否因您的更改而导致任何关键系统上的负载、CPU 利用率或内存利用率变得更糟?如果是这样,这是否会让您在高峰利用/需求期间处于危险境地?验证结果应该是关于进行更改的人员何时完成验证的条目,如果不满足验证标准则回滚更改,或者升级以解决是否回滚更改的问题。 ##### Change Review 变更审核 The change management process should include a periodic review of its effectiveness.Looking back and remembering Chapter 5, Management 101, you simply cannotimprove that which you do not measure. Key metrics to analyze during the changereview are 变革管理流程应包括对其有效性的定期审查。回顾并记住第 5 章“管理 101”,如果不进行衡量,您根本无法改进。变更审核期间要分析的关键指标是 * Number of change proposals submitted * Number of successful change proposals (without incidents) * Number of failed change proposals (without incidents but change unsuccessfuland didn’t make it to validation phase) * Number of incidents resulting from change proposals * Number of aborted changes or changes rolled back due to failure to validate * Average time to implement a proposal from submission * 提交的变更提案数量 * 成功的变革提案数量(无事件) * 失败的变更提案数量(没有事件,但变更不成功且未进入验证阶段) * 变更提案导致的事件数量 * 由于验证失败而中止的更改或回滚的更改数量 * 从提交提案到实施提案的平均时间 Obviously, we are looking for data indicating the effectiveness of our process. Ifwe have a high rate of change but also a high percentage of failures and incidents,something is definitely wrong with our change management process and something islikely wrong with other processes, our organization, and maybe our architecture.Aborted changes on one hand should be a source of pride for the organization that the validation step is finding issues and keeping incidents from happening; on theother hand, it is a source for future corrections to process or architecture as the primary goal should be to have a successful change. 显然,我们正在寻找表明我们流程有效性的数据。如果我们的变更率很高,但失败和事件的比例也很高,那么我们的变更管理流程肯定有问题,其他流程、我们的组织甚至我们的架构也可能有问题。一方面,中止的变更应该是验证步骤是发现问题并防止事件发生,这让组织感到自豪;另一方面,它是未来对流程或架构进行修正的来源,因为主要目标应该是成功的变革。 #### The Change Control Meeting 变更控制会议 We’ve several times referred to a meeting wherein changes are approved and scheduled. The ITIL and ITSM refer to such meetings and gatherings of people as theChange Control Board or Change Approval Board. Whatever you decide to call it,we recommend a regularly scheduled meeting with a consistent set of people. It isabsolutely okay for this to be an additional responsibility for several individual contributors and/or managers within your organization; oftentimes, having a diversegroup of folks from each of your technical teams and even some of the businessteams helps to make the most effective reviewing authority possible. 我们曾多次提到过一次会议,其中批准并安排了变更。 ITIL 和 ITSM 将此类会议和聚会称为变更控制委员会或变更批准委员会。无论您决定如何称呼它,我们建议您定期安排与一组一致的人员举行会议。对于组织内的多个个人贡献者和/或管理者来说,这是一项额外的责任是完全可以的;通常,来自每个技术团队甚至某些业务团队的多元化人员有助于实现最有效的审查机构。 Depending upon your rate of change, you should consider a meeting once a day,once a week, or once a month. Attendees ideally will include representatives of eachof your technical organizations and hopefully at least one team outside of technologythat can represent the business or customer needs. Typically, we see the head of theinfrastructure or operations teams “chairing” the meeting as he most often has thetools to be able to review change proposals and completed or failed changes. 根据您的变化率,您应该考虑每天一次、每周一次或每月一次会议。理想情况下,与会者应包括每个技术组织的代表,并希望至少有一个技术以外的团队可以代表业务或客户需求。通常,我们看到基础设施或运营团队的负责人“主持”会议,因为他通常拥有能够审查变更提案以及已完成或失败的变更的工具。 The team should have access to the database wherein the change proposals andcompleted changes are stored. The team should also have a set of guidelines by whichit analyzes changes and attempts to schedule them for production. Some of theseguidelines were discussed previously in this chapter. 团队应该有权访问存储变更建议和已完成变更的数据库。团队还应该有一套指导方针,通过它分析变更并尝试安排它们进行生产。其中一些准则已在本章前面讨论过。 Part of the change control meetings, on a somewhat periodic basis, should includea review of the change control process using the metrics we’ve identified. It is absolutely acceptable to augment these metrics. Where necessary, postmortems should bescheduled to analyze failures of the change control process. These postmortemsshould be run consistently with the postmortem process we identified in Chapter 8.The output of the postmortems should be tasks to correct issues associated with thechange control process, or feed into requests for architecture changes or changes toother processes. 变更控制会议的一部分(定期)应包括使用我们确定的指标对变更控制流程进行审查。增加这些指标是绝对可以接受的。如有必要,应安排事后分析来分析变更控制过程的失败。这些事后分析应该与我们在第 8 章中确定的事后分析流程一致地运行。事后分析的输出应该是纠正与变更控制流程相关的问题的任务,或者反馈到架构变更或其他流程变更的请求中。 #### Continuous Process Improvement 持续的流程改进 Besides the periodic internal review of the change control process identified withinthe preceding “Change Control Meeting” section, you should implement a quarterlyor annual review of the change control process. Are changes taking too long to imple-ment as a result of the process? Are change related incidents increasing or decreasingas a percentage of total incidents? Are risks being properly identified? Are validationsconsistently performed and consistently correct? As with any other process, thechange control process should not be assumed to be correct. Although it might workwell for a year or two given some rate of change within your environment, as yougrow in complexity, rate of change, and rate of transactions, it very likely will needtweaking to continue to meet your needs. As we discussed in Chapter 7, Understanding Why Processes Are Critical to Scale, no process is right for every stage of yourcompany. 除了前面的“变更控制会议”部分中确定的变更控制流程的定期内部审查之外,您还应该对变更控制流程进行季度或年度审查。该流程的结果是否导致变更需要太长时间才能实施?与变革相关的事件占总事件的百分比是增加还是减少?风险是否得到正确识别?验证是否始终如一地执行且始终正确?与任何其他过程一样,不应假定变更控制过程是正确的。尽管考虑到环境中的某些变化率,它可能会工作一两年,但随着复杂性、变化率和事务率的增长,它很可能需要进行调整才能继续满足您的需求。正如我们在第 7 章“了解为什么流程对于规模化至关重要”中所讨论的,没有任何流程适合公司的每个阶段。 ##### Change Management Checklist 变革管理清单 Your change management process has, at a minimum, the following phases: 您的变更管理流程至少包含以下阶段 * Change Proposal (the ITIL Request for Change or RFC) * Change Approval * Change Scheduling * Change Implementation and Logging * Change Validation * Change Review * 变更提案(ITIL 变更请求或 RFC) * 变更批准 * 更改日程安排 * 变更实施和记录 * 更改验证 * 变更审核 Your change management meeting should be comprised of representatives from all teamswithin technology and members of the business responsible for working with your customers orstakeholders. 您的变革管理会议应由技术内所有团队的代表以及负责与客户或利益相关者合作的业务成员组成。 Your change management process should have a continual process improvement loop thathelps drive changes to the change management process as your company and needs matureand also drives changes to other processes, organizations, and architectures as they are identified with change metrics. 您的变更管理流程应该有一个持续的流程改进循环,当您的公司和需求成熟时,有助于推动变更管理流程的变更,并且当其他流程、组织和架构通过变更指标进行识别时,也推动它们的变更。 #### Conclusion 结论 We’ve discussed two separate change processes for two very different companies.Change identification is a very lightweight process for very young and small companies. It is powerful in that it can help limit the customer impact of changes when theygo badly. However, as companies grow and their rate of change grows, they oftenneed a much more robust process that more closely approximates our air traffic control system. 我们已经讨论了两家截然不同的公司的两个独立的变更流程。对于非常年轻的小公司来说,变更识别是一个非常轻量级的流程。它的强大之处在于,当变更出现问题时,它可以帮助限制变更对客户的影响。然而,随着公司的发展和变化速度的加快,他们通常需要一个更强大的流程,更接近我们的空中交通管制系统。 Change management is a process whereby a company attempts to take control ofits changes. Change management processes can vary from lightweight processes thatsimply attempt to schedule changes and avoid change related conflicts to very matureprocesses that attempt to manage the total risk and reward tradeoff on any given dayor hour within a system. As your company grows and as your needs to managechange associated risks grows, you will likely move from a simple change identification process to a very mature change management process that takes into consideration risk, reward, timing, and system dependencies. 变革管理是公司试图控制其变革的过程。变更管理流程可以从简单地尝试安排变更并避免与变更相关的冲突的轻量级流程,到尝试管理系统内任何给定日期或小时的总风险和回报权衡的非常成熟的流程。随着您的公司的发展以及管理与变更相关的风险的需求的增长,您可能会从简单的变更识别流程转变为非常成熟的变更管理流程,该流程考虑风险、回报、时间安排和系统依赖性。 ##### Key Points 关键点 * A change happens any time any one of your employees needs to touch, twiddle,prod or poke any piece of hardware, software, or firmware. * Change identification is an easy process for young or small companies focusedon being able to find recent changes and roll them back in the event of an incident. * At a minimum, an effective change identification process should include the exacttime and date of the change, the system undergoing change, the expected resultsof the change, and the contact information of the person making the change. * The intent of change management is to limit the impact of changes by controlling them through their release into the production environment and loggingthem as they are introduced to production. * Change management consists of the following phases or components: changeproposal, change approval, change scheduling, change implementation and logging, change validation, and change efficacy review. * 每当您的任何一名员工需要触摸、摆弄、刺激或戳任何硬件、软件或固件时,就会发生变化。 * 对于年轻或小型公司来说,变更识别是一个简单的过程,专注于能够找到最近的变更并在发生事件时将其回滚。 * 有效的变更识别流程至少应包括变更的确切时间和日期、正在进行变更的系统、变更的预期结果以及变更人员的联系信息。 * 变更管理的目的是通过将变更发布到生产环境中并在引入生产时进行记录来控制变更的影响,从而限制变更的影响。 * 变更管理由以下阶段或组件组成:变更建议、变更批准、变更计划、变更实施和记录、变更验证和变更效力审核。 * The change proposal kicks off the process and should contain as a minimum thefollowing information: system or subsystem being changed, expected result ofthe change, information on how the change is to be performed, known risks,known dependencies, and relationships to other changes or subsystems. * The change proposal in more advanced processes may also contain informationregarding risk, reward, and suggested or proposed dates for the change. * The change approval step validates that all information is correct and that theperson requesting the change has the authorization to make the change. * The change scheduling step is the process of limiting risk by analyzing dependencies, rates of changes on subsystems and components, and attempting tominimize the risk of an incident. Mature processes will include an analysis ofrisk and reward. * The change implementation step is similar to the change identification lightweight process, but it includes the logging of start and completion times withinthe changes database. * 变更提案启动流程,并至少应包含以下信息:正在变更的系统或子系统、变更的预期结果、有关如何执行变更的信息、已知风险、已知依赖性以及与其他变更或子系统的关系。 * 更高级流程中的变更提案还可能包含有关风险、回报以及建议或提议的变更日期的信息。 * 变更批准步骤验证所有信息是否正确,并且请求变更的人员有权进行变更。 * 变更计划步骤是通过分析子系统和组件的依赖关系、变更率来限制风险的过程,并尝试将事件风险最小化。成熟的流程将包括风险和回报的分析。 * 变更实施步骤与变更识别轻量级流程类似,但它包括变更数据库内的开始和完成时间的记录。 * The change validation step is responsible for ensuring that the change had theexpected result. A failure here might trigger a rollback of the change, or an escalation if partial benefit is achieved. * The change review step is the change management team’s internal review of thechange process and the results. It looks at data relating to rates of changes, failure rates, impact to time to market, and so on. * The change control meeting is the meeting in which changes are approved,scheduled, and reviewed after implementation. It is typically chaired by the headof operations and/or infrastructure and has as its members participants fromeach engineering team and customer facing business teams. * The change management process should be reviewed by teams outside thechange management team to determine its efficacy. A quarterly or annual reviewis appropriate and should be performed by the CTO/CIO and members of theexecutive staff of the company. * 变更验证步骤负责确保变更达到预期结果。这里的失败可能会触发更改的回滚,或者如果实现了部分好处,则可能会触发升级。 * 变更审核步骤是变更管理团队对变更过程和结果的内部审核。它着眼于与变化率、故障率、上市时间影响等相关的数据。 * 变更控制会议是变更实施后批准、安排和审查的会议。它通常由运营和/或基础设施负责人担任主席,成员包括来自每个工程团队和面向客户的业务团队的参与者。 * 变革管理流程应由变革管理团队之外的团队进行审查,以确定其有效性。季度或年度审查是适当的,并且应由公司的 CTO/CIO 和执行人员进行。
没有评论