### Chapter 18 Barrier Conditions and Rollback 第18章 障碍条件和回滚 > He will conquer who has learned the artifice of deviation. Such is the art of maneuvering.—Sun Tzu > 他将征服那些学会了偏离技巧的人。这就是谋略的艺术。——孙子 Whether you develop with an agile methodology, a classic waterfall methodology, orsome hybrid, good processes for the promotion of systems into your production envi-ronment have the capability of protecting you from significant failures; whereas poorprocesses may end up damning you to near certain technical death. Checkpoints andbarrier conditions within your product development life cycle can increase quality andreduce the cost of developing your product by detecting early when you are off course.But processes alone are not always enough. Even the best of teams, with the best pro-cesses and great technology make mistakes and incorrectly analyze the results of certaintests or reviews. If your platform implements a service, either Software as a Serviceplay or a traditional back office IT system, you need to be able to quickly roll backsignificant releases to keep scale related events from creating availability incidents. 无论您使用敏捷方法、经典瀑布方法还是某种混合方法进行开发,将系统提升到生产环境的良好流程都能够保护您免受重大故障的影响;而糟糕的流程最终可能会让你在技术上濒临死亡。产品开发生命周期中的检查点和障碍条件可以通过及早发现偏离方向来提高质量并降低产品开发成本。但仅靠流程并不总是足够的。即使是最好的团队,拥有最好的流程和最先进的技术,也会犯错误,并错误地分析某些测试或审查的结果。如果您的平台实施服务,无论是软件即服务还是传统的后台 IT 系统,您需要能够快速回滚重要版本,以防止与规模相关的事件造成可用性事件。 Developing effective go/no-go processes or barrier conditions, ideally within afault isolative infrastructure, and coupling them with a process and capability to rollback production changes, are necessary components within any highly available ser-vice and are critical to the success of your scalability goals. The companies focusedmost intensely on cost effectively scaling their systems while guaranteeing high avail-ability create several checkpoints in their development processes. These checkpointsare an attempt to guarantee the lowest probability of a scalability related event andto minimize the impact of that event should it occur. They also make sure that theycan quickly get out of any event created through recent changes by ensuring that theycan always roll back from any major change. 开发有效的通过/不通过流程或障碍条件(最好是在故障隔离基础设施内),并将其与回滚生产变更的流程和能力相结合,是任何高可用服务中的必要组件,并且对于可扩展性的成功至关重要目标。这些公司最关注的是经济高效地扩展其系统,同时保证高可用性,并在其开发过程中创建多个检查点。这些检查点旨在保证可扩展性相关事件发生的概率最低,并尽量减少该事件发生时的影响。他们还确保始终可以从任何重大更改中回滚,从而确保可以快速摆脱最近更改所产生的任何事件。 ####Barrier Conditions 障碍条件 You might read this heading and immediately assume that we are proposing thatwaterfall development cycles are the key to success within highly scalable environ-ments. Very often, barrier conditions or entry and exit criteria are associated with thephases of waterfall development and sometimes identified as a reason for the inflexi-bility of a waterfall development model. Our intent here is not to promote the water-fall methodology, but rather to discuss the need for standards and protectivemeasures regardless of your approach to development. For the purposes of this dis-cussion, assume that a barrier condition is a standard against which you measure suc-cess or failure within your development life cycle. Ideally, you want to have theseconditions or checkpoints established within your cycle to help you decide whetheryou are indeed on the right path for the product or enhancements that you are devel-oping. Remember our discussion on goals in Chapters 4, Leadership 101, and 5,Management 101, and the need to establish and measure these goals. Barrier condi-tions are static goals within a development at regular “heartbeats” to ensure thatwhat you are developing aligns with your vision and need. Barrier conditions forscalability might include desk checking a design against your architectural principleswithin an Architecture Review Board before the design is implemented, code review-ing the implementation to ensure it is consistent with the design, or performance test-ing an implementation within QA and then measuring the impact to scalability uponrelease to the production environment. 您可能会阅读此标题并立即认为我们建议瀑布式开发周期是在高度可扩展的环境中取得成功的关键。通常,障碍条件或进入和退出标准与瀑布开发的阶段相关,有时被认为是瀑布开发模型不灵活的原因。我们在此的目的不是推广瀑布方法,而是讨论对标准和保护措施的需求,无论您的开发方法如何。出于本讨论的目的,假设障碍条件是衡量开发生命周期内成功或失败的标准。理想情况下,您希望在周期内建立这些条件或检查点,以帮助您确定您正在开发的产品或增强功能是否确实走在正确的道路上。请记住我们在第 4 章“领导力 101”和第 5 章“管理 101”中关于目标的讨论,以及建立和衡量这些目标的必要性。障碍条件是定期“心跳”开发中的静态目标,以确保您正在开发的内容符合您的愿景和需求。可扩展性的障碍条件可能包括在实施设计之前,在架构审查委员会内根据架构原则对设计进行桌面检查,对实施进行代码审查以确保其与设计一致,或者在 QA 中对实施进行性能测试,然后进行测量发布到生产环境后对可扩展性的影响。 #####Example Scalability Barrier Conditions 可扩展性障碍条件示例 We often recommend that the following barrier conditions be inserted into your developmentmethodology or life cycle. Each has a purpose to try to limit the probability of occurrence andresulting impact of any scalability issues within your production environment: 我们经常建议将以下障碍条件插入到您的开发方法或生命周期中。每个方法都有一个目的,即尝试限制生产环境中任何可扩展性问题发生的可能性及其造成的影响 1.Architecture Review Board. From Chapter 14, Architecture Review Board, the ARB exists to ensure that designs are consistent with architectural principles. Architectural princi-ples, in turn, ideally address one or more key scalability tenets within your platform. The intent of this barrier is to ensure that time isn’t wasted implementing or developing sys-tems that are difficult or impossible to scale to your needs. 1.架构审查委员会。从第 14 章“架构审查委员会”开始,ARB 的存在是为了确保设计与架构原则一致。反过来,架构原则可以理想地解决平台内的一个或多个关键可扩展性原则。这一障碍的目的是确保时间不会浪费在实施或开发难以或不可能扩展以满足您的需求的系统上。 2.Code Reviews. Modifying what is hopefully an existing and robust code review process to include ensuring that architectural principles are followed within the implementation of the system in question is critical to ensuring that code can be fixed for scalability prob-lems before being identified within QA and being required to be fixed later. 2.代码审查。修改现有的、健壮的代码审查流程,包括确保在相关系统的实现中遵循架构原则,这对于确保代码在 QA 中被识别并被要求之前能够解决可扩展性问题至关重要。稍后修复。 3.Performance Testing: From Chapter 17, Performance and Stress Testing, performance testing helps you identify potential issues of scale before introducing the system into a production environment and potentially impacting your customers with a scalability related issue. 3.性能测试:从第 17 章“性能和压力测试”开始,性能测试可帮助您在将系统引入生产环境之前识别潜在的规模问题,并可能因可扩展性相关问题影响您的客户。 4.Production Monitoring and Measurement. Ideally, your system has been designed to be monitored as discussed within Chapter 12, Exploring Architectural Principles. Even if it is not, capturing key performance data from both a user perspective, application perspec-tive, and system perspective after release and comparing it to previous releases can help you identify potential scalability related issues early before they impact your customers. 4.生产监控和测量。理想情况下,您的系统被设计为可以进行监控,如第 12 章“探索架构原则”中所讨论的那样。即使不是这样,在发布后从用户角度、应用程序角度和系统角度捕获关键性能数据并将其与以前的版本进行比较可以帮助您在潜在的可扩展性相关问题影响客户之前尽早识别它们。 Your processes may include additional barrier conditions that you’ve found useful over time,but we consider these to be the bare minimum to help manage the risk of releasing systemsthat negatively impact customers due to scalability related problems. 您的流程可能包括随着时间的推移您发现有用的额外障碍条件,但我们认为这些是帮助管理由于可扩展性相关问题而对客户产生负面影响的系统发布风险的最低限度。 #####Barrier Conditions and Agile Development 障碍条件与敏捷开发 In our practice, we have found that many of our clients have a mistaken perceptionthat the including or defining standards, constraints, or processes in agile processes,is a violation of the agile mindset. The very notion that process runs counter to agilemethodologies is flawed from the outset as any agile method is itself a process. Mostoften, we find the Agile Manifesto quoted out of context as a reason for eschewingany process or standard. 在我们的实践中,我们发现许多客户有一种错误的看法,认为在敏捷流程中包含或定义标准、约束或流程是违反敏捷思维的。流程与敏捷方法论背道而驰的观念从一开始就是有缺陷的,因为任何敏捷方法本身就是一个流程。大多数情况下,我们发现断章取义地引用敏捷宣言作为回避任何流程或标准的原因。 1 As a review, and from the Agile Manifesto, agile methodol-ogies value 1 作为回顾,根据《敏捷宣言》,敏捷方法论的价值 * Individuals and interactions over processes and tools * 流程和工具上的个体和交互 * Working software over comprehensive documentation * 工作软件胜过全面的文档 * Customer collaboration over contract negotiation * 客户协作胜过合同谈判 * Responding to change over following a plan * 响应变化而不是遵循计划 Organizations often take the “Individuals and interactions over processes andtools” out of context without reading the line that follows these bullets, which states,“That is, while there is value in the items from the right, we value the items on theleft more.” 组织经常断章取义地理解“个人和交互高于流程和工具”,而不阅读这些项目符号后面的内容,其中指出,“也就是说,虽然右侧的项目有价值,但我们更看重左侧的项目。 It is clear with this line that processes add value, but that people andinteractions should take precedent over them where we need to make choices. Weabsolutely agree with this approach and prefer to inject process into agile developmentmost often as barrier conditions to test for an appropriate level of quality, scalability,and availability, or to help ensure that engineers are properly evaluated and taughtover time. Let’s examine how some key barrier conditions enhance our agile method. 从这条线上可以清楚看出,流程会增加价值,但在我们需要做出选择时,人员和交互应优先于它们。我们完全同意这种方法,并且更愿意将流程注入敏捷开发中,通常作为障碍条件来测试适当水平的质量、可扩展性和可用性,或者帮助确保工程师得到正确的评估和教学时间。让我们来看看一些关键障碍条件如何增强我们的敏捷方法。 We’ll first start with valuing working software over comprehensive documenta-tion. None of the suggestions we’ve made from ARB and code reviews to perfor-mance testing and production measurement violate this rule. The barrier conditionsrepresented by ARB and Joint Architecture Design (JAD) are used within agile meth-ods to ensure that the product under development can scale appropriately. ARB andJAD can be performed orally in a group and with limited documentation and there-fore are all consistent with the agile method. 我们首先会优先考虑工作软件而不是全面的文档。我们从 ARB 和代码审查到性能测试和生产测量提出的建议均不违反此规则。以ARB和联合架构设计(JAD)为代表的障碍条件在敏捷方法中使用,以确保正在开发的产品可以适当扩展。 ARB 和 JAD 可以在小组中口头执行,并且文档有限,因此都与敏捷方法一致。 The inclusion of barrier conditions and standards to help ensure that systems andproducts work properly in production actually supports the development of workingsoftware. We have not defined comprehensive documentation as necessary in any ofour proposed activities, although it is likely that the results of these activities will belogged somewhere. Remember, we are interested in improving our processes overtime so logging performance results for instance will help us determine how often weare making mistakes in our development process that result in failed performancetests in QA or scalability issues within production. 纳入障碍条件和标准有助于确保系统和产品在生产中正常工作,实际上支持了工作软件的开发。我们没有在我们提议的任何活动中定义必要的全面文档,尽管这些活动的结果很可能会记录在某处。请记住,我们对加班改进流程很感兴趣,因此记录性能结果将帮助我们确定在开发过程中犯错误的频率,这些错误会导致 QA 性能测试失败或生产中的可扩展性问题。 The processes we’ve suggested also do not in any way hinder customer collabora-tion or support contract negotiation over customer collaboration. In fact, one mightargue that they foster a better working environment with the end customer in that byinserting scalability barrier conditions you are actually looking out for your cus-tomer’s needs. Your customer is not likely capable of performing the type of designevaluation, reviews, testing, or measuring that is necessary to determine if your prod-uct will scale to its needs. Your customer does, however, expect that you are deliver-ing a product or service that will meet not only its business objectives but itsscalability needs as well. Collaborating to develop tests and measurements that willhelp ensure that your product meets customer needs and to insert those tests andmeasurements into your development process is a great way to take care of your cus-tomers and create shareholder value. 我们建议的流程也不会以任何方式阻碍客户协作或支持客户协作的合同谈判。事实上,有人可能会说,他们与最终客户建立了一个更好的工作环境,因为通过插入可扩展性障碍条件,您实际上是在关注客户的需求。您的客户可能无法执行确定您的产品是否能够满足其需求所需的设计评估、审查、测试或测量类型。然而,您的客户确实希望您提供的产品或服务不仅能满足其业务目标,还能满足其可扩展性需求。合作开发测试和测量将有助于确保您的产品满足客户需求,并将这些测试和测量插入到您的开发过程中,这是照顾客户和创造股东价值的好方法。 Finally, the inclusion of the barrier conditions we’ve suggested helps us to respondto change by helping us identify when that change is occurring. The failure of a bar-rier condition is an early alert to issues that we need to address immediately. Identify-ing that a component is incapable of being scaled horizontally (scale out not up fromour recommended architectural principles) in an ARB session is a good indication ofpotential issues for our customer. Although we may make the executive decision tolaunch the feature, product, or service, we had better ensure that future agile cyclesare used to fix the issue we’ve identified. However, if the need for scale is so dramaticthat a failure to scale out will keep us from being successful, should we not respondimmediately to that issue and fix it? Without such a process and series of checks, howwould we ensure that we are meeting our customer’s needs? 最后,纳入我们建议的障碍条件可以帮助我们识别变化何时发生,从而帮助我们应对变化。障碍条件的失败是对我们需要立即解决的问题的早期警报。在 ARB 会话中识别出组件无法水平扩展(横向扩展而不是按照我们推荐的架构原则向上扩展)可以很好地表明我们的客户存在潜在问题。尽管我们可能会做出推出功能、产品或服务的行政决策,但我们最好确保未来的敏捷周期用于解决我们已经发现的问题。然而,如果对规模的需求如此巨大,以至于无法扩大规模将阻碍我们取得成功,我们是否应该立即响应该问题并解决它?如果没有这样的流程和一系列的检查,我们如何确保满足客户的需求? Hopefully, we’ve convinced you that the addition of criteria against which you canevaluate the success of your scalability objectives is a good idea within your agileimplementation. If we haven’t, please remember our “board of directors” test withinChapter 5, Management 101.Would you feel comfortable stating that you absolutelywould not develop processes within your development life cycle to ensure that yourproducts and services could scale? Imagine yourself saying, “In no way, shape, or formwill we ever implement barrier conditions or criteria to ensure that we don’t releaseproducts with scalability problems!” How long do you think you would have a job? 希望我们已经让您相信,在敏捷实施中添加用于评估可扩展性目标是否成功的标准是一个好主意。如果我们没有,请记住我们在第 5 章管理 101 中的“董事会”测试。您是否可以放心地声明您绝对不会在开发生命周期内开发流程来确保您的产品和服务可以扩展?想象一下你自己说:“我们绝不会实施障碍条件或标准来确保我们发布的产品不会存在可扩展性问题!”你认为你能找到工作多久? #####Cowboy Coding 牛仔编码 Development without any process, without any plans, and without measurements to ensurethat the results meet the needs of the business is what we often refer to as cowboy coding. Thecomplete lack of process in cowboy-like environments is a significant barrier to success for anyscalability initiatives. 没有任何流程、没有任何计划、没有衡量来确保结果满足业务需求的开发,就是我们常说的牛仔编码。在类似牛仔的环境中完全缺乏流程是任何可扩展性计划成功的重大障碍。 Often, we find that teams attempt to claim that cowboy implementations are “agile.” Thissimply isn’t true. The agile methodology is a defined life cycle that is tailored to be adaptive toyour needs over time, versus other models that tend to be more predictive. The absence of pro-cesses, such as any cowboy implementation, is neither adaptive nor predictive. Agile methodol-ogies are not arguments against measurement or management. They are methodologies tunedto release small components or subsets of functionality quickly. They were developed to helpcontrol chaos through managing small, easily managed components rather than trying torepeatedly fail at attempting to predict and control very large complex projects. 我们经常发现团队试图声称牛仔实现是“敏捷的”。这根本不是真的。敏捷方法是一个定义的生命周期,与其他更具预测性的模型相比,它可以随着时间的推移适应您的需求。缺乏流程(例如任何牛仔实施)既不具有适应性,也不具有预测性。敏捷方法论并不是反对测量或管理的论据。它们是为了快速发布小组件或功能子集而调整的方法。它们的开发目的是通过管理小型、易于管理的组件来帮助控制混乱,而不是试图在预测和控制非常大的复杂项目时屡屡失败。 Do not allow yourself or your team to fall prey to the misconception that agile methodologiesshould not be measured or managed. Using a metric such as velocity to improve the estimationability of engineers but not to beat them up over, is a fundamental part of the agile methodol-ogy. A lack of measuring dooms you to never improving and a lack of managing dooms you togetting lost en route to your goals and vision. Being a cowboy when it comes to designinghighly scalable solutions is a sure way to get thrown off of the bucking scalability bronco! 不要让您自己或您的团队陷入敏捷方法不应被衡量或管理的误解。使用速度等指标来提高工程师的可评估性,而不是打败他们,是敏捷方法论的基本组成部分。缺乏衡量注定你永远无法进步,缺乏管理注定你在实现目标和愿景的路上迷失方向。在设计高度可扩展的解决方案时,成为一名牛仔是摆脱可扩展性野马的必然方法! #####Barrier Conditions and Waterfall Development 障碍条件与瀑布发育 The inclusion of barrier conditions within waterfall models is not a new concept.Most waterfall implementations include a concept of entry criteria and exit criteriafor each phase of development. For instance, in a strict waterfall model, design maynot start until the requirements phase is completed. The exit criteria for the require-ments phase in turn may include a signoff by key stakeholders and a review ofrequirements by the internal customer (or an external representative) and a review bythe organizations responsible for producing those requirements. In modified, over-lapping, or hybrid waterfall models, requirements may need to be complete for thesystems to be developed first but may not be complete for the entire product or sys-tem. If prototyping is employed, potentially those requirements need to be mockedup in a prototype before major design starts. 在瀑布模型中包含障碍条件并不是一个新概念。大多数瀑布实现都包括每个开发阶段的进入标准和退出标准的概念。例如,在严格的瀑布模型中,设计可能要等到需求阶段完成后才能开始。需求阶段的退出标准可能包括关键利益相关者的签字、内部客户(或外部代表)对需求的审查以及负责产生这些需求的组织的审查。在修改的、重叠的或混合瀑布模型中,首先开发的系统的需求可能需要完整,但整个产品或系统的需求可能不完整。如果采用原型设计,则可能需要在主要设计开始之前在原型中对这些要求进行模型化。 For our purposes, we need only inject the four processes we identified earlier intothe existing barrier conditions. The Architecture Review Board lines up nicely as anexit criterion for the design phase of our project. Code reviews, including a reviewconsistent with our architectural principles, might create exit criteria for our codingor implementation phase. Performance testing should be performed during the vali-dation or testing phase with requirements being that no more than a specific percent-age change be present for any critical system resources. Production measurementsbeing defined and implemented should be the entry criteria for the maintenancephase and significant increases in any measured area if not expected should triggerwork to reduce the impact of the implementation or changes in architecture to allowfor more cost-effective scalability. 出于我们的目的,我们只需将我们之前确定的四个过程注入现有的屏障条件中。架构审查委员会非常适合作为我们项目设计阶段的退出标准。代码审查,包括与我们的架构原则一致的审查,可能会为我们的编码或实现阶段创建退出标准。性能测试应在验证或测试阶段进行,要求任何关键系统资源的变化不得超过特定的百分比。定义和实施的生产测量应该是维护阶段的进入标准,如果没有预期,任何测量区域的显着增加应该触发减少实施或架构变化的影响,以实现更具成本效益的可扩展性。 #####Barrier Conditions and Hybrid Models 势垒条件和混合模型 Many companies have developed models that merge agile and waterfall methodolo-gies, and some continue to follow the predecessor to agile methods known as rapidapplication development (RAD). For instance, some companies may be required todevelop software consistent with contracts and predefined requirements, such asthose that interact with governmental organizations. These companies may wish tohave some of the predictability of dates associated with a waterfall model, but desireto implement chunks of functionality quickly as in agile approaches. 许多公司已经开发了融合敏捷方法和瀑布方法的模型,有些公司继续遵循敏捷方法的前身,即快速应用程序开发 (RAD)。例如,一些公司可能需要开发符合合同和预定义要求的软件,例如与政府组织互动的软件。这些公司可能希望具有与瀑布模型相关的日期的一些可预测性,但希望像敏捷方法一样快速实现功能块。 The question for these models is where to place the barrier conditions for thegreatest benefit. To answer that question, we need to return to the objectives of thebarrier conditions. Our intent with any barrier condition is to ensure that we catchproblems or issues early in our development so that we reduce the amount of reworkto meet our objectives. It costs us less in time and work, for instance, to catch a prob-lem in our QA organization than it does in our production environment. Similarly, itcosts us less to catch an issue in ARB than to allow it to be implemented and caughtin a code review. 这些模型的问题是在哪里放置障碍条件才能获得最大收益。要回答这个问题,我们需要回到障碍条件的目标。我们对任何障碍条件的目的是确保我们在开发早期发现问题,以便减少返工量以实现我们的目标。例如,与在生产环境中相比,在 QA 组织中发现问题所花费的时间和工作量更少。同样,在 ARB 中发现问题的成本比允许问题在代码审查中实施和发现的成本要低。 The answer to the question of where to place the barrier conditions, then, is toplace the barrier conditions where they add the most value and incur the least cost toour processes. Code reviews should be placed at the completion of each coding cycleor at the completion of chunks of functionality. The architectural review shouldoccur prior to the beginning of implementation, production metrics obviously needto occur within the production environment, and performance testing should happenprior to the release of a system into the production environment. 那么,在哪里放置障碍条件这个问题的答案就是将障碍条件放置在为流程增加最大价值并产生最少成本的地方。代码审查应该在每个编码周期完成或功能块完成时进行。架构审查应该在开始实施之前进行,生产指标显然需要在生产环境中进行,性能测试应该在将系统发布到生产环境之前进行。 ####Rollback Capabilities 回滚能力 You might argue that an effective set of barrier conditions in your development pro-cess should obviate the need for being able to roll back major changes within yourproduction environment. We can’t really argue with that thought or approach astechnically it is correct. However, arguing against the capability to roll back is reallyan argument against having an insurance policy. You may believe, for instance, thatyou don’t have a need for health insurance because you are a healthy individual andfairly wealthy. Or, you may argue against automobile insurance because you are, inthe words of Dustin Hoffman in Rain Man, “an excellent driver.” But what happenswhen you contract a treatable cancer and don’t have the funds for the treatment, orsomeone runs into your vehicle and doesn’t have liability insurance? If you are likemost people, your view of whether you need (or needed) this insurance changesimmediately when it would become useful. The same holds true when you find your-self in a situation where fixing forward is going to take quite a bit of time and havequite an adverse impact on your clients. 您可能会争辩说,开发过程中的一组有效的障碍条件应该可以消除在生产环境中回滚重大更改的需要。我们无法真正反驳这种想法或方法,因为它在技术上是正确的。然而,反对回滚能力实际上就是反对拥有保险单。例如,您可能认为您不需要健康保险,因为您是一个健康的人并且相当富有。或者,您可能会反对汽车保险,因为用《雨人》中达斯汀?#38669;夫曼的话说,您是“一名出色的司机”。但是,当您患上可治疗的癌症但没有治疗资金,或者有人撞上您的车辆且没有责任保险时,会发生什么情况?如果您像大多数人一样,当它变得有用时,您对是否需要(或需要)此保险的看法会立即改变。当您发现自己处于解决问题需要花费大量时间并对客户产生相当不利影响的情况时,情况也是如此。 #####Rollback Window Requirements 回滚窗口要求 Rollback requirements differ significantly by business. The question to ask yourselfin determining how to establish your specific rollback needs, at least from the per-spective of scalability, is to decide by when you will have enough information regard-ing performance to determine if you need to undo your recent changes. For manycompanies, the bare minimum is to allow a weekly business day peak utilizationperiod to have great confidence in the results of your analysis. This bare minimummay be enough for modifications to existing functionality, but when new functional-ity is added, it may not be enough. 回滚要求因业务而异。在确定如何确定您的特定回滚需求(至少从可扩展性的角度)时要问自己的问题是,确定何时您将获得足够的性能信息来确定是否需要撤消最近的更改。对于许多公司来说,最低限度是允许每周工作日的高峰使用期,以便对分析结果充满信心。这个最低限度可能足以修改现有功能,但当添加新功能时,它可能还不够。 New functions or features often have adoption curves that take more than one dayto get enough traffic through that feature to determine its resulting impact on systemperformance. The amount of data gathered over time within any new feature may alsohave an adverse performance impact and as a result negatively impact your scalability.Let’s return to Johnny Fixer and the HRM application at AllScale. Johnny’s teamhas been busy implementing a “degrees of separation” feature into the resume track-ing portion of the system. The idea is that the system will identify people within thecompany who either know a potential candidate personally or who might know peo-ple who know the candidate with the intent being to enable background checkingthrough individual’s relationships. The feature takes as inputs all companies at whichcurrent employees have worked and the list of companies for any given candidate.Johnny’s team initially figures that a linear search should be appropriate as the list ofpotential companies and resulting overlaps are likely to be small. 新功能或特性通常具有采用曲线,需要一天以上的时间才能通过该功能获得足够的流量,以确定其对系统性能的最终影响。任何新功能中随着时间的推移收集的数据量也可能会对性能产生不利影响,从而对您的可扩展性产生负面影响。让我们回到 Johnny Fixer 和 AllScale 的 HRM 应用程序。约翰尼的团队一直忙于在系统的简历跟踪部分实施“分离度”功能。这个想法是,系统将识别公司内那些认识潜在候选人的人,或者可能认识认识该候选人的人,目的是通过个人关系进行背景调查。该功能将当前员工工作过的所有公司以及任何给定候选人的公司列表作为输入。约翰尼的团队最初认为线性搜索应该是合适的,因为潜在公司列表和由此产生的重叠可能很小。 The new feature is released and starts to compute relationship maps over thecourse of the next few weeks. Initially, all goes well and Johnny’s team is happy withthe results and the runtime of the application. However, as the list of candidatesgrows, so does the list of companies for which the candidates have worked. Addition-ally, given the growth of AllScale, the number of employees has grown as have theirfirst and second order relationship trees. Soon, many of the processes relying uponthe degrees of separation function start timing out and customers are gettingaggravated. 新功能已发布,并在接下来的几周内开始计算关系图。最初,一切都很顺利,Johnny 的团队对应用程序的结果和运行时间感到满意。然而,随着候选人名单的增加,候选人工作过的公司名单也在增加。此外,随着 AllScale 的发展,员工数量也随之增长,其一阶和二阶关系树也随之增长。很快,许多依赖分离度功能的流程开始超时,客户变得越来越恼怒。 The crisis management process kicks in and Johnny’s team quickly identifies theculprit as the degrees of separation functionality. Working with the entire team,Johnny feels that the team can make a change to this feature to perform a more cost-effective search algorithm within a day and get it tested and rolled out to the sitewithin 30 hours. Christine, the CEO, is concerned that the company will see a signif-icant departure in user base if the problem is not fixed within a few hours. 危机管理流程启动,约翰尼的团队很快就将罪魁祸首确定为分离功能的程度。 Johnny 与整个团队合作,认为团队可以对此功能进行更改,以便在一天内执行更具成本效益的搜索算法,并在 30 小时内对其进行测试并推广到站点。首席执行官克里斯汀担心,如果问题不在几个小时内得到解决,公司的用户群将大幅流失。 If Johnny had followed our advice and made sure that he could roll back his lastrelease, he could simply roll the code back and then roll it back out when the fix ismade, assuming that his rollback process allowed him to roll back code releasedthree days ago. Although this may cause some user confusion, proper messagingcould help control that and within two days, Johnny could have the new code outand functioning properly without impact to his current scalability. If Johnny didn’ttake our advice, or Johnny’s rollback process only allowed rolling back within thefirst six hours of a release, our guess is that Johnny would be a convert to ensuring healways has a rollback insurance policy to meet his needs. 如果约翰尼遵循了我们的建议并确保他可以回滚他的上一个版本,他可以简单地回滚代码,然后在修复完成后将其回滚,假设他的回滚过程允许他回滚三天前发布的代码。尽管这可能会导致一些用户困惑,但适当的消息传递可以帮助控制这种情况,并且在两天之内,Johnny 就可以让新代码正常运行,而不会影响他当前的可扩展性。如果约翰尼没有采纳我们的建议,或者约翰尼的回滚过程只允许在发布后的前六个小时内回滚,我们的猜测是约翰尼将转变为确保healways拥有回滚保险政策来满足他的需求。 The last major consideration for returning your rollback window size deals withthe frequency of your releases and how many releases you need to be capable of roll-ing back. Maybe you have a release process that has you releasing new functionalityto your site several times a week. In this case, you may need to roll back more thanone release if the adoption rate of any new functionality extends into the next releasecycle. If this is the case, your process needs to be slightly more robust, as you are con-cerned about multiple changes and multiple releases rather than just one release tothe next. 返回回滚窗口大小的最后一个主要考虑因素涉及版本的频率以及需要能够回滚的版本数。也许您有一个发布流程,要求您每周多次向您的网站发布新功能。在这种情况下,如果任何新功能的采用率延伸到下一个发布周期,您可能需要回滚多个版本。如果是这种情况,您的流程需要稍微稳健一些,因为您关心的是多个更改和多个版本,而不仅仅是一个版本到下一个版本。 #####Rollback Window Requirements Checklist 回滚窗口要求清单 To determine your timeframe necessary to perform a rollback, you should consider the follow-ing things: 要确定执行回滚所需的时间范围,您应该考虑以下事项 * How long between your release and the first heavy traffic period for your product? * 您的产品发布到第一个流量高峰期之间需要多长时间? * Is this a modification of existing functionality or a new feature? * 这是对现有功能的修改还是新功能? * If this is a new feature, what is the adoption curve for this new feature? * 如果这是一项新功能,那么该新功能的采用曲线是怎样的? * For how many releases do I need to consider rolling back based on my release fre-quency? We call this the rollback version number requirement. * 根据我的发布频率,我需要考虑回滚多少个版本?我们称之为回滚版本号要求。 Your rollback window should allow you to roll back after significant adoption of a new feature(say up to 50% adoption) and after or during your first time period of peak utilization. 您的回滚窗口应该允许您在大量采用新功能(例如采用率高达 50%)之后以及在首次高峰使用期间或之后进行回滚。 #####Rollback Technology Considerations 回滚技术注意事项 We often hear during our discussions around the rollback insurance policy that cli-ents in general agree that being able to roll back would be great but that it is techni-cally not feasible for them. Our answer to this is that it is almost always possible; itjust may not be possible with your current team, processes, or architecture. 在围绕回滚保险政策的讨论中,我们经常听到客户普遍认为能够回滚固然很好,但技术上对他们来说并不可行。我们对此的回答是,这几乎总是可能的;对于您当前的团队、流程或架构来说,这可能是不可能的。 The most commonly cited reason for an inability to roll back in Web enabled plat-forms and back office IT systems is database schema incompatibility. The argumentusually goes that for any major development effort, there may be significant changesto the schema resulting in an incompatibility with the way old and new data isstored. This modification may result in table relationships changing, candidate keyschanging, table columns changing, tables added, tables merged, tables disaggregated,and tables removed. 在支持 Web 的平台和后台 IT 系统中无法回滚的最常见原因是数据库模式不兼容。通常的论点是,对于任何重大的开发工作,架构可能会发生重大变化,导致新旧数据的存储方式不兼容。此修改可能会导致表关系更改、候选键更改、表列更改、表添加、表合并、表分解和表删除。 The key to fixing these database issues is to grow your schema over time and keepold database relationships and entities for at least as long as it would require you toroll back to them should you run into significant performance issues. In the casewhere you need to move data to create schemas of varying normal forms, either forfunctionality reasons or performance reasons, consider using data movement pro-grams potentially started by a database trigger or using a data movement daemon orthird-party replication technology. This data movement can cease whenever you havemet or exceeded your rollback version number limit identified during your require-ments. Ideally, you can turn off such data movement systems within a week or twoafter implementation and validation that you do not need to roll back. 解决这些数据库问题的关键是随着时间的推移不断扩展您的架构,并保留旧的数据库关系和实体,至少在遇到重大性能问题时需要回滚到它们为止。如果出于功能原因或性能原因需要移动数据以创建不同范式的模式,请考虑使用可能由数据库触发器启动的数据移动程序或使用数据移动守护程序或第三方复制技术。只要您达到或超出了在要求期间确定的回滚版本数限制,此数据移动就可以停止。理想情况下,您可以在实施和验证后一两周内关闭此类数据移动系统,而无需回滚。 Ideally, you will limit such data movement, and instead populate new data in newtables or columns while leaving old data in its original columns and tables. In manycases, this is sufficient to accomplish your needs. In the case where you are reorganiz-ing data, simply move the data from the new to old positions for the period of timenecessary to perform the rollback. If you need to change the name of a column or itsmeaning within an application, you must first make the change in the applicationleaving the database alone and then come back in a future release and change thedatabase. This is an example of the general rollback principle of making the change inthe application in release one and making the change in the database in a later release. 理想情况下,您将限制此类数据移动,而是在新表或新列中填充新数据,同时将旧数据保留在其原始列和表中。在许多情况下,这足以满足您的需求。如果要重新组织数据,只需在执行回滚所需的时间内将数据从新位置移动到旧位置即可。如果您需要更改应用程序中列的名称或其含义,则必须首先在应用程序中进行更改,而不要影响数据库,然后在将来的版本中返回并更改数据库。这是一般回滚原则的示例,即在第一个版本中对应用程序进行更改并在更高版本中对数据库进行更改。 #####Cost Considerations of Rollback 回滚的成本考虑 If you’ve gotten to this point and determined that designing and implementing a roll-back insurance policy has a cost, you are absolutely right! For some releases, the costcan be significant, adding as much as 10% or 20% to the cost of the release. In mostcases and for most releases, we believe that you can implement an effective rollbackstrategy for less than 1% of the cost or time of the release as very often you are reallyjust talking about different ways to store data within a database or other storage sys-tem. Insurance isn’t free, but it exists for a reason. 如果您已经到了这一点并确定设计和实施回滚保险单是有成本的,那么您绝对是对的!对于某些版本,成本可能会很高,最多会增加版本成本的 10% 或 20%。在大多数情况下和大多数版本中,我们相信您可以以不到版本 1% 的成本或时间来实施有效的回滚策略,因为通常您实际上只是在谈论在数据库或其他存储系统中存储数据的不同方法 -温度。保险不是免费的,但它的存在是有原因的。 Many of our clients have implemented procedures that allow them to violate therollback architectural principle as long as several other risk mitigation steps or pro-cesses are in place. We typically suggest that the CEO or general manager of theproduct or service in question sign off on the risk and review the risk mitigation plan(see Chapter 16, Determining Risk) before agreeing to violating the rollback architec-tural principle. In the ideal scenario, the principle is only violated with very small,very low risk releases where the cost of being able to roll back exceeds the value ofthe rollback given the size and impact of the release. Unfortunately, what typicallyhappens is that the rollback principle is violated for very large and complex releasesin order to hit time to market constraints. The problem with this approach is thatthese large complex releases are often the ones for which you need rollback capabilitythe most. 我们的许多客户已经实施了一些程序,只要其他几个风险缓解步骤或流程到位,他们就可以违反回滚架构原则。我们通常建议相关产品或服务的首席执行官或总经理在同意违反回滚架构原则之前签署风险并审查风险缓解计划(请参阅第 16 章“确定风险”)。在理想情况下,只有非常小的、非常低风险的版本才会违反该原则,在这种情况下,回滚的成本超过了回滚的价值(考虑到版本的大小和影响)。不幸的是,通常会发生的情况是,为了满足上市时间的限制,对于非常大且复杂的版本会违反回滚原则。这种方法的问题在于,这些大型复杂版本通常是您最需要回滚功能的版本。 Challenge your team whenever it indicates that the cost or difficulty to implementa rollback strategy for a particular release is too high. Often, there are simple solu-tions, such as implementing short lived data movement scripts, to help mitigate thecost and increase the possibility of implementing the rollback strategy. Sometimes, therisk of a release can be significantly mitigated by implementing markdown logic forcomplex features rather than needing to ensure that the release can be rolled back. Inour consulting practice at AKF Partners, we have seen many team members who startby saying, “we cannot possibly roll back.” After they accept the fact that it is possi-ble, they are then able to come up with creative solutions for almost any challenge. 每当发现针对特定版本实施回滚策略的成本或难度过高时,就向您的团队提出挑战。通常,有一些简单的解决方案(例如实施短期数据移动脚本)来帮助降低成本并增加实施回滚策略的可能性。有时,通过为复杂功能实现降价逻辑可以显着降低发布风险,而不需要确保发布可以回滚。在 AKF Partners 的咨询实践中,我们看到许多团队成员一开始就说:“我们不可能回滚。”在他们接受了这是可能的事实之后,他们就能够针对几乎任何挑战提出创造性的解决方案。 ####Markdown Functionality—Design to Be Disabled Markdown 功能——被禁用的设计 Another of our architectural principles from Chapter 12 was designing a feature tobe disabled. This differs from rolling back features in at least two ways. The first isthat, if implemented properly, it is typically faster to turn a feature off than it is toreplace it with the previous version or release of the system. When done well, theapplication may listen to a dedicated communication channel for instructions to dis-allow or disable certain features. Other approaches may require the restart of theapplication to pick up new configuration files. Either way, it is typically much fasterto disable functions causing scalability problems than it is to replace the system withthe previous release. 第 12 章中的另一个架构原则是设计一个要禁用的功能。这至少在两个方面不同于回滚功能。首先,如果实施得当,关闭某个功能通常比用系统的先前版本或版本替换它要快。如果做得好,应用程序可以侦听专用通信通道以获取禁止或禁用某些功能的指令。其他方法可能需要重新启动应用程序才能获取新的配置文件。无论哪种方式,禁用导致可扩展性问题的功能通常比用以前的版本替换系统要快得多。 Another way functionality disabling differs from rolling back is that it might allowall of the other functions within any given release, both modified and new, to con-tinue to function as normal. If in our example of our dating site we had released boththe “has he dated a friend of mine” search and another feature that allowed the rat-ing of any given date, we would only need to disable our search feature until it is fixedrather than rolling back and in effect turning off both features. This obviously givesus an advantage in releases containing multiple fixes, modified and new functionality. 功能禁用与回滚不同的另一种方式是,它可能允许任何给定版本中的所有其他功能(无论是修改后的还是新的)继续正常运行。如果在我们的约会网站示例中,我们发布了“他是否与我的朋友约会过”搜索和另一个允许对任何给定日期进行评级的功能,那么我们只需要禁用我们的搜索功能,直到它被修复,而不是回滚并实际上关闭这两个功能。这显然给我们带来了包含多个修复、修改和新功能的版本的优势。 Designing all features to be disabled, however, can sometimes add an even moresignificant cost than designing to roll any given release back. The ideal case is that thecost is low for both designing to be disabled and rolling back and the companychooses to do both for all new and modified features. Most likely, you will identifyfeatures that are high risk, using a Failure Mode and Effects Analysis described inChapter 16, to determine which features should have mark down functionalityenabled. Code reuse or a shared service that is called asynchronously may help to sig-nificantly reduce the cost of implementing functions that can be disabled on demand.Implementing both rollback and feature disabling helps enable agile methods by cre-ating an adaptive and flexible production environment rather than relying on predic-tive methods such as extensive, costly, and often low return performance testing. 然而,设计为禁用所有功能有时会比设计回滚任何给定版本增加更显着的成本。理想的情况是,禁用和回滚设计的成本都很低,并且公司选择对所有新功能和修改后的功能都执行这两项操作。最有可能的是,您将使用第 16 章中描述的故障模式和影响分析来识别高风险功能,以确定哪些功能应启用降价功能。代码重用或异步调用的共享服务可能有助于显着降低实现可按需禁用的功能的成本。同时实现回滚和功能禁用有助于通过创建自适应且灵活的生产环境而不是实现敏捷方法。而不是依赖预测方法,例如广泛、昂贵且通常回报率低的性能测试。 If implemented properly, designing to be disabled and designing for rollbacks canactually decrease your time to market by allowing you to take some risks in produc-tion that you would not take in their absence. Although not a replacement for loadand performance testing, it allows you to perform such testing much more quickly inrecognition of the fact that you can easily move back from implementations oncereleased. 如果实施得当,禁用设计和回滚设计实际上可以缩短您的上市时间,因为您可以在生产中承担一些在没有这些风险的情况下不会承担的风险。尽管不能替代负载和性能测试,但它允许您更快地执行此类测试,因为您可以轻松地从发布后的实现中撤回。 #####The Barrier Condition, Rollback, and Markdown Checklist 障碍条件、回滚和 Markdown 清单 Do you have the following? 您有以下情况吗? * Something to block bad scalability designs from proceeding to implementation? * 有什么办法可以阻止不良可扩展性设计继续实施? * Reviews to ensure that code is consistent with a scalable design or principles? * 进行审查以确保代码符合可扩展的设计或原则? * A way to test the impact of an implementation before it goes to production? * 在投入生产之前测试实施的影响的方法? * Ways to measure the impact of production releases immediately? * 如何立即衡量生产版本的影响? * A way to roll back a major release that impacts your ability to scale? * A way to disable functionality that impacts your ability to scale? * 有没有办法回滚影响您扩展能力的主要版本? * 一种禁用影响您扩展能力的功能的方法? Answering yes to all of these puts you on a path to identifying scale issues early and beingable to recover from them quickly when they happen. 对所有这些问题的回答都是肯定的,这样您就可以及早识别规模问题,并在问题发生时能够快速恢复。 ####Conclusion 结论 This chapter covered topics such as barrier conditions, rollback capabilities, andmarkdown capabilities that help companies manage the risk associated with scalabil-ity incidents and recover quickly from them if and when they happen. Barrier condi-tions (a.k.a. go/no-go processes) focus on identifying and eliminating risks to futurescalability early within a development process, thereby lowering the cost of identify-ing the issue and eliminating the threat of it in production. Rollback capabilitiesallow for the immediate removal of any scalability related threat, thereby limiting itsimpact to customers and shareholders. Markdown and disabling capabilities allowfeatures impacting scalability to be disabled on a per feature basis, removing them asthreats when they cause problems. 本章涵盖了障碍条件、回滚功能和降价功能等主题,这些主题可帮助公司管理与可扩展性事件相关的风险,并在事件发生时快速从中恢复。障碍条件(也称为通过/不通过流程)侧重于在开发过程的早期识别和消除未来可扩展性的风险,从而降低识别问题并消除生产中的威胁的成本。回滚功能可以立即消除任何与可扩展性相关的威胁,从而限制其对客户和股东的影响。 Markdown 和禁用功能允许在每个功能的基础上禁用影响可扩展性的功能,从而在它们引起问题时消除它们的威胁。 Ideally, you will consider implementing all of these. Sometimes, on a per releasebasis, the cost of implementing either rollback or markdown capabilities are excep-tionally high. In these cases, we recommend a thorough review of the risks and all ofthe risk mitigation steps possible to help minimize the impact to your customers andshareholders. In the event of high cost of both markdown and rollback, considerimplementing at least one unless the feature is small and not complex. Should youdecide to forego implementing both markdown and rollback, ensure that you per-form adequate load and performance testing and that you have all of the necessaryresources available during product launch to monitor and recover from any incidentsquickly. 理想情况下,您会考虑实施所有这些。有时,在每个版本的基础上,实现回滚或降价功能的成本非常高。在这些情况下,我们建议对风险和所有可能的风险缓解步骤进行彻底审查,以帮助最大限度地减少对客户和股东的影响。如果降价和回滚的成本都很高,请考虑至少实现一个,除非该功能很小且不复杂。如果您决定放弃实施降价和回滚,请确保执行足够的负载和性能测试,并且在产品发布期间拥有所有必要的可用资源来监控任何事件并快速从任何事件中恢复。 #####Key Points 关键点 * Barrier conditions or go/no-go processes exist to isolate faults early in yourdevelopment life cycle. * 障碍条件或通过/不通过过程的存在是为了在开发生命周期的早期隔离故障。 * Barrier conditions can work with any development life cycle. They do not needto be document intensive, though data should be collected to learn from pastmistakes. * 障碍条件适用于任何开发生命周期。尽管应该收集数据以从过去的错误中吸取教训,但它们不需要大量文档。 * Architecture Review Board, code reviews, performance testing, and productionmeasurements can all be considered examples of barrier conditions if the resultof a failure of one of these conditions is to rework the system in question. * 如果架构审查委员会、代码审查、性能测试和生产测量都可以被视为障碍条件的示例,如果这些条件之一失败的结果是需要重新设计相关系统。 * Designing the capability to roll back into an application helps limit the scalabil-ity impact of any given release. Consider it an insurance policy for your busi-ness, shareholders, and customers. * 设计回滚到应用程序的功能有助于限制任何给定版本的可扩展性影响。将其视为您的企业、股东和客户的保险。 * Designing to disable, or markdown, features complements designing by rollbackand adds the flexibility of keeping the most recent release in production whileeliminating the impact of offending features or functionality. * 设计禁用或降价功能可以补充回滚设计,并增加在生产中保留最新版本的灵活性,同时消除违规特性或功能的影响。
没有评论