这篇文章上次修改于 217 天前,可能其部分内容已经发生变化,如有疑问可询问作者。 Chapter 21Creating Fault Isolative Architectural Structures创建断层隔离建筑结构 > The natural formation of the country is the soldier’s best ally.—Sun Tzu > 国家的自然形态是士兵最好的盟友。——孙子 Part III, Architecting Scalable Solutions, focuses on the technology aspects of scale. Ifyou purchased this book and flipped immediately to Part III, we entreat you toreview the introduction and Parts I, Staffing a Scalable Organization, and II, BuildingProcesses for Scale, of this book. The ties between technology and architecture andscalability are obvious, but where companies most often go wrong is in also address-ing the issues of process and organization. In our minds, a failure to treat the issueswithin your process and organization is akin to treating acquired diabetes with insu-lin and not addressing exercise and diet as well. 第三部分,构建可扩展的解决方案,重点关注规模的技术方面。如果您购买了本书并立即翻到第三部分,我们恳请您阅读本书的简介和第一部分“为可扩展的组织配备人员”和第二部分“规模化流程”。技术、架构和可扩展性之间的联系是显而易见的,但公司最常出错的地方还在于解决流程和组织问题。在我们看来,未能解决流程和组织中的问题就像用胰岛素治疗后天糖尿病而不解决锻炼和饮食问题一样。 This chapter focuses on architecting to isolate and limit the effects of failure withinany system. In the days before full duplex and 10 gigabit Ethernet, when repeatersand hubs were used within CSMA/CD (carrier sense multiple access with collisiondetection) networks, collisions among transmissions were common. Collisionsreduced the speed and effectiveness of the network as collided packets would likelynot be delivered on their first attempt. Although the Ethernet protocol (then animplementation of CSMA/CD) used collision detection and binary exponential backoff to protect against congestion in such networks, network engineers additionallydeveloped the practice of segmenting networks to allow for fewer collisions and afaster overall network. This segmentation into multiple collision domains also cre-ated a fault isolative infrastructure wherein a bad or congested network segmentwould not necessarily propagate its problems to each and every other peer or siblingnetwork segment. With this approach, collisions were reduced, overall speed of deliv-ery in most cases was increased, and failures in any given segment would not bringthe entire network down. 本章重点介绍如何构建隔离和限制任何系统内故障影响的架构。在全双工和 10 Gb 以太网出现之前,当中继器和集线器在 CSMA/CD(带冲突检测的载波侦听多路访问)网络中使用时,传输之间的冲突很常见。冲突降低了网络的速度和有效性,因为冲突的数据包可能无法在第一次尝试时传送。尽管以太网协议(当时是 CSMA/CD 的实现)使用冲突检测和二进制指数退避来防止此类网络中的拥塞,但网络工程师还开发了分段网络的做法,以减少冲突并加快整体网络的速度。这种划分为多个冲突域的方式还创建了一个故障隔离基础设施,其中不良或拥塞的网段不一定会将其问题传播到每个其他对等或同级网段。通过这种方法,可以减少冲突,在大多数情况下提高整体交付速度,并且任何给定网段的故障都不会导致整个网络瘫痪。 This same general approach can be applied to not just networks, but every othercomponent of your system’s architecture. When we use the term system’s architec-ture, we are referring to the way in which your holistic product or platform works.The platform is composed of several entities or subcomponents: a network, propri-etary software, servers, databases, operating systems, firewalls, third-party software,application servers, Web servers, and so on. The concept of creating a fault isolativearchitecture can be applied to each of these individual components and to the sys-tem’s architecture as a whole. 同样的通用方法不仅可以应用于网络,还可以应用于系统架构的每个其他组件。当我们使用术语系统架构时,我们指的是整体产品或平台的工作方式。平台由多个实体或子组件组成:网络、专有软件、服务器、数据库、操作系统、防火墙、第三方软件、应用程序服务器、Web 服务器等。创建故障隔离架构的概念可以应用于每个单独的组件以及整个系统的架构。 ####Fault Isolative Architecture Terms 故障隔离架构术语 In our practice, we often refer to fault isolative architectures as swim lanes. Althoughwe did not coin the term, we believe it to be a great metaphor for what it is we wantto create within architectures. For swimmers, the swim lane represents both a barrierand a guide. The barrier exists to ensure that the swimmer does not cross over intoanother lane and interfere with another swimmer. In a race, this helps to ensure thatno interference happens to unduly influence the probability that any given swimmerwill win the race. In practice, or in exercise pools, the barriers exist to ensure thatnovice swimmers do not interfere with swimmers of greater capabilities. Addition-ally, the lanes help guide the swimmer toward her objective with minimal effort onthe part of the swimmer; in strokes requiring the submersion of a swimmer’s head,she can see the lanes as the head is turned or raised for air. 在我们的实践中,我们经常将故障隔离架构称为泳道。尽管我们没有创造这个术语,但我们相信它是我们想要在架构中创建的内容的一个很好的隐喻。对于游泳者来说,泳道既是屏障,又是引导者。屏障的存在是为了确保游泳者不会跨入另一条泳道并干扰另一名游泳者。在比赛中,这有助于确保不会发生干扰,从而不适当地影响任何特定游泳运动员赢得比赛的可能性。在实践中或在练习池中,存在障碍以确保新手游泳者不会干扰能力更强的游泳者。此外,泳道有助于引导游泳者以最小的努力实现目标;在需要将头部浸入水中的泳姿中,当头部转动或抬起换气时,她可以看到泳道。 Swim lanes in architecture protect your systems operations similarly to how swimlanes protect swimmers and ensure safe and efficient operation of a pool. Operationsof a set of systems within a swim lane are meant to stay within the guide ropes of thatswim lane and not cross into the operations of other swim lanes. Furthermore, swimlanes provide guides for architects and engineers designing new functionality to helpthem decide what set of functionality should be placed in what type of swim lane forprogress toward the architectural goal of high scalability. 建筑中的泳道可保护您的系统运行,就像泳道保护游泳者并确保泳池安全高效的运行一样。泳道内的一组系统的操作应留在该泳道的引导绳内,并且不会交叉到其他泳道的操作中。此外,泳道为设计新功能的架构师和工程师提供了指南,帮助他们决定应将哪组功能放置在哪种类型的泳道中,以实现高可扩展性的架构目标。 The term swim lane, however, is not the only fault isolative term used within thetechnical community. Terms like a pod are often used to define fault isolativedomains representing a group of customers or set of functionality. Podding is the actof splitting some set of data and functionality into several groups of fault isolation.Sometimes pods are used to represent groups of services and sometimes they are usedto represent separation of data. Thinking back to our definition of fault isolation asapplied to either components or systems, the separation of data or services alonewould be fault isolation at a component level only. Although this has benefits to theoverall system, it is not a complete fault isolation domain from a systems perspectiveand as such only protects you for the component in question. 然而,泳道一词并不是技术界使用的唯一故障隔离术语。像 Pod 这样的术语通常用于定义代表一组客户或一组功能的故障隔离域。 Podding 是将一组数据和功能拆分为多个故障隔离组的行为。有时 Pod 用于表示服务组,有时用于表示数据分离。回想一下我们对应用于组件或系统的故障隔离的定义,数据或服务的分离仅是组件级别的故障隔离。尽管这对整个系统有好处,但从系统角度来看,它并不是一个完整的故障隔离域,因此只能保护有问题的组件。 Shard is yet another term that is often used within the technical community. Mostoften, it describes a database structure or storage subsystem. Sharding is the splittingof these systems into failure domains with the failure of a single shard not bringingthe remainder of the system down as a whole. A storage system comprised of 100shards may have a single failure that allows the other 99 shards to continue to oper-ate. As with pods, however, this does not mean that the systems addressing thoseremaining 99 shards will function properly. We will discuss this concept in moredetail later in this chapter. 分片是技术社区中经常使用的另一个术语。大多数情况下,它描述数据库结构或存储子系统。分片是将这些系统划分为多个故障域,单个分片的故障不会导致整个系统的其余部分崩溃。由 100 个分片组成的存储系统可能会出现一个故障,但允许其他 99 个分片继续运行。然而,与 Pod 一样,这并不意味着处理剩余 99 个分片的系统将正常运行。我们将在本章后面更详细地讨论这个概念。 Slivers, chunks, and pools are also terms with which we have become familiar overtime. Slivers are often used as a replacement for shards. Chunks often are used as asynonym for pods. Pools most often reference a group of servers that perform similartasks, this is a fault isolation term but not in the same fashion as swim lanes as we’lldiscuss later. Most often, these are application servers or Web servers performingsome portion of functionality for your platform. All of these terms most often repre-sent components of your overall system design, though the term can easily beextended to mean the entire system or platform rather than just its subcomponent. 条子、块和池也是我们逐渐熟悉的术语。条子通常用作碎片的替代品。 chunk 通常用作 pod 的同义词。池通常引用一组执行类似任务的服务器,这是一个故障隔离术语,但与我们稍后讨论的泳道不同。大多数情况下,这些是为您的平台执行某些功能的应用程序服务器或 Web 服务器。所有这些术语通常代表整个系统设计的组件,尽管该术语可以轻松扩展为表示整个系统或平台而不仅仅是其子组件。 Ultimately, there is no single “right” answer regarding what you should call yourfault isolative architecture. Choose whatever word you like the most or make upyour own descriptive word. There is, however, a “right” approach and that’s todesign to allow for scale and graceful failure under extremely high demand. 最终,对于所谓的故障隔离架构,没有单一的“正确”答案。选择您最喜欢的单词或创建您自己的描述性单词。然而,有一个“正确”的方法,那就是设计允许在极高的需求下进行规模化和优雅的失败。 #####Common Fault Isolation Terms 常见故障隔离术语 Swim lane is most often used to describe a fault isolative architecture from a platform or com-plete system perspective. 泳道最常用于从平台或完整系统的角度描述故障隔离架构。 Pod is most often used as a replacement for swim lane, especially when fault isolation is per-formed on a customer or geographic basis. Pod 最常用作泳道的替代品,尤其是在按客户或地理基础执行故障隔离时。 Shard is a fault isolation term most often used when referencing the splitting of databases orstorage subcomponents. 分片是一个故障隔离术语,在涉及数据库或存储子组件的拆分时最常使用。 Sliver is a synonym for pod, often also used for storage and database subcomponents.Chunk is a synonym for pods. Sliver 是 pod 的同义词,通常也用于存储和数据库子组件。Chunk 是 pod 的同义词。 Pool is a fault isolation term commonly applied to software services but is not necessarily aswim lane in implementation. 池是一个常用于软件服务的故障隔离术语,但在实现中不一定是泳道。 ####Benefits of Fault Isolation 故障隔离的好处 Fault isolative architectures offer many benefits within a platform or product. Thesebenefits range from the obvious benefits of increased availability and scalability tothe less obvious benefits of decreased time to market and cost of development. Com-panies find it easier to roll back releases, as we described in Chapter 18, Barrier Con-ditions and Rollback, and push out new functionality while the site, platform, orproduct is “live” and serving customers. 故障隔离架构在平台或产品中提供了许多好处。这些好处包括从提高可用性和可扩展性等明显好处到缩短上市时间和降低开发成本等不太明显的好处。公司发现更容易回滚版本,正如我们在第 18 章“障碍条件和回滚”中所述,并在站点、平台或产品“上线”并为客户提供服务时推出新功能。 #####Fault Isolation and Availability—Limiting Impact 故障隔离和可用性——限制影响 As the name would seem to imply, fault isolation greatly benefits the availability ofyour platform or product. When a fault isolation domain or swim lane fails at theplatform or systems architecture level, you only lose the functionality, geography, orset of customers that the swim lane serves. Of course, this assumes that you havearchitected your swim lane properly and that other swim lanes are not making calls tothe swim lane in question. Of course, the choice of a swim lane in this case can result inhaving no net benefit to your availability, so the architecting of swim lanes becomesvery important. To explain this, let’s look at a swim lane architecture that supportshigh availability and contrast it with a poorly architected swim lane architecture. 顾名思义,故障隔离极大地提高了平台或产品的可用性。当故障隔离域或泳道在平台或系统架构级别发生故障时,您只会失去泳道所服务的功能、地理位置或客户集。当然,这假设您已经正确构建了泳道,并且其他泳道没有调用相关泳道。当然,在这种情况下选择泳道可能不会对您的可用性产生任何净效益,因此泳道的设计变得非常重要。为了解释这一点,让我们看一下支持高可用性的泳道架构,并将其与架构不佳的泳道架构进行对比。 Our fictitious company AllScale, which we have been using to provide examplesfor various topics, is at it again. The AllScale team decides to apply the concept ofcreating swim lanes to both the newly developed customer relationship management(CRM) and the existing human resources management (HRM) system. Both are SaaS(Software as a Service) platforms. Johnny Fixer, CTO, and his team develop theCRM platform from scratch to allow for multitenancy at a company level, meaningthat multiple companies can reside within the same physical database to reduce youroverall cost and make the most efficient use of your capital. The AllScale architectsalso recognize the need for scalability long term as their customer base grows overtime. As such, they decide that they will split the application and databases along cus-tomer boundaries for both the newly developed CRM solution and the existing HRMsolution. Johnny and the AllScale team decide that the smallest customer segmentthey will ever need to split upon is a division within a company. The AllScale archi-tects also decide to run multiple live data centers throughout the United States. 我们虚构的公司 AllScale 再次出现,我们一直用它来提供各种主题的示例。 AllScale 团队决定将创建泳道的概念应用到新开发的客户关系管理 (CRM) 和现有的人力资源管理 (HRM) 系统中。两者都是SaaS(软件即服务)平台。首席技术官 Johnny Fixer 和他的团队从头开始开发 CRM 平台,以允许公司级别的多租户,这意味着多个公司可以驻留在同一个物理数据库中,以降低总体成本并最有效地利用您的资本。 AllScale 架构师还认识到,随着客户群的不断增长,长期需要可扩展性。因此,他们决定将新开发的 CRM 解决方案和现有的 HRM 解决方案沿着客户边界拆分应用程序和数据库。 Johnny 和 AllScale 团队决定,他们需要划分的最小客户群是公司内部的部门。 AllScale 架构师还决定在美国各地运行多个实时数据中心。 The AllScale architects select a swim lane, or fault isolative architecture, thatplaces somewhere between a very large company division and several smaller compa-nies into a single data center with all of the services necessary to run those customersout of that data center. The swim lane construction is depicted in Figure 21.1.Thedata center location is selected based on proximity to the corporate headquarters ofthe companies the data center will serve. No services are allowed to communicatebetween data centers. As a result, when any set of components from the database tothe border routers fail within a data center, only the customers contained within thatdata center are impacted for the period of outage or service degradation. AllScale 架构师选择了一个泳道或故障隔离架构,将一个非常大的公司部门和几个较小的公司之间的某个位置放入一个数据中心,并提供从该数据中心运行这些客户所需的所有服务。泳道结构如图 21.1 所示。数据中心位置的选择是基于与数据中心将服务的公司总部的距离。不允许任何服务在数据中心之间进行通信。因此,当数据中心内从数据库到边界路由器的任何组件集发生故障时,只有该数据中心内包含的客户在中断或服务降级期间受到影响。 ![](https://blog.baidu-google.com/usr/uploads/2024/06/3863393965.png) The AllScale architects further identify a way to scale with fault isolation within adata center. With virtual local area network segmentation and multiple databases,more than one division or group of companies can be placed into one of many faultisolation domains within the data center. This allows for fault isolation of systemsand services below the internal router and inclusive of network LANs, databases,application servers, and so on. Again, services are not allowed to communicate acrossthese fault isolation domains. Here too, the result is that any equipment failure otherthan shared network components such as routers and border routers will be isolatedto the customers within a single zone or domain within the data center. In implemen-tation, the design exceeds expectations and allows the company to roll beta productsto isolated customer segments, thereby further decreasing risk. AllScale 架构师进一步确定了一种在数据中心内通过故障隔离进行扩展的方法。通过虚拟局域网分段和多个数据库,可以将多个部门或一组公司置于数据中心内的多个故障隔离域之一中。这允许对内部路由器下方的系统和服务进行故障隔离,包括网络 LAN、数据库、应用程序服务器等。同样,服务不允许跨这些故障隔离域进行通信。在这里,结果也是除了共享网络组件(例如路由器和边界路由器)之外的任何设备故障都将被隔离到数据中心内单个区域或域内的客户。在实施过程中,该设计超出了预期,并允许公司将测试版产品推向孤立的客户群,从而进一步降低风险。 Contrast the preceding approach with an approach where fault isolation domainsare created on a services level. Let’s imagine that the AllScale team created fault isola-tive structures along the boundaries of services rather than customer boundaries. Inthis case, the team might create a swim lane for the login service, one for the popula-tion and updating of leads, one for the reading of leads, one for reporting on leadmetrics, and so on. This initially seems like an appropriate approach to achieve somelevel of scale within the system and isolate faults. The issue with this approach is thatthe failure of at least one of these services is going to have an unintended downstreameffect on the other services. For instance, in the preceding example, a login servicefailure will prevent access to the system, and although many of the other services maystill be available, you would expect system utilization to decay over time as new log-ins won’t be accepted. This in turn affects 100% of the AllScale clients attempting tointeract with the platform after the login failure. 将上述方法与在服务级别创建故障隔离域的方法进行对比。让我们想象一下,AllScale 团队沿着服务边界而不是客户边界创建了故障隔离结构。在这种情况下,团队可能会为登录服务创建一条泳道,一条用于填充和更新潜在客户,一条用于读取潜在客户,一条用于报告潜在客户指标,等等。最初,这似乎是在系统内实现一定规模并隔离故障的适当方法。这种方法的问题在于,这些服务中至少一个出现故障将对其他服务产生意想不到的下游影响。例如,在前面的示例中,登录服务失败将阻止对系统的访问,尽管许多其他服务可能仍然可用,但您预计系统利用率会随着时间的推移而下降,因为新的登录将不被接受。这反过来会影响登录失败后尝试与平台交互的 100% AllScale 客户端。 This is not to say that such a service-oriented isolation approach should never beused. Quite the contrary, it is an excellent way to isolate your code base, speed timeto market through the isolation, and reduce the scalability requirements on cachingfor action specific services. But whenever you have services that rely upon anotherservice, either synchronously as described earlier or simply in a time oriented serieswhen one service is called before another, you subject yourself to a higher rate of fail-ure. You can either ensure that the first order service (the first one to be called beforeany other service can be used such as a login) is so highly available and redundant asto minimize the risk or you can perform multiple splits to further isolate failures. 这并不是说永远不应该使用这种面向服务的隔离方法。恰恰相反,它是隔离代码库、通过隔离加快上市时间并降低特定于操作的服务的缓存的可扩展性要求的绝佳方法。但是,只要您有依赖于另一个服务的服务,无论是如前所述同步,还是简单地以面向时间的系列方式(当一个服务在另一个服务之前被调用时),您都会面临更高的失败率。您可以确保第一个订单服务(在使用任何其他服务(例如登录)之前调用的第一个服务)具有高可用性和冗余性,以最大限度地降低风险,或者您可以执行多个拆分以进一步隔离故障。 The former approach of making the service even more highly available can beaccomplished by adding significantly more capacity than is typically needed. In addi-tion, the incorporation of markdown functionality (see the following sidebar “Mark-down Logic Revisited” or Chapter 18 for a review) on a per company basis might helpus isolate certain problems. Forcing a small percentage of customers through a spe-cialized login pool service for new login code might reduce AllScale’s risk with newrollouts, and establishing connection limits on the servers might help them keep somecustomers logging in properly when they have slowdowns in service for some reason. 前一种提高服务可用性的方法可以通过显着增加比通常需要的容量来实现。此外,在每个公司的基础上合并降价功能(请参阅下面的边栏“降价逻辑重温”或第 18 章进行回顾)可能有助于我们隔离某些问题。强制一小部分客户通过专门的登录池服务获取新的登录代码可能会降低 AllScale 进行新部署的风险,并且在服务器上建立连接限制可能会帮助他们在某些客户因某种原因导致服务速度减慢时保持正确登录。 #####Markdown Logic Revisited 重新审视 Markdown 逻辑 You may recall from Chapter 18 that we provided an implementation of the architectural princi-ple Design to be Disabled in what we called markdown functionality. Markdown functionalityenables certain features within a product to be turned off without affecting other features. Thetypical reason companies invest in markdown functionality is to limit the negative impact of newfeature releases on either availability or scalability. 你可能还记得第 18 章,我们在所谓的 Markdown 功能中提供了“禁用设计”架构原则的实现。 Markdown 功能可以关闭产品中的某些功能,而不影响其他功能。公司投资降价功能的典型原因是为了限制新功能发布对可用性或可扩展性的负面影响。 Proper markdown functionality allows a new release to remain in a production environmentwhile the offending code or system is fixed and without rolling the entire release back. Theoffending code or system is simply taken offline typically through a software “toggle” and isbrought back online after the cause of unintended behavior is rectified. 适当的降价功能允许新版本保留在生产环境中,同时修复有问题的代码或系统,而无需回滚整个版本。通常通过软件“切换”将有问题的代码或系统脱机,并在纠正意外行为的原因后将其重新上线。 The latter approach of performing multiple splits to isolate failures is our pre-ferred method of addressing both scalability and availability. In this method, AllScalecould combine the splits of services and splits of customers on a per company basis.AllScale could have the services oriented split of logins be the primary method of splitand isolation by implementing a login service swim lane and then make a customeroriented split with swim lanes by companies within the services swim lane. Alternatively,AllScale could swap the approach and create a customer pod (or swim lane) for groupsof companies; within that pod, it could break out its services by swim lanes, one ofwhich would be a login service. Either method is fine, though most companies findthat the customer oriented split is a little more intuitive. We will discuss these types ofsplits in greater detail in Chapters 22, Introduction to the AKF Scale Cube, 23, Split-ting Applications for Scale, and 24, Splitting Databases for Scale, where we address theAKF Scale Cube and how to apply it to services, databases, and storage structures. 后一种执行多次拆分以隔离故障的方法是我们解决可扩展性和可用性问题的首选方法。在这种方法中,AllScale 可以将每个公司的服务拆分和客户拆分结合起来。AllScale 可以通过实现登录服务泳道,将面向服务的登录拆分作为拆分和隔离的主要方法,然后进行面向客户的拆分服务泳道内的公司的泳道。或者,AllScale 可以交换方法并为公司集团创建客户群(或泳道);在该 Pod 内,它可以通过泳道分解其服务,其中之一是登录服务。两种方法都可以,尽管大多数公司发现以客户为导向的分割更直观一些。我们将在第 22 章“AKF Scale Cube 简介”、第 23 章“针对 Scale 的拆分应用程序”和第 24 章“针对 Scale 的数据库拆分”中更详细地讨论这些类型的拆分,其中我们介绍了 AKF Scale Cube 以及如何将其应用于服务、数据库和存储结构。 #####Fault Isolation and Availability—Incident Detection and Resolution 故障隔离和可用性——事件检测和解决 Fault isolation also increases availability because incidents are easier to detect, iden-tify, and resolve. If you have several swim lanes, each dedicated to a group of custom-ers, and only a single swim lane goes down, you know quite a bit about the failureimmediately; it is limited to a set of customers. As a result, your questions to resolvethe incident are nearly immediately narrowed. More than likely, the issue is a resultof systems or services that are servicing that set of customers alone. Maybe it’s a data-base unique to that customer swim lane. You might ask, “Did we just roll code out tothat swim lane or pod?” or more generally, “What were the most recent changes tothat swim lane or pod?” As the name implies, fault isolation has incredible benefits toincident detection and resolution. Not only does fault isolation isolate the incidentfrom propagation throughout your platform, it focuses your incident resolution pro-cess like a laser and shaves critical time off the restoration of service. 故障隔离还提高了可用性,因为事件更容易检测、识别和解决。如果您有多个泳道,每个泳道专用于一组客户,并且只有一个泳道发生故障,您会立即了解有关故障的情况;它仅限于一组客户。因此,您解决事件的问题几乎立即就缩小了。该问题很可能是由单独为这组客户提供服务的系统或服务造成的。也许这是该客户泳道独有的数据库。您可能会问,“我们刚刚将代码发布到泳道或吊舱吗?”或者更笼统地说,“该泳道或池的最新变化是什么?”顾名思义,故障隔离对于事件检测和解决具有令人难以置信的好处。故障隔离不仅可以将事件与整个平台的传播隔离开来,还可以像激光一样集中事件解决过程,并缩短恢复服务的关键时间。 #####Fault Isolation and Scalability 故障隔离和可扩展性 This is a book on scalability and it should be no surprise, given that we’ve includedfault isolation as a topic, that it somehow benefits your scalability initiatives. Thesubject of exactly how fault isolation affects scalability has to do with how you splityour services, as we’ll discuss in Chapters 22 through 24, and has to do with thearchitectural principle of scaling out rather than up. The most important thing toremember is that to have a swim lane it must not communicate synchronously intoany other swim lane. It can make asynchronous calls with the appropriate timeoutsand discard mechanisms to other swim lanes, but you cannot have a connection ori-ented communication to any other service outside of the swim lane. We’ll discusshow to construct and test swim lanes later in this chapter. 这是一本关于可扩展性的书,考虑到我们已将故障隔离作为一个主题,它在某种程度上有利于您的可扩展性计划,这应该不足为奇。故障隔离究竟如何影响可扩展性这一主题与您如何拆分服务有关,正如我们将在第 22 章到第 24 章中讨论的那样,并且与横向扩展而不是纵向扩展的架构原则有关。最重要的是要记住,要拥有一条泳道,它不能与任何其他泳道同步通信。它可以使用适当的超时和丢弃机制对其他泳道进行异步调用,但您无法与泳道之外的任何其他服务进行面向连接的通信。我们将在本章后面讨论如何构建和测试泳道。 #####Fault Isolation and Time to Market 故障隔离和上市时间 Creating architectures that allow you to isolate code into service oriented or resourceoriented systems gives you the flexibility of focus and the ability to dedicate engineersto those services. When you are a small company, this probably doesn’t make muchsense. But as your company grows, the lines of code, number of servers, and overallcomplexity of your system will grow. To handle this growth in complexity, you willneed to focus your engineering staff. Failing to specialize and focus your staff will resultin too many engineers having too little information on the entire system to be effective. 创建允许您将代码隔离到面向服务或面向资源的系统的架构,可以让您灵活地关注重点,并能够让工程师专门致力于这些服务。当你是一家小公司时,这可能没有多大意义。但随着公司的发展,系统的代码行数、服务器数量和整体复杂性都会增加。为了应对这种复杂性的增长,您需要集中您的工程人员的精力。如果您的员工无法专业化和专注,将导致太多工程师对整个系统的信息太少而无法发挥作用。 If you run a commerce site, you might have code, objects, methods, modules, serv-ers, and databases focused on checkout, finding, comparing, browsing, shipping,inventory management, and so on. By dedicating teams to these areas, each team willbecome an expert on a codebase that is itself complex, challenging, and growing. Theresulting specialization will allow for faster new feature development and a fastertime to resolve known or current incidents and problems. All of this increase in speedto delivery may result in a faster time to market for bug fixes, incident resolution,and new feature development. 如果您运行一个商业网站,您可能拥有专注于结账、查找、比较、浏览、运输、库存管理等的代码、对象、方法、模块、服务器和数据库。通过将团队专门致力于这些领域,每个团队都将成为本身复杂、具有挑战性且不断发展的代码库的专家。由此产生的专业化将有助于更快地开发新功能,并更快地解决已知或当前的事件和问题。所有这些交付速度的提高可能会加快错误修复、事件解决和新功能开发的上市时间。 Additionally, this isolation of development and ideally isolation of systems or ser-vices will reduce the merge conflicts that would happen within monolithic systemsdevelopment. Here, we use the term monolithic systems development to identifysource that is shared across all set of functions, objects, procedures, and methodswithin a given product. Duplicate checkouts for a complex system across many engi-neers will result in an increase in merge conflicts and errors. Specializing the code andthe engineering teams reduce these conflicts. 此外,这种开发隔离以及理想情况下系统或服务的隔离将减少单体系统开发中可能发生的合并冲突。在这里,我们使用术语“整体系统开发”来标识在给定产品内的所有函数、对象、过程和方法集之间共享的源。许多工程师对复杂系统的重复检查将导致合并冲突和错误的增加。专业化的代码和工程团队可以减少这些冲突。 This is not to say that code reuse should not be a focus for the organization; itabsolutely should be a focus. Shared libraries should be developed, and potentiallyyou should consider a dedicated team responsible for shared library development andoversight. These libraries can be implemented as services to services, as shareddynamically loadable libraries, or compiled and/or linked during the build of theproduct. Our preferred approach, however, would be to have shared libraries dedicatedto a team, and should a nonshared library team develop a useful and potentially shar-able component, that component should be moved to the shared library team. 这并不是说代码重用不应该成为组织的重点;这绝对应该成为一个焦点。应该开发共享库,并且您可能应该考虑一个专门的团队负责共享库的开发和监督。这些库可以实现为服务的服务、共享的动态可加载库,或者在产品构建期间编译和/或链接。然而,我们的首选方法是让共享库专用于一个团队,并且如果非共享库团队开发出有用且可能可共享的组件,则应将该组件移至共享库团队。 Recognizing that engineers like to continue to be challenged, you might be con-cerned that engineers will not want to spend a great deal of time on a specific area ofyour site. You can slowly rotate engineers to get them a better understanding of theentire system and in so doing stretch and develop them over time. Additionally, youstart to develop potential future architects with a breadth of knowledge regardingyour system or fast reaction SWAT team members that can easily get into and resolveincidents and problems. 认识到工程师喜欢继续接受挑战,您可能会担心工程师不想在站点的特定区域上花费大量时间。您可以慢慢地轮换工程师,让他们更好地了解整个系统,并随着时间的推移不断扩展和发展他们。此外,您开始培养潜在的未来架构师,他们对您的系统或快速反应的特警团队成员拥有广泛的知识,可以轻松地处理并解决事件和问题。 #####Fault Isolation and Cost 故障隔离和成本 In the same ways and for the same reason that fault isolation reduces time to market,it can also reduce cost. One way to look at it is as you get greater throughput for eachhour and day spent on a per engineer basis, your per unit cost goes down. Forinstance, if it normally took you 5 engineering days to produce the average story oruse-case in a complex monolithic system, it might now take you 4.5 engineering daysto produce the average story or use-case in a disaggregated system with swim lanes.The average per unit cost of your engineering endeavors was just reduced by 10%! 以同样的方式和原因,故障隔离可以缩短上市时间,也可以降低成本。看待这个问题的一种方法是,当您在每个工程师的每小时和每天的工作量上获得更大的吞吐量时,您的单位成本就会下降。例如,如果通常需要 5 个工程日才能在复杂的整体系统中生成平均故事或用例,那么现在可能需要 4.5 个工程日才能在带有泳道的分解系统中生成平均故事或用例。您工程工作的单位成本仅降低了 10%! You can do one of two things with this per unit cost reduction, both of whichimpact net income and as a result shareholder wealth. You might decide to reduceyour engineering staff by 10% and produce exactly the same amount of productenhancements, changes, and bug fixes at a lower absolute cost than before. Thisreduction in cost increases net income without an increase in revenue. 通过降低单位成本,您可以做两件事之一,这两者都会影响净收入和股东财富。您可能决定将工程人员减少 10%,并以比以前更低的绝对成本进行完全相同数量的产品增强、更改和错误修复。成本的降低在不增加收入的情况下增加了净利润。 Alternatively, you might decide that you are going to keep your current cost struc-ture and attempt to develop more products at the same cost. The thought here is thatyou will make great product choices that increase your revenue. If you are successful,you also increase net income and as a result your shareholders will become wealthier. 或者,您可能决定保持当前的成本结构并尝试以相同的成本开发更多产品。这里的想法是,您将做出很好的产品选择,从而增加您的收入。如果你成功了,你的净利润也会增加,你的股东也会变得更富有。 You may correctly believe that additional sites usually end up costing more capitalthan running out of a single site and that operational expenses may increase.Although this is true, most companies aspire to have products capable of weatheringgeographically isolated disasters and invest to varying levels in disaster recovery initi-atives that help mitigate the effects of such disasters. As we will discuss in Chapter 32,Planning Data Centers, assuming you have an appropriately fault isolated architecture,the capital and expense associated with running three or four properly fault isolateddata centers can be significantly less than two completely redundant data centers. 您可能正确地认为,额外的站点通常最终会比耗尽单个站点花费更多的资金,并且运营费用可能会增加。尽管这是事实,但大多数公司都希望拥有能够抵御地理上孤立的灾难的产品,并在灾难恢复方面进行不同程度的投资有助于减轻此类灾害影响的举措。正如我们将在第 32 章“规划数据中心”中讨论的那样,假设您有一个适当的故障隔离架构,则与运行三个或四个适当的故障隔离数据中心相关的资本和费用可能大大少于两个完全冗余的数据中心。 Another consideration in justifying fault isolation is the effect that it has on reve-nue. Referring back to Chapter 6, Making the Business Case, you can attempt to cal-culate the lost opportunity (lost revenue) over some period of time. Typically, thiswill be the easily measured loss of a number of transactions on your system added tothe future loss of a higher than expected customer departure rate and the resultingreduction in revenue. This loss of current and future revenue can be used to deter-mine if the cost of implementing a fault isolated architecture is warranted. In ourexperience, some measure of fault isolation is easily justified through the increase inavailability and the resulting decrease in lost opportunity. 证明故障隔离合理性的另一个考虑因素是它对收入的影响。回顾第 6 章“制定业务案例”,您可以尝试计算一段时间内失去的机会(收入损失)。通常,这将是系统上许多交易的容易测量的损失,加上未来因高于预期的客户离开率而造成的损失以及由此产生的收入减少。当前和未来收入的损失可用于确定实施故障隔离架构的成本是否合理。根据我们的经验,通过提高可用性以及由此导致的机会损失的减少,可以轻松地证明某种故障隔离措施的合理性。 ####How to Approach Fault Isolation 如何进行故障隔离 The most fault isolative systems are those that make absolutely no calls and have nointeraction with anything outside of their functional or data boundaries. The bestway to envision this is to think of a group of lead lined, concrete structures, eachwith a single door. Each door opens into a long isolated hallway that has a singledoor at each end; one door accesses the lead lined concrete structure and one dooraccesses a shared room with an infinite number of desks and people. In each of theseconcrete structures is a piece of information that one of the people sitting at the manydesks might want. To get that information, he has to travel the long hallway dedi-cated to the room with the information he needs and then walk back to his desk.After that journey, he may decide to get a second piece of information from the roomhe just entered, or travel down another hallway to another room. It is impossible fora person to cross from one room to the next; he must always make the long journey.If too many people get caught up attempting to get to the same room down the samehallway, it will be immediately apparent to everyone in the room and they can eitherdecide to travel to another room or simply wait. 最具故障隔离性的系统是那些绝对不进行任何调用并且不与其功能或数据边界之外的任何内容进行交互的系统。想象这一点的最好方法是想象一组铅衬混凝土结构,每个结构都有一扇门。每扇门都通向一条长长的孤立走廊,走廊两端各有一扇门;一扇门通向铅衬混凝土结构,一扇门通向一间拥有无数桌子和人员的共用房间。这些具体结构中的每一个都包含坐在许多办公桌前的人可能想要的一条信息。为了获取该信息,他必须穿过专用于带有他需要的信息的房间的长走廊,然后走回他的办公桌。在这段旅程之后,他可能决定从他刚刚进入的房间获取第二条信息,或沿着另一条走廊前往另一个房间。一个人不可能从一个房间走到另一个房间;如果太多人试图沿着同一条走廊到达同一个房间,房间里的每个人都会立即意识到这一点,他们可以决定前往另一个房间或干脆等待。 In this example, we’ve not only illustrated how to think about fault isolativedesign, but we’ve demonstrated two benefits of such a design. The first benefit is thata failure in capacity of the hallway does not keep anyone from moving on to anotherroom. The second benefit is that everyone knows immediately which room has thecapacity problem. Contrast this with an example where each of the rooms is con-nected to a shared hallway and there is but one entrance to this shared hallway fromour rather large room. Although each of the rooms is isolated, should he back up intothe hallway, it becomes both difficult to determine which room is at fault and impos-sible to travel to the other rooms. This example also illustrates our first principle offault isolative architecture. 在这个例子中,我们不仅说明了如何考虑故障隔离设计,而且还展示了这种设计的两个好处。第一个好处是,走廊容量的故障不会阻止任何人前往另一个房间。第二个好处是每个人都能立即知道哪个房间出现容量问题。与此相比,每个房间都连接到一个共享走廊,并且从我们相当大的房间到这个共享走廊只有一个入口。尽管每个房间都是孤立的,但如果他回到走廊,就很难确定哪个房间有问题,而且无法前往其他房间。这个例子也说明了我们的第一原则故障隔离架构。 #####Principle 1: Nothing Is Shared 原则 1:不共享任何内容 The first principle of fault isolative design or architecture is that absolutely nothing isshared. Of course, this is an extreme and may not be financially feasible in somecompanies, but it is nevertheless the starting point for fault isolative design. If youwant to ensure that capacity or system failure does not cause problems for multiplesystems, you need to isolate system components. This may be very difficult in severalareas, like border or gateway routers. That said, and recognizing both the financialand technical barriers in some cases, the more thoroughly you apply the principle, thebetter your results. 故障隔离设计或架构的首要原则是绝对不共享任何内容。当然,这是一个极端,对于某些公司来说可能在经济上不可行,但它仍然是故障隔离设计的起点。如果要确保容量或系统故障不会导致多个系统出现问题,则需要隔离系统组件。这在某些领域可能非常困难,例如边界或网关路由器。也就是说,认识到某些情况下的财务和技术障碍,应用该原则越彻底,结果就越好。 One often overlooked area is URIs/URLs. For instance, consider using different sub-domains for different groups. If grouping by customers, consider cust1.allscale.com tocustN.allscale.com. If grouping by services, maybe view.allscale.com, update.allscale.com,input.allscale.com, and so on. The domain grouping ideally also references isolatedWeb and app servers as well as databases and storage unique to that URI/URL. Iffinancing allows and demand is appropriate, dedicated load balancers, DNS, andaccess switches should be used. URI/URL 是一个经常被忽视的领域。例如,考虑为不同的组使用不同的子域。如果按客户分组,请考虑 cust1.allscale.com 到 custN.allscale.com。如果按服务分组,可能是 view.allscale.com、update.allscale.com、input.allscale.com 等。理想情况下,域分组还引用隔离的 Web 和应用程序服务器以及该 URI/URL 特有的数据库和存储。如果资金允许且需求适当,则应使用专用负载平衡器、DNS 和访问交换机。 If you identify two swim lanes and have them communicate to a shared database,they are the same swim lane in the big picture. You may have two smaller fault isola-tion zones from a service’s perspective (for instance, the application servers), whichwill help when one application server fails; but should the database fail, it will bringdown both of these service swim lanes. 如果您确定两条泳道并让它们与共享数据库进行通信,那么从总体上看它们是同一条泳道。从服务的角度来看,您可能有两个较小的故障隔离区域(例如应用程序服务器),这将在一台应用程序服务器发生故障时提供帮助;但如果数据库出现故障,这两条服务泳道都会瘫痪。 #####Principle 2: Nothing Crosses a Swim Lane Boundary 原则 2:任何事物都不能跨越泳道边界 This is another important principle in designing fault isolative systems. If you havesystems communicating synchronously and even asynchronously, they can cause apotential fault. Although it is true that asynchronous systems are less likely to causesuch a fault, they have caused plenty of issues in extremely high demand scenarioswhen timeouts aren’t aggressive enough to bump unlikely to complete processes. 这是设计故障隔离系统的另一个重要原则。如果您的系统进行同步甚至异步通信,它们可能会导致潜在的故障。尽管异步系统确实不太可能导致此类故障,但它们在极高需求的场景中引起了很多问题,因为超时不够严重,无法阻止不太可能完成的进程。 You cannot build a fault isolation zone and have that zone communicate to any-thing outside of the zone. Think back to our concrete room analogy: the room and itshallway were the fault isolation zone or domain. The large shared room was theInternet. There was no way to move from one room to the next without travellingback to the area of the desk (our browser) and then starting down another path. As aresult, we know exactly where bottlenecks or problems are immediately and we canfigure out how to handle those problems. 您无法构建故障隔离区域并让该区域与该区域之外的任何事物进行通信。回想一下我们的具体房间的比喻:房间及其走廊是故障隔离区或域。共用的大房间就是互联网。如果不返回办公桌区域(我们的浏览器)然后开始另一条路径,就无法从一个房间移动到另一个房间。因此,我们可以立即准确地知道瓶颈或问题在哪里,并且可以找出如何处理这些问题。 Any communication between zones, or paths between rooms in our scenario, cancause problems with our fault isolation. A backup of people in one hallway may bethe cause of the hallway connected to that room or any of a series of rooms con-nected by other hallways. How can we tell easily without a thorough diagnosis? Con-versely, a backup in any room may have an unintended effect in some other room; asa result, our room availability goes down. 在我们的场景中,区域之间的任何通信或房间之间的路径都可能导致故障隔离出现问题。一个走廊中的人员备份可能是导致该走廊连接到该房间或由其他走廊连接的一系列房间中的任何一个的原因。在没有彻底诊断的情况下,我们如何能够轻松判断?相反,任何房间的备份可能会对其他房间产生意想不到的影响;结果,我们的房间供应量下降了。 #####Principle 3: Transactions Occur Along Swim Lanes 原则 3:交易发生在泳道上 Given the name and the previous principle, this principle should go without saying;but we learned long ago not to assume anything. In technology, assumption is themother of catastrophe. Have you ever seen swimmers line up facing the length of apool, but seen the swim lane ropes running widthwise? Of course not, but the result-ing water obstacle course would probably be great fun to watch. 考虑到名称和前面的原则,这个原则应该是不言而喻的;但我们很久以前就学会了不要假设任何事情。在技术领域,假设是灾难之母。您是否见过游泳者面向泳池的长度排成一排,但也见过泳道绳索横向延伸?当然不是,但由此产生的水障碍训练场可能会非常有趣。 The same is true for technical swim lanes. It is incorrect to say that you’ve createda swim lane of databases, for instance. How would transactions get to the databases?Communication would have to happen across the swim lane; and per Principle 2,that should never happen. In this case, you may well have created a pool, but becausetransactions cross a line, it is not a swim lane as we define it. 技术泳道也是如此。例如,说您已经创建了一条数据库泳道是不正确的。事务如何到达数据库?通信必须跨越泳道进行;根据原则 2,这种情况永远不应该发生。在这种情况下,您很可能已经创建了一个池,但由于交易跨越了一条线,因此它不是我们定义的泳道。 ####When to Implement Fault Isolation 何时实施故障隔离 If only money grew on trees . . . . Fault isolation isn’t free and it is not even cheap.Although it has a number of benefits, attempting to design every single function of yourplatform to be fault isolative would likely be cost prohibitive. Moreover, the share-holder return just wouldn’t be there. And that’s the answer to the preceding heading.After twenty and a half chapters, you probably can sense where we are going. 要是钱长在树上就好了。 。 。 。故障隔离不是免费的,而且价格也不便宜。尽管它有很多好处,但尝试将平台的每个功能都设计为故障隔离的可能成本过高。此外,股东回报也不存在。这就是前一个标题的答案。二十章半之后,你可能已经能感觉到我们要走向何方了。 You should implement just the right amount of fault isolation in your system togenerate a positive shareholder return. “OK, thanks, how about telling me how to dothat?” you might ask. 您应该在系统中实施适量的故障隔离,以产生积极的股东回报。 “好的,谢谢,告诉我怎么做怎么样?”你可能会问。 The answer, unfortunately, is going to depend on your particular needs, the rate ofgrowth and unavailability and causes of unavailability in your system, customerexpectation with respect to availability, contractual availability commitments, and awhole host of things that result in a combinatorial explosion, which make it impossi-ble for us to describe for you what you need to do in your environment. 不幸的是,答案将取决于您的特定需求、系统中的增长率和不可用性以及不可用性的原因、客户对可用性的期望、合同可用性承诺以及导致组合爆炸的所有因素,这使我们无法向您描述您在您的环境中需要做什么。 That said, there are some simple rules to apply to increase your scalability andavailability. We present some of the most useful here to help you in your fault isola-tion endeavors. 也就是说,可以应用一些简单的规则来提高可扩展性和可用性。我们在这里提供一些最有用的内容来帮助您进行故障隔离工作。 #####Approach 1: Swim Lane the Money-Maker 方法一:泳道赚钱 Whatever you do, always make sure that the thing that is most closely related tomaking money is appropriately isolated from the failures and demand limitations ofother systems. If you are a commerce site, this might be your purchase flow from the“buy” button and checkout process through the processing of credit cards. If you area content site and you make your money through proprietary advertising, ensure thatthe advertising system functions separately from everything else. If you are a recur-ring registration fee based site, ensure that the processes from registration to billingare appropriately fault isolated. 无论你做什么,始终确保与赚钱最密切相关的事情与其他系统的故障和需求限制适当隔离。如果您是一个商业网站,这可能是您从“购买”按钮和结帐流程到信用卡处理的购买流程。如果您经营内容网站并通过专有广告赚钱,请确保广告系统与其他系统分开运行。如果您是一个基于定期注册费用的站点,请确保从注册到计费的过程适当地进行故障隔离。 It stands to reason that you might have some subordinate flows that are closelytied to the money making functions of your site and you should consider these forswim lanes as well. For instance, in a commerce site, the search and browse function-ality might need to be in swim lanes. In content sites, the most heavily traffickedareas might need to be in their own swim lanes or several swim lanes to help withdemand and capacity projections. Social networking sites may create swim lanes forthe most commonly hit profiles or segment profile utilization by class. 按理说,您可能有一些与您网站的赚钱功能密切相关的从属流程,您也应该考虑这些泳道。例如,在商业网站中,搜索和浏览功能可能需要位于泳道中。在内容网站中,流量最大的区域可能需要位于自己的泳道或多个泳道中,以帮助进行需求和容量预测。社交网站可以为最常点击的配置文件创建泳道或按类别细分配置文件使用情况。 #####Approach 2: Swim Lane the Biggest Sources of Incidents 方法二:泳道是最大的事故来源 If in your recurring quarterly incident review (Chapter 8, Managing Incidents andProblems), you identify that certain components of your site are repeatedly causingother incidents, you should absolutely consider these for future headroom projects(Chapter 11, Determining Headroom for Applications) and isolate these areas. Thewhole purpose of the quarterly incident review is to learn from our past mistakes,and if demand related issues are causing availability problems on a recurring basis,we should isolate those areas from impacting the rest of our product or platform. 如果在定期的季度事件审查(第 8 章,管理事件和问题)中,您发现站点的某些组件反复导致其他事件,那么您绝对应该在未来的净空项目中考虑这些组件(第 11 章,确定应用程序的净空)并隔离这些组件地区。季度事件审查的全部目的是从我们过去的错误中吸取教训,如果与需求相关的问题反复导致可用性问题,我们应该隔离这些区域,以免影响我们产品或平台的其余部分。 #####Approach 3: Swim Lane Along Natural Barriers 方法 3:沿着天然屏障的泳道 This is especially useful in multitenant SaaS systems and most often relies upon the z-axis of scale discussed later in Chapters 22 to 24.The sites and platforms needing thegreatest scalability often have to rely on segmentation along the z-axis, which is mostoften implemented on customer boundaries. Although this split is often first accom-plished along the storage or database tier of architecture, it follows that we shouldcreate an entire swim lane from request to data storage or database and back. 这在多租户 SaaS 系统中特别有用,并且通常依赖于稍后在第 22 章到第 24 章中讨论的 z 轴规模。需要最大可扩展性的站点和平台通常必须依赖于沿 z 轴的分段,这通常是实现的关于客户边界。尽管这种分割通常首先沿着架构的存储或数据库层完成,但我们应该创建一个从请求到数据存储或数据库并返回的完整泳道。 Very often, multitenant indicates that you are attempting to get cost efficienciesfrom common utilization. In many cases, this approach means that you can designthe system to run one or many “tenants” in a single swim lane. If this is true for yourplatform, you should make use of it. If you have a tenant that is very busy, assign it aswim lane. A majority of your tenants have very low utilization? Assign them all to asingle swim lane. You get the idea. 很多时候,多租户表明您正在尝试从共同利用中获得成本效率。在许多情况下,这种方法意味着您可以将系统设计为在单个泳道中运行一个或多个“租户”。如果您的平台确实如此,您应该利用它。如果您的租户非常繁忙,请将其分配为泳道。您的大多数租户的利用率都很低?将它们全部分配到一条泳道。你明白了。 #####Fault Isolation Design Checklist 故障隔离设计清单 The design principles for fault isolative architectures are 故障隔离架构的设计原则是 * Principle 1: Nothing Is Shared (a.k.a. share as little as possible). The less that is shared within a swim lane, the more fault isolative the swim lane becomes. * 原则 1:不共享任何内容(又称尽可能少地共享)。泳道内共享的信息越少,泳道的故障隔离能力就越强。 * Principle 2: Nothing Crosses a Swim Lane Boundary. Communication never crosses a swim lane boundary or the boundary is drawn incorrectly. * 原则 2:任何事物都不能跨越泳道边界。通信永远不会跨越泳道边界,或者边界绘制不正确。 * Principle 3: Transactions Occur with Swim Lanes. You can’t create a swim lane of ser-vices as the communication to those services would violate Principle 2. * 原则 3:交易发生在泳道上。您无法创建服务泳道,因为与这些服务的通信将违反原则 2。 The approaches for fault isolative architectures are 故障隔离架构的方法是 * Approach 1: Swim Lane the Money-Maker. Never allow your cash register to be compro-mised by other systems. * 方法一:泳道赚钱。切勿让您的收银机受到其他系统的损害。 * Approach 2: Swim Lane the Biggest Sources of Incidents. Identify the recurring causes of pain and isolate them. * 方法二:泳道是最大的事故来源。找出疼痛反复出现的原因并加以隔离。 * Approach 3: Swim Lane Natural Barriers. Customer boundaries make good swim lanes. * 方法 3:泳道天然屏障。客户边界构成了良好的泳道。 Although there are a number of approaches, these will go a long way to increasing yourscalability while not giving your CFO a heart attack. 尽管方法有很多种,但这些方法对于提高可扩展性大有帮助,同时又不会让 CFO 心脏病发作。 ####How to Test Fault Isolative Designs 如何测试故障隔离设计 The easiest way to test a fault isolative design is to draw your platform at a high levelon a whiteboard. Draw a dotted line for any communication between systems, and asolid line for where you believe your swim lanes exist or should exist. Anywhere adotted line crosses a solid line, you have a violation of a swim lane. From a puristperspective, it does not matter if that communication is synchronous or asynchro-nous, though synchronous transactions and communications are a more egregiousviolation from both a scalability and an availability perspective. This test will testagainst the first and second principles of fault isolative designs and architectures. 测试故障隔离设计的最简单方法是在白板上绘制高层平台。为系统之间的任何通信画一条虚线,为您认为存在或应该存在泳道的位置画一条实线。任何虚线与实线交叉的地方,都属于违反泳道的行为。从纯粹的角度来看,通信是同步还是异步并不重要,尽管从可扩展性和可用性的角度来看,同步事务和通信都是更严重的违规行为。该测试将针对故障隔离设计和架构的第一和第二原则进行测试。 To test the third principle, simply draw an arrow from the user to the last systemon your whiteboard. The arrow should not cross any lines for any swim lane; if itdoes, you have violated the third principle. 要测试第三个原则,只需在白板上从用户到最后一个系统绘制一个箭头即可。箭头不应越过任何泳道的任何线;如果是这样,你就违反了第三条原则。 ####Conclusion 结论 In this chapter, we discussed the need for fault isolative architectures, principles ofimplementation, approaches for implementation, and finally a design test. We mostcommonly use swim lanes to identify a completely fault isolative component of anarchitecture, though terms like pods and slivers are often used to mean the samething. 在本章中,我们讨论了故障隔离架构的必要性、实现原理、实现方法以及最后的设计测试。我们最常使用泳道来识别建筑中完全断层隔离的组件,尽管像豆荚和条子这样的术语经常被用来表示相同的意思。 Fault isolative designs increase availability by ensuring that subsets of functional-ity do not hamper the overall functionality of your entire product or platform. Theyfurther aid in increasing availability by allowing for immediate detection of the areascausing problems within the system. They lower both time to market and cost byallowing for dedicated, deeply experienced resources to focus on the swim lanes andby reducing merge conflicts and other barriers and costs to rapid development. Scal-ability is increased by allowing for scale in multiple dimensions as discussed in Chap-ters 22 through 24. 故障隔离设计通过确保功能子集不会妨碍整个产品或平台的整体功能来提高可用性。它们通过允许立即检测系统内引起问题的区域来进一步帮助提高可用性。它们允许专门的、经验丰富的资源专注于泳道,并减少合并冲突和其他快速开发的障碍和成本,从而缩短上市时间并降低成本。如第 22 章到第 24 章所述,通过允许多维度扩展来提高可扩展性。 The principles of swim lane construction include a principle addressing sharing,one addressing swim lane boundaries, and one addressing swim lane direction. Thefewer things that are shared within a swim lane, the more isolative and beneficial thatswim lane becomes to both scalability and availability. Swim lane boundaries shouldnever have lines of communication drawn across them. Swim lanes always move inthe direction of communication and customer transactions and never across them. 泳道建设的原则包括泳道共享原则、泳道边界原则和泳道方向原则。泳道内共享的东西越少,泳道就越独立,对可扩展性和可用性就越有利。泳道边界不应划有跨越其的通讯线。泳道始终朝着沟通和客户交易的方向移动,而永远不会穿过它们。 Always address the transactions making the company money first when consideringswim lane implementation. Then, move functions causing repetitive problems intoswim lanes. Finally, consider the natural layout or topology of the site for opportuni-ties to swim lane, such as customer boundaries in a multitenant SaaS environment. 在考虑实施泳道时,始终首先解决为公司赚钱的交易。然后,将导致重复问题的功能移至泳道中。最后,考虑站点的自然布局或拓扑以获得泳道机会,例如多租户 SaaS 环境中的客户边界。 #####Key Points 关键点 * Swim lane is a term used to describe a fault isolative architecture construct inwhich a failure in the swim lane is not propagated and does not affect otherplatform functionality. * 泳道是一个术语,用于描述故障隔离架构构造,其中泳道中的故障不会传播,也不会影响其他平台功能。 * Pods, shards, and chunks are often used in place of the term swim lane, butoften they do not represent a “full system” view of functionality and fault isolation. * Pod、分片和块通常用来代替术语“泳道”,但它们通常并不代表功能和故障隔离的“完整系统”视图。 * Fault isolation increases availability and scalability while decreasing time tomarket and cost of development. * 故障隔离提高了可用性和可扩展性,同时缩短了上市时间并降低了开发成本。 * The less you share in a swim lane, the greater its benefit to availability andscalability. * 在泳道中共享的越少,其对可用性和可扩展性的好处就越大。 * No communication or transaction should ever cross a swim lane boundary. * 任何通信或交易都不应跨越泳道边界。 * Swim lanes go in the direction of and never across transaction flow. * 泳道沿着交易流的方向,而不是穿过交易流。 * Swim lane the functions that directly impact revenue, followed by those thatcause the most problems and any natural boundaries that can be defined foryour product. * 泳道是直接影响收入的功能,其次是导致最多问题的功能以及可以为您的产品定义的任何自然边界。
没有评论