这篇文章上次修改于 193 天前,可能其部分内容已经发生变化,如有疑问可询问作者。 ### Chapter 11 Determining Headroom for Applications 第 11 章确定应用的余量 > Knowing the place and the time of the coming battle, we may concentratefrom the greatest distances in order to fight.—Sun Tzu > 知道即将到来的战斗的地点和时间,我们可以从最远的距离集中力量去战斗。——《孙子》 If you were blindfolded and dropped off in the middle of the woods with a map andcompass, what is the first thing you would do? If you are an experienced outdoorsperson or even better an orienteering expert, you would probably try to determineyour exact location. You might accomplish this by looking around you at the terrainsuch as mountains, streams, or roads and trying to match that to a position on themap that has similar terrain elements depicted. If there is a stream to your east and amountain to your north, you look on the map for streams and find where a likelyposition is along that stream where you would have a mountain to the north. Thereason you do this is that in order to have a better chance at navigating your way outof the woods, you need to know the point from which you are starting. 如果你被蒙住眼睛,带着地图和指南针被扔到树林中央,你会做的第一件事是什么?如果您是一位经验丰富的户外活动者,甚至是一位定向运动专家,您可能会尝试确定您的确切位置。您可以通过观察周围的地形(例如山脉、溪流或道路)并尝试将其与地图上描绘的类似地形元素的位置相匹配来实现此目的。如果您的东边有一条溪流,北边有一座山,您可以在地图上查找溪流,并找到沿着这条溪流的可能位置,北边有一座山。你这样做的原因是,为了有更好的机会走出困境,你需要知道你从哪里开始。 Scalability is like the preceding scenario. You need to know where you are startingfrom in order to move confidently to a better place. In scalability terms, this meansunderstanding your application’s headroom. We use the term headroom to mean theamount of free capacity that exists within your system before you start having problems such as a degradation of performance or an outage. Because your application isa system that involves many different components such as databases, firewalls, andapplication servers, in order to truly understand headroom, you need to first understand the headroom for each of these. There are many scenarios in which you willneed to determine the headroom of an application. Your company might haveacquired another company and now you have responsibility for an application thatyou know nothing about. Or, you are designing a brand-new system that you need tobe able to scale because of an expected influx of traffic. Or, you have an existingapplication that is starting to have outages and you need to determine how to scale the application. Most commonly, you will make several changes to your existingapplication such that it no longer looks or behaves like the previous version forwhich you determined headroom. All of these and many more are scenarios that youmay encounter that will require you to determine the headroom of an application inorder for it to scale. 可扩展性就像前面的场景。您需要知道自己从哪里开始,才能自信地走向更好的地方。就可扩展性而言,这意味着了解应用程序的空间。我们使用术语“余量”来表示在出现性能下降或中断等问题之前系统中存在的可用容量。由于您的应用程序是一个涉及许多不同组件(例如数据库、防火墙和应用程序服务器)的系统,因此为了真正了解净空,您需要首先了解每个组件的净空。在许多情况下,您需要确定应用程序的空间。您的公司可能已经收购了另一家公司,现在您负责一个您一无所知的应用程序。或者,您正在设计一个全新的系统,由于预期的流量涌入,您需要能够扩展该系统。或者,您的现有应用程序开始出现中断,您需要确定如何扩展该应用程序。最常见的是,您将对现有应用程序进行多项更改,使其外观或行为不再像您确定余量的先前版本一样。所有这些以及更多的场景都是您可能遇到的情况,需要您确定应用程序的空间才能进行扩展。 This chapter will walk you through the process of determining headroom for yourapplication. We will start with a brief discussion of the purpose of headroom andwhere it is used. Then, we will talk about how to determine the headroom of somecommon components found in systems. Lastly, we will discuss the ideal conditionsthat you want to look for in your components in terms of loads or performance. 本章将引导您完成确定应用程序余量的过程。我们将首先简要讨论净空的目的及其用途。然后,我们将讨论如何确定系统中一些常见组件的余量。最后,我们将讨论您希望在组件中寻找负载或性能的理想条件。 #### Purpose of the Process 该过程的目的 The purpose of determining the headroom of your application, as we started to discuss, is to understand where your system stands in terms of its capability to continueto serve the needs of your customers as that customer base grows or the demands forthe service grows. If you do not plot out where you are in terms of capacity usage anddetermine what your growth path looks like, you are likely to be blindsided by asurge in capacity from any number of sources. There are a number of different placeswithin the product development life cycle where you will find a good use for yourheadroom calculations or projections. 正如我们开始讨论的那样,确定应用程序的余量的目的是了解您的系统在随着客户群的增长或服务需求的增长而继续满足客户需求的能力方面处于什么位置。如果您没有规划出您在容量使用方面的情况并确定您的增长路径是什么样的,那么您可能会因各种来源的容量激增而措手不及。在产品开发生命周期中有许多不同的地方,您会发现净空计算或预测很有用。 One of the very earliest places that you will probably use your headroom projections is when planning an annual budget. If you are planning on capital investmentsor expense expenditures for increased capacity in the form of application servers,database servers, network gear, or network bandwidth, you need a good idea of yourapplication’s headroom in all those various areas. If you don’t have a good handle onwhat amount of headroom you have, you are just guessing when it comes to a budgetof how much you will need to spend next year. It is unfortunate if you approach budgeting this way, but you are not alone; and we will show you a better way. Manyorganizations do a rough, back-of-the-envelope calculation by saying, for example,they grew x% this year and spent $y, so therefore if they expect to grow x% againnext year, they should spend $y again. Although this passes as a planned budget inmany organizations, it is guaranteed to be wrong. Not taking into account differenttypes of growth, existing headroom capacity, and optimizations, there is no way yourprojections could be accurate other than by pure luck. 您可能最早使用净空预测的地方之一是在规划年度预算时。如果您计划以应用程序服务器、数据库服务器、网络设备或网络带宽的形式进行资本投资或支出以增加容量,那么您需要充分了解应用程序在所有这些不同领域的空间。如果你不能很好地掌握自己有多少空间,那么在预算方面你只是猜测明年需要花多少钱。如果您以这种方式制定预算,那将是不幸的,但您并不孤单。我们将向您展示更好的方法。许多组织都会进行粗略的粗略计算,例如,他们今年增长了 x%,并花费了 y 美元,因此,如果他们预计明年再次增长 x%,他们应该再次花费 y 美元。尽管这在许多组织中被视为计划预算,但它肯定是错误的。如果不考虑不同类型的增长、现有的空间容量和优化,你的预测不可能准确,除非纯粹是运气。 Another area very early in the planning process where you will need headroomprojections is when putting together a hiring plan. Determining how many engineersversus network engineers versus database administrators can either be left to thesqueaky wheel method or actually planned out based on probable workloads. We prefer putting a little bit of science behind the plan. If you understand that you haveplenty of headroom on your application servers and on your database but you arebumping up against the bandwidth capacity of your firewalls and load balancers, youmay want to add another network engineer to the hiring plan instead of another systems administrator. 在规划过程的早期阶段,您需要进行空间预测的另一个领域是制定招聘计划时。确定多少工程师、网络工程师和数据库管理员的数量可以采用吱吱作响的轮子方法,也可以根据可能的工作负载进行实际规划。我们更喜欢在计划背后加入一点科学依据。如果您知道您的应用程序服务器和数据库上有足够的空间,但您的防火墙和负载平衡器的带宽容量正在增加,您可能需要在招聘计划中添加另一位网络工程师,而不是另一位系统管理员。 As you design and plan for new features during your product development lifecycle, you should be considering very early what hardware implications the new features will cause. If you are building a brand-new service, you will likely want to run iton its own pool of servers. If this feature is an enhancement of another service, youshould consider what ramifications it will have on the headroom of the current servers. Will the new feature require the use of more memory, larger log files, intensiveCPU operations, the storage of external files, or more SQL calls? Any of these canimpact the headroom projections for your entire application from network to database to application servers. 当您在产品开发生命周期中设计和规划新功能时,您应该尽早考虑新功能将导致哪些硬件影响。如果您正在构建全新的服务,您可能希望在自己的服务器池上运行它。如果此功能是其他服务的增强,您应该考虑它将对当前服务器的空间产生什么影响。新功能是否需要使用更多内存、更大的日志文件、密集的CPU操作、外部文件的存储或更多的SQL调用?其中任何一个都会影响从网络到数据库再到应用程序服务器的整个应用程序的净空预测。 The last area that you can and should use your headroom projections for is prioritization of headroom or scalability projects. As you establish the processes outlinedin this book, you will begin to amass a list of projects to improve and maintain yourscalability. Without a way to prioritize this list, the projects that people like the mostor that are someone’s pet project are the ones that will get worked on. The properway of selecting the project priority is to use a cost-and-benefits analysis. The cost isthe estimated time in engineering and operations effort to complete the project. Thebenefit is the increase in headroom or scale that the projects will bring. After readingthrough the chapter on risk management, you may want to add a third comparison,and that is risk. How risky is the project in terms of impact to customers, completionwithin the timeline, or impact to future feature development? 您可以并且应该使用净空预测的最后一个领域是净空或可扩展性项目的优先级。当您建立本书中概述的流程时,您将开始收集项目列表以改进和维护可扩展性。如果没有办法对这个列表进行优先级排序,那么人们最喜欢的项目或者是某人最喜欢的项目就会被处理。选择项目优先级的正确方法是使用成本效益分析。成本是完成项目的工程和运营工作的预计时间。好处是项目将带来的空间或规模的增加。读完风险管理这一章后,你可能想添加第三个比较,那就是风险。就对客户的影响、在时间表内完成或对未来功能开发的影响而言,该项目的风险有多大? Those are the four principle areas that you should consider using headroom projections when planning. Budgets, headcount, feature development, and scalabilityprojects all can benefit from the introduction of headroom calculations. Using headroom data, you will start making much more data driven decisions and become muchbetter at planning and predicting. 这些是您在规划时应考虑使用净空预测的四个主要领域。预算、员工人数、功能开发和可扩展性项目都可以从引入净空计算中受益。使用余量数据,您将开始做出更多数据驱动的决策,并在规划和预测方面变得更好。 #### Structure of the Process 流程结构 The process of determining your application’s headroom is straightforward but notsimple. It requires research, insight, and calculations. The more attention to detailthat you pay during each step of the process the better and more accurate your headroom projections will be. There will be enough ambiguity in the numbers already, butif you cut corners when you should spend the time to find the right answer, you willensure that the variability will be so large as to make the numbers worthless. You already have to account for unknown user behavior, undetermined future features,and many more variables that are not easy to pin down. Do not add more variationby not doing the homework or legwork in some cases. 确定应用程序空间的过程很简单,但并不简单。它需要研究、洞察力和计算。您在流程的每个步骤中对细节越关注,您的净空预测就会越好、越准确。数字中已经存在足够的模糊性,但是如果您在应该花时间寻找正确答案的时候偷工减料,那么您将确保变异性会如此之大,以至于使数字变得毫无价值。您必须考虑未知的用户行为、不确定的未来功能以及更多不易确定的变量。在某些情况下,不要通过不做家庭作业或跑腿工作来增加更多的变化。 The very first step in the headroom process is to identify the major components ofthe system. Typically, there are items such as application servers, database servers,network infrastructure that should be broken down even further if at all possible. Ifyou have a Service Oriented Architecture and different services reside on differentservers, treat each pool separately. A sample list might look like this: 净空过程的第一步是确定系统的主要组件。通常,如果可能的话,应用程序服务器、数据库服务器、网络基础设施等项目应该进一步分解。如果您有面向服务的体系结构并且不同的服务驻留在不同的服务器上,请分别对待每个池。示例列表可能如下所示 * Account management service application servers * Reports and configuration services application servers * Firewalls * Load balancers * Bandwidth * Oracle database cluster * 账户管理服务应用服务器 * 报告和配置服务应用服务器 * 防火墙 * 负载均衡器 * 带宽 * Oracle数据库集群 After you have the major component list of your system, assign responsibility tothe appropriate party to determine the actual usage over time, preferably the pastyear, and the maximum capacity in whatever is the appropriate measurement. Formost of the components, there will be multiple measurements. The database, forexample, would include the number of SQL transactions (based on the current querymix), the storage, and the server loads. These assignees should be the people responsible for the health and welfare of these components whenever possible. The databaseadministrators are most likely the best candidates for the database analysis, the systems administrators for the application servers. 获得系统的主要组件列表后,将责任分配给适当的一方,以确定一段时间内(最好是过去一年)的实际使用情况,以及适当测量的最大容量。对于大多数组件,都会进行多次测量。例如,数据库将包括 SQL 事务的数量(基于当前的查询组合)、存储和服务器负载。这些受让人应尽可能负责这些组件的健康和福利。数据库管理员很可能是数据库分析的最佳人选,而系统管理员则是应用程序服务器的最佳人选。 The next step can be done by a manager, CTO, product manager, project manager,or anyone with insight into the business plans for the next 12 or more months. Thisperson should determine the growth rate of the business. This growth rate is made upof many parts. The first rate of growth is the natural or intrinsic growth. This is howmuch growth would occur if nothing else was done to the system or by the business(no deals, no marketing, no advertising, and so on) except basic maintenance. Thiswould include the rate of walkup users that occur naturally and the increase ordecrease usage by existing users. The second growth rate is the expected increase ingrowth caused by business activities such as developing new or better features, marketing, or signing deals that bring more customers or activities. 下一步可以由经理、首席技术官、产品经理、项目经理或任何对未来 12 个月或更多个月的业务计划有洞察力的人来完成。此人应确定业务的增长率。这个增长率是由很多部分组成的。第一个增长率是自然增长率或内在增长率。这是如果除了基本维护之外,系统或业务没有做任何其他事情(没有交易、没有营销、没有广告等),就会出现多少增长。这包括自然出现的无预约用户的比率以及现有用户使用量的增加或减少。第二个增长率是由业务活动引起的预期增长,例如开发新的或更好的功能、营销或签署带来更多客户或活动的交易。 The natural growth rate can be determined by analyzing periods of growth without any business activity explanations. For instance, if in June the application has a5% increase in traffic and there was no big signed deal in the prior month nor releaseof customer facing features to explain the increase, this could be taken as a naturalmonthly growth rate. Determining the business activity growth rate requires knowl-edge of the planned feature initiatives, business department growth goals, marketingcampaigns, increases in advertising budgets, and any other similar metric or goal thatmay influence how quickly the application usage will grow. In most businesses, thebusiness profit and loss (P&L), general manager, or business development team isassigned a goal to meet for the upcoming year in terms of customer acquisition, revenue, usage, or any combination. To meet these goals, they put together plans thatinclude signing deals with customers for distribution, developing products to enticemore users or increased usage, or marketing campaigns to get the word out abouttheir fabulous products. These plans should have some correlation to their businessgoals and can be the background for determining how these will affect the application in terms of usage and growth. 自然增长率可以通过分析增长时期来确定,无需任何商业活动解释。例如,如果 6 月份应用程序的流量增加了 5%,并且上个月没有签署重大交易,也没有发布面向客户的功能来解释流量增加,则这可以视为自然月度增长率。确定业务活动增长率需要了解计划的功能计划、业务部门增长目标、营销活动、广告预算的增加以及可能影响应用程序使用增长速度的任何其他类似指标或目标。在大多数企业中,企业损益 (P&L)、总经理或业务开发团队都会被分配一个目标,要在客户获取、收入、使用或任何组合方面实现来年的目标。为了实现这些目标,他们制定了一系列计划,包括与客户签署分销协议、开发产品以吸引更多用户或增加使用量,或开展营销活动以宣传其出色的产品。这些计划应该与其业务目标有一定的相关性,并且可以作为确定这些计划将如何影响应用程序的使用和增长的背景。 After you have a very solid projection of natural and man-made growth projections, you can move on to understanding the seasonality effect. Some retailers see75% of their revenue in the last 45 days of the year due to the holiday season. Somesee the summer doldrums as people spend more time on vacations and less timebrowsing sites or purchasing books. Whatever is the case for your application, youshould take this into account in order to understand what point of the seasonalitycurve you are on and how much you can expect this curve to raise or lower thedemand. If you have at least one year’s worth of data, you can begin projecting seasonal differences. The way to accomplish this is to strip out the average growth ratefrom the numbers and see how the traffic or usage changed from month to month.You are looking for a sine wave or something similar to Figure 11.1. 在对自然和人为增长有了非常可靠的预测后,您可以继续了解季节性效应。由于假期,一些零售商 75% 的收入来自一年中的最后 45 天。有些人认为夏季低迷是因为人们花在度假上的时间更多,而浏览网站或购买书籍的时间更少。无论您的应用程序是什么情况,您都应该考虑到这一点,以便了解您处于季节性曲线的哪个点以及您可以预期该曲线会提高或降低需求多少。如果您拥有至少一年的数据,则可以开始预测季节性差异。实现此目的的方法是从数字中剔除平均增长率,并查看流量或使用量逐月变化的情况。您正在寻找正弦波或类似于图 11.1 的波形。 Now that you have seasonality data, growth data, and actual usage data, you needto determine how much headroom you are likely to retrieve through your scalabilityinitiatives next year. Similar to the way we used the business growth rates for customer facing features, you need to determine an amount of headroom that you willgain by developing infrastructure features, or scalability projects, as these are sometimescalled. These infrastructure features could be projects such as splitting a database oradding a caching layer. For this, you can use various approaches such as historic gains from similar projects or multiple estimations by several architects as you wouldfor an estimated effort for story points. When organized into a timeline, theseprojects will give you a projected increase in headroom throughout the year. Sometimes, projects have not been identified for the entire next 12 months; in that case,you would use an estimation process similar to what you would do for businessdriven growth. Use historic data to provide the most likely outcome of future projectsweighted with an amount of expert insight from your architects or chief engineerswho best understand the system. 现在您已经有了季节性数据、增长数据和实际使用数据,您需要确定明年可以通过可扩展性计划获得多少空间。与我们使用面向客户的功能的业务增长率的方式类似,您需要确定通过开发基础设施功能或可扩展性项目(有时称为这些项目)将获得的净空量。这些基础设施功能可以是诸如拆分数据库或添加缓存层之类的项目。为此,您可以使用各种方法,例如类似项目的历史收益或多个架构师的多次估计,就像估计故事点的工作量一样。当组织成时间表时,这些项目将为您带来全年预计的增长空间。有时,未来 12 个月的项目尚未确定;在这种情况下,您将使用类似于业务驱动增长的估算流程。使用历史数据提供未来项目最有可能的结果,并结合最了解系统的建筑师或首席工程师的大量专家见解进行加权。 ![](https://blog.baidu-google.com/usr/uploads/2024/06/433009385.png) The last step is to bring all the data together to calculate the headroom. The formula for doing this is shown in Figure 11.2. 最后一步是将所有数据汇总在一起以计算净空。执行此操作的公式如图 11.2 所示。 ![](https://blog.baidu-google.com/usr/uploads/2024/06/2759597376.png) This equation states that the headroom of a particular component of your systemis equal to the ideal usage percentage of the maximum capacity minus the currentusage minus the sum over a time period (here it is 12 months) of the growth rateminus the optimization. We will cover the ideal usage percentage in the next sectionof this chapter; for now, let’s use 50% as the number. 该方程表明,系统特定组件的净空等于最大容量的理想使用百分比减去当前使用量再减去一段时间内(此处为 12 个月)增长率的总和减去优化。我们将在本章的下一节中介绍理想的使用百分比;现在,我们使用 50% 作为数字。 If the headroom number is positive, you have enough headroom for the period oftime used in the equation. If it is negative, you do not. Let’s return to our team atAllScale and follow them through a headroom calculation to illustrate this and whatit means. Tom Harde, the director of infrastructure and operations, had never performed a headroom analysis, so Johnny Fixer, the CTO, has offered to guide Tomand his team through the steps. The exercise was to calculate the headroom of theHRM database in terms of SQL queries. Tom’s DBA stated that assuming a similarquery mix (reads, writes, use of indexes, and so on), they could service 100 queriesper second. The HRM application is currently running 25 queries per second on thisdatabase node and has a combined (natural and man-made) growth of 10 more queries per second over the next year. Johnny explains to Tom and the team that the realgrowth rate is likely to be different each month depending on seasonality as well aswhen certain projects get released to production, but using this projection is a goodestimate. Continuing with the exercise, Tom expects that they can reduce the queriesper second by 5 through some infrastructure projects and database tuning. Johnnygoes to the whiteboard and uses the following units of measure abbreviations: “q/s”is queries per second and “t p” is time period (in this exercise, “t p” is one year or 12months). Johnny writes the following equation: 如果净空数字为正,则表示您在方程中使用的时间段内有足够的净空。如果它是负数,则不需要。让我们回到 AllScale 团队,跟随他们进行余量计算,以说明这一点及其含义。基础设施和运营总监 Tom Harde 从未进行过净空分析,因此首席技术官 Johnny Fixer 主动提出指导 Tomand 的团队完成这些步骤。该练习是根据 SQL 查询计算 HRM 数据库的空间。 Tom 的 DBA 表示,假设采用类似的查询组合(读取、写入、索引的使用等),他们每秒可以处理 100 个查询。 HRM 应用程序目前在此数据库节点上每秒运行 25 个查询,并且明年的查询次数(自然和人为)将增加每秒 10 个查询。约翰尼向汤姆和团队解释说,每个月的实际增长率可能会有所不同,具体取决于季节性以及某些项目何时投入生产,但使用此预测是一个不错的估计。继续练习,Tom 预计他们可以通过一些基础设施项目和数据库调整将每秒查询次数减少 5 次。 Johnny 走到白板前并使用以下计量单位缩写:“q/s”是每秒的查询次数,“t p”是时间段(在本练习中,“t p”是一年或 12 个月)。约翰尼写出以下等式 ![](https://blog.baidu-google.com/usr/uploads/2024/06/333278393.png) And then Johnny begins solving the equation, resulting in the following: 然后约翰尼开始求解方程,结果如下 ![](https://blog.baidu-google.com/usr/uploads/2024/06/229964261.png) Johnny explains that because the number is positive, they have enough headroomto make it through the next 12 months; this was the time period that the growth, seasonality, and optimization covered. 约翰尼解释说,由于这个数字是正数,他们有足够的空间来度过接下来的 12 个月;这是增长、季节性和优化所涵盖的时间段。 Tom raised the question that was on most of his team members’ minds: What doesthe headroom number 20q/s mean? Johnny explained that strictly speaking this meansthat the HRM application has 20 queries per second of spare capacity. Additionally,this number when combined with the summation clause (growth, seasonality, andoptimization over the time period) tells the team how much time it has before theapplication runs out of headroom. Johnny goes back to the whiteboard and writesthe equation for this, as shown in Figure 11.3. Tom 提出了大多数团队成员都关心的问题:净空数字 20q/s 是什么意思? Johnny 解释说,严格来说,这意味着 HRM 应用程序每秒有 20 个查询的闲置容量。此外,该数字与求和子句(一段时间内的增长、季节性和优化)相结合可以告诉团队在应用程序耗尽空间之前还有多少时间。约翰尼回到白板并写下这个方程,如图 11.3 所示。 ![](https://blog.baidu-google.com/usr/uploads/2024/06/4031669348.png) Johnny continues with the exercise stating that they have Headroom Time = 20q/s /5q/s/t p = 4.0t p. Because the time period is 12 months or one year, Johnny states that iftheir projected growth rates continue as predicted for the first 12 months, they havefour years of headroom remaining on this database server. Tom and his team arepretty impressed with not only this answer but the entire process of calculating headroom and are excited to try it on their own for some other components in the HRMsystem. Johnny cautions them that although this is a great way to determine howmuch longer an application can grow on a particular set of hardware, it involves lotsof estimates and those should be rechecked periodically. Johnny 继续练习,指出他们的净空时间 = 20q/s /5q/s/t p = 4.0t p。由于时间段为 12 个月或一年,Johnny 表示,如果他们的预计增长率在前 12 个月内持续如预期,那么他们在此数据库服务器上还有四年的空间。 Tom 和他的团队不仅对这个答案印象深刻,而且对计算净空的整个过程印象深刻,并且很高兴能够自己尝试 HRM 系统中的其他一些组件。约翰尼警告他们,虽然这是确定应用程序在一组特定硬件上可以增长多长时间的好方法,但它涉及大量估计,并且应该定期重新检查。 #### Ideal Usage Percentage 理想的使用百分比 If you recall, we used a variable in our headroom calculations that we called the idealusage percentage. This is a pretty fancy name, but its definition is really simple. Weare describing the amount of capacity for a particular component that should beplanned for usage. Why not 100% of capacity, you ask? Well, there are several reasonsfor not wanting to plan on using every spare bit of capacity that you have in a component, whether that is a database server or load balancer. The first reason is that youmight be wrong. I know it’s hard to believe, but you and your fine team of engineers and database architects might be wrong about the actual maximum capacity becausestress testing is not always equal to real-world testing. We’ll cover the issues withstress testing in Chapter 17, Performance and Stress Testing. The other way youmight be wrong is that your projections could be off. Either way, you should leavesome amount of room in the plan for being off in your estimates. 如果您还记得的话,我们在净空计算中使用了一个称为理想使用百分比的变量。这是一个非常奇特的名字,但它的定义非常简单。我们描述了应该规划使用的特定组件的容量。您可能会问,为什么不使用 100% 的容量呢?好吧,有几个原因导致您不想计划使用组件中的所有空闲容量,无论是数据库服务器还是负载平衡器。第一个原因是你可能错了。我知道这很难相信,但是您和您优秀的工程师和数据库架构师团队对于实际最大容量的判断可能是错误的,因为压力测试并不总是等于现实世界的测试。我们将在第 17 章“性能和压力测试”中介绍压力测试的问题。你可能错的另一种方式是你的预测可能会偏离。无论哪种方式,您都应该在计划中留出一些与您的估计不一致的空间。 The second reason that you do not want to use 100% of your capacity is that asyou approach maximum usage, unpredictable things start happening, such as thrashing, which is the excessive swapping of data or program instructions in and out ofmemory. Unpredictability in our hardware and software, when discussed as a theoretical concept, is entertaining, but when it occurs in the real world, there is nothingentertaining about it. Sporadic behavior is a factor that makes a problem incrediblyhard to diagnose. 您不想使用 100% 容量的第二个原因是,当您接近最大使用量时,不可预测的事情就会开始发生,例如抖动,即内存中数据或程序指令的过度交换。当我们将硬件和软件的不可预测性作为一个理论概念进行讨论时,它是很有趣的,但当它发生在现实世界中时,就没有什么有趣的了。零星行为是导致问题极其难以诊断的一个因素。 ##### Thrashing 殴打 As one example of unpredictable behavior, let’s look at thrashing or excessive swapping. Youare probably familiar with the concept, but as a quick review, almost all operating systems havethe ability to swap programs or data in and out of memory if the program or data that is beingrun is larger than the allocated physical memory. Some operating systems divide memory intopages, and these pages are swapped out and written to disk. This capability to swap is veryimportant for two reasons: the first being that some items used by a program during startup areused very infrequently and should be removed from active memory. Secondly, when the program or dataset is larger than the physical memory, swapping the needed parts into memorymakes the execution much faster. This speed difference between disk and memory is actuallywhat causes problems. Memory is accessed in nanoseconds, whereas disk access is typicallyin milliseconds. The difference is thousands of times slower. 作为不可预测行为的一个例子,让我们看一下抖动或过度交换。您可能熟悉这个概念,但快速回顾一下,如果正在运行的程序或数据大于分配的物理内存,则几乎所有操作系统都能够在内存中交换程序或数据。一些操作系统将内存划分为页面,这些页面被换出并写入磁盘。这种交换功能非常重要,原因有两个:第一个原因是程序在启动期间使用的某些项目很少使用,应该从活动内存中删除。其次,当程序或数据集大于物理内存时,将所需部分交换到内存中可以使执行速度更快。磁盘和内存之间的速度差异实际上是导致问题的原因。内存访问以纳秒为单位,而磁盘访问通常以毫秒为单位。差别是慢了数千倍。 Thrashing occurs when a page is swapped out to disk but is soon needed and must beswapped back in. After it is in memory, it gets swapped back out in order to let something elsehave the freed memory. The reading and writing of pages to disk is very slow compared tomemory; therefore, the entire execution begins to slow down while processes wait for pages toland back in memory. There are lots of factors that influence thrashing, but closing in on the limits of capacity on a machine is a very likely cause. 当页面被换出到磁盘但很快需要并且必须换回时,就会发生颠簸。当页面进入内存后,它会被换回,以便让其他东西拥有释放的内存。与内存相比,页面在磁盘上的读写速度非常慢;因此,当进程等待页面返回内存时,整个执行速度开始减慢。影响抖动的因素有很多,但接近机器容量极限是一个很可能的原因。 What is the ideal percentage of capacity to use for a particular component? Aswith most things, the answer is that it depends. It depends on a number of variables,one of the most important being the type of component. Certain components, mostnotably networking gear, are notoriously predictable as demand ramps. Applicationservers are in general much less predictable, not because of the hardware being infe-rior, but because of their general use nature. They can and usually do run a wide variety of processes, even when dedicated to single services. Therefore, you may decide touse a higher percentage in the headroom equation for your load balancer than you doon your application server. 特定组件使用的理想容量百分比是多少?与大多数事情一样,答案是视情况而定。它取决于许多变量,其中最重要的一个是组件的类型。众所周知,某些组件(尤其是网络设备)随着需求的增加是可以预测的。应用程序服务器通常不太可预测,不是因为硬件较差,而是因为它们的一般使用性质。他们可以而且通常确实运行各种各样的流程,即使专用于单一服务也是如此。因此,您可能决定在负载均衡器的余量方程中使用比应用程序服务器更高的百分比。 As a general rule of thumb, we like to start at 50% as the ideal usage percentageand work up from there as the arguments dictate. Your app servers are probably yourmost variable component so someone could make arguments that the servers dedicated to API traffic are less variable and therefore could run higher toward the truemaximum usage, perhaps 60%. And then there is the networking gear that we discussed earlier that you may feel comfortable running as high as 75% of maximum.We’re open to these changes, but as a guideline, we recommend starting at 50% andhaving your teams or yourself make the arguments why you should feel comfortablerunning at a higher percentage. We would not recommend going above 75% becauseyou have to still account for error in your estimates of growth. 作为一般经验法则,我们喜欢从 50% 作为理想的使用百分比开始,然后根据参数的要求逐渐增加。您的应用程序服务器可能是最可变的组件,因此有人可能会认为专用于 API 流量的服务器的可变性较小,因此可以更高地实现真正的最大使用率,也许是 60%。然后是我们之前讨论过的网络设备,您可能会觉得运行高达最大值的 75% 很舒服。我们对这些更改持开放态度,但作为指导,我们建议从 50% 开始,让您的团队或您自己进行调整为什么你应该以更高的百分比跑步感到舒服的论据。我们不建议超过 75%,因为您仍然需要考虑增长估计中的错误。 Another way that you can arrive at the usage percentage or how close you can runto the maximum capacity of a component is by using statistics. The concept is to figure out how much variability resides in your services running on the particular component and then use that as a guide to buffer away from the maximum capacity. Ifyou are planning on using this method, you should consider revisiting these numbersoften, especially after releases with major features because the performance of services can change dramatically based on user behavior or new code. For this method,we look at weeks’ or months’ worth of performance data such as load on a serverand then calculate the standard deviation of that data. We then subtract 3 u the standard deviation from the maximum capacity and use that in the headroom equation asthe substitute for the Ideal Usage Percentage u Maximum Capacity. 获得使用百分比或可以运行到组件最大容量的程度的另一种方法是使用统计数据。这个概念是找出在特定组件上运行的服务存在多少可变性,然后使用它作为缓冲远离最大容量的指导。如果您计划使用此方法,则应考虑经常重新访问这些数字,尤其是在发布主要功能之后,因为服务的性能可能会根据用户行为或新代码而发生巨大变化。对于这种方法,我们查看数周或数月的性能数据,例如服务器上的负载,然后计算该数据的标准偏差。然后,我们从最大容量中减去 3 u 标准差,并在余量方程中使用该值作为理想使用百分比 u 最大容量的替代值。 In Table 11.1, Load Averages, we have three weeks worth of maximum load values on our application servers. The standard deviation for this sample data set is1.49. If we take 3u that amount, 4.48, and subtract the maximum load capacity thatwe have established for this server class, we then have the amount of load capacitythat we can plan to use up to but not exceed. In this case, our systems administratorsbelieve that 15 is the maximum, therefore 15 – 4.48 = 10.5 is the maximum amountwe can plan for. This is the number that we would use in the headroom equation toreplace the Ideal Usage Percentage u Maximum Capacity 在表 11.1“负载平均值”中,我们的应用程序服务器上有三周的最大负载值。该样本数据集的标准差是 1.49。如果我们采用 3u 这个量(4.48),并减去我们为此服务器类建立的最大负载容量,那么我们就得到了我们可以计划使用但不能超过的负载容量。在这种情况下,我们的系统管理员认为 15 是最大值,因此 15 – 4.48 = 10.5 是我们可以计划的最大数量。这是我们在余量方程中用来代替理想使用百分比 u 最大容量的数字 ![](https://blog.baidu-google.com/usr/uploads/2024/06/228355155.png) ##### Headroom Calculation Checklist 净空计算清单 These are the major steps that should be followed when completing a headroom calculation: 这些是完成净空计算时应遵循的主要步骤 1. Identify major components. 2. Assign responsibility for determining actual usage and maximum capacity. 3. Determine intrinsic or natural growth rate. 4. Determine business activity based growth rates. 5. Determine peak seasonality affects. 6. Estimate headroom or capacity reclaimed by infrastructure projects. 7. Make headroom calculation: * If positive, you have capacity for the timeframe analyzed. * If negative, you do not have capacity over the specified timeframe.8. Divide headroom by the (growth rate + seasonality – optimizations) to get the amount oftime remaining to use up the capacity * 1. 识别主要组件。 2. 分配确定实际使用量和最大容量的责任。 3.确定内在增长率或自然增长率。 4.确定基于业务活动的增长率。 5.确定高峰季节性影响。 6。估计基础设施项目回收的净空或容量。 7.进行净空计算: * 如果为正,则您有能力完成所分析的时间范围。 * 如果为负数,则说明您没有能力超过指定的时间范围。8.将净空除以(增长率+季节性 - 优化)即可得到用完容量的剩余时间量 #### Conclusion 结论 In this chapter, we started our discussion by investigating the purpose of the headroom process. We decided that there are four principle areas where you should consider using headroom projections when planning: budgets, headcount, featuredevelopment, and scalability projects. 在本章中,我们通过调查净空过程的目的开始讨论。我们认为,在规划时应考虑使用净空预测的四个主要领域:预算、员工人数、功能开发和可扩展性项目。 We then covered the basic structure of the headroom process. This process consistsof many steps, which are detail oriented, but overall the process is very straightforward. The steps include identifying major components, assigning responsible partiesfor determining actual usage and maximum capacity for those components, determining the intrinsic growth rate as well as the growth rate caused by business activities, accounting for seasonality, making estimates for improvement in usage based oninfrastructure projects, and then performing the calculations. 然后我们介绍了净空流程的基本结构。这个过程由很多步骤组成,这些步骤都是注重细节的,但总体来说这个过程非常简单。这些步骤包括确定主要组成部分,指定负责方确定这些组成部分的实际使用情况和最大容量,确定内在增长率以及业务活动引起的增长率,考虑季节性因素,根据基础设施项目对使用情况的改善进行估计,然后进行计算。 The last topic we covered in this chapter was the ideal usage percentage for components. We stipulated that in general we prefer to use a simple 50% as the amountof maximum capacity that should be planned on using. The reason is that thisaccounts for variability or mistakes in determining the maximum capacity as well aserrors in the growth projects. We capitulated that we could be convinced to increasethis percentage if the administrators or engineers could make sound and reasonable arguments for why the system is very well understood and not very variable. Analternative method of determining this maximum usage capacity is to subtract threestandard deviations of actual usage from the believed maximum and use that numberas the planning maximum. 本章讨论的最后一个主题是组件的理想使用百分比。我们规定,一般情况下,我们更愿意使用简单的50%作为应计划使用的最大容量。原因是,这会导致确定最大容量时的可变性或错误以及增长项目中的错误。我们承认,如果管理员或工程师能够就为什么系统很好理解且变化不大,我们可以说服增加这个百分比。确定此最大使用容量的另一种方法是从相信的最大值中减去实际使用量的三个标准偏差,并将该数字用作计划最大值。 ##### Key Points 关键点 * The reason that you should want to know your headroom for various components is that you need this information for budgets, hiring plans, release planning, and scalability project prioritization. * Headroom should be calculated for each major component within the systemsuch as each pool of application servers, networking gear, bandwidth usage, anddatabase servers. * Without a sound and factually based argument for deviating, we recommendnot planning on using more than 50% of the maximum capacity on any onecomponent. * 您应该想知道各种组件的空间,因为您需要这些信息来确定预算、招聘计划、发布计划和可扩展性项目优先级。 * 应计算系统内每个主要组件的余量,例如每个应用程序服务器池、网络设备、带宽使用情况和数据库服务器。 * 如果没有合理且基于事实的偏离论据,我们建议不要计划在任何一个组件上使用超过最大容量的 50%。
没有评论