### Chapter 27 Too Much Data 数据太多 > The skillful soldier does not raise a second levy, nor are his supply wagons loaded more than once.—Sun Tzu > 善兵不会二次征兵,辎重车上的货物也不会多于一次。——《孙子》 Hyper growth, or even slow steady growth over time, presents some unique scalabil-ity problems with data retention and storage. We might log information relevant atthe time of a transaction, insert information relevant to a purchase, or keep track ofuser account changes. We may log all customer contacts or allow users to store dataranging from pictures to videos. This size, as we will discuss later, has significant costimplications to our business and can negatively affect our ability to scale, or at leastscale cost effectively. 随着时间的推移,超速增长,甚至缓慢稳定增长,都会在数据保留和存储方面带来一些独特的可扩展性问题。我们可能会记录交易时的相关信息、插入与购买相关的信息或跟踪用户帐户的更改。我们可能会记录所有客户联系人或允许用户存储从图片到视频的数据。正如我们稍后将讨论的,这种规模对我们的业务具有重大的成本影响,并且可能会对我们的扩展能力或至少有效地扩展成本产生负面影响。 Time also affects the value of our data in most systems. Although not universallytrue, in many systems, the value of data decreases over time. Old customer contactinformation, although potentially valuable, probably isn’t as valuable as the mostrecent contact information. Old photos and videos aren’t likely accessed as often andold log messages that we’ve made probably aren’t as relevant to us today. So as ourcosts increase with all of the additional data being stored, the value on a per data unitstored decreases, presenting unique challenges for most businesses. 时间还会影响大多数系统中数据的价值。尽管并非普遍正确,但在许多系统中,数据的价值会随着时间的推移而下降。旧客户联系信息虽然可能有价值,但可能不如最近的联系信息有价值。旧照片和视频不太可能被频繁访问,而我们制作的旧日志消息可能与今天的我们不再相关。因此,随着我们的成本随着存储的所有附加数据而增加,存储的每个数据单元的价值就会下降,这给大多数企业带来了独特的挑战。 The size of data alone can present issues for your business. Assuming that not allelements of the data are valuable to all requests or actions against that data, we needto find ways to process and store this data quickly and cost effectively. 仅数据大小就可能给您的业务带来问题。假设并非所有数据元素对于针对该数据的所有请求或操作都有价值,我们需要找到快速且经济有效地处理和存储这些数据的方法。 This chapter is all about data size or the amount of data that you store. How dowe handle it, process it, and keep our business from being overly burdened by it?What data do we get rid of and how do we store data in a tiered fashion that allowsall data to be accretive to shareholder value? 本章主要讨论数据大小或存储的数据量。我们如何处理它,处理它,并防止我们的业务受到过度负担?我们要摆脱哪些数据以及如何以分层方式存储数据,使所有数据都能为股东价值增值? ####The Cost of Data 数据成本 Data is costly. Your first response to this might be that the costs of mass storagedevices have decreased steadily over time and with the introduction of cloud storageservices, storage has become “nearly free.” But free and nearly free obviously aren’tthe same thing as a whole lot of something that is nearly free actually turns out to bequite expensive. As the price of storage decreases over time, we tend to care lessabout how much we use and as a result our usage typically increases significantly.Prices might drop by 50% and rather than passing that 50% reduction in price off toshareholders as a reduction in our cost of operations, we may very likely allow thesize of our storage to double because it is “cheap.” 数据是昂贵的。您对此的第一反应可能是,随着时间的推移,大容量存储设备的成本稳步下降,并且随着云存储服务的引入,存储已变得“几乎免费”。但免费和几乎免费显然不是一回事,因为很多几乎免费的东西实际上非常昂贵。随着存储价格随着时间的推移而下降,我们往往不太关心我们使用了多少,因此我们的使用量通常会显着增加。价格可能会下降 50%,而不是将 50% 的价格下降转嫁给股东作为我们的运营成本,我们很可能允许我们的存储大小增加一倍,因为它“便宜”。 But the initial cost of this storage is not the only cost you incur with every piece ofdata you store on it. The more storage you have, the more storage management youneed. This might be the overhead of systems administrators to handle the data, orcapacity planners to plan for the growth, or maybe even software licenses that allowyou to “virtualize” your storage environment and manage it more easily. As yourstorage grows, so does the complexity of managing that storage. 但此存储的初始成本并不是存储在其中的每条数据所产生的唯一成本。您拥有的存储空间越多,您需要的存储管理就越多。这可能是系统管理员处理数据的开销,或者是容量规划人员规划增长的开销,甚至可能是允许您“虚拟化”存储环境并更轻松地管理它的软件许可证。随着存储的增长,管理存储的复杂性也会随之增加。 Furthermore, as your storage increases, the power and space costs of handling thatstorage increases as well. You might argue here that the advent of Massive Array ofIdle Disks (MAID) has offset those costs, or maybe you are thinking of even lesscostly solutions such as cloud storage services. We applaud you if you have put yourinfrequently accessed data on such a storage infrastructure. But the fact of the matteris that if you run one massive array, it will cost you less than 10 massive arrays, andless storage in the cloud will cost you less than more storage in the cloud. In the caseof MAID solutions, those disks spin from time to time, and they take power just toensure that they are “functioning.” Furthermore, you either paid for the power distri-bution units (power sockets) into which they are plugged or you pay a monthly orannual fee in the case of a collocation provider to have the plug and power available.Finally, you either paid to build an infrastructure capable of some maximum powerutilization likely driven by a percentage of those drives being active or you pay some-one else (again in the case of collocation) to handle that for you. And of course, ifyou aren’t using MAID drives, the cost of your power to run systems that are alwaysspinning is even higher. If you are using cloud services, you still need the staff andprocesses to understand where that storage is located and to ensure that you canproperly access it. 此外,随着存储的增加,处理该存储的电力和空间成本也会增加。您可能会认为大规模空闲磁盘阵列 (MAID) 的出现抵消了这些成本,或者您可能正在考虑成本更低的解决方案,例如云存储服务。如果您将不经常访问的数据放在这样的存储基础设施上,我们将为您鼓掌。但事实是,如果您运行一个大型阵列,那么您的成本将少于 10 个大型阵列,并且云中较少的存储所花费的成本将低于云中更多存储的成本。对于 MAID 解决方案,这些磁盘会不时旋转,它们通电只是为了确保它们“正常运行”。此外,您要么支付它们所插入的配电单元(电源插座)的费用,要么支付每月或每年的费用(如果是配置提供商)以获得可用的插头和电源。最后,您要么支付构建费用一种能够实现最大功率利用率的基础设施,可能由处于活动状态的驱动器的百分比驱动,或者您付费给其他人(同样在并置的情况下)来为您处理该问题。当然,如果您不使用 MAID 驱动器,则运行始终旋转的系统的电力成本会更高。如果您使用云服务,您仍然需要人员和流程来了解存储所在的位置并确保您可以正确访问它。 And that’s not it! If this data resides in a database upon which you are performingtransactions for end users, each query of that data increases with the size of the databeing queried. We’re not talking about the cost of the physical storage at this point,but rather the time to complete the query. Although it’s true that if you are queryingupon a properly balanced index that the time to query that data is not linear (it ismore likely log2N where N is the number of elements), it nevertheless increases withan increase in the size of the data. Sixteen elements in binary tree will not cost twiceas much to traverse and find an element as eight elements—but it will still cost more.This increase in steps to traverse data elements takes more processor time per userquery, which in turn means that fewer things can be processed within any givenamount of time. Let’s say that we have eight elements and it takes us on average 1.5steps to find our item with a query. Let’s then say that with 16 elements it takes us onaverage two steps to find our item. This is a 33% increase in processing time to han-dle 16 elements versus the eight. Although this seems like a good leverage scalingmethod, it is still taking more time. It doesn’t just cost more time on the database.This increase in time, even if performed asynchronously, is probably time that an appserver is waiting for the query to finish, the Web server is waiting for the app serverto return the data, and the time your customer is waiting for a page to load. 不是这样!如果此数据驻留在您为最终用户执行事务的数据库中,则该数据的每次查询都会随着所查询数据的大小而增加。我们此时讨论的不是物理存储的成本,而是完成查询的时间。尽管如果您在适当平衡的索引上进行查询,则查询该数据的时间确实不是线性的(更可能是 log2N,其中 N 是元素数量),但它仍然随着数据大小的增加而增加。二叉树中的 16 个元素的遍历和查找元素的成本不会是 8 个元素的两倍,但它仍然会花费更多。遍历数据元素的步骤的增加每个用户查询需要更多的处理器时间,这反过来意味着可以处理的东西更少在任何给定的时间内进行处理。假设我们有 8 个元素,通过查询平均需要 1.5 个步骤才能找到我们的项目。假设有 16 个元素,我们平均需要两步才能找到我们的项目。与处理 8 个元素相比,处理 16 个元素的处理时间增加了 33%。虽然这看起来是一个很好的杠杆扩容方法,但它仍然需要更多的时间。它不仅仅会花费更多的数据库时间。即使异步执行,这种时间的增加也可能是应用程序服务器等待查询完成的时间,Web 服务器等待应用程序服务器返回数据的时间,以及您的客户等待页面加载的时间。 Let’s now consider our peak utilization time of say 1 to 2 PM in the afternoon. Ifeach query takes us 33% more time on average to complete and we want to run at100% utilization during our peak traffic period, we might need as many as 33%more systems to handle twice the data (16 elements) versus the original eight elementsif we do not want the user response time adversely impacted. In other words, weeither let each of the queries take 33% more time to complete and affect the userexperience as new queries get backed up waiting for longer running queries to com-plete given constrained capacity, or we add capacity to try to limit the impact to theusers. At some point of course, without disaggregation of the data similar to the trickwe performed with search in Chapter 24, Splitting Databases for Scale, user experiencewill begin to suffer. Although you can argue that faster processors, better caching,and faster storage will help the user experience, none of these really affect the factthat more data costs you more in processing time than less data with similar systems. 现在让我们考虑一下下午 1 点到 2 点的高峰使用时间。如果每个查询平均需要多花 33% 的时间来完成,并且我们希望在高峰流量期间以 100% 的利用率运行,那么我们可能需要多 33% 的系统来处理两倍于原始 8 个元素的数据(16 个元素)如果我们不希望用户响应时间受到不利影响。换句话说,我们要么让每个查询多花 33% 的时间来完成,并影响用户体验,因为新查询会备份,等待运行时间较长的查询完成给定的受限容量,要么我们添加容量来尝试限制影响给用户。当然,在某些时候,如果不像我们在第 24 章“按规模拆分数据库”中使用的搜索技巧那样对数据进行分解,用户体验将开始受到影响。尽管您可能会说更快的处理器、更好的缓存和更快的存储将有助于改善用户体验,但这些都不会真正影响这样一个事实:在类似的系统中,更多的数据比更少的数据花费更多的处理时间。 If you think that’s the end of your costs relative to storage, you are probablywrong again. You undoubtedly back up your storage from time to time, potentiallyto an offsite storage facility. As your data grows, the amount of work you do to per-form a “full backup” grows as well. Not only that, but you do that work over andover again with each full backup. Much of your data probably isn’t changing, butyou are nevertheless rewriting it time and again. Although incremental backups(backing up only the changed data) helps with this concern, you more than likely per-form a periodic full backup to forego the cost of needing to apply a multitude ofincremental backups to a single full backup that might be years old. If you did only asingle full and then relied on incremental backups alone to recover some section ofyour storage infrastructure, your recovery time objective (the amount of time torecover from a storage failure) would be long indeed! 如果您认为这就是与存储相关的成本的终结,那么您可能又错了。毫无疑问,您会不时备份您的存储,可能会备份到异地存储设施。随着数据的增长,执行“完整备份”所需的工作量也会增加。不仅如此,您还可以在每次完整备份时一遍又一遍地执行该工作。您的大部分数据可能没有改变,但您仍然一次又一次地重写它。尽管增量备份(仅备份更改的数据)有助于解决此问题,但您很可能执行定期完整备份,以避免需要将大量增量备份应用到可能已有多年历史的单个完整备份的成本。如果您只执行一次完整备份,然后仅依靠增量备份来恢复存储基础架构的某些部分,那么您的恢复时间目标(从存储故障中恢复的时间)确实会很长! Hopefully, we’ve disabused you of the notion that storage is free. Storage pricesmay be falling, but they are only a portion of your true cost to store information,data, and knowledge. 希望我们已经让您摆脱了存储免费的观念。存储价格可能会下降,但这只是存储信息、数据和知识的真实成本的一部分。 #####The Six Costs of Data 数据的六大成本 As the amount of data that you store increases, the following costs increase: 随着存储数据量的增加,以下成本也会增加 * Storage costs to store the data * 存储数据的存储成本 * People and software to manage the storage * 管理存储的人员和软件 * Power and space to make the storage work * 使存储发挥作用的电源和空间 * Capital to ensure the proper power infrastructure * 确保电力基础设施适当的资本 * Processing power to traverse the data * 遍历数据的处理能力 * Backup time and costs * 备份时间和成本 Data isn’t just about the physical storage, and sometimes the other costs identified here caneven eclipse the actual cost of storage. 数据不仅仅与物理存储有关,有时此处确定的其他成本甚至可能超过存储的实际成本。 ####The Value of Data and the Cost-Value Dilemma 数据的价值和成本价值困境 All data is not created equally in terms of its value to our business. In many busi-nesses, time negatively impacts the value that we can get from any specific data ele-ment. For instance, old data in most data warehouses is less likely to be useful inmodeling business transactions. Old data regarding a given customer’s interactionwith your ecommerce platform might be useful to you, but it’s not likely as useful asthe most current data that you have. Detail call records for the phone company fromyears ago aren’t as valuable to the users as new call records, and old banking transac-tions from three years ago probably aren’t as useful as the ones that occurred in thelast couple of weeks. Old photos and videos might be referenced from time to time,but they aren’t likely accessed as often as the most recent uploads. Although wewon’t argue that as a law older data is less valuable than new data, we believe itholds true often enough in most businesses to call it generally true and directionallycorrect. 就其对我们业务的价值而言,并非所有数据都是平等创建的。在许多企业中,时间会对我们从任何特定数据元素中获得的价值产生负面影响。例如,大多数数据仓库中的旧数据在建模业务事务时不太可能有用。有关给定客户与电子商务平台交互的旧数据可能对您有用,但它可能不如您拥有的最新数据有用。几年前电话公司的详细通话记录对用户来说不如新的通话记录有价值,三年前的旧银行交易可能不如过去几周发生的交易有用。旧照片和视频可能会不时被引用,但它们的访问频率可能不如最近上传的那么频繁。尽管我们不会认为旧数据的价值不如新数据这一定律,但我们相信它在大多数企业中经常成立,可以称其为普遍真实且方向正确。 If the value of data decreases over time and the cost of keeping it increases overtime, why do we so very often keep so darn much of it? We call this question theCost-Value Data Dilemma. In our experience, most companies simply do not payattention to the deteriorating value of data and the increasing cost of data retentionover time. Often, new or faster technologies allow us to store the same data for lowercost or store more data for the same cost. As the per unit cost of storage drops, ourwillingness to keep more of it increases. 如果数据的价值随着时间的推移而减少,而保存数据的成本随着时间的推移而增加,那么为什么我们经常保留如此多的数据呢?我们将这个问题称为“成本价值数据困境”。根据我们的经验,大多数公司根本不关注数据价值的不断恶化以及数据保留成本随着时间的推移而不断增加。通常,新技术或更快的技术使我们能够以更低的成本存储相同的数据,或者以相同的成本存储更多的数据。随着单位存储成本的下降,我们保留更多存储成本的意愿就会增加。 Moreover, many companies point to the option value of data. How can you possi-bly know what you might use that data for in the future? It might become at somepoint in the company’s future incredibly valuable. Nearly everyone can point to acase at some point in her career where we have said, “if only we had kept that data.”We use that experience or set of experiences to drive decisions about all future data;if we needed one or a few pieces of data once and didn’t have it, that becomes a rea-son to keep all other data for all time. 此外,许多公司都指出了数据的期权价值。您怎么可能知道您将来可能会使用这些数据做什么?在公司未来的某个时刻,它可能会变得非常有价值。几乎每个人都可以在她职业生涯的某个时刻指出一个案例,我们曾说过,“如果我们保留了这些数据就好了。”我们使用该经验或一组经验来推动有关所有未来数据的决策;如果我们需要一个或几个数据一些数据曾经被保存过但没有,这成为永远保留所有其他数据的原因。 Another common reason is strategic advantage. Very often, this reason is couchedas, “We keep this data because our competition doesn’t keep it.” That becomes rea-son enough as it is most often decided by the general manager or CEO and a numberof surveys support its implementation. In fact, it might be a source of competitiveadvantage, though our experience is that the value of keeping data infinitely is not asmuch of an advantage as simply keeping it longer than your competition (but notinfinitely). 另一个常见原因是战略优势。很多时候,这个原因是“我们保留这些数据,因为我们的竞争对手不保留它。”这变得足够合理,因为它通常由总经理或首席执行官决定,并且许多调查支持其实施。事实上,它可能是竞争优势的一个来源,尽管我们的经验是,无限保留数据的价值并不像简单地比竞争对手保留数据更长时间(但不是无限期)那样具有优势。 Ignoring the Cost-Value Data Dilemma, citing the option value of data or claimingcompetitive advantage through infinite data retention, all potentially have dilutiveeffects to shareholder value. If the real upside of the decisions (or lack of decisions inthe case of ignoring the dilemma) does not create more value than the cost, the deci-sion is suboptimal. In the cases where legislation or regulation requires you to retaindata, such as emails or financial transactions, you have little choice but to complywith the letter of the law. But in all other cases, it is possible to assign some real orperceived value to the data and compare it to the costs. Consider the fact that thevalue is likely to decrease over time and that the costs of data retention, althoughgoing down on a per unit basis, will likely increase in aggregate value in hyper-growth companies. 忽视成本价值数据困境,引用数据的期权价值或通过无限数据保留来声称竞争优势,所有这些都可能对股东价值产生稀释效应。如果决策的真正好处(或者在忽略困境的情况下缺乏决策)不能创造比成本更多的价值,那么该决策就是次优的。在法律或法规要求您保留数据(例如电子邮件或金融交易)的情况下,您别无选择,只能遵守法律条文。但在所有其他情况下,可以为数据分配一些实际或感知的价值并将其与成本进行比较。考虑到这样一个事实,即价值可能会随着时间的推移而下降,而数据保留成本虽然按单位计算会下降,但在高速增长的公司中,总价值可能会增加。 As a real-world analog, your company may be mature enough to associate a cer-tain value and cost to a class of user. Business schools often spend a great deal of timediscussing the concept of unprofitable customers. An unprofitable customer is a cus-tomer that costs you more to keep than you make off of them through their relation-ship life. Ideally, you do not want to service or keep your unprofitable customersassuming that you have correctly identified them. For instance, a single customer maybe unprofitable to you on a standalone basis, but serves to bring in several profitablecustomers whom you might not have without that single unprofitable relationship.The science and art of determining and pruning unprofitable customers is more diffi-cult in some businesses than others. 作为现实世界的模拟,您的公司可能足够成熟,可以将特定的价值和成本与一类用户相关联。商学院经常花费大量时间讨论无利可图客户的概念。一个无利可图的客户是指你为了留住他们而付出的代价比你通过他们的关系生活从他们身上赚到的钱还要多的客户。理想情况下,您不想在假设您已经正确识别了无利可图的客户的情况下为他们提供服务或留住他们。例如,单个客户可能对您来说单独无利可图,但可以带来几个有利可图的客户,如果没有这种单一无利可图的关系,您可能不会拥有这些客户。确定和修剪无利可图的客户的科学和艺术在某些情况下更加困难比其他企业。 The same concept of profitable and unprofitable customers nevertheless applies toyour data. In nearly any environment, with enough investigation, you will likely finddata that adds shareholder value and data that is dilutive to shareholder value as thecost of retaining that data on its existing storage solution is greater than the valuethat it creates. Just as we may have customers that are more costly to service thantheir total value to the company (even when considering the profitable customers thatthey bring along), so do we have unprofitable and value destroying data. 然而,盈利和亏损客户的相同概念也适用于您的数据。在几乎任何环境中,经过足够的调查,您可能会发现增加股东价值的数据和稀释股东价值的数据,因为在现有存储解决方案上保留这些数据的成本大于其创造的价值。正如我们的客户服务成本可能高于其对公司的总价值(即使考虑到他们带来的有利可图的客户)一样,我们也有无利可图且破坏价值的数据。 ####Making Data Profitable 让数据盈利 The business and technology approach for what data to keep and how to keep it ispretty straightforward: architect storage solutions that allow you to keep all data thatis profitable for your business, or is likely to be accretive to shareholder value, andremove the rest. Let’s look at the most common reasons driving data bloat and thenexamine ways to match our data storage costs to the value of the data containedwithin that storage. 保留哪些数据以及如何保留数据的业务和技术方法非常简单:架构存储解决方案,使您能够保留对您的业务有利可图或可能增加股东价值的所有数据,并删除其余数据。让我们看看导致数据膨胀的最常见原因,然后研究如何将数据存储成本与存储中包含的数据的价值相匹配。 #####Option Value 期权价值 All options have some value to us. The value may be determined by what we believethe probability is that we will ultimately execute the option to our personal benefit.This may be a probabilistic equation that calculates both the possibility that theoption will be executed and the likely benefit of the value of executing the option.Clearly, we cannot claim that the option value is “infinite;” in so doing, we would besaying that the option will produce an infinite value to our shareholders. If that werethe case, we should simply disclose our wealth of information and watch our shareprice rise sharply. What do you think the chance of that is? The answer is that if youwere to make such a disclosure, your share price probably wouldn’t move noticeably;at least it wouldn’t move noticeably as a result of your data disclosure. 所有选项对我们都有一定的价值。该价值可能取决于我们相信我们最终将执行该期权以实现个人利益的概率。这可能是一个概率方程,用于计算该期权将被执行的可能性以及执行该期权的价值的可能收益显然,我们不能声称期权价值是“无限的”;通过这样做,我们可以说该期权将为我们的股东带来无限的价值。如果真是这样,我们就应该披露我们的大量信息,然后看着我们的股价大幅上涨。您认为这种可能性有多大?答案是,如果你要进行这样的披露,你的股价可能不会出现明显波动;至少不会因为你的数据披露而出现明显波动。 The option value of our data then is some noninfinite number. We should startasking ourselves questions like, How often have we used data in the past to make avaluable decision? What was the age of the data used in that decision? What was thevalue that we ultimately created versus the cost of maintaining that data? Was the netresult profitable? 那么我们数据的选项值是某个非无限数。我们应该开始问自己这样的问题:我们过去多久使用数据来做出有价值的决策?该决策中使用的数据的年龄是多少?与维护数据的成本相比,我们最终创造的价值是多少?最终结果是否有利可图? Remember, we aren’t talking about flushing all data or advocating the removal ofall data from your systems. Your platform probably wouldn’t work if it didn’t havesome meaningful data in it. We are simply indicating that you should evaluate andquestion your data retention to ensure that all of the data you are keeping is in factvaluable and, as we will discuss later in this chapter, that the solution for storing thatdata is priced and architected with the data value in mind. If you haven’t made use ofthe data in the past to make better decisions, there is a good chance that you’re notgoing to start using all of it tomorrow. Even when you start using your data, youaren’t likely going to use all of it; as such, you should decide which data has realvalue, which data has value but should be stored in a storage solution of lower cost,and which data can be removed. 请记住,我们并不是在谈论刷新所有数据或提倡从系统中删除所有数据。如果你的平台中没有一些有意义的数据,它可能无法工作。我们只是表明您应该评估和质疑您的数据保留,以确保您保留的所有数据实际上都是有价值的,并且正如我们将在本章后面讨论的那样,用于存储该数据的解决方案是根据数据价值来定价和构建的心里。如果您过去没有利用这些数据来做出更好的决策,那么明天您很可能不会开始使用所有这些数据。即使您开始使用数据,您也不太可能使用全部数据;因此,您应该决定哪些数据具有实际价值,哪些数据有价值但应该存储在成本较低的存储解决方案中,以及哪些数据可以删除。 #####Strategic Competitive Differentiation 战略竞争差异化 This is one of our favorite reasons to keep data. It’s the easiest to claim and the hardestto disprove. The general thought is that you are better than all of your competitorsbecause they do not keep all of their data. You make better decisions, your customershave access to more and better data, and as a result you will win in your market seg-ment. You probably even have market research that shows that your approach isappreciated by your clients. 这是我们保留数据最喜欢的原因之一。这是最容易断言的,也是最难反驳的。人们普遍认为,您比所有竞争对手都好,因为他们没有保留所有数据。您可以做出更好的决策,您的客户可以获得更多更好的数据,因此您将在您的细分市场中获胜。您甚至可能进行了市场研究,表明您的方法受到客户的赞赏。 Let’s address the market research first. What do you think the answer will be ifyou ask your customers if they value having all of their “widgets” available for eter-nity? Depending upon your industry, they are probably going to respond favorably.There are at least a couple of reasons for this. The first is that they already have a bitof conformational bias working by using your platform over a competitor’s andyou’ve just given them a reason to claim why they use your platform. Another reasonis that you haven’t presented a cost to them, at least not in the question, of having thedata infinitely. As such, with no associated cost, they are probably happy with thenear infinite storage. 我们先来进行市场调查。如果您问您的客户是否重视永久使用所有“小部件”,您认为答案会是什么?根据您所在的行业,他们可能会做出积极的回应。这至少有几个原因。首先,他们在使用你的平台而不是竞争对手的平台时已经有了一些构象偏见,而你刚刚给了他们一个理由来说明为什么他们使用你的平台。另一个原因是你没有向他们提出无限拥有数据的成本,至少在问题上没有。因此,在没有相关成本的情况下,他们可能对近乎无限的存储感到满意。 On the other hand, what if we asked questions about what someone would bewilling to pay for near infinite storage? Our answers would likely be very differentindeed! How about if we were to ask why our customers use our product rather thanour competitors and forced them to write an answer in? Our guess is that you mayfind out that the first thing that comes to mind is not the infinite storage. 另一方面,如果我们询问某人愿意为近乎无限的存储支付多少钱呢?我们的答案可能确实会非常不同!如果我们问为什么我们的客户使用我们的产品而不是我们的竞争对手并强迫他们写下答案,怎么样?我们的猜测是,您可能会发现,首先想到的并不是无限存储。 The right question here is to determine what the incremental value of “infinite”data storage is over, say, 10 years, or t years. What about the difference between 20years and 10 years? Our guess is that as the retention period increases, each year addsless value than the previous year. Year 19, for instance, is probably more valuablethan year 20, and year 1 is probably more valuable than year 2.As our yearsincrease, the value starts to dwindle to move to zero even as our costs increase relativeto the amount of storage. It’s starting to appear to us that the company that constrainsits storage is very likely going to have a competitive advantage over the company thatdoes not constrain storage. What is that advantage? Greater profitability! 这里正确的问题是确定“无限”数据存储在 10 年或 t 年期间的增量价值是多少。 20年和10年有什么区别?我们的猜测是,随着保留期的延长,每年增加的价值都会比前一年少。例如,第 19 年可能比第 20 年更有价值,第 1 年可能比第 2 年更有价值。随着我们年龄的增长,即使我们的成本相对于存储量增加,价值也开始减少直至为零。我们开始认为,限制存储的公司很可能比不限制存储的公司拥有竞争优势。那个优点是什么?盈利能力更大! Of course, the preceding comparisons all assume that storage is similarly priced,that all storage solutions are equivalent, and that all types of access require the sameservice levels for response time, and so on. After we recognize that some data hasimmense value, some data has lower value, some data “might have value,” and somedata has no value at all, we can determine a tiered cost storage solution for data withvalue and remove the data with very low or no value. We can also transform andcompact the data to make sure that we retain most of the value at significantly lowercosts. 当然,前面的比较都假设存储价格相似、所有存储解决方案都是等效的、所有类型的访问都需要相同的响应时间服务级别等等。当我们认识到有些数据价值很大、有些数据价值较低、有些数据“可能有价值”、有些数据根本没有价值后,我们可以为有价值的数据确定分层成本存储方案,剔除价值很低或很低的数据。没有价值。我们还可以转换和压缩数据,以确保我们以显着降低的成本保留大部分价值。 #####Cost Justify the Solution (Tiered Storage Solutions) 成本证明解决方案合理(分层存储解决方案 Maybe you have some data that has meaningful business value, but where the cost ofstoring that data exceeds the value or expected value of the data. This is the time toconsider a tiered storage solution. Many young companies settle on a certain type ofstorage based on the primary needs of their transaction processing systems. Theresult of this decision is that just about everything else relies upon this (typically) pre-mium storage solution. Not absolutely everything needs the redundancy, high avail-ability, and response of your primary applications. For your lower value, butnevertheless valuable, services and needs, consider moving to tiered storage solutions. 也许您有一些具有有意义的业务价值的数据,但存储该数据的成本超出了数据的价值或预期价值。现在是考虑分层存储解决方案的时候了。许多年轻公司根据其交易处理系统的主要需求选择某种类型的存储。这一决定的结果是,几乎所有其他事情都依赖于这个(通常)高级存储解决方案。并非所有事情都需要主要应用程序的冗余、高可用性和响应。对于价值较低但仍然有价值的服务和需求,请考虑转向分层存储解决方案。 For instance, infrequently accessed data that does not necessarily require immedi-ate response times might be provisioned on the aforementioned massive array of idledisks. Or maybe you just move some of this data to less expensive and slowerresponse network attached storage systems. Potentially, you decide to simply split upyour architecture to serve some of these data needs from a y-axis split that addressesthe function of “serve archived data.” To conserve processing power, maybe therequests to “serve archived data” are made in an asynchronous fashion and emailedafter the results are compiled. 例如,不一定需要立即响应时间的不经常访问的数据可以在上述大量空闲磁盘阵列上配置。或者,您可能只是将其中一些数据移动到成本较低且响应速度较慢的网络附加存储系统。您可能决定简单地拆分架构,以通过 y 轴拆分来满足其中一些数据需求,从而解决“提供存档数据”的功能。为了节省处理能力,“提供存档数据”的请求可能以异步方式发出,并在结果编译后通过电子邮件发送。 You may decide to take all old emails on tape storage for the period of time man-dated by current legislation or regulation within your industry. Perhaps you takeinfrequently accessed customer data and put it on cloud storage systems. Dataaccessed sometimes (where sometimes is more than infrequently but less than fre-quently) might go to MAID farms. Data that is frequently accessed but has low cor-porate value might go onto inexpensive slower speed devices and frequently accesseddata of high value might go on your “tier 1” high performance access systems. 您可以决定在您所在行业的现行法律或法规规定的期限内将所有旧电子邮件存储在磁带上。也许您将不经常访问的客户数据放在云存储系统上。有时访问的数据(有时访问频率不高但不频繁)可能会进入 MAID 农场。经常访问但企业价值较低的数据可能会进入廉价的速度较慢的设备,而经常访问的高价值数据可能会进入“第一层”高性能访问系统。 Let’s return to our example of AllScale and examine how it approaches the prob-lem within its human resource management (HRM) system. The HRM solutionallows all correspondence on HR matters to be stored within the company’s plat-form. Some correspondence is searched frequently, and that correspondence tends tobe for events happening within the last couple of months. Returns from search resultsover several months are seldom reviewed and if an email is older than two years, it isalmost never viewed. Furthermore, those correspondences are still held within thecustomer’s email systems and are kept by the customer for the period of its useragreements and/or regulatory requirements. 让我们回到 AllScale 的示例,并研究它如何解决其人力资源管理 (HRM) 系统中的问题。人力资源管理解决方案允许将所有人力资源事务的信件存储在公司的平台中。有些信件会被频繁搜索,并且这些信件往往是最近几个月内发生的事件。几个月内的搜索结果返回很少被审查,如果一封电子邮件超过两年,则几乎从未被查看过。此外,这些信件仍保留在客户的电子邮件系统中,并由客户在其用户协议和/或监管要求期间保留。 The team decides on a multitier architecture for all storage. Common searches willbe precalculated and cached within the platform. The data associated with thesesearches will be stored in a tiered fashion with the most relevant search results beingon high speed local or storage area network storage devices. Less frequently accesseddata will be moved progressively to cheaper and slower storage including MAIDdevices and cloud storage for very infrequently accessed solutions. Very old data sim-ply has records of where the data can be found on the customer managed mail sys-tem, and the actual correspondence itself is first archived to tape and permanentlypurged after no more than five years. 团队决定为所有存储采用多层架构。常见搜索将预先计算并缓存在平台内。与这些搜索相关的数据将以分层方式存储,最相关的搜索结果位于高速本地或存储区域网络存储设备上。不常访问的数据将逐步转移到更便宜且速度较慢的存储,包括 MAID 设备和用于非常不常访问的解决方案的云存储。非常旧的数据简单地记录了在客户管理的邮件系统上可以找到数据的位置,而实际的信件本身首先被存档到磁带上,并在不超过五年后被永久清除。 The solution here is to match the cost or cost justify the solution with the valuethat it creates. Not every system or piece of data offers the same value to the business.We typically pay our employees based on their merit or value to the business, so whyshouldn’t we approach system design in the same fashion? If there is some, but notmuch, value in some group of data, simply build the system to support the value. 这里的解决方案是将成本或解决方案的成本与其所创造的价值相匹配。并非每个系统或数据都为企业提供相同的价值。我们通常根据员工的优点或对企业的价值来支付员工工资,那么为什么我们不应该以同样的方式进行系统设计呢?如果某组数据有一些价值,但价值不大,只需构建系统来支持该价值即可。 This approach does have some downfalls, such as the requirement that the operationsstaff will now need to support and maintain multiple storage tiers, but as long asthose additional costs are evaluated properly, the tiered storage solution works wellfor many companies. 这种方法确实有一些缺点,例如要求操作人员现在需要支持和维护多个存储层,但只要正确评估这些额外成本,分层存储解决方案对许多公司来说效果很好。 #####Transform the Data 转换数据 Often, the data we keep for transactional purposes simply isn’t in a form that is con-sumable or meaningful for our other needs. As a result, we end up processing thedata in near real time to make it meaningful to corporate decision making or to makeit useful to our product and platform for a better customer experience. 通常,我们出于交易目的而保留的数据根本不是可消耗的或对我们的其他需求有意义的形式。因此,我们最终以近乎实时的方式处理数据,使其对企业决策有意义,或使其对我们的产品和平台有用,以提供更好的客户体验。 As an example of our former case, where we are concerned about making goodbusiness decisions, consider the needs of a marketing organization concerned aboutindividual consumer behavior. Our marketing organization might be interested indemographic analysis of purchases over time of any of a number of our products.Keeping the exact records of every purchase might be the most flexible approach tofulfill their needs, but the marketing organization is probably comfortable with beingable to match buyer purchases of products by month. All of a sudden, our datarequirements have shrunk because many of our customers are repeat purchasers andwe can collapse individual transaction records into records indicating the buyer, theitems purchased, and the month in which those items were purchased. Now, wemight keep online transaction details for four months to facilitate the most recentquarterly reporting needs, and then roll up those transactions into summary transac-tions by individual for marketing and by internal department for finance. Our datastorage requirements might go down by as much as 50%. Furthermore, as we wouldotherwise perform this summarization during the time of the marketing request, wehave reduced the response time of the application generating this data (it is now pre-populated), and as a result increased the efficiency of our marketing organization. 作为我们之前案例的一个例子,我们关心的是做出良好的商业决策,请考虑关注个人消费者行为的营销组织的需求。我们的营销组织可能对我们的任何产品的购买情况进行人口统计分析感兴趣。保留每次购买的准确记录可能是满足他们需求的最灵活的方法,但营销组织可能对能够匹配买家感到满意按月购买产品。突然之间,我们的数据要求减少了,因为我们的许多客户都是重复购买者,我们可以将单个交易记录折叠成指示买家、购买的商品以及购买这些商品的月份的记录。现在,我们可以将在线交易详细信息保留四个月,以满足最近季度报告的需要,然后将这些交易汇总为个人的交易摘要(用于营销)和内部部门的财务交易。我们的数据存储需求可能会下降多达 50%。此外,由于我们会在营销请求期间执行此汇总,因此我们减少了生成此数据的应用程序的响应时间(现在已预先填充),从而提高了营销组织的效率。 As an example of our latter case, we might want to make product recommenda-tions to our customers while they are interacting with our platform. These productrecommendations might give insight as to what other customers bought who haveviewed or purchased similar items. It goes without saying that scanning all purchasesto develop such a customer affinity to product map would likely be too complex tocalculate and present while someone is attempting to shop. For this reason alone, wewould want to precalculate the product and customer relationships. However, suchcalculation also reduces our need to store the details of all transactions over time. Asa result in developing our precalculated affinity map, we have not only reducedresponse times for our customers, we have also reduced some of our long-term dataretention needs. 作为后一种情况的示例,我们可能希望在客户与我们的平台交互时向他们推荐产品。这些产品推荐可能会让您了解其他浏览过或购买过类似商品的客户购买了什么。不言而喻,扫描所有购买以培养客户对产品地图的亲和力可能会过于复杂,无法在有人尝试购物时进行计算和呈现。仅出于这个原因,我们就需要预先计算产品和客户关系。然而,这样的计算也减少了我们随着时间的推移存储所有交易细节的需要。通过开发预先计算的亲和力图,我们不仅减少了客户的响应时间,还减少了一些长期数据保留需求。 The principles on which data transformation are based are couched within a pro-cess data warehousing experts refer to as Extract, Transform, and Load (ETL). It isbeyond the scope of this book to even attempt to scratch the surface of data ware-housing, but the concepts inherent to ETL can help obviate some of the need for storinglarger amounts of data within your transactional systems. Ideally, these ETL pro-cesses, besides removing the data from your primary transaction systems, also reduceyour overall storage needs as compared to keeping the raw data over similar timeperiods. Condensing expensive detailed records into summary tables and fact tablesfocused on answering specific questions helps save space and saves processing time. 数据转换所依据的原则体现在数据仓库专家称为提取、转换和加载 (ETL) 的流程中。即使试图触及数据仓库的表面也超出了本书的范围,但是 ETL 固有的概念可以帮助消除在事务系统中存储大量数据的某些需要。理想情况下,与在相似时间段内保留原始数据相比,这些 ETL 过程除了从主事务系统中删除数据之外,还可以减少总体存储需求。将昂贵的详细记录压缩为汇总表和事实表,专注于回答特定问题有助于节省空间和处理时间。 ####Handling Large Amounts of Data 处理大量数据 Having spent several pages discussing the need to match storage cost with data valueand eliminating data of very low value, let’s now turn our attention to a more excit-ing problem: What do we do when our data is valuable but there is just way toomuch of it to process efficiently? 在花了几页讨论了将存储成本与数据价值相匹配以及消除价值非常低的数据的必要性之后,现在让我们将注意力转向一个更令人兴奋的问题:当我们的数据很有价值但有太多的数据时,我们该怎么办?它能够有效地处理吗? If you’ve ever had an algebra class, and chances are you have, you probablyalready know the answer to this question. Remember your algebra or calculusteacher or professor reminding you to simplify equations before attempting to solvethem? Well, the same advice that would make you successful in solving a math problemwill make you successful in solving problems associated with large amounts of data. 如果您曾经上过代数课(而且很可能上过),那么您可能已经知道这个问题的答案。还记得您的代数或微积分老师或教授提醒您在尝试求解方程之前先简化方程吗?好吧,同样的建议可以让你成功解决数学问题,也可以让你成功解决与大量数据相关的问题。 If the data is easily segmented into resources or can be easily associated with ser-vices, we need only apply the concepts we learned in Chapters 22 through 24.TheAKF Scale Cube will solve your needs for these situations. But how about the casewhen an entire data set needs to be traversed to produce a single answer, such as thecount by word within all of the works contained within the Library of Congress, orpotentially an inventory count within a very large and complex inventory system? Ifwe want to get through this work quickly, we are going to need to find a way todistribute the work efficiently. This distribution of work might take the form of amultiple pass system where the first pass analyzes (or maps) the work and the secondpass calculates (or reduces) the work. Google introduced a software framework tosupport distributed processing of such large datasets called MapReduce. 如果数据可以轻松地分割成资源或可以轻松地与服务关联,我们只需要应用我们在第 22 章到第 24 章中学到的概念。AKF Scale Cube 将解决您对这些情况的需求。但是,当需要遍历整个数据集以产生单一答案时,例如国会图书馆内所有作品中的单词计数,或者可能是非常大且复杂的库存系统中的库存计数,情况又如何呢?如果我们想快速完成这项工作,我们就需要找到一种有效分配工作的方法。这种工作分配可能采用多遍系统的形式,其中第一遍分析(或映射)工作,第二遍计算(或减少)工作。 Google 推出了一个软件框架来支持这种大型数据集的分布式处理,称为 MapReduce。 The follow-ing is a description of that model and an example of how it can be applied to largeproblems. 以下是对该模型的描述以及如何将其应用于大型问题的示例。 At a high level, MapReduce has a Map function and a Reduce function. The Mapfunction takes as its input a key-value pair and produces an intermediate key-valuepair. This might not immediately seem useful to the layperson, but the intent is thatthis is a distributed process creating useful intermediate information for another dis-tributed process to compile. The input key might be the name of a document, orremembering that this is a document, the name, or pointer to a piece of a document.The value could be content consisting of all the words within the document itself. Inour distributed inventory system, the key might be the inventory location and thevalue all of the names of inventory within that location with one name for each pieceand quantity of inventory. For instance, if we had five screws and two nails, the valuewould be screw, screw, screw, screw, screw, and nail, nail. 从高层次上看,MapReduce 具有 Map 函数和 Reduce 函数。 Mapfunction 将键值对作为输入并生成中间键值对。对于外行来说,这可能不会立即显得有用,但其目的是,这是一个分布式进程,为另一个分布式进程创建有用的中间信息进行编译。输入键可能是文档的名称,或者记住这是一个文档、名称或指向文档片段的指针。值可以是由文档本身内的所有单词组成的内容。在我们的分布式库存系统中,键可能是库存位置和该位置内所有库存名称的值,每件库存和数量都有一个名称。例如,如果我们有五个螺丝和两个钉子,则值将是螺丝、螺丝、螺丝、螺丝、螺丝和钉子、钉子。 The canonical form of Map looks like this in pseudocode: Map 的规范形式在伪代码中看起来像这样 ![](https://blog.baidu-google.com/usr/uploads/2024/06/1609552558.png) We’ve identified parenthetically that this pseudocode could work for both theword count example (also given by Google) and the distributed parts inventoryexample. Only one or the other would exist in reality for your application and youwould eliminate the parenthesis. The following input_key and input_values and out-put keys and values are presented in Figure 27.1.The first example is a set of phrasesincluding the word “red” with which we are fond, and a small set of inventories fordifferent locations. 我们已经在括号中指出,这个伪代码可以适用于字数统计示例(也由 Google 提供)和分布式零件库存示例。对于您的应用程序来说,现实中只存在其中之一,您可以删除括号。下面的 input_key 和 input_values 以及输出键和值如图 27.1 所示。第一个示例是一组短语,包括我们喜欢的单词“red”,以及一小组不同位置的库存。 ![](https://blog.baidu-google.com/usr/uploads/2024/06/3177825605.png) ![](https://blog.baidu-google.com/usr/uploads/2024/06/3717373106.png) Note here how Map takes each of the documents and simply emits each word witha count of 1 as we move through the document. For the sake of speed, we had a sep-arate Map process working on each of the documents. Figure 27.2 shows the outputof this process. 请注意,当我们在文档中移动时,Map 如何获取每个文档并简单地发出计数为 1 的每个单词。为了提高速度,我们对每个文档都有一个单独的 Map 流程。图 27.2 显示了该过程的输出。 Again, we have taken each of our initial key-value pairs with the key being thelocation of the inventory and the value being the individual components listed withone listing for each occurrence of that component per location. The output is thename of the component and a value of 1 per each component listing. Again, we usedseparate Map processes. 同样,我们采用了每个初始键值对,其中键是库存的位置,值是列出的各个组件,每个位置每次出现该组件都有一个列表。输出是组件的名称和每个组件列表的值 1。同样,我们使用了单独的 Map 进程。 What is the value of such a construct? We can now feed these key-value pairs intoa distributed process that will combine them and create an ordered result of key-value pairs, where the value is the number of items that we have of each type (either aword or a part). The trick in our distributed system is to ensure that each key getsrouted to one and only one collector or reducer. We need this affinity to a reducer (ortier of reducers as we will discuss in a minute) to ensure an accurate account. If thepart screw is going to go to reducer 1, all instances of screw must go to reducer 1. 这样的构造有什么价值?现在,我们可以将这些键值对输入分布式进程,该进程将组合它们并创建键值对的有序结果,其中值是我们拥有的每种类型(单词或部分)的项目数。我们的分布式系统中的技巧是确保每个键都路由到一个且仅一个收集器或减速器。我们需要对减速器(我们将在稍后讨论的减速器的一层)的这种亲和力来确保准确的帐户。如果零件螺丝要转到减速器 1,则所有螺丝实例都必须转到减速器 1。 Let’s see how the Google reduce function works in pseudocode: 让我们看看Google的reduce函数在伪代码中是如何工作的 ![](https://blog.baidu-google.com/usr/uploads/2024/06/927264828.png) ![](https://blog.baidu-google.com/usr/uploads/2024/06/3188381033.png) For our reduce function to work, we need to add a program to group the words orparts and append the values for each in a list. This is a rather trivial program that willsort and group the functions by key. This too could be distributed assuming that thekey-value pairs emitted from the Map function are sent to the same function intendedto sort and group and then submit to the reduce function. Passing over the trivial func-tion of sorting and grouping, which is the subject of many computer science under-graduate text books, we can display our reduce function as in Figure 27.3 for ourinventory system (we will leave the word count output as an exercise for our readers).Multiple layers of sorting, grouping, and reducing can be employed to help speedalong the process. For instance, if there were 50-map systems, they could send theirresults to 50 sorters, which could in turn send their results to 25 sorters and grou-pers, and so on until we had a single sorted and grouped list of parts and value liststo send to our multiple reducer functions. The system is highly scalable in terms ofthe amount of processors and processing power you can throw at it. We highly rec-ommend that you read the Google Labs MapReduce documentation. 为了使我们的归约函数正常工作,我们需要添加一个程序来对单词或部分进行分组,并将每个单词或部分的值附加到列表中。这是一个相当简单的程序,它将按键对函数进行排序和分组。假设从 Map 函数发出的键值对被发送到用于排序和分组的同一函数,然后提交给reduce 函数,那么这也可以是分布式的。忽略排序和分组的琐碎函数(这是许多计算机科学本科教科书的主题),我们可以为我们的库存系统显示如图 27.3 所示的归约函数(我们将字数统计输出作为练习)为我们的读者)。可以采用多层排序、分组和缩减来帮助加快该过程。例如,如果有 50 个地图系统,它们可以将结果发送到 50 个分拣机,而分拣机又可以将结果发送到 25 个分拣机和分组机,依此类推,直到我们得到单个已排序和分组的零件和值列表列表发送到我们的多个减速器功能。该系统在处理器数量和处理能力方面具有高度可扩展性。我们强烈建议您阅读 Google Labs MapReduce 文档。 ####Conclusion 结论 This chapter discussed what to do with large datasets. On one end of the spectrum,we have the paradoxical relationship of cost and value for data. As data ages anddata sizes grow, the cost to the organization increases. As this data ages in most com-panies, its value to the company and platform typically decreases. The reasons forclinging to data past its valuable life to a company include ignorance, perceivedoption value, and perceived strategic competitive differentiation. Our remedies forperceived option value and perceived competitive differentiation are based in applyingreal dollar values to these perceived values in order to properly justify the existence(and the cost) of the data. 本章讨论了如何处理大型数据集。一方面,我们存在数据成本与价值之间的矛盾关系。随着数据老化和数据规模的增长,组织的成本也会增加。随着大多数公司的数据老化,其对公司和平台的价值通常会下降。对于公司来说,坚持使用已经过了其宝贵生命周期的数据的原因包括无知、感知到的期权价值和感知到的战略竞争差异化。我们对感知期权价值和感知竞争差异的补救措施基于将实际美元价值应用于这些感知价值,以便正确证明数据的存在(和成本)。 After we’ve identified the value and costs of data, we proposed implementingtiered storage solutions that match the cost and access speed of data to the value thatit creates for shareholders. On one end of our tiered strategy are high-end, very faststorage devices, and on the opposite end is the deletion or purging of low value data.Data transformation and summarization can help reduce the cost and thereforeincrease the profitability of data where the reduction in size does not significantlychange the value of the data. 在确定数据的价值和成本后,我们建议实施分层存储解决方案,将数据的成本和访问速度与其为股东创造的价值相匹配。我们的分层策略的一端是高端、非常快的存储设备,另一端是低价值数据的删除或清除。数据转换和汇总可以帮助降低成本,从而提高数据的盈利能力,其中减少大小不会显着改变数据的值。 Finally, we addressed one approach to parallelize the processing of very largedatasets. Google’s MapReduce approach is widely adopted by many industries as astandard for how to process large datasets quickly in a distributed fashion. 最后,我们提出了一种并行处理非常大的数据集的方法。 Google 的 MapReduce 方法被许多行业广泛采用,作为如何以分布式方式快速处理大型数据集的标准。 #####Key Points 关键点 * Data is costly and the cost of data consists of more than just the cost of the stor-age itself. People, power, capital costs of power infrastructure, processingpower, and backup time and costs all impact the cost of data. * 数据的成本很高,而且数据的成本不仅仅包括存储本身的成本。人员、电力、电力基础设施的资本成本、处理能力以及备份时间和成本都会影响数据成本。 * The value of data in most companies tends to decrease over time. * 大多数公司的数据价值往往会随着时间的推移而下降。 * Companies often keep too much data due to ignorance, perceived option value,and perceived competitive differentiation. * 由于无知、感知到的期权价值和感知到的竞争差异化,公司常常保留过多的数据。 * Perceived option value and perceived competitive differentiation should includevalues and time limits on data to properly determine if the data is accretive ordilutive to shareholder value. * 感知期权价值和感知竞争差异应包括数据的价值和时间限制,以正确确定数据是否会增加或稀释股东价值。 * Eliminate data that is dilutive to shareholder value, or find alternative storageapproaches to make the data accretive. Tiered storage strategies and data trans-formation are all methods of cost justifying data. * 消除会稀释股东价值的数据,或寻找替代存储方法来使数据增值。分层存储策略和数据转换都是证明数据成本合理的方法。 * Applying concepts of distributed computing to large datasets helps us processthose datasets quickly. Google’s MapReduce is a good example of a softwareframework to act upon large datasets. * 将分布式计算的概念应用于大型数据集可以帮助我们快速处理这些数据集。 Google 的 MapReduce 是处理大型数据集的软件框架的一个很好的例子。
没有评论