posted on September5, 2012 by dbtube
In order to meet the challenges of Big Data, you must rethink data systems from the ground up. You will discover that some of the most basic ways people manage data in traditional systems like the relational database management system (RDBMS)is too complex for Big Data systems. The simpler, alternative approach is a new paradigm for Big Data. In this article based on chapter 1, author Nathan Marz shows you this approach he has dubbed the “lambda architecture.”
面對大資料的挑戰,你不得不從頭重新思考資料系統。你會發現,面對大資料,像關係型資料庫管理系統(RDBMS)這樣的傳統系統人們管理資料的一些常用的基本方法太複雜。對大資料來說,一個相對簡單,可選方式是一種新範式。這篇文章基於第一章,作者NathanMarz展示給你的這種方式被他稱為“lambda架構”。
This article is based on Big Data, to be published in Fall 2012. This eBook is available through the Manning Early Access Program (MEAP). Download the eBook instantly from manning.com. All print book purchases include free digital formats (PDF, ePub and Kindle). Visit the book’s page for more information based on Big Data. This content is being reproduced here by permission from Manning Publications.
Author: Nathan Marz
Computing arbitrary functions on an arbitrary dataset in real time is a daunting problem. There is no single tool that provides a complete solution. Instead, you have to use a variety of tools and techniques to build a complete Big Data system.
在任意的資料集上即時任意計算是個令人望而卻步的難題。沒有提供一個完整解決方案的單一工具。因此,為了建立一個完整的大資料系統你不得不使用很多的工具和技術。
The lambda architecture solves the problem of computing arbitrary functions on arbitrary data in real time by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.
Lambda架構解決在任意資料上即時任意計算的難題,是通過將問題分解成以下三個層次:批處理層,服務層,高速層。
Figure 1 – Lambda Architecture
Everything starts from the “query = function(all data)”equation. Ideally, you could literally run your query functions on the fly on a complete dataset to get the results. Unfortunately, even if this were possible,it would take a huge amount of resources to do and would be unreasonably expensive.Imagine having to read a petabyte dataset every time you want to answer the query of someone’s current location.
一切都是從“query =function(all data)”公式開始的。理想情況下,你可以在完整的資料集上飛快的得到你的查詢結果。不幸的是,即使這可能,也是通過使用非常多的資源來做並且花銷極其昂貴。相像一下每次查詢某人的當前位置都需要讀取數T資料的情景。
The alternative approach is to precompute the query function. Let’s call the precomputed query function the batch view. Instead of computing the query on the fly, you read the results from the precomputed view. The precomputed viewis indexed so that it can be accessed quickly with random reads. This system looks like this:
一個可選的方式是提前計算這些查詢,我們稱這種提前計算的查詢為批處理視圖。從提前計算視圖中讀取結果,而不是在大資料集中即時計算。提前計算視圖是索引好的,因此在隨機讀時可以很快的被訪問。這個系統看起來是這樣:
Figure 2 – Batch layer
In this system, you run a function on all of the data to get the batch view. Then, when you want to know the value for a query function,you use the precomputed results to complete the query rather than scan through all of the data. The batch view enables you to get the values you need from it very quickly because it’s indexed.
在這個系統中,你提前在所有資料上運行計算以得到批處理視圖。然後,當你想從查詢中得到值時,直接從預計算結果中完成查詢而不是掃描所有的資料。因為批處理社圖是提前索引好的,因此可以讓你很快的得到結果。
Since this discussion is somewhat abstract,let’s ground it with an example.
因為這個討論有點抽象,我們先用一個例子講解一下。
Suppose you’re building a web analytics application and you want to query the number of pageviews for a URL on any range of days. If you were computing the query as a function of all the data, you would scan the dataset for pageviews for that URL within that time range and return the count of those results. This, of course, would be enormously expensive because you would have to look at all the pageview data for every query you do.
假如你建一個WEB分析應用,你想知道一個URL在一段時間內的頁面訪問量。如果你從所有資料中計算得到,你可能需要掃描這個URL在這段時間內的所有資料然後再返回結果。當然這種方式,非常的昂貴因為每次查詢都需要尋找所有相關資料。
The batch view approach instead runs a function on all the pageviews to precompute an index from a key of [url, day] to the count of the number of pageviews for that URL for that day. Then, to resolve the query,you retrieve all of the values from that view for all of the days within that time range and sum up the counts to get the result. The precomputed view indexes the data by URL, so you can quickly retrieve all of the data points you need to complete the query.
You might be thinking that there’s something missing from this approach as described so far. Creating the batch view is clearly going to be a high latency operation because it’s running a function on all of the data you have. By the time it finishes, a lot of new data that’s not represented in the batch views will have been collected, and the queries are going to be out of date by many hours. You’re right, but let’s ignore this issue for the moment because we’ll be able to fix it. Let’s pretend that it’s okay for queries to be out of date by a few hours and continue exploring this idea of precomputing a batch view by running a function on the complete dataset.
Batch layer
The portion of the lambda architecture that precomputesthe batch views is called the batch layer. The batch layer stores the master copy of the dataset and precomputes batch views on that master dataset. The master dataset can be thought of us a very large list of records.
Figure 3 – Batch layer
The batch layer needs to be able to do two things to do its job: store an immutable, constantly growing master dataset, and compute arbitrary functions on that dataset. The key word here is arbitrary. If you’re going to precompute views on a dataset, you need to be able to do so for any view and any dataset. There’s a class of systems called batch processing systems that are built to do exactly what the batch layer requires. They are very good at storing immutable, constantly growing datasets, and theyexpose computational primitives to allow you to compute arbitrary functions onthose datasets. Hadoop is the canonical example of a batch processing system, and we will use Hadoop to demonstrate the concepts of the batch layer.
這裡的關鍵字是任意。如果你在一個資料集上做預計算視圖,那麼你應該在任何資料集上都這樣做。批處理層需要一種叫批處理的系統。他們非常擅長儲存不可變、不斷增長的資料集,並且可以在那些資料集上計算任意功能。Hadoop是批處理系統的典範,我們就用hadoop示範批處理層的概念。
Figure 4 – Batch layer
The simplest form of the batch layer can be represented in pseudo-code like this:
最簡單的批處理層可以用以下依代碼錶示:
function runBatchLayer():
while(true):
recomputeBatchViews()
The batch layer runs in a while(true) loop and continuously recomputes the batch views from scratch. In reality, the batch layer will be a little more involved. This is the best way to think about the batch layer for the purpose of this article.
批處理層從頭開始運行一個while迴圈持續的計算批處理視圖。事實上,批處理層會稍微多些介入,這篇文章的目的也是用最好的方式去思考批處理層。
The nice thing about the batch layer is that it’s so simple to use. Batch computations are written like single-threaded programs yet automatically parallelize across a cluster of machines. This implicit parallelization makes batch layer computations scale to datasets of any size.It’s easy to write robust, highly scalable computations on the batch layer.
批處理比較好的地方是它比較容易使用。批處理可以像單線程編程那樣寫,其後會在整個叢集上自動的並行化。隱式的並行化使得批處理可以擴充到任意大小的資料集,在批處理層上可以容易的編寫健壯、高伸縮的計算。
Here’s an example of a batch layer computation. Don’t worry about understanding this code; the point is to show what an inherently parallel program looks like.
這裡有個批處理的例子。不用擔心理解這段代碼,它只是展示並行編程看起來是什麼樣的:
Pipe pipe= new Pipe(“counter”);
pipe = new GroupBy(pipe, new Fields(“url”));
pipe = new Every(
pipe,
new Count(new Fields(“count”)),
new Fields(“url”, “count”));
Flow flow = new FlowConnector().connect(
new Hfs(new TextLine(new Fields(“url”)), srcDir),
new StdoutTap(),
pipe);
flow.complete();
This code computes the number of pageviews for every URL, given an input dataset of raw pageviews.What’s interesting about this code is that all of the concurrencychallenges of scheduling work, merging results, and dealing with runtimefailures (such as machines going down) are done for you. Because the algorithm is written in this way, it can be automatically distributed on a MapReduce cluster, scaling to however many nodes you have available. So, if you have 10 nodes in your MapReduce cluster, the computation will finish about 10 times faster than if you only had one node! At the end of the computation, theo utput directory will contain a number of files with the results.
這段代碼從給定原始頁面訪問量資料集中,計算出每個URL的頁面訪問量。這段代碼比較有意思的是,像調度、合并結果集、處理執行階段錯誤(像機器宕機)等的並發問題都處理好了。因為這樣寫的演算法,可以自動在MapReduce叢集上分布,可擴充至任意多的可用節點。因此,如果MapReduce叢集上有10個節點,它可以比一個節點快10倍。計算的最後,輸出目錄會包括很多的結果檔案。
Serving layer
The batch layer emits batch views as the result of its functions. The next step is to load the views somewhere so that they can be queried. This is where the serving layer comes in. For example, your batch layer may precompute a batch view containing the pageview count for every [url,hour] pair. That batch view is essentially just a set of flat files though:there’s no way to quickly get the value for a particular URL out of that output.
批處理層實現了功能的批處理視圖,下一步就是把把資料載入到什麼地方以供查詢,這就是服務層的由來。比如,批處理層計算出了包含每個[url, hour]數的批處理視圖,批處理視圖本質上只是一些一般檔案,沒有辦法從中很快得到特定URL的值。
Figure 5 – Serving layer
The serving layer indexes the batch view and loads it up so it can be efficiently queried to get particular values out of the view. The serving layer is a specialized distributed database that loads in batch views,makes them queryable, and continuously swaps in new versions of a batch view as they’re computed by the batch layer. Since the batch layer usually takes at least a few hours to do an update, the serving layer is updated every few hours.
服務層索引了批處理視圖所以可以從中非常高效的得到特定的值。服務層是實現了載入批處理視圖,使它們可查詢,並且在批處理視圖新版本進入時持續切換的分散式資料庫。批處理層通常需要幾個小時做更新,因此服務層也是每隔幾個小時更新一次。
A serving layer database only requires batch updates and random reads. Most notably, it does not need to support random writes. This is a very important point because random writes cause most of the complexity in databases. By not supporting random writes, serving layer databases can be very simple. That simplicity makes them robust, predictable, easy to configure, and easy to operate.ElephantDB, a serving layer database,is only a few thousand lines of code.
服務層資料庫只需要滿足批量自新和隨機讀,顯然它不需要支援隨機寫,這是個非常重要的觀點因為隨機寫導致了大部分資料庫的複雜性。因此不支援隨機寫,服務層資料庫可以非常簡單。這些簡單性使得它們健壯、可預測、配置簡單、並且容易操作。ElephantDB,一個服務層資料庫,只有僅僅幾千行代碼。
Batch and serving layers satisfy almost all properties
So far you’ve seen how the batch and serving layers can support arbitrary queries on an arbitrary dataset with the trade off that queries will be out of date by a few hours. The long update latency is due to the fact that new pieces of data take a few hours to propagate through the batch layer into the serving layer where it can be queried.
到目前為止,你知道批處理層和服務層如何支援在任意資料集上進行任意查詢與查詢過時數小時間進行折衷。高延遲是因為新資料從進入批處理層到查詢的服務層需要花費數小時。
The important thing to notice is that, other than low latency updates, the batch and serving layers satisfy every property desired in a Big Data system. Let’s go through them one by one:
需要注意到的重要事是,除了低延遲的更新,批處理和服務層可以滿足大資料系統的所有要求的特性。我們就一個個的過:
* Robust and fault tolerant: The batch layer handles failover when machines go down using replication and restarting computation tasks on other machines. The serving layer uses replication under the hood to ensure availability when servers go down. The batch and serving layers are also human fault tolerant,since, when a mistake is made, you can fix your algorithm or remove the bad data and recompute the views from scratch.
健壯性和容錯性:批處理層通過複製和在其他伺服器上重啟計算任務來處理機器宕機。服務層通過複製確保伺服器宕機時可用。批處理和服務層也是人類行為容錯的,當一個錯誤發生時,可以通過修改演算法或者移除壞資料然後從頭重新計算解決。
* Scalable—Both the batch layer and serving layers are easily scalable. They can both be implemented as fully distributed systems, where upon scaling them is as easy as just adding new machines.
可擴充性:批處理層和服務都可以很簡單的擴充,他們都是完全實現的分布式系統,僅僅通過簡單的增加機器即可擴充。
* General—The architecture described is as general as it gets. You can compute and update arbitrary views of an arbitrary dataset.
通用性:描述的架構就像剛提到的那樣通用,你可以計算並在任意資料上更新任意視圖
* Extensible—Adding a new view is as easy as adding a new function of the master dataset. Since the master dataset can contain arbitrary data, new types of data can be easily added. If you want to tweak a view, you don’t have to worry about supporting multiple versions of the view in the application. You can simply recompute the entire view from scratch.
可擴充性:添加一個新的視圖就像在未經處理資料上添加一個函數一樣簡單。因為未經處理資料包括任意的資料,新的資料類型可以容易的添加。如果你想更改一個視圖,不用擔心在程式中實現此視圖的多版本,你可以簡單的從頭計算整個視圖。
* Allows ad hoc queries—The batch layer supports ad-hoc queries innately. All of the data is conveniently available in one location and you’re able to run any function you want on that data.
允許即席查詢:批處理層天生支援即席查詢。所有的資料在一個地點方便可用,你可以在那些資料上執行任意你想執行的函數。
* Minimal maintenance—The batch and serving layers consist of very few pieces,yet they generalize arbitrarily. So, you only have to maintain a few pieces fora huge number of applications. As explained before, the serving layer databases are simple because they don’t do random writes. Since a serving layer database has so few moving parts, there’s lots less that can go wrong. As a consequence,it’s much less likely that anything will go wrong with a serving layer database, so they are easier to maintain.
最少維護:批處理層和服務層只有很少的部分組成,以此類推。因此很多的應用也只需要維護一小部分。像前面解釋的那樣,伺服器資料庫因為沒有隨機寫而非常簡單。服務層資料庫幾乎沒有可移動部分,也就很少出錯,結果是服務層資料庫沒有東西會出錯,因此非常容易維護。
* Debuggable—You will always have the inputs and outputs of computations run on the batch layer. In a traditional database, an output can replace the original input—for example, when incrementing a value. In the batch and serving layers,the input is the master dataset and the output is the views. Likewise, you have the inputs and outputs for all of the intermediate steps. Having the inputs and outputs gives you all the information you need to debug when something goes wrong.
可調試:在批處理層上的計算總是同時擁有輸入和輸出。在傳統的資料庫中,輸出可能覆蓋輸入,比如一個值自增。在批處理層和服務層,輸入是主要資料集輸出是視圖。同樣的,所有的中間步驟資料也是有的。有輸入與輸出,當出錯時你就擁有足夠的資訊去調試。
The beauty of the batch and serving layers is that they satisfy almost all of the properties you want with a simple and easy to understand approach. There are no concurrency issues to deal with, and it scales trivially. The only property missing is low latency updates. The final layer, the speed layer, fixes this problem.
批處理層和服務層非常好的地方是你想要的所有的特性都以一種簡單、容易理解的方式滿足了,沒有並發性需要處理,並且沒有擴充細節。唯一缺失的地方是低延遲更新,在最後一層,即高速層,解決這個問題。
Speed layer
The serving layer updates whenever the batch layer finishes precomputing a batch view. This means that the only data not represented in the batch views is the data that came in while the precomputation was running. All that’s left to do to have a fully realtime data system—that is, arbitrary functions computed on arbitrary data in real time—is to compensate for those last few hours of data. This is the purpose of the speed layer.
當批處理層完成批處理視圖後才更新服務層,這也意味著在批處理層唯一不能展現的是正在計算部分。剩下的就交給即時資料系統了,它可以在任意資料集上即時的執行任意計算,它正好彌補最後幾小時的資料。這也正是高速層的目的。
Figure 6 – Speed layer
You can think of the speed layer as similar to the batch layer in that it produces views based on data it receives. There are some key differences, though. One big difference is that, in order to achieve the fastest latencies possible, the speed layer doesn’t look at all the new data at once. Instead, it updates the realtime view as it receives new data instead of recomputing them like the batch layer does. This is called incremental updates as opposed to recomputation updates. Another big difference is that the speed layer only produces views on recent data, whereas the batch layer produces views on the entire dataset.
你可以認為高速層跟批處理層類似也是資料來了產生視圖。他們之間也有一些關鍵點不同,其中一個不同的是,為了儘可能低的延遲得到資料,高速層不會尋找所有的新資料,而是當新資料平時更新即時視圖,而不是像批處理層那樣計算,與重新計算更新相反這是累加式更新。另一個很大不同點是高速層只在最近的資料產生視圖,而處理層是在所有資料上產生視圖。
Let’s continue the example of computing the number of pageviews for a URL over a range of time. The speed layer needs to compensate for pageviews that haven’t been incorporated in the batch views, which will bea few hours of pageviews. Like the batch layer, the speed layer maintains a view from a key [url, hour] to a pageview count. Unlike the batch layer, which recomputes that mapping from scratch each time, the speed layer modifies its view as it receives new data.
我們繼續以計算一個URL的PV數為例。高速層需要用幾個小時的PV數併入進來來彌補批處理層。同批處理層一樣,高速層維護一個[url,hour]PV數的視圖,不同的是,批處理層每次都是從頭映射,而高速層則接收到新資料後更新視圖。
When it receives a new pageview, it increments the count for the corresponding[url, hour]in the database.
當接收到新PV,則增加資料庫中對應的[url,hour]數。
The speed layer requires databases that support random reads and random writes. Because these databases support random writes, they are orders of magnitude more complex than the databases you use in the serving layer, both in terms of implementation and operation.
高速層需要支援隨機讀和隨機寫的資料資料庫。因為資料庫支援隨機寫,無論是實現還是操作他們都比服務層用的資料庫複雜幾個數量級。
The beauty of the lambda architecture is that, once datamakes it through the batch layer into the serving layer, the correspondingresults in the realtime views. Thismeans you can discard pieces of the are no longer needed realtime view asthey’re no longer needed. This is a wonderful result, since the speed layer isway more complex than the batch and serving layers. This property of the lambdaarchitecture is called complexity isolation, meaning that complexity is pushedinto a layer whose results are only temporary. If anything ever goes wrong, youcan discard the state for entire speed layer and everything will be back tonormal within a few hours. This property greatly limits the potential negativeimpact of the complexity of the speed layer.
Lambda架構優美的地方在於,一旦資料通過批處理層進到服務層,與即時視圖的結果是一致的,也就意味著當即時視圖不需要時可以隨時丟棄。這是個極好的結果,因為高速層處理方式比批處理層和服務層複雜得多。Lambda架構的這個特別被為複雜隔離,也就是複雜性放在一層而且他的結果是臨時的。如果有任何