Posted on September5, from Dbtube
In order to meet the challenges of Big Data, you must rethink Data systems from the ground up. You'll discover that some of the very basic ways people manage data in traditional systems like the relational database Management System (RDBMS) is too complex for Big Data systems. The simpler, alternative approach is a new paradigm for Big Data. In this article based on Chapter 1, author Nathan Marz shows it approach he has dubbed the "lambda architecture."
Faced with big data challenges, you have to rethink your data system from scratch. You will find that in the face of big data, traditional systems such as relational database management systems (RDBMS), some of the common basic ways people manage data are too complex. For big data, a relatively simple, alternative approach is a new paradigm. This article is based on the first chapter, and the way the author Nathanmarz shows you is called the "lambda Architecture".
This article was based on Big Data and to being published in Fall 2012. This eBook was available through the Manning Early Access Program (MEAP). Download the EBook instantly from manning.com. All print book purchases include the free digital formats (PDF, EPub and Kindle). Visit The book's page for more information based on Big Data. This content was being reproduced here by permission from Manning Publications.
Author:nathan Marz
Computing arbitrary functions on an arbitrary datasets in real time is a daunting problem. There is no single tool, provides a complete solution. Instead, you had to use a variety of tools and techniques to build a complete Big Data system.
Real-time arbitrary computation on arbitrary datasets is a daunting challenge. There is no single tool to provide a complete solution. Therefore, in order to build a complete big Data system you have to use a lot of tools and techniques.
The lambda architecture solves the problem of computing arbitrary functions on arbitrary data in real time by decomposing The problem into three layers:the batch layer, the serving layer, and the speed layer.
The lambda architecture solves the problem of arbitrary real-time computation on arbitrary data by decomposing the problem into the following three levels: Batch layer, service layer, high-speed layer.
Figure 1–lambda Architecture
Everything starts from the "query = function (all data)" equation. Ideally, could literally run your query functions on the fly in a complete datasets to get the results. Unfortunately, even if this were possible,it would take a huge amount of resources to does and would be unreasonably expensi Ve. Imagine has to read a petabyte datasets every time you want to answer the query of someone ' s current location.
Everything starts with the "Query =function (all data)" formula. Ideally, you can quickly get your query results on a complete set of data. Unfortunately, even if this is possible, it is done by using very many resources and is extremely expensive. Look like a scenario where you need to read a number of T data each time you query someone's current location.
The alternative approach is to precompute the query function. Let's call the precomputed query function the batch view. Instead of computing the query on the fly, you read the results from the precomputed view. The precomputed viewis indexed so, it can be accessed quickly with random reads. This system looks like this:
An alternative approach is to calculate these queries in advance, which we call a batch view of this pre-computed query. Read the results from the in-advance calculation view, rather than in real-time calculations in the big data set. The pre-computed view is indexed so that it can be accessed very quickly at random reads. The system looks like this:
Figure 2–batch Layer
In this system, you run a function on the batch view for the data to get. Then if you want to know the value for a query function,you use the precomputed results to complete the query rather th An scan through all of the data. The batch view enables you to get the values of need from it very quickly because it ' s indexed.
In this system, you run calculations on all data ahead of time to get a batch view. Then, when you want to get the value from the query, complete the query directly from the precomputed results instead of scanning all the data. Because batch charts are indexed well in advance, you can get results quickly.
Since This discussion was somewhat abstract,let ' s ground it with an example.
Because this discussion is a bit abstract, let's start with an example.
Suppose you ' re building a Web analytics application and your want to query the number of pageviews for a URL in any range o F days. If you were computing the query as a function of all the data, you would scan the dataset for pageviews for that URL Withi N that time range and return the count of those results. This, of course, would is enormously expensive because you would has to look at all the PageView data for every query Do.
If you build a Web analytics app, you want to know how much page access a URL has over time. If you calculate from all the data, you may need to scan all the data for this URL during that time and then return the results. Of course, this is very expensive because every query needs to find all the relevant data.
The batch view approach instead runs a function on all the pageviews to precompute an index from a key of [URL, day] to th e Count of the number of pageviews for. Then, to resolve the query,you retrieve all of the values from that view for all of the days within that time range and Su M up the counts to get the result. The precomputed view indexes the data by URL and so can quickly retrieve all of the data points you need to complete the Query.
You might is thinking that there ' s something missing from this approach as described so far. Creating The batch view is clearly going to being a high latency operation because it's running a function on all of the data You have. By the time it finishes, a lot of new data which ' s not represented in the batch views would have been collected, and the Que Ries is going to being out of date by many hours. You ' re right, but let's ignore this issue for the moment because we'll be able to fix it. Let's pretend that it's okay for queries to being out of date by a few hours and continue exploring this idea of precomputing A batch view by running a function in the complete dataset.
Batch Layer
The portion of the lambda architecture that precomputesthe batch views is called the batch layer. The batch layer stores the master copy of the dataset and Precomputes batch views on that master dataset. The master DataSet can be thought of us a very large list of records.
Figure 3–batch Layer
The batch layer needs to is able to does the things to does its job:store a immutable, constantly growing master dataset, an D compute arbitrary functions on the that dataset. The key word is arbitrary. If you ' re going to precompute views on a dataset, you need to is able to does so for any view and any dataset. There's a class of systems called batch processing systems that is built to doing exactly what the batch layer requires. They is very good at storing immutable, constantly growing datasets, and Theyexpose computational primitives To compute arbitrary functions onthose datasets. Hadoop is the canonical example of a batch processing system, and we'll use Hadoop to demonstrate the concepts of the BA TCH layer.
The key word here is arbitrary. If you do an precomputed view on a dataset, you should do so on any dataset. The batch layer requires a system called batch processing. They are very good at storing immutable, growing datasets, and can compute arbitrary functionality on those datasets. Hadoop is a model for batch processing systems, and we use Hadoop to demonstrate the concept of batch layers.
Figure 4–batch Layer
The simplest form of the batch layer can be represented in pseudo-code like this:
The simplest batch layer can be represented by code as follows:
function Runbatchlayer ():
while (true):
Recomputebatchviews ()
The batch layer runs in a while (true) loops and continuously recomputes the batch views from scratch. In reality, the batch layer is a little more involved. The best-of-the-think about the batch layer for the purpose of this article.
The batch layer runs from the start of a while loop to continue the computed batch view. In fact, the batch layer will be slightly more involved, the purpose of this article is to use the best way to think about the batch layer.
The nice thing on the batch layer is, it's so, simple-to-use. Batch computations is written like single-threaded programs yet automatically parallelize across a cluster of machines. This implicit parallelization makes batch layer computations the scale to datasets of any size. It's easy-to-write robust, highly scalable computations on the batch layer.
The better part of batching is that it's easier to use. Batches can be written like single-threaded programming, followed by automatic parallelization across the cluster. Implicit parallelization allows batches to be extended to any size dataset, making it easy to write robust, highly scalable computations on a batch layer.
Here's an example of a batch layer computation. Don ' t worry about understanding this code; The inherently parallel program looks like.
Here is an example of a batch process. Don't worry about understanding this code, it just shows what parallel programming looks like:
Pipe pipe= New pipe ("counter");
pipe = new GroupBy (Pipe, new field ("url"));
Pipe = new Every (
pipe,
New count (New fields ("Count")),
New fields ("url", "count"));
Flow flow = new Flowconnector (). Connect (
New Hfs (New TextLine ("url"), Srcdir),
New Stdouttap (),
Pi PE);
Flow.complete ();
This code computes the number of pageviews for every URLs, given an input dataset of raw pageviews. What's interesting about this-code is-all of the concurrencychallenges of scheduling work, merging results, and Deali Ng with runtimefailures (such as machines going down) is done for you. Because the algorithm is written in this, it can be automatically distributed on a MapReduce cluster, scaling to Howev Er many nodes you have available. So, if you had nodes in your MapReduce cluster, the computation would finish about ten times faster than if you only had One node! At the end of the computation, Theo Utput directory would contain a number of files with the results.
This code calculates the amount of page access per URL from the given original page access data set. What's interesting about this code is that concurrency issues like scheduling, merging result sets, and handling run-time errors (like machine outages) are all handled. Because of this algorithm, can be automatically distributed on the MapReduce cluster, can be extended to any number of available nodes. Therefore, if there are 10 nodes on the MapReduce cluster, it can be 10 times times faster than a node. At the end of the calculation, the output directory will include many result files.
Serving Layer
The batch layer emits batch views as the result of its functions. The next step is to load the somewhere so, they can be queried. This is where the serving layer comes in. For example, your batch layer could precompute a batch view containing the PageView count for every [Url,hour] pair. That batch view is essentially just a set of flat files Though:there ' s no-to-quickly get the value for a particular UR L out of the that output.
The batch processing layer implements the batch view of the function, and the next step is to load the data where it is to be queried, which is the origin of the service layer. For example, the batch layer calculates a batch view that contains each [URL, hour] number, and the batch view is essentially a flat file, and there is no way to get the value of a particular URL from it very quickly.
Figure 5–serving Layer
The serving layer indexes the batch view and loads it up so it can is efficiently queried to get particular values out of The view. The serving layer is a specialized distributed database this loads in batch views,makes them queryable, and continuously s WAPs in new versions for a batch view as they ' re computed by the batch layer. Since The batch layer usually takes at least a few hours to does an update, the serving layer is updated every few hours.
The Service layer indexes the batch view so that a particular value can be obtained very efficiently from it. The service layer is a distributed database that implements the load batch view so that they can be queried and continuously switched when a new version of the batch view is entered. The batch layer typically takes several hours to update, so the service layer is updated every few hours.
A serving layer database only requires batch updates and random reads. Most notably, the it does not need to the support random writes. This was a very important point because random writes cause most of the complexity in databases. By does supporting Random writes, serving layer databases can be very simple. That's simplicity makes them robust, predictable, easy-to-configure, and easy-to-operate. ELEPHANTDB, a serving layer database,is only a few thousand lines of code.
The service tier database only needs to meet batch and random reads, and obviously it does not need to support random writes, which is a very important point because random writes lead to the complexity of most databases. Therefore, random write is not supported, and the service layer database can be very simple. These simplicity makes them robust, predictable, simple to configure, and easy to operate. ELEPHANTDB, a service-tier database, has only thousands of lines of code.
Batch and serving layers satisfy almost all properties
So far you've seen how the batch and serving layers can support arbitrary queries on a arbitrary dataset with the trade O FF that queries'll be is out of date by a few hours. The long update latency is due to the fact, the new pieces of data take a few hours to propagate through the batch layer I Nto the serving layer where it can be queried.
So far, you know how the batch and service tiers support the tradeoff between arbitrary queries on any data set and queries that are out of date for hours. The high latency is due to the fact that new data takes hours from the service tier to the batch layer to the query.
The important thing to notice are, and other than low latency updates, the batch and serving layers satisfy every property Desired in a Big Data system. Let's go through them one by one:
The important thing to note is that, in addition to low-latency updates, batch and service tiers can meet all of the requirements of big Data systems. We have a:
* Robust and fault tolerant:the batch layer handles failover when machines go down using replication and restarting Compu tation tasks on the other machines. The serving layer uses replication under the hood to ensure availability when servers go down. The batch and serving layers is also human fault tolerant,since, when a mistake is made, you can fix your algorithm or re Move the bad data and recompute the scratch.
Robustness and fault tolerance: the batch processing layer handles machine outages by replicating and restarting compute tasks on other servers. The service layer is available through replication to ensure that the server is down. Batch and service layers are also fault tolerant to human behavior, and when an error occurs, you can either modify the algorithm or remove the bad data and then recalculate the solution from scratch.
* Scalable-both the batch layer and serving layers are easily scalable. They can both be implemented as fully distributed systems, where upon scaling them are as easy as just adding new machines.
Scalability: Batch layers and services can be easily extended, they are fully implemented distributed systems, simply by adding a machine can be expanded.
* General-the Architecture described is as general as it gets. You can compute and update arbitrary views of an arbitrary dataset.
Versatility: The architecture described is as generic as it is mentioned, and you can calculate and update any view on any data
* Extensible-adding A new view is as easy as Adding a new function of the master dataset. Since The master DataSet can contain arbitrary data, new types of data can be easily added. If you want to tweak a view, you don't have to worry about supporting multiple versions of the view in the application. You can simply recompute the entire view from scratch.
Extensibility: Adding a new view is as simple as adding a function to the original data. Because the raw data includes arbitrary data, the new data types can be easily added. If you want to change a view without worrying about implementing multiple versions of this view in your program, you can simply calculate the entire view from scratch.
* Allows ad hoc queries-the batch layer supports AD-HOC queries innately. All of the data are conveniently available in one location and you ' re able to run any function you want on that data.
Allow ad hoc queries: the batch layer natively supports ad hoc queries. All data is readily available in one location, and you can perform any function you wish to perform on those data.
* Minimal maintenance-the batch and serving layers consist of very few pieces,yet they generalize arbitrarily. So, you are only having to maintain a few pieces fora huge number of applications. As explained before, the serving layer databases is simple because they don ' t does Random writes. Since a serving layer database has so few moving parts, there ' s lots less that can go wrong. As a consequence,it ' s much less likely that anything would go wrong with a serving layer database, so they is easier to Ma Intain.
Minimal maintenance: There are only a few parts of the batch and service tiers, and so on. So a lot of applications just need to maintain a small part. As explained earlier, the server database is very simple because it is not randomly written. The service-tier database has almost no moving parts, and there is little error, as a result, there is nothing wrong with the service layer database, so it is very easy to maintain.
* Debuggable-you always has the inputs and outputs of computations run on the batch layer. In a traditional database, an output can replace the original Input-for example, when incrementing a value. In the batch and serving layers,the input are the master dataset and the output is the. Likewise, you had the inputs and outputs for all of the intermediate steps. Have the inputs and outputs gives you all the information if need to debug when something goes wrong.
Debug: Calculations on a batch layer always have both input and output. In a traditional database, the output may overwrite input, such as a value increment. In the batch and service tiers, the input is the main dataset output is the view. Similarly, all intermediate step data is available. There is input and output, you have enough information to debug when something goes wrong.
The beauty of the batch and serving layers is this they satisfy almost all of the properties of your want with a simple and EA Sy to understand approach. There is no concurrency issues to deal with, and it scales trivially. The only property missing are low latency updates. The final layer, the speed layer, fixes this problem.
Batch and service tiers The good thing is that all the features you want are met in a simple, easy-to-understand way, with no concurrency to deal with and no extension details. The only missing place is the low latency update, which solves this problem in the last layer, the high-speed layer.
Speed Layer
The serving layer updates whenever the batch layer finishes precomputing a batch view. This means, the only data, represented in the batch, is the data, came in while, the precomputation was run Ning. All that's left to do to has a fully realtime data system-that is, arbitrary functions computed on arbitrary data in real Time-is to compensate for those last few hours of data. The purpose of the speed layer.
The service tier is not updated until the batch processing layer finishes the batch view, which means that the only part of the batch layer that cannot be displayed is the one being calculated. The rest is left to the real-time data system, which can perform arbitrary computations in real time on any data set, just to compensate for the last few hours of data. This is also the purpose of the high-speed layer.
Figure 6–speed Layer
You can think of the "speed" layer as similar to the batch layer in that it produces views based on data it receives. There is some key differences, though. One big difference is so, in order to achieve the fastest latencies possible, the speed layer doesn ' t look at all the NE W data at once. Instead, it updates the Realtime view as it receives new data Instead of recomputing them like the batch layer does. This is called incremental updates as opposed to recomputation updates. Another big difference is that the speed layer only produces views on recent data, whereas the batch layer produces views On the entire dataset.
You can assume that the high-speed layer is similar to the batch layer and that data is generated for the view. There are some key differences between them, one of which is that, to get the data as low as possible, the high-speed layer does not look up all the new data, but rather when the new data is updated in the real-time view instead of the batch layer, as opposed to the recalculation update, which is an incremental update. Another big difference is that the high-speed layer only generates views on the most recent data, whereas the processing layer is generating views on all data.
Let's continue the example of computing the number of pageviews for a URLs over a range of time. The speed layer needs to compensate for pageviews that haven ' t been incorporated in the batch views, which'll bea few Ho Urs of pageviews. Like the batch layer, the speed layer maintains a view from a key [URL, hour] to a pageview count. Unlike the batch layer, which recomputes that mapping from scratch each time, the speed layer modifies its view as it rece Ives new data.
Let's continue with the example of calculating the PV number for a URL. The high-speed layer needs several hours of PV to be incorporated to compensate for the batch processing layer. As with the batch layer, the high-speed layer maintains a view of [URL,HOUR]PV], but the batch layer is mapped from scratch each time, and the high-speed layer updates the view after the new data is received.
When it receives a new pageview, it increments the count for the Corresponding[url, hour]in the database.
When a new PV is received, the corresponding [Url,hour] number in the database is increased.
The speed layer requires databases this support random reads and Random writes. Because These databases support random writes, they is orders of magnitude more complex than the databases Serving layer, both in terms of implementation and operation.
The high-speed layer needs to support random read and random write data database. Because the database supports random writes, both implementations and operations are a few orders of magnitude more complex than the database used by the service tier.
The beauty of the lambda architecture is so, once datamakes it through the batch layer into the serving layer, the Corre Spondingresults in the Realtime views. Thismeans You can discard pieces of the is no longer needed realtime view Asthey ' re no longer needed. This was a wonderful result, since the speed layer Isway more complex than the batch and serving layers. This property of the lambdaarchitecture are called complexity isolation, meaning that complexity is pushedinto a layer whos E results is only temporary. If anything ever goes wrong, youcan discard the state for entire speed layer and everything would be back tonormal within a Few hours. The greatly limits the potential negativeimpact of the complexity of the speed layer.
The beauty of the lambda architecture is that once the data is passed through the batch layer into the service layer, it is consistent with the results of the live view, which means that the live view can be discarded at any time when it is not needed. This is an excellent result because the high-speed layer processing is much more complex than the batch and service tiers. This particular of the lambda architecture is a complex isolation, where complexity is placed on a layer and his results are temporary. If there is any