4.1 Storage requirements for the master DataSet
To determine the requirements for data storage, you must consider how your data would be written and how it would be read. The role of the batch layer within the LAMBDA Architecture affects both values.
In Chapter 2 we emphasized the key properties of Data:data is immutable and eternally true. Consequently, each piece of your data would be written once and only once. There is no need-ever alter your data–the only write operation'll be-add ad new data unit to your dataset. The storage solution must therefore is optimized to handle a large, constantly growing set of data.
The batch layer is also responsible for computing functions on the dataset to produce the batch views. This means the batch layer storage system needs to is good at reading lots of data at once. In particular, the random access to individual pieces of data are not required.
With this "write once, bulk read many times" paradigm in mind, we can create a checklist of requirements for the data Storage.
4.2 Choosing a storage solution for the batch layer
With the requirements checklist in hand, you can now consider options for batch layer storage. With such loose requirements–not even needing random access to the data–it seems what you could use pretty much any di stributed database for the master dataset.
4.2.1 Using A Key/value store for the master dataset
We Haven ' t discussed distributed Key/value stores yet, but can essentially think of them as giant persistent hashmaps That is distributed among many machines. If you ' re storing a master dataset on a Key/value store, the first thing you had to figure out are what the keys should be And what is the values should be.
What a value should was Obvious–it ' s a piece of data you want to store–but what should a key was? There's no natural key in the data model, or is one necessary because the data was meant to being consumed in bulk. So you immediately hits an impedance mismatch between the data model and what Key/value stores work. The only really viable idea was to generate a UUID to use as a key.
But it is only the start of the problems with using Key/value stores for a master dataset. Because Key/value Store need fine-grained access to key/value pairs to doing random reads and writes, you can ' t compress mult Iple Key/value pairs together. So your ' re severely limited in tuning the trade-off between storage costs and processing costs.
Key/value Stores was meant to being used as mutable stores, which is a problem if enforcing immutability was so crucial fo R the Master DataSet. Unless modify the code of the Key/value store you ' re using, you typically can ' t disable the ability to modify existing Key/value pairs.
The biggest problem, though, is, a key/value store have a lot of things you don ' t need:random reads, Random writes , and all of the machinery behind making those work. In fact, the most of the implementation of a Key/value store are dedicated to these features you don ' t need at all. This means the tool was enormously more complex than it needs to be-meet your requirements, making it much more likely Y Ou ' ll has a problem with it. Additionally, the Key/value store indexes your data and provides unneeded services, which would increase your storage costs and lower your performance when reading and writing data.
4.2.2 Distributed filesystems
files is Sequences of bytes, and the most efficient the consume them are by scanning through them. They ' re stored sequentially on disk (sometimes they ' re split to blocks, but reading and writing are still essentially seq uential). You had full control over the bytes of a file, and you had the full freedom to compress them however you want. Unlike a Key/value store, a filesystem gives you exactly what's need and no more, while also isn't limiting your abilit to Tune storage cost versus processing cost. On top of that, filesystems implement fine-grained permission system, which is perfect for enforcing immutability.
the problem with a Regular filesystems is this it exists on just a single machine so you can be only scale to the storage limits and processing Power of that one machine. But it turns out that there's a class of technologies called distributed filesystems that's is quite was similar to the files Ystems you ' re familiar with, except they spread their storage across a cluster of computers. They scale by adding more machines to the cluster. Distributed Fielsystems is designed so then you had fault tolerance when a machine goes down, meaning that if you lose O NE machine, all your files and data would still be accessible.
There is some differences between distributed fielsystems and regular filesystems. The operations you can does with a distributed filesystem is often more limited than you can does with a regular filesystem. For instance, if not is able to write to the middle of a file or even modify a file at all after creation. Oftentimes have small files can be inefficient, so you want to make sure you keep your file sizes relatively large to MA Ke use of the distributed filesystem properly (the details depend on the tool, but a good the rule of thumb).
4.3 How distributed filesystems work
hdfs and Hadoop Map Reduce is the prongs of the Hadoop project:a Java framework for distributed storage and distributed processing O F large amounts of data. Hadoop is deployed across multiple servers, typically called a cluster, and HDFS are a distributed and scalable filesystem That's manages how data is stored across the cluster. Hadoop is a project of significant size and depth, so we'll only have provide a high level description.
in an HDFS cluster , there is types of nodes:a single Namenode and multiple datanodes. When you upload a file to HDFS, the file was first chunked into blocks of a fixed size, typically between MB and a. Each of the block is then replicated across multiple datanodes (typically three), which is chosen at random. The Namenode keeps track of the file-to-block mapping and where each block is located. This design was distributing a file in the this "across many nodes allows it to was easily processed in parallel. When a program needs to access a file stored in HDFS, it contacts the Namenode to determine which Datanodes host the file Contents.
Additionally, with each block replicated across multiple nodes, your data remains available even when individual nodes is offline. Of course, there is limits to this fault tolerance:if we have a replication factor of three, three nodes go down at onc E, and you ' re storing millions of blocks, chances is, some blocks happened to exist on exactly those three nodes and would be unavailable.
4.4 Storing a master DataSet with a distributed filesystem
Distributed filesystems vary in the kinds of operations they permit. Some Distributed Fielsystems Let you modify existing files, and others don ' t. Some the append to existing files, and Some don ' t has the feature.
clearly, with unmodifiable files you can ' t store the entire master dataset in a single file. What are the can do instead are spred the master dataset among many files, and store all those files in the same folder. Each file would contain many serialized data objects.
4.5 Vertical Partitioning
although the Batch layer is built to run functions on the entire dataset, many computations don ' t require looking at all the data. For example, your may has a computation that is only requires information collected during the past. The batch storage should allow for partition your data so, a function only for access data relevant to its computation. This process is called vertical partitioning, and it can greatly contribute to making the batch layer more efficient. While it's not a strictly necessary for the batch layer, as the batch layer was capable of looking at all the data a once an D filtering out what it doesn ' t need, vertical partitioning enables larges performance gains, so it's important to know Ho W to use the technique.
vertically partitioning data on a distributed filesystem can is down by sorting your data into separate folders. For example, suppose "re storing login information on a distributed filesystem. Each login contains a username, IP address, and timestamp. To vertically partition by day, you can create a separate folder for each day of data. Each day folder would has many files containing the logins for.
Now if you have want to look at a particular subset of your datasets, you can just look at the files in those Particula R folders and ignore the other files.
4.6 Low-level nature of distributed filesystems
while Distributed filesystems provide the storage and Fault-tolerance properties you need for storing a master dataset, you ' ll f The IND using their APIs directly too low-level for the tasks you need to run. We ' ll illustrate this using regular Unix filesystem operations and show the difficulties you can get to when doing tasks Like appending to a master dataset or vertically partitioning a master dataset.
let ' s start with Appending to a master dataset. Suppose your master DataSet is in the Folder/master and you has a folder of data In/new-data that's want to put Insid E your master DataSet. Unfortunately, this code has serious problems. If The Master DataSet folder contains any files of the same name and then the MV operation would fail. To does it correctly, you had to be sure your rename the file to a random filename and so avoid conflicts.
there ' s another problem. One of the core requirements of storage for the master dataset are the ability to tune the trade-offs between storage costs and processing costs. When storing a master datasets on a distributed filesystem, you choose a file format and compression format that makes the Trade-Off you desire. What if the files In/new-data is of a different format than In/master? Then the MV Operation won ' t work at all–you instead need to copy the records off Of/new-data and into a brand new file With the file format used In/master.
Let's take a look at doing the same operation but with a vertically paritioned master dataset.
Just putting the files from/new-data into the root of/master is wrong because it wouldn ' t respect the vertical parti Tioning Of/master. Either the append operation should be disallowed–because/new-data isn ' t correctly vertically partitioned–or/new-data Should is vertically partitioned as part of the append operation. But if you're just using a files-and-folders as part of the append operation. But when you ' re just using a files-and-folders API directly, it's very easy-to-make a mistake and break the vertical parti Tioning constrains to a dataset.
All the operations and checks so need to happen to get these operations working correctly strongly indicate that Fil Es and folders is too low-level of an abstraction for manipulating datasets.
4.7 Storing the superwebanalytics.com master dataset on a distributed filesystem
Let's now look at how do you can make use of a distributed filesystem to store the master dataset for superwebanalytics.com
When you last left this project, you had created a graph schema to represent the dataset. Every edge and property are represented via its own independent dataunit.
a key observation is a graph schema provides a natural vertical partitioning of the data. You can store the all edge and the property types in their own folders. Vertically partitioning The data this is the lets you efficiently run computation that only look at certain properties and Ed Ges.
Data storage on the batch layer