What if I have to deal with it at a higher speed?
If I am a microblogging-like company, I would like to show not 24 hours of hot Bo, I want to see a constantly changing hit list, the update delay in a minute, the above means will not be competent. Then another computational model was developed, which is the streaming (stream) calculation. Storm is the most popular streaming computing platform. The idea of stream computing is that if you want to achieve a more real-time update, why don't I deal with it when the data flow comes in? For example, the word frequency statistics, my data flow is one of the words, I let them flow through me on the side began to count. Flow calculation is very good, basically no delay, but its shortcomings are, not flexible, you want to count things must know beforehand, after all, the data flow is gone, you do not count things can not be mended. So it's a good thing, but it can't replace the Data warehouse and batch system.
There is also a separate module that is KV Store, such as Cassandra,hbase,mongodb and many many, many others (too much to imagine). So the KV store means that I have a bunch of key values that I can quickly get to the data bound to this key. For example, I use a social security number, can take your identity data. This action can be done with mapreduce, but it is possible to scan the entire data set. The KV store is dedicated to this operation, and all the save and fetch are optimized for this purpose. Find a social Security number from several P's data, perhaps as long as fraction seconds. This has made some of the specialized operations of big data companies vastly optimized. For example, I have a page on the page based on the order number to find the order content, and the entire site order number can not be stored on a single database, I will consider the KV store to save. KV Store's philosophy is that the basic inability to deal with complex calculations, mostly can not join, perhaps not aggregation, no strong consistency guarantee (different data distributed on different machines, you may read the different results each time, you can not handle similar to bank transfer as strong consistency requirements of the operation). But ya is fast. Extremely fast.
Each of the different KV store designs has a different trade-offs, some faster, some higher, and some to support more complex operations. There must be one for you.
In addition, there are some more specialized systems/components, such as Mahout is a distributed machine learning Library, PROTOBUF is the data interchange encoding and library, zookeeper is a high-consistency distributed access cooperative system, and so on.
With so many messy tools running on the same cluster, everyone needs to work with each other in a respectful and orderly way. So another important component is the dispatch system. Yarn is the most popular now. You can think of him as a central management, like your mother in the kitchen supervisor, hey, your sister cut the vegetables are finished, you can take the knife to kill the chicken. As long as everyone obeys your mother's assignment, everyone can have a pleasant cooking.
You can think of the big data biosphere as a kitchen tool ecosystem. In order to do different dishes, Chinese cuisine, Japanese cuisine, French cuisine, you need a variety of different tools. And the needs of the guests are complicating, and your kitchen utensils are constantly being invented, and no one can handle all the situations, so it becomes more and more complex.
Lao Li shares big data Biosphere 2