Because of the needs of the project, learning to use Hadoop, as with all the overheated technology, "big Data", "mass" such words on the internet over the sky flying. Hadoop is a very good distributed programming framework that is exquisitely designed and does not currently have the same level of weight as a substitute. It also touches on an internally used framework that encapsulates and customizes Hadoop, making it more responsive to business requirements. I also recently wanted to write some of the learning and use of Hadoop experience, but see the internet so flooded articles, I think it is no value to write some notes. It's better to calm down and see what's not a problem for Hadoop.
This diagram is the architecture diagram for Hadoop, and map and reduce are the two most basic processing stages, preceded by input data format definitions and data fragmentation, followed by output data format definitions, In between the two can also implement combine this local reduce operation and partition this redirection mapper output policy behavior. Additional customization and enhancements can include:
The enhancement of input data and output data, for example, through data set management, can unify, merge all kinds of data sets, even can increase the filtering operation of data as a primary screen, in fact, the core data source is a wide range of business;
The expansion of data partitioning strategy, we often need to have some business characteristics of the data to be processed together;
The extension of combine and partition is mainly a number of policy implementations that are common in many hadoop jobs;
The expansion of the monitoring tool, I have also seen other companies within the custom tools;
The enhancement of communication protocols and file systems, especially the file system, is best used as close to local commands as can be found on the Internet;
The further encapsulation of the programming interface of data access is mainly to meet the business, with convenience;
......
These customizations also reflect, to some extent, the limitations of Hadoop in actual use or the design time, but these are minor problems that can be fixed by customization and extension. But there are some problems that Hadoop is inherently incapable of solving, or that is not suitable for use with Hadoop.
1, the most important point, Hadoop can solve the problem must be mapreduce. Here are two special meanings, one is that the problem must be split, some of the problems look big, but the split is difficult; the second is that the child must be independent-many hadoop textbooks give an example of a Fibonacci sequence, and each step of the data is not independent, and must depend on the previous step, The result of the first two steps, in other words, the inability to divide the big problem into a small independent problem, is that there is no way to use Hadoop at all.
2, the data structure does not satisfy key-value such pattern. In the Hadoop in action, the author compares Hadoop with a relational database, and structured data queries are not suitable for hadoop (although things like hive simulate ANSI SQL syntax). Even so, performance overhead is not comparable to a generic relational database, and if a complex combination of conditions is not as powerful as SQL. Writing code calls is also time-consuming.
3, Hadoop is not suitable for handling large quantities of small files. In fact, this is determined by the limitations of Namenode, if the file is too small, namenode stored meta information will occupy a large proportion of space, memory or disk costs are very large. If a task has a larger file handling, then the virtual machine startup, initialization, and so on, and the cleanup time after the task is completed, and even shuffle and so on frame consumption time accounted for a much smaller proportion, otherwise, processing throughput fell. (Someone did an experiment, see: Link)
4. Hadoop is not suitable to handle tasks that require timely response, and high concurrent requests. It's also easy to understand that the virtual machine overhead, initialization preparation time, and so on, can take a few minutes even if nothing is done in a task and running the job completely.
5, Hadoop to deal with the real "big data", the scale up really into scale out, two small broken machines, or a few, more than 10 gigabytes of data, with Hadoop is clumsy. The intuitive nature of the asynchronous system itself is not as good as the synchronization system, which is obvious. So basically, the maintenance cost is not low.