Google's greatness is largely due to its powerful data storage and computing capabilities. GFS and bigtable have helped it basically get rid of expensive human O & M and saved machine resources; mapreduce allows it to quickly see the results of various search policy tests. In view of this, there have been many counterfeits at home and abroad. They are all so-called "high-tech" enterprises and are often labeled as "cloud computing. From start to end, implementing a set of Google storage, computing, and query systems is extremely complex, and only a few giants can do it. hadoop is an open-source simplified implementation, it has helped a lot of tech companies. Some time ago, Yahoo put the founder of hadoop under his arm, making hadoop a great turn and a leap in performance.
hadoop mainly includes HDFS (Distributed File System, corresponding to GFS), mapreduce (Distributed Computing System) and hbase (Distributed Query System, corresponding to bigtable), among which HDFS and mapreduce are more mature. In addition, hadoop also includes some auxiliary systems, such as the Distributed Lock service zookeeper, which corresponds to Google chubby. The design objectives of this system are as follows:
1. simplified O & M: In large-scale clusters, machine downtime, network exceptions, and disk errors are all normal. Therefore, error detection and automatic recovery are the core architecture goals. Google's solution has enabled machines to join/leave clusters at any time.
2. high throughput: high throughput and low latency are two conflicting goals. hadoop preferentially pursues high throughput, and small operation merging is adopted in design and implementation, update Based on operation logs and other technologies to improve throughput.
3. reduce Machine costs: hadoop encourages the use of large-capacity cheap machines (cost-effective but high machine failure probability) during deployment. Data Storage and services are also divided into two levels: HDFS and hbase, this maximizes the use of machine resources.
4. Single master design: the design of a single master greatly simplifies the design and implementation of the system, resulting in machine scale restrictions and single point of failure issues. For machine scale problems, since hadoop is a data/computing-intensive system rather than metadata-intensive system, a single cluster designed by a single master can support thousands of machines, almost all current applications are not a problem, and spof can be effectively solved through the Distributed Lock service or other mechanisms.
other Google counterfeits include Microsoft dyrad (model Google mapreduce) and hypertable (bigtable implemented by C ++, an open-source project started by core members of hadoop hbase Development Team ). Google's solutions are not omnipotent, but they are almost insurmountable compared to ordinary people. Hadoop, as a simplified implementation of Google's model, has many shortcomings. Here we will first list several points, in the future, we will gradually analyze hadoop by reading the Source Code and the paper. Hadoop has the following obvious disadvantages:
1. Java implementation. Although Java's Io processing does not have a performance bottleneck, it is a waste for CPU-intensive tasks. This can be preliminarily verified by comparing the two open-source bigtable implementations of hbase and hypertable.
2. Open-source projects. Open source itself is a double-edged sword. It facilitates most people. However, for a company of a certain size, the development direction of the project is grasped, and the technology is kept confidential, technical support and so on are all issues that must be considered when hadoop is used as an open-source project. In addition, hadoop, as a relatively new project, takes some time to improve its performance and stability.
3. (To be continued)