Document directory
- 2.1 achieve high availability of namenode-avatarnode
- 2.2 hadoop RPC compatibility and data block availability
- 2.3 performance optimization of Real-Time Load
- 2.4 HDFS sync optimization and concurrent reading
- 3.1 row-level atomicity and consistency
- Availability 3.2
- 3.3 Performance Optimization
Original article address:
Http://blog.solrex.org/articles/facebook-realtime-hadoop-system.html
Author: Yang Wenbo
Facebook published a conference paper (PDF) entitled "Apache hadoop goes realtime at Facebook" at sigmod June this year in 2011 ), this article introduces Facebook's unique secret technology used to build a real-time hbase system. Because the Application Scenario mentioned in this paper is similar to the problem domain that the younger brother is responsible for, he took the time to carefully read this paper. The following describes some of my views and feelings based on the content of the paper. If you have any mistakes, please correct me.
The main content of this 10-page long article is Facebook's engineering practices on the hadoop system. The goal of these engineering practices is the real-time question. Although I lack the development or use experience of the hadoop system, I do not think it hinders my understanding of this paper. In my mind, HDFS is GFS and hbase is bigtable. There may be differences in their implementation, but the main idea should be the same. If you are familiar with the two articles GFS and bigtable, this article can be regarded as "advanced" for GFS and bigtable ".
1. application scenarios and requirements
This article first introduced background information and mainly provided three application scenarios: Facebook messaging, Facebook insight, and Facebook metrics System (ODS ). Messaging is Facebook's new message service. insight is a data analysis tool provided to developers and website owners. ODS is a software and hardware status statistics system in Facebook. These three application scenarios have their own characteristics, but simply put, they face the same problem: Single-host or sharded relational databases cannot meet their needs.
Based on the data features of application scenarios, Facebook abstracted several requirements for storage systems. Due to the complexity of the description, such as efficient and low-latency strong consistency semantics within a data center, these requirements are not listed one by one. What is more interesting than the demand is its "non-demand". There are three in total:
- To tolerate network differentiation within a single data center, Facebook believes that this problem should be solved at the network hardware level (redundant design) rather than at the software level;
- The downtime of a single data center does not affect services. Facebook believes that such a disaster is very difficult and therefore is willing to accept this risk;
- Cross-data hot standby service capabilities. Facebook assumes that user data is distributed to a fixed data center, and the possible response latency problem should be solved through caching.
From these "non-demand" aspects, we can see that Facebook is considering a more practical situation, rather than an ideal distributed system, which has some reference significance.
Based on the above requirements and non-requirements, Facebook naturally gave reasons for choosing the Apache hadoop system, among them, there are various advantages of the Community maturity, hadoop in consistency, scalability, availability, fault tolerance, read and write efficiency, and so on. These advantages are also obvious to all.
2. Build Real-Time HDFS
HDFS is a distributed file system designed to support offline mapreduce computing. Although it performs well in scalability and throughput, it does not perform well in real time. If you want to improve the performance of HDFS-based hbase, the optimization of the HDFS layer is inevitable. Facebook has made the following Optimizations to make HDFS a common low-latency file system.
2.1 achieve high availability of namenode-avatarnode
HDFS namenode is a single point of failure, which means that the system will be unavailable if the namenode fails. It takes about 45 minutes to report the memory snapshot, application log, and data block information of datanode when namenode is restarted. Even if backupnode is used, you still need to collect the data block information report. The switching time may be more than 20 minutes. However, systems with real-time requirements generally require 24x7 availability. Therefore, Facebook has improved the single-point namenode to achieve dual-node Hot Backup of namenode, which is called avatarnode, as shown in:
Avatarnode
Simply put, back up avatarnode to read and play back the transaction logs of the master avatarnode through NFS to maintain data synchronization and receive the data block information report of datanode, this ensures that the data gap between the master and slave avatarnode is as small as possible, so that the backup avatarnode can quickly switch to the master node role. The role of the master-slave avatarnode is registered to zookeeper. datanode can determine which avatarnode command to obey based on the information in zookeeper.
To achieve data synchronization and ease-of-use of Hot Standby avatarnode, Facebook also improved namenode transaction logs and deployed dafs (Distributed Avatar File System) to shield avatarnode failover, make these changes transparent to the client. The article does not mention whether the avatarnode switchover is manual or automatic. However, considering the lease mechanism of zookeeper, it is not difficult to achieve automatic failover.
2.2 hadoop RPC compatibility and data block availability
In the previous system requirements, it was mentioned that fault isolation and Facebook's hadoop system was deployed in a single data center. Therefore, the same service would inevitably use multiple hadoop systems. To make the system upgrade independent and convenient, it is natural that the client is compatible with hadoop RPC of different versions.
Although the Rack Space is taken into account when HDFS allocates the data block location of the replica, it is still quite random as a whole. In fact, I have discussed a similar issue with my colleagues before, whether to select a random Copy location or use a certain group policy for allocation. The advantage of random distribution is simple balancing. The disadvantage is that when multiple machines go down, the random distribution of copies leads to a high probability of all data copies being lost; the advantage of using a certain group policy for allocation is that multiple downtime instances do not lose data if they are not in the same group. However, if multiple downtime instances occur in the same group, a lot of data will be lost. It seems that Facebook chose the Group Policy allocation method, and it is not likely that multiple machines are down in the same group.
However, I have doubts if this is correct. Servers in the same rack or adjacent racks generally have the same shelving time and hardware model. At the same time, faults are not completely independent, the probability is greater than the probability of an ideal fault distribution. I think that's why a group of machines in Facebook's final solution are (2, 5), 2 racks, and 5 servers. If you are careful about the two racks, try to avoid this situation. However, everything depends on the execution ability. If you do not know the deployment conditions, choosing a rack may not necessarily achieve the expected results.
2.3 performance optimization of Real-Time Load
In addition to the above changes, Facebook also optimized the RPC process on the client. Add a timeout mechanism for RPC to speed up file lease revocation (I don't know why I want to speed up the HDFS file operation ).
In addition, we also mentioned the most important point: locality! Facebook has added a function to check whether the file block is on the local machine. If the file block is on the local machine, it will be read directly. I don't know how it is implemented, but I think it is "very pornographic and violent". I don't know if it will damage data consistency.
2.4 HDFS sync optimization and concurrent reading
In order to improve the write performance, Facebook is allowed to continue writing without waiting for sync to finish. This seems violent and does not know whether it will affect data correctness.
In order to read the latest data, Facebook allows the client to read a data file that has not been written. If you read the last block being written, calculate the checksum again.
3. Create hbase 3.1 row-level atomicity and consistency for real-time production environments
Although hbase ensures the atomicity at the row level, node downtime may result in incomplete Last Update log. Facebook was not satisfied enough. waledit was introduced, a log transaction concept to ensure the integrity of each updated log.
In terms of consistency, it seems that hbase can meet the requirements. However, if the verification fails at the same time, the data block becomes unavailable because of the failure of the three replicas, Facebook adds the post-event analysis mechanism instead of simply discarding the data block.
Availability 3.2
Facebook tests hbase to improve its availability and solves the following problems:
- Rewrite hbase master and store the ragion allocation information to zookeeper to ensure that the Failover is completed correctly.
- This allows the compaction to be interrupted to accelerate the normal exit speed of the regionserver, and implement rolling restarts (that is, upgrading one by one) to reduce the impact of program upgrade on services.
- The log splitting function of the down server regionserver is removed from the master, and multiple regionservers are split to improve the fault recovery efficiency of the regionserver.
The solutions to these problems have a general purpose. I think it is very likely that they will be merged into hadoop code soon.
3.3 Performance Optimization
Performance Optimization mainly involves two aspects: Compaction Performance and read performance.
Those who have read bigtable papers should be familiar with the features of memtable and compaction. Here we mainly discuss the benefits of allowing minor compaction to delete data as well as how major compaction can improve the merge performance.
In terms of data read performance, this article mainly discusses how to reduce Io operations, including the use of bloom filter and specific meta information (timestamp. It is also important to maintain the locality of the regionserver and physical files during deployment!
The post also provides some experiences on Facebook's deployment and O & M, some of which are interesting. I may write an article to discuss it later, which is not detailed here.
4. Summary
We have previously discussed how to build a real-time data analysis system based on the distributed file system. At that time, we thought that if a mature GFS is available, this work would be relatively simple. Now I read this article on Facebook to find out the naive idea. From the technical point of view in this article, it is convincing to say that this system has been working for many years. Of course, this also means that it is not easy to copy a system like this.
From the results of the system design, the system should meet the requirements set at the beginning of the article and meet the needs of most application scenarios. However, I have some questions about the realtime analytics feature provided for insights. Realtime is fine, but how well does analytics support hbase? You may need to learn more about hbase functions before you can have an answer.
From the details of this system, we can find that there are many folds and trick. I think this is the real world. It is difficult to do everything perfectly, and the same is true for engineering. There is nothing wrong with the pursuit of perfection in the design system, but the cost and feasibility need to be considered. Don't forget that meeting the needs is the most important goal. In addition, you may wish to list some "non-demand" items. Excluding these restrictions may reduce the complexity of the system.