April 2013, Ali ladder Cluster in the data Center (IDC room) is full, unable to continue to expand the cluster. According to the data volume of Ali group at that time the growth trend, in the foreseeable short time, the cluster size will be unable to continue to expand due to the lack of room seats. Because at that time the Hadoop version of the ladder does not support the function of single cluster across the room, so the large data business of Ali Group will stop developing because of the limitation of cluster size. Ladder of the Cross Room project in this context began. The goal is clear: Build a Hadoop cluster that supports a cross room.
Technical challenges
There are a lot of technical difficulties in building a cross room Hadoop cluster.
The extensibility of difficult 1:namenode
As we all know, the Namenode single point in the Hadoop HDFs is one of the biggest problems that prevents the Hadoop cluster from expanding indefinitely. The ladder has been a single namenode structure before the engine room, no matter how optimized, its service ability has its limit. Although the ladder development team's multiple-wheel optimization, has been able to more than 5000 units (the daily average RPC access to 2.5 billion times), but consider the size of the expansion of one-fold, obviously can not be achieved. So it is necessary to support multiple Namenode for the Hadoop version of the ladder.
Difficulty 2: Machine Room network limit
Some problems are not to be one of the rooms of all the slave directly to another computer room master can be solved, because the machine room bandwidth is a huge obstacle.
Difficulty 3: How the data should be distributed across the computer room
Cut into many namenode after, it is bound to need to draw the data room or even across the computer room distribution, the distribution strategy needs from the business level of the overall planning. The solution to this problem is not within the scope of this article, so simply ask. In fact, the ladder team is based on the upper application of the data access distribution and requirements of clustering generated data distribution.
Difficulty 4: How to calculate the Cross room scheduling
After the data is distributed across the computer room, how to make the optimal scheduling strategy in order to avoid data back and forth copying between the computer rooms and reading data from the operation across the engine room?
Difficult 5: Dozens of PB data migration, and with data upgrades
With hundreds of PB data for cluster overall upgrade, data can not be any loss, is a very big challenge.
Difficulty 6: How to be transparent to the user?
After implementing multiple master, how to be transparent to users, without requiring hundreds of thousands of jobs on the ladder to be seamless and compatible, is another big challenge for the development team.
Difficulty 7: Can the scheme be extended to multiple computer rooms (>=3)?
In order to further across more room, ladder version needs to consider not only the two-room distribution, but the distribution of multiple computer rooms.
Detailed steps for the solution
Identify the needs and difficulties, next, you need to have a clear implementation steps, through the development team, Test team, operation and maintenance team and business team communication and brainstorming, ladder across the engine room project to determine the following technical implementation steps (here the design, is in accordance with the actual implementation of the project sequence of steps to introduce, In order to facilitate everyone to understand each step of the original intention and solve the problem, each step to solve the problem of the corresponding introduction.
The first step is to upgrade the ladder cluster to support the Federation version (developed based on the ladder itself), the existing Namenode as a namespace, "NameNode1", the "NameNode1" namespace under the full scale of the ladder of data, the size of 5000 units.
The second step, in the same room to build another namenode, for "NameNode2." The namespace under the Namenode is empty and no data is initially managed. At the same time, create a blockpool for NameNode2 on all Datanode to report to NameNode2.
The third step is to migrate some of the data in the NameNode1 (such as 50%) to NameNode2 (the migration here includes the metadata migration in namespace and block on the underlying datanode disk). After this step is completed, the ladder structure is shown in Figure 1. This step is a very big difficulty.
Fig. 1 The composition of multi-namenode frame of ladder