Hadoop in Facebook application

Last Update:2014-12-22 Source: Internet

Author: User

Keywords nbsp; avatar data warehouse

Tags active users added application based block click content cpu

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Facebook, a world-renowned social networking site, has more than 300 million active users, of which about 30 million users update their status at least once a day; users upload a total of more than 1 billion photos and 10 million videos a month; Week to share 1 billion content, including journals, links, news, Weibo and so on. So Facebook need to store and process the amount of data is huge, every day new 4TB compressed data, scanning 135TB size data, perform Hive tasks on the cluster more than 7500 times, 80,000 calculations per hour, so high Performance cloud platform is very important for Facebook, and Facebook mainly Hadoop platform for log processing, recommendation system and data warehouse and so on.

Facebook stores data on a data warehouse built using Hadoop / Hive, which has 4800 cores, 5.5 petabytes of storage, 12 terabytes of data per node, two layers of network topology, As shown in Figure 3-5. MapReduce clusters in Facebook are dynamic and dynamically move based on load conditions and configuration information between cluster nodes.

(Click to enlarge) Figure 3-5 Cluster network topology

Figure 3-6 shows the Facebook data warehouse architecture in which web servers and internal services generate log data. Here Facebook uses an open source logging system that stores hundreds of log data sets on an NFS server, However, most of the log data is copied to the same hub HDFS instance, and the HDFS stored data is placed in a data warehouse built with Hive. Hive provides a SQL-like language to integrate with MapReduce, create and publish multiple summaries and reports, and conduct historical analysis on top of them. Hive's browser-based interface allows users to perform Hive queries. Oracle and MySQL databases are used to publish these summaries, which are relatively small in size but are frequently queried and require real-time responses. Some old data needs to be archived in time and stored on less expensive memory, as shown in Figure 3-7.

(Click to enlarge) Figure 3-6 Facebook data warehouse architecture

Here are some of Facebook's work on AvatarNode and scheduling policies. AvatarNode is mainly used for the recovery and startup of HDFS. If HDFS crashes, the original technology recovery first takes 10-15 minutes to read and write 12GB file image, and processes data from 2000 DataNodes in 20-30 minutes Block report, and finally use 40 to 60 minutes to restore the collapsed NameNode and deploy the software. Table 3-1 illustrates the difference between BackupNode and AvatarNode, AvatarNode starts as a normal NameNode and handles all messages from DataNode. AvatarDataNode, like DataNode, supports multi-threading and multi-queuing for multiple primary nodes, but can not distinguish between raw and backup. Manual recovery Using the AvatarShell command-line tool, AvatarShell performs a recovery operation and updates ZooKeeper's zNodes, the recovery process being transparent to the user. The distributed Avatar file system is implemented on top of the existing file system.

(Click to enlarge) Figure 3-7 Data Archive

Table 3-1 Differences between BackupNode and AvatarNode

There are some issues with location-based scheduling strategies in practice: tasks that require high memory may be assigned to TaskTracker with low memory, CPU resources are sometimes underutilized, and tasktracker configurations for different hardware may be difficult . Facebook uses a resource-based scheduling strategy that equitably grants schedules, monitors the system in real time and collects CPU and memory usage. The scheduler analyzes real-time memory consumption and allocates tasks' memory usage equally between tasks. It parses the process tree by reading the / proc / directory and collects all the CPU and memory usage information in the process tree and then sends the message on heartbeat via TaskCounters.

Facebook's data warehouse uses Hive, and its architecture is shown in Figure 3-8. For the relevant knowledge of Hive query language, refer to Chapter 11. Here HDFS supports three file formats: TextFile to facilitate other applications to read and write; SequenceFile, only Hadoop can read and support block compression; RCFile, using sequential file-based block storage, each A block by column, so that there is better compression and query performance. Facebook will improve on Hive in the future to support new features such as indexes, views, and subqueries.

(Click to enlarge) Figure 3-8 Hive architecture

The challenges that Facebook now has with Hadoop are:

In terms of service quality and isolation, larger tasks can affect cluster performance;

In terms of security, what happens if a software vulnerability causes the NameNode transaction log to crash?

Data archiving, how to choose the archived data, and how to archive the data;

Performance improvements, such as how to effectively solve bottlenecks.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More