How to use hadoop Baidu, and made what improvements

Source: Internet
Author: User
Keywords nbsp ; can through
Tags analysis based cloud cloud platform company computing consumption control

As the largest Chinese search engine company in the world, Baidu offers a variety of products based on search engines and covers almost all search needs in the Chinese online world. Therefore, Baidu requires relatively large amounts of data to be processed online. Analysis, but also within the prescribed time processing and feedback to the platform. Baidu's platform needs in the Internet area to be handled by the cloud platform with better performance, Hadoop is a good choice. At Baidu, Hadoop is mainly used in the following areas:

Log storage and statistics;

Webpage data analysis and mining;

Business analysis, such as user behavior and advertising attention;

Online data feedback, timely access to online advertising clicks;

User web page clustering, analysis of the user's recommendation and the degree of correlation between users.

MapReduce is mainly a kind of thinking, can not solve all problems related to computing in all fields, Baidu researchers believe that the better model should be as follows:

HDFS implements shared storage, some are solved using MapReduce, some are solved using MPI, and others are co-processed by both. Because MapReduce is suitable for processing large and well-demarcated data, MapReduce can be used to do some filtering on the data to get the basic vector matrix, which can be further processed by MPI to return the result. Only the integrated technology can better Solve the problem.

Baidu now has three Hadoop clusters, with a total size of about 700 machines, of which over 100 new machines and more than 600 machines to be eliminated (their computing power is equivalent to more than 200 new machines), but its scale is still constant Increase MapReduce now runs about 3,000 jobs a day, processing data about 120TB / day.

Baidu in order to better use Hadoop for data processing, in the following areas made improvements and adjustments:

(1) adjust MapReduce strategy

Limit the number of jobs in which the job is running;

Adjust forecast execution strategy, control forecast execution volume, some tasks do not need forecast execution;

Scheduling according to the node memory status;

Balance the intermediate result output and reduce the I / O burden through compression.

(2) improve the efficiency and function of HDFS

Permission control, data in the PB-level data volume should be shared, so that it is easier to analyze, but you need to limit the permissions;

Let partition and node independent, so that a partition is broken after the other partitions on the node can also be used normally;

Modify DSClient select the copy location of the block strategy, increase the function so that DFSClient skip the error when selecting the block DataNode;

Solve VFS (Virtual File System) POSIX (Portable Operating System Interface of Unix) compatibility issues.

(3) Modify Speculative's execution strategy

Replacing the rate with the inverse of the rate to prevent the uneven distribution of data often can not start the prediction of the implementation of the situation;

Increase the task must reach a certain percentage to start the implementation of the forecast limit to solve the reduce run waiting for the map data time;

With only one map or reduce, predictive execution can be started directly.

(4) control the use of resources

Control the application physical memory. If too much memory usage can cause the operating system to skip some tasks, Baidu independently restricts the physical memory used by the process by modifying the Linux kernel, and terminating the process if the threshold is exceeded.

Packet scheduling computing resources, storage sharing, computing independence, running in Hadoop process is not preemptible.

In a large file system, the size of the next page on the X86 platform is 4KB. If the page is small, the management of the data will be many, will increase the cost of data operations and affect the computational efficiency, so need to increase the page size.

Baidu also encountered some problems when using Hadoop, are:

MapReduce efficiency issues: for example, how to reduce the number of I / Os in shuffle efficiency to improve parallelism; how to set sorts to be configurable in terms of sorting efficiency because the sorting process istes a lot of computing resources, and in some cases is not Need to sort.

HDFS efficiency and reliability issues: how to improve random access efficiency, and data write real-time issues, if Hadoop write a log on HDFS once stored, the efficiency will be low.

Memory usage problems: reducer-side shuffle will frequently use memory, here using a Linux-like buddy system to solve, to ensure that Hadoop with the minimum overhead to achieve the highest utilization; When the Java process content memory usage, you can adjust the garbage collection GC) strategy; sometimes there is a lot of memory duplication, which consumes a lot of CPU resources and can result in very high memory usage, requiring less memory duplication.

Job Scheduling Problem: How to Limit the Number of Map and Reduce Tasks in a Task to Ensure That Critical Calculations Can Have Sufficient Units of Computation; How to Group TaskTrackers to Restrict Job Execution on a Machine While You Are Submitting Tasks Determine the group to be executed and authenticate the group.

Performance problems: UserLogs cleanup at the end of each task should check the log to determine whether to clear, it will take up a certain amount of task resources, you can clean thread from the child Java process to TaskTracker to solve; Java The process will cut the text line and the map and reduce the process will be re-cut, which will result in duplication of treatment, then need to turn off the cutting process of the Java process; sorting can also be achieved when the parallel sort to improve performance; to achieve the data Asynchronous read and write can also improve performance.

The issue of robustness: Need to limit the memory consumption of the mapper and reducer programs, which is necessary to modify the Linux kernel to increase the physical memory of the process to limit its capacity; also can share a piece of memory through multiple map program, at a price to reduce The use of physical memory can also DataNode and TaskTracker UGI configured as a normal user and set the account password; or DataNode and TaskTracker sub-account to start to ensure the safety of HDFS data to prevent the Tracker operation DataNode content; can not be guaranteed Each user's program is very robust, sometimes need to terminate the process, but to ensure that the parent process is terminated, the child process is also terminated.

Limitations of Streaming: For example, text data can only be processed. Mapper and reducer communicate according to the text line protocol and can not handle the binary data simply. To solve this problem, Baidu staff wrote a new class Bistreaming (Binary Streaming), where the sub-Java process mapper and reducer in accordance with (KeyLen, Key, ValLen, Value) way to communicate, the user can write programs in accordance with this agreement.

User Authentication: The solution to this problem is to make the user name, password, and the group belong to the NameNode and Job Tracker centralized maintenance. The user needs to provide the user name and password when connecting to ensure the security of the data.

Baidu's next focus may mainly involve the following:

Memory, reduce NameNode memory usage and study JVM memory management;

Scheduling, to improve the task can be preempted situation, while developing its own Capacity-based job scheduler, waiting for the job queue priority and job queue can be set Capacity, and can support TaskTracker packet;

Compression algorithm, choose a better method to improve the compression ratio, reduce storage capacity, while selecting efficient algorithms for shuffle data compression and decompression; mapper program and reducer program to control the use of resources to prevent excessive consumption of resources leading to machine crashes . Formerly controlled by modifying the Linux kernel, consider now that the resources used by the mapper and reducer are controlled by introducing cgroups in Linux; changing the DataNode's concurrent data read-write mode from multithread to select mode to support large-scale Concurrent read and write and Hypertable applications.

Baidu is also using Hypertable, which is a BigTable-based open distributed data storage system based on Google. Baidu uses it as a platform for analyzing user behavior. At the same time, Baidu also focuses on metadata centralization, memory consumption optimization, cluster security downtime, Automatic recovery, etc. made some improvements.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.