White Elephant: A necessary Hadoop tool for developers

Source: Internet
Author: User
Keywords We can these developers

LinkedIn is http://www.aliyun.com/zixun/aggregation/31877.html "> the world's largest professional social networking site, founded from December 2002 to the beginning of 2013, LinkedIn registered users to 200 million, an average of a new user per second, 86% of the "Fortune 100 Companies" are using LinkedIn's pay solution, 2.7 million corporate home page here, users launched more than billions of times a year search. To deal with these oversized data, LinkedIn uses Hadoop for product development and, in order to better understand the use of the Hadoop cluster in all use cases of LinkedIn, they created white Elephant.

The following is the full text of the article:

With the development of Hadoop, scheduling, capacity planning, and billing have become critical issues, and these are open issues. Today, we are pleased to announce our Open source LinkedIn solution: White Elephant.

In LinkedIn, we use Hadoop for product development (predictive analysis applications like arranges you/may know and endorsements), to better understand our Hadoop cluster usage in all use cases, we created white Elephant.

Although tools such as ganglia provide system-level metrics, we want to be able to understand the resources that each user uses at any time. White Elephant The Hadoop log provides a layer-by-level downward monitoring of the Hadoop cluster and a summary of task statistics, including overages time, used periods, CPU time, and failed work items.

White Elephant meets the following requirements:

Scheduling: White elephant has the ability to arrange work during the period of low utilization rate, to maximize the efficiency of the cluster. Capacity planning: Future hardware requirements can be planned to understand the growth in the use of job resources. Billing: The Hadoop cluster has limited capacity, so white elephant can allocate resources for the size of the business value of the job in a multi-tenant environment.

In this article, we will share the architecture of white elephant and show some of the visualizations it provides. We have published the Code on the GitHub, you can try it yourself!

Schema

White Elephant Frame composition

This diagram contains three Hadoop grids of a, B, and C respectively, and white Elephant calculates the following statistics:

Upload task: The task runs regularly on the job tracker and gradually copies the new log files to a Hadoop grid for analysis. Calculation: The sequence of mapreduce jobs is coordinated through job executor, parsing the uploaded logs and calculating the summary statistics. View: A viewer application that incrementally loads summary statistics, caches it locally, and exposes a web interface that can subdivide the statistics of the Hadoop cluster.

Example

Here's what we actually use: we've noticed an increase in cluster usage over the past few months, but no one is responsible for it. We can use white elephant to investigate this problem.

The following illustration shows the total number of hours per week that a sample dataset has been used in the last few months, and you will notice that the baseline of cluster usage has increased from approximately 6,000 hours to 10,000 hrs per week since mid-January.

In the previous illustration, the entire dataset was selected for inspection, so that all users ' data was grouped together, so let's look at the top 20 user stacked graphs.

Now we can see the individual weekly usage of the top 20 users. The remaining 46 users have been grouped into a single metric. Several users stand out in the dubious cluster usage group, so we'll dig deeper.

We can highlight these users by hovering the mouse over the legend.

With the drag-and-drop operation, we can rearrange the list so that these users appear at the bottom.

It looks like 4 users have shown significant usage increases: User 1 and User 2 usage began to increase in mid-January, while user 43 and user 65 's usage began to climb steadily around December.

If we do not want to see the cluster usage of these users, we can uncheck them in the legend.

Once we have excluded these users, we can see that the usage of the cluster has not changed significantly during that time, so we have identified our culprit.

Let's go back to the four users who can choose a multiple-selection control, a filter that makes it easy to search for specific users by name.

How do you compare these four users to others? For convenience, the remaining users are aggregated, including: Select only the total metric and move it to the top.

With White Elephant, we've found the problem, thanks to the unprecedented visibility of hadoop usage. We can even get a table that lists the queried data from the CSV.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.