On automatic deployment and operation of large-scale Hadoop cluster

Last Update:2014-12-22 Source: Internet

Author: User

Keywords We very very very own something open source

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

November 2013 22-23rd, as the only large-scale industry event dedicated to the sharing of Hadoop technology and applications, the 2013 Hadoop China Technology Summit (Chinese Hadoop Summit 2013) was held at four points by Sheraton Beijing Group Hotel. Nearly thousands of CIOs, CTO, architects, IT managers, consultants, engineers, enthusiasts for Hadoop technology, and it vendors and technologists engaged in Hadoop research and promotion will be involved in a range of industries from home and abroad.

The Hadoop China Technology Summit was hosted by the Chinese Hadoop summit Committee of Experts, supported by IT168, Itpub and Chinaunix, and managed by the drainage media. The conference will uphold the theme of "effectiveness, application and innovation", aiming to promote the ability and level of Chinese enterprise users to improve the application of Hadoop, reduce the application threshold of Hadoop technology and the threshold of investment budget, and popularize the application value of large data through open and extensive sharing and exchange. The author is currently in the theme Forum One: Architecture and practice, the following for you to bring the keynote speech is the Beijing Digital Science and technology director, Easyhadoop software and phphiveadmin author and Storm audio and video development manager Shanglei, the topic of his speech is " On the automation deployment and Operation dimension of the large-scale Hadoop cluster. The following is a transcript of Shanglei's speech.

The deployment of Hadoop is a hassle.

The landing of Hadoop. As I first said, the big data many companies are talking about this thing, the real application to the ground, so far, I know that there may be actual operational capabilities, are concentrated in the Internet enterprises. Other enterprises if want to use the technology of large data, in fact, it is more difficult. Because not only is it cumbersome to deploy, but also the maintenance of late operation problems. In fact, the hardest part of distributed clustering is in later maintenance, last year to participate in the Aliyun contest, listen to Taobao people say, they have a large cluster of size, when your server cluster number reached more than 5,000 times, the daily hard drive rate of more than 99.6%, so not just ask for Hadoop understand, It also requires knowledge of the network. They say that the operation of Hadoop is very complex, and installation is not very complicated, and the use and maintenance is the real problem. So we find a way to solve the problem of automated operation.

The whole ecosystem of Hadoop is not something that is WYSIWYG. Basically most of the software is running in the command line state. So for many users, they can't use it, even within ourselves, the people inside the Internet company, it's not realistic to let them play things in the command line, such as product and operation, so it's a matter of using Hadoop as a whole ecosystem.

The application of development, data application, data analysis and data mining development, both familiar with the company's business, but also familiar with Hadoop, which is more complicated. This is Hadoop in real landing, I think several key points. With a number of business owners in private said, the enterprise really consider is not to say, perhaps large enterprises to consider more is not to say that this thing is not used, but after the problem who will "take the blame."

Hadoop has deep water.

Hadoop itself an operational dimension, from my years of operation in the experience of Hadoop, this water is very deep, pit more, we ourselves are step by step come. Linux itself is an obstacle, and not all people will use the Linux command line. You will encounter a variety of problems, the JVM will have bugs, it may be hadoop, or it may be a script for parsing languages. There may even be bugs in the JDK. Hadoop log is a great challenge to the user itself, to solve the problems you encounter, it is necessary to see the log, to analyze a Hadoop source code, wrong, these are the most test of people.

There will be many problems in the middle. VPN, VLAN, Route, switch issues. There are some algorithms, we met a very serious applause, the last check down, Hadoop log, nothing in mind. Finally we grabbed the GDB series from the OS syslog. Hadoop is a very large ecosystem source code, all kinds of logs, can be forced to die alive. Because Hadoop itself is a producer of large data. Each server on our cluster will probably produce about two g of logs. Hadoop itself is a producer of large data, and to solve the problem of Hadoop, it needs to analyze the entire Hadoop log.

One of my personal views is that Hadoop and Hadoop's entire surrounding ecosystem, no matter which company it is, increases the applicability of the entire Hadoop ecosystem, and the friendliness of the user is the most important thing. Can not let a golden Phoenix always stay in the Indus branches, or need her to fly down to thousands of households. We own the open source world, find open source things, find a variety of permutations and combinations, and, with our own research and development, we will enter a world of big data.

Hadoop ecology, increased ease of use is kingly

Now I'll talk about what big data can see and what it's all about when our entire team is open source. I very much support the idea that we are fully open source. This is also a contributor to our Apache project code, the Hadoop ecosystem, and the added ease of use is kingly. We are directly open to all the product personnel and operators, they can reduce our own pressure, they can go up on their own. This is Stonger Phphbase, a Ella of the Apache project. You can see our data volume is very large, but also to the access list monitoring, to hot regant do.

Hadoop itself is also a big data creator, how to analyze Hadoop's own log? Hunk things are charged and we implement a feature that is very similar to Hadoop. Open source is a lot of things waiting for everyone to dig, they are not difficult. We do a common building space project for Hadoop, and there will be a trans-provincial Hadoop cluster that transports and submits computational tasks.

As for open source, we are all using open source. They will also write their own things open source to contribute to everyone, and we also encourage more people to do an open source. Programmers can really change the world, as long as you have an idea, no matter how big the idea, can change the world. Finally, from the sqlite of a hope that you do good and not evil, I hope you forgive the skill also forgive people, I hope you freely share, I hope you never get more than pay.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More