Peripheral eco-Software and brief working principle of Hadoop (II.)

Last Update:2014-12-18 Source: Internet

Author: User

Keywords Can very inside algorithm working principle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Sqoop:

Sqoop is also a highly applied software in the Hadoop ecosystem, mainly used for ETL tools, developed by Yadoo and presented to http://www.aliyun.com/zixun/aggregation/14417.html "> Apache. Hadoop throughout the biosphere, most of the applications are Yadoo research and development, contribute very much. Yahoo Inside Out two people, formed Cloudera and Hortonworks respectively.

The so-called ETL, is the data extraction (extract) load (load) conversion (transform). A form or representation of data, as in code, that changes form, and becomes another form or representation of data. Even if the matrix in the order to change, is also an ETL.

Sqoop is the most important feature can be in many databases and data formats between the conversion, by setting parameters, you can oracle,mysql the structure of the data into unstructured storage into the HDFs inside, you can also extract the HDFs inside the data to save to the database or plain text, Very flexible. Intermediate conversion process with hive or their own mapreduce, or with pig,mahout, are not important. He provides an interface to various systems, executed as command-line arguments.

In fact, the implementation of Sqoop is not complicated, they can not spend much time to sqoop again, as long as the understanding of his working principle, is to do a variety of databases and Hadoop between the interface can be. Instead of using Sqoop, we are using Python to implement a similar set of things.

Oozie:

Great stuff, famous workflow system. You can string up all sorts of data streams and imagine the side of a street kebab. Just like a plate tendon, a piece of the bar is a data task, a piece of fat is also a data task, the plate and fat to staggered to get the final delicious food, then Oozie is shouldering the task of bamboo. The data task string, after a period of time waiting, the plate can be eaten. There may be a variety of dependencies in between, such as sprinkle salt, sprinkle pepper, also in the entire workflow to complete.

A real bi decision is likely to go through extremely complex data streams, and the data is highly interdependent. A task run, can start B,c task, and B,c task depends on the data of D task, and then e task relies on B,C data, the result of F and a task to compare analysis, finally get result G. This is a simple data flow, the middle of how to control the entire data flow and output, you need to Oozie to complete.

Mahout:

Mahout can be said to be the crystallization of the wisdom of large data arithmetic, which contains many algorithms of machine learning and artificial intelligence. There are calculations based on map/reduce, and there are no calculations based on Map/reduce. The number of algorithms can cover almost every major area.

However, Mahout algorithm library is too general, can not adapt to all the requirements, in our actual use of the process, we rarely directly use Mahout to do the calculation, more often take mahout as an algorithm reference code base, and then according to their own needs to do two times refactoring. For example, the use of the highest frequency in the Internet recommended and classified clustering algorithms, need to be based on different needs to achieve, but no matter what, even as an algorithm for reference, Mahout is still very good thing. Only recently updated very slowly, since 2012 released 0.7, it has not been updated.

Pig:

Pig work like Hive, earlier than the Hive appeared, is also developed by Yahoo. Before the advent of hive, Pig had been a lone flirt in the Hadoop biosphere. Later, the hive appeared gradually slightly. After all, it's a whole new language, and the business can migrate almost without cost than SQL hive. And pig after all still need a certain cost of learning, but pig in data processing than hive more flexible, it should be compiled map/reduce application of the pioneers.

But I still can't write Pig-latin. Recently has an open source project, has made the pig to make the visualization the thing, is very good, called lipstick, is worth a try.

Bookkeeper:

is a zookeeper from the inside of the sub-project, relatively new, haven't seen how. But look at the introduction, it should be with nn ha has a great relationship. The single point of Hadoop has always been a pain in the neck, with all kinds of distributed file systems having this problem. Moosefs What, also need to rely on HEARTBEAT,DRBD and so on to stage master single point problem. HDFs is no exception, so earlier someone proposed using zookeeper to solve nn Win Bei, hot standby. But very complex, both to prevent the brain crack, also can not do almost real-time hot change. Because if the ZK's check time is set very short, will lead to increased pressure, and ZK time set long, can not do real-time hot standby. I remember it as if it were to be set in about 10-20 seconds. Bookkeeper should be a separate subproject for solving overly complex solutions.

Bigtop:

It was introduced in the previous article.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More