Peripheral eco-Software and brief working principle of Hadoop (II.)

Source: Internet
Author: User
Keywords Can very inside algorithm working principle

Sqoop:

Sqoop is also a highly applied software in the Hadoop ecosystem, mainly used for ETL tools, developed by Yadoo and presented to http://www.aliyun.com/zixun/aggregation/14417.html "> Apache. Hadoop throughout the biosphere, most of the applications are Yadoo research and development, contribute very much. Yahoo Inside Out two people, formed Cloudera and Hortonworks respectively.

The so-called ETL, is the data extraction (extract) load (load) conversion (transform). A form or representation of data, as in code, that changes form, and becomes another form or representation of data. Even if the matrix in the order to change, is also an ETL.

Sqoop is the most important feature can be in many databases and data formats between the conversion, by setting parameters, you can oracle,mysql the structure of the data into unstructured storage into the HDFs inside, you can also extract the HDFs inside the data to save to the database or plain text, Very flexible. Intermediate conversion process with hive or their own mapreduce, or with pig,mahout, are not important. He provides an interface to various systems, executed as command-line arguments.

In fact, the implementation of Sqoop is not complicated, they can not spend much time to sqoop again, as long as the understanding of his working principle, is to do a variety of databases and Hadoop between the interface can be. Instead of using Sqoop, we are using Python to implement a similar set of things.

Oozie:

Great stuff, famous workflow system. You can string up all sorts of data streams and imagine the side of a street kebab. Just like a plate tendon, a piece of the bar is a data task, a piece of fat is also a data task, the plate and fat to staggered to get the final delicious food, then Oozie is shouldering the task of bamboo. The data task string, after a period of time waiting, the plate can be eaten. There may be a variety of dependencies in between, such as sprinkle salt, sprinkle pepper, also in the entire workflow to complete.

A real bi decision is likely to go through extremely complex data streams, and the data is highly interdependent. A task run, can start B,c task, and B,c task depends on the data of D task, and then e task relies on B,C data, the result of F and a task to compare analysis, finally get result G. This is a simple data flow, the middle of how to control the entire data flow and output, you need to Oozie to complete.

Mahout:

Mahout can be said to be the crystallization of the wisdom of large data arithmetic, which contains many algorithms of machine learning and artificial intelligence. There are calculations based on map/reduce, and there are no calculations based on Map/reduce. The number of algorithms can cover almost every major area.

However, Mahout algorithm library is too general, can not adapt to all the requirements, in our actual use of the process, we rarely directly use Mahout to do the calculation, more often take mahout as an algorithm reference code base, and then according to their own needs to do two times refactoring. For example, the use of the highest frequency in the Internet recommended and classified clustering algorithms, need to be based on different needs to achieve, but no matter what, even as an algorithm for reference, Mahout is still very good thing. Only recently updated very slowly, since 2012 released 0.7, it has not been updated.

Pig:

Pig work like Hive, earlier than the Hive appeared, is also developed by Yahoo. Before the advent of hive, Pig had been a lone flirt in the Hadoop biosphere. Later, the hive appeared gradually slightly. After all, it's a whole new language, and the business can migrate almost without cost than SQL hive. And pig after all still need a certain cost of learning, but pig in data processing than hive more flexible, it should be compiled map/reduce application of the pioneers.

But I still can't write Pig-latin. Recently has an open source project, has made the pig to make the visualization the thing, is very good, called lipstick, is worth a try.

Bookkeeper:

is a zookeeper from the inside of the sub-project, relatively new, haven't seen how. But look at the introduction, it should be with nn ha has a great relationship. The single point of Hadoop has always been a pain in the neck, with all kinds of distributed file systems having this problem. Moosefs What, also need to rely on HEARTBEAT,DRBD and so on to stage master single point problem. HDFs is no exception, so earlier someone proposed using zookeeper to solve nn Win Bei, hot standby. But very complex, both to prevent the brain crack, also can not do almost real-time hot change. Because if the ZK's check time is set very short, will lead to increased pressure, and ZK time set long, can not do real-time hot standby. I remember it as if it were to be set in about 10-20 seconds. Bookkeeper should be a separate subproject for solving overly complex solutions.

Bigtop:

It was introduced in the previous article.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.