Knowledge Chapter: A new generation of data processing platform Hadoop introduction __hadoop

Source: Internet
Author: User

Today, with cloud computing and big data, Hadoop and its related technologies play a very important role and are a technology platform that cannot be neglected in this era. In fact, Hadoop is becoming a new generation of data-processing platforms due to its open source, low-cost, and unprecedented scalability.

Hadoop is a set of distributed data processing framework based on Java language, from its historical development point of view, we can see that Hadoop was born, it has noble pedigree, and develops downwind:

2004, Google published a paper to introduce the world to the MapReduce

In early 2005, in order to support the Nutch search engine project, Nutch's developers developed a working MapReduce application on Nutch based on the MapReduce report released by Google.

For 2005 years, all major nutch algorithms have been ported to MapReduce and NDFs (Nutchdistributedfilesystem) environments to run

February 2006, Apachehadoop Project officially launched to support the independent development of MapReduce and HDFs

2007, Baidu started using Hadoop to do offline processing, currently almost 80% of the Hadoop cluster used for log processing

2008, Taobao began to study the system based on Hadoop-ladder, and used it to deal with e-commerce-related data. Ladder 1 's total capacity is about 9.3PB, contains 1100 machines, handles about 18000 jobs a day, scans 500TB data

January 2008, Hadoop becomes the Apache top project

July 2008, Hadoop broke the 1TB data sorting benchmark record. A Hadoop cluster in Yahoo, with 209 seconds to complete the 1TB data, is 297 seconds faster than the last year's record holder for nearly 90 seconds.

......

When many people start to touch Hadoop, they think it's a project, and in fact Hadoop contains a number of subprojects in addition to the core MapReduce and HDFs, in other words, Hadoop has formed a rich technical biosphere:

With the rapid development of the Internet, a large number of data storage and analysis of bottlenecks, disk capacity growth is much larger than the disk read speed, 1TB of disk, data transmission speed 100mb/s, Read the 2.5H, write the data not to mention, the heart pulls cool pull cool (of course, SSD in the production environment of practical application, greatly alleviated this dilemma).

Data volume growth in the Internet application is very obvious, good Internet applications have tens of millions of users, regardless of the volume of data, pressure is increasing.

In addition, in the enterprise application level, many large and medium-sized enterprises, informatization for more than more than 10 years, the enterprise accumulated a large number of unstructured data, various types of documents need storage, backup, analysis, display, suffer from no good way to data processing.

So how to solve such a problem, technology cattle naturally have methods, such as disk data in parallel reading and writing, data chunking, Distributed file system, redundant data, mapreduce algorithm, etc., finally, such as Hadoop, such as the emergence of similar technologies. So I waited for the grass people to be blessed.

Isn't there a saying that big data is better than a good algorithm, and if there's enough data, it might produce an unwanted application, and see now Facebook, Twitter, microblogging-related derivative applications. In addition, whether the algorithm is good or bad, more data can always bring a better recommendation effect, which is also obvious.

So, no matter how cloud computing and big data slogans shout, eight-brain, Hadoop is a very pragmatic technology, whether you are in the Internet companies or traditional software companies, should learn and understand this technology.

The deployment of Hadoop provides three patterns, local patterns, pseudo distribution patterns, and full distribution patterns, suggesting that you take a third practice, so that you understand the system usage more deeply.

This requires you to have at least two machines to cluster, and the better way is to use a virtual machine. Hadoop native Support Unix/linux, if you want to play on Windows, you also need to install a simulated environment cygwin.

This time to reflect the advantages of Mac users, I was using the Mac to do master, up to two virtual Linux to do slave,ssd+8g memory, no pressure. The benefits of doing this are mentioned in the Book of Unix programming thinking, which is to achieve maximum working scope with minimal working environment.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.