This article is a brief introduction to Hadoop-related technical biosphere, while sharing a previously written practice tutorial that requires a person to take.
Today, with cloud computing and big data, Hadoop and its related technologies play a very important role and are a technology platform that cannot be neglected in this era. In fact, Hadoop is becoming a new generation of data processing platforms due to its open source, low-cost and unprecedented scalability.
Hadoop is a set of distributed data processing framework based on Java language, from its historical development point of view, we can see that Hadoop was born, it has noble pedigree, and develops downwind:
2004, Google published a paper to introduce the world to the MapReduce
In early 2005, in order to support the Nutch search engine project, Nutch's developers developed a working MapReduce application on Nutch based on the MapReduce report released by Google.
For 2005 years, all major nutch algorithms have been ported to the MapReduce and NDFs (Nutch distributed File System) environments to run
February 2006, the Apache Hadoop project was officially launched to support the independent development of MapReduce and HDFs
2007, Baidu started using Hadoop to do offline processing, currently almost 80% of the Hadoop cluster used for log processing
2008, Taobao began to study the system based on Hadoop-ladder, and used it to deal with e-commerce-related data. Ladder 1 's total capacity is about 9.3PB, contains 1100 machines, handles about 18000 jobs a day, scans 500TB data
January 2008, Hadoop becomes the Apache top project
July 2008, Hadoop broke the 1TB data sorting benchmark record. A Hadoop cluster in Yahoo, with 209 seconds to complete the 1TB data, is 297 seconds faster than the last year's record holder for nearly 90 seconds.
......
When many people start to touch Hadoop, they think it's a project, and in fact Hadoop contains a number of subprojects in addition to the core MapReduce and HDFs, in other words, Hadoop has formed a rich technical biosphere:
Why is there such a technology born?
In short, with the rapid development of the Internet, a large number of data storage and analysis bottlenecks, disk capacity growth is much larger than the disk read speed, 1TB of disk, data transmission speed 100mb/s, read the 2.5H, write the data will not mention, the heart pulls cool pull cool (of course, SSD in the production environment of the actual application, Greatly alleviated this dilemma). Data volume growth in the Internet application is very obvious, good Internet applications have tens of millions of users, regardless of the volume of data, pressure is increasing. In addition, in the enterprise application level, many large and medium-sized enterprises, informatization for more than more than 10 years, the enterprise accumulated a large number of unstructured data, various types of documents need storage, backup, analysis, display, suffer from no good way to data processing.
So how to solve such problems, technology Niuwa natural methods, such as disk data in parallel reading and writing, data chunking, Distributed file system, redundant data, mapreduce algorithm, etc., finally, such as Hadoop, such as the emergence of similar technologies. So I waited for the grass people to be blessed.
Isn't there a saying that big data is better than a good algorithm, and if there's enough data, it might produce an unwanted application, and see now Facebook, Twitter, microblogging-related derivative applications. In addition, whether the algorithm is good or bad, more data can always bring a better recommendation effect, which is also obvious.
So, no matter how cloud computing and big data slogans shout, eight-brain, Hadoop is a very pragmatic technology, whether you are in the Internet companies or traditional software companies, should learn and understand this technology.
Here is a brief introduction of the Hadoop and practice Tutorial class Keynote in my previous internal technical exchange, hoping to help a little.
To say the same, the deployment of Hadoop provides three modes, local mode, pseudo distribution pattern and full distribution mode, it is recommended to use the third practice, so that the understanding of the system usage more in-depth. This requires you to have at least two machines to cluster, and the better way is to use virtual machines. Hadoop Native Support Unix, if you want to play on Windows, you need to install a simulated environment cygwin. This time reflects the advantages of the Mac user, I was using Mac to do Master, up to two virtual Linux to do slave,ssd+8g memory, no pressure. The benefits of doing this are mentioned in the Book of Unix programming thinking, which is to achieve maximum working scope with minimal working environment.
Original connection: http://www.cnblogs.com/chijianqiang/archive/2012/06/25/hadoop-info.html