Hadoop has always been the technology I want to learn, just as the recent project team to do e-mall, I began to study Hadoop, although the final identification of Hadoop is not suitable for our project, but I will continue to study, more and more do not press.
The basic Hadoop tutorial is the first Hadoop book I've read, and of course it's only probation the first chapter, but there's a preliminary understanding of Hadoop history, core technology, and application scenarios.
Embryonic beginning in 2002, Apache Nutch,nutch is an open source Java implementation of the search engine. It provides all the tools we need to run our own search engine. Includes full-text search and web crawlers.
Then in 2003 Google published a technical academic paper Google File system (GFS). GFS is the proprietary file system designed by Google file System,google to store massive amounts of search data.
2004 Nutch founder Doug Cutting, based on Google's GFS thesis, implemented a distributed file storage system named NDFs.
In 2004, Google also published a technical academic paper mapreduce. MapReduce is a programming model for parallel analysis operations of large datasets (larger than 1TB).
2005 Doug Cutting was also based on MapReduce, which was implemented in the Nutch search engine.
In 2006, Yahoo hired Doug cutting,doug cutting to name NDFs and MapReduce upgrades as Hadoop,yahoo opened a separate team for Goug cutting to specialize in the development of Hadoop.
It has to be said that Google and Yahoo have contributed to Hadoop.
The core of Hadoop is HDFs and MapReduce, and both are theoretical foundations, not specific, high-level applications, and Hadoop has a number of classic sub-projects, such as HBase, Hive, which are developed based on HDFs and MapReduce. To understand Hadoop, you have to know what HDFs and MapReduce are.
HDFS (Hadoop Distributed File System,hadoop distributed filesystem), it is a highly fault-tolerant system that is suitable for deployment on inexpensive machines. HDFS provides high-throughput data access for applications with very large datasets (large data set).
The design features of HDFs are:
1, Big data files, very suitable for large files on the T-level or a heap of large data file storage, if the file only a few grams or even smaller is not very interesting.
2, File block storage, HDFs will be a complete large file average block storage to different calculators, it is the meaning of reading a file can be simultaneously from multiple hosts to take different chunks of files, multi-host read more than a single host read efficiency is much higher.
3, streaming data access, write many times a time, this mode is different from the traditional file, it does not support the dynamic change of file content, but requires that the file write once do not change, to change can only add content at the end of the file.
4, cheap hardware, HDFs can be applied on the ordinary PC, this mechanism can give some companies with dozens of cheap computers can prop up a big data cluster.
5, hardware failure, HDFS think all computers may be problems, in order to prevent a host failure to read the block file of the host, it will be the same file block copy assigned to some other hosts, if one of the host fails, you can quickly find another copy to fetch the file.
Key elements of HDFs:
Block: A file is chunked, usually 64M.
NameNode: Save the entire file system directory information, file information and block information, this is the only one host dedicated to save, of course, if this host error, NameNode will fail. Start supporting Activity-standy mode at hadoop2.*----If the primary namenode fails, start the standby host to run Namenode.
DataNode: Distributed on inexpensive computers for storing block block files.
In layman's terms, MapReduce is a set of programming models that finally return the result set from the massive, source data extraction analysis element, which is the first step in distributing the file to the hard disk, and extracting the analysis from the massive data we need is what mapreduce does.
The following is an example of calculating the maximum value of a large amount of data: a bank has billions of depositors, and the bank wants to find out how much of the money is stored, in the traditional way we do:
Java code
- Long moneys[] ...
- Long max = 0L;
- for (int i=0;i<moneys.length;i++) {
- if (Moneys[i]>max) {
- max = Moneys[i];
- }
- }
If the computed array length is small, there will be no problem with this implementation, or there is a problem when faced with massive amounts of data.
MapReduce would do this: first the numbers are stored in different blocks, with a few blocks as a map, the largest values in the map are computed, and the maximum value in each map is made to reduce, and the maximum value is removed to the user.
The basic principle of mapreduce is that the large data analysis is divided into small pieces and analyzed, and then the extracted data are aggregated and analyzed, and finally get the content we want. Of course how to split the block analysis, how to do the reduce operation is very complex, Hadoop has provided the implementation of data analysis, we just need to write simple requirements command to achieve the data we want.
In general, Hadoop is suitable for applications in big data storage and big data analytics, and is suitable for servers from thousands of to tens of thousands of clusters, supporting petabytes of storage capacity.
Typical Hadoop applications include: Search, log processing, referral systems, data analysis, video image analysis, data preservation, and more.
But realize that Hadoop is much smaller than scripting languages such as SQL or Python, so don't use Hadoop blindly, and I know that Hadoop doesn't work for our projects after reading this probation article. But Hadoop is a hot word for big data, and I think an avid programming enthusiast deserves to learn, maybe you need Hadoop talent for your next home, right?
Hadoop Essentials Tutorial At the beginning of the knowledge of Hadoop