What is Hadoop?
Reference Hadoop is an open source framework for writing and running distributed applications to handle large-scale data, designed for offline and large-scale data analysis, and is not suitable for online transaction processing patterns that randomly read and write to several records.
Hadoop=hdfs (file system, data storage technology related) + Mapreduce (processing), Hadoop data source can be any form, in the processing of semi-structured and unstructured data and relational database with better performance, with more flexible processing power, Whether any data form eventually translates to Key/value,key/value is the basic data unit. Using functional to MapReduce instead of Sql,sql is a query, while MapReduce uses scripts and code, and for a relational database, Hadoop, a custom SQL, has an open source Tool hive instead.
Article Hadoop is a solution for distributed computing.
What can Hadoop do?
Hadoop specializes in log analysis, Facebook uses hive for log analysis, and in 2009 Facebook had 30% of people who were not programmers using HIVEQL for data analysis; The custom filter in Taobao search also uses hive; Using pig can also do advanced data processing, including Twitter, LinkedIn for discovering people you might know, and can achieve the recommended effect of similar Amazon.com collaborative filtering. Taobao's product recommendation is also! The 40% Hadoop jobs in Yahoo! are run with pig, including spam identification and filtering, and user feature modeling. (August 2012 25 New update, the recommendation system of the day cat is hive, a few attempts mahout!)
The following examples illustrate:
Imagine such an application scenario. I have a 100M database backup SQL file. I now want to filter out what I want without importing to the database directly using the grep operation. For example, there are several ways in which a table contains records of the same keywords, one is the direct use of Linux command grep there is also a program to read the file, and then a regular match for each row of data to get the results good now is a 100M database backup. Both of these methods can be easily addressed.
So if it's 1G, 1T, or even 1PB data, what about the top 2 methods? The answer is no. After all, the performance of a single server has its upper limit. So how do you get the results we want with this oversized data file?
One method is distributed computing, and the core of distributed computing is the use of distributed algorithms to extend programs running on a single machine to multiple machines running in parallel. The data processing capability is multiplied. But this kind of distributed computing is generally very demanding for programmers, It also has requirements for the server. The cost has become very high.
Haddop was born to solve this problem. Haddop can easily put a lot of Linux cheap pc composed of distributed nodes, and then programmers do not need to know the distributed algorithm, and so on, just according to the MapReduce rules to define the interface method, the rest to Haddop. It automatically distributes the relevant calculations to each node and then results.
For example, the above example: Hadoop to do is first to import 1PB data files into the HDFs, then the programmer defined map and reduce, that is, the file is defined as a key, the content of each row is defined as value, and then a regular match, the success of the matching results Aggregated by reduce to return. Hadoop will distribute this program to n nodes to operate in parallel.
Then it might have to be counted for days to narrow it down to a few hours after having enough nodes.
This is the so-called Big Data cloud computing. If you still don't understand, take a simple example.
For example, 100 million 1 added to calculate the result, we easily know that the result is 100 million. But the computer does not know. Then the single computer handles the way to do a 100 million cycles each time the result is +1
The distributed processing then becomes I use 10,000 computers, each computer only needs to calculate 10,000 1 to add, then has one computer to add 10,000 computers to the result
To get the final result.
In theory, the speed of calculation is 10,000 times times higher. Of course, it may be an inappropriate example. But the so-called distributed, large data, cloud computing is probably the case.
What can Hadoop do for our company?
0 Data base, 0 data platform, all starting point is 0.
How to use Hadoop
The application of Hadoop in our company is also a research and development project, to use the log analysis to go through a process, because this phase does not need data mining professionals, in the data analysis phase, and the system has a database engineer, MapReduce has Java development engineers, and analysis by my own intervention, And visualization can be temporarily realized by the front-end JS, originally my research program, for large data solution is hadoop+r, but for R we are completely do not understand, in the company has not been a large number of personnel in the case, only the log analysis now appears to be the most easy to produce results, can also be achieved by less people can have some results, so select this direction as a pilot.
Original link: http://blog.csdn.net/blue_it/article/details/26228227