What is Hadoop and what can be done in plain words?

Source: Internet
Author: User
What Hadoop is.
(1) Hadoop is an open source framework for writing and running distributed applications to handle large-scale data, designed for offline and large-scale data analysis, and is not suitable for online transaction processing patterns that randomly read and write to several records. Hadoop=hdfs (file system, data storage technology related) + Mapreduce (processing), Hadoop data source can be any form, in the processing of semi-structured and unstructured data and relational database with better performance, with more flexible processing power, Whether any data form eventually translates to Key/value,key/value is the basic data unit. Using functional to MapReduce instead of Sql,sql is a query, while MapReduce uses scripts and code, and for a relational database, Hadoop, a custom SQL, has an open source Tool hive instead.

(2) Hadoop is a solution for distributed computing.


What Hadoop can do.

Hadoop specializes in log analysis, Facebook uses hive for log analysis, and in 2009 Facebook had 30% of people who were not programmers using HIVEQL for data analysis; The custom filter in Taobao search also uses hive , using pig can also do advanced data processing, including Twitter, LinkedIn for discovering people you might know, and can achieve the recommended effect of similar Amazon.com collaborative filtering. Taobao's product recommendation is also. On Yahoo. 40% of Hadoop operations are run using pig, including spam identification and filtering, and user feature modeling. (August 2012 25 New updates, the recommendation system of the day cat is hive, a few attempts mahout. )

The following examples illustrate:

Imagine such an application scenario. I have a 100M database backup SQL file. I now want to filter out what I want without importing to the database directly using the grep operation. For example, there are several ways in which a table contains records of the same keywords, one is the direct use of Linux command grep there is also a program to read the file, and then a regular match for each row of data to get the results good now is a 100M database backup. Both of these methods can be easily addressed.
So if it's 1G, 1T, or even 1PB data, what about the above 2 methods still make sense? The answer is no. After all, the performance of a single server has its upper limit. So how do you get the results we want for this oversized data file?
One method is distributed computing, and the core of distributed computing is the use of distributed algorithms to extend programs running on a single machine to multiple machines running in parallel. The data processing capability is multiplied. But this kind of distributed computing is generally very demanding for programmers, And there are requirements for the server. The cost has become very high.
Haddop was born to solve this problem. Haddop can easily put a lot of Linux cheap pc composed of distributed nodes, and then programmers do not need to know the distributed algorithm, and so on, just according to the MapReduce rules to define the interface method, the rest to Haddop. It automatically distributes the relevant calculations to each node and then results.
For example, the above example: Hadoop to do is first to import 1PB data files into the HDFs, and then the programmer defined map and reduce, that is, the file is defined as key, the content of each row is defined as value, then a regular match, the success of the match will pass the result redu Ce converged up to return. Hadoop will distribute this program to n nodes to operate in parallel.
Then it might have to be counted for days to narrow it down to a few hours after having enough nodes.


This is the so-called Big Data cloud computing. If you still don't understand, take a simple example.
For example, 100 million 1 added to calculate the result, we easily know that the result is 100 million. But the computer does not know. Then the single computer handles the way to do a 100 million cycles each time the result is +1
The distributed processing then becomes I use 10,000 computers, each computer only needs to calculate 10,000 1 to add, then has one computer to add 10,000 computers the result again
And get the final result.
In theory, the speed of calculation is 10,000 times times higher. Of course, the above may be an inappropriate example. But the so-called distributed, large data, cloud computing is probably the case.


What Hadoop can do for our company.
0 Data base, 0 data platform, all starting point is 0.


Log processing user segmentation feature modeling personalized advertising recommend Smart instruments recommend everything to increase the business value of the enterprise as the core purpose, the ultimate goal
How to use Hadoop
The application of Hadoop in the Division I also belong to research and development projects, to use the analysis of the log to go through a process, because this stage does not need data mining professionals, in the data analysis phase, and the system has database engineers, MapReduce has Java development engineers, and analysis by my own intervention , and visualization can be temporarily realized by the front-end JS, originally my research program, for large data solution is hadoop+r, but for R we are completely do not understand, in the company has not been a large number of personnel in the case, only the log analysis now appears to be the most easy to produce results, can also be achieved by less people can have a certain result, so select this direction as a pilot.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.