Hadoop note (1)

Source: Internet
Author: User

Today is the age of big data. The amount of data stored electronically in the world is very large every day. The following are some examples:

1. Facebook stores about 10 billion photos, with a storage capacity of 1 Pb

2. The internet archive stores about 2 Pb of data and is growing at least 20 TB per month.

3. The Large Hadron ordinator near Geneva, Switzerland, generates 15 Pb of data each year.

We have a large amount of data to analyze useful information, such as the preferences of a single user who browses web content and discovers potential users. There are also many scientific and technological uses.

How can we store data? The primary solution for such a large amount of data is data read/write speed, data security, hardware faults, and other issues. There must be a high-availability solution. Hadoop provides such a solution: hadoop provides a reliable shared storage and analysis system. HDFS implements storage, while mapreduce implements analysis and processing.



Relational databases and mapreduce:

The database system only updates some records, which is advantageous over mapreduce. However, if you update a large amount of data, the efficiency is much lower than that of mapreduce, because you need to use "Sort/merge" to recreate the database. Mapreduce is suitable for processing the entire dataset in batches, while RDBMS is suitable for "Point query" and update. After the dataset is indexed, the database system can provide low-latency data retrieval and fast small data updates. Mapreduce is suitable for applications that write data at a time and read data multiple times, while relational databases are more suitable for continuously updating datasets.

Another difference is the level of structuring of the datasets they operate on. Structured Data is a type of materialized data, such as XML files. Semi-structured data is loose. Although it may have a format, it can be ignored. Therefore, it can only be used as a general guide to the data structure, such as a workbook, its structure is a grid composed of cells, but each cell itself can save any form of data. Unstructured data has no special internal structure, such as plain text. Mapreduce is very effective for unstructured or semi-structured data because data is interpreted only when processing data. Relational Data is standardized to ensure data integrity without redundancy. Normalization poses a problem to mapreduce because it makes record reading a remote operation and does not allow high-speed stream reading and writing. The core premise of mapreduce is that it can perform high-speed stream read and write operations.



Distributed Computing:

1. mapreduc will try its best to store data on computing nodes to achieve fast local access to data, so as to achieve good performance and reduce the network bandwidth used.

2. mapreduce frees programmers from considering part of the system failure issue, because its system implementation is difficult to detect failed map or reduce tasks, and allows normal running machines to re-execute these failed tasks, each task is independent of each other.

Hadoop design goals:

Services for jobs that can be completed in minutes or hours, and run in a single data center that is connected through a high-speed network, and the computers in the data center need to be reliable, customized hardware structure.



Common hadoop-related projects:

Mapreduce: Distributed Data Processing Model and execution environment, which runs in large-scale commercial clusters.

HDFS: distributed file system that runs in large commercial clusters

Hive: a distributed, column-based data warehouse. Hive manages data stored in HDFS and provides SQL-based query languages for data query.

Hbase: a distributed database that stores data by column. Hbase uses HDFS as the underlying storage and supports mapreduce batch computing and point query.

Zookeeper: a distributed and highly available Coordination Service. Zookeeper provides basic services such as distributed locks for building distributed applications.



Hadoop note (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.