Hadoop authoritative guide Chapter1 meet hadoop

Source: Internet
Author: User
Tags hadoop ecosystem

Meet hadoop

1.1 data! (Data)

Most of the data is locked up in the largest Web properties (like search engines), or scientific or financial institutions, isn' t it? Does the advent of "big data," as it is being
Called, affect smaller organizations or individuals?

As ordinary people do not benefit from the vast amount of data, data is stored in the network or stored by a large number of research institutions, so big data mining is also applied.

From a personal perspective, reading and filtering data will consume a lot of time as the data volume continues to expand.

1.2 data storage and analysis (data storage and analysis)

Although the reading speed of hard disks and other storage media continues to increase, data retrieval and filtering consume a lot of time compared to the growth rate of data volume.

This is a long time to read all data on a single drive-and writing is even slower. the obvious way to reduce the time is to read from multiple disks at once. Imagine if we
Had 100 drives, each holding one hundredth of the Data. working in parallel, we cocould read the data in under two minutes.

Reading data from a single drive is even slower. The most obvious way is to reduce reading from multiple media. However, the hardware utilization is also reduced while the reading rate is too high.

There is also a risk of reading data from multiple drives in parallel:

1. Data Reading failed due to hardware failure. Redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. Data Backup

2. Data Integration from different drives is also a big challenge. This leads to mapreduce.

1.3 comparison with other systems (compared with other systems)

Mapreduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative.

RDBMS Relational Database Management System

Grid computing Grid Computing

Grid computing distributed computing is a new computing method proposed in recent years. Distributed Computing allows two or more software to share information with each other. These software can run on the same computer or on multiple computers connected by a network.

Volunteer Computing volunteer computing

Volunteer computing is a computing method that enables ordinary people around the world to volunteer to provide free PC time and participate in scientific computing or data analysis through the Internet. This method provides an effective solution to the problems of large basic scientific computing scale and high computing resource demands. For scientists, volunteer computing means almost free and unlimited computing resources, and volunteers can gain an opportunity to understand and participate in science to promote public understanding of science.

1.4 A Brief History of hadoop (hadoop History)

Apache Lucene

1.5 Apache hadoop and hadoop ecosystem (about the organization and hadoop ecosystem)

Common: a set of components and interfaces for Distributed filesystems and general I/O (serialization, Java RPC, persistent data structures ).
Avro: A serialization System for efficient, cross-language RPC, and persistent data storage.
Mapreduce: A Distributed Data Processing Model and execution environment that runs on large clusters of commodity machines.

HDFS: a distributed filesystem that runs on large clusters of commodity machines.
Pig: a data flow language and execution environment for processing ing very large datasets. Pig runs on HDFS and mapreduce clusters.
Hive: A Distributed Data Warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine
Mapreduce jobs) for querying the data.
Hbase: a distributed, column-oriented database. hbase uses HDFS for its underlying storage, and supports both batch-style computations using mapreduce and point
Queries (random reads ).
Zookeeper: a distributed, highly available Coordination Service. zookeeper provides primitives such as distributed locks that can be used for building distributed applications.
Sqoop: A Tool for efficiently moving data between relational databases and HDFS.

1.6 hadoop releases (hadoop version Introduction)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.