Meet hadoop
1.1 data! (Data)
Most of the data is locked up in the largest Web properties (like search engines), or scientific or financial institutions, isn' t it? Does the advent of "big data," as it is being
Called, affect smaller organizations or individuals?
As ordinary people do not benefit from the vast amount of data, data is stored in the network or stored by a large number of research institutions, so big data mining is also applied.
From a personal perspective, reading and filtering data will consume a lot of time as the data volume continues to expand.
1.2 data storage and analysis (data storage and analysis)
Although the reading speed of hard disks and other storage media continues to increase, data retrieval and filtering consume a lot of time compared to the growth rate of data volume.
This is a long time to read all data on a single drive-and writing is even slower. the obvious way to reduce the time is to read from multiple disks at once. Imagine if we
Had 100 drives, each holding one hundredth of the Data. working in parallel, we cocould read the data in under two minutes.
Reading data from a single drive is even slower. The most obvious way is to reduce reading from multiple media. However, the hardware utilization is also reduced while the reading rate is too high.
There is also a risk of reading data from multiple drives in parallel:
1. Data Reading failed due to hardware failure. Redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. Data Backup
2. Data Integration from different drives is also a big challenge. This leads to mapreduce.
1.3 comparison with other systems (compared with other systems)
Mapreduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative.
RDBMS Relational Database Management System
Grid computing Grid Computing
Grid computing distributed computing is a new computing method proposed in recent years. Distributed Computing allows two or more software to share information with each other. These software can run on the same computer or on multiple computers connected by a network.
Volunteer Computing volunteer computing
Volunteer computing is a computing method that enables ordinary people around the world to volunteer to provide free PC time and participate in scientific computing or data analysis through the Internet. This method provides an effective solution to the problems of large basic scientific computing scale and high computing resource demands. For scientists, volunteer computing means almost free and unlimited computing resources, and volunteers can gain an opportunity to understand and participate in science to promote public understanding of science.
1.4 A Brief History of hadoop (hadoop History)
Apache Lucene
1.5 Apache hadoop and hadoop ecosystem (about the organization and hadoop ecosystem)
Common: a set of components and interfaces for Distributed filesystems and general I/O (serialization, Java RPC, persistent data structures ).
Avro: A serialization System for efficient, cross-language RPC, and persistent data storage.
Mapreduce: A Distributed Data Processing Model and execution environment that runs on large clusters of commodity machines.
HDFS: a distributed filesystem that runs on large clusters of commodity machines.
Pig: a data flow language and execution environment for processing ing very large datasets. Pig runs on HDFS and mapreduce clusters.
Hive: A Distributed Data Warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine
Mapreduce jobs) for querying the data.
Hbase: a distributed, column-oriented database. hbase uses HDFS for its underlying storage, and supports both batch-style computations using mapreduce and point
Queries (random reads ).
Zookeeper: a distributed, highly available Coordination Service. zookeeper provides primitives such as distributed locks that can be used for building distributed applications.
Sqoop: A Tool for efficiently moving data between relational databases and HDFS.
1.6 hadoop releases (hadoop version Introduction)