1. Scene:
Now people are generating more and more data faster, machines are faster, so another way to process data is needed.
The drive capacity increases, but the performance is not up, the solution is to divide the data into multiple hard disks, and then read at the same time.
Problem:
Hardware Issues-Replication data resolution (RAID)
Analyze data that needs to be read from a different hard disk: MapReduce
Hadoop:
1) reliable shared storage (distributed storage)
2) Abstract Analysis Interface (distributed analysis)
2. Big Data
--can be understood as data that cannot be processed using a single machine
The core of big data is sample = overall
Characteristics: multiplicity of rapid variability variability accuracy complexity
Key technologies:
1) data distributed across multiple machines
--Reliability: Each data block is replicated to multiple nodes
Performance: Multiple nodes process data at the same time
2) Calculate with data go
Network IO speed << Local disk speed, Big Data system will try to assign tasks to the machine closest to the data to run
(When the program runs, the program and its dependent packages are copied to the machine where the data resides)
Code migration to data, avoid large-scale data, resulting in a large number of data migration situation, as far as possible to calculate a piece of data on the same machine
3) Serial IO instead of random IO
Transfer time << seek time, general data is not modified after writing
* * Big Data is the main solution is more data, so stored on more than one machine, then need to pay attention to the problem of data storage, as well as data security, as well as data calculation problems, computational performance;
3. Hadoop
Hadoop high fault tolerance, high reliability, high scalability, especially suitable for writing once, read many times of the scene.
For:
Large-scale data
Streaming data (write once, read multiple times)
Commercial hardware (general hardware)
Not suitable for:
Low-latency data access
A large number of small files
Change files frequently (basically write 1 times)
* * 4. Hadoop architecture
HDFS: Distributed File storage
YARN: Distributed resource management
MapReduce: Distributed Computing
Others: Using Yarn's resource management function to realize other data processing methods
All internal nodes are basically using Master-woker architecture
Learn about Hadoop and Big data