Apache hadoop
Open-source software developed by the project provides reliable, scalable, and distributed computing. It is an open-source version of similar Google technologies. Hadoop companies include Yahoo !, Facebook, Twitter, IBM, etc.
Why do we need to develop such a system? "When data exists in this quantity (Terabit/day or petabit/day), one of the processing limitations is that it takes a significant amount of time to move the data. apache hadoop has emerged to address these concerns with its unique approach of moving the work to the data and not the other way around. simple translation means that when the data volume is very large, the mobile data time is also very large. To address this challenge, hadoop uses the method of moving data processing to data rather than reverse data processing.
Hadoop includes the following sub-projects:
- Hadoop common
: Shared Project Software Package
- Chukwa
: Data collection system to manage large-scale distributed systems
- Hbase
: Scalable, distributed, and structured databases
- HDFS
: Distributed File System
- Hive
: Data warehouse for data integration and ad hoc query
- Mapreduce
: Distributed large-scale data processing software framework
- Pig
: Data Stream language and parallel execution framework
- Zookeeper
: Provides high-performance coordination services for distributed applications
The following figure shows how hadoop works:
There are also many hadoop-based projects:
- Nutch: Developed a clustered search engine based on nutch.
- Mahout: building a social recommendation engine based on Apache mahout
- CASSANDRA: Distributed Key-value Storage System: Cassa
If you want to learn more about hadoop, The hadoop authoritative guide (Chinese version) will be your first choice. It is written by hadoop developers, so it is quite authoritative.