The title of the article looks very young and very popular, but in fact those who watch big data speeches every day, the technology of the management of small white, they will never openly ask this question.
Big data is an overwhelming word, and talking about big data inevitably mentions Hadoop, but unfortunately most big data advocates today, and even professionals, are not really sure what Hadoop is and what it does, and that their management's small white audience is confused.
As we all know, Hadoop is an open source software platform managed by the Apache Software Foundation, but what exactly is Hadoop? In short, Hadoop is a way to store massive amounts of data on a distributed server cluster and run distributed analytics applications.
Hadoop is designed as a very "robust" system, and even if a server or even a cluster goes down, the big data analytics applications running on it will not be interrupted. Hadoop is also highly efficient because it doesn't require you to churn data back and forth between networks.
The following is the official definition of Apache:
The Apache Hadoop Software Library is a framework that allows for distributed processing of large datasets using a simple programming model on a clustered server. Hadoop is designed to scale from a single server to thousands of servers, with local compute and storage resources for each server. The high availability of Hadoop does not depend on hardware, its code base itself can detect and handle hardware failures at the application layer, thus providing highly available services based on server clusters.
If we analyze more deeply, we find that Hadoop has more exciting features. First, Hadoop is almost entirely modular, which means you can use other software tools to swap out Hadoop modules. This makes Hadoop architecture exceptionally flexible without sacrificing reliability and efficiency.
Hadoop Distributed File System (HDFS)
If you mention Hadoop and your brain is blank, keep this in mind: Hadoop has two main parts: a data processing framework and a distributed data storage file system (HDFS).
HDFs is like a basket of Hadoop systems, and you put the data neatly on the inside waiting for the data analysis chef to turn into a sexy feast on the CEO's desk. Of course, you can do data analysis in Hadoop, or you can see the data in Ghadoop, "pull-transform-load", into other tools for analysis.
Data processing framework and MapReduce
As the name implies, the data processing framework is the tool for processing. Specifically, Hadoop's data processing framework is based on the Jave system--mapreduce, and you hear more MapReduce than HDFs, because:
1.MapReduce is a real tool for accomplishing data processing tasks
2.MapReduce often drives its users crazy.
In the conventional relational database, the data is found and analyzed through SQL (Structured Query language), the non-relational database also uses query statements, but not confined to SQL, so there is a term nosql.
One thing that's easy to confuse is that Hadoop is not really a database: it can store and extract data, but there is no query language involved. Hadoop is more of a data warehouse system, so a system like MapReduce is needed to do real data processing.
Maprduce runs a series of tasks, each of which is a separate Java application that can access data and extract useful information. Using MapReduce instead of query language makes Hadoop data analytics more powerful and flexible, but it also leads to a significant increase in technical complexity.
There are many tools available to make Hadoop easier to use, such as hive, which translates query statements into mapreduce tasks. But the complexities and limitations of MapReduce (single-tasking batching) make Hadoop more of a data warehouse use than a data analysis tool. Reference read: Hadoop is just the poor ETL.
Another unique feature of Hadoop is that all functions are distributed rather than centralized systems of traditional databases.
What the hell is Hadoop?