now Apache Hadoop has become the driving force behind the big data industry's development. Technologies such as hive and pig are often mentioned, but they all have functions and why they need strange names (such as Oozie,zookeeper, Flume).
Hadoop brings the ability to deal with big data cheaply (big data is usually 10-100GB or more, and there are a variety of data types, including structured, unstructured, etc.). But how is this different from before?
today's enterprise data warehouses and relational databases are adept at processing structured data and can store large amounts of data. But the cost is somewhat expensive. This requirement for data limits the kinds of data that can be processed, and the drawbacks of this inertia also affect how the Data warehouse explores agility when confronted with massive amounts of heterogeneous data. This usually means that valuable data sources have never been excavated within the organization. This is the biggest difference between Hadoop and traditional data processing methods.
This article describes the components of the Hadoop system and explains the capabilities of each component.
The Hadoop ecosystem contains more than 10 components or sub-projects, but there are challenges in terms of installation, configuration, deployment of cluster size, and management.
The Hadoop main components include:
Hadoop: A software framework written in Java to support data-intensive distributed Applications
ZooKeeper: Highly reliable distributed coordination system
MapReduce: A flexible Parallel data processing framework for big data
HDFS: Hadoop Distributed File System
Oozie: Responsible for MapReduce job scheduling
HBase: Key-value database
Hive: A data Warehouse package built on top of maprudece
pig: Pig is an advanced data processing layer that is architected on top of Hadoop. The Pig Latin language provides programmers with a more intuitive way to customize data streams.
the application and typical characteristics of the Hadoop MapReduce method
- Huge amount of data
- less or no data dependency
- contains both structured and unstructured data
- suitable for large-scale parallel processing
Application Use Cases
- fast enough batch analyzer to meet business needs and business reports, such as website traffic and product recommendation analysis.
- iterative analysis using data mining and machine learning algorithms. such as association rules analysis K-means Data aggregation, link analysis (data analysis Technology), data mining classification, the famous Bayes algorithm analysis.
- statistical analysis and refinement, such as Web log analysis, data analysis
- behavioral analysis, such as clickstream analysis, user video behavior, etc.
- transformations and enhancements such as social media, ETL processing, data normalization , and more
Typically, Hadoop is applied to distributed environments. As in previous Linux, vendors integrate and test components of the Apache Hadoop ecosystem and add their own tools and management capabilities.
hadoop--related components and their relationships