Spark cultivation Path (advanced)--spark Getting Started to Mastery: section II Introduction to Hadoop, Spark generation ring

Source: Internet
Author: User
Tags apache mesos hadoop mapreduce hadoop ecosystem sqoop spark rdd spark mllib

The main contents of this section
    1. Hadoop Eco-Circle
    2. Spark Eco-Circle
1. Hadoop Eco-Circle

Original address: http://os.51cto.com/art/201508/487936_all.htm#rd?sukey= a805c0b270074a064cd1c1c9a73c1dcc953928bfe4a56cc94d6f67793fa02b3b983df6df92dc418df5a1083411b53325
The key products in the Hadoop ecosystem are given:

Image source: http://www.36dsj.com/archives/26942

The following is a brief introduction to the products

1 Hadoop

Apache's Hadoop project has almost been equated with big data. It has grown to become a complete ecosystem, with many open source tools for highly scalable distributed computing.

Supported operating systems: Windows, Linux, and OS X.

RELATED Links: http://hadoop.apache.org

2 Ambari

As part of the Hadoop ecosystem, this Apache project provides a web-based, intuitive interface for configuring, managing, and monitoring Hadoop clusters. Some developers want to integrate Ambari functionality into their own applications, and Ambari provide them with an API that takes full advantage of the rest (representational state transport Protocol).

Supported operating systems: Windows, Linux, and OS X.

RELATED Links: http://ambari.apache.org

3 Avro

This Apache project provides a data serialization system with a rich data structure and compact format. Patterns are defined in JSON and are easily integrated with dynamic languages.

Supported operating systems: Operating system-independent.

RELATED Links: http://avro.apache.org

4 cascading

Cascading is a Hadoop-based application development platform. Provide business support and training services.

Supported operating systems: Operating system-independent.

RELATED Links: http://www.cascading.org/projects/cascading/

5 Chukwa

Chukwa based on Hadoop, data from large distributed systems can be collected for monitoring. It also contains tools for analyzing and displaying data.

Supported operating systems: Linux and OS X.

RELATED Links: http://chukwa.apache.org

6 Flume

Flume can collect log data from other applications and then send that data to Hadoop. The official website claims: "It is powerful, fault-tolerant, has a reliability mechanism that can be tuned for optimization, and many failover and recovery mechanisms. ”

Supported operating systems: Linux and OS X.

RELATED Links: https://cwiki.apache.org/confluence/display/FLUME/Home

7 HBase

HBase is designed for large tables with billions of rows and millions of columns, a distributed database that allows for random, real-time read/write access to big data. It's a bit like Google's bigtable, but it's built on Hadoop and Hadoop Distributed File System (HDFS).

Supported operating systems: Operating system-independent.

RELATED Links: http://hbase.apache.org

8 Hadoop Distributed File System (HDFS)

HDFs is a file system for Hadoop, but it can also be used as a standalone distributed file system. It is based on Java and is fault tolerant, highly scalable, and highly configurable.

Supported operating systems: Windows, Linux, and OS X.

RELATED Links: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

9 Hive

Apache Hive is a data warehouse for the Hadoop ecosystem. It allows users to query and manage big data using HIVEQL, which is a SQL-like language.

Supported operating systems: Operating system-independent.

RELATED Links: http://hive.apache.org

Ten Hivemall

Hivemall combines a variety of machine learning algorithms for hive. It includes a number of highly scalable algorithms that can be used for data classification, recursion, recommendation, K nearest neighbor, anomaly detection, and feature hashing.

Supported operating systems: Operating system-independent.

RELATED Links: Https://github.com/myui/hivemall

Mahout

According to the official website, the Mahout project is designed to "create an environment for rapidly building scalable, high-performance machine learning applications." "It includes a number of algorithms for data mining on Hadoop MapReduce, as well as some novel algorithms for Scala and spark environments.

Supported operating systems: Operating system-independent.

RELATED Links: http://mahout.apache.org

MapReduce

As an integral part of Hadoop, the MapReduce programming model provides a way to handle large distributed datasets. It was originally developed by Google, but is now also used by several other big data tools introduced in this article, including Couchdb, MongoDB, and Riak.

Supported operating systems: Operating system-independent.

RELATED Links: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

Oozie

This workflow scheduling tool is specifically designed to manage Hadoop tasks. It can trigger tasks by time or according to data availability, and integrates with MapReduce, Pig, Hive, Sqoop, and many other related tools.

Supported operating systems: Linux and OS X.

RELATED Links: http://oozie.apache.org

Pig

Apache Pig is a platform for distributed big Data analytics. It relies on a programming language called Pig Latin, with the advantages of simplified parallel programming, optimization, and extensibility.

Supported operating systems: Operating system-independent.

RELATED Links: http://pig.apache.org

    1. Sqoop

Organizations often need to transfer data between relational databases and Hadoop, and Sqoop is a tool that can accomplish this task. It can import data into hive or hbase and export from Hadoop to a relational database management system (RDBMS).

Supported operating systems: Operating system-independent.

RELATED Links: http://sqoop.apache.org

    1. Spark

As an alternative to MapReduce, Spark is a data processing engine. It claims that, when used in memory, it is up to 100 times times faster than MapReduce, and when used on disk, it is up to 10 times times faster than MapReduce. It can be used with Hadoop and Apache Mesos, or it can be used standalone.

Supported operating systems: Windows, Linux, and OS X.

RELATED Links: http://spark.apache.org

    1. Tez

The Tez, based on Apache Hadoop yarn, is an application framework that allows for the creation of a complex, forward-free graph of tasks to process data. "It allows hive and pig to simplify complex tasks that would otherwise require multiple steps to complete."

Supported operating systems: Windows, Linux, and OS X.

RELATED Links: http://tez.apache.org

    1. Zookeeper

This Big Data management tool claims to be "a centralized service that can be used to maintain configuration information, name, provide distributed synchronization, and provide group services." "It allows the nodes within the Hadoop cluster to coordinate with each other.

Supported operating systems: Linux, Windows (for development environments only) and OS X (for development environments only).

RELATED Links: http://zookeeper.apache.org

2. Spark Eco-Circle

Hadoop uses spark as part of its ecosystem, but spark can be completely off the Hadoop platform, not just in HDFs, Yarn, for example, it can use standalone, Mesos for cluster resource management, Its inclusiveness has enabled spark to have many source contributors and users, and its ecosystems are thriving. Spark official components.

    1. Spark SQL and DataFrame
      Spark SQL is used to process structured data, which provides an abstraction of dataframe as a distributed platform data query engine that can build a big Data warehouse on this component. Dataframe is a distributed dataset, conceptually similar to a traditional database table structure, where data is organized into named columns, dataframe data sources can be structured data files, or they can be tables or external databases in hive, or they can be existing rdd.

    2. Spark streaming.

Spark streaming is used for real-time streaming data processing, which has a high scale, high throughput and fault tolerance mechanism, the data source can be Kafka, Flume, Twitter, ZeroMQ, kinesis or TCP, its operation depends on discretized Stream (DStream), DStream can be seen as a number of ordered rdd composition, so it can only be done through the map, reduce, join and window operations to complete real-time data processing, another very important point is that Spark Streaming can be used in conjunction with Spark MLlib, GRAPHX, etc., and is powerful and seemingly omnipotent.

3 Spark Machine Learning
Spark integrates the Mllib library and its distributed data structures are RDD-based and interoperable with other components, greatly reducing the threshold for machine learning, especially in distributed environments. Currently, Spark Mllib supports the following machine learning algorithms:

(1) Classification (classification) and regression (regression)
The algorithms currently implemented include: Linear models (SVMs, logistic regression, linear regression), Naive Bayes (naive Bayes), decision Trees (decision tree), Ensembles of trees (Random forests and gradient-boosted trees) (Combined model tree), isotonic regression (isotonic regression)

(2) Clustering (cluster)
Currently implemented algorithms are: K-means, Gaussian mixture, power iteration clustering (PIC), latent Dirichlet allocation (LDA), streaming K-means

(3) Collaborative filtering (collaborative filtering)
The algorithm currently implemented is only: alternating least squares (ALS)

(4) dimensionality reduction (feature dimension reduction)
Singular value decomposition (singular value decomposition, SVD)
Principal component Analysis (principal component, PCA)

In addition to the machine learning algorithms mentioned above, some algorithms, such as statistical correlation algorithm, feature extraction and numerical calculation, are also included. Spark from the 1.2 release, the machine learning Library made a relatively large launch, spark machine learning is divided into two packages, respectively, mllib and ml,ml the whole machine learning process is abstracted into pipeline (pipelining), to avoid machine learning engineers to spend a lot of time in the feature extraction before training the model , conversion, and so on preparation work.

4 Spark GraphX
GRAPHX is dedicated to distributed graph computing, Graph abstraction is also implemented by extending the Spark Rdd, providing subgraph, joinvertices, and Aggregatemessages-based diagram operations.

5 Sparkr
R language in the field of data analysis is widely used, but previously only in a single-machine environment, Spark R to get rid of the fate of the single-machine operation, a large number of data engineers can be a very small cost of distributed environment of data analysis. Spark R provides rdd for Api,r language engineers to make any submissions through the R shell.

Other more famous Spark biosphere products are currently available (see http://spark-packages.org/):
1 Astro
Huawei Open source Spark SQL on HBase package. The spark SQL on HBase package project, aka Astro, is an end-to-end integration of Spark,spark SQL and hbase capabilities to help drive a broad customer base that helps Spark get into NoSQL, and provide powerful online query and analysis as well as large-scale data processing capabilities in vertical enterprises. See http://www.ctiforum.com/news/guonei/458028.html

2 Apache Zeppelin
Open source, Spark-based web interactive data analysis platform with the following features:
(1) automatic inflow of sparkcontext and SqlContext
(2) Run-time load jar package dependency
(3) Stop job or show job progress dynamically
Key features include: Data ingestion, data Discovery, data Analytics, data visualization & Collaboration
At present Zeppelin is still only incubation project, but I believe in the future it must have broad prospects, see http://zeppelin.incubator.apache.org/

3 Apache Pig on Apache Spark (spork)
This is easy to understand, see http://blog.cloudera.com/blog/2014/09/pig-is-flying-apache-pig-on-apache-spark/

More spark biosphere products see http://spark-packages.org/

Add a public number to find out more about the latest spark and Scala technical information

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Spark Cultivation (advanced)--spark Getting Started to Mastery: section II Introduction to Hadoop, Spark generation ring

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.