Core components of the spark Big data analytics framework

Last Update:2015-08-07 Source: Internet

Author: User

Tags sparkr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Core components of the spark Big data analytics framework

The core components of the Spark Big Data analysis framework include RDD memory data structures, streaming flow computing frameworks, Graphx graph computing and mesh data mining, Mllib machine Learning Support Framework, Spark SQL data Retrieval language, Tachyon file system, Sparkr compute engine and other major components. Here is a simple introduction.

A. RDD Memory data structure

Big data analysis system generally includes data acquisition, data cleansing, data processing, analysis, report output and other subsystems. In order to facilitate data processing and improve performance, spark specifically introduces the RDD data memory structure, which is very similar to the mechanism of R. The user program only needs to access the structure of the RDD, and the data dispatch and exchange of the storage system are implemented by the provider driver. The RDD can interact with Haoop HBase, HDFs, and so on as a data storage system, and can, of course, extend support for many other data storage systems.

Because of the RDD, it is important that the application model is decoupled from the physical storage and that it is easier to handle a large number of data logging traversal searches. Because the structure of Hadoop is mainly applied to sequential processing, it is very inefficient to turn back and retrieve data repeatedly, and the lack of a unified implementation framework, by the algorithm developers themselves to find ways to achieve. There is no doubt that this is quite difficult. The appearance of Rdd has made the problem solved to a certain extent. But because the RDD is the core component, the implementation is difficult, this piece of performance, capacity, stability directly determines the degree of implementation of other algorithms. From now on, there are often problems with the memory overload that the RDD consumes.

Second, streaming flow calculation framework

Stream is now an important data form for Twitter, Weibo, image services, Internet of things, location services, and so on, so stream computing is becoming more important than ever. The streaming computing framework is the core infrastructure for all ISPs, and Amazon and Microsoft have launched the event messaging bus cloud service platform, and Facebook\twitter is open source for its own streaming computing framework.

Spark streaming is specifically designed to handle streaming data. With spark streaming, you can quickly push data into the process, like a pipeline for fast processing and feedback to use in the shortest time possible.

Three, Graphx graph calculation and mesh data mining

The topological structure of physical network, the connection relation of social network, the e-r relation of traditional database are the typical graph data model. Hadoop is used primarily for "data volume" situations where there is little support for the processing of relationships, and HBase is a very weak relational processing power. Graph data structures often require a fast and multiple scan of the database, and the introduction of RDD enables spark to process graph-based data structures more efficiently, making it possible to store and process large-scale graph networks. Similar systems dedicated to diagrams are neo4j and so on.

Graphx relative to the traditional database connection, can deal with larger, deeper topological relations, can be performed on multiple cluster nodes, is indeed the modern data relationship research tool.

Iv. Mllib Machine Learning Support Framework

By porting the machine learning algorithm to the spark architecture, it can take advantage of the underlying large-scale storage and the data fast access of the RDD, as well as the processing power of graph data structure and cluster computing, so that machine learning can be carried out on a large scale cluster system. The application ability of machine learning algorithm is greatly expanded.

V. Spark SQL data Retrieval language

This is somewhat similar to the hive-based implementation, but the RDD theory provides better performance and more convenient handling of operations such as join and relational retrieval. This is designed as a standardized portal for interacting with the user.

Vi.. tachyon File System

Tachyon is an hdfs-like implementation, but it feels closer to the user, while HDFs is primarily for storage blocks.

Vii. Sparkr Computing Engine

The ability to apply the R language to the spark's underlying compute architecture, providing an algorithm engine for it.

Core components of the spark Big data analytics framework

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More