Core components of the spark Big data analytics framework
The core components of the Spark Big Data analysis framework include RDD memory data structures, streaming flow computing frameworks, Graphx graph computing and mesh data mining, Mllib machine Learning Support Framework, Spark SQL data Retrieval language, Tachyon file system, Sparkr compute engine and other major components. Here is a simple introduction.
A. RDD Memory data structure
Big data analysis system generally includes data acquisition, data cleansing, data processing, analysis, report output and other subsystems. In order to facilitate data processing and improve performance, spark specifically introduces the RDD data memory structure, which is very similar to the mechanism of R. The user program only needs to access the structure of the RDD, and the data dispatch and exchange of the storage system are implemented by the provider driver. The RDD can interact with Haoop HBase, HDFs, and so on as a data storage system, and can, of course, extend support for many other data storage systems.
Because of the RDD, it is important that the application model is decoupled from the physical storage and that it is easier to handle a large number of data logging traversal searches. Because the structure of Hadoop is mainly applied to sequential processing, it is very inefficient to turn back and retrieve data repeatedly, and the lack of a unified implementation framework, by the algorithm developers themselves to find ways to achieve. There is no doubt that this is quite difficult. The appearance of Rdd has made the problem solved to a certain extent. But because the RDD is the core component, the implementation is difficult, this piece of performance, capacity, stability directly determines the degree of implementation of other algorithms. From now on, there are often problems with the memory overload that the RDD consumes.
Second, streaming flow calculation framework
Stream is now an important data form for Twitter, Weibo, image services, Internet of things, location services, and so on, so stream computing is becoming more important than ever. The streaming computing framework is the core infrastructure for all ISPs, and Amazon and Microsoft have launched the event messaging bus cloud service platform, and Facebook\twitter is open source for its own streaming computing framework.
Spark streaming is specifically designed to handle streaming data. With spark streaming, you can quickly push data into the process, like a pipeline for fast processing and feedback to use in the shortest time possible.
Three, Graphx graph calculation and mesh data mining
The topological structure of physical network, the connection relation of social network, the e-r relation of traditional database are the typical graph data model. Hadoop is used primarily for "data volume" situations where there is little support for the processing of relationships, and HBase is a very weak relational processing power. Graph data structures often require a fast and multiple scan of the database, and the introduction of RDD enables spark to process graph-based data structures more efficiently, making it possible to store and process large-scale graph networks. Similar systems dedicated to diagrams are neo4j and so on.
Graphx relative to the traditional database connection, can deal with larger, deeper topological relations, can be performed on multiple cluster nodes, is indeed the modern data relationship research tool.
Iv. Mllib Machine Learning Support Framework
By porting the machine learning algorithm to the spark architecture, it can take advantage of the underlying large-scale storage and the data fast access of the RDD, as well as the processing power of graph data structure and cluster computing, so that machine learning can be carried out on a large scale cluster system. The application ability of machine learning algorithm is greatly expanded.
V. Spark SQL data Retrieval language
This is somewhat similar to the hive-based implementation, but the RDD theory provides better performance and more convenient handling of operations such as join and relational retrieval. This is designed as a standardized portal for interacting with the user.
Vi.. tachyon File System
Tachyon is an hdfs-like implementation, but it feels closer to the user, while HDFs is primarily for storage blocks.
Vii. Sparkr Computing Engine
The ability to apply the R language to the spark's underlying compute architecture, providing an algorithm engine for it.
Core components of the spark Big data analytics framework