Linux environment programming shared memory Area (i): Introduction to Shared Memory Area

Last Update:2014-06-11 Source: Internet

Author: User

Tags sparkr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The spark ecosystem, also known as Bdas (Berkeley data Analytics stack), is a platform designed by the Berkeley Apmlab Lab to showcase big data applications through large-scale integration between algorithms (algorithms), Machines (machines), and people (people). The core engine is spark, which is based on the elastic distributed data set, or RDD. Through the spark ecosystem, Amplab uses resources such as big data, cloud computing, communications, and flexible technology solutions to identify and transform massive amounts of opaque data into useful information for people to understand the world better. Spark ecosystem is already involved in many fields such as machine learning, data mining, database, information retrieval, natural language processing and speech recognition. With the improvement of Spark, Spark is becoming the next big open source data processing platform for industry and academia with its excellent performance. With the release of Spark1.0.0 and the continuous expansion of the spark ecosystem, it is anticipated that spark will become more and more hot in the coming period. Let's take a look at the recent Spark1.0.0 ecosystem, the Bdas (Berkeley data analytics Stack), and make a brief introduction to the spark ecosystem. As shown, the spark ecosystem is based on spark as the core engine, using HDFs, S3, Techyon as the persistent layer to read and write native data, to complete the calculation of the spark application by Mesos, yarn, and the standalone they carry as the resource Manager dispatch job. And these spark applications can come from different components, such as Spark's batch applications, sparkstreaming real-time processing applications, ad hoc queries for spark SQL, BLINKDB tradeoff queries, mllib or mlbase machine learning, Graphx graph processing, mathematical calculation from Sparkr and so on. For more information, see the project progress at the Berkeley Apmlab Lab https://amplab.cs.berkeley.edu/projects/or Spark Summit information http://spark-summit.org/.

1: Introduction to the Biosphere A:spark Spark is a fast, general-purpose, large-scale data processing system compared to Hadoop mapredeuce:

Better fault tolerance and memory computing
High speed, in-memory operation 100 times times faster than MapReduce
Easy to use, the same amount of application code is 2-5 times less than MapReduce
Provides a rich API
Support for interactive and iterative procedures

The spark Big Data platform is thriving thanks to the excellent spark core architecture:

Provides a distributed parallel computing framework that supports DAG graphs, reducing intermediate result io overhead between multiple computations
Provides a cache mechanism to support multiple iterations or data sharing, reducing IO overhead
The RDD maintains a bloodline relationship that, once the RDD fail, can be rebuilt automatically through the parent RDD, ensuring fault tolerance
Mobile computing instead of moving data, RDD partition can read data blocks from a distributed file system to compute in each node's memory
Use a multi-thread pool model to reduce task initiation
Avoid unnecessary sort operations during the shuffle process
Use of fault-tolerant, highly scalable akka as a communication framework
。。。

B:sparkstreaming Sparkstreaming is a streaming system for high-throughput, fault-tolerant processing of real-time data streams, which can map a variety of data sources such as Kdfka, Flume, Twitter, Zero, and TCP sockets, Complex operations such as reduce, join, window, and save the results to an external file system, database, or application to a real-time dashboard. Sparkstreaming Flow Treatment system features are:

Decompose streaming calculations into a series of short batch jobs
Fail or perform slower tasks on other nodes in parallel execution
Strong fault tolerance (based on the Rdd inheritance relationship lineage)
Use the same semantics as the RDD

C:spark SQL Spark SQL is an ad hoc query system, formerly known as Shark, but the code is almost rewritten, but takes advantage of the best part of shark. Spark SQL can execute queries on spark through an SQL expression, HIVEQL, or Scala DSL. Currently, spark SQL is still an alpha version. Features of Spark sql:

Introduces the new RDD type Schemardd, which can be defined as a traditional database definition table Schemardd,schemardd by a row object that defines the column data type.
The Schemardd can be converted from an RDD, can be read from a parquet file, or can be obtained from hive using HIVEQL.
Data from different sources can be mixed in an application, such as data from HIVEQL and data from SQL can be join operations.
Inline Catalyst Optimizer optimizes user query statements automatically

D:BLINKDB Blinkdb is an interesting interactive query system, like a seesaw, the user needs to make a tradeoff between query accuracy and query time, and if the user wants to get query results faster, the accuracy of the query results will be sacrificed, and similarly, if the user wants to get more accurate query results, You need to sacrifice query response time. Users can define a fault boundary at the time of the query. The core idea of BLINKDB design:

Establish and maintain a set of multi-dimensional samples by sampling
When the query comes in, select the appropriate sample to run the query

E:mlbase/mllib MLlib is the spark implementation of some common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and bottom-up optimization. Mlbase through the boundary definition, trying to make mlbase a machine learning platform, so that the threshold for machine learning is lower, so that some users may not understand machine learning can easily use the Mlbase tool to process their own data, mlbase defined four boundaries:

ML Optimizer Optimizer chooses the most appropriate machine learning algorithms and related parameters that have been implemented
MLI is an API or platform for the implementation of algorithms for feature extraction and advanced ml programming abstraction
MLlib based on Spark's underlying distributed machine learning Library, it is possible to expand the algorithm continuously
Based on the Spark computing framework, Mlruntime applies Spark's distributed computing to the Machine learning field.

F:graphx GraphX is a spark-based graph processing and graph parallel computing API. GRAPHX defines a new concept: the elastic distributed attribute graph, a directional multi-map with attributes for each vertex and edge, and the introduction of three core rdd:vertices, Edges, triplets, and a set of basic operations such as subgraph, Joinvertices, and Mapreducetriplets), and the continuous expansion of graphical algorithms and graphical build tools to simplify the graph analysis effort.
G:sparkr Sparkr is a amplab released by the R development package, so that r out of the fate of single-machine operation, can be run as spark job on the cluster, greatly expanded the data processing capacity of R. Several features of Spark:

Provides an API for flexible distributed Datasets (RDD) in spark that allows users to run spark jobs interactively through the R shell on a cluster.
Supporting the sequential closure function, you can automatically send the variables referenced in the user-defined function to other machines in the cluster.
Sparkr can also easily invoke the R development package, just need to read the R development package with Includepackage before performing the operation on the cluster, of course, to install the R development package on the cluster.

2: The application of the biosphere the spark ecosystem, based on the core of Spark and the RDD, creates a big data platform based on memory computing, providing people with all-in-one data processing solutions. Instead of using multiple isolated systems to meet the needs of the scene, one can use multiple products from the spark ecosystem to solve the application in different scenarios. Here are a few typical examples: A: Scenario 1: Historical data and real-time data analysis queries through spark for historical data analysis, spark streaming for real-time data analysis, and finally through spark SQL or BLINKDB to the user interactively query.
B: Scenario 2: Fraud detection, discovery of abnormal behavior historical data is analyzed by spark, data model is built with mllib, real-time data of spark streaming is evaluated, and abnormal data is detected and discovered.
C: Scenario 3: Social network insights calculate social relationships through spark and GRAPHX, and give advice.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More