Linux environment programming shared memory Area (i): Introduction to Shared Memory Area

Source: Internet
Author: User
Tags sparkr

The spark ecosystem, also known as Bdas (Berkeley data Analytics stack), is a platform designed by the Berkeley Apmlab Lab to showcase big data applications through large-scale integration between algorithms (algorithms), Machines (machines), and people (people). The core engine is spark, which is based on the elastic distributed data set, or RDD. Through the spark ecosystem, Amplab uses resources such as big data, cloud computing, communications, and flexible technology solutions to identify and transform massive amounts of opaque data into useful information for people to understand the world better.      Spark ecosystem is already involved in many fields such as machine learning, data mining, database, information retrieval, natural language processing and speech recognition. With the improvement of Spark, Spark is becoming the next big open source data processing platform for industry and academia with its excellent performance. With the release of Spark1.0.0 and the continuous expansion of the spark ecosystem, it is anticipated that spark will become more and more hot in the coming period.      Let's take a look at the recent Spark1.0.0 ecosystem, the Bdas (Berkeley data analytics Stack), and make a brief introduction to the spark ecosystem. As shown, the spark ecosystem is based on spark as the core engine, using HDFs, S3, Techyon as the persistent layer to read and write native data, to complete the calculation of the spark application by Mesos, yarn, and the standalone they carry as the resource Manager dispatch job. And these spark applications can come from different components, such as Spark's batch applications, sparkstreaming real-time processing applications, ad hoc queries for spark SQL, BLINKDB tradeoff queries, mllib or mlbase machine learning, Graphx graph processing, mathematical calculation from Sparkr and so on. For more information, see the project progress at the Berkeley Apmlab Lab https://amplab.cs.berkeley.edu/projects/or Spark Summit information http://spark-summit.org/.

1: Introduction to the Biosphere A:spark Spark is a fast, general-purpose, large-scale data processing system compared to Hadoop mapredeuce:
    • Better fault tolerance and memory computing
    • High speed, in-memory operation 100 times times faster than MapReduce
    • Easy to use, the same amount of application code is 2-5 times less than MapReduce
    • Provides a rich API
    • Support for interactive and iterative procedures
The spark Big Data platform is thriving thanks to the excellent spark core architecture:
    • Provides a distributed parallel computing framework that supports DAG graphs, reducing intermediate result io overhead between multiple computations
    • Provides a cache mechanism to support multiple iterations or data sharing, reducing IO overhead
    • The RDD maintains a bloodline relationship that, once the RDD fail, can be rebuilt automatically through the parent RDD, ensuring fault tolerance
    • Mobile computing instead of moving data, RDD partition can read data blocks from a distributed file system to compute in each node's memory
    • Use a multi-thread pool model to reduce task initiation
    • Avoid unnecessary sort operations during the shuffle process
    • Use of fault-tolerant, highly scalable akka as a communication framework
    • 。。。

B:sparkstreaming Sparkstreaming is a streaming system for high-throughput, fault-tolerant processing of real-time data streams, which can map a variety of data sources such as Kdfka, Flume, Twitter, Zero, and TCP sockets,      Complex operations such as reduce, join, window, and save the results to an external file system, database, or application to a real-time dashboard. Sparkstreaming Flow Treatment system features are:
    • Decompose streaming calculations into a series of short batch jobs
    • Fail or perform slower tasks on other nodes in parallel execution
    • Strong fault tolerance (based on the Rdd inheritance relationship lineage)
    • Use the same semantics as the RDD

C:spark SQL Spark SQL is an ad hoc query system, formerly known as Shark, but the code is almost rewritten, but takes advantage of the best part of shark. Spark SQL can execute queries on spark through an SQL expression, HIVEQL, or Scala DSL.      Currently, spark SQL is still an alpha version. Features of Spark sql:
    • Introduces the new RDD type Schemardd, which can be defined as a traditional database definition table Schemardd,schemardd by a row object that defines the column data type.
    • The Schemardd can be converted from an RDD, can be read from a parquet file, or can be obtained from hive using HIVEQL.
    • Data from different sources can be mixed in an application, such as data from HIVEQL and data from SQL can be join operations.
    • Inline Catalyst Optimizer optimizes user query statements automatically

 D:BLINKDB Blinkdb is an interesting interactive query system, like a seesaw, the user needs to make a tradeoff between query accuracy and query time, and if the user wants to get query results faster, the accuracy of the query results will be sacrificed, and similarly, if the user wants to get more accurate query results, You need to sacrifice query response time.      Users can define a fault boundary at the time of the query. The core idea of BLINKDB design:
    • Establish and maintain a set of multi-dimensional samples by sampling
    • When the query comes in, select the appropriate sample to run the query

 E:mlbase/mllib MLlib is the spark implementation of some common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and bottom-up optimization. Mlbase through the boundary definition, trying to make mlbase a machine learning platform, so that the threshold for machine learning is lower, so that some users may not understand machine learning can easily use the Mlbase tool to process their own data, mlbase defined four boundaries:
    • ML Optimizer Optimizer chooses the most appropriate machine learning algorithms and related parameters that have been implemented
    • MLI is an API or platform for the implementation of algorithms for feature extraction and advanced ml programming abstraction
    • MLlib based on Spark's underlying distributed machine learning Library, it is possible to expand the algorithm continuously
    • Based on the Spark computing framework, Mlruntime applies Spark's distributed computing to the Machine learning field.

F:graphx GraphX is a spark-based graph processing and graph parallel computing API. GRAPHX defines a new concept: the elastic distributed attribute graph, a directional multi-map with attributes for each vertex and edge, and the introduction of three core rdd:vertices, Edges, triplets, and a set of basic operations such as subgraph, Joinvertices, and Mapreducetriplets), and the continuous expansion of graphical algorithms and graphical build tools to simplify the graph analysis effort.
G:sparkr Sparkr is a amplab released by the R development package, so that r out of the fate of single-machine operation, can be run as spark job on the cluster, greatly expanded the data processing capacity of R. Several features of Spark:
    • Provides an API for flexible distributed Datasets (RDD) in spark that allows users to run spark jobs interactively through the R shell on a cluster.
    • Supporting the sequential closure function, you can automatically send the variables referenced in the user-defined function to other machines in the cluster.
    • Sparkr can also easily invoke the R development package, just need to read the R development package with Includepackage before performing the operation on the cluster, of course, to install the R development package on the cluster.

2: The application of the biosphere the spark ecosystem, based on the core of Spark and the RDD, creates a big data platform based on memory computing, providing people with all-in-one data processing solutions. Instead of using multiple isolated systems to meet the needs of the scene, one can use multiple products from the spark ecosystem to solve the application in different scenarios. Here are a few typical examples: A: Scenario 1: Historical data and real-time data analysis queries through spark for historical data analysis, spark streaming for real-time data analysis, and finally through spark SQL or BLINKDB to the user interactively query.
B: Scenario 2: Fraud detection, discovery of abnormal behavior historical data is analyzed by spark, data model is built with mllib, real-time data of spark streaming is evaluated, and abnormal data is detected and discovered.
C: Scenario 3: Social network insights calculate social relationships through spark and GRAPHX, and give advice.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.