A push spark practice teaches you to bypass the development of those "pits"

Source: Internet
Author: User
Tags hadoop mapreduce spark mllib

As an open-source data processing framework, spark caches intermediate data directly into memory during data calculation, which can greatly improve processing speed, especially for complex iterative computations. Spark mainly includes Sparksql,sparkstreaming,spark mllib and figure calculations.

Introduction to spark Core concepts

1, Rdd is elastic distributed data set, through the RDD can perform various operators to achieve data processing and calculation. For example, using spark to do statistical frequency, that is, get a string of text to WordCount, you can load the text data to the RDD, call map, reducebykey operator, and finally execute count action to trigger the real calculation.

2, wide dependence and narrow dependence. There are many lines in the factory, a product upstream has a person operation, downstream someone to perform a second operation, narrow dependence and this very similar, downstream dependence upstream. And the so-called wide dependence is similar to having more than one pipeline, a pipeline operation is required to rely on a pipeline B, can continue to execute, requires two lines to do material transport, coordination, but inefficient.

As you can see, B is a narrow dependency if it relies on a only. The Reducebykey operation, like this one, is just an example of a wide dependency, similar to a number of lines between several operations interdependent, such as: F to E, B dependency. The biggest problem with wide dependencies is that it causes the shuffle process.

Spark Streaming Introduction

Streaming calculation, that is, data generation, real-time processing of data. Spark is a batch-processing framework, how does it implement streaming? Spark is the process of cutting data into a segment, where one data stream is discretized into many successive batches, and then spark processes each batch.

A push why Choose spark

1, Spark is more suitable for iterative calculation, solve the bottleneck that our team used to calculate this piece by using Hadoop mapreduce iteration algebra.

2, Spark is a technology stack, but can do a lot of types of data processing: batch processing, SQL, streaming and ML, etc., basically meet the needs of our team at that time.

3, its API abstraction level is very high, through the use of map, reduce, groupby and other operators can quickly achieve data processing, greatly reduce development costs, and flexible. In addition, the spark framework for multi-lingual support is also very good, a lot of data mining algorithm students are familiar with Python, and engineering development of students familiar with Java, multi-language support can be developed and analyzed students quickly introduced.

4, in 2014, we used Hadoop yarn, and spark can be deployed in Yarn, using spark greatly reduces the cost of switching, and can take advantage of the previous Hadoop resources.

5, spark in the community is very fire, find information very convenient.

A Push data processing architecture

is a typical lambda architecture. Mainly divided into three layers. The blue box above is done offline batch processing, the next layer is the real-time data processing this piece, the middle layer is to do some storage and retrieval of the result data.

There are two ways to import data into HDFs, part of the data is written to Kafka from the business Platform log collection, and then direct LinkedIn Camus (we did the extension) in real-time to HDFs, and the other part is timed to be imported to HDFs via the Ops side script.

We still use two methods (Hadoop MR and Spark) for the offline Processing section. The original Hadoop Mr did not give up, because many of the original projects have been done with Mr, very stable, there is no need to re-start, only part of the iterative task using spark to implement. In addition, hive is a command that can be combined directly with spark to use hive in Spark SQL.

Deployment status of a push spark cluster

A push to start with Spark is 1.3.1 version, with a blade server, is the knife box can be plugged into 16 blade server, a single memory size of 192G, CPU core count is 24 cores. In spark official also recommended with Gigabit network card, large memory device. After weighing the requirements and costs, we chose to build the spark cluster with a blade machine. The advantage of knife frame is that the blade machine is connected by the backplane, the transmission speed is fast and the relative cost is small. The deployment mode uses Spark on Yarn for resource reuse.

The specific use of Spark in a push business

1, a push to do the user portrait, model iterations and some recommendations of the time directly with the Mllib,mllib integration of a lot of algorithms, very convenient.

2, a push has a bi toolbox, let some operators to extract data, we are using the Spark Sql+parquet format wide table implementation, Parquet is a columnstore format, using it you do not load the entire table, only to load the care of those fields, greatly reducing IO consumption.

3, real-time statistical analysis of this piece: for example, a push a product called a figure, is the use of spark streaming to real-time statistics.

4, the complex ETL task we also use Spark. For example: We push the report this piece, every day need to do a lot of dimensions of the push report statistics. Using Spark to cache intermediate result caches, and then counting other dimensions, greatly reduces I/O consumption and significantly increases the speed of statistical processing.

A push spark practice case sharing

It is a processing architecture of push-heat force diagram. The left side uses the business platform to get the real-time location data of the device, through spark streaming and calculate the number of people on each geohash lattice, and then the statistical results in real-time to the business Service layer, on the push to the client map to render, and finally form a real-time thermal map. The Spark streaming is primarily used for data real-time statistical processing.

A push to teach you to bypass the development of those pits

1, data processing is often skewed, resulting in unbalanced load, it is necessary to do statistical analysis to find skewed data features, fixed hash strategy.

2. Use parquet columnstore to reduce IO and improve spark SQL efficiency.

3, real-time processing aspects: On the one hand to pay attention to the data source (KAFKA) topic need more than partition, and the data to hash evenly, so that spark streaming recevier can be multiple parallel, and balanced consumption of data. With spark streaming, you need to troubleshoot more of the Dstream operations in spark history and then optimize them. On the other hand, we also made a real-time monitoring system to monitor processing conditions such as inflow, outflow data speed and so on. By monitoring the system alarm, it is convenient to operation the spark streaming real-time processing program. This small monitoring system mainly uses the Influxdb+grafana and so on realizes.

4, our test network often appear can not find third-party jar case, if is with CDH classmate General will encounter, is in CDH 5.4 beginning, CDH technical support staff said they removed the hbase and other jars, they know those jars do not need to be coupled in their own classpath, This situation can be added by Spark.executor.extraClassPath way.

5, some new people will encounter confusion transform and action, do not understand transform is lazy, need action trigger, and two action before and after the call effect may be different.

6, we use the process, the need to re-use the RDD, must do the cache, performance improvement will be obvious.

A push spark practice teaches you to bypass the development of those "pits"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.