4th lesson: Spark Streaming's exactly-one transaction and non-repetitive output complete mastery

Source: Internet
Author: User

This blog post is organized as follows:
One: Transaction processing of Exactly-one
Two: output is not duplicated

One: Transaction processing of Exactly-one
One: Transaction processing of Exactly-one
1. What is transaction processing:
A) can be processed and processed only once. For example, bank transfer, A is transferred to B,a and is only transferred once.
b) can be output, and can only be output once. and b receives the transfer, and it is collected directly.

2.  事务处理会不会失败呢?

There is little likelihood that spark is a batch-processing way to stream batch interval, the spark application starts with resources allocated to us, and dynamically allocates resources as it is calculated.

3.  WAL机制:

When writing a file, it is written to the file system via Wal and then stored in memory or on disk via executor. But assuming that the data is not written successfully, the latter will not be stored in the executor, so that the executor at this time will not report to driver, then these data will not be calculated. Therefore, Wal does not necessarily guarantee data security.

4.  Executor接收数据是一条一条接收的,Receiver会将数据在内存中积累到一定程度的时候才会写入到WAL或者说写入到磁盘中。但是如果还没有积累到一定程度,Receiver崩溃了咋办?5.  InputDStream的真正产生是在:Driver端产生的。Receiver不断的接收数据,Receiver为了保证安全性,他会不断的通过容错的方式进行处理(把数据写进磁盘,写进内存同时有副本的方式,或者说WAL),

StreamingContext: First get data, second generation job.

6.  假设数据崩溃的话,如何处理?

A) driver-side data recovery: Directly driver checkpoint file system to read out the data, and in fact, the inside is to restart the Sparkcontext, from the new build StreamingContext, restore the metadata, again produce the RDD, Submit to the spark cluster again.
b) Receiver recovery: Receiver continues to receive data on the basis of previous data, once received data, recovered from disk through the Wal mechanism.

data flow into the Executor,receiver constantly receive data, in order to ensure the security of the data, he will continue to handle the fault-tolerant way, the practice is to write data to disk, memory, while in a copy of the way, or Wal.

Exactly once transaction processing: Data reception based on Kafka
1, 0 loss of data: there must be a data source and reliable receiver, and the entire application metadata must be checkpoint, and through the Wal to ensure data security, including the received data and metadata itself, The data source in the actual production environment is generally kafka,receiver received from the data from Kafka, the default storage is memony_and_disk_2. By default, when performing calculations, he had to complete the fault tolerance of two machines before he began to actually perform calculations. Receiver receives data if it crashes, this time there will be no data loss, when the default copy is not completed, receiver recovery can be re-received.
2, Spark Streaming1.3, in order to avoid Wal performance loss and realize exactly once and provide Kafka Direct API, Kafka as a file storage System!!!
3, Kafka is the message middleware, you can dynamically receive data, and then spark streaming can directly use Direct API direct operation of Kafka, at this time Kafka as a file storage system, at this time both flow and file system characteristics, directly to the Kafka operation , Kafka can also store data for a period of time, so at this point the operation of the Kafka data in the direct operation of offset, this will ensure that the data will not be lost, so spark streaming + Kafka to build the perfect stream processing event
(1. The data does not require a copy,
2. No Wal is required and therefore no performance loss.
3. Kafka is much more efficient than HDFs because Kafka uses memory copy in all executor to consume data directly through the Kafka API
How do I resolve an issue that does not read data repeatedly? Direct relationship offset. Therefore, the consumption data is not repeated, and the transaction is implemented.

driver Default fault tolerance mechanism is: Checkpoint, generally Checkpoint to HDFs, because HDFs is inherently a replica. If the driver end fails, you can get Metada information from the checkpoint side of the data.

Two: output is not duplicated
Thinking:
Where are the data likely to be lost?

  1. Data loss and how to solve it specifically:
    A) If the driver suddenly crashes when receiver receives the data and starts to calculate the data through the driver dispatch executor, then the executor will be killed, then the data in executor will be lost. At this point, it is necessary to make all data through, for example, the Wal, the first security-tolerant processing through the way of HDFs, if the data in the executor is lost, then it can be recovered through Wal.
    b) Spark streaming in 1.3 to avoid the performance loss of Wal, and implement exactly once and provide Kafka Direct API, Kafka as a file storage system. At this time both the advantages of flow and file system advantages, so far, Spark streaming + Kafka constitute the perfect stream processing world!!! First: The data does not require a copy copy, and the second: No Wal will have performance loss. Third: Kafka is much more efficient than HDFs, because the Kafka interior is a copy_and_memory approach. All executor consume data directly through the Kafka API. Therefore, the direct management of offset, so it will not repeat consumption data. At this point, the data is guaranteed to be processed and processed once, and the transaction is implemented.

  2. Data re-read scenario:
    A) When receiver receives the data and saves it to a persistent engine such as HDFs but does not have time to updateoffsets, the receiver crashes and restarts to read the data again by managing the metadata in the Kafka zookeeper. At this point, however, spark streaming thought it was a success, and Kafka thought it was a failure (because it did not update offset into zookeeper), which caused the data to be re-consumed.

  3. Through the way of Wal-malpractice performance loss?

    1. The disadvantage of the way through Wal is that it can greatly damage the performance of receivers receiving data in spark streaming. Receiver receives Kafka data way in the actual enterprise use is not so many, generally is directly uses the Kafka to read the data.
    2. If Kafka is used as a source of data, there is data in the Kafka, and then receiver receives a copy of the data, which is actually a waste of storage. How to solve? Because based on the Zookeeper method can directly access the metadata information, so in the processing time can be written to the memory database, in the processing time to check whether the data has been processed, if processed then skip.

about the spark streaming data output multiple rewrite and its solution:
1. Why do you have this problem? because spark streaming is calculated based on spark Core,spark Core is inherently doing something that causes spark streaming results (partial) to repeat the output:

            Task重试;            慢任务推测;            Stage重复;            Job重试;
2.  具体解决方案:

Sets the number of allowable failures. Spark.task.maxFailures Number of 1
Set Spark.speculation to OFF (because slow-task speculation is actually very performance-intensive, it can significantly improve spark streaming processing performance when turned off).
When Spark streaming on Kafka, the job fails to set the Auto.offset.reset to "largest", which automatically restores.

Spark streaming is based on Spark Core's natural task retry and stage retry,
Finally, it is explained that the transform and Foreachrdd can be used to realize the non-repetition consumption and output of the data through logic control based on business logic code. These two methods are similar to the back door of spark streaming, and can be controlled in any conceivable way.

Summarize:

This course note comes from:

4th lesson: Spark Streaming's exactly-one transaction and non-repetitive output complete mastery

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.