Streamsets Multithreading Pipelines

Source: Internet
Author: User

The following are official documents:

Multithreaded Pipeline Overview

A multithreaded pipeline is a pipeline with an origin that supports parallel execution, Enablin G one pipeline to run in multiple threads.

Multithreaded pipelines enable processing high volumes of data in a single pipeline on one data collector thus taking full advantage of all available CPUs on the Data Collector machine. When using multithreaded pipelines, make sure to allocate sufficient resources to the pipeline and Data Colle ctor.

A multithreaded pipeline honors the configured delivery guarantee for the pipeline, but does not guarantee the order in WH Ich batches of data is processed.

How It Works

When you configure a multithreaded pipeline, you specify the number of threads, the origin should use to generate BATC Hes of data. You can also configure the maximum number of pipeline runners that Data collector< /c4> uses to perform pipeline processing.

A pipeline runner is a sourceless Pipeline Instance -An instance of the pipeline that includes All of the processors and destinations in the pipeline and represents all pipeline processing after the origin.

Origins perform multithreaded processing based on the origin systems they work with, but the following are true for all Ori Gins that generate multithreaded pipelines:

When you start the pipeline, the origin creates a number of threads based on the multithreaded property configured in the Origin. and Data Collector Creates a number of pipeline runners based on the pipeline Max Runners property to perform pipeline processing. Each thread connects to the origin system and creates a batch of data, and passes the batch to an available Pipeline runner.

Each pipeline runner processes one batch at a time, and just like a pipeline that runs on a single thread. When the flow of data slows, the pipeline runners wait idly until they is needed, generating an empty batch at regular Tervals. You can configure the Runner Idle time pipeline Property Specify the interval or to opt out of empty batch generation.

Multithreaded pipelines preserve the order of records within each batch and just like a single-threaded pipeline. But since batches was processed by different pipeline instances, the order that batches was written to destinations was not Ensured.

For example, take the following multithreaded pipeline. The HTTP Server Origin processes HTTP POST and PUT requests passed from HTTP clients. When you are configure the origin, you specify the number of threads to use-in this case, the Max Concurrent requests proper Ty

Let's say you configure the pipeline-to-opt out of the "the Max runners property." When you do this, Data Collector generates a matching number of pipeline runners for the number of threads.

With Max Concurrent requests set to 5 when you start the pipeline the origin creates five threads and Data C Ollector creates five pipeline runners. Upon receiving data, the origin passes a batch to all of the pipeline runners for processing.

Conceptually, the multithreaded pipeline looks like this:

Each pipeline runner performs the processing associated with the rest of the pipeline. After a batch are written to pipeline destinations-in this case, Azure Data Lake Store 1 and 2-the pipeline runner Beco Mes available for another batch of data. Each batch are processed and written as quickly as possible, independently from batches processed by other pipeline runners , so the write-order of the batches can differ from the Read-order.

At no given moment, the five pipeline runners can each process a batch, so this multithreaded pipeline processes up to fi ve batches at a time. When incoming data slows, the pipeline runners sit idle, available for use as soon as the data flow increases.

Streamsets Multithreading Pipelines

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.