One, Control Flow
In control flow, the task is the smallest unit, and the task is kept in sync by precedence constraint.
1,control Flow cannot pass data between components to perform tasks serially or in parallel, acting as the dispatcher of a task.
If precedence constraint is not set between two tasks, then these two tasks are executed concurrently. In the design package, maximizing the concurrency of the task, the ability to take full advantage of server resources, helps reduce ETL execution time.
2,control Flow stays in sync with precedence constraint
The precedence Constraint is set between tasks and downstream tasks are not executed until the upstream task executes, and downstream tasks do not execute until the upstream task finishes.
Second, Data Flow
Data Flow consists of three parts: source, transformation, and purpose, which are used to process data, transfer data (source fetch data and target load data), and transform.
The data flow has the characteristic of the flow, the data is processed concurrently in the stream, the data is not loaded all at once, but divided into the multiple parts that are not duplicated, forming a stream, flowing from upstream components to downstream components. In the process of data flow, all the components of the flow are processed at the same time. After the upstream component finishes processing a batch of data, the handover is processed to the downstream component, and the upstream component continues to process the next batch of data. Data flows between components, and each component processes different parts of the data at the same time until all data processing is complete.
Data flow has the function of feedback and auto-tuning, and if there is pressure on the downstream component's processing speed, then SSIS will apply a reverse pressure to the upstream and automatically adjust to slow down the upstream data flow rate.
1, Memory buffer structure
The data stream uses memory to temporarily store data in the data source, which means that data is stored in a pre-allocated memory cache, based on the width of a row (the number of bytes for all the column in a line), as it is extracted from the source to the SSIS engine. Set the DefaultBufferMaxRows property to adjust the size of the buffer, or set the Defaultbuffersize property directly.
Properties of the Data Flow task
SSIS engine allocates a set of buffers based on the resources and pressure of the server, each of which stores a distinct subset of the full data set. The SSIS Engine backend uses a more efficient way to transform the data stream: Apply the business logic of the transform component one at a time to the same buffer, which is more efficient than copying the changed data to another buffer and then applying the transformation component's business logic. In some cases, however, SSIS engine needs to replicate buffers and even intercept data streams to transform the entire data set, such as aggregations and sorting.
2, blocking of transitions: non-blocking (flow), semi-blocking and blocking
- Non-blocking Description: Data is immediately passed from the pipeline to the downstream,
- Semi-blocking Description: Increase the data to a certain number before passing downstream,
- Blocking instructions: After receiving all the data and then passing downstream
Most transformations are fluid, which means that when the transformation logic is applied to a row, the data is not prevented from moving to the next transformation.
3, converted synchronous and asynchronous outputs
Each conversion has inputs and outputs, and if the input buffers used are not the same as the buffers used for the output, the output of the transformation is asynchronous, in other words, many transformations are not able to perform both the specified operation and the buffer (the number of rows or rows), so the data must be replicated to achieve the desired result.
In the advanced editor of the transform component, there is a lineageID property for each column, which is a pointer to a buffer that identifies the position of the column in the buffer.
Specifies the lineage ID of the output column when this item is first placed in the data flow.
Three, asynchronous conversion output sample
The asynchronous transformation output refers to the difference between the input buffer and the output buffer of the conversion.
1,data Flow
2,data Source Component
The Data Source has external columns and output columns.
The External columns comes directly from the data source, which creates the buffer, copies the data into the buffer, allocates the lineageID, and the buffer is the output for the data source component.
3,sort components
The sort component's input columns,id column lineageID is 197, and the lineageID property of the ID column in the data source output buffer is the same, which means that the sort transformation uses the output buffer of data source directly.
The output of the Sort turn, the lineageID property value of the ID column is 288. The same column has different lineageid values because the output of the sort transformation is asynchronous, the input buffer differs from the output buffer, so the output requires a new column identifier.
The Sortcolumnid property value is 197, which is the source column of the output column. The Sortcolumnid property specifies the lineage identifier of the input column is the associated with this output column.
All transitions with semi-blocking and blocking are asynchronous outputs, which do not pass the input buffers directly downstream because the data needs to be intercepted for processing and reassembly.
Four, synchronous conversion output sample
The Identificationstring property is used to identify the data flow of a component whose value is a string in a specific format, for example: Conditional split.inputs[conditional Split input]
The Synchronousinputid property is used to identify the identificationstring of the output stream of a component, and if the component's property Synchronousinputid is None, the component creates a new output stream, the transformation output is asynchronous, If the component's property Synchronousinputid is a identificationstring, the output and input of the component use the same buffer.
1,data Flow
2,conditional Split Component
View the output of the conditional split component and discover that CASE1 has no output Columns
Click Case1, view the attribute Synchronousinputid, and the value is conditional split.inputs[conditional Split input].
View the input stream for Conditional split, view the value of the property identificationstring, and the Synchronousinputid property value for the output stream Case1 the component.
3, verify the lineageID value
Looking at the sort component's input Columns,id lineageID is 197, and the output stream of the DataSource component, Conditional split is the same, stating that Conditional split did not create a new data buffer , a buffer created by the data source is used for the conversion.
Conclusion: In the synchronous transformation output, when the conversion logic is completed, the buffer is immediately passed to the downstream transformation, that is, the conversion input and the output of the transformation use the same buffer, avoiding copying the buffer to the output, so the lineageID of the same column is the same.
Five, recognizing the asynchronous nature of the transformation
From the advanced Editor, view the Synchronousinputid property of the output, specifies the input ID of the rows in this output
If the Synchronousinputid property value is None, then the output of the component is asynchronous, and if the value is not none, but a string of identificationstring format, then the output of the transformation is synchronous.
Six, application and release of data stream buffers
The Data source component creates a new buffer and assigns the lineageID value to the column of the response. When the buffer data is loaded into destination, destination releases the buffer.
Data Flow and Control flow for SSIS