Error output can be used to improve reliability, but it also aims to improve scalability. In terms of reliability, they can be used to copy error data. With proper configuration, you can extract the wrong data from the main data stream and output it to the downstream system. The data will be specially processed or cleaned and then returned to the primary data stream. They can be explicitly merged, for example, using Union conversion, or implicitly merging data is rarely discarded. In many cases, they are logged or processed in subsequent steps.
If some data is lost during data extraction, but the data is still needed, the error output can be used to solve this problem. If you can obtain the error data from other data sources, you can use lookup to convert and find the data. If you cannot obtain these information from other places, you can use the default value.
In other scenarios, the data may be out of the processing range or the target container range. If the data causes integrity verification errors, you can use some dependency constraints to obtain new data, the data can be processed normally. If the data type conflicts, you can simply use the truncation processing, or use an additional set of logic for processing, such as converting the time to a special format.
When such assumptions and processing are performed on the data, it is necessary to properly mark or explain the description, so that there will be some reference when there are new requirements in the future, or other user references.
All the above scenarios focus on restoring dirty data, maintaining continuous processing of data streams, and restoring errors as much as possible. Compared with DTS, this is a new concept. It is meaningful to process error data in real time when processing data.
Another question: why not convert the data directly in the main data stream to avoid errors? This means that all data streams have only one flow direction, which looks simpler and has no branches. In fact, the simpler the data flow, the higher the reliability and scalability.
1. In this example, the number of specialtycode and consultantcode columns is lost. There is also a description column in the data source, so you can use lookup to retrieve the lost data. Initially, the conditional split conversion may be used to set the loss data to null. Once you use lookup to process these null values, you can use the Union all conversion to merge them into the primary data stream. The specialtycode and consultantcode columns are processed in this way to ensure that all the data written to the destination is suitable. This method is the best way to deal with such problems.
Figure 1
2 is an optional data stream design. In the first design, you can ignore all possible problems of specialtycode and consultantcode and directly use lookup to handle them. It may seem a little laborious, but all the steps are simpler, it is 2% faster than the design in Figure 2. This result is obtained using the 1/3 defect test data, that is, there is one row of data in the three rows, and the specialtycode column or the consultantcode column is missing. If the loss rate is 1/6, the efficiency is the same.
Figure 2
The second design assumes that all data has no defects and therefore is directly written to the target. If an error occurs, the data will be written to the error output, ole db destination error output, and then processed through two lookup methods. The difference between the two designs is: whether to handle only problematic rows or all columns. If you use a 1/3 defect probability test file, only the problematic ones will be handled 20% faster. When the defect probability is 1/6, it will decrease by only 10%.
Through the above example, we can see that the specific design should be based on the defect rate of the test file or test data.
Here, the performance of data correction should also be taken into account. In the above example, using lookup is a very expensive conversion. In the test data and the lookup reference table, there are only six data records that are not duplicated. This will affect the overall test performance. If there are more non-repeated values in the reference table, it will have a greater impact on performance. More data will require more caching and will consume more resources.
The stricter the data verification, the more defect data must be processed. When there are few defect data or expensive verification processes, try to make the judgment process simple and use error output to process the defect data.
The number of rows of all data will also affect the design of the data stream, because any advantages and disadvantages will affect a large amount of data. If the data size is small, the resources that can be used up are relatively small, but there will be resource loss in any conversion during the operation, so the more complex design may not be worth the candle when the data size is small.
This method of error output and data correction in the primary data stream can be used without restrictions. Do your best to use the processing method shown in figure 2. Ensure that as few data flows as possible are expensive and most data flows are in the main process.
Finally, do not reject error output. Developers may often use complex and advanced conversions, but remember that error output is the easiest way to process wrong data without affecting other conversions in the package, therefore, do not reject error output.
The performance improvement mentioned here is only a guiding role. Here we only introduce different design methods and do not use it as a dogma unless you encounter Identical scenarios and data, the same problem. The only principle that remains unchanged is to conduct comparative tests and then draw a better design.