SSIS Learning (2): Data Flow task (I)

Source: Internet
Author: User
Tags ssis

Data Flow tasks are a core task in SSIs. It is estimated that most ETL packages are inseparable from data flow tasks. So we also learned from data flow tasks.

A Data Flow task consists of three types of data flow components: source, conversion, and target. Where:

Source: it refers to a group of data storage bodies, including tables and views of relational databases, files (flat files, Excel files, XML files, etc.), and datasets in system memory.

Conversion: this is the core component of the data flow task. If the data flow task is the core of ETL, the conversion in the data flow task is the core of ETL. It contains rich data conversion components, such as data update, aggregation, merging, distribution, sorting, and searching. It can be said that some functions in SQL statements are basically used.

Objective: it corresponds to the "Source" and is also a set of data storage bodies. Including tables, views, files, multi-dimensional datasets, and memory record sets.

In addition to the above three types of components, there is also a component, that is, "flow", which visually shows the data from the "Source", after "Conversion", A group of paths that finally reach the destination. We can use "stream" to view data and add remarks.

The following figure shows the relationships between the source, conversion, destination, and stream.

 

The following describes how to develop data flow tasks by importing IIS logs to a database.

Before development, let's take a look at the iislog structure,

It basically records all web page browsing information, such as date, time, customer IP address, Server IP address, page address, page parameters, and so on. Based on this information, create a table in a relational database to record the information.

Code

Create Table [DBO]. [iislog] (
[C_date] [datetime] Null,
[C_time] [varchar] (10) null,
[C_ip] [varchar] (20) null,
[Cs_username] [varchar] (20) null,
[S_ip] [varchar] (20) null,
S_computername varchar (30) null,
[S_port] [varchar] (10) null,
[Cs_method] [varchar] (10) null,
[Cs_uri_stem] [varchar] (500) null,
[Cs_uri_query] [varchar] (500) null,
[SC _status] [varchar] (20) null,
SC _substatus varchar (20) null,
SC _win32_status varchar (20) null,
SC _bytes int null,
Cs_bytes int null,
Time_taken varchar (10) null,
Cs_version varchar (20) null,
Cs_host varchar (20) null,
[Cs_user_agent] [varchar] (500) null,
[Cs_refere] [varchar] (500) null
) On [primary]

 

Everything is ready. Now we can start the ETL development journey. Open the Visual Studio 2008 tool, [file] --> [new] --> [project], select "Integration Services Project", and the ETL development interface will jump into view. net Development friends are very familiar with the interface. Open the "toolbox" on the left and drag "Data Flow task" to the "control flow panel" in the main window ",:

  

Double-click "Data Flow task" on the "control flow" Panel to go to the "data flow" Panel. There is no difference between the two UIS, but the functions are different. Real Data Flow task development starts from now on.

Open the "toolbox" on the left and you can see three major parts: Data Pipeline, data stream conversion, and data stream target. Drag "flat file source" from "Data Sources" to the main window, double-click to open the "flat file source" editor, and click "new" to open the flat file connection management editor,

 

Enter the connection name, select the iislog file, and select the row separator and column separator to view the data in the preview window.

Note that the row and column delimiters of different flat files are different. If you do not select the correct one, the expected results will not be met, all data may be squeezed into a column. Generally, the line separation is simple. Basically, the line breaks ({Cr} {lf}) are separated. The column separator is different. It can be separated by any text character, for example, commas (,), semicolons (;), colons (:) tab, vertical bars (|), and common text and numeric characters, you can also define the fixed width of each column to separate them. This requires different file sources.

In the flat file Connection Manager, select "advanced" to define the column name, data type, and character length of each column. After the definition is complete, click "OK" and return to the flat file Editor interface. The connection established earlier will be automatically returned to the drop-down list box of "flat file Connection Manager, select the columns to be output,

Then select "error output". The default options are shown in:

This option is very important and requires us to configure how to handle errors in the source data. Generally, there are two possible errors in the source data: Data Type errors, for example, if the date format is incorrect or the number is changed, the character length is too long to exceed the column width. According to different situations, the processing methods are different. The system provides three solutions:

  Failed to ignore: If a data row is incorrect, ignore this row without affecting program execution and continue to import other data.

  Redirect row: Import the wrong data row to another data stream target for manual check and re-processing.

  Component failed: This is the strictest rule. If a data error occurs, the component fails immediately and stops running.

For data source files such as iislog, there are rows of wrong data, which often occur. However, these small amounts of data errors will not affect the final results. We need to take fault tolerance into consideration, to simplify data quality requirements, select ignore errors to facilitate program operation.

After everything is defined, we can see that there is a Red Cross (X) on the "flat file source" control, that is, there is no target defined for this data source, that is, what to define next. In addition, there are two long-line arrows, one green and one foreign red, in which green: indicates the correct data flow path, and red indicates the Error Data Flow Path. If the preceding definition is incorrect,Redirect row", The error data will flow along the Red path to the location where the error data is stored.

Define the data source target, which may be simpler. Similarly, from the left-side "toolbox", we can see that there are many types of data source targets. We select "ole db target ", connect the Green Arrow under the "flat file source" control to "ole db target", double-click it, open the "ole db target Editor" window, and create a database connection,

 

Return to the "ole db target Editor" window. In data access mode, select "table or view-fast loading" and select the corresponding table,

Configure column ing below,

If the column does not exist, ignore it directly (provided that the column in the table is allowed to be empty), and configure the error handling method later. Refer to the flat file source error handling method.

So far, a simple data flow task has basically been completed. Click to run, and our long-awaited results have appeared.

 

Of course, in the actual development process, it may not be so smooth, and many problems may occur. We seldom mention this in this article, mainly because it is only the beginning, I will explain it more and more in future topics.

A simple data source task is completed. In fact, this is just a demo, let everyone know an overview, it can be said that the Great Wall is just the first step, the real ETL is not that simple. Next we will introduce "Data Stream Conversion", the most exciting part of ETL.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.