Integration Services Learning

Source: Internet
Author: User
Tags mssql object model ole import database ssis

Summary

As the saying goes: "Ten years to grind a sword", Microsoft through 5 years of careful build, in 2005 heavily launched SQL Server 2005, this is from SQL Server 2000 after another. This enterprise-class database solution consists of the following aspects: Database Engine Services, data mining, Analysis Services, Integration Services, Reporting services, Among them integration Services (that is, SSIS), is the intermediary between them, the link between the various sources of data, through the ETL to the data warehouse, the establishment of a multidimensional data set, and then analysis, mining and the results through reporting Services Service to enterprises at all levels of users, for the enterprise planning decision-making, supervision and implementation escort.

SSIS, whose full name is SQL Server Integration Services, is a great tool for Microsoft BI Solutions and is an upgrade to DTS in SQL Server 2000. Both functional, performance, and operational aspects are greatly improved. And look at the following operating interface can be seen.

SQL Server Watts DTS

SQL Server 2008 SSIS

Now a lot of people have said SSIS is an ETL (extract-transform-load) tool, I personally think that is not very accurate, perhaps we basically have him as an ETL use, in fact, SSIS has gone beyond the function of ETL, ETL is only one of them, It also has a very prominent performance in other areas:

(1) System maintenance:

(a) In the area of database maintenance:

I. Database backup;

II. Update of statistical information;

III. Database integrity checks;

Iv. Index reconstruction

V. SSIS package execution;

Vi. SSAS Task processing.

(b) Business processes:

I. Perform SQL tasks.

Ii. Web Service tasks.

c) Operating system maintenance:

I. WMI Event Viewer task

Ii. File system tasks.

D) Other:

I. Execute SQL task

II. implementation Process Tasks

Iii. ActiveX Script Tasks

IV. Script Task (vb/c#).

V. Performing Web Service services

In particular, the 4th above, you can perform SQL tasks, you can perform Web service services, you can perform system processes, you can perform (vb/c#) script tasks, which gives us a lot of space to imagine, there are no exceptions. Yes, it's strong. Have to admire the service.

The architecture of SSIS consists of four main components: Integration Services, Integration service object model, Integration Services Runtime and Run-time executables and data flow tasks that encapsulate the data flow engine and data flow components (figure):

This is what we beginners must understand, as long as we understand the structure of the individual system, understand the relationship between the components, clear what is the control flow, what is the data flow, SSIS learning is not difficult.

In short, SSIS is not simply an upgraded version of DTS, but, in addition to the several improvements mentioned above, Microsoft has always its edge in the development environment, tightly integrated with visual Studio, so that developers can be more familiar with More convenient platform design, development, greatly reduced the threshold of entry, accelerated the learning, development progress. Its constituent elements are also more object-oriented, each package, each task, each control flow, each data stream, is a separate object, has its corresponding attributes, corresponding events. vb/c# script tasks, variables, attributes of the parameterized, but also shocking, almost omnipotent, nothing can not be like (some exaggerated, I am not, but feel more powerful than before). It's not complicated to use, as long as you install the SQL Server Integration Services 10.0 Service (SQL 2005 should be Integration Services 9.0), NEW project, select Integration S Ervices project, you can see the first, intimate feelings of his broad and profound.

Data Flow Task (top)

Data Flow task is a core task in SSIS, it is estimated that most ETL packages are inseparable from the data Flow task. So we also learn from the data Flow task.

The Data Flow task consists of three different types of data flow components: source, transform, target. which

Source: It refers to a set of data stores, including tables and views of relational databases, files (flat files, Excel files, Xml files, etc.), data sets in system memory, and so on.

Transformation: This is the core component of the data Flow task, if the data Flow task is the core of the ETL, then the transformation in the Data Flow task is the core of the ETL core. It contains very rich data transformation components, such as Data Update, aggregation, merging, distribution, sorting, lookup, and so on. It can be said that some functions in the SQL statement, it is basically used.

Target: Corresponds to "source" and is also a set of data storage bodies. Contains tables, views, files, cubes, memory recordsets, and so on.

In addition to the above three types of components, there is a component, that is "flow", which visually shows the data from the "source", through "transformation", and finally reached the "purpose" of a set of paths. We can use the "stream" to view the data, add notes, and so on.

The following picture shows the relationship between source, transformation, purpose, and flow.

Let's take the IIS Log Import database as an example to describe how to do data Flow task development.

Before developing, let's look at the structure of the IISLog, as shown in the figure:

It basically records the Web browsing all the information, such as date, time, customer IP, server IP, page address, page parameters, and many other information, we based on this information, in a relational database, create a corresponding table, to record this information.

Code
CREATE TABLE [dbo]. [IISLog] (
[C_date] [DateTime] Null
[C_time] [varchar] (a) NULL,
[C_IP] [varchar] (a) NULL,
[Cs_username] [varchar] (a) NULL,
[S_IP] [varchar] (a) NULL,
s_ComputerName varchar () NULL,
[S_port] [varchar] (a) NULL,
[Cs_method] [varchar] (a) NULL,
[Cs_uri_stem] [varchar] () NULL,
[Cs_uri_query] [varchar] () NULL,
[Sc_status] [varchar] (a) NULL,
Sc_substatus varchar () NULL,
Sc_win32_status varchar () NULL,
sc_bytes int NULL,
cs_bytes int NULL,
Time_taken varchar (TEN) NULL,
cs_version varchar () NULL,
Cs_host varchar () NULL,
[Cs_user_agent] [varchar] () NULL,
[Cs_refere] [varchar] () NULL
) on [PRIMARY]

Everything is ready, we can start the ETL Development Tour, open the visual Studio 2008 tools, [File]-->[new]-->[project], select "Integration Services Project", ETL development interface jumped into view, This is a very familiar interface for friends who are engaged in. Net development. Open the toolbox on the left, and drag the Data Flow task to the main window control flow panel, as shown in the figure:

  

Then double-click the "Data Flow task" on the Control flow panel and go to the Data flow panel, which has no difference between the two parts of the UI, except that the functionality is implemented differently. Real Data Flow task development, starting from now.

Open the toolbox on the left to see three major parts: Data flow source, data flow conversion, data flow target. From the data flow source, drag the flat file source to the main window, double-click to open the flat File Source Editor, click New, and open the flat File connection management editor, as shown in the figure:
  

Enter the connection name, select the IISLog file, select the row separator, the column separator, and you can see the data from the preview window.

Here is a point to note that the different flat file, the row separator, column separator is not the same, if the wrong choice, will not achieve the effect you want, all the data may be squeezed into a column. General row separation is relatively simple, basically is to return line ({CR}{LF}) to separate; The column separator is different, and it can be separated by any text character, such as a comma (,), semicolon (;), colon (:) tab, vertical bar (|), and commonly used literal characters, numeric characters, You can also define the fixed width of each column to be delimited. This needs to be treated differently depending on the file source.

In flat File connection Manager, select Advanced, and you can define the column name, data type, character length, and so on for each column. When all the definitions are complete, click OK, return to the Flat file editor interface, the connection that you established will automatically return to the Drop-down list box in the Flat File connection Manager, and then select the columns you want to output, as shown in figure:

   

Then select Error output, and the default option is shown in the following illustration:

This is a very important option, is to ask us to configure what to do when the source data is wrong, there are two kinds of errors in general source data: First, the data type error, such as the date format error, the number of Fu Yi, and so on; the other is that the characters are too long to exceed the width of the column. Depending on the situation, the approach is different, and the system provides three solutions:

Ignore failure: Refers to if a row of data errors, ignore this row, do not affect program execution, continue to import other data.

REDIRECT row: The wrong row of data is imported into another data flow target for later processing after manual inspection.

Component failure: This is the strictest, as long as a data error is encountered, the component fails immediately and stops running.

In the case of a data source file like IISLog, there are error data lines, that is often, but these small numbers of data errors, will not affect the final results, we should consider fault-tolerant, to relax the quality of data requirements, generally choose "Ignore errors" to facilitate the program to continue to run.

After everything is defined, we see a red fork (X) on the flat File source control, which means that there is no target defined for this data source, which is the next step to define. There are also two long line arrows, one green, one red, one green: the correct data flow path, red for the wrong data flow, if the previous definition error "redirect Line", then the error data will follow the red path to the wrong data store.

Defining a data source destination, this may be simpler, similarly, from the toolbox on the left, you see many types of data source destinations, we select OLE DB destinations, connect the green arrows under the flat File source control to OLE DB destinations, and then double-click to open the OLE DB Target Editor window. "New" database connection, as shown in figure:

Return to the OLE DB Target Editor window, in data access mode, select table or view-quick load, and then select the corresponding table, as shown in:

The following configuration column mappings, as shown in figure:

If you do not have a column, simply ignore it (provided that the column in the table is allowed to be empty), the following is still the configuration error handling, referencing the flat File source error handling.

So far, a simple data Flow task is basically done, click to run, and we've been expecting a long time to come out.

Of course, in the actual development process, may not be so smooth, will encounter a lot of various problems, in this article we rarely mention, mainly because this is only the beginning, not involved so in-depth, in future topics, will gradually explain.

A simple data source task even if completed, in fact, this is just a demo, so that we understand an overview, you can say that the Great Wall just out of the first step, the real ETL will not be so simple. Next we will introduce the most exciting part of the ETL "Data Flow conversion", please look forward to.

Data Flow task (next)


Data Flow Task (above), describes how to create a simple ETL package, how to use a simple data Flow task, a text file to import data into the database. These data have maintained its original nature, a few characters, less than one character to import, but in the actual application process, may rarely have this situation, take the IISLog file, which contains: request a successful record (sc-status=200), but also a request failed record ; There are pages (such as: *.aspx, *.htm, *.asp, *.php, etc.), have pictures, have style sheet files (*. CSS, there are script files (*.js), such as flowers and poisonous weeds coexist, the essence and the platinum live together Ah, how do we according to different needs, the flowers and extract the essence of it, this is what we want to talk about today: Data flow conversion.

Before the conversion of data flow, we first introduce the use of the scene: based on IISLog, site click Rate Analysis (IP & PV analysis), the specific requirements are as follows:

(1) Analysis of a period of time, the site clicks the trend of change. At the same time also need to know the weekly, each holiday site traffic situation.

(2) Analysis of the day, each period (in hours) site pressure.

(3) To understand the site customer base from which countries, which regions.

To achieve these requirements, we have established the following data model, see:

&NBSP
Code
Use [IISLog]
Go
--Create a fact table
CREATE TABLE [dbo].[ IISLog] (
    [Lngid] [bigint] not NULL,
    [lngshopid] [int] NULL,
     [Lngdateid] [int] null,
    [Lngtimeid] [int] null,
    [Csdatetime] [DateTime] NULL,
    [lngipid] [int] null,
    [CIP] [varchar] NULL,
     [Csuristem] [varchar] (1000) NULL,
    [csuriquery] [varchar] (1000) NULL,
     [scstatus] [varchar] (a) null,
    [useragent] [varchar] (255) NULL,
    [ Lngreferer] [int] null,
    [csreferer] [varchar] (1000) NULL,
    [CSREFERERKPI] [varchar] (1000) NULL,
    [lngflag] [int] null
) on [PRIMARY]

--ip Library
CREATE TABLE [dbo].[ DIMIP] (
    [ID] [bigint] IDENTITY (1,1) not NULL,
    [ipsegment] [nvarchar] (20) NULL,
    [strcountry] [varchar] (a) null,
    [strprovince] [varchar] () NULL,
    [strcity] [varchar] NULL,
    [Strmemo] [varchar] (m) NULL,
  CONSTRAINT [pk_id] PRIMARY KEY CLUSTERED
(
    [ID] ASC
) with (pad_index  = off, STATISTICS _norecompute  = off, Ignore_dup_key = off, allow_row_locks  = on, allow_page_locks  = on) on [PRIMARY]
) On [PRIMARY]

--Date
CREATE TABLE [dbo].[ Dimdate] (
    [Lngdateid] [int] NOT NULL,
    [lngyear] [int] NULL,
     [Strmonth] [varchar] (a) null,
    [Dtdatetime] [datetime] NULL,
     [Strquarter] [varchar] (a) null,
    [strdateattr] [varchar] (a) null,
     [Strmemo] [varchar] (NULL,
 constraint [pk_dimdate] PRIMARY KEY CLUSTERED
(
     [Lngdateid] ASC
) with (pad_index  = off, statistics_norecompute  = off, Ignore_dup_key = off, Allow_ro w_locks  = on, allow_page_locks  = on) on [PRIMARY]
) on [PRIMARY]

--Time
CREATE TABLE [dbo]. [Dimtime] (
[Lngtimeid] [INT] Not NULL,
[Lnghour] [INT] Null
[Strhour] [varchar] (a) NULL,
[Strtimeattr] [varchar] (a) NULL,
[Strmemo] [varchar] (m) NULL,
CONSTRAINT [Pk_dimtime] PRIMARY KEY CLUSTERED
(
[Lngtimeid] Asc
With (Pad_index = off, Statistics_norecompute = off, Ignore_dup_key = off, Allow_row_locks = on, Allow_page_locks = O N) on [PRIMARY]
) on [PRIMARY]

Next, we'll take a step-by-step description of how to do data flow conversion to achieve the above requirements.

(i), "Conditional Split" (Conditional split). Corresponds to the Where condition of the SQL statement. This may be the first step in all data flow conversion tasks, in order to reduce the amount of data that is subsequently processed, and to improve system performance, first filter out unwanted records. As I said before, the IISLog file includes a wide variety of records, and for this example, how do we filter the IP and PV data in order to accurately compute it?

(1), filter out the pure Web browsing records. That is, *.aspx, *.htm (this site only two types of Web page files) file records.

(2), filter out the record of successful request (SC-STATUS=200).

Open the SSIS Solution for the previous file, switch to Data Flow tab, from the left toolbox, open Data flow conversion, locate the conditional split (Conditional split) component, drag to the Data Flow panel, and then drag the green arrow under the Flat File source component to the Conditional Split component, double-click the conditional split component to open the Conditional Split transformation Editor, as shown in the figure:

In this window, there are system variables, data source columns, and system functions that are available for use. In order to filter out the pure Web browsing record, we need to find a page link ending with. aspx,. htm, "/" from the column Cs_uri_stem. In the "Output name" field of the previous figure list, enter "Form Records" and enter in the Conditional expression field:

Right (cs_uri_stem,5) = = ". aspx" | | Right (cs_uri_stem,4) = = ". htm" | | Right (cs_uri_stem,1) = = "/"
Then filter the record for the successful request, whose table is:

Sc_status = "200"

Finally, the two expressions are grouped together, which is:

(Right (cs_uri_stem,5) = = ". aspx" | | Right (cs_uri_stem,4) = = ". htm" | | Right (cs_uri_stem,1) = = "/") && Sc_status = = "200"
As shown in the figure:

Click OK. Data filtering is done.

(b), derived column (Derived column), equivalent to the computed column in the SQL statement, that is, according to the other columns, according to a certain calculation formula, derive a new column. In this example, you need to use a derived column for three different situations:

(1) Date column, from the log file import date, time, two separate strings (varchar), and the corresponding field in the database in the datetime type, if you want to establish a mapping, you need to base on the log file date, times fields, derived a datetime Type of field.

(2) time period, the same log file in the Times is a string, you need to remove the "decimal" (hour) in order to match the Lnghour in Dimtime.

(3) IP, we want to according to customer IP, determine his country, provinces and regions. To meet this demand, I think do not need IP exact match, as long as the first three segments of IP matching, you can determine (no textual research, personal feeling, such as improper, please correct), so you need to derive a ipsegment =ip of the first three paragraphs, so as to map his region.

In the same vein, from the Toolbox, drag the derived columns component below the conditional split component, and then drag the green arrow underneath the Conditional split component onto the derived columns component, and a window pops up asking you to select the output name of the conditional split, as shown in the figure:

Select Form Records from the Drop-down list box and click OK.

Then double-click the derived columns component to open the Derived Column Transformation Editor, as shown in the figure:

This window is so familiar, isn't that the "Conditional split edit window"? Yes, very similar, I will not wordy, as required on the map, enter the derived column name, select the derived type, input expression, followed by the data type, data length, precision and other attributes, will be automatically generated according to the derived expression, generally is not allowed to modify.

(iii), data type conversion. In integration Services, data type matching requirements are fairly stringent, especially after the lookup component, where the data type must be absolutely matched to join, otherwise it will not succeed.

The data type in integration Services, which, in order to be compatible with a variety of data sources (such as flat files, MssQL, ORACLE, DB2, MySQL, etc.), differs in form from the data type of any of the previous data sources, Once the data is in the data stream in the Integration Services package, the data flow engine converts the data for those columns to the data type of the integration services, as described earlier in the conditional split, derived columns expression, Operates on data of this Integration Services type. So if you want to apply the lookup component later, you must convert the data type to match the columns of the lookup source (the table or view in the relational database). The specific actions are:

From the Toolbox, drag the Data Transformation component onto the window, drag the green arrow under the previous component (derived column) component to the component, and double-click to open Data transformation components, as shown in the figure:

Tick the columns to be converted to data types: Date,strdatetime and convert them to the datetime type of MSSQL.

Specifically, the relationship between the Integration Services data type and the data types of other relational databases is complex, and if it's hard to find a correspondence between them, refer to the Microsoft documentation for a very detailed description. Integration Services Data type

(d), lookup (lookup), similar to the left join in SQL, right join, can generally achieve two functions: (1) Output matching items, (2), output no match, this function in the ETL application is quite frequent, if good to use, can achieve a lot of functions. The previous two data flow transformations (derived columns, data type conversions) are for lookup paving. In this example, there are three columns to find, IP, Date, time. As long as all preparations are in place, it is much easier to Lookup.

Drag the Lookup component to the window, connect the green arrow for the previous component, and double-click to open the Find Transformation Editor, as shown in the figure:

This can be more complicated than the previous editor, it is not so terrible, if the general use, many places are set by default, it is also very easy. But the performance of ETL, in this step is quite critical. First look at the caching mode:

Full cache: Refers to the reference dataset, which is fully cached in memory before the lookup transformation, for later lookups.

Partial caching: A reference dataset is generated when a lookup transformation is performed and a matching row of data is loaded into the cache, and no matching rows of data are discarded.

No caching: Generates a reference dataset during the Lookup transformation, but does not load into the cache.

Through the above explanation, the pros and cons are already very obvious, different situations, may need different processing strategy, self weigh it.

The type of connection, which is actually very clear, is not much to say.

Specify how to handle unmatched rows: This option is important and has four options:

Ignore failure: This means that there is no matching entry, ignoring, the program continues to execute.

Navigates the row to the error output: No matching record is exported through the error data flow path (red arrow) for subsequent manual analysis.

Component failed: If no match is encountered, the component fails immediately and the program stops executing.

Navigates a row to unmatched output: The output does not match the recordset. This option is typically used to find out whether a new record is generated, and if a new record appears, the import is already ignored by a matching recordset. In this case, IP lookup will use this option, and if a new IP is encountered, insert it into the Data warehouse, otherwise, ignore the record and insert it again.

Select "Connect" as shown in the figure:

Select the Connection Manager iislog and select "Dimdate" in the table or in the Viewport column box.

Switch to column, drag "Dtdate" from [available input columns] to "Dtdatetime" in available lookup columns, and two fields W will have a straight line representing a connection to each other, and as mentioned previously, this relationship cannot be established if the data types of the two columns are inconsistent. Finally, check "Lngdateid" in "Available Find columns" as output. Click OK, Lngdateid's search is complete.

The other two, interested friends can try from the beginning, to see whether the success.

In this way, the data conversion is complete, and finally the data flow target of the class, map the source column and target, as shown in figure:

Click on "Run", the dream of the Green Realm, appeared.

SOURCE download: IISLog source Download

This article address: http://www.fengfly.com/plus/view-168582-1.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.