Microsoft BI Series Essay-SSIS 2012 advanced App-Script component handling irregular flat files Introduction to the scene
When you use SSIS to import source data from a flat file, the following two scenarios are most commonly encountered:
- The flat file in which the rule is imported, and the parsing rules for each line of the file are the same
- Import irregular flat files, which may contain a variety of data structures, such as some rows are headers (head), some lines are content (contents), some lines are tails (Tail), and three types of data have different parsing formats
- The character encoding set of the source file is different, for example, if the file is from various systems (or regions), the encoding of these files may be either Ansi,unicode or UTF-8.
For the first case, you can use the Flat file connector (FlatFile Connection) to resolve, by configuring his character set, format, delimiter, etc. can be done.
In the second case, if you use a flat file connector (FlatFile Connection), you need to use a temporary file that reads the data for each row from the source file as a type using a flat file connector, and then detaches based on the Content usage criteria (Conditional Split) separate the content and reorganize the new files. The newly generated file is then routinely processed for data import.
For the third case, it is more complex and cannot be resolved using a flat file connector at this time. Although the flat File connector supports expression setting codepage and whether it is Unicode, as well as other settings, it is more cumbersome to do so. There are serious disturbances at design time, often resulting in a failure to compile.
This article will give the reader a thought for the second case and the third case, which is to use Script Component as the data source to parse the file, output the data, and deal with the second case.
Implementing steps analyzing Flat File formats
Suppose we have a flat file in the following format:
First of all, the encoding of this file is irregular, and he contains Spanish (and other documents that do not contain this text set).
Second, he has 02H and 02D two content formats, which would be considered headers and detail.
So this type of file uses a flat file connector that cannot be resolved at all.
So what do we do? Look underneath.
Script component Source Definition input and output
With script Component as a data source in SSIS, you can use C # or VB code to do anything, and then construct one or more datasets as output. As shown below:
In this example, I define a edcorder output to put the header content, orderdetail the content I do not need so no definition.
I also defined a failedrows to put the header to parse the failed content. No input.
Writing scripts
On the Script page, select the variable and scripting language you want to use.
Click Edit Script ... )
This will open a script-editing project for VSTA project.
This project consists of three files
- ComponentWrapper.cs-Contains variable classes, connection classes, and user control classes
- BufferWrapper.cs-Defines the output class
- Main.cs-the entry function of the script file, inherited from the user control class.
Main.cs This file has three methods
- PreExecute-Prior to execution
- PostExecute-After execution
- Createnewoutputrows Creating an output line
In order to parse the flat file, we define the following parameters, mainly a StreamReader to read the file, the time record and the line number:
System.IO.StreamReader sr; DateTime now; int rownumber,failedrows; Guid fileId; string contentlengtherror; string coposidnullerror; string errormsg;
Write in PreExecute, open the file according to the file name, define the error message, initialize the variable
/// <summary> ///This method was called once, before rows begin to being processed in the data flow. /// ///You can remove the This method if you don ' t need to do anything here. /// </summary> Public Override voidPreExecute () {Base. PreExecute (); Now=DateTime.Now; RowNumber=1; Failedrows=0; FileId=Guid.parse (Variables.pvFileID.ToString ()); Contentlengtherror="Line content validation failed. Line content length isn't equal to 1141."; Coposidnullerror="Line content validation failed. Copos ID is required."; ErrorMsg="Filename:"+ Variables.pvloadfile +System.Environment.NewLine; SR=NewSystem.IO.StreamReader (Variables.pvloadfile); }
Write this in PostExecute, close the file, return the error message to the script called the variable
/// <summary> ///This method was called after all the rows had passed through this component. /// ///You can delete this method if you don ' t need to do anything here. /// </summary> Public Override voidPostExecute () {Base. PostExecute (); Sr. Close (); Variables.pvfailedrows=failedrows; if(ErrorMsg! ="Filename:"+ Variables.pvloadfile +System.Environment.NewLine) Variables.pverrormessage+ = System.Environment.NewLine +errormsg; }
In Createnewoutputrows, the message type is judged based on the content of each line, the message content is validated, and the output line is generated.
Public Override voidcreatenewoutputrows () {/*Add rows by calling the AddRow method on the member variable named "<output Name>buffer". For example, call Myoutputbuffer.addrow () If your output is named "Myoutput". */ while(!Sr. Endofstream) {stringline =Sr. ReadLine (); if(line. StartsWith ("02H")) { if(line. Length! =1141) {failedrowsbuffer.addrow (); Failedrowsbuffer.rownumber=RowNumber; Failedrowsbuffer.errordescription=Contentlengtherror; Failedrowsbuffer.content=Line ; Failedrows++; ErrorMsg+="At Row:"+ rownumber.tostring () + System.Environment.NewLine +"Error Description:"+ Contentlengtherror +System.Environment.NewLine; } Else { if(!string. IsNullOrEmpty (line. Substring (3, -)) &&!string. Isnullorwhitespace (line. Substring (3, -)))//Coposid is required{edcorderbuffer.addrow (); Edcorderbuffer.coposid= line. Substring (3, -); Edcorderbuffer.custname= line. Substring ( -, -); Edcorderbuffer.custtel= line. Substring (257, -); Edcorderbuffer.dlvytitle= line. Substring (287,4); Edcorderbuffer.dlvyinitial= line. Substring (291,1); Edcorderbuffer.dlvyname= line. Substring (292, -); Edcorderbuffer.dlvyhouse= line. Substring (322, -); Edcorderbuffer.dlvytel= line. Substring (482, -); EDCORDERBUFFER.DSPTCHSRVC= line. Substring (882,9); Edcorderbuffer.dlvydate= line. Substring (891,8); Edcorderbuffer.dmworder= line. Substring (959, -); Edcorderbuffer.dsptchemail= line. Substring (989, -); Edcorderbuffer.orderdate= line. Substring (1059, A); Edcorderbuffer.fileid=fileId; Edcorderbuffer.createdon=Now ; } Else{failedrowsbuffer.addrow (); Failedrowsbuffer.rownumber=RowNumber; Failedrowsbuffer.errordescription=Coposidnullerror; Failedrowsbuffer.content=Line ; Failedrows++; ErrorMsg+="At Row:"+ rownumber.tostring () + System.Environment.NewLine +"Error Description:"+ Coposidnullerror +System.Environment.NewLine; } }} rowNumber++; } }
Compile the next VSTA project.
Processing Script Component Output
In this way, an irregular file is easily handled by us using Script component.
- With StreamReader, the default encoding format is used to automatically identify file encodings, avoiding garbled use of flat file parsers
- Because the file content is processed separately, can parse irregular file, produce multiple output
- By adding custom validation information, you can verify the contents of the message
Questions:
- Can the script component be executed in parallel, andEdcorderbuffer.addrow () is not already producing a line of output? This remains to be studied
- To write code parsing files, is there any length type can be configured, if the need for format conversion is convenient, or external conversion?
Leave these questions to the reader's friends to think about.
You can also give me a message to discuss with you.
Handling irregular flat files using the Script component source