Hawk principle: Implement common ETL pipeline through IEnumerable

Source: Internet
Author: User
Tags new set

There are already several articles for IEnumerable, this article describes how to use IEnumerable to implement ETL. ETL, an abbreviation of English extract-transform-load, is used to describe the process of extracting data from the source (Extract), transpose (Transform), loading (load) to the destination. Typically, the data collected from the original end has many problems, and the business requirements may not match the data format, so the ETL process must be implemented.

ETL can be understood as a cleaning pipeline, data flow from one side, from the other end of the flow. The amount of data can be large, so the pipeline is unlikely or necessary to load the entire content. At the same time, in general, the data from the pipeline flows into the new data pool, rarely directly modified to the original table.

As can be seen from the concept of pipelines, ETL needs to construct a composable chain, first implement a set of components, and then implement these components can be assembled into an ETL pipeline framework. IEnumerable a whole bunch of LINQ extensions, just to help us realize this idea.

1. Expression of data

We will discuss how to express the data first, because the data processing involves the problem of dynamic increment and decrement attribute, so the general entity class is not possible, we use the dictionary to implement. For this I have packaged a class that implements Idictionary<string, object>. Called Freedocument. It can be simply expressed as follows:

  /// <summary>    ///free-form documentation/// </summary>     Public InterfaceIfreedocument:idictionaryserializable, idictionary<string,Object>, IComparable {#regionPropertiesIDictionary<string,Object> Dataitems {Get;Set; } IEnumerable<string> PropertyNames {Get; } #endregion    }

Therefore, the processing of data, the essence is to each Dictionary object in the key value of the increase and deletion check.

2. Basic components

The base interface of the data cleansing component is icollumprocess. Defined as follows:

  Public Interfaceicollumprocess:idictionaryserializable {stringCollumname {Get;Set; }//column names for        BOOLshouldcalculated {Get;Set; }//whether recalculation is required         DoublePriority {Get;Set; }//Priority Level        voidFinish ();//Recycle function When processing is complete        voidInit (ilist<ifreedocument> datas);//probe behavior for initializing data    }

More clearly, in fact, the derivation of four parts:

(1) Generator

The generator is the component that provides/produces the data. This may include generating a number from 0-1000, getting data from a data table, or retrieving results from a Web page. Its interface can be expressed as follows:

[Interface ("Icollumgenerator","Data Generators", Searchstrategy.foldersearch)]  Public Interfaceicollumgenerator:icollumprocess {/// <summary>          ///position of current iteration/// </summary>          intPosition {Get;Set; } IEnumerable<FreeDocument>Generate ();/// <summary>          ///number of documents that the generator can generate/// </summary>          /// <returns></returns>          int?Generatecount (); }

The main approach is Generate, which can enumerate a set of data, and possibly (sometimes not) the total number of documents that can be generated.

(2) Filter

The filter is able to analyze whether a document satisfies the criteria and does not satisfy the rejected component. The interface is also very simple:

  

  [Interface ("icollumdatafilter"" data column filter ", Searchstrategy.foldersearch)]     publicinterface  icollumdatafilter:  Icollumprocess    {        bool  filtedata (ifreedocument data);         }

(3) Sequencer

As the name implies, the interface for sorting data is defined as follows:

  [Interface ("icollumdatasorter " " Data sequencer ", Searchstrategy.foldersearch)]    publicinterface icollumdatasorter: Idictionaryserializable, icollumprocess,icomparer<object>    {                       get  set;}         IEnumerable<IFreeDocument> Sort (ienumerable<ifreedocument> data);    }

Sorting generally requires ascending and descending, but the problem with the most sort is that it destroys the unidirectional liquidity and the virtualization of the pipeline. At least the standard implementation of LINQ, sorting is memory ordering, so it is necessary to load all the data in order to sort, which seriously affects performance. So the current ranking is best done in the case of small data.

(4) Column converter

It is the most important component. The whole ETL process, in essence, is a different column to transform, the process of forming additional columns (column is the key value pair). The definition is implemented as follows:

[Interface ("Icollumdatatransformer","Data Converters", Searchstrategy.foldersearch)]  Public Interfaceicollumdatatransformer:icollumprocess {stringNewcollumname {Get;Set; } Simpledatatype Targetdatatype {Get;Set; } ObservableCollection<ICollumDataFilter> Filterlogics {Get;Set; } ObjectTransformdata (ifreedocument datas); IEnumerable<string> affectedcollums {Get; } }

It's complicated to look at, but it's actually converting some columns in the document to some other columns. For example, you can make a regular substitution for a column of a string, or convert its data type (such as from string to int). As an example of the simplest HTML codec:

    Public Override ObjectTransformdata (ifreedocument document) {Objectitem =Document[collumname]; if(Item = =NULL)                return ""; Switch(converttype) { CaseConverttype.decode:returnHttputility.htmldecode (item.                    ToString ());  Break;  CaseConverttype.encode:returnHttputility.htmlencode (item.                    ToString ());  Break; }            return ""; }

  

3. ETL Pipeline Design

I believe you have thought that the core of the ETL pipeline is the dynamic assembly of LINQ.

One of the most basic ETL management classes, should have the following attributes:

Public observablecollection<icollumprocess> Currentetltools {get; set;} ETL tools that are currently loaded

Protected list<type> Alletltools {get; set;} All the ETL tools that can be used. Of course type is only set up here for ease of understanding, it should be more appropriate to record the component metadata, name and description of the extension class.

And a method:

Public ienumerable<ifreedocument> Refreshdatas (ienumerable<ifreedocument> docuts)//convert from raw data to new data

Then, the implementation of this function can be defined as follows:

   PublicIenumerable<ifreedocument> Refreshdatas (ienumerable<idictionaryserializable>docuts) {            if(Samplemount <=0) {Samplemount=Ten; } IEnumerable<IFreeDocument> ienumable = docuts. Where (d=>d!=NULL). Select (d =d.dictserialize ()); Errorlogs=NewList<errorlog>(); List<IFreeDocument> samples = docuts. Take ((int) (Samplemount). Select (d = d asifreedocument).            ToList (); foreach(Icollumprocess toolinchCurrentetltools.where (d= d.shouldcalculated). OrderByDescending (d =d.priority)) {tool. Sourcecollection=currentcollection; Tool.                Init (samples); if(Tool isIcollumdatatransformer) {                    varGE = Tool asIcollumdatatransformer; Ienumable=Transform (GE, ienumable); }                if(Tool isicollumgenerator) {                    varGE = Tool asIcollumgenerator; if(!ge. CanAppend)//Direct StitchingIenumable =ienumable. Concat (GE.                    Generate ()); Else{ienumable=ienumable. Mergeall (GE.                    Generate ()); }                }                Else if(Tool isicollumdatafilter) {                    vart = Tool asIcollumdatafilter; Ienumable=ienumable.                Where (T.filtedata); }                Else if(Tool isicollumdatasorter) {                    vars = Tool asIcollumdatasorter; Switch(s.sorttype) { CaseSortType.AscendSort:ienumable= Ienumable. (d =d, s);  Break;  CaseSortType.DescendSort:ienumable= Ienumable. OrderByDescending (d =d, s);  Break; }} tool.            Finish (); }            returnienumable; }

Basic implementation of the idea as above. That is, all the loaded ETL components are sorted by priority, and part of the sample data is extracted to initialize the component once. Then, by assembling different converters, generators, sequencer and filters, the final assembly is a new ienumable object. Note that the entire process is deferred, and only when the ETL results are really needed will the actual operation be performed.

4. Optimizing ETL Pipelines and implementing virtual views

The above is the basic idea of ETL. But it's not enough to do just that. The following is the core of this article.

ETL pipelines destroy the characteristics of the original set, the original set may be able to support the index query or even perform high-performance lookups. But the ETL degrades it to be able to enumerate only. Enumeration means that you can only access from the beginning to the tail, not fallback, and index. To use a new collection, you can only access its first n elements, or all of it. This is obviously detrimental to some operations.

Consider the indexer first. If the following conditions can be met:

(1) The pipeline does not include the sequencer and the filter, because they cause the collection to create a disorderly order.

(2) The original set can support indexers

(3) The generator used can provide the size of the build, and the generator can also implement the indexer

(4) The converter should only achieve 1 to 1 conversion, no additional side effects.

Then the correspondence between the original collection and the new set element is computable. The indexer can now work. In actual use, the converter is the most used. The conditions are not harsh.

For high-performance lookups, let's first consider the simplest find (item[key]==value) query for complex SQL queries. But this condition is more demanding:

(1) Key must support high-performance lookups in the original collection

(2) Four conditions to meet the above indexer

(3) for the operation of the key column, the converter must be reversible. And it's best to implement 1-1 mappings.

The so-called reversible meaning that the converter can be converted from A to B, but also through the result of B anti-rolling results A. But how harsh is this condition! A*5=b, such operations are reversible, whereas regular conversions, replacements, and most operations are irreversible.

What do we do? It is possible that the converter will dynamically save the result of the key conversion during the conversion process. As a result, the lookup operation on the new collection is finally able to step back to the original collection's lookup operation. Is there a better way?

How do I get new collections to deal with complex SQL queries? First you need to parse SQL, which can involve a lot of mathematical derivation and transformation. So that in the implementation because of too much restriction, it is basically impossible to achieve. To filter key as a certain range of data for example, each time need to reverse deduction, this derivation is very difficult.

 

5. Intelligent ETL and User experience optimization

The whole ETL process, is the characteristics of human observation data, the combination and configuration of different ETL components, this process can be automated?

People are intelligent, they can observe the different formats and types of data, and discover the characteristics of them, such as the following data:

High floor/21 Floor, South FA Tau Cui into Xin Yuan, built in 2004, Tower Middle Floor /5 floor, North and South Bean Zhuang 5th Courtyard, 2003 building, board Building

By observing such two rows of data, one can roughly determine what the message represents, and how to divide and transform it. You can use regular, extract the first occurrence of the number, that is, the floor, and then use \d{4} to extract the year, and separated by commas, you can get the cell name.

However, this operation still need to know a certain program basis of the people to participate, if the machine to do, and how to do it? The automated steps can be divided into two levels:

(1) Automatic segmentation and alignment.

Data, especially from the Web, is generated by the program itself, so there is a high degree of uniformity in the format, and the delimiters are similar, including commas, semicolons, spaces, slashes, and so on. Therefore, you can count the number of occurrences of different separators, and the corresponding location, through the probabilistic model, the most likely to generate the segmentation scheme, so that each piece of data split out the length and the number of children as consistent as possible.

(2) automatic recognition of content

Automatic recognition of content can depend on rules or recognizers. A more reliable approach is to construct a set of rule groups through regular-based text rules. Usually 200x such a value, it is easy to understand as the year, and 12:32 such a structure, it is easy to be recognized as time. Through the structure-based recognition engine, not only can identify "what this is", but also can put forward its metadata, such as the date of the sun and the moon and other information, to prepare for later work.

The greatest benefit of a web table is its consistent format. By analyzing very few representative sample data, you can master the characteristics of the entire dataset. As a result, it is possible to obtain a maximum identification module at a high cost, while maximizing performance during execution.

  

Hawk principle: Implement common ETL pipeline through IEnumerable

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.