Large-scale processing from ETL conversion to Elt:netezza

Last Update:2014-12-24 Source: Internet

Author: User

Keywords Netezza large scale processing

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

CIO: Why is the super [commercial-level RDBMS] system not in sufficient stamina? Didn't we just get a grade?

Manager: Yes, but the upgrade was not accepted.

CIO: Not accepted? It sounds like a doctor transplanting an organ. Is there a rejection of the CPU? Laugh）

Manager: (calmly) No, just not accepted by the user. The system is still too slow.

CIO: We've invested [X] million dollars on http://www.aliyun.com/zixun/aggregation/11585.html > Hardware, it's best to reach the target, or I'm afraid I'll have to break it down into parts, along with your
Chief architect together please come out of the company!

Manager: I can't explain. The white paper says "" (looking like a business article that is incomprehensible to a heavenly book, cannot be sorted out)

CIO: I also went through that manual. All of your employees last week wore T-shirts with this Supplier logo, and I noticed that they are now changing back to their regular polo shirts. Did something happen?

Manager: We can't reach the expected figure. We are tuning.

CIO: Tuning? But the system is getting worse. This is not the so-called "tuning".

Manager: (voice turn small) Yes, I know. (murmuring slowly away)

It is unfair to blame the manager for everything. The hardware that does not fit the specific work is imposed on the manager, regardless of the original design objectives of the hardware and the work at all, but hope that the hardware can not achieve the results?

Outside of the database, "batch data processing" has only one meaning: extraction, transformation, loading (ETL), and there are dozens of tools to choose from. These choices will quickly escalate, creating a domino effect that requires not only the necessary infrastructure (such as ETL machines) but also the people who know the tools and how to apply them. Scalability is not just about hardware, it's about logistics; The goal is to develop more functionality without hiring a large number of developers to provide support. But what if you don't want to hire a tool specialist? What if the UNIX basics and the ability to handle SQL and enterprise-owned information well are the core skill sets of a large and successful data-processing environment?

There is only one form of batch data processing in a database machine: using SQL insert/select statements, usually performed between intermediate database tables and/or persistent database tables (that is, ELT). In addition to the mechanisms that SQL generates to support this form, you do not need to write SQL statements for business logic manually! , we should also delve into the hardware aspect and determine whether this approach is feasible for large-scale data.

The common platform is "horizontal", that is, the CPU is physically separated from the disk drive, connected through the substrate, and sharing all the resources at the hardware level. Replication is the acquisition of large amounts of data from disk, through the substrate, through the CPU and the logical representation of the table software, eventually passed to the substrate, and then returned to the disk. In such a data stream, data can be passed back and forth on the (saturated) substrate. This is bound to result in large, bloated, and monolithic queries, as we can only process data after we have obtained the data. It can be said that this multi-instruction multi-data (MIMD) model cannot be scaled to achieve batch processing.

By contrast, the ibm®netezza® platform is a purpose-built proprietary assembly platform that uses finely tuned commercial-grade parts, with CPUs paired with one disk drive (no shared hardware). When you create a table, logically, there is only one table, but physically, each part of the table can be in each CPU disk combination, where 10 of these types of CPU disk pairs may have a table containing 1,000 records divided into 10 location stores, each of which stores 100 records. But a single query can query all of these CPUs at the same time, that is to say, to move 10 groups of 100 records in parallel, it takes one-tenth to move all 1,000 records as a single block of data. This model is called a single instruction multiple data (SIMD) model. SIMD can be scaled up for bulk processing, so a twinfin® with 90 or more CPUs can handle billions of records in a very short time, much better than a generic competitive product with similar capabilities.

Many companies that decide to shift logic from ETL machines to Netezza are stuck with a simple question: Can we extend this approach? In short, SQL insert/select statements have some logistical work that we have to use rationally to form a reliable method base. Consider the following query:

Insert into EMPLOYEE (emp_id, Emp_fname, Emp_lname, emp_start_date) Select IDs, FirstName, LastName, st_date from Oldemployees;

It seems so simple, maybe it's too simple. Then look at the following query:

Insert into the EMPLOYEE select ID, FirstName, LastName, st_date from Oldemployees;

Note that in the above query, the insert phrase is implicit. Perhaps some readers have discovered the great danger here. If later we add a column to the middle of the content, the other columns will be offset so that they cannot be properly aligned with the target column. Some users indicate that at least one of the target columns is problematic, but what is the truth? I have encountered situations where a query that cannot be aligned is still running for days or even months. These queries just happen to match the acceptable target data type, but the content is completely wrong. We have to take control of the problem.

What if it is a fact table that contains 150 columns? The Insert/select clause immediately loses control. What if we need to add or delete columns, or we need to update the query based on changes to the data model? These situations can cause the query to freeze, causing the solution to become vulnerable. Subtle changes can cause the entire solution to work erratically.

The best way to do this is to keep the target column consistent with the source rule, which populates the elements of the SQL Select phrase for the target column:

emp_id ID
Emp_fname FirstName
Emp_lname LastName
Emp_start_date st_date

We can extend this template to restrict the database machine. The target column is always consistent with the source processed column never offsets. Does this sound like a more advanced approach than relying solely on ETL tools? Don't be fooled. The purpose of all is to simplify and reduce the complexity of the logistic work generated by SQL statements. This structure helps maintain and does not require difficult impact reviews. It is extremely easy to scan simple tools and compare their contents to directory content to find, apply, or report changes directly. However, it is important to remember that all columns are affected as long as the data model changes. In this case, our main concern is the exception, the data model in more than 90% of the coordination work may be completed automatically. As a result, the coordination of the new data model becomes agile-or extremely agile.

With this power to control query generation, we are free to handle a large number of queries without losing control of the logistics. Can this approach be extended? The answer is yes, and it has been achieved. The more advanced form of this template is the propulsion of the framework engine, and we often see dozens of of such queries executed in a controlled order, working towards a functional goal that simplifies maintenance and reuse.

The main logistical difference between MIMD and SIMD is that when using MIMD, we have to use fewer, more complex SQL statements that are necessarily serialized because they make the hardware saturated. SIMD encourages us to use more, simpler SQL statements while running these statements in parallel, because the machine will quickly dispatch statements. Therefore, SIMD queries can be completed faster than equivalent MIMD queries.

The general-purpose MIMD platform has a high (and permanent) complexity from the outset, and purpose-built SIMD platforms allow us to start with simplicity and always remain simple. Simplicity means that everything is easy to maintain, extend, and troubleshoot, which lays the foundation for agility throughout the life of the system.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More